MILOS NIKOLIC
Lecturer in Database Systems
School of Informatics, University of Edinburgh
milos.nikolic@ed.ac.uk
Photo of Milos Nikolic

My research interest is in databases and large-scale data management systems.

My recent work studies in-database learning, stream processing, incremental computation, and query compilation. I am a member of the Database Group and the Laboratory for Foundations of Computer Science.

bio | Google Scholar

PROJECTS

Adaptive Query Processing based on Degree Information
  • investigating trade-offs in static and dynamic evaluation of queries
Incremental Maintenance of Complex Data Analytics
  • maintaining a wide range of analytics using novel factorisation techniques
Deep Analytics over Data Streams
  • deep integration of relational and digital signal processing (DSP)
DBToaster
  • compilation of database queries into high-performance streaming engines

TEACHING

Instructor

Teaching Assistant

PUBLICATIONS

  • Scalable Analysis of Multi-Modal Biomedical Data, ,
    Targeted diagnosis and treatment options are dependent on insights drawn from multi-modal analysis of large-scale biomedical datasets. Advances in genomics sequencing, image processing, and medical data management have supported data collection and management within medical institutions. These efforts have produced large-scale datasets and have enabled integrative analyses that provide a more thorough look of the impact of a disease on the underlying system. The integration of large-scale biomedical data commonly involves several complex data transformation steps, such as combining datasets to build feature vectors for learning analysis. Thus, scalable data integration solutions play a key role in the future of targeted medicine. Though large-scale data processing frameworks have shown promising performance for many domains, they fail to support scalable processing of complex datatypes. To address these issues and achieve scalable processing of multi-modal biomedical data, we present TraNCE, a framework that automates the difficulties of designing distributed analyses with complex biomedical data types. We outline research and clinical applications for the platform, including data integration support for building feature sets for classification. We show that the system is capable of outperforming the common alternative, based on "flattening" complex data structures, and runs efficiently when alternative approaches are unable to perform at all.
    @article{smith2021scalable,
    title={Scalable Analysis of Multi-Modal Biomedical Data},
    author={Smith, Jaclyn and Shi, Yao and Benedikt, Michael and Nikolic, Milos},
    journal={GigaScience},
    volume={10},
    number={9},
    year={2021}
    }
  • Machine Learning over Static and Dynamic Relational Data,
    This tutorial overviews principles behind recent works on training and maintaining machine learning models over relational data, with an emphasis on the exploitation of the relational data structure to improve the runtime performance of the learning task. The tutorial has the following parts:
    • Database research for data science
    • Three main ideas to achieve performance improvements:
      • Turn the ML problem into a DB problem
      • Exploit structure of the data and problem
      • Exploit engineering tools of a DB researcher
    • Avenues for future research
    @inproceedings{kara2021machine,
    title={Machine learning over static and dynamic relational data},
    author={Kara, Ahmet and Nikolic, Milos and Olteanu, Dan and Zhang, Haozhe},
    booktitle={Proceedings of the 15th ACM International Conference on Distributed and Event-based Systems},
    pages={160--163},
    year={2021}
    }
  • TraNCE: Transforming Nested Collections Efficiently,
    Nested relational query languages have long been seen as an attractive tool for scenarios involving large hierarchical datasets. In recent years, there has been a resurgence of interest in nested relational languages. One driver has been the affinity of these languages for large-scale processing platforms such as Spark and Flink.
    This demonstration gives a tour of TraNCE, a new system for processing nested data on top of distributed processing systems. The core innovation of the system is a compiler that processes nested relational queries in a series of transformations; these include variants of two prior techniques, shredding and unnesting, as well as a materialization transformation that customizes the way levels of the nested output are generated. The TraNCE platform builds on these techniques by adding components for users to create and visualize queries, as well as data exploration and notebook execution targets to facilitate the construction of large-scale data science applications. The demonstration will both showcase the system from the viewpoint of usability by data scientists and illustrate the data management techniques employed.
    @article{smith2021trance,
    title={TraNCE: Transforming Nested Collections Efficiently},
    author={Smith, Jaclyn and Benedikt, Michael and Moore, Brandon and Nikolic, Milos},
    journal={Proceedings of the VLDB Endowment (PVLDB)},
    volume={14},
    number={12},
    pages={2727--2730},
    year={2021},
    publisher={VLDB Endowment}
    }
  • Scalable Querying of Nested Data,
    While large-scale distributed data processing platforms have become an attractive target for query processing, these systems are problematic for applications that deal with nested collections. Programmers are forced either to perform nontrivial translations of collection programs or to employ automated flattening procedures, both of which lead to performance problems. These challenges only worsen for nested collections with skewed cardinalities, where both handcrafted rewriting and automated flattening are unable to enforce load balancing across partitions.
    In this work, we propose a framework that translates a program manipulating nested collections into a set of semantically equivalent shredded queries that can be efficiently evaluated. The framework employs a combination of query compilation techniques, an efficient data representation for nested collections, and automated skew-handling. We provide an extensive experimental evaluation, demonstrating significant improvements provided by the framework in diverse scenarios for nested collection programs.
    @article{smith2020scalable,
    title={Scalable Querying of Nested Data},
    author={Smith, Jaclyn and Benedikt, Michael and Nikolic, Milos and Shaikhha, Amir},
    journal={Proceedings of the VLDB Endowment (PVLDB)},
    volume={14},
    number={3},
    pages={445--457},
    year={2020},
    publisher={VLDB Endowment}
    }
  • Maintaining Triangle Queries under Updates,
    We consider the problem of incrementally maintaining the triangle queries with arbitrary free variables under single-tuple updates to the input relations.
    We introduce an approach called IVMε that exhibits a trade-off between the update time, the space, and the delay for the enumeration of the query result, such that the update time ranges from the square root to linear in the database size while the delay ranges from constant to linear time.
    IVMε achieves Pareto worst-case optimality in the update-delay space conditioned on the Online Matrix-Vector Multiplication conjecture. It is strongly Pareto optimal for the triangle queries with no or three free variables and weakly Pareto optimal for the remaining triangle queries with one or two free variables.
    IVMε recovers prior work such as the suboptimal classical view maintenance approach that uses delta query processing and the worst-case optimal approach that computes all triangles in a static database.
    @article{kara2020maintaining,
    title={Maintaining triangle queries under updates},
    author={Kara, Ahmet and Ngo, Hung Q and Nikolic, Milos and Olteanu, Dan and Zhang, Haozhe},
    journal={ACM Transactions on Database Systems (TODS)},
    volume={45},
    number={3},
    pages={1--46},
    year={2020},
    publisher={ACM New York, NY, USA}
    }
  • F-IVM: Learning over Fast-Evolving Relational Data,
    F-IVM is a system for real-time analytics such as machine learning applications over training datasets defined by queries over fast-evolving relational databases. We will demonstrate F-IVM for three such applications: model selection, Chow-Liu trees, and ridge linear regression.
    @inproceedings{nikolic2020f,
    title={F-IVM: learning over fast-evolving relational data},
    author={Nikolic, Milos and Zhang, Haozhe and Kara, Ahmet and Olteanu, Dan},
    booktitle={Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data},
    pages={2773--2776},
    year={2020}
    }
  • Trade-offs in Static and Dynamic Evaluation of Hierarchical Queries,
    We investigate trade-offs in static and dynamic evaluation of hierarchical queries with arbitrary free variables. In the static setting, the trade-off is between the time to partially compute the query result and the delay needed to enumerate its tuples. In the dynamic setting, we additionally consider the time needed to update the query result under single-tuple inserts or deletes to the database.
    Our approach observes the degree of values in the database and uses different computation and maintenance strategies for high-degree (heavy) and low-degree (light) values. For the latter it partially computes the result, while for the former it computes enough information to allow for on-the-fly enumeration.
    The main result of this work defines the preprocessing time, the update time, and the enumeration delay as functions of the light/heavy threshold. By conveniently choosing this threshold, our approach recovers a number of prior results when restricted to hierarchical queries.
    For a restricted class of hierarchical queries, our approach can achieve worst-case optimal update time and enumeration delay conditioned on the Online Matrix-Vector Multiplication Conjecture.
    @inproceedings{kara2020trade,
    title={Trade-offs in static and dynamic evaluation of hierarchical queries},
    author={Kara, Ahmet and Nikolic, Milos and Olteanu, Dan and Zhang, Haozhe},
    booktitle={Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems},
    pages={375--392},
    year={2020}
    }
  • Counting Triangles under Updates in Worst-Case Optimal Time,
    We consider the problem of incrementally maintaining the triangle count query under single-tuple updates to the input relations. We introduce an approach that exhibits a space-time tradeoff such that the space-time product is quadratic in the size of the input database and the update time can be as low as the square root of this size. This lowest update time is worst-case optimal conditioned on the Online Matrix-Vector Multiplication conjecture. The classical and factorized incremental view maintenance approaches are recovered as special cases of our approach within the space-time tradeoff. In particular, they require linear-time update maintenance, which is suboptimal. Our approach also recovers the worst-case optimal time complexity for computing the triangle count in the non-incremental setting.
    @inproceedings{kara2019counting,
    title={Counting Triangles under Updates in Worst-Case Optimal Time},
    author={Kara, Ahmet and Ngo, Hung Q and Nikolic, Milos and Olteanu, Dan and Zhang, Haozhe},
    booktitle={22nd International Conference on Database Theory (ICDT 2019)},
    year={2019},
    organization={Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik}
    }

STUDENTS

  • Pavlos Gogousis (MSc, May - Aug 2019)
  • Cecilia Cobos Santes (MSc, May - Aug 2019)
  • Haozhe Zhang (DPhil, Sept 2017 - present, co-advised w/ Dan Olteanu)
  • Theodore Dickson (MSc, April - Sept 2018)
  • Canberk Koparal (3rd year CS, Oct 2017 - May 2018)
  • Alexandru Valeanu (3rd year CS, Oct 2017 - May 2018, co-advised w/ Dan Olteanu)
  • Xu Sun (MSc, April - Sept 2017)