MILOS NIKOLIC
Lecturer in Database Systems
School of Informatics, University of Edinburgh
milos.nikolic@ed.ac.uk
Photo of Milos Nikolic

My research interest is in databases and large-scale data management systems.

My recent work studies in-database learning, stream processing, incremental computation, and query compilation. I am a member of the Database Group and the Laboratory for Foundations of Computer Science.

I am looking for a PhD student to work on topics related to data management. Feel free to contact me for more information.

bio | Google Scholar

PROJECTS

Adaptive Query Processing based on Degree Information
  • investigating trade-offs in static and dynamic evaluation of queries
Incremental Maintenance of Complex Data Analytics
  • maintaining a wide range of analytics using novel factorisation techniques
Deep Analytics over Data Streams
  • deep integration of relational and digital signal processing (DSP)
DBToaster
  • compilation of database queries into high-performance streaming engines

TEACHING

Instructor

Teaching Assistant

PUBLICATIONS

  • In-Database Data Imputation,
    Missing data is a widespread problem in many domains, creating challenges in data analysis and decision making. Traditional techniques for dealing with missing data, such as excluding incomplete records or imputing simple estimates (e.g., mean), are computationally efficient but may introduce bias and disrupt variable relationships, leading to inaccurate analyses. Model-based imputation techniques offer a more robust solution that preserves the variability and relationships in the data, but they demand significantly more computation time, limiting their applicability to small datasets.
    This work enables efficient, high-quality, and scalable data imputation within a database system using the widely used MICE method. We adapt this method to exploit computation sharing and a ring abstraction for faster model training. To impute both continuous and categorical values, we develop techniques for in-database learning of stochastic linear regression and Gaussian discriminant analysis models. Our MICE implementations in PostgreSQL and DuckDB outperform alternative MICE implementations and model-based imputation techniques by up to two orders of magnitude in terms of computation time, while maintaining high imputation quality.
    @article{perini2024pacmmod,
    title={{In-Database Data Imputation}},
    author={Perini, Massimo and Nikolic, Milos},
    journal={Proc. ACM Manag. Data (PACMMOD)},
    volume={2},
    doi={10.1145/3639326},
    year={2024},
    publisher={Association for Computing Machinery},
    }
  • F-IVM: Analytics over Relational Databases under Updates,
    This article describes F-IVM, a unified approach for maintaining analytics over changing relational data. We exemplify its versatility in four disciplines: processing queries with group-by aggregates and joins; learning linear regression models using the covariance matrix of the input features; building Chow-Liu trees using pairwise mutual information of the input features; and matrix chain multiplication.
    F-IVM has three main ingredients: higher-order incremental view maintenance; factorized computation; and ring abstraction. F-IVM reduces the maintenance of a task to that of a hierarchy of simple views. Such views are functions mapping keys, which are tuples of input values, to payloads, which are elements from a ring. F-IVM also supports efficient factorized computation over keys, payloads, and updates. Finally, F-IVM treats uniformly seemingly disparate tasks. In the key space, all tasks require joins and variable marginalization. In the payload space, tasks differ in the definition of the sum and product ring operations.
    We implemented F-IVM on top of DBToaster and show that it can outperform classical first-order and fully recursive higher-order incremental view maintenance by orders of magnitude while using less memory.
    @article{kara2023vldbj,
    title={{F-IVM: Analytics over Relational Databases under Updates}},
    author={Kara, Ahmet and Nikolic, Milos and Olteanu, Dan and Zhang, Haozhe},
    journal={The VLDB Journal},
    doi={10.1007/s00778-023-00817-w},
    year={2023},
    }
  • Trade-offs in Static and Dynamic Evaluation of Hierarchical Queries,
    We investigate trade-offs in static and dynamic evaluation of hierarchical queries with arbitrary free variables. In the static setting, the trade-off is between the time to partially compute the query result and the delay needed to enumerate its tuples. In the dynamic setting, we additionally consider the time needed to update the query result under single-tuple inserts or deletes to the database.
    Our approach observes the degree of values in the database and uses different computation and maintenance strategies for high-degree (heavy) and low-degree (light) values. For the latter it partially computes the result, while for the former it computes enough information to allow for on-the-fly enumeration.
    We define the preprocessing time, the update time, and the enumeration delay as functions of the light/heavy threshold. By appropriately choosing this threshold, our approach recovers a number of prior results when restricted to hierarchical queries.
    We show that for a restricted class of hierarchical queries, our approach achieves worst-case optimal update time and enumeration delay conditioned on the Online Matrix-Vector Multiplication Conjecture.
    @article{kara2023lmcs,
    title={{Trade-offs in Static and Dynamic Evaluation of Hierarchical Queries}},
    author={Kara, Ahmet and Nikolic, Milos and Olteanu, Dan and Zhang, Haozhe},
    journal = {{Logical Methods in Computer Science}},
    volume={19},
    number={3},
    pages={1--50},
    year={2023},
    }
  • Conjunctive Queries with Free Access Patterns under Updates,
    We study the problem of answering conjunctive queries with free access patterns under updates. A free access pattern is a partition of the free variables of the query into input and output. The query returns tuples over the output variables given a tuple of values over the input variables.
    We introduce a fully dynamic evaluation approach for such queries. We also give a syntactic characterisation of those queries that admit constant time per single-tuple update and whose output tuples can be enumerated with constant delay given an input tuple. Finally, we chart the complexity trade-off between the preprocessing time, update time and enumeration delay for such queries. For a class of queries, our approach achieves optimal, albeit non-constant, update time and delay. Their optimality is predicated on the Online Matrix-Vector Multiplication conjecture. Our results recover prior work on the dynamic evaluation of conjunctive queries without access patterns.
    @inproceedings{kara2023icdt,
    title={{Conjunctive Queries with Free Access Patterns under Updates}},
    author={Kara, Ahmet and Nikolic, Milos and Olteanu, Dan and Zhang, Haozhe},
    booktitle={26th International Conference on Database Theory (ICDT 2023)},
    year={2023},
    organization={Schloss Dagstuhl-Leibniz-Zentrum f{\"u}r Informatik}
    }
  • Evaluation Trade-Offs for Acyclic Conjunctive Queries,
    We consider the evaluation of acyclic conjunctive queries, where the evaluation time is decomposed into preprocessing time and enumeration delay. In a seminal paper at CSL'07, Bagan, Durand, and Grandjean showed that acyclic queries can be evaluated with linear preprocessing time and linear enumeration delay. If the query is free-connex, the enumeration delay becomes constant. Further prior work showed that constant enumeration delay can be achieved for arbitrary acyclic conjunctive queries at the expense of a preprocessing time that is characterised by the fractional hypertree width.
    We introduce an approach that exposes a trade-off between preprocessing time and enumeration delay for acyclic conjunctive queries. The aforementioned prior works represent extremes in this trade-off space. Yet our approach also allows for the enumeration delay and the preprocessing time between these extremes, in particular the delay may lie between constant and linear time.
    Our approach decomposes the given query into subqueries and achieves for each subquery a trade-off that depends on a parameter controlling the times for preprocessing and enumeration. The complexity of the query is given by the Pareto optimal points of a bi-objective optimisation program whose inputs are possible query decompositions and parameter values.
    @inproceedings{kara2023csl,
    title={{Evaluation Trade-Offs for Acyclic Conjunctive Queries}},
    author={Kara, Ahmet and Nikolic, Milos and Olteanu, Dan and Zhang, Haozhe},
    booktitle={31st EACSL Annual Conference on Computer Science Logic (CSL 2023)},
    year={2023},
    organization={Schloss Dagstuhl-Leibniz-Zentrum f{\"u}r Informatik}
    }
  • Scalable Analysis of Multi-Modal Biomedical Data, ,
    Targeted diagnosis and treatment options are dependent on insights drawn from multi-modal analysis of large-scale biomedical datasets. Advances in genomics sequencing, image processing, and medical data management have supported data collection and management within medical institutions. These efforts have produced large-scale datasets and have enabled integrative analyses that provide a more thorough look of the impact of a disease on the underlying system. The integration of large-scale biomedical data commonly involves several complex data transformation steps, such as combining datasets to build feature vectors for learning analysis. Thus, scalable data integration solutions play a key role in the future of targeted medicine. Though large-scale data processing frameworks have shown promising performance for many domains, they fail to support scalable processing of complex datatypes. To address these issues and achieve scalable processing of multi-modal biomedical data, we present TraNCE, a framework that automates the difficulties of designing distributed analyses with complex biomedical data types. We outline research and clinical applications for the platform, including data integration support for building feature sets for classification. We show that the system is capable of outperforming the common alternative, based on "flattening" complex data structures, and runs efficiently when alternative approaches are unable to perform at all.
    @article{smith2021scalable,
    title={Scalable Analysis of Multi-Modal Biomedical Data},
    author={Smith, Jaclyn and Shi, Yao and Benedikt, Michael and Nikolic, Milos},
    journal={GigaScience},
    volume={10},
    number={9},
    year={2021}
    }
  • Machine Learning over Static and Dynamic Relational Data,
    This tutorial overviews principles behind recent works on training and maintaining machine learning models over relational data, with an emphasis on the exploitation of the relational data structure to improve the runtime performance of the learning task. The tutorial has the following parts:
    • Database research for data science
    • Three main ideas to achieve performance improvements:
      • Turn the ML problem into a DB problem
      • Exploit structure of the data and problem
      • Exploit engineering tools of a DB researcher
    • Avenues for future research
    @inproceedings{kara2021machine,
    title={Machine learning over static and dynamic relational data},
    author={Kara, Ahmet and Nikolic, Milos and Olteanu, Dan and Zhang, Haozhe},
    booktitle={Proceedings of the 15th ACM International Conference on Distributed and Event-based Systems},
    pages={160--163},
    year={2021}
    }
  • TraNCE: Transforming Nested Collections Efficiently,
    Nested relational query languages have long been seen as an attractive tool for scenarios involving large hierarchical datasets. In recent years, there has been a resurgence of interest in nested relational languages. One driver has been the affinity of these languages for large-scale processing platforms such as Spark and Flink.
    This demonstration gives a tour of TraNCE, a new system for processing nested data on top of distributed processing systems. The core innovation of the system is a compiler that processes nested relational queries in a series of transformations; these include variants of two prior techniques, shredding and unnesting, as well as a materialization transformation that customizes the way levels of the nested output are generated. The TraNCE platform builds on these techniques by adding components for users to create and visualize queries, as well as data exploration and notebook execution targets to facilitate the construction of large-scale data science applications. The demonstration will both showcase the system from the viewpoint of usability by data scientists and illustrate the data management techniques employed.
    @article{smith2021trance,
    title={TraNCE: Transforming Nested Collections Efficiently},
    author={Smith, Jaclyn and Benedikt, Michael and Moore, Brandon and Nikolic, Milos},
    journal={Proceedings of the VLDB Endowment (PVLDB)},
    volume={14},
    number={12},
    pages={2727--2730},
    year={2021},
    publisher={VLDB Endowment}
    }
  • Scalable Querying of Nested Data,
    While large-scale distributed data processing platforms have become an attractive target for query processing, these systems are problematic for applications that deal with nested collections. Programmers are forced either to perform nontrivial translations of collection programs or to employ automated flattening procedures, both of which lead to performance problems. These challenges only worsen for nested collections with skewed cardinalities, where both handcrafted rewriting and automated flattening are unable to enforce load balancing across partitions.
    In this work, we propose a framework that translates a program manipulating nested collections into a set of semantically equivalent shredded queries that can be efficiently evaluated. The framework employs a combination of query compilation techniques, an efficient data representation for nested collections, and automated skew-handling. We provide an extensive experimental evaluation, demonstrating significant improvements provided by the framework in diverse scenarios for nested collection programs.
    @article{smith2020scalable,
    title={Scalable Querying of Nested Data},
    author={Smith, Jaclyn and Benedikt, Michael and Nikolic, Milos and Shaikhha, Amir},
    journal={Proceedings of the VLDB Endowment (PVLDB)},
    volume={14},
    number={3},
    pages={445--457},
    year={2020},
    publisher={VLDB Endowment}
    }
  • Maintaining Triangle Queries under Updates,
    We consider the problem of incrementally maintaining the triangle queries with arbitrary free variables under single-tuple updates to the input relations.
    We introduce an approach called IVMε that exhibits a trade-off between the update time, the space, and the delay for the enumeration of the query result, such that the update time ranges from the square root to linear in the database size while the delay ranges from constant to linear time.
    IVMε achieves Pareto worst-case optimality in the update-delay space conditioned on the Online Matrix-Vector Multiplication conjecture. It is strongly Pareto optimal for the triangle queries with no or three free variables and weakly Pareto optimal for the remaining triangle queries with one or two free variables.
    IVMε recovers prior work such as the suboptimal classical view maintenance approach that uses delta query processing and the worst-case optimal approach that computes all triangles in a static database.
    @article{kara2020maintaining,
    title={Maintaining triangle queries under updates},
    author={Kara, Ahmet and Ngo, Hung Q and Nikolic, Milos and Olteanu, Dan and Zhang, Haozhe},
    journal={ACM Transactions on Database Systems (TODS)},
    volume={45},
    number={3},
    pages={1--46},
    year={2020},
    publisher={ACM New York, NY, USA}
    }
  • F-IVM: Learning over Fast-Evolving Relational Data,
    F-IVM is a system for real-time analytics such as machine learning applications over training datasets defined by queries over fast-evolving relational databases. We will demonstrate F-IVM for three such applications: model selection, Chow-Liu trees, and ridge linear regression.
    @inproceedings{nikolic2020f,
    title={F-IVM: learning over fast-evolving relational data},
    author={Nikolic, Milos and Zhang, Haozhe and Kara, Ahmet and Olteanu, Dan},
    booktitle={Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data},
    pages={2773--2776},
    year={2020}
    }
  • Trade-offs in Static and Dynamic Evaluation of Hierarchical Queries,
    We investigate trade-offs in static and dynamic evaluation of hierarchical queries with arbitrary free variables. In the static setting, the trade-off is between the time to partially compute the query result and the delay needed to enumerate its tuples. In the dynamic setting, we additionally consider the time needed to update the query result under single-tuple inserts or deletes to the database.
    Our approach observes the degree of values in the database and uses different computation and maintenance strategies for high-degree (heavy) and low-degree (light) values. For the latter it partially computes the result, while for the former it computes enough information to allow for on-the-fly enumeration.
    The main result of this work defines the preprocessing time, the update time, and the enumeration delay as functions of the light/heavy threshold. By conveniently choosing this threshold, our approach recovers a number of prior results when restricted to hierarchical queries.
    For a restricted class of hierarchical queries, our approach can achieve worst-case optimal update time and enumeration delay conditioned on the Online Matrix-Vector Multiplication Conjecture.
    @inproceedings{kara2020trade,
    title={Trade-offs in static and dynamic evaluation of hierarchical queries},
    author={Kara, Ahmet and Nikolic, Milos and Olteanu, Dan and Zhang, Haozhe},
    booktitle={Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems},
    pages={375--392},
    year={2020}
    }
  • Counting Triangles under Updates in Worst-Case Optimal Time,
    We consider the problem of incrementally maintaining the triangle count query under single-tuple updates to the input relations. We introduce an approach that exhibits a space-time tradeoff such that the space-time product is quadratic in the size of the input database and the update time can be as low as the square root of this size. This lowest update time is worst-case optimal conditioned on the Online Matrix-Vector Multiplication conjecture. The classical and factorized incremental view maintenance approaches are recovered as special cases of our approach within the space-time tradeoff. In particular, they require linear-time update maintenance, which is suboptimal. Our approach also recovers the worst-case optimal time complexity for computing the triangle count in the non-incremental setting.
    @inproceedings{kara2019counting,
    title={Counting Triangles under Updates in Worst-Case Optimal Time},
    author={Kara, Ahmet and Ngo, Hung Q and Nikolic, Milos and Olteanu, Dan and Zhang, Haozhe},
    booktitle={22nd International Conference on Database Theory (ICDT 2019)},
    year={2019},
    organization={Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik}
    }