Yang Cao's photo

Research Interests

My broad research areas include database systems and theory, data science, and machine learning for data management. I like both theory and system work, and my focus is the combination of the two: developing principled methods for systems with real-life impact, while providing theoretical foundations and guarantees. I specifically have been working on the following topics:

Research Projects

(1) Query processing and optimization: resource-bounded query processing
The last decade has seen an explosion of big data, particularly in terms of its volume and variety. While its rising role in, e.g., business intelligence, is evident, big data analytics is expensive. For example, assuming the largest Solid State Drives (SSD) with 12GB/s for reading, a linear scan of a dataset of 15TB takes more than 20 minutes. It easily takes hours to join tables with millions of tuples. To deal with the unprecedented quantity of big data with limited resources such as time and storage, this line of my research has been centered around principled methods (both theory and systems) for querying big data by accessing a bounded amount of small data, i.e., making big data small for queries. Such methods are built upon the insight that not all the data values are necessary for answering queries and, for many queries, the relevant fraction is even of bounded size if we access data smartly.

[Related Publications] (2) Query large graphs: approximation and parallelization
Graphs are pervasive, spanning from social networking and traffic planning to bioinformatics. Querying large data graphs is typically computationally expensive due to the high complexity of graph queries and the lack of data locality of graphs. This project aims to alliviate the cost of querying large graphs by (a) relaxing & approximating graph query semantics and (b) effectively and automatically parallelizing graph computations.

[Related Publications] (3) Data quality: accuracy, consistency and completeness
Real-life datasets are typically dirty, containing noisy data, inconsistencies or incomplete information. This project studies foudations and techniques for managing dirty data, with a focus on data consistency, accuracy and information completeness.

[Related Publications]

Selected Awards

Professional Services