He Sun's homepage

Algorithmic Foundation of Data Science

Optional Course in Informatics, University of Edinburgh

Background

Algorithmic Foundation of Data Science (AFDS) is an optimal course that I designed for the 3rd- and 4th-year undergraduates in computer science and mathematics. The overall set of materials is presented through 15 lectures, each of which lasts for about 1 hour. I taught AFDS at the University of Edinburgh in 2018 and 2019, and this webpage archives most materials that I used when teaching AFDS.

Course Aim

Data Science and Big Data Analysis is the research field that studies efficient processing and analysing massive datasets, and has received considerable attention from different research fields, including algorithms, machine learning, network sciences, probability theory and statistics. This line of research has also motivated a sequence of novel techniques for designing efficient algorithms for massive graphs, and has enormous industrial impact. This course aims to introduce algorithmic techniques that form the foundations of processing and analysing massive data sets of various forms. In particular, the course discusses how to process massive datasets, effectively store massive datasets, design fast algorithms for massive datasets, and analyse the performance of the designed algorithm. Through various examples as well as the coursework, the students will see applications of the topics discussed in class in other research field of computer science, e.g., machine learning, and network optimisation.

Learning Outcome

Demonstrate familiarity with fundamentals for processing massive datasets.
Describe and compare various algorithmic design techniques covered in the syllabus to process massive datasets.
Apply the learned techniques to design efficient algorithms for massive datasets.
Apply basic knowledge in linear algebra and probability theory to prove the efficiency of the designed algorithms.
Use appropriate software to solve certain algorithmic problems for a given data set.

Pre-requisites

Calculus: limits, sums, integration, differentiation, recurrence relations
Graph theory: graphs, digraphs, trees
Probability: random variables, expectation, variance, Markov's inequality, Chebyshev's inequality
Linear algebra: vectors, matrices, eigenvectors and eigenvalues, rank
Students should be familiar with the definition and use of big-O notation, and must be comfortable with both reading and constructing mathematical proofs using various methods such as proof by induction and proof by contradiction.

Lectures

Lecture 1: Introduction Slides
Lecture 2: High-Dimensional Spaces (1) Lecture Notes
Lecture 3: High-Dimensional Spaces (2) Lecture Notes
Lecture 4: High-Dimensional Spaces (3) Lecture Notes
Lecture 5: Best-Fit Subspaces and Singular Value Decomposition (1) Lecture Notes
Lecture 6: Best-Fit Subspaces and Singular Value Decomposition (2) Lecture Notes
Lecture 7: Best-Fit Subspaces and Singular Value Decomposition (3) Lecture Notes
Lecture 8: Hashing Functions Slides
Lecture 9: Data Streaming Algorithms (1) Slides
Lecture 10: Data Streaming Algorithms (2) Slides
Lecture 11: Graphs versus Matrices Lecture Notes
Lecture 12: The Cheeger Inequality Lecture Notes
Lecture 13: Spectral Clustering Slides
Lecture 14: Spectral Sparsification of Graphs (1) Lecture Notes
Lecture 15: Spectral Sparsification of Graphs (2) Lecture Notes

Coursework & Solutions

Coursework
Sample Images & Sample Output