He Sun

Algorithmic Foundation of Data Science

Optional Course in Informatics, University of Edinburgh

Algorithmic Foundation of Data Science (AFDS) is an optimal course that I designed for the 3rd- and 4th-year undergraduates in computer science and mathematics. The overall set of materials is presented through 15 lectures, each of which lasts for about 1 hour. I taught AFDS at the University of Edinburgh in 2018 and 2019, and this webpage archives most materials that I used when teaching AFDS.

Data Science and Big Data Analysis is the research field that studies efficient processing and analysing massive datasets, and has received considerable attention from different research fields, including algorithms, machine learning, network sciences, probability theory and statistics. This line of research has also motivated a sequence of novel techniques for designing efficient algorithms for massive graphs, and has enormous industrial impact. This course aims to introduce algorithmic techniques that form the foundations of processing and analysing massive data sets of various forms. In particular, the course discusses how to process massive datasets, effectively store massive datasets, design fast algorithms for massive datasets, and analyse the performance of the designed algorithm. Through various examples as well as the coursework, the students will see applications of the topics discussed in class in other research field of computer science, e.g., machine learning, and network optimisation.

Learning Outcome
  • Demonstrate familiarity with fundamentals for processing massive datasets.
  • Describe and compare various algorithmic design techniques covered in the syllabus to process massive datasets.
  • Apply the learned techniques to design efficient algorithms for massive datasets.
  • Apply basic knowledge in linear algebra and probability theory to prove the efficiency of the designed algorithms.
  • Use appropriate software to solve certain algorithmic problems for a given data set.

Pre-requisites
  • Calculus: limits, sums, integration, differentiation, recurrence relations
  • Graph theory: graphs, digraphs, trees
  • Probability: random variables, expectation, variance, Markov's inequality, Chebyshev's inequality
  • Linear algebra: vectors, matrices, eigenvectors and eigenvalues, rank
  • Students should be familiar with the definition and use of big-O notation, and must be comfortable with both reading and constructing mathematical proofs using various methods such as proof by induction and proof by contradiction.

  • Lecture 1: Introduction Slides

  • Lecture 2: High-Dimensional Spaces (1) Lecture Notes

  • Lecture 3: High-Dimensional Spaces (2) Lecture Notes

  • Lecture 4: High-Dimensional Spaces (3) Lecture Notes

  • Lecture 5: Best-Fit Subspaces and Singular Value Decomposition (1) Lecture Notes

  • Lecture 6: Best-Fit Subspaces and Singular Value Decomposition (2) Lecture Notes

  • Lecture 7: Best-Fit Subspaces and Singular Value Decomposition (3) Lecture Notes

  • Lecture 8: Hashing Functions Slides

  • Lecture 9: Data Streaming Algorithms (1) Slides

  • Lecture 10: Data Streaming Algorithms (2) Slides

  • Lecture 11: Graphs versus Matrices Lecture Notes

  • Lecture 12: The Cheeger Inequality Lecture Notes

  • Lecture 13: Spectral Clustering Slides

  • Lecture 14: Spectral Sparsification of Graphs (1) Lecture Notes

  • Lecture 15: Spectral Sparsification of Graphs (2) Lecture Notes

Coursework & Solutions
Problem Sets & Solutions