High-Dimensional Data Analysis
Instructor : Jeff Phillips (email) | Office hours: Thursdays 9-10am (MEB 3404) and directly after each class
TAs: Atreya Ghosal (email) | Office hours: Tuesday 12-2pm

Fall 2024 | Monday, Wednesday 3:00 pm - 4:20 pm
AEB 310
Catalog number: CS 5966/6966 01


Syllabus

Description:
In modern data analysis, it is common to deal with data represented as high-dimensional vectors. These have become part of the default featurized or "learned" representation for large language models, images, audio, and just about any other structured data one wants to learn from. This course will cover the curse of dimensionality, as well as various methods to partially overcome it, including dimensionality reduction, margin-based approaches, locality sensitive hashing, and no-dimensional coresets. The focus will mainly be on methods with mathematically rigorous proofs, including analysis of correctness and efficiency; but keeping to simple methods that work in practice, and showing how to use these approaches on data.

For undergraduate students, prerequisites are CS 3190 and CS 4150.
For representing and working with high dimensions, some linear algebra is a necessity. The instructor will use geometric perspectives to explain phenomenons when possible. Some probabilistic arguments will be necessary, especially concentration of measure bounds. While the ideas will conceptually require synthesis of advanced ideas, the aim will be explanations from first principals as much as possible.
Coding tasks will be agnostic to language, but convenient vector process (e.g., matlab or numpy in python) will be an essential component.

Books: There will not be a single textbook for this course. Here are some I may draw on:
  • FDS: Foundations of Data Science by Avrim Blum, John Hopcroft, and Ravi Kannan (and here)
  • M4D: Mathematical Foundations for Data Analysis by Jeff Phillips
  • HDP: High Dimensional Probability by Roman Vershynin
  • HDLM: High-Dimensional Data Analysis with Low-Dimensional Models by John Wright and Yi Ma
  • AHRS: Algorithmic High-Dimensional Robust Statistics by Ilias Diakonikolas and Daniel M. Kane
    More resources like papers and course notes are linked in the schedule topic.

    Lecture notes are in markdown and can be rendered nicely with a browser plug-in. Make sure to enable latex/mathjax.

    (tentative) Schedule:
    Date Chapter Notes Topic Assignment
    Mon 8.19 Class Overview
    Wed 8.21 FDS 2, HDP 3 L1 Curse of Dimensionality : Basic Geometry
    Mon 8.26 L2 Curse of Dimensionality : Nearest Neighbors
    Quiz 1
    Wed 8.28 seidel, eps-kernel, John's Ellipsoid L3 Curse of Dimensionality : Convex Hulls
    Mon 9.2
    LABOR DAY
    Wed 9.4 FDS 5.11, HDP 8.3, AB book L4 Hardness of Estimation : VC-dimension and Learning Theory
    Mon 9.9 AHRS 2, HDP 4, 7.3 L5 Hardness of Estimation : Mean/Parameter Estimation
    Wed 9.11 RFFs, RKHS, kernel distance L6 Building Embeddings : Lifting HW 1 due
    Mon 9.16 M4D 4.2, Magen notes L7 Building Embeddings : Metric Embeddings
    Quiz 2
    Wed 9.18 BM25, SIFT L8 Building Embeddings : Feature Modeling
    Mon 9.23 Embed Vis, Collab L9 Building Embeddings : Masked Word Models
    Wed 9.25 L10 Building Embeddings : Contrastive Learning
    Mon 9.30 FDS 3, M4D 7.6, t-SNE, analysis L11 Dimensionality Reduction : PCA, t-SNE
    Quiz 3
    Wed 10.2 M4D 7.7, DML L12 Dimensionality Reduction : Distance Metric Learning HW 2 due
    Mon 10.7
    FALL BREAK
    Wed 10.11
    FALL BREAK
    Mon 10.14 FDS 2.7 - Dimensionality Reduction : Random Projections + JL extensions
    Wed 10.16 M4D 4.6 - Similarity Search : Locality Sensitive Hashing
    Project Proposal
    Mon 10.21 - Similarity Search : Graph-based Search (aka HNSW)
    Quiz 4
    Wed 10.23 - Similarity Search : SOTA - Pinecone, FAISS
    Mon 10.28 FDS 5.2 - Optimization : Perceptron
    Wed 10.30 - Optimization : Minimum Enclosing Ball (and Frank-Wolfe) HW 3 due
    Mon 11.4 - Optimization : Convexity, non-convexity, Polyak-Lojasiewicz
    Quiz 5
    Wed 11.6 - Coresets : Kernel Density Estimates
    Mon 11.11 - Coresets : Sensitivity Sampling (Radamacher Complexity)
    Wed 11.13 - Coresets : Low-Rank Convex Hulls
    Mon 11.18 - Sampling : from Convex Polytopes
    Quiz 6
    Wed 11.20 - Sampling : Hit-and-Run HW 4 due
    Mon 11.25
    Project Presentations
    Project Report due
    Wed 11.27
    Project Presentations
    Mon 12.2
    Project Presentations
    Wed 12.4
    Project Presentations
    Quiz 7
    Thu 12.12
    FINAL EXAM 3:30pm - 5:30pm



    Class Organization: The class will be run through this webpage and Canvas. The schedule, notes, and links will be maintained here. All homeworks will be turned in through Canvas.


    Grading: Quizzes will be worth 20% of the grade. There will be 7. The lowest score dropped.
    Assignments will be worth 30% of the grade. There will be about 4.
    A Project will be worth 30% of the grade.
    The Final Exam worth 20% of the grade. It will be like the quizzes.

    The homeworks will usually consist of an analytical problems set, and/or light programming exercises in a programming language of your choice.
    The quizzes should take about 10 minutes each, and should be simple if you are attending and following the lectures.

    Late Policy: To get full credit for an assignment, it must be turned in through Canvas by the end of the day it is due, specifically 11:50 pm. Once the 11:50pm deadline is missed, those turned in late will lose 10%. Every subsequent 24 hours until it is turned another 10% is deducted. That is, a homework 30 hours late worth 10 points will have lost 2 points. Once the graded assignment is returned, or 48 hours has passed, any assignment not yet turned in will be given a 0.



    This class has the following collaboration policy:
    For assignments, students may discuss answers with anyone, including problem approach, proofs, and code. But all students must write their own code, proofs, and write-ups. If you collaborated with another student on homeworks to the extent that you expect your answers may start to look similar, you must explain the extent to which you collaborated explicitly on the homework. Students whose homeworks appear too similar, and did not explain the collaboration will get a 0 on that assignment.
    For quizzes and exams should be taken individually. Students discovered discussing with other students the contents of a quiz or test before they have both turned in their quiz will constitute cheating for both students. Students caught cheating will get a 0 and will not be able to drop a low score on a quiz. Discussion threads, chat areas, and emails are all considered to be equivalent to the classroom, and your behavior in all these venues should conform to the university's student code