Foundations of Data Analysis
Instructor : Jeff Phillips (email) | Office hours: Thursdays 9-10am (MEB 3404) and directly after each class
TAs: Arman Ashkari (arman.ashkari@utah.edu) | Office hours: Mondays 11am-1pm (TBA)
         Lucas Pearce (u1110118@utah.edu) | Office hours: Tuesdays 3:30-5:30pm (TBA)

Spring 2026 | Monday, Wednesday 3pm-4:20 pm
WEB L101 (YouTube)
Catalog number: CS/DS 3190 01



Syllabus | Course Description | ELOs | Book | Pre-Reqs | Schedule | Policies

Description:
This class will be an introduction to computational data analysis, focusing on the mathematical foundations, but providing some basic experience in analysis techniques. The goal will be to carefully develop and explore several core topics that form the backbone of modern data analysis topics, including Machine Learning, Data Mining, Artificial Intelligence, and Visualization. This will include some background in probability and linear algebra, and then various topics including Bayes Rule and its connection to inference, linear regression and its polynomial and high dimensional extensions, principal component analysis and dimensionality reduction, as well as classification and clustering. We will also focus on modern PAC (probably approximately correct) and cross-validation models for algorithm evaluation.
Some of these topics are often very breifly covered at the end of a probability or linear algebra class, and then are often assumed knowledge in advanced data mining or machine learning classes. This class fills that gap. The planned pace will be closer to CS3130 or Math2270 than the 5000-level advanced data analysis courses.

Expected Learning Outcomes: On completion of the course, students should be able to:
  • To represent data points as vectors and data sets as matrices, and manipulate them with tool from linear algebra.
  • To understand how to express a model to fit data as geometric object represented by a small number of parameters, with the goal of minimizing sum of squared errors, and motivated by probability assuming iid data.
  • To understand basic formulations, models, and algorithms for goals in linear regression, dimensionality reduction, clustering, and classification.
  • To be able to optimize a convex function with gradient descent, and how to apply these tools to optimize model parameters with respect to a cost function derived from data.
  • To evaluate supervised learning problems (regression and classification), by how well they generalize to new data, with cross-validation.

    We will use
    Python in the class to demonstrate and explore basic concepts. But programming will not be the main focus.
    Former TA Hasan Poormahmood created a short python tutorial on loading, manipulating, processing, and plotting data in python in colab. Here is the python notebook so you can follow along.

    Book: Mathematical Foundations of Data Analysis (v1.0)
    A free version (v0.6) is free and available online as pdf. The formatting and page numbering is updated, and the writing is improved in spots in the v1.0. Some content is also added in v1.0, but it does not affect the part covered in this course.

    Videos: Lecture will be given in person, and live streamed on YouTube. Soonafter videos will automatically appear on the YouTube playlist

    Prerequisits:
    The official pre-requisites are CS 2100, CS 2420, and Math 2270. These are to ensure a certain very basic mathematical maturity (CS 2100) a basic understanding of how to store and manipulate data with some efficiency (CS2420), and basics of linear algebra and high dimensions (MATH 2270).
    We have as a co-requisite CS 3130 (or Math 3070) to ensure some familiarity with probability.
    A few lectures will be devoted to review linear algebra and probability, but at a fast pace and a focus on the data interpretation of these domains. I understand students now obtain background in data analysis in a variety of different ways, contact instructor if you think you may manage without these pre-requisites.
    This course is a pre-requisite for CS 5350 (Machine Learning) and CS 5140 (Data Mining), and is required course for the BS in Data Science.
    The COMP 5960 course meant for non-SoC graduate students, and non-matriculated students, does not have pre-requisites.

    Schedule:
    Date Chapter Video Topic Assignment
    Mon 1.05 Class Overview
    Wed 1.07 Ch 1 - 1.2 Probability Review : Sample Space, Random Variables, Independence
    Quiz 0
    Mon 1.12 Ch 1.3 - 1.6 Probability Review : PDFs, CDFs, Expectation, Variance, Joint and Marginal Distributions(colab) HW1 out
    Wed 1.14 Ch 1.7 Bayes' Rule: MLEs and Log-likelihoods
    Mon 1.19
    MLK DAY
    Wed 1.21 Ch 1.8 Bayes Rule : Bayesian Reasoning
    Quiz 1
    Mon 1.26 Ch 2.1 - 2.2 Convergence : Central Limit Theorem and Estimation (colab)
    Wed 1.28 Ch 2.3 Convergence : PAC Algorithms and Concentration of Measure HW 1 due
    Mon 2.02 Ch 3.1 - 3.2 Linear Algebra Review : Vectors, Matrices, Multiplication and Scaling HW 2 out
    Wed 2.04 Ch 3.3 - 3.5 Linear Algebra Review : Norms, Linear Independence, Rank and numpy (colab)
    Quiz 2
    Mon 2.09 Ch 3.6 - 3.8 Linear Algebra Review : Inverse, Orthogonality
    Wed 2.11 Ch 5.1 Linear Regression : explanatory & dependent variables (colab) HW 2 due
    Mon 2.16
    PRESIDENTS DAY
    HW 3 out
    Wed 2.18 Ch 5.2-5.3 Linear Regression : multiple regression (colab), polynomial regression (colab)
    Quiz 3
    Mon 2.23 Ch 5.4 Linear Regression : overfitting and cross-validation + double descent (colab)
    Wed 2.25 Ch 6.1 - 6.2 Gradient Descent : functions, minimum, maximum, convexity & gradients HW 3 due
    Mon 3.02 Ch 6.3 Gradient Descent : algorithmic & convergence (colab) HW 4 out
    Wed 3.04 Ch 6.4 Gradient Descent : fitting models to data and stochastic gradient descent
    Quiz 4
    Mon 3.09
    SPRING BREAK
    Wed 3.11
    SPRING BREAK
    Mon 3.16 Ch 7.1 - 7.2 Dimensionality Reduction : project onto a basis
    Wed 3.18 Ch 7.2 - 7.3 Dimensionality Reduction : SVD and rank-k approximation (colab) HW 4 due
    Mon 3.23 Ch 7.4 Dimensionality Reduction : eigndecomposition and power method (colab) HW 5 out
    Wed 3.25 Ch 7.5 - 7.6 Dimensionality Reduction : PCA, centering (colab), and MDS (colab)
    Quiz 5
    Mon 3.30 Ch 8.1 Clustering : Voronoi Diagrams + Assignment-based Clustering
    Wed 4.01 Ch 8.3 Clustering : k-means (colab) HW 5 due
    Mon 4.06 Ch 8.4, 8.7 Clustering : EM, Mixture of Gaussians, Mean-Shift HW 6 out
    Wed 4.08 Ch 9.1 Classification : Linear prediction
    Quiz 6
    Mon 4.13 Ch 9.2 Classification : Perceptron Algorithm
    Wed 4.15 Ch 9.3 Classification : Kernels and SVMs HW 6 due
    Mon 4.20 Ch 9.4 - 9.5 Classification : Neural Nets, Decision Trees, etc
    Mon 4.27
    FINAL EXAM 3:30pm - 5:30pm
    (practice)



    Class Organization: The class will be run through this webpage and Canvas. The schedule, notes, and links will be maintained here. All homeworks will be turned in through GradeScope.


    Grading: There will be one final exam with 30% of the grade. Homeworks will be worth 40% of the grade. There will be 5 homeworks and the lowest one can be dropped. Quizzes will be worth 30% of the grade. There will be 6 or 7 (the first, Quiz 0, is worth fewer points).

    The homeworks will usually consist of an analytical problems set, and sometimes light programming exercizes in python. When python will be used, we typically will work through examples in class first.


    Late Policy: To get full credit for an assignment, it must be turned in through GradeScope by 2:30pm the end of the day it is due (then come to class at 3pm). Once GradeScope marks it late, those turned in will lose 10% of the total points. After 24 hours, another 20% is deducted. After 48 hours, no further submissions will be given a 0 -- but it can still be turned in and graded, so the lowest score can be dropped.


    KSoC Policies: This course follows the guidelines in the
    KSoC Handbook.

    This class has the following collaboration policy:
    For assignments, students may discuss questions with anyone, including problem approach, proofs, and code. But all students must write their own code, proofs, and write-ups. If you collaborated with another student on homeworks to the extent that you expect your answers may start to look similar, you must explain the extent to which you collaborated explicitly on the homework. Students whose homeworks appear too similar, and did not explain the collaboration will get a 0 on that assignment. For quizzes and the exams, you must work by yourself. Students talking during exams without the instructor or TA present will have their tests confiscated and receive a 0.


    More Resources:
    I hope the book provide all information required to understand the material for the class, and for a solid footing beyond. However, it is sometimes useful to also explore other sources.
    Wikipedia is often a good source on many of these topics. In the past students have also enjoyed 3 Blue 1 Brown.

    Here are a few other books that cover some of the material, but at a more advanced level:
    Understanding ML | Foundations of Data Science | Introduction to Statistical Learning