Data Mining
Instructor : Jeff Phillips (email)| Office hours: 11am-noon on Thursdays @ MEB 3442 (and often directly after class)
TA: Yan Zheng | Office hours: 10am-noon on Mondays @ MEB 3115
Spring 2013 | Mondays, Wednesdays 5:15 pm - 6:35 pm
MEB 3147 (LCR) WEB L114
Catalog number: CS 5955 01 (ugrad) or CS 6955 01 (grad)

Data mining is the study of efficiently finding structures and patterns in data sets. We will also study what structures and patterns you can not find. The structure and patterns are based on statistical and probabilistic principals, and they are found efficiently through the use of clever algorithms. This class will take this two-pronged approach to the topic; we will understand the model and then explore efficient algorithms to find them.
This class may differ greatly from many data mining classes offered elsewhere. Perhaps it should be called "Large Scale Data Mining" since many of the techniques we will discuss have been designed to deal with (or have survived the onslaught) of very large scale data. Many of these techniques use randomized algorithms - these are often extremely simple to use, but more difficult to analyze. We will focus more on how to use, and give explanations (but often not proofs) of correctness.
Topics will include: similarity search, clustering, regression/dimensionality reduction, link analysis (PageRank), and small space summaries. We may also discuss anomaly detection, compressed sensing, and pattern matching.

MMDS(v1.2): Mining Massive Data Sets by Anand Rajaraman and Jeff Ullman. The digital version of the book is free, but you may wish to purchase a hard copy.
CSTIA: Computer Science Theory for the Information Age by John Hopcroft and Ravi Kannan. This is currently only collated lecture notes from a theory class that covers some similar topics. I may refer to this book for proofs not covered in the notes.
When material is not covered by the books, free reference material will be linked to or produced.

Videos: All lectures have been videotaped and are available online. See discussion group for details on accessing videos on ITunesU.
I have begin uploading videos to my YouTube Channel for broader availability.

Prerequisits: A student who is comfortable with basic probability, basic big-O analysis, and simple programming will be qualified for the class. There is not specific languange we will use.
For undergrads, the prerequistits are CS 3505 and CS 2100. It is also highly recommended you have taken CS 3130 - in many ways, this is the natural continuation of that course.
In the past, this class has had undergraduates, masters, and PhD students, including many from outside of Computer Science. Most have kept up fine, and still most have been challenged. If you are unsure if the class is right for you, contact the instructor.
Date Topic Link Assignment (latex) Project
Mon 1.07 (Instructor Traveling - No Class)
Wed 1.09 Class Overview MMDS 1.1
Mon 1.14 Statistics Principles : Birthday Paradox + Coupon Collector (N) MMDS 1.2
Wed 1.16 Chernoff-Hoeffding Bounds + Applications (N) CSTIA 2.3 | Terry Tao Notes
Mon 1.21 (MLK Day - No Class)
Wed 1.23 Similarity : Jaccard + Shingling (N) MMDS 3.1 + 3.2 | CSTIA 7.3
Mon 1.28 Similarity : Min Hashing (N) MMDS 3.3
Wed 1.30 Similarity : LSH (N) MMDS 3.4 Statistical Principles
Mon 2.04 Similarity : Distances (N) MMDS 3.5 + 7.1 | CSTIA 8.1 Proposal
Wed 2.06 Similarity : SIFT and ANN vs. LSH (N) MMDS 3.7 + 7.1.3
Mon 2.11 Clustering : Hierarchical (N) MMDS 7.2 | CSTIA 8.7
Wed 2.13 Clustering : K-Means (N) MMDS 7.3 | CSTIA 8.3
Mon 2.18 (Presidents Day - No Class)
Wed 2.20 Clustering : Spectral (N) MMDS 10.4| Luxburg Tutorial | CSTIA 8.4 Document Hashing
Mon 2.25 Frequent Items : Heavy Hitters (N) MMDS 4.1 | CSTIA 7.1.3 | Min-Count Sketch | Misra-Gries Data Collection Report
Wed 2.27 Frequent Itemsets : Apriori Algorithm (N) MMDS 6+4.3 | Careful Bloom Filter Analysis
Mon 3.04 Regression : Basics in 2-dimensions (N) ESL 3.2 and 3.4
Wed 3.06 Regression : PCA (N) Geometry of SVD - Chap 3 | CSTIA 4 Clustering
Mon 3.11 (Spring Break - No Class)
Wed 3.13 (Spring Break - No Class)
Mon 3.18 Regression : Column Sampling and Frequent Directions (N) MMDS 9.4 | CSTIA 2.7 + 7.2.2 | arXiv
Wed 3.20 Regression : Compressed Sensing and OMP (N) CSTIA 10.3 | Tropp + Gilbert Intermediate Report
Mon 3.25 Regression : L1 Regression and Lasso (N) Davenport | ESL 3.8
Wed 3.27 Noise : Cross Validation and Uncertain Data MMDS 9.1 | Tutorial Frequent
Mon 4.01 Noise : Outliers + Heavy Tails (N) Dwork
Wed 4.03 Link Analysis : Markov Chains (N) MMDS 10.1 + 5.1 | CSTIA 5 | Weckesser notes
Mon 4.08 Link Analysis : PageRank (N) MMDS 5.1 + 5.4 Final Report
Wed 4.10 Link Analysis : MapReduce MMDS 2 Regression
Mon 4.15 Link Analysis : PageRank via MapReduce (N) MMDS 5.2
Wed 4.17 Link Analysis : Communities (N) MMDS 10.2 + 5.5 | CSTIA 8.8 + 3.4 Poster Outline
Mon 4.22 Link Analysis : Graph Sparsification (N) MMDS 4.1
Wed 4.24 Poster Day !!! Poster Presentation
Mon 4.30 Graphs

Grading: The grading will be 50% from homeworks and 50% from a project.

We will plan to have 5 or 6 short homework assignments, roughly covering each main topic in the class. The homeworks will usually consist of an analytical problems set, and sometimes a light programming exercize. There will be no specific programming language for the class, but some assignments may be designed around a specific one that is convenient for that task.

Each person in the class will be responsible for a small project. I will allow small groups to work together. The project will be very open-ended; basically it will consist of finding an interesting data set, exploring it with one or more techniques from class, and presenting what you found. I will try to provide suggestions for data sources and topics, but ultimately the groups will need to decide on their own topic. There will be several intermediate deadlines so projects are not rushed at the end of the semester.

Late Policy: To get full credit for an assignment, a hard (that means printed) copy must be turned into the TA at the start of class. Once class starts, those turned in late will lose 10%. Every subsequent 24 hours until it is turned another 10% is deducted. That is, a homework 30 hours late worth 10 points will have lost 2 points. Once the graded assignment is returned, any assignment not yet turned in will be given a 0.

Cheating Policy: The Utah School of Computing has a Cheating Policy which requires all registered students to sign an Acknowledgement Form. This form must be signed and turned into the department office before any homeworks are graded.

This class has the following collaboration policy. For assignments, students may discuss answers with anyone, including problem approach, proofs, and code. But all students must write their own code, proofs, and write-ups. For projects, you may of course work however you like within your groups. You may discuss your project with anyone as well, but if this contributes to your final product, they must be acknowledged (this does not count towards page limits). Of course any outside materials used must be referenced appropriately.

Latex: I highly highly recommend using LaTex for writing up homeworks. It is something that everyone should know for research and writing scientific documents. This linked directory contains a sample .tex file, as well as what its .pdf compiled outcome looks like. It also has a figure .pdf to show how to include figures.

Discussion Group: