Assignment #2: Streaming & SketchingAssignment OverviewThe second assignment will teach you about streaming (specifically, one-pass streaming) and sketching algorithms. We will use sketching algorithms to solve the heavy hitters problem. Given an input stream S of size N, a Φ-heavy hitter is an item that occurs at least ΦN times in S. The problem of finding heavy-hitters is extensively studied in the database literature. The primary goal of this assignment is to become familiar with three sketching algorithms (Misra-Gries, Count Sketch, Count-Min Sketch) for streaming and evaluate the accuracy (precision and recall) of results. All the code in this programming assignment must be written in C/C++. If you have not used C++ before, here's a short tutorial on the language. Even if you are familiar with C++, go over this guide for additional information on writing code in the system. If you want to use another programming language for this assignment please ask the instructor first. This is a single-person project that will be completed individually (i.e., no groups).
Implementation DetailsThere are four steps in this assignment: Step #1 - Sketch API implementationImplement the three sketching algorithms seperately. All three sketching data structures should support the follwoing operations.
Add function. At any time, one can get the count estimate of the item using the Estimate function. After ingesting the stream, you can query the sketch to to get the φ-heavy hitters using HeavyHitter function.
Use one or more parameters in the initialize function to define the size/error of the sketch, depending on the algorithm. Step #2 - Ground truthAs part of the assignment, we have included a test program to generate a stream of N items (64 bits) from a Zipfian distribution, ingesting the stream in a hash table, and then computing the heavy hitters based on the frequency of items in the stream. You need to treat the heavy hitter result from the hash table as the ground truth to evaluate the accuracy of sketch implementations. You can download the source code source code implemented in C++. This program creates a hash table using The Code to generate a stream of items from a Zipfian distribution is present in To build and run the test program you need follow the following instructions: make ./test NUM_ITEMS PHI (NUM_ITEMS: 1000000, PHI: 0.01) The test program also records the time to generate N items, count N items, and compute heavy hitters and reports them using Step #3 - EvaluationYou need to evalate the accuracy (precision and recall) of the three sketching implementations; Misra-Gries, Count Sketch, Count-Min Sketch compared to the ground truth heavy hitters obtained using the hash table. You need to write code to compute the precision and recall of the heavy hitters output from sketching algorithms. For all three sketch implementations, you need to ingest the items from the stream. Once all items are ingested you can compute the heavy hitters for different values of φ. For example, you can vary φ between 0.001 to 0.01 with a step size of 0.001, i.e., 0.001, 0.002, 0.003,....., 0.01. You further need to tweak the configuration parameters of the sketching algorithm to vary the precision/recall and evaluate the variance in the time/space of the sketching algorithm. You should compare the space and time required by the sketch algorithm compared to the space and time used by the hash table. To compute the final precision and recall for reporting purposes, you should use φ: 0.001. For the stream, you need to use the default Zipfian parameters (Universe: 230 and Exponent: 1.5) and stream size N: 100 Million. You can extend the test program for evaluation and benchmarking. Similar to the std::map in the program, you need to use your sketch implementations. Step #4 - ReportYou need to write a report including:
In the report, you need to plot the following:
Tip: Format the output of your main function to match the requirements of your plotting code/app. Tip: If you are looking for a new/better way to plot these charts, try searching 'python matplotlib' and 'latex pgfplots' in combination with 'line chart' on Google. It is OK to borrow some example code from open source code for plotting only. Tip: To organize multiple plots in a single figure, try searching 'latex groupplot' or 'python matplotlib subplot'. InstructionsYou will use the Cade cluster to finish this project. CADE manages clusters that you can use to do your development and testing for all of the class projects. You are free to use other machines and environments, but all grading will be done on these machines. Please test your solutions on these machines.
Check with CADE if you need to setup an account. CADE machines all share your home directory, so you needn't log in to the same machine each time to continue working.
After you have an account choose a machine at random from the lab status page from
the lab1- set of machines (that is,
ssh lab1-10.eng.utah.edu CADE user accounts have tcsh set as their default shell. Each time you login first run bash before anything else. All instructions, examples, and scripts from this class assume you are using bash as your shell. You'll need to do this each time unless you reset your default shell ( link) (which I'd recommend). Perhaps, savvy users can provide slick setups. This step is important. If you don't reset your shell, other things will mysteriously break as you try to work through the labs. There is also a CADE setup document available in Canvas for reference. SubmissionYou need to submit a You should also include a seperate
We will evaluate the correctness and the performance of your implementation off-line after the project due date. Collaboration Policy
If you have any questions please contact the instructor. |