Assignment #2: Streaming & Sketching

Assignment Overview

The second assignment will teach you about streaming (specifically, one-pass streaming) and sketching algorithms. We will use sketching algorithms to solve the heavy hitters problem. Given an input stream S of size N, a Φ-heavy hitter is an item that occurs at least ΦN times in S. The problem of finding heavy-hitters is extensively studied in the database literature. The primary goal of this assignment is to become familiar with three sketching algorithms (Misra-Gries, Count Sketch, Count-Min Sketch) for streaming and evaluate the accuracy (precision and recall) of results.

All the code in this programming assignment must be written in C/C++. If you have not used C++ before, here's a short tutorial on the language. Even if you are familiar with C++, go over this guide for additional information on writing code in the system.

If you want to use another programming language for this assignment please ask the instructor first.

This is a single-person project that will be completed individually (i.e., no groups).

  • Release date: Mon, February 6

  • Due date: Mon, February 27

Implementation Details

There are four steps in this assignment:

  1. Sketching API
  2. Ground truth
  3. Evaluation
  4. Report

Step #1 - Sketch API implementation

Implement the three sketching algorithms seperately. All three sketching data structures should support the follwoing operations.

  • Add (x): Increments the count of item x by 1.
  • Estimate (x): Returns the estimated frequency of item.
  • HeavyHitters(phi): Returns a set of items where the count of each item in the set is bigger than ΦN, where N is the size of the stream.
  • Size: Return the size of the sketch (allocated memory).
The sketch receives a one-pass stream of N items. Each item from the stream will be added to the sketch using Add function. At any time, one can get the count estimate of the item using the Estimate function. After ingesting the stream, you can query the sketch to to get the φ-heavy hitters using HeavyHitter function.

Use one or more parameters in the initialize function to define the size/error of the sketch, depending on the algorithm.

Step #2 - Ground truth

As part of the assignment, we have included a test program to generate a stream of N items (64 bits) from a Zipfian distribution, ingesting the stream in a hash table, and then computing the heavy hitters based on the frequency of items in the stream.

You need to treat the heavy hitter result from the hash table as the ground truth to evaluate the accuracy of sketch implementations.

You can download the source code source code implemented in C++. This program creates a hash table using std::map and computes the heavy hitters in the stream using std::multimap.

The Code to generate a stream of items from a Zipfian distribution is present in zipf.h and zipf.c. It generates N items (specified in the program argument) from a universe of 230 items and Zipfian exponent of 1.5. The stream generation code is slow and you need a little patience while running. For exmaple, to generate 1 Million items it takes around 73 secs on a Cade machine. For testing purposes, you can reduce the universe size in test.cc to 224 and generate only 1000 items. However, for the final accuracy evaluation you need to switch the parameters back to the default ones.

To build and run the test program you need follow the following instructions:

make
./test NUM_ITEMS PHI (NUM_ITEMS: 1000000, PHI: 0.01)

The test program also records the time to generate N items, count N items, and compute heavy hitters and reports them using std::chrono library in C++.

Step #3 - Evaluation

You need to evalate the accuracy (precision and recall) of the three sketching implementations; Misra-Gries, Count Sketch, Count-Min Sketch compared to the ground truth heavy hitters obtained using the hash table. You need to write code to compute the precision and recall of the heavy hitters output from sketching algorithms.

For all three sketch implementations, you need to ingest the items from the stream. Once all items are ingested you can compute the heavy hitters for different values of φ. For example, you can vary φ between 0.001 to 0.01 with a step size of 0.001, i.e., 0.001, 0.002, 0.003,....., 0.01.

You further need to tweak the configuration parameters of the sketching algorithm to vary the precision/recall and evaluate the variance in the time/space of the sketching algorithm. You should compare the space and time required by the sketch algorithm compared to the space and time used by the hash table.

To compute the final precision and recall for reporting purposes, you should use φ: 0.001.

For the stream, you need to use the default Zipfian parameters (Universe: 230 and Exponent: 1.5) and stream size N: 100 Million.

You can extend the test program for evaluation and benchmarking. Similar to the std::map in the program, you need to use your sketch implementations.

Step #4 - Report

You need to write a report including:

  • Description of your implementations, any design decisions you made and why.
  • Find all the interesting phenomenons you find from your plots, and try to explain them using theoretical constructs.

In the report, you need to plot the following:

  1. Precision (Y-axis) VS update time (X-axis), one line for each algorithm.
  2. Precision (Y-axis) VS space (X-axis), one line for each algorithm.
  3. Recall (Y-axis) VS update time (X-axis), one line for each algorithm.
  4. Recall (Y-axis) VS space (X-axis), one line for each algorithm.

Tip: Format the output of your main function to match the requirements of your plotting code/app.

Tip: If you are looking for a new/better way to plot these charts, try searching 'python matplotlib' and 'latex pgfplots' in combination with 'line chart' on Google. It is OK to borrow some example code from open source code for plotting only.

Tip: To organize multiple plots in a single figure, try searching 'latex groupplot' or 'python matplotlib subplot'.

Instructions

You will use the Cade cluster to finish this project.

CADE manages clusters that you can use to do your development and testing for all of the class projects. You are free to use other machines and environments, but all grading will be done on these machines. Please test your solutions on these machines.

Check with CADE if you need to setup an account.

CADE machines all share your home directory, so you needn't log in to the same machine each time to continue working.

After you have an account choose a machine at random from the lab status page from the lab1- set of machines (that is, lab1-1.eng.utah.edu through lab1-40.eng.utah.edu).

ssh lab1-10.eng.utah.edu

CADE user accounts have tcsh set as their default shell. Each time you login first run bash before anything else. All instructions, examples, and scripts from this class assume you are using bash as your shell. You'll need to do this each time unless you reset your default shell ( link) (which I'd recommend). Perhaps, savvy users can provide slick setups. This step is important. If you don't reset your shell, other things will mysteriously break as you try to work through the labs.

There is also a CADE setup document available in Canvas for reference.

Submission

You need to submit a tar.gz file of your source code to canvas.

You should also include a seperate report.pdf in your submission that contains:

  1. Plots showing the accuracy (precision/recall) and performance (space/time) of the three sketching algorithm.
  2. A brief analysis of the trade off between the accuracy and performance of the three sketching algorithms. You also need to include a discussion about the theoretical guarantees and empirical performance.
  3. You need to analyze the biasness in the count estimate of the three sketching algorithms.

We will evaluate the correctness and the performance of your implementation off-line after the project due date.

Collaboration Policy

  • Every student has to work individually on this assignment.
  • Students are allowed to discuss high-level details about the project with others.
  • Students are not allowed to copy the contents of a white-board after a group meeting with other students.
  • Students are not allowed to copy the solutions from another colleague.
  • You can not copy code from the internet (Github Co-Pilot, ChatGPT, StackOverflow, etc).

If you have any questions please contact the instructor.