PIRS: Randomized Synopses for Query Assurance on Data Streams

Ke Yi, Feifei Li, Graham Cormode, Marios Hadjieleftheriou, George Kollios and  Divesh Srivastava

[Overview] [Papers and Talks] [PIRS: Polynomial Identity Random Synopses] [Source Code] [Dataset] [Contacts] 

Overview

The overwhelming flow of information in many data stream applications forces many companies to outsource to a third-party the deployment of a Data Stream Management System (DSMS) for performing desired computations. Remote computations intrinsically raise issues of trust, making query execution assurance on data streams a problem with practical implications. Consider a client observing the same data stream as a remote server (e.g., network traffic), that registers a continuous query on the server's DSMS, and receives answers upon request. The client needs to verify the integrity of the results using significantly fewer resources than evaluating the query locally. Towards that goal, we propose a {\em probabilistic algorithm} for selection and aggregate/group-by queries, that uses constant space irrespective of the result-set size, has low update cost, and arbitrarily small probability of failure. We generalize this algorithm to allow some tolerance on the number of errors permitted (irrespective of error magnitude), and also discuss the hardness of permitting arbitrary errors of small magnitude. We also perform an empirical evaluation using live network traffic.
 

Papers and Talks

1. Randomized Synopses for Query Assurance on Data Streams,

    Full version:   Talk:  

PIRS: Polynomial Identity Random Synopses

 Please refer papers above for details. 

 Illustration of the problem settings

 

Source Code

Important Notice

If you find any bugs or any suggestions/comments, we are very happy to hear from you!

Library Description

The library is developed in GNU C++.  The library depends on: Tools Library and GNU GMP for arbitrary precision arithmetic calculation.

Our library is tested with the following version of corresponding dependent libraries:

1. Tools Library : [zip]

2. GNU GMP: [tar.gz]

Download

PIRS Library [tar.gz]

Update: (05/14/08)Fixed some bugs and newly released the PIRS* and FM-PIRS!

This version includes the followings: PIRS, PIRS test with World Cup Data Set, PIRS test with IPs data set, PIRS^\gamma (PIRSET, the exact solution) , PIRS^\pm\gamma (PIRSAT, the approximate solution) and their corresponding buffering techniques (using LRU to explore locality), and PIRS-attack to simulate attacks to PIRS. Now also includes PIRS* (to identify which groups contain errors and) and FM-PIRS that could indentify how many errors in all groups!

Quick Install

The subfolder's names are self-explain (pirsgi refers to PIRS*, gi stands for group identifier). Each subfolder contains a Makefile for easy-compilation. All the main test program has a verbose help output to explain what parameters it expects.

Dataset

We have tested PIRS (and its variants) for two sets of real data streams. One is the World Cup data set which consists of the access log for 1998 world cup web sites. You may find detailed description of this data set from here. The second one is the live IP traces from AT&T backbone network. Due to privacy and legal requirement, we are not able to release the IPs dataset. However, the World Cup data stream is public. Our experiment consists of the traffic/request from Day 46 to Day 47. You may download the original data from WCUP_data_day4647. However, you need to combine the individual small data files into one file. For testing purpose, you may grab any one small file from ITA website and test-run PIRS and its variants.

Contacts

Feifei Li     

Last modified 05/14/08