\documentclass[11pt]{book}
\usepackage{classMCMD13}
\newcommand{\A}{\textsc{Alice}\xspace}
\newcommand{\B}{\textsc{Bob}\xspace}
\newcommand{\JS}{\ensuremath{\textsf{\small JS}}}
\newcommand{\D}{\texttt{\textbf{d}}}
\newcommand{\Ded}{\texttt{\textbf{d}}_{\textsf{ed}}}
\newcommand{\vor}{\textsf{Vor}}
\newcommand{\rad}{\textsf{rad}}
\newcommand{\vol}{\textsf{Vol}}
\newcommand{\cut}{\textsf{Cut}}
\newcommand{\ncut}{\textsf{NCut}}
\begin{document}
\setcounter{chapter}{11}
\chapter{Heavy Hitters}
\begin{center}
\vspace{-.2in}
scribe(s): \emph{student name(s)}
\end{center}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\section*{Overview}
A core mining problem is to find items that occur more than one would expect. These may be called outliers, anomalies, or other terms. Statistical models can be layered on top of these notions.
We begin with a very simple problem. There are $m$ elements and they come from a domain $[n]$ (but both $m$ and $n$ might be very large, and we don't want to use $\Omega(m)$ or $\Omega(n)$ space). Some items in the domain occur more than once, and we want to find the items which occur the most frequently.
If we can keep a counter for each item in the domain, this is easy. But we will assume $n$ is huge (like all possible IP addresses), and $m$ is also huge, the number of packets passing through a router in a day.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Streaming}
Streaming is a model of computation that emphasizes \emph{space} over all else. The goal is to compute something using as little storage space as possible. So much so that we cannot even store the input. Typically, you get to read the data once, you can then store something about the data, and then let it go forever! Or sometimes, less dramatically, you can make 2 or more passes on the data.
Formally, there is a stream $A = \langle a_1, a_2, \ldots, a_m\rangle$ of $m$ items where each $a_i \in [n]$. This means, the size of each $a_i$ is about $\log n$ (to represent which element), and just to count how many items you have seen requires space $\log m$ (although if you allow approximations you can reduce this). Unless otherwise specified, $\log$ is used to represent $\log_2$ that is the base-2 logarithm.
The goal is to compute a function $g(A)$ using space that is only $\textsf{poly}(\log n, \log n)$.
Let $f_j = |\{a_i \in A \mid a_i = j\}|$ represent the number of items in the stream that have value $j$.
Let $F_1 = \sum_j f_j = m$ be the total number of elements seen.
Let $F_2 = \sqrt{\sum_j f_j^2}$ be the sum of squares of elements counts, squarerooted.
Let $F_0 = \sum_j f_j^0$ be the number of distinct elements.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Majority and Heavy Hitters}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\subsection{Streaming Majority}
One of the most basic streaming problems is as follows:
\textsc{Majority:} if some $f_j > m/2$, output $j$. Otherwise, output anything.
How can we do this with $\log n + \log m$ space (one counter $c$, and one location $\ell$)?
\begin{algorithm}
\caption{\label{alg:MAJ} Majority($A$)}
\begin{algorithmic}
\STATE Set $c=0$ and $\ell = \emptyset$
\FOR {$i=1 \textbf{ to } m$}
\IF {$(a_i = \ell)$}
\STATE $c = c+1$
\ELSE
\STATE $c = c-1$
\ENDIF
\IF {$(c \leq 0)$}
\STATE $c=1$, $\ell = a_i$
\ENDIF
\ENDFOR
\STATE \textbf{return} $\ell$
\end{algorithmic}
\end{algorithm}
Why is Algorithm \ref{alg:MAJ} correct? If $f_j > m/2$, then
\begin{itemize}\denselist \vspace{-.1in}
\item if $(\ell \neq j)$ then $c$ decremented at most $< m/2$ times, but $c > m/2$
\item if $(\ell = j)$ can be decremented $< m/2$ times, but incremented $> m/2$ times.
\end{itemize}
On the other hand, if $f_j < m/2$ for all $j$, then any answer is ok.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Heavy Hitters}
Now we generalize the \textsc{Majority} problem to something much more useful.
\textsc{$k$-Frequency-Estimation}: Build a data structure $S$.
For any $j \in [n]$ we can return $S(j) = \hat f_j$ such that
\[
f_j - m/k \leq \hat f_j \leq f_j.
\]
From another view, a \emph{$\phi$-heavy hitter} is an element $j \in [n]$ such that $f_j > \phi m$. We want to build a data structure for $\eps$-approximate $\phi$-heavy hitters so that it returns
\begin{itemize} \denselist
\item all $f_j$ such that $f_j > \phi m$
\item no $f_j$ such that $f_j < \phi m - \eps m$
\item (any $f_j$ such that $\phi m - \eps m \leq f_j < \phi m$ can be returned, but might not be).
\end{itemize}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Misra-Gries Algorithm} [Misra+Gries 1982]
Solves $k$-\textsc{Frequency-Estimation} in $k (\log m + \log n)$ space.
The trick is to run the \textsc{Majority} algorithm, but with $k$ counters instead of $1$.
Let $C$ be an array of $k$ counters $C[1]$, $C[2]$, \ldots, $C[k]$.
Let $L$ be an array of $k$ locations $L[1]$, $L[2]$, \ldots, $L[k]$.
\begin{algorithm}
\caption{\label{alg:MG} Misra-Gries($A$)}
\begin{algorithmic}
\STATE Set all $C[i]=0$ and all $L = \emptyset$
\FOR {$i=1 \textbf{ to } m$}
\IF {$(a_i \in L)$ (at index $j$)}
\STATE $C[j] = C[j]+1$
\ELSE
\IF {($|L| < k$)}
\STATE Let $L[j] = \emptyset$: $C[j] = 1$ \& $L[j] = a_i$
\ELSE
\STATE \textbf{for} {$j \in [k]$} \textbf{do} $C[j] = C[j]-1$
\ENDIF
\ENDIF
\FOR {$j \in [k]$}
\STATE \textbf{if} {$C[j] \leq 0$} \textbf{do} $L[j] = \emptyset$
\ENDFOR
\IF {($|L| < k$ \& $a_i \notin L$)}
\STATE For some $j$ where $C[j] =0$: $C[j] = 1$ \& $L[j] = a_i$
\ENDIF
\ENDFOR
\STATE \textbf{return} $C$, $L$
\end{algorithmic}
\end{algorithm}
Then on a query $q \in [n]$ to $C,L$, if $q \in L$ (specifically $L[j]=q$), then return $\hat f_q = C[j]$. Otherwise return $\hat f_q = 0$.
\paragraph{Analysis:}
Why is Algorithm \ref{alg:MG} correct?
\begin{itemize}\denselist
\item A counter $C[j]$ representing $L[j] = q$ is only incremented if $a_i = q$, so we always have
\[
\hat f_q \leq f_q.
\]
\item If a counter $C[j]$ representing $L[j]=q$ is decremented, then $k-1$ other counters are also decremented. This happens at most $m/k$ times. Thus a counter $C[j]$ representing $L[j]=q$ is decremented at most $m/k$ times. Thus
\[
f_q - m/k \leq \hat f_q.
\]
\end{itemize}
We can now apply this to get an additive $\eps$-approximate \textsc{Frequency-Estimation} by setting $k=1/\eps$. We return $\hat f_q$ such that
\[
| f_q - \hat f_q| \leq \eps m.
\]
Or we can set $k=2/\eps$ and return $C[j] + (m/k)/2$ to make error on both sides.
Space is $(1/\eps) (\log m + \log n)$, since there are $(1/\eps)$ counters and locations.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Count-Min Sketch}
We now describe a completely different way to solve the \textsc{Heavy-Hitter} problem, called the Count-Min Sketch [Cormode + Muthukrishnan 2005]
Start with $t$ independent (random) hash functions $\{h_1, \ldots, h_t\}$ where each $h_i : [n] \to [k]$.
Now we store an 2d array of counters for $t = \log(1/\delta)$ and $k = 2/\eps$:
\[
\begin{array}{|r||cccc|}
\hline
h_1 & C_{1,1} & C_{1,2} & \ldots & C_{1,k} \\
h_2 & C_{2,1} & C_{2,2} & \ldots & C_{2,k} \\
\ldots & \ldots & \ldots & \ldots & \ldots \\
h_t & C_{t,1} & C_{t,2} & \ldots & C_{t,k} \\
\hline
\end{array}
\]
\begin{algorithm}
\caption{\label{alg:CM} Count-Min($A$)}
\begin{algorithmic}
\STATE Set all $C_{i,j}=0$
\FOR {$i=1 \textbf{ to } m$}
\FOR {$j =1 \textbf{ to } t$}
\STATE $C_{j,h_j(a_i)} = C_{j,h_j(a_i)}+1$
\ENDFOR
\ENDFOR
\end{algorithmic}
\end{algorithm}
After running Algorithm \ref{alg:CM} on a stream $A$, then on a query $q \in [n]$ we can return
\[
\hat f_q = \min_{j \in [t]} C_{j,h_j(q)}.
\]
This is why it is called a \emph{count-min sketch}.
\paragraph{Analysis:}
Clearly $f_q \leq \hat f_q$ since each counter has everything for $q$, but may also have other stuff (on hash collisions).
Next we claim that $\hat f_q \leq f_q + W$ for some over count value $W$. So how large is $W$?
Consider just one hash function $h_i$. It adds to $W$ when there is a collision $h_i(q) = h_i(j)$. This happens with probability $1/k$.
So we can create a random variable $Y_{i,j}$ that is represents the overcount caused on $h_i$ for $q$ because of element $j \in [n]$. That is, for each instance of $j$, it increments $W$ by $1$ with probability $1/k$, and $0$ otherwise. Each instance of $j$ has the same value $h_i(j)$, so we need to sum up all these counts. Thus
\begin{itemize}\denselist
\item $Y_{i,j} = \begin{cases} f_j & \text{ with probability} 1/k \\ 0 & \text{otherwise.} \end{cases}$
\item $\E[Y_{i,j}] = f_j/k$.
\end{itemize}
Then let $X_i$ be another random variable defined
\begin{itemize}\denselist
\item $X_i = \sum_{j \in [n], j \neq q} Y_{i,j}$, and
\item $\E[X_i] = \E[\sum_{j \neq q} Y_{i,j}] = \sum_{j \neq q} f_j/k = F_1/k = \eps F_1/2$.
\end{itemize}
Now we recall the Markov Inequality. For a random variable $X$ and a value $\alpha>0$, then $\Pr[|X| \geq \alpha] \leq \E[|X|]/\alpha$.
Since $X_i > 0$, then $|X_i| = X_i$, and set $\alpha = \eps F_1$. And note $\E[|X|]/\alpha = (\eps F_1 /2)/(\eps F_1) = 1/2$. It follows that
\[
\Pr[X_i \geq \eps F_1] \leq 1/2.
\]
But this was for just one hash function $h_i$. Now we extend this to $t$ \emph{independent} hash functions:
\begin{align*}
\Pr[\hat f_q - f_q \geq \eps F_1]
&=
\Pr[\min_i X_i \geq \eps F_1]
=
\Pr[\forall_{i \in [t]} (X_i \geq \eps F_1)]
\\ &=
\prod_{i \in [t]} \Pr[X_i \geq \eps F_1]
\leq
1/2^t
=
\delta,
\end{align*}
since $t = \log(1/\delta)$.
So that gives us a PAC bound. The Count-Min Sketch for any $q$ has
\[
f_q \leq \hat f_q \leq f_q + \eps F_1
\]
where the first inequality always holds, and the second holds with probability at least $1-\delta$.
\paragraph{Space.}
Since there are $kt$ counters, and each require $\log m$ space, then the total counter space is $kt \log m$.
But we also need to store $t$ hash functions, these can be made to take $\log n$ space each. Then since $t = \log(1/\delta)$ and $k = 2/\eps$ it follows the overall total space is
$t(k \log m + \log n) = ((2/\eps) \log m + \log n) \log(1/\delta)$.
\paragraph{Turnstile Model:}
There is a variation of streaming algorithms where each element $a_i \in A$ can either add one or subtract one from corpus (like a turnstile at the entrance of a football game), but each count must remain positive. This Count-Min has the same guarantees in the turnstile model, but Misra-Gries does not.
\end{document}