Matrix Sketching

So let us revisit matrix sketching. Let us look at it from a code perspective. I will implement this in python, but I strongly encourage you to implement this using spark streaming.

First let us import numpy and numpy.linalg and math

import numpy as np
import numpy.linalg as ln
import math

Let us define a function to compute the \(L_2\) norm of the error, as we will be using these to check how good our approximation is. Remember that the sketch \(B\) of matrix \(A\) has the following bound,

\[ \| AA^T - BB^T\|_2 \leq \epsilon \|A\|_f^2 \]

where \(\epsilon = 2/\ell\). The Frobenious norm of a matrix can be computed using ln.norm(A, ord='fro')

def calculate_error(mat_A, mat_B):
    mat_AAT = np.dot(mat_A, mat_A.T)
    mat_BBT = np.dot(mat_B, mat_B.T)
    return ln.norm(mat_AAT - mat_BBT, ord=2)

Now, let us define our sketch function that takes a matrix \(A\) and a parameter \(\ell\), the maximum number of columns to store.

def sketch(A, l):
    # dimensions of the matrix
    m,n = A.shape
    
    B = np.zeros([m, l])
    next_empty_col = 0
    
    for i in range(0,n):
        # insert a column into B
        B[:,next_empty_col] = A[:,i]
        
        # check if no zero columns remain,
        next_empty_col += 1
        if (next_empty_col == l) & (i < (n-1)):
            # compute svd
            U,S,V = ln.svd(B, full_matrices=False)
            
            # compute δ
            deltaSq = S[math.floor(l/2)] ** 2
            
            # reduce norms of columns of B
            Sbar = [math.sqrt(max(s,0)) for s in (S**2 - deltaSq)]
            
            # update B
            B = np.dot(U, np.diagflat(Sbar))
            
            # update next_empty_column
            empty_cols = np.where(~B.any(axis=0))[0]
            # print empty_cols
            next_empty_col = min(empty_cols)

    print ("Done Sketching")
    return B

Now lets create a random matrix and compute its sketch. We will also make sure the error is within bounds.

A = np.random.randn(100,1000)
l = 5
B = sketch(A, l)
err = calculate_error(A,B)
err_max = (2.0/l)* (ln.norm(A, ord='fro')**2)
print err, err_max