Scan, Search, Ranking

scan

scan, prefix sum


  • Given an array, \([4, 2, 1, 8]\)
  • Compute, \([4, 4+2, 4+2+1, 4+2+1+8]\)
  • \([4, 6, 7, 15]\)


obvious sequential algorithm

\[ D_s(n) = (n) \]


not parallelizable, need to change algorithm

scan


  • a good example of a computation that seems inherently sequential, but for which there is an efficient parallel algorithm
  • prefix-sums can be used to convert certain sequential computations into equivalent, but parallel, computations

scan


Given a binary associative operator \(\bigoplus\) with identity \(I\), and an array of \(n\) elements, \([a_0,a_1,\cdots,a_{n-1}]\), compute:

\[ \left [ I, a_0, (a_0\oplus a_1),\cdots,(a_0\oplus a_1 \oplus \cdots \oplus a_{n-2}) \right ] \]


the exclusive scan or the inclusive scan,

\[ \left [ a_0, (a_0\oplus a_1),\cdots,(a_0\oplus a_1 \oplus \cdots \oplus a_{n-1}) \right ] \]

sequential scan


// exclusive 
out[0] = 0;  
for (k=1; k<n; ++k)  
  out[k] = in[k-1] + out[k-1]; 


\[ T_1(n) = \mathcal{O}(n) \]

parallel scan


  • binary tree (like sum)
  • two tree traversals
    • up-sweep bottom-up — reduce phase
    • down-sweep top-down
  • Work/Depth
    • \(W(n) = \mathcal{O}(n)\)
    • \(D(n)=\mathcal{O}(\log n)\)
  • recursive & iterative implementations

scan up-sweep


scan down-sweep


simple use case


parallel select

parallel select


  • consider a shared memory machine (PRAM)
  • given an array \(A\), we wish to select all entries of \(A\) satisfying some condition
  • say, we are implementing quicksort and would like to compute,
  • all values less than a given pivot.

parallel select for a[i] < pv


(l,m) = select_lower(a, n, pv) {
  // t = t[0,...,n-1]
  parallel for (i=0; i<n; ++i)
    t[i] = a[i] < pv;
  s = scan(t);
  m = s(n-1);
  // allocate l
  parallel for (i=0; i<n; ++i)
    if (t[i])
      l[s[i]-1] = a[i];
}

\[W=\mathcal{O}(n), D=W=\mathcal{O}(\log n)\]

parallel thinking

think about whether parallel scan can be used for any of the problems in assignment 1

search


problem description

  • given a sorted list \(X\) of size \(n\) and an element \(y\)
  • find the index \(i\), such that \(x_i \leq y < x_{i+1}\)

sequential

  • use binary search
  • \(\mathcal{O}(\log n)\) time

work / depth

parallel for (i) 
  if (x[i] <= y < x[i+1]) return i; // no duplicates

PRAM

  • \(\mathcal{O}(\log n/ \log p)\) using \(p\) processors

parallel search


  • split \(X\) into \((p+1)\) segments
  • compare \(y\) with the \(p\) splitters (boundary elements)
  • we will find, \(i\) such that,
    1. \(X_i = y\), or
    2. \(X_{ni/(p+1)} < y < X_{n(i+1)/(p+1)}\) — bucket
  • if case 1, stop
  • if case 2, and size of bucket is \(\sim p\)
    • do \(p\) comparisons in parallel
  • else recurse

parallel search


example

parallel search


  • \(p\) processes
  • \(\mathcal{O}(p)\) work per step, \(\mathcal{O}(1)\) depth
  • $p^k = n k = _p n $
  • Work optimal ?
  • no parallelism if we insist on work optimality

ranking

ranking


  • given ordered lists, \(A,B\) of lengths \(n, m\)
  • define:

    rank$(z:A) $ number of elements \(a_i|a_i\leq z\)

  • define:

    rank\((B:A) := (r_1,r_2,\cdots,r_t)\)
    $r_i$ rank\((b_i:A)\)

parallel ranking


  • A = [7, 13, 25, 26, 31, 54]
  • B = [1, 8, 13, 27]
  • Rank(B:A) = (0, 1, 2, 4)
  • Rank(A:B) = (1,3,3,3,4,4,4)

algorithm for Rank(b:A)


  • ranking one element in an array A
  • use a binary search algorithm
  • Depth
    • sequential search \(\mathcal{O}( \log(n) )\)
    • parallel search \(\mathcal{O}( \log(n)/\log(p) )\)

merge sort

divide & conquer mergesort

  • divide \(X\) into \(X_1\) and \(X_2\)
  • sort \(X_1\) and \(X_2\)
  • merge \(X_1\) and \(X_2\)
  • uses a binary tree
    • bottom-up approach
    • start with the leaves
    • climb to the root
    • merge the branches
  • requires parallel merge

mergesort - example

mergesort


b = Merge_Sort(a,n)
  if n < 100 
    return seqSort(a, n);
  b1 = Merge_Sort(a[0,…,n/2-1], n/2);
  b2 = Merge_Sort(a[n/2,…,n-1], n/2);
  return Merge (b1, b2);

parallel merge

merging two sorted lists


  • best sequential time — \(\mathcal{O}(n)\)


parallel merge

tradeoffs between

  • depth-optimal
  • work-optimal

merging using ranking


  • Assume elements in \(A\) and \(B\) are distinct
  • Let \(C\) be the merged result. Given, \(x \in C\) and rank\((x:C)=i\)
    • \(c_i=x\)
  • rank$(x:C) = \(rank\)(x:A)+\(rank\)(x:B)$
  • Solution to the merging problem,
    • find rank\((x:A)\) and rank\((x:B)\)
    • parallel searches using \(p=nm, D=\mathcal{O}(1)\) but \(W=\mathcal{O}(n^2)\)

    • Concurrent binary searches, \(D=\mathcal{O}(\log n)\) and \(W=\mathcal{O}(n \log n)\)

  • Goal: Parallelize with optimal work

work-optimal merge

work-optimal parallel merge


work-optimal parallel merge


partition \(B\) into blocks with \(\log m\) elements

work-optimal parallel merge


rank splitters of \(B\) in \(A\)

work-optimal parallel merge


partition \(A\) accordingly

work-optimal parallel merge


merge blocks \(B_i\) and \(A_i\) sequentially

work-optimal parallel merge


  • partition \(B\) into \(m/\log m\) blocks, each with \(\log m\) elements
  • parallel for \(i=1:m/\log m\)
    • $r_i = \(`seq_rank`\)(b_{iK}: A)$
  • partition \(A\) accordingly
    • block \(A_i: (a_{r_{i-1}+1},\cdots,a_{r_i})\)
  • merge blocks of \(A\) and \(B\) sequentially in \(\mathcal{O}(\log n)\) time
  • but, if \(|A_i|\gg|B_i|=\log m\) then par_merge\((B_i, A_i)\)

work-optimal parallel merge


assuming \(m=\mathcal{O}(n)\),

\[W=\mathcal{O}(n)\]

and,

\[D=\mathcal{O}(\log n)\]

next time …


  • bitonic sort
  • sample sort

self-test questions


  • design a work-optimal ranking algorithm
    • similar to parallel merge
  • what are the challenges of implementing the parallel merge in a shared memory framework? how about using message passing?
  • can you implement a parallel mergesort using the algorithms discussed today? What will its time complexity be?