Scan, Search, Ranking

scan

scan, prefix sum

Given an array, $[4, 2, 1, 8]$
Compute, $[4, 4+2, 4+2+1, 4+2+1+8]$
$[4, 6, 7, 15]$

obvious sequential algorithm

\[ D_s(n) = (n) \]

not parallelizable, need to change algorithm

scan

a good example of a computation that seems inherently sequential, but for which there is an efficient parallel algorithm
prefix-sums can be used to convert certain sequential computations into equivalent, but parallel, computations

scan

Given a binary associative operator $\bigoplus$ with identity $I$, and an array of $n$ elements, $[a_0,a_1,\cdots,a_{n-1}]$, compute:

\[ \left [ I, a_0, (a_0\oplus a_1),\cdots,(a_0\oplus a_1 \oplus \cdots \oplus a_{n-2}) \right ] \]

the exclusive scan or the inclusive scan,

\[ \left [ a_0, (a_0\oplus a_1),\cdots,(a_0\oplus a_1 \oplus \cdots \oplus a_{n-1}) \right ] \]

sequential scan

// exclusive 
out[0] = 0;  
for (k=1; k<n; ++k)  
  out[k] = in[k-1] + out[k-1];

\[ T_1(n) = \mathcal{O}(n) \]

parallel scan

binary tree (like sum)
two tree traversals
- up-sweep bottom-up — reduce phase
- down-sweep top-down
Work/Depth
- $W(n) = \mathcal{O}(n)$
- $D(n)=\mathcal{O}(\log n)$
recursive & iterative implementations

scan up-sweep

scan down-sweep

simple use case

parallel select

consider a shared memory machine (PRAM)
given an array $A$, we wish to select all entries of $A$ satisfying some condition
say, we are implementing quicksort and would like to compute,
all values less than a given pivot.

parallel select for a[i] < pv

(l,m) = select_lower(a, n, pv) {
  // t = t[0,...,n-1]
  parallel for (i=0; i<n; ++i)
    t[i] = a[i] < pv;
  s = scan(t);
  m = s(n-1);
  // allocate l
  parallel for (i=0; i<n; ++i)
    if (t[i])
      l[s[i]-1] = a[i];
}

\[W=\mathcal{O}(n), D=W=\mathcal{O}(\log n)\]

parallel thinking

think about whether parallel scan can be used for any of the problems in assignment 1

search

parallel search

problem description

given a sorted list $X$ of size $n$ and an element $y$
find the index $i$, such that $x_i \leq y < x_{i+1}$

sequential

use binary search
$\mathcal{O}(\log n)$ time

work / depth

parallel for (i) 
  if (x[i] <= y < x[i+1]) return i; // no duplicates

PRAM

$\mathcal{O}(\log n/ \log p)$ using $p$ processors

parallel search

split $X$ into $(p+1)$ segments
compare $y$ with the $p$ splitters (boundary elements)
we will find, $i$ such that,
1. $X_i = y$, or
2. $X_{ni/(p+1)} < y < X_{n(i+1)/(p+1)}$ — bucket
if case 1, stop
if case 2, and size of bucket is $\sim p$
- do $p$ comparisons in parallel
else recurse

parallel search

example

parallel search

$p$ processes
$\mathcal{O}(p)$ work per step, $\mathcal{O}(1)$ depth
$p^k = n k = _p n $
Work optimal ?
no parallelism if we insist on work optimality

ranking

given ordered lists, $A,B$ of lengths $n, m$
define:

rank$(z:A) $ number of elements $a_i|a_i\leq z$
define:

rank$(B:A) := (r_1,r_2,\cdots,r_t)$
$r_i$ rank$(b_i:A)$

parallel ranking

A = [7, 13, 25, 26, 31, 54]
B = [1, 8, 13, 27]
Rank(B:A) = (0, 1, 2, 4)
Rank(A:B) = (1,3,3,3,4,4,4)

algorithm for Rank(b:A)

ranking one element in an array A
use a binary search algorithm
Depth
- sequential search $\mathcal{O}( \log(n) )$
- parallel search $\mathcal{O}( \log(n)/\log(p) )$

merge sort

divide & conquer mergesort

divide $X$ into $X_1$ and $X_2$
sort $X_1$ and $X_2$
merge $X_1$ and $X_2$
uses a binary tree
- bottom-up approach
- start with the leaves
- climb to the root
- merge the branches
requires parallel merge

mergesort - example

mergesort

b = Merge_Sort(a,n)
  if n < 100 
    return seqSort(a, n);
  b1 = Merge_Sort(a[0,…,n/2-1], n/2);
  b2 = Merge_Sort(a[n/2,…,n-1], n/2);
  return Merge (b1, b2);

parallel merge

merging two sorted lists

best sequential time — $\mathcal{O}(n)$

parallel merge

tradeoffs between

depth-optimal
work-optimal

merging using ranking

Assume elements in $A$ and $B$ are distinct
Let $C$ be the merged result. Given, $x \in C$ and rank$(x:C)=i$
- $c_i=x$
rank$(x:C) = $rank$(x:A)+$rank$(x:B)$
Solution to the merging problem,
- find rank$(x:A)$ and rank$(x:B)$
- parallel searches using $p=nm, D=\mathcal{O}(1)$ but $W=\mathcal{O}(n^2)$
- Concurrent binary searches, $D=\mathcal{O}(\log n)$ and $W=\mathcal{O}(n \log n)$
Goal: Parallelize with optimal work

work-optimal merge

work-optimal parallel merge

partition $B$ into blocks with $\log m$ elements

work-optimal parallel merge

rank splitters of $B$ in $A$

work-optimal parallel merge

partition $A$ accordingly

work-optimal parallel merge

merge blocks $B_i$ and $A_i$ sequentially

work-optimal parallel merge

partition $B$ into $m/\log m$ blocks, each with $\log m$ elements
parallel for $i=1:m/\log m$
- $r_i = $`seq_rank`$(b_{iK}: A)$
partition $A$ accordingly
- block $A_i: (a_{r_{i-1}+1},\cdots,a_{r_i})$
merge blocks of $A$ and $B$ sequentially in $\mathcal{O}(\log n)$ time
but, if $|A_i|\gg|B_i|=\log m$ then par_merge$(B_i, A_i)$

work-optimal parallel merge

assuming $m=\mathcal{O}(n)$,

\[W=\mathcal{O}(n)\]

and,

\[D=\mathcal{O}(\log n)\]

next time …

bitonic sort
sample sort

self-test questions

design a work-optimal ranking algorithm
- similar to parallel merge
what are the challenges of implementing the parallel merge in a shared memory framework? how about using message passing?
can you implement a parallel mergesort using the algorithms discussed today? What will its time complexity be?

Scan, Search, Ranking

scan

scan, prefix sum

obvious sequential algorithm

not parallelizable, need to change algorithm

scan

scan

Given a binary associative operator \(\bigoplus\) with identity \(I\), and an array of \(n\) elements, \([a_0,a_1,\cdots,a_{n-1}]\), compute:

the exclusive scan or the inclusive scan,

sequential scan

parallel scan

scan up-sweep

scan down-sweep

simple use case

parallel select

parallel select

parallel select for a[i] < pv

parallel thinking

search

parallel search

problem description

sequential

work / depth

PRAM

parallel search

parallel search

example

parallel search

ranking

ranking

parallel ranking

algorithm for Rank(b:A)

merge sort

divide & conquer mergesort

mergesort - example

mergesort

parallel merge

merging two sorted lists

parallel merge

merging using ranking

work-optimal merge

work-optimal parallel merge

work-optimal parallel merge

partition \(B\) into blocks with \(\log m\) elements

work-optimal parallel merge

rank splitters of \(B\) in \(A\)

work-optimal parallel merge

partition \(A\) accordingly

work-optimal parallel merge

merge blocks \(B_i\) and \(A_i\) sequentially

work-optimal parallel merge

work-optimal parallel merge

assuming \(m=\mathcal{O}(n)\),

next time …

self-test questions