Shared Memory & Network Models

story thus far …

four step design methodology

message passing interface

minimal set of six MPI functions we will need

  // Initiates use of MPI
  int MPI_Init(int *argc, char ***argv);

  // Concludes use of MPI
  int MPI_Finalize(void);

  // On return, size contains number of processes in comm
  int MPI_Comm_size(MPI_Comm comm, int *size);

  // On return, rank contains rank of calling process in comm
  int MPI_Comm_rank(MPI_Comm comm, int *rank);

  // On return, msg can be reused immediately
  int MPI_Send(void *msg, int count, MPI_Datatype datatype, 
  int dest, int tag, MPI_Comm comm);

  // On return, msg contains requested message
  int MPI_Recv(void *msg, int count, MPI_Datatype datatype, 
  int source, int tag, MPI_Comm comm, MPI_Status *status);

example: reduction

Given \(A\in\mathbb{R}^n\), compute \(\sum_{i=1}^n A_i\)

// option 1
double sum1(double *A, int n) {     //    W     D
  double sum=0.0;                   //    1     1
  for (i=0; i<n; ++i) sum += A[i];  //    n     n
  return sum;                       //    1     1 
}

\[ W(n) = 2 + n, \quad D(n) = 2 + n, \quad P(n) = \mathcal{O}(1) \]

// option 2
double sum2(double *A, int n) {     //    W     D
  if (n == 1) return A[0];          //    1     1    
  parallel for (i=0; i<n/2; ++i)    //    n     1   
    B[i] = A[2i] + A[2i+1];
  return sum2 (B, n/2); }           //  W(n/2) D(n/2)

\[ W(n) = n + W(n/2), \quad D(n) = 1 + D(n/2), \quad P(n) = \mathcal{O}(n/\log n) \]

today

PRAM
Network Models
Collective communications

PRAM

parallel random access memory
shared memory model (synchronous)
SIMD - specify code for single thread
- same instruction for all threads
no memory access cost
- each thread has its own memory along with shared memory
- shared memory variables must be declared
global/local memory (read/write)
- ER/CR: exclusive read, concurrent read
- EW/CW: exclusive write, concurrent write
- CREW PRAM: most practical
Complexity estimates: \(T(n,p), W(n,p)\)

PRAM sum

# t: thread id, p: number of threads (n,p: power of two)
PRAM_sum(A,n):
  if n > p:
    r = n/p
    w = SEQ_sum (A[tr], r)
    S[t] = global_write (w)      # EW
    return PRAM_sum (S, p)
  
  if t <= n/2:
    a1 = global_read (A[2t])     # ER
    a2 = global_read (A[2t+1])   # ER
    S[t] = global_write(a1+a2)   # EW
    return PRAM_sum (S, p/2)

PRAM MATVEC

\(A\) is \(n \times n\), \(x\)(input), \(y\)(output) : in global memory

# assume n>p, n%p == 0
PRAM_matvec(A, x, y, n):
  r = n/p
  z = global_read (x)                # CR  - O(n)
  B = global_read (A[tr, :], r)      # ER  - O(n^2/p) 
  w = SEQ_matvec (B, z)              # W = O(n^2/p)
  y[tr, (t+1)r] = global_write (w)   # EW  - O(n/p)

scheduling DAGs to PRAM

scheduling principle: \(T(n,p) \leq \lfloor W(n)/p \rfloor + D(n)\)
simulate work at \(i_{th}\)—\(W_i(n)\) level using \(p\) processors
total runtime
\[ \begin{align} T(n,p) & \leq \sum_i \left\lceil\frac{W_i(n)}{p}\right\rceil \\ & \leq \sum_i \left(\left\lfloor\frac{W_i(n)}{p}\right\rfloor + 1 \right) \\ & \leq \left\lfloor\frac{W(n)}{p}\right\rfloor + D(n) \end{align} \]

network models

message passing model

network topologies

Access to remote data requires communication
Direct connections would require \(\mathcal{O}(p^2)\) wires and communication ports, which is infeasible for large \(p\)
Limited connectivity necessitates routing data through intermediate processors or switches
Topology of network affects algorithm design, implementation, and performance

common network topologies

message passing

Simple model for time required to send message (move data) between adjacent nodes:

\(T_{msg} = t_s + t_w L\)

\(t_s\) = startup time = latency (i.e., time to send message of length zero)
\(t_w\) = incremental transfer time per word (\(1/t_w\) = bandwidth in words per unit time)
\(L\) = length of message in words

For most real parallel systems, \(ts \gg tw\)

cost of collectives

reduction

ring reduction pseudocode

reduction(A, id, p) {
  s=A
  for k=0:log2(p)-1 {
    r = 2^k;
    cid = id/r;
    if ~isInt(cid) break; end
    
    if even(cid) {
      partner = (cid+1)*r;
      recv (sp, partner);
      s=s+sp;
    } else {
      partner = (cid-1)*r;
      send(s, partner);
    }
  }
  return s;
}

hypercube

hypercube reduction

hcube_reduction(A, id, p) {
  d = log2(p);
  s = A;
  mask = 0;
  for k=0:d-1 {
    r = 2^k;
    
    // run only if id's lower k bits == 0
    if id & mask == 0 {
      partner = id ^ r; // flip the kth bit
      if id & r {
        send(s, partner);
      } else {
        recv(sp, partner);
        s = s + sp;
      }
    }
    mask = mask ^ r; % lower k bits = 1
  }
}

broadcast

source node sends the same message to each of \(p-1\) other nodes

int MPI_Bcast (void* buffer, int count, MPI_Datatype datatype, 
               int root, MPI_Comm comm);

broadcast

Cost of broadcast depends on network, for example

————|—————————— 1-D mesh | \(T=(p-1)(t_s + t_wL)\) 2-D mesh | \(T=2(\sqrt{p}-1)(t_s + t_wL)\) hypercube | \(T=\log p (t_s + t_wL)\)

scatter & gather

int MPI_Gather(void* sbuf, int scount, MPI_Datatype stype,
               void* rbuf, int rcount, MPI_Datatype rtype,
               int root, MPI_Comm comm )

int MPI_Scatter(void* sbuf, int scount, MPI_Datatype stype,
                void* rbuf, int rcount, MPI_Datatype rtype,
                int root, MPI_Comm comm)

scatter & gather

int MPI_Gather(void* sbuf, int scount, MPI_Datatype stype,
void* rbuf, int rcount, MPI_Datatype rtype,
int root, MPI_Comm comm )

int MPI_Scatter(void* sbuf, int scount, MPI_Datatype stype,
void* rbuf, int rcount, MPI_Datatype rtype,
int root, MPI_Comm comm)

allgather

int MPI_Allgather(void* sbuf, int scount, MPI_Datatype stype,
                  void* rbuf, int rcount, MPI_Datatype rtype, MPI_Comm comm)

all to all

int MPI_Alltoall(void* sbuf, int scount, MPI_Datatype stype,
                 void* rbuf, int rcount, MPI_Datatype rtype, MPI_Comm comm)

scan, prefix sum

Given an array, \([4, 2, 1, 8]\)
Compute, \([4, 4+2, 4+2+1, 4+2+1+8]\)
\([4, 6, 7, 15]\)

obvious sequential algorithm

\[ D_s(n) = (n) \]

not parallelizable, need to change algorithm

recursive scan

s = Rec_Scan(a,n) 
// s[k]= a[0]+a[1]+…+a[k-1], k=0,…,n-1 
1.  if n=1 { s[0]←a[0]; return; } //base case 
2.  par for (i=0; i<n/2; ++i) 
      b[i] ← a[2i] + a[2i+1]; 
3.  c ← Rec_Scan (b, n/2); 
4.  s[0] ← a[0]; 
5.  par for (i=0; i<n; ++i) 
      if isOdd(i) 
        s[i] ← c[i/2]; 
      else if isEven(i) 
        s[i] ← c[i/2] + a[i];

Shared Memory & Network Models

story thus far …

four step design methodology

message passing interface

minimal set of six MPI functions we will need

example: reduction

Given \(A\in\mathbb{R}^n\), compute \(\sum_{i=1}^n A_i\)

today

PRAM

PRAM sum

PRAM MATVEC

PRAM MATVEC

scheduling DAGs to PRAM

network models

message passing model

network topologies

common network topologies

common network topologies

message passing

cost of collectives

reduction

ring reduction pseudocode

hypercube

hypercube reduction

hypercube reduction

hypercube reduction

hypercube reduction

hypercube reduction

broadcast

broadcast

broadcast

scatter & gather

scatter & gather

allgather

all to all

scan, prefix sum

obvious sequential algorithm

not parallelizable, need to change algorithm

recursive scan

reading

Intro to Parallel Algorithms

first chapter of Jájá’s book.