Parallel Computing

Spring 2015 - CS:6230

Hari Sundar, MEB 3454
hari@cs.utah.edu
WEB L120 - MW 3-4:20pm

overview

scope

scalable algorithms

efficient implementations

outline

Parallel programming models
- Shared Memory (Work/Depth, PRAM), APIs (OpenMP)
- Distributed Memory (message passing), APIs (MPI)
Discrete Algorithms
- Sorting, Searching, Selecting, Merging
- Trees, Graphs (coloring, partitioning, traversal)
Numerical Algorithms
- Dense/Sparse Linear Algebra (LU, SOR, Krylov, Nested Dissection)
- Fast Transforms - FFT, Multigrid, Gauss
- $n$-body Algorithms - Fast Multipole Methods, Tree codes, nearest neighbors
- Time parallelism (parareal algorithms)
Optional topics

Logistics

Suggested Books
- Gramma et al. Introduction to Parallel Computing
- Jájá, An introduction to Parallel Algorithms
Research papers and other readings posted on the course webpage/canvas
Prerequisites
- C/C++
- Sequential Algorithms
- Numerical linear algebra, ODEs, PDEs

Logistics

3-4 assignments - individual
midterm exam
final project
- teams of two
- project proposals due by end of february
- project reports & presentations

no extensions

Logistics

Primarily using MPI and OpenMP
- modern C/C++ compiler
- install openmpi or mpich on your machine
- good for development and small debugging
Clusters
- CHPC - Tangent - 64 nodes with 16 cores each
- XSEDE - Stampede
  - 6400(256) nodes with 16 cores each
  - will also consider Intel Xeon Phi later in course

Logistics

programming problems

course github page
- basic code template
- you will submit functions
- automatically graded
  - correctness
  - scalability - rankings

motivation

scientific discovery

inference, prediction, decision making

in the past, largely driven by

experiments
- expensive
- impossible
- hazardous
- difficult to reproduce

ever increasingly,

computational science & engg.
in silico experiments

experiments vs simulation

architectures

Flynn’s taxonomy

Classifies architectures based on the relationship between the data and instruction steams:

SISD : single instruction stream, single data stream → standard sequential machine

SIMD : single instruction stream, multiple data streams → vectorized, SSE, AVX etc

MISD : multiple instruction streams, single data stream

MIMD : multiple instruction streams, multiple data streams → modern multicore machines

SPMD

(Single Program, Multiple Data)

developer writes a single program that runs on all processes, but computes a subset of the total work

easier to program compared to the MIMD model, but is not as restrictive as the SIMD model
although most modern machines are MIMD, the vast majority of codes in programmed in a SPMD fashion

distributed vs shared memory

Shared Memory

memory is uniformly accessible
no formal way for inter-process communication

Distributed Memory

memory is local to each process
formal way for inter-process communication

hybrid systems

memory is shared locally within SMP (symmetric multiprocessor) nodes but distributed globally across nodes

heterogeneous systems

has accelerators (e.g. GPU) in additional to SMP within a node

Stampede

network topologies

network models

Distributed memory models
Graph $G=(N,E)$
- Nodes: processors
- Edges: two-way communication link
Memory
- Each $p$ has local RAM; No shared RAM
Asynchronous
Two basic communication constructs
- send(data, to_proc_i)
- recv(data, from_proc_j)

common network topologies

network properties

Diameter
- Max distance between any two nodes
Connectivity
- Number of links needed to remove a node
Bisection width
- Number of links to break network into equal halves
Cost
- Total number of links

network properties

Network \| Nodes	Nodes \| C	Cost \| Di	Diameter	Bisec. Width \|
1-D mesh \| $k	$k$ \|	$k-1$ \|	$k-1$ \|	$1$
2-D mesh \| $k^	$k^2$ \|	$2k(k-1)$ \| $2(k	$2(k-1)$ \|	$k$ \|
3-D mesh \| $k^	$k^3$ \|	$3k^2(k-1)$ \| $3	$3(k-1)$ \|	$k^2$ \|
n-D mesh \| $k^	$k^n$ \|	$nk^{n-1}k$ \| $n(k-	$n(k-1)$ \|	$k^{n-1}$ \|
1-D torus \| $k	$k$ \|	$k$ \|	$k/2$ \|	$2$
2-D torus \| $k^	$k^2$ \|	$2k^2$ \|	$k$ \|	$2k$ \|
3-D torus \| $k^3$ \|	$k^3$ \|	$3k^3$ \|	$3k/2$ \|	$2k^2$
n-D torus \| $k^n$ \|	$k^n$ \|	$nk^n$ \| $	$nk/2$ \|	$2k^{n-1}$ \|
hypercube \| $2^k$ \|	$2^k$ \|	$2^k k/2$	$k$ \|	$2^{k-1}$ \|

MPI

building & running

hello mpi

#include <stdio.h>
#include <mpi.h>

int main(int argc, char *argv[]) {
  MPI_Init(&argc, &argv);

  int rank, size;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  printf ("Hello from process % of %d\n", rank, size);

  MPI_Finalize();
  return 0;
}

compiling & running

$ mpicc -o hello hello_mpi.c 
$ mpirun -np 8 ./hello
$
$ sbatch batch.sh

#!/bin/bash
#SBATCH -J hello        # Job name
#SBATCH -o hello.o%j    # name of output file (%j expands to jobID)
#SBATCH -N 1 -n 16      # total number of mpi tasks requested
#SBATCH -p normal       # Queue name --- normal, development, etc.
#SBATCH -t 00:30:00     # run time (hh:mm:ss)
#SBATCH -A TG-CDA150001 # account 

set -x                 # echo commands

cd $HOME/hello

ibrun tacc_affinity ./hello

assignment 0

readings

Intro to Sequential Algorithms
Intro to Parallel Algorithms - first chapter of Jájá’s book.
Parallel Algorithms
Parallel Computing - Blaise Barney, Lawrence Livermore National Laboratory
MPI Reference

Network \| Nodes	Nodes \| C	Cost \| Di	Diameter	Bisec. Width \|
1-D mesh \| $k	\(k\) \|	\(k-1\) \|	\(k-1\) \|	\(1\)
2-D mesh \| $k^	\(k^2\) \|	\(2k(k-1)\) \| $2(k	\(2(k-1)\) \|	\(k\) \|
3-D mesh \| $k^	\(k^3\) \|	\(3k^2(k-1)\) \| $3	\(3(k-1)\) \|	\(k^2\) \|
n-D mesh \| $k^	\(k^n\) \|	\(nk^{n-1}k\) \| $n(k-	\(n(k-1)\) \|	\(k^{n-1}\) \|
1-D torus \| $k	\(k\) \|	\(k\) \|	\(k/2\) \|	\(2\)
2-D torus \| $k^	\(k^2\) \|	\(2k^2\) \|	\(k\) \|	\(2k\) \|
3-D torus \| \(k^3\) \|	\(k^3\) \|	\(3k^3\) \|	\(3k/2\) \|	\(2k^2\)
n-D torus \| \(k^n\) \|	\(k^n\) \|	\(nk^n\) \| $	\(nk/2\) \|	\(2k^{n-1}\) \|
hypercube \| \(2^k\) \|	\(2^k\) \|	\(2^k k/2\)	\(k\) \|	\(2^{k-1}\) \|