Parallel Computing

Spring 2015 - CS:6230

Hari Sundar, MEB 3454
hari@cs.utah.edu
WEB L120 - MW 3-4:20pm

overview

scope


scalable algorithms


efficient implementations

outline

  • Parallel programming models
    • Shared Memory (Work/Depth, PRAM), APIs (OpenMP)
    • Distributed Memory (message passing), APIs (MPI)
  • Discrete Algorithms
    • Sorting, Searching, Selecting, Merging
    • Trees, Graphs (coloring, partitioning, traversal)
  • Numerical Algorithms
    • Dense/Sparse Linear Algebra (LU, SOR, Krylov, Nested Dissection)
    • Fast Transforms - FFT, Multigrid, Gauss
    • \(n\)-body Algorithms - Fast Multipole Methods, Tree codes, nearest neighbors
    • Time parallelism (parareal algorithms)
  • Optional topics

Logistics

Logistics

  • 3-4 assignments - individual
  • midterm exam
  • final project
    • teams of two
    • project proposals due by end of february
    • project reports & presentations


no extensions

Logistics

  • Primarily using MPI and OpenMP
    • modern C/C++ compiler
    • install openmpi or mpich on your machine
    • good for development and small debugging
  • Clusters
    • CHPC - Tangent - 64 nodes with 16 cores each
    • XSEDE - Stampede
      • 6400(256) nodes with 16 cores each
      • will also consider Intel Xeon Phi later in course

Logistics

programming problems

  • course github page
    • basic code template
    • you will submit functions
    • automatically graded
      • correctness
      • scalability - rankings

motivation

scientific discovery

inference, prediction, decision making


in the past, largely driven by

  • experiments
    • expensive
    • impossible
    • hazardous
    • difficult to reproduce


ever increasingly,

  • computational science & engg.
  • in silico experiments

experiments vs simulation

architectures

Flynn’s taxonomy

Classifies architectures based on the relationship between the data and instruction steams:


SISD : single instruction stream, single data stream → standard sequential machine

SIMD : single instruction stream, multiple data streams → vectorized, SSE, AVX etc

MISD : multiple instruction streams, single data stream

MIMD : multiple instruction streams, multiple data streams → modern multicore machines

SPMD

(Single Program, Multiple Data)


developer writes a single program that runs on all processes, but computes a subset of the total work

  • easier to program compared to the MIMD model, but is not as restrictive as the SIMD model
  • although most modern machines are MIMD, the vast majority of codes in programmed in a SPMD fashion

distributed vs shared memory


Shared Memory

  • memory is uniformly accessible
  • no formal way for inter-process communication


Distributed Memory

  • memory is local to each process
  • formal way for inter-process communication

hybrid systems

memory is shared locally within SMP (symmetric multiprocessor) nodes but distributed globally across nodes

heterogeneous systems

has accelerators (e.g. GPU) in additional to SMP within a node


Stampede

network topologies

network models

  • Distributed memory models
  • Graph \(G=(N,E)\)
    • Nodes: processors
    • Edges: two-way communication link
  • Memory
    • Each \(p\) has local RAM; No shared RAM
  • Asynchronous
  • Two basic communication constructs
    • send(data, to_proc_i)
    • recv(data, from_proc_j)

common network topologies

network properties


  • Diameter
    • Max distance between any two nodes
  • Connectivity
    • Number of links needed to remove a node
  • Bisection width
    • Number of links to break network into equal halves
  • Cost
    • Total number of links

network properties

Network | Nodes Nodes       | C Cost    | Di Diameter  Bisec. Width     |
1-D mesh | $k \(k\) | \(k-1\) | \(k-1\) | \(1\)
2-D mesh | $k^ \(k^2\) | \(2k(k-1)\) | $2(k \(2(k-1)\) | \(k\) |
3-D mesh | $k^ \(k^3\) | \(3k^2(k-1)\) | $3 \(3(k-1)\) | \(k^2\) |
n-D mesh | $k^ \(k^n\) | \(nk^{n-1}k\) | $n(k- \(n(k-1)\) | \(k^{n-1}\) |
1-D torus | $k \(k\) | \(k\) | \(k/2\) | \(2\)
2-D torus | $k^ \(k^2\) | \(2k^2\) | \(k\) | \(2k\) |
3-D torus | \(k^3\) | \(k^3\) | \(3k^3\) | \(3k/2\) | \(2k^2\)
n-D torus | \(k^n\) | \(k^n\) | \(nk^n\) | $ \(nk/2\) | \(2k^{n-1}\) |
hypercube | \(2^k\) | \(2^k\) | \(2^k k/2\) \(k\) | \(2^{k-1}\) |

MPI

building & running

hello mpi

#include <stdio.h>
#include <mpi.h>

int main(int argc, char *argv[]) {
  MPI_Init(&argc, &argv);

  int rank, size;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  printf ("Hello from process % of %d\n", rank, size);

  MPI_Finalize();
  return 0;
}

compiling & running

$ mpicc -o hello hello_mpi.c 
$ mpirun -np 8 ./hello
$
$ sbatch batch.sh
#!/bin/bash
#SBATCH -J hello        # Job name
#SBATCH -o hello.o%j    # name of output file (%j expands to jobID)
#SBATCH -N 1 -n 16      # total number of mpi tasks requested
#SBATCH -p normal       # Queue name --- normal, development, etc.
#SBATCH -t 00:30:00     # run time (hh:mm:ss)
#SBATCH -A TG-CDA150001 # account 

set -x                 # echo commands

cd $HOME/hello

ibrun tacc_affinity ./hello

assignment 0

readings