CS 6230: Parallel Computing and High Performance Computing

Assignment 2

Due Feb 3, 11:59pm

Please submit a single pdf and a single tgz named uid.{pdf,tgz} respectively. Let the uid start with a lowercase u followed by 7 digits. Submit/upload via canvas.

Reading: First chapter of Jájá’s book.
Given two vectors in \(x, y \in \mathbb{R}^n\), we want to compute their outer product \(A = x \otimes y \in \mathbb{R}^{n×\times n}\), deﬁned by \(A_{ij} = x_iy_j\) . Give work-depth pseudocode for this problem. Derive work, depth, and parallelism as a function of \(n\).
Let \(A\) be an \(n\times n\) lower triangular matrix (i.e., \(A_{ij} = 0\) if \(j > i\)) such that \(A_{ii} \neq 0\), for \(1\leq i \leq n\), and let \(b\) be a \(n\)-dimensional vector. Consider the forward-substitution algorithm for solving \(Ax = b\) for \(x\): \[ \begin{aligned} x_1 & = \frac{1}{A_{11}}b_1, \\ x_i & = \frac{1}{A_{ii}}(b_i - \sum_{j=1}^{i-1}A_{ij} x_j ), \quad i=2,\cdots,n \end{aligned} \] Determine the work and depth of the algorithm. State a PRAM version of this algorithm and derive its time and work complexity as a function of the input size \(n\) and the number of processors \(p\). Optional: Suggest ways to improve the complexity of this algorithm.
Consider the matrix-vector multiplication problem \(y = Ax\) on a PRAM machine. For PRAM, we discussed an algorithm that uses row-wise partitioning of \(A\) and \(y\). This type of partitioning is called one-dimensional, because partitions run across one dimension of the matrix. State a PRAM algorithm that uses a two-dimensional partitioning of the matrix. That is, if we have \(p^2\) cores, we partition \(A\) into \(p\) row blocks and \(p\) column blocks and \(x, y\) in \(p\) blocks. Derive the complexity ( \(T(n, p)\) and \(W(n, p)\) ) and the speedup of the algorithm. Is your algorithm work efficient (assuming \(W_{sequential}(n) = \mathcal{O}(n^2)\) ) ?
Programming Implement an in-place generic scan operation in C/C++ using OpenMP (no MPI). The signature of your function should be similar to
```
void genericScan(void *X, size_t n, size_t l, void (*oper)(void *x1, void *x2));
```
You may change the interface to improve performance if you wish. Give an example in which each element in the input array X is a three-dimensional double precision vector and the binary operator is the vector addition. Your code should compile using GNU or Intel compilers on x86 platforms. Please report wallclock times for summing (a) 1D and (b) 3D double-precision vectors. The input array should be 300M keys long. Write a driver routine called scan that takes one argument, the size of array to be scanned (initialized randomly). In the write-up, give pseudocode for your algorithm, and report wall clock time/core/n results for n = 1M, 10M, 100M, 1B of elements, using up to one node on Tangent (no MPI). Report weak and strong scaling results.
Programming Parallelize your quicksort implementation from assignment 1 using OpenMP. In the write-up, give pseudocode for your algorithm, and report wall clock time/core/n results for n = 1M, 10M, 100M, 1B of elements, using up to one node on Tangent (no MPI). Report weak and strong scaling results.

Note Try to estimate how much memory your code will require based on the input. For example you might not be able to run the n=1B case if your implementation is not efficient. Just skip the 1B case if that happens. If you have implemented it efficiently, are you able to run n=10B?