CS 6230: Parallel Computing and High Performance Computing

Assignment 3

Due Feb 24, 11:59pm

Please submit a single pdf and a single tgz named uid.{pdf,tgz} respectively. Let the uid start with a lowercase u followed by 7 digits. Submit/upload via canvas.

Reading: MPI Tutorial
Using the work-depth language give an algorithm for computing the median of an unsorted array of n integers. What is the work and depth of your algorithm? Is your algorithm work optimal? Suggest how you would find the \(k\)-ranked element in an unsorted array (hint: the median is the \(k=n/2\)-ranked element).
Generalize the hypercube reduction algorithm for a \(d\)-dimensional hypercube for the case in which the number of processors is not a power of two and for any problem size \(n \geq p\). Derive an expression for the wall-clock time \(T(n, p)\) as a function of the number of processors \(p\) and the problem size \(n\). Your estimate should include communication costs in terms of latency, bandwidth, and message size.
Finding MPI bugs This archive contains the files mpi_bug1.c, mpi_bug2.c, …, mpi_bug7.c. These example codes contain bugs, resulting in hangs or other undesirable behavior. Try to find these bugs and fix them. Add a short comment to the code describing what was wrong and how you fixed the problem. Add the solutions to your submission using the naming convention mpi_solved1.c - mpi_solved7.c. Each problem should be run with 4 MPI tasks.
Programming MPI ring communication. Write a distributed memory program that sends an integer in a ring starting from process 0 to 1 to 2 (and so on). The last process sends the message back to process 0.
- Allow for a command line parameter \(N\) that specifies how often the message is sent around the ring.
- Start with sending the integer 0 and let every process add its rank to the integer before it is being sent again. Use the result after \(N\) loops to check if all processors have properly added their contribution each time they received and sent the message.
- Time your program for a larger \(N\) and estimate the latency on your system (i.e., the time used for each communication). If possible, try to test your communication ring on more than one machine such that communication must go through the network. Note that if you use MPI on a single processor with multiple cores, the available memory is logically distributed, but messages are not actually sent through a network. (It depends on the implementation of MPI how it handles sending/receiving of data that is stored in the same physical memory.)
- Modify your code such that instead of a single integer being sent in a ring, you communicate a large array in a ring. Time the communication and use these timings to estimate the bandwidth of your system (i.e., the amount of data that can be communicated per second). \end{itemize}
Programming Implement an MPI+OpenMP parallel binary search algorithm that has similar signature to the ANSI C bsearch() routine but it supports both OpenMP and MPI parallelism and multiple keys. For this reason, it should have three additional arguments (compared to the standard bsearch), the number of keys, the MPI communicator and the number of threads per MPI process. Write a driver routine called search that takes one argument, the size of array to be searched (initialized to random integers). On the write-up, give pseudocode for your algorithm, and report wall clock time/core/n results for n = 1M, 10M, 100M, 1B of elements (for 1 key), using up to four nodes on Tangent with 1 MPI task per socket. Report weak and strong scaling results using 1 core, 1 socket, 1,2 and 4 nodes.