Divide & Conquer

Last time we focused primarily on OpenMP and covered p2p communication in MPI. Today we will look at some more advanced OpenMP directives. We will switch to MPI in the next assignment. Our primary focus will be on divide and conquer style parallelism using omp task. We will also revisit several of the synchronization constructs we discussed in class last time along with illustrative examples.

tasks

Remember how we talked about parallel programming constructs,

Tasks are primarily useful for the divide & conquer style algorithms defined in a recursive fashion. Tasks on first glance will look very similar to the sections/section construct we did last time, but there are subtle differences that are important to know. Tasks, like sections, allow task-based division of work across threads. Look at this thread for a very nice and detailed discussion on the differences between tasks and sections.

Tasks are generated dynamically in recursive structures or while loops. When a thread encounters a task construct, it may choose to execute the task immediately or defer its execution until a later time. If task execution is deferred, then the task is placed in a pool of tasks. A thread that executes the task might be different from the one that encountered it. Like the parallel for, tasks can take the standard data attribute clauses, shared, private, firstprivate, etc. Note that since there is no implicit barrier at the end of the tasks, there is a corresponding taskwait directive that specifies a wait on the completion of child tasks generated since the beginning of the current task. These concepts can be best illustrated with the use of a simple example. Let us consider the Fibonacci series. This is not the best way to generate the Fibonacci numbers in parallel, but this example is good for illustration.

Note how we first write the function in terms of the divide and conquer or recursion strategy , carefully specifying the task (s) being created. Then, we use omp parallel and omp single to launch this recursion. Also note the usage of taskwait in the above example as we need to get the values for and before we can compute fib(n). This will only wait on the tasks created at this level to be completed, but of course the nested tasks have their own taskwait(s) making this work properly. Lets look at computing the Fibonacci number for we can write this as a tree, and look at the number of task being created,

fib(5)
fib(4)
fib(3)
fib(3)
fib(2)
fib(2)
fib(1)
fib(2)
fib(1)
fib(1)
fib(0)
fib(1)
fib(0)
fib(1)
fib(0)

As one can see, this recursion produces a lot of tasks, and this is inefficient. More time is spent creating tasks than doing actual work. Lets see if this can be rectified. We don't need to make any changes to the calling function. Only fib() needs to be modified. The first option is to use the if clause to limit when a task is created,

But even in this case, we note that at the upper levels, we create 2 tasks while the parent task only waits and adds two numbers. This can be rectified by creating only one task and having the parent finish the other task.

Task 1: Implement a parallel quicksort using openmp tasks. You can use templates for this assignment if you want or work with your earlier code. Your primary goal is to get good scalability on kingspeak. So tailor your recursion in a clever fashion. Report your strategy and plot strong and weak scalability using a grain size of 1 million long integers. For strong scalbility work with a problem size of num_cores*1 Million. Would a parallelization using parallel for be easier? faster?

nowait, barrier

Just as we needed taskwait to synchronize tasks, for the work-sharing constructs having implicit barriers, we have the nowait clause that avoids the implicit barrier. This is most commonly used with sections and for. Note that this only makes sense when there are multiple work-sharing directives within a single parallel region. nowait is not going to avoid the join at the end of the parallel region. In these and other cases, one can explicitly force a barrier using the barrier directive.

Here are some examples of the nowait clause.

Task 2: Does the following parallel code work correctly? If yes, why? If no, can you fix it? It would be helpful to consider the data accesses as C/E R/W.

single, master, critical

We already discussed the single construct in class. Since we have potentially multiple threads executing a parallel region, if we want to perform a sequential operation in that region, say for example IO, then we can use the single construct. Only one (non deterministic) thread will execute the code block following the construct, all others will skip this block and wait at the implicit barrier at the end of the construct. Note that the nowait clause can be used to avoid this synchronization.

The master directive is similar to the single clause except that there is no implicit barrier and it is always executed by the master thread. Its behavior is similar to the omp single nowait directive, but is usually faster.

A similar situation is where we wish every thread to execute some piece of code, but we would like them to only do so one at a time, say for a concurrent write on some shared shared datastructure. This can be achieved using the critical construct. Here is a simple example,

locks

For situations where we need more fine-grained control over synchronization and resource sharing, we can use locks. In the simplest case, they function similar to omp critical but with an explicit lock and unlock operations. This also enables us to for example do other useful work while waiting for a lock. Lets first look at a simple example for functionality similar to that offered by critical,

Note that the code is more verbose in this case. We have to,

  1. declare a shared variable,
  2. initialize the lock using omp_init_lock,
  3. acquire the lock using omp_set_lock, and
  4. one the critical work is done, release the lock using omp_unset_lock.
  5. free the variable using omp_destroy_lock.

Indeed, this is more complex than using critical, but it does offer more flexibility. Specifically, locks allow us to perform other work while waiting for the lock. Here is an example,

In this case we use omp_test_lock instead of omp_set_lock. This function attempts to set the lock. If the lock is already set by another thread, it returns 0; if it managed to set the lock, it returns 1. We use this property within a while loop to do other useful work while we wait for the lock. Once the lock is acquired, we do the critical task. This is a very powerful concept. The same basic idea will be used later when we use overlapped computations and communication within MPI.

reduce

The reduction clause causes a reduction on the variables that appear in its list. The clause follows a work sharing construct such as parallel for. Variables in the list must be declared shared. A private copy for each list variable is created for each thread. At the end of the reduction, the reduction variable is applied to all private copies of the shared variable, and the final result is written to the global shared variable. It also takes an argument for the operator, and any associative operator will work. Here are some examples,

Here we use the addition opperator + on result. The other operators that can be used are,

OperatorInitial Value
+0
*1
-0
&~0
|0
^0
&&1
||0

Recent implementations should also support min and max operators for the reduction.

Task 3: Implement an in-place generic scan operation in C/C++ using OpenMP (no MPI). The signature of your function should be similar to

You may change the interface to improve performance if you wish., e.g. you might want to specify the operator as an in-place operator. Give an example in which each element in the input array X is a three-dimensional double precision vector and the binary operator is the vector addition. Your code should compile using GNU or Intel compilers on x86 platforms. Please report wallclock times for summing (a) 1D and (b) 3D double-precision vectors. The input array should be 300M keys long. Write a driver routine called scan that takes one argument, the size of array to be scanned (initialized randomly). In the write-up, give pseudocode for your algorithm, and report wall clock time/core/n results for n = 1M, 10M, 100M of elements, using up to one node on Kingspeak (no MPI). Your goal is to get it to work correctly for arbitrary and . Aim for the best parallel performance compared to the sequential case. Report both strong and weak scalability.