Cannon's Algorithm

Now that we have covered the basic concepts of OpenMP and MPI, its time to move on to more advanced concepts. From here on we will be working with example problems and consider the best approach to implementing these using MPI and OpenMP. From this stage on, (I will not explicitly mention it for subsequent assignments), you should test/demonstrate the correctness of your code and its performance and scalability, both strong and weak. In all cases, aim to use at least 8 nodes of Kingspeak. When running using both MPI and OpenMP, you should launch one task per socket of Kingspeak and use OpenMP to parallelize within the socket. This means that you will have 2 MPI tasks per node and OMP_NUM_THREADS = 14 per socket. Make sure your job script and environment variables are setting up your environment correctly for this configuration. Read the slurm documentation on CHPC and use canvas discussion in case of any problems with this. Submit a report and codes.

Matrix Multiplication

Our problem for this week will be to efficiently implement Cannon's Matrix Multiplication. For this task, we want to multiply two matrices $A$ and $B$ to produce a new matrix $C=AB$ . The matrices can be of arbitrary sizes as long as their inner dimensions match, i.e., if $A$ is $m\times k$ , then $B$ should be of size $k \times n$ to produce $C$ of size $m\times n$ . Assume that the number of processes, $p$ can be factorized into a square grid of processes $q\times q$ . All matrices should be blocked into $p$ blocks. The square grid simplifies things as the local products will be possible, i.e., $(m/q,k/q) \times (k/q,n/q)$ . The general idea of the algorithm is shown in the figure below.

Cannon's Algorithm

Now, let us look at the general structure of the algorithm and figure out the general strategy of coding this up. I will assume $p=16$ with $q=4$ .

The initial distribution of data is,
$\left[ \begin{array}{c|c|c|c} A_{00},B_{00} & A_{01},B_{01} & A_{02},B_{02} & A_{03},B_{03} \\ \hline A_{10},B_{10} & A_{11},B_{11} & A_{12},B_{12} & A_{13},B_{13} \\ \hline A_{20},B_{20} & A_{21},B_{21} & A_{22},B_{22} & A_{23},B_{23} \\ \hline A_{30},B_{30} & A_{31},B_{31} & A_{32},B_{32} & A_{33},B_{33} \end{array} \right]$
We first shift data to prepare for the algorithm. Recall that we need to,
1. shift row $i$ of $A$ circularly by $i$ elements to the left.
2. shift col $j$ of $B$ circularly by $j$ elements up.
This should result in the following layout,

$\left[ \begin{array}{c|c|c|c} A_{00},B_{00} & A_{01},B_{11} & A_{02},B_{22} & A_{03},B_{33} \\ \hline A_{11},B_{10} & A_{12},B_{21} & A_{13},B_{32} & A_{10},B_{03} \\ \hline A_{22},B_{20} & A_{23},B_{31} & A_{20},B_{02} & A_{21},B_{13} \\ \hline A_{33},B_{30} & A_{30},B_{01} & A_{31},B_{12} & A_{32},B_{23} \end{array} \right]$

Now, we can start the actual multiplication, that will make $q$ rounds, multiply the local blocks and accumulate the results in the block of $C$ belonging to the process. Remember that only blocks of $A$ and $B$ are communicated. The blocks of $C$ do not move. The multiplication of the local blocks of $A$ and $B$ is also a matrix multiplication and you should use OpenMP to parallelize this step.

 
x
for (int k=1; k<q; ++k) {
    recv ( A, right_proc );
    recv ( B, bottom_proc );
    C += A * B ;
    send ( A, left_proc );
    send ( B, top_proc );
}

Remember to return the blocks of $A$ and $B$ to the original position once done. This should ideally be done as part of the last exchange.
Compare a pure MPI (28 tasks per node) version with the hybrid version (2 MPI tasks $\times$ 14 OpenMP threads per node). Which version is faster? Which scales better? Why?

MPI Communicators

For implementing the matrix multiplication, we need to communicate with processors either along the same row or column of the process grid. Recall that in MPI we identify processes using a rank, which is a 1D concept. Now its not terribly complex to map the 1D rank to a 2D rank of the form that will be helpful for us, but MPI has in built functionality for imposing a virtual topology for the active processes. For this particular case we will like to use the Cartesian topology, that we create using the following function,

 
xxxxxxxxxx
int MPI_Cart_create(MPI_Comm comm_old, int ndims, const int dims[],
                    const int periods[], int reorder, MPI_Comm *comm_cart);

Here ndims is the number of dimensions of the Cartesian grid. Clearly 2 and 3 are the most common. We will use ndims=2 for our problem. The dims array should be of length ndims and contains the number of processes along each dimension. The product of dims should be less than or equal to the size of comm_old. If the product of dims is less than $p$ , then some processes will be idle for the new cartesian communicator, comm_cart. The periods argument, also of length ndims is set to 1 or 0 depending on whether periodicity (or cyclic connectivity) is desired. The reorder argument determines whether the ranking can be reordered or not. This allows for topology mapping and can improve performance if allowed. It also means that the processes might have a new rank in the new communicator. While not a big concern for us, this can be as issue if this is part of a larger program and reordering ranks might require us to move data. Here is a simple example for our example,

 
xxxxxxxxxx
int dims[2] = {4,4};
int periods[2] = {1,1};
MPI_Comm cart_comm;
MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, 0, &cart_comm);

In out case, creating the dims array is easy, as we require it to be square. But for more generic problems, MPI can also help you fill in values using MPI_Dims_create. Create dims array for MPI_Cart_create given the number of nodes (mpi tasks) in a grid and the dimensions.

Once we have created our cartesian communicator, we can use it to do our left and up shifts. Here are the useful functions to keep in mind,

The most useful one first. . This computes the shifted source and destination ranks. We will look at an example soon.
Translate coordinates into rank, MPI_Cart_rank. Basically converts our ndims coordinates to the 1D rank of the process.
Translate rank to coordinates, MPI_Cart_coords. This is the exact opposite of MPI_Cart_rank.
Other useful functions MPI_Cartdim_get, MPI_Cart_get.

As mentioned, the most relevant function is MPI_Cart_shift. Lets look at this closely,

 
xxxxxxxxxx
int MPI_Cart_shift(MPI_Comm comm, int direction, int displacement, 
                   int *rank_source, int *rank_dest)

The most important arguments are the direction and the displacement. These allow us to specify whether the shifts are along the rows or the column (direction = 0 or 1) and the amount of the shift, a uniform 1 (or -1) during the multiplication or variable by the coordinates for the initial shifts. Note that the rank returned might be MPI_PROC_NULL although this should not happen if we are using a periodic grid. We will look at MPI_PROC_NULL in a future assignment when we use finite difference to solve the Poisson's equation.

The function returns two values, the rank_source and rank_dest. These are the values you will use in your send and recv functions. Consider a row of processes, $p_1, p_2, p_3, p_4, p_5, p_6$ . If we consider process $p_3$ and a displacement of $\pm 2$ , then the source and destination will consider $p_1$ and $p_5$ . Flipping the sign will only change the source and destination.

I would suggest you experiment with these parameters to understand how they behave. Here are some questions that you should evaluate to understand what MPI is doing.

What values of direction chooses the rows and columns?
Are we using positive or negative values for the displacement? Does it make a difference?