



UNIVERSITY







## **Terminology Review**

- device = GPU = set of multiprocessors
  - Multiprocessor = set of processors & shared memory
- Kernel = GPU program
- Grid = array of thread blocks that execute a kernel
- Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory

| Memory   | Location | Cached         | Access     | Who                    |
|----------|----------|----------------|------------|------------------------|
| Local    | Off-chip | No             | Read/write | One thread             |
| Shared   | On-chip  | N/A - resident | Read/write | All threads in a block |
| Global   | Off-chip | No             | Read/write | All threads + host     |
| Constant | Off-chip | Yes            | Read       | All threads + host     |
| Texture  | Off-chip | Yes            | Read       | All threads + host     |

## Access Times (REWRITE?)

- Register dedicated HW single cycle •
- Constant and Texture caches possibly single cycle, proportional to addresses accessed by warp
- Shared Memory dedicated HW single cycle Local Memory DRAM, no cache \*slow\*
- Global Memory DRAM, no cache \*slow\*
- Constant Memory DRAM, cached, 1...10s...100s of cycles, depending on cache locality Texture Memory DRAM, cached, 1...10s...100s of cycles, depending on cache locality
- Instruction Memory (invisible) DRAM, cached

10 L5: Memory Hierarchy 





















6





















UNIVERSITY OF UTAH



| CUDA Code - Kernel Overview                                                                                                   | CUDA Code - Load Data to Shared<br>Memory                                                                                     |  |  |
|-------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|--|--|
| // Block index                                                                                                                | // Get a pointer to the current sub-matrix Msub of M                                                                          |  |  |
| <pre>int bx = blockIdx.x;<br/>int by = blockIdx.y;</pre>                                                                      | <pre>Matrix Msub = GetSubMatrix(M, m, by);</pre>                                                                              |  |  |
| // Thread index                                                                                                               | // Get a pointer to the current sub-matrix Nsub of N                                                                          |  |  |
| <pre>int tx = threadIdx.x;<br/>int ty = threadIdx.y;</pre>                                                                    | <pre>Matrix Nsub = GetSubMatrix(N, bx, m);</pre>                                                                              |  |  |
| <pre>// Pvalue stores the element of the block sub-matrix // that is computed by the thread</pre>                             | shared float Ms[BLOCK_SIZE][BLOCK_SIZE];                                                                                      |  |  |
| float Pvalue = 0;                                                                                                             | shared float Ns[BLOCK_SIZE][BLOCK_SIZE];                                                                                      |  |  |
| // Loop over all the sub-matrices of M and N                                                                                  | // each thread loads one element of the sub-matrix                                                                            |  |  |
| <pre>// required to compute the block sub-matrix for (int m = 0; m &lt; M.width/BLOCK_SIZE; ++m) {</pre>                      | <pre>Ms[ty][tx] = GetMatrixElement(Msub, tx, ty);</pre>                                                                       |  |  |
| code from the next few slides };                                                                                              | // each thread loads one element of the sub-matrix                                                                            |  |  |
|                                                                                                                               | <pre>Ns[ty][tx] = GetMatrixElement(Nsub, tx, ty);</pre>                                                                       |  |  |
| © Devid Kirk/NVIDIA and Wen-mel W. Hwu, 2007 39<br>ECE 498AL, University of Illinois, Urbana-Champaign L5: Memory Hierarchy I | © David Kirk/NVIDIA and Wen-mel W. Hwu, 2007 40<br>ECE 498AL, University of Illinois, Urbana-Champaign L6: Memory Hierarchy I |  |  |







