























## Terminology Review

- device = GPU = set of multiprocessors
- Multiprocessor = set of processors & shared memory
- Kernel = GPU program
- Grid = array of thread blocks that execute a kernel
- Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory

| Memory             | Location            | Cached         | Access     | Who                    |
|--------------------|---------------------|----------------|------------|------------------------|
| Local              | Off-chip            | No             | Read/write | One thread             |
| Shared             | On-chip             | N/A - resident | Read/write | All threads in a block |
| Global             | Off-chip            | No             | Read/write | All threads + host     |
| Constant           | Off-chip            | Yes            | Read       | All threads + host     |
| Texture            | Off-chip            | Yes            | Read       | All threads + host     |
|                    |                     |                |            |                        |
| Kirk/NVIDIA and We | en-mei W. Hwu, 2007 |                |            | UNIVER:                |





- Register dedicated HW single cycle •
- •
- Constant and Texture caches possibly single cycle, proportional to addresses accessed by warp Shared Memory dedicated HW single cycle if no "bank conflicts" Local Memory DRAM, no cache \*slow\* Global Memory DRAM, no cache \*slow\*
- •
- Constant Memory DRAM, no cache "slow" Constant Memory DRAM, cached, 1...10s...100s of cycles, depending on cache locality Texture Memory DRAM, cached, 1...10s...100s of cycles, depending on cache locality
- Instruction Memory (invisible) DRAM, cached

DIA and Wen-mei W. Hwu, 2007 ersity of Illinois, Urbana-Champaigr L3: Memory Hierarchy, 1

```
UNIVERSITY
OF UTAH
```

| <ul> <li>Copies from hos<br/>(possibly, conste<br/>How to use SP s</li> <li>Must construct</li> <li>How to use cons</li> <li>Read-only "ret<br/>by host</li> <li>Also, how to use</li> <li>Most locally-a</li> <li>Even array var<br/>access patter</li> <li>Can allocate "s</li> <li>Excessive use</li> <li>Local memory</li> <li>Deals with cap</li> </ul> | ct or be copied <sup>'</sup> from global memor<br>stant or texture cache<br>used" data can be placed in consta<br>e registers<br>illocated data is placed directly in<br>riables can use registers if compil | global memory<br>ry by kernel program<br>unt & texture memory<br>registers<br>er understands<br>at4<br>cal memory | c<br>o<br>• Throu<br>- Flav<br>whe<br>• Implic<br>- Dev<br>- Hos<br>- Thr | Data Placement: Syntax<br>agh type qualifiers<br>onstant,shared,local,<br>levice<br>agh cudaMemcpy calls<br>for of call and symbolic constant des<br>are to copy<br>cit default behavior<br>rice memory without qualifier is global<br>tt by default copies to global memory<br>ead-local variables go into registers<br>acity exceeded, then local memory | al memory<br>1        |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------|
| CS6963                                                                                                                                                                                                                                                                                                                                                       | L3: Memory Hierarchy, 1                                                                                                                                                                                      | UNIVERSITY<br>OF UTAH                                                                                             | CS6963                                                                    | L3: Memory Hierarchy, 1                                                                                                                                                                                                                                                                                                                                    | UNIVERSITY<br>OF UTAH |







6









7

















## CUDA Code - Kernel Execution CUDA Code - Kernel Overview Configuration // Block index // Setup the execution configuration int bx = blockIdx.x; int by = blockIdx.y; dim3 dimBlock(BLOCK SIZE, BLOCK SIZE); // Thread index int tx = threadIdx.x; int ty = threadIdx.y; dim3 dimGrid(N.width / dimBlock.x, M.height / dimBlock.y); // Pvalue stores the element of the block sub-matrix For very large N and M dimensions, one // that is computed by the thread float Pvalue = 0; will need to add another level of blocking // Loop over all the sub-matrices of M and N // required to compute the block sub-matrix for (int m = 0; m < M.width/BLOCK\_SIZE; ++m) { code from the next few slides }; and execute the second-level blocks sequentially. David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 CE 498AL, University of Illinois, Urbana-Champaign David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 498AL, University of Illinois, Urbana-Champaign L3: Memory Hierarchy, 1 UNIVERSITY

| Memory<br>// Get a pointer to the current sub-matrix<br>Matrix Msub = GetSubMatrix (M           |                       |
|-------------------------------------------------------------------------------------------------|-----------------------|
| // Get a pointer to the current sub-matrix<br>Matrix Nsub = GetSubMatrix(N                      |                       |
| shared float Ms[BLOCK_SI<br>shared float Ns[BLOCK_SI                                            |                       |
| <pre>// each thread loads one element of the su<br/>Ms[ty][tx] = GetMatrixElemen</pre>          |                       |
| <pre>// each thread loads one element of the su<br/>Ns[ty][tx] = GetMatrixElemen</pre>          |                       |
| David Kirk/NVIDIA and Wen-mei W. Hwu, 2007<br>E 498AL, University of Illinois, Urbana-Champaign | UNIVERSITY<br>OF UTAH |



## CUDA Code - Save Result

// Get a pointer to the block sub-matrix of P
Matrix Psub = GetSubMatrix(P, bx, by);

// Write the block sub-matrix to device memory; // each thread writes one element SetMatrixElement(Psub, tx, ty, Pvalue);

This code should run at about 150 Gflops on a

GTX or Tesla.

State-of-the-art mapping (in CUBLAS 3.2 on C2050) yields just above 600 Gflops. Higher on GTX480.



