









|      | Data Placement: Conceptual                                                                                                                                                                                                                                                                                                     |
|------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| •    | Copies from host to device go to some part of global memory<br>(possibly, constant or texture memory)                                                                                                                                                                                                                          |
| •    | How to use SP shared memory <ul> <li>Must construct or be copied from global memory by kernel program</li> </ul>                                                                                                                                                                                                               |
| •    | How to use constant or texture cache<br>– Read-only "reused" data can be placed in constant & texture memory<br>by host                                                                                                                                                                                                        |
|      | Also, how to use registers<br>- Most locally-allocated data is placed directly in registers<br>- Even array variables can use registers if compiler understands<br>access patterns<br>- Can allocate "superwords" to registers, e.g., float4<br>- Excessive use of registers will "spill" data to local memory<br>Local memory |
|      | <ul> <li>Deals with capacity limitations of registers and shared memory</li> <li>Eliminates worries about race conditions</li> <li> but SLOW</li> </ul>                                                                                                                                                                        |
| 6963 | 8<br>L5: Memory Hierarchy II UNIVERSITY<br>6 UTAH                                                                                                                                                                                                                                                                              |



























| CUDA Code - Load Data to Shared<br>Memory<br>// Get a pointer to the current sub-matrix Msub of M<br>Matrix Msub = GetSubMatrix (M, m, by); |
|---------------------------------------------------------------------------------------------------------------------------------------------|
| <pre>// Get a pointer to the current sub-matrix Nsub of N Matrix Nsub = GetSubMatrix(N, bx, m);</pre>                                       |
| shared float Ms[BLOCK_SIZE][BLOCK_SIZE];<br>shared float Ns[BLOCK_SIZE][BLOCK_SIZE];                                                        |
| <pre>// each thread loads one element of the sub-matrix Ms[ty][tx] = GetMatrixElement(Msub, tx, ty);</pre>                                  |
| <pre>// each thread loads one element of the sub-matrix<br/>Ns[ty][tx] = GetMatrixElement(Nsub, tx, ty);</pre>                              |
| © Devid Kirk/NVIDIA and Wen-mel W. Hwu, 2007 25<br>ECE 498AL, University of Illinois, Urbana-Champaign L5: Memory Hierarchy II              |



## CUDA Code - Save Result

// Get a pointer to the block sub-matrix of P
Matrix Psub = GetSubMatrix(P, bx, by);

// Write the block sub-matrix to device memory; // each thread writes one element SetMatrixElement(Psub, tx, ty, Pvalue);

This code should run at about 150 Gflops on a GTX or Tesla.

State-of-the-art mapping (in CUBLAS 2.0) yields just under 400 Gflops.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 27 ECE 498AL, University of Illinois, Urbana-Champaign L5: Memory Hierarchy II

UNIVERSITY



## "Tiling" for Registers

- A similar technique can be used to map data to registers
- Unroll-and-jam
  - Unroll outer loops in a nest and fuse together resulting inner loops
- Equivalent to "strip-mine" followed by permutation
- Fusion safe if dependences are not reversed
- Scalar replacement

CS6963

- May be followed by replacing array references with scalar variables to help compiler identify register opportunities
- Used to be important because earlier compilers would not place array variables in registers, but not the case with nvcc compiler UNIVERSITY

L5: Memory Hierarchy II

- Overview of Texture Memory • Recall, texture cache of read-only data Special protocol for allocating and copying to GPU – texture<Type, Dim, ReadMode> texRef; • Dim: 1, 2 or 3D objects • Special protocol for accesses (macros)
- tex2D(<name>,dim1,dim2);
- In full glory can also apply functions to textures

30 L5: Memory Hierarchy II CS6963

## Using Texture Memory (simpleTexture project from SDK)

cudaMalloc( (void\*\*) &d\_data, size);

- cudaChannelFormatBesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
- cudaArray\* cu\_array; cudaMallocArray( &cu array, &channelDesc, width, height );
- cudaMemcpyToArray( cu\_array, 0, 0, h\_data, size, cudaMemcpyHostToDevice);
- // set texture parameters
- tex.addressMode[0] = tex.addressMode[1] = cudaAddressModeWrap
- tex.filterMode = cudaFilterModeLinear; tex.normalized = true; cudaBindTextureToArray( tex,cu\_array, channelDesc);
- // execute the kernel
- transformKernel<<< dimGrid, dimBlock, 0 >>>( d\_data, width, height, angle);

Kernel function:

// declare texture reference for 2D float texture texture<float, 2, cudaReadModeElementType> tex;

## ... = tex2D(tex,i,j);

CS6963

31 L5: Memory Hierarchy II



CS6963









