L4: Memory Hierarchy Optimization I, Locality and Data Placement

CS6963



### Administrative

- · Next assignment on the website
  - Description at end of class
  - Due Wednesday, Feb. 17, 5PM
  - Use handin program on CADE machines
    - "handin cs6963 lab2 <probfile>"
- Mailing lists
  - cs6963s10-discussion@list.eng.utah.edu
    - Please use for all questions suitable for the whole class
    - Feel free to answer your classmates questions!
  - cs6963s10-teach@list.eng.utah.edu
     Please use for questions to Protonu and me



### Overview

- Where data can be stored
- And how to get it there
- Some guidelines for where to store data

  - Who needs to access it?Read only vs. Read/Write
- Footprint of data
- · High level description of how to write code to optimize for memory hierarchy
  - More details Wednesday and (probably) next week
- · Reading:
  - Chapter 4, Kirk and Hwu
  - http://courses.ece.illinois.edu/ece498/al/textbook/ Chapter4-CudaMemoryModel.pdf



### Targets of Memory Hierarchy **Optimizations**

- Reduce memory latency
   The latency of a memory access is the time (usually in cycles) between a memory request and its completion
- Maximize memory bandwidth
  - Bandwidth is the amount of useful data that can be retrieved over a time interval
- Manage overhead
  - Cost of performing optimization (e.g., copying) should be less than anticipated gain

L4: Memory Hierarchy I









### device = GPU = set of multiprocessors Multiprocessor = set of processors & shared memory Kernel = GPU program Grid = array of thread blocks that execute a kernel Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory Location Local Off-chip Read/write One thread Shared On-chip N/A - resident All threads in a block Read/write Global Off-chip Nο Read/write All threads + host Constant Off-chip Yes Read All threads + host Off-chip Read All threads + host Texture Yes

Terminology Review

### Reuse and Locality

- · Consider how data is accessed
  - Data reuse:
    - · Same data used multiple times
    - Intrinsic in computation
  - Data locality:
    - · Data is reused and is present in "fast memory"
    - · Same data or same data transfer
- If a computation has reuse, what can we do to get locality?
  - · Appropriate data placement and layout
  - Code reordering transformations



### Access Times

- Register dedicated HW single cycle
- Constant and Texture caches possibly single cycle, proportional to addresses accessed by warp
- Shared Memory dedicated HW single cycle if no "bank conflicts"

  Local Memory DRAM, no cache \*slow\*
- Global Memory DRAM, no cache \*slow\*
- Constant Memory DRAM, rached, 1...10s...100s of cycles, depending on cache locality
  Texture Memory DRAM, cached, 1...10s...100s of cycles, depending on cache locality
- Instruction Memory (invisible) DRAM, cached



### Data Placement: Conceptual

- Copies from host to device go to some part of global memory (possibly, constant or texture memory)
- How to use SP shared memory

  Must construct or be copied from global memory by kernel program
  How to use constant or texture cache
- Read-only "reused" data can be placed in constant & texture memory by host
- Also, how to use registers
  - Most locally-allocated data is placed directly in registers
  - Most locally-allocated and as placed an interesty in registers
     Even array variables can use registers if compiler understands access patterns
     Can allocate "superwords" to registers, e.g., float4
     Excessive use of registers will "spill" data to local memory
- Local memory

   Deals with capacity limitations of registers and shared memory
   Eliminates worries about race conditions

  - ... but SLOW



### Data Placement: Syntax

- Through type qualifiers
  - \_\_constant\_\_, \_\_shared\_\_, \_\_local\_\_, \_device\_\_
- · Through cudaMemcpy calls
  - Flavor of call and symbolic constant designate where to copy
- · Implicit default behavior
  - Device memory without qualifier is global memory
  - Host by default copies to global memory
  - Thread-local variables go into registers unless capacity exceeded, then local memory



### Language Extensions: Variable Type Qualifiers

|        |          |     |              | Memory   | Scope  | Lifetime    |
|--------|----------|-----|--------------|----------|--------|-------------|
| device | _local   | int | LocalVar;    | local    | thread | thread      |
| device | _shared  | int | SharedVar;   | shared   | block  | block       |
| device |          | int | GlobalVar;   | global   | grid   | application |
| device | constant | int | ConstantVar; | constant | grid   | application |

- \_\_device\_\_ is optional when used with \_\_local\_\_, \_\_shared\_\_, or \_\_constant\_\_
- Automatic variables without any qualifier reside in a register
  - Except arrays that reside in local memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007





### Variable Type Restrictions

- Pointers can only point to memory allocated or declared in global memory:
  - Allocated in the host and passed to the kernel:

```
__global__ void KernelFunc(float*
ptr)
```

- Obtained as the address of a global variable: float\* ptr = &GlobalVar;

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007

4: Memory Hierarchy I



### Rest of Today's Lecture

- Mechanics of how to place data in shared memory and constant memory
- Tiling transformation to reuse data within
  - Shared memory
  - Constant cache

15 1: Memory Hierarchy I



### Constant Memory Example

- · Signal recognition:
  - Apply input signal (a vector) to a set of precomputed transform matrices
  - Compute  $M_1V$ ,  $M_2V$ , ...,  $M_nV$

```
__constant__float d_signalVector[M];
__device__float R[N][M];
__host__void outerApplySignal () {
    float *h_inputSignal;
    dim3 dimGrid(N);
    dim3 dimBlock[M];
    cudaMemcpyToSymbol (d_signalVector,
        h_inputSignal, M*sizeof(float));
    // input marix is in d_mat
    ApplySignal
    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    ApplySignal

    Appl
```



### Additional Detail

- Suppose each thread accesses different data from constant memory on same instruction
  - Reuse across threads?
    - · Consider capacity of constant cache and locality
    - · Code transformation needed? (later in lecture)
    - Cache latency proportional to number of accesses in a warp
  - No reuse?
    - Should not be in constant memory.

18
L4: Memory Hierarchy I UNIVERSITY OF UTAH



### Mechanics of Using Shared Memory • \_\_shared\_\_ type qualifier required · Must be allocated from global/device function, or as "extern" Examples: extern \_\_shared\_\_ float d\_s\_array[]; \_\_global\_\_ void compute2() { \_\_shared\_\_ float d\_s\_array[M]; /\* a form of dynamic allocation \*/ /\* MEMSIZE is size of per-block \*/ /\* shared memory\*/ /\* create or copy from global memory \*/ d\_s\_array[j] = ...; \_\_host\_\_ void outerCompute() { compute<<<gs,bs>>>(); /\* write result back to global memory \*/ d\_g\_array[j] = d\_s\_array[j]; global void compute() { d\_s\_array[i] = ...;



### Reuse and Locality

- · Consider how data is accessed
  - Data reuse:
    - · Same data used multiple times
    - Intrinsic in computation
  - Data locality:
    - Data is reused and is present in "fast memory"
    - Same data or same data transfer
- If a computation has reuse, what can we do to get locality?
  - · Appropriate data placement and layout
  - Code reordering transformations

3 22 UNIVERSI L4: Memory Hierarchy I OF UTAH

### Temporal Reuse in Sequential Code

• Same data used in distinct iterations I and T'

for (i=1; i<N; i++)
for (j=1; j<N; j++)
A[j]= A[j]+A[j+1]+A[j-1]

• A[j] has self-temporal reuse in loop i

CSEGES

23 14: Memory Hierarchy



### Spatial Reuse (Ignore for now)

 Same data transfer (usually cache line) used in distinct iterations I and I'

for (i=1; i<N; i++)
for (j=1; j<N; j++)
A[j]= A[j]+A[j+1]+A[j-1];

- · A[j] has self-spatial reuse in loop j
- Multi-dimensional array note: C arrays are stored in row-major order

CS6963

24 4: Memory Hierarchy I



### Group Reuse

• Same data used by distinct references

```
for (i=1; i<N; i++)
  for (j=1; j<N; j++)
    A[j]= A[j]+A[j+1]+A[j-1];</pre>
```

\* A[j],A[j+1] and A[j-1] have group reuse (spatial and temporal) in loop j

CS6963

25 4: Memory Hierarchy I UNIVERSITY

### Can Use Reordering Transformations!

- · Analyze reuse in computation
- Apply loop reordering transformations to improve locality based on reuse
- With any loop reordering transformation, always ask
  - Safety? (doesn't reverse dependences)
  - Profitablity? (improves locality)

26 L4: Memory Hier 1 4

UNIVERSI OF UTAH





## Tiling (Blocking): Another Loop Reordering Transformation • Blocking reorders loop iterations to bring iterations that reuse data closer in time



### Legality of Tiling

- Tiling = strip-mine and permutation
  - -Strip-mine does not reorder iterations
  - -Permutation must be legal OR
  - strip size less than dependence distance

CS696

31 4: Memory Hierarchy



### A Few Words On Tiling

- Tiling can be used hierarchically to compute partial results on a block of data wherever there are capacity limitations
  - Between grids if total data exceeds global memory capacity
  - Across thread blocks if shared data exceeds shared memory capacity (also to partition computation across blocks and threads)
  - Within threads if data in constant cache exceeds cache capacity or data in registers exceeds register capacity or (as in example) data in shared memory for block still exceeds shared memory capacity

CS6963

32 4: Memory Hierarchy







### Shared Memory Usage

- · Assume each SMP has 16KB shared memory
  - Each Thread Block uses 2\*256\*4B = 2KB of shared memory.
  - Can potentially have up to 8 Thread Blocks actively executing
  - For BLOCK\_SIZE = 16, this allows up to 8\*512 = 4,096 pending loads
    - In practice, there will probably be up to half of this due to scheduling to make use of SPs.
  - The next BLOCK\_SIZE 32 would lead to 2\*32\*32\*4B= 8KB shared memory usage per Thread Block, allowing only up to two Thread Blocks active at the same time



### First-order Size Considerations

- Each Thread Block should have a minimal of 192 threads
   BLOCK\_SIZE of 16 gives 16\*16 = 256 threads
- A minimal of 32 Thread Blocks
   A 1024\*1024 P Matrix gives 64\*64 = 4096 Thread Blocks
- Each thread block performs 2\*256 = 512 float loads from global memory for 256 \* (2\*16) = 8,192 mul/add operations.
  - Memory bandwidth no longer a limiting factor

36 L4: Memory Hierarchy I



### CUDA Code - Kernel Execution Configuration

For very large N and M dimensions, one will need to add another level of blocking and execute the second-level blocks sequentially.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007

4: Memory Hierarchy I



### CUDA Code - Kernel Overview

```
// Block index
int bx = blockIdx.x;
int by = blockIdx.y;
// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;

// Pvalue stores the element of the block sub-matrix
// that is computed by the thread
float Pvalue = 0;

// Loop over all the sub-matrices of M and N
// required to compute the block sub-matrix
for (int m = 0; m < M.width/BLOCK_SIZE; ++m) {
    code from the next few slides };</pre>
```

4: Memory Hierarchy I



### CUDA Code - Load Data to Shared Memory

```
Memory

// Get a pointer to the current sub-matrix Msub of M

Matrix Msub = GetSubMatrix(M, m, by);

// Get a pointer to the current sub-matrix Nsub of N

Matrix Nsub = GetSubMatrix(N, bx, m);

__shared__ float Ms[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Ns[BLOCK_SIZE][BLOCK_SIZE];

// each thread loads one element of the sub-matrix

Ms[ty][tx] = GetMatrixElement(Msub, tx, ty);

// each thread loads one element of the sub-matrix

Ns[ty][tx] = GetMatrixElement(Nsub, tx, ty);
```

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 EE 498AL, University of Illinois, Urbana-Champaign 39 L4: Memory Hierarchy



### CUDA Code - Compute Result

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 40 L4: Memory Hierarchy I



### // Get a pointer to the block sub-matrix of P Matrix Psub = GetSubMatrix(P, bx, by); // Write the block sub-matrix to device memory; // each thread writes one element SetMatrixElement(Psub, tx, ty, Pvalue); This code should run at about 150 Gflops on a GTX or Tesla. State-of-the-art mapping (in CUBLAS 2.0) yields just under 400 Gflops.

### Matrix Multiply in CUDA

- Imagine you want to compute extremely large matrices.
  - That don't fit in global memory
- This is where an additional level of tiling could be used, between grids

6963 42 L4: Memory Hierarchy I UTIAH OF UTAH





# General Approach 0. Provided a. Input file b. Sample output file c. CPU implementation 1. Structure a. Compare CPU version and GPU version output [compareInt from L3, slide 30] b. Time performance of two GPU versions (see 2 & 3 below) [see timing construct from L2, p. 9] 2. GPU version 1 (partial credit if correct) implementation using global memory 3. GPU version 2 (highest points to best performing versions) use memory hierarchy optimizations from this and next 2 lectures Handin using the following on CADE machines, where probfile includes all files "handin cs6963 lab2 <probfile>" LINIVERSITY OF UTAH

### Summary of Lecture

- How to place data in constant memory and shared memory
- Reordering transformations to improve locality
- Tiling transformation
- Matrix multiply example

6963 46 L4: Memory Hierarchy I

### THE

### Next Time

- Complete this example
  - Also, registers and texture memory
- · Reasoning about reuse and locality
- Introduction to bandwidth optimization



