# Administrative • Next assignment available - Next three slides - Goals of assignment: - simple memory hierarchy management - block-thread decomposition tradeoff - Due Friday, Feb. 8, 5PM - Use handin program on CADE machines • "handin CS6235 lab2 <probfiles" # General Approach O. Provided a. Input file b. Sample output file c. CPU implementation 1. Structure a. Compare CPU version and GPU version output [compareInt] b. Time performance of two GPU versions (see 2 & 3 below) [EventRecord] 2. GPU version 1 (partial credit if correct) implementation using global memory 3. GPU version 2 (highest points to best performing versions) use memory hierarchy optimizations from previous, current lecture 4. Extra credit: Try two different block / thread decompositions, What happens if you use more threads versus more blocks? What if you do more work per thread? Explain your choices in a README file. Handin using the following on CADE machines, where probfile includes all files "handin cs6235 lab2 probfile>"" ### Overview of Lecture - Tiling for constant memory and registers - · Global Memory Coalescing - · Reading: - Chapter 5, Kirk and Hwu book CS6235 L5: Memory Hierarchy, 3 UNIVERSITY UNIVERSITY # Review: Targets of Memory Hierarchy Optimizations - Reduce memory latency - The latency of a memory access is the time (usually in cycles) between a memory request and its completion - Maximize memory bandwidth - Bandwidth is the amount of useful data that can be retrieved over a time interval - · Manage overhead - Cost of performing optimization (e.g., copying) should be less than anticipated gain CS6235 L5: Memory Hierarchy, 3 ### Discussion (Simplified Code) ``` for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * width + k]; double b = N[k * width + j]; sum += a * b; } P[i * Width + j] = sum; } </pre> for (int i = 0; i < Width; ++i) { double sum = 0; for (int k = 0; k < Width; ++k) { sum += M[i][k] * N[k][j]; } P[i][j] = sum; } </pre> ``` ``` What Does this Look Like in CUDA #define TI 32 #define TJ 32 #define TK 32 __global__ matMult(float *M, float *N, float *P) { ii = blockIdx.y; jj = blockIdx.x; i = threadIdx.y; j = threadIdx.x; __shared__ Ms[TI][TK], Ns[TK][TJ]; double sum = 0; for (int kk = 0; kk < Width; kk+=TK) { Tiling for shared Ms[j][i] = M[(ii*TI+i)*Width+TJ*jj+j+kk]; memory Ns[j][i] = N[(kk*TK+i)*Width+TJ*jj+j]; _syncthreads(); Now eliminate mods for (int k = kk; k < kk+TK; k++) sum += Ms[k%TK][i] * Ns[j][k%TK]; _synchthreads(); P[(ii*TI+i)*Width+jj*TJ+j] = sum; ``` ### Final Code (from text, p. 87) \_global\_void MatrixMulKernel (float \*Md, float \*Nd, float \*Pd, int Width) { 1. \_\_shared\_\_float Mds [TILE\_WIDTH] [TILE\_WIDTH]; 2. \_\_shared\_\_float Nds [TILE\_WIDTH] [TILE\_WIDTH]; 3 & 4. int bx = blockldx x: int by = blockldx y: int tx = threadldx x: int ty = threadldx y: //Identify the row and column of the Pd element to work on 5 & 6. int Row = by \* TILE\_WIDTH + ty; int Col = bx \* TILE\_WIDTH + tx; float Pvalue = 0: // Loop over the Md and Nd tiles required to compute the Pd element for (int m=0; m < Width / TILE\_WIDTH; ++m) $\{$ // Collaborative (parallel) loading of Md and Nd tiles into shared memory Mds [ty] [tx] = Md [Row\*Width + (m\*TILE\_WIDTH + tx)]; Nds [ty] [tx] = Nd [(m\*TILE\_WIDTH + ty)\*Width + Col]; \_syncthreads(); // make sure all threads have completed copy bef for (int k = 0; k < TILE\_WIDTH; ++k) // Update Pvalue for TKxTK tiles in Mds and Nds 11. // make sure all threads have completed copy before calculation 12 Pvalue += Mds [ty] [k] \* Nds [k] [tx]; 13. 14. \_syncthreads(); // make sure calculation complete before copying next tile } // m loop Pd [Row\*Width + Col] = Pvalue; L5: Memory Hierarchy, 3 UNIVERSITY ### "Tiling" for Registers - A similar technique can be used to map data to registers - · Unroll-and-jam - Unroll outer loops in a nest and fuse together resulting inner loops - Equivalent to "strip-mine" followed by permutation and unrolling - Fusion safe if relative order of memory reads and writes is preserved - Scalar replacement - May be followed by replacing array references with scalar variables to help compiler identify register opportunities LS: Memory Hierarchy, 3 UNIVERSIT ### Unroll and Jam: Matrix Multiply Code Example ``` for (int i = 0; i < Width; ++i) for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) { for (int j = 0; j < Width; ++j) { double sum = 0; double sum = 0; for (int k = 0; k < Width; ++k) { for (int k = 0; k < Width; ++k) { sum += M[i][k] * N[k][j]; sum += M[i][k] * N[k][j]; P[i][j] = sum; P[i][j] = sum; L5: Memory Hierarchy, 3 ``` # Unroll J Loop and Fuse K Loop Copies ``` for (int i = 0; i < Width; i++) for (int j = 0; j < Width; j+=2) { Why is this helpful? double sum1 = sum2 = 0; Reuse M[i][k], possibly in register Two independent memory streams increases instruction- for (int k = 0; k < Width; k++) { sum1 += M[i][k] * N[k][j]; sum2 += M[i][k] * N[k][j+1]; P[i][j] = sum1; P[i][j+1] = sum2; L5: Memory Hierarchy, 3 ``` ### Unroll I and J Loops and Fuse J, K Loop Copies ``` for (int i = 0; i < Width; i+=2) Why is this helpful? Added reuse of N More independent memory for (int j = 0; j < Width; j+=2) { double sum1 = sum2 = sum3 = sum4 = 0; for (int k = 0; k < Width; k++) { sum1 += M[i][k] * N[k][j]; sum2 += M[i][k] * N[k][j+1]; sum3 += M[i+1][k] * N[k][j]; This code is almost always faste than the original on ANY architecture sum4 += M[i+1][k] * N[k][j+1]; More unrolling is better up to a point where registers are P[i][j] = sum1; exceeded P[i][j+1] = sum2; P[i+1][j] = sum3; P[i+1][j+1] = sum4; UNIVERSITY ``` ## Scalar Replacement (beyond sum) ``` for (int i = 0; i < Width; i+=2) Scalar Replacement Replace array variables with for (int j = 0; j < Width; j+=2) { scalar temporaries double sum1 = sum2 = sum3 = sum4 = 0; for (int k = 0; k < Width; k++) { Sometimes this helps compilers put array variables in registers, but tmpm1 = M[i][k]; tmpm2 = M[i+1][k]; tmpn1 = N[k][j]; tmpn2 = N[k][j+1]; not always necessary sum1 += tmpm1 * tmpn1; This code is almost always sum2 += tmpm1 * tmpn2; faster than the original on ANY architecture sum3 += tmpm2 * tmpn1; sum4 += tmpm2 * tmpn2; More unrolling is better up to a point where registers are P[i][j] = sum1; P[i][j+1] = sum2; P[i+1][j] = sum3; P[i+1][j+1] = sum4; L5: Memory Hierarchy, 3 UNIVERSITY ``` ### Additional Detail - Suppose each thread accesses different data from constant memory on same instruction - Reuse across threads? - · Consider capacity of constant cache and locality - Code transformation needed? -- tile for constant memory, constant cache - Cache latency proportional to number of accesses in a warp - No reuse? - · Should not be in constant memory CS623! L5: Memory Hierarchy, 3 ### Overview of Texture Memory - · Recall, texture cache of read-only data - Special protocol for allocating and copying to GPU - texture<Type, Dim, ReadMode> texRef; - Dim: 1, 2 or 3D objects - Special protocol for accesses (macros) - tex2D(<name>,dim1,dim2); - In full glory can also apply functions to textures - Writing possible, but unsafe if followed by read in same kernel CS6235 L5: Memory Hierarchy, 3 # Using Texture Memory (simpleTexture project from SDK) cudaMalloc( (void\*\*) &d\_data, size); cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat): cudaArray\* cu\_array; cudaMallocArray( &cu\_array, &channelDesc, width, height ); cudaMemcpyToArray( cu\_array, 0, 0, h\_data, size, cudaMemcpyHostToDevice); // set texture parameters tex.addressMode[0] = tex.addressMode[1] = cudaAddressModeWrap; tex.filterMode = cudaFilterModeLinear; tex.normalized = true; cudaBindTextureToArray( tex,cu\_array, channelDesc); // execute the kernel transformKernel<<< dimGrid, dimBlock, 0 >>>( d\_data, width, height, angle); Kernel function: // declare texture reference for 2D float texture texture/float, 2, cudaReadModeElementType> tex; ... = tex2D(tex,i,j); CS6235 L5: Memory Hierarchy, 3 ### When to use Texture (and Surface) Memory (From 5.3 of CUDA manual) Reading device memory through texture or surface fetching present some benefits that can make it an advantageous alternative to reading device memory from global or constant memory: - If memory reads to global or constant memory will not be coalesced, higher bandwidth can be achieved providing that there is locality in the texture fetches or surface reads (this is less likely for devices of compute capability 2.x given that global memory reads are cached on these devices); - Addressing calculations are performed outside the kernel by dedicated units; - Packed data may be broadcast to separate variables in a single operation: - 8-bit and 16-bit integer input data may be optionally converted to 32-bit floating-point values in the range [0.0, 1.0] or [-1.0, 1.0] (see Section 3.2.4.1.1). L5: Memory Hierarchy, 3 ### Memory Bandwidth Optimization - Goal is to maximize utility of data for each data transfer from global memory - Memory system will "coalesce" accesses for a collection of consecutive threads if they are within an aligned 128 byte portion of memory (from half-warp or warp) - $\bullet \ \ \textbf{Implications for programming:}$ - Desirable to have consecutive threads in tx dimension accessing consecutive data in memory - Significant performance impact, but Fermi data cache makes it slightly less important L5: Memory Hierarchy, I UNIVERS OF UTAH ### Introduction to Global Memory Bandwidth: Understanding Global Memory Accesses # Memory protocol for compute capability 1.2\* (CUDA Manual 5.1.2.1) - Start with memory request by smallest numbered thread. Find the memory segment that contains the address (32, 64 or 128 byte segment, depending on data type) - Find other active threads requesting addresses within that segment and coalesce - Reduce transaction size if possible - · Access memory and mark threads as "inactive" - Repeat until all threads *in half-warp* are serviced \*Includes Tesla and GTX platforms CS6235 L5: Memory Hierarchy, III L5: Memory Hierarchy II # Protocol for most systems (including lab6 machines) even more restrictive - For compute capability 1.0 and 1.1 - Threads must access the words in a segment in sequence - The kth thread must access the kth word CS6235 L5: Memory Hierarchy, III 25 UN ### Coalescing in Matrix Multiply \_global\_\_void MatrixMulKernel (float "Md, float "Nd, float "Pd, int Width) { . \_\_shared\_\_float Mds [TILE\_WIDTH] [TILE\_WIDTH]; . \_\_shared\_\_float Nds [TILE\_WIDTH] [TILE\_WIDTH]; 3 & 4. int bx = blockldx.x; int by = blockldx.y; int tx = threadldx.x; int ty = threadldx.y; //Identify the row and column of the Pd element to work on 5 & 6. int Row = by \* TILE\_WIDTH + ty; int Col = bx \* TILE\_WIDTH + tx; float Pyalue = 0: // Loop over the Md and Nd tiles required to compute the Pd element for (int m=0; m < Width / TILE\_WIDTH; ++m) { 8. Tor (int in=y, in \text{ \text{Visit in} \text{ Nds [ty] [tx] = Nd [(m\*TILE\_WIDTH + ty)\*Width + Col]; COALESCED - See Col 11. \_\_syncthreads(); for (int k = 0; k < TILE\_WIDTH; ++k) 12. Pvalue += Mds [ty] [k] \* Nds [k] [tx]; 14. \_syncthreads(); } // m loop Pd [Row\*Width + Col] = Pvalue; COALESCED – See Col L5: Memory Hierarchy, 3 UNIVERSITY OF UTAH ## Summary of Lecture - How to place data in constant memory and registers - Introduction to Bandwidth Optimization Global Memory Coalescing CS6235 Memory Hierarchy,