



















| Tiling View (Simplified Code)                                                                                                                                                                                                                                                                                            |                                                                                                                                                                                                                                            |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <pre>for (int i = 0; i &lt; Width; ++i)     for (int j = 0; j &lt; Width; ++j) {         double sum = 0;         for (int k = 0; k &lt; Width; ++k) {             double a = M[i * width + k];             double b = N[k * width + j];             sum += a * b;         }         P[i * Width + j] = sum;     } </pre> | <pre>for (int i = 0; i &lt; Width; ++i)     for (int j = 0; j &lt; Width; ++j) {         double sum = 0;         for (int k = 0; k &lt; Width; ++k) {             sum += M[i][k] * N[k][j];         }         P[i][j] = sum;     } }</pre> |
| LS: Memory Hierarchy, III 15 UNIVERSITY                                                                                                                                                                                                                                                                                  |                                                                                                                                                                                                                                            |









This code should run at about 150 Gflops on a State-of-the-art mapping (in CUBLAS 3.2 on C2050) yields just above 600 Gflops. Higher on 20 











## Using Texture Memory (simpleTexture project from SDK)

cudaMalloc( (void\*\*) &d\_data, size);

cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);

cudaArray\* cu\_array; cudaMallocArray( &cu\_array, &channelDesc, width, height );

cudaMemcpyToArray( cu\_array, 0, 0, h\_data, size, cudaMemcpyHostToDevice);

// set texture parameters

tex.addressMode[0] = tex.addressMode[1] = cudaAddressModeWrap; tex.filterMode = cudaFilterModeLinear; tex.normalized = true;

tex.tilterMode = cudaFilterModeLinear; tex.normalized = 1 cudaBindTextureToArray( tex.cu\_array, channelDesc);

// execute the kernel

transformKernel<<< dimGrid, dimBlock, 0 >>>( d\_data, width, height, angle);

27 L5: Memory Hierarchy III

Kernel function:

// declare texture reference for 2D float texture texture<float, 2, cudaReadModeElementType> tex;

... = tex2D(tex,i,j); CS6963

UNIVERSITY

## When to use Texture (and Surface) Memory (From 5.3 of CUDA manual) Reading device memory through texture or surface fetching present some benefits that can make it an advantageous alternative to reading device memory from global or constant memory: If memory reads to global or constant memory will not be coalesced, higher bandwidth can be achieved providing that there is locality in the texture fetches or surface reads (this is less likely for devices of compute capability 2.x given that global memory reads are cached on these devices); Addressing calculations are performed outside the kernel by dedicated units; Packed data may be broadcast to separate variables in a single operation; 8-bit and 16-bit integer input data may be optionally converted to 32-bit floating-point values in the range [0.0, 1.0] or [-1.0, 1.0] (see Section 3.2.4.1.1). L5: Memory Hierarchy, III 28 UNIVERSIT







8



