









| <ul> <li>Data Placement: Conceptual</li> <li>Copies from host to device go to some part of global memory<br/>(possibly, constant or texture memory)</li> <li>How to use SP shared memory</li> <li>Must construct or be copied from global memory by kernel program</li> <li>How to use constant or texture cache <ul> <li>Read-only "reused" data can be placed in constant &amp; texture memory by host</li> </ul> </li> <li>Also, how to use registers <ul> <li>Most locally-allocated data is placed directly in registers</li> <li>Even array variables can use registers if compiler understands access patterns</li> </ul> </li> </ul> | <ul> <li>Data Placement: Syntax</li> <li>Through type qualifiers <ul> <li>constant,shared,local,</li> <li>device</li> </ul> </li> <li>Through cudaMemcpy calls <ul> <li>Flavor of call and symbolic constant designate where to copy</li> <li>Tmplicit default behavior</li> </ul> </li> </ul> |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| access patterns<br>– Can allocate "superwords" to registers, e.g., float4<br>– Excessive use of registers will "spill" data to local memory<br>• Local memory<br>– Deals with capacity limitations of registers and shared memory<br>– Eliminates worries about race conditions<br>– but SLOW                                                                                                                                                                                                                                                                                                                                                | <ul> <li>Implicit default behavior         <ul> <li>Device memory without qualifier is global memory</li> <li>Host by default copies to global memory</li> <li>Thread-local variables go into registers unless capacity exceeded, then local memory</li> </ul> </li> </ul>                     |
| CS6963 L4: Memory Hierarchy, II 11 UNIVERSITY<br>OF UTXH                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | C56963 L4: Memory Hierarchy, II 12 UNIVERSITY<br>OF UTAH                                                                                                                                                                                                                                       |

































| <pre>for (int i = 0; i &lt; Width; ++i)     for (int j = 0; j &lt; Width; ++j) {         double sum = 0;         for (int k = 0; k &lt; Width; ++k) {             sum += M[i][k] * N[k][j];         }         P[i][j] = sum;     } </pre> | Tile i |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------|
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------|









// Synchronize to make sure the sub-matrices are loaded
// before starting the computation
\_\_\_\_syncthreads();

// each thread computes one element of the block sub-matrix
for (int k = 0; k < BLOCK\_SIZE; ++k)
 Pvalue += Ms[ty][k] \* Ns[k][tx];</pre>

// Synchronize to make sure that the preceding // computation is done before loading two new // sub-matrices of M and N in the next iteration \_\_\_\_\_syncthreads();

© David Kirk/NVIDIA and Wen-mel W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 35 THE UNIVERSITY





