







|                                                                                                                                                              | Host Blocking: Common Examples                                                                                                                                                                                                                                                         |  |  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| •                                                                                                                                                            | How do you guarantee the GPU is done and results are ready?                                                                                                                                                                                                                            |  |  |
| •                                                                                                                                                            | Timing example (excerpt from simpleStreams in CUDA SDK):                                                                                                                                                                                                                               |  |  |
|                                                                                                                                                              | cudaEvent_t start_event, stop_event;<br>cudaEventCreate(&start_event);<br>cudaEventCreate(&stop_event);<br>cudaEventRecord(start_event, 0);<br>init_array<<<br>colocks, threads>>>(d_a, d_c, niterations);<br>cudaEventRecord(stop_event, 0);<br>cudaEventSynchronize(stop_event);<br> |  |  |
| •                                                                                                                                                            | • A bunch of runs in a row example (excerpt from transpose in                                                                                                                                                                                                                          |  |  |
| CUDA SDK)<br>for (int i = 0; i < numlterations; ++i) {<br>transpose<<< grid, threads >>>(d_odata, d_idata, size_x, size_y);<br>}<br>cudaThreadSynchronize(); |                                                                                                                                                                                                                                                                                        |  |  |
|                                                                                                                                                              |                                                                                                                                                                                                                                                                                        |  |  |

| Predominant Control Mechanisms:                  |                                                                                    |                                                                                                                                             |  |  |
|--------------------------------------------------|------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| Some definitions                                 |                                                                                    |                                                                                                                                             |  |  |
| Name                                             | Meaning                                                                            | Examples                                                                                                                                    |  |  |
| Single Instruction,<br>Multiple Data<br>(SIMD)   | A single thread of<br>control, same<br>computation applied<br>across "vector" elts | Array notation as in<br>Fortran 95:<br>A[1:n] = A[1:n] + B[1:n]<br>Kernel fns w/in block:<br>compute<< <gs,bs,msize>&gt;&gt;</gs,bs,msize>  |  |  |
| Multiple Instruction,<br>Multiple Data<br>(MIMD) | Multiple threads of<br>control, processors<br>periodically synch                   | OpenMP parallel loop:<br>forall (i=0; i <n; i++)<br="">Kernel fns across blocks<br/>compute&lt;&lt;<gs,bs,msize>&gt;&gt;</gs,bs,msize></n;> |  |  |
| Single Program,<br>Multiple Data<br>(SPMD)       | Multiple threads of<br>control, but each<br>processor executes<br>same code        | <pre>Processor-specific code: if (\$threadIdx.x == 0) { }</pre>                                                                             |  |  |
| C56963                                           | 10<br>L2: Hardware Overview                                                        |                                                                                                                                             |  |  |





























Instruction State

Computing

Computing

Operands

ready to go

Schedule

at time k

UNIVERSIT











## Summary of LectureSIMT = SIMD+SPMD

- SIMD execution model within a warp, and conceptually within a block
- MIMD execution model across blocks
- Multithreading of SMs used to hide memory latency
  - Motivation for lots of threads to be concurrently active
- Scoreboarding used to track warps ready to execute

35 L2: Hardware Overview

