



1

# Reminder: Content of Proposal, MPM/GIMP as Example

| III. Suitability for G                                           | PU acceleration, cont.                                                      |                                          |
|------------------------------------------------------------------|-----------------------------------------------------------------------------|------------------------------------------|
| <ul> <li>Synchronization<br/>need to be pro<br/>host.</li> </ul> | on and Communication: Discuss what c<br>tected by synchronization, or commu | data structures may<br>inication through |
| Some challenges on bou                                           | ndaries between nodes in grid                                               |                                          |
|                                                                  | l: Discuss the data footprint and anti<br>m host memory.                    | icipated cost of                         |
| Measure grid and patch<br>computations to reduce                 | es to discover data footprint. Consic<br>copying overhead.                  | der ways to combine                      |
| IV. Intellectual Chal                                            | lenges                                                                      |                                          |
| - Generally, wha                                                 | t makes this computation worthy of a                                        | a project?                               |
| Importance of computa<br>with scope, managing co                 | tion, and challenges in partitioning co<br>pying overhead                   | mputation, dealing                       |
| <ul> <li>Point to any dispeedup</li> </ul>                       | fficulties you anticipate at present in                                     | n achieving high                         |
| See previous                                                     |                                                                             |                                          |
| CS6963                                                           | 5<br>L10: Floating Point                                                    |                                          |

## Midterm Exam • Goal is to reinforce understanding of CUDA and NVIDIA architecture Material will come from lecture notes and assignments • In class, should not be difficult to finish

#### 

# Parts of Exam

- I.
- Definitions A list of 10 terms you will be asked to define
- II. Constraints Understand constraints on numbers of threads, blocks, warps, size of storage

#### III. Problem Solving

- Derive distance vectors for sequential code and use these to transform code to CUDA, making use of constant memory
- Given some CUDA code, indicate whether global memory accesses will be coalesced and whether there will be bank conflicts in shared memory
- shared memory Given some CUDA code, add synchronization to derive a correct implementation Given some CUDA code, provide an optimized version that will have fewer divergent branches Given some CUDA code, derive a partitioning into threads and blocks that does not exceed various hardware limits

- IV. (Brief) Essay Question
  - Pick one from a set of 4

UNIVERSITY OF UTAH

### How Much? How Many?

- How many threads per block? Max 512
- How many blocks per grid? Max 65535
- How many threads per warp? 32
- How many warps per multiprocessor? 24
- · How much shared memory per streaming multiprocessor? 16Kbytes
- How many registers per streaming multiprocessor? 8192
- Size of constant cache: 8Kbytes

UNIVERSITY

# Syllabus

- L1 & L2: Introduction and CUDA Overview
  \* Not much there...
  L3: Synchronization and Data Partitioning
  What does \_\_\_\_\_syncthreads () do?
  Indexing to map portions of a data structure to a particular thread
  L4: Hardware and Execution Model
  How are threads in a block scheduled? How are blocks mapped to streaming multiprocessors?
  L5: Dependence Analysis and Parallelization
  Constructing distance vectors
  Determining if parallelization is safe
  L6: Memory Hierarchy I: Data Placement
  What are the different memory spaces on the device, who can read/write them?
  How do you tell the compiler that something belongs in a particular memory space?

UNIVERSITY OF UTAH

# Syllabus

- L7: Memory Hierarchy II: Reuse and Tiling Safety and profitability of tiling L8: Memory Hierarchy III: Memory Bandwidth
- Understanding global memory coalescing (for compute capability < 1.2 and > 1.2)
- Understanding memory bank conflicts
  L9: Control Flow
  Divergent branches

- Execution model
- L10: Floating Point
- Intrinsics vs. arithmetic operations, what is more precise?
  What operations can be performed in 4 cycles, and what operations take longer?
  L11: Tools: Occupancy Calculator and Profiler
  How do they help you?

UNIVERSITY

## Next Time

- March 23:
- Guest Lecture, Austin Robison
- March 25: - MIDTERM, in class

UNIVERSITY