



## 1









## Bandwidth to Shared Memory: Parallel Memory Accesses

- $\bullet$  Consider each thread accessing a different location in shared memory
- Bandwidth maximized if each one is able to proceed *in parallel*
- Hardware to support this
  Banked memory: each bank can support an access on every memory cycle









## Shared memory bank conflicts

- Shared memory is as fast as registers if there are no bank conflicts
- The fast case:
  - If all threads of a half-warp access different banks, there is no bank conflict
  - If all threads of a half-warp access the identical address, there is no bank conflict (broadcast)
- The slow case:
  - Bank Conflict: multiple threads in the same half-warp access the same bank

UNIVERSIT

- Must serialize the accesses
- Cost = max # of simultaneous accesses to a single bank

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign











## Summary of Lecture

- $\boldsymbol{\cdot}$  A deeper probe of performance issues
  - Heterogeneous memory hierarchy
  - Locality and bandwidth
  - Tiling for CUDA code generation

UNIVERSITY