



























- Two recent papers:
- "Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters," Zhang and Mueller, CGO 2012.
- "High-Performance Code Generation for Stencil Computations on GPU Architectures," Holewinski et al., ICS 2012.
- Key issues:
- -Exploit reuse in shared memory.
- Avoid fetching from global memory.
  Thread decomposition to support global memory coalescing.



UNIVERSITY





## **Other Optimizations**

- X dimension delivers coalesced global memory accesses
  Pad to multiples of 32 stencil elements
- Halo regions are aligned to 128-bit boundaries
- Input (parameter) arrays are padded to match halo region, to share indexing.
- BlockSize.x is maximized to avoid non-coalesced accesses to halo region

UNIVERSITY

- Blocks are square to reduce area of redundancy.
- Use of shared memory for input.
- Use of texture fetch for input.



