



- Is there a shared memory bank conflict for when each thread accesses contiguous 8-bit or 16-bit data?
  - YES for compute capability below 2.0, see G3.3 CUDA 3.2 programming guide
  - NO for compute capability 2.0 and greater, see G4.3 CUDA 3.2 programming guide
- GTX 460 and 560 do have 48 cores per SM (7 or 8)
  Seem to use a different warpsize?

L8: Control Flow







UNIVERSITY

L8: Control Flow

UNIVERSITY

## Code from asyncAPI SDK project

// allocate host memory CUDA\_SAFE\_CALL( cudaMallocHost((void\*\*)&a, nbytes) ); memset(a, 0, nbytes);

// allocate device memory CUDA\_SAFE\_CALL( cudaMalloc((void\*\*)&d\_a, nbytes) ); CUDA\_SAFE\_CALL( cudaMemset(d\_a, 255, nbytes) );

... // declare grid and thread dimensions and create start and stop events // asynchronously issue work to the GPU (all to stream 0)

// asynchronously issue work to the 640 (all to stream 0) cudaEventRecond(start, 0); cudaMemcpyAsync(d\_a, a, nbytes, cudaMemcpyHostToDevice, 0); increment\_Kernel«:vblocks, threads, 0, 0>>>(d\_a, value); cudaMemcpyAsync(a, d\_a, nbytes, cudaMemcpyDeviceToHost, 0); cudaEventRecord(stop, 0);

// have CPU do some work while waiting for GPU to finish

// release resources CUDA\_SAFE\_CALL( cudaFreeHost(a) ); CUDA\_SAFE\_CALL( cudaFree(d\_a) );

L8: Control Flow

More Parallelism to Come (Compute Capability 2.0)

Stream concept: create, destroy, tag asynchronous operations with stream

- Special synchronization mechanisms for streams: queries, waits and synchronize functions
- Concurrent Kernel Execution

 Execute multiple kernels (up to 4) simultaneously Concurrent Data Transfers

L8: Control Flow

– Can concurrently copy from host to GPU and GPU to host using asynchronous Memcpy

Section 3.2.6 of CUDA 3.2 manual



## Debugging: Using Device Emulation Mode

- An executable compiled in device emulation mode (nvcc -deviceemu) runs completely on the host using the CUDA runtime
  - No need of any device and CUDA driver
- Each device thread is emulated with a host thread
- When running in device emulation mode, one can: Use host native debug support (breakpoints, inspection, etc.)
- Access any device-specific data from host code and vice-versa
- Call any host function from device code (e.g. printf) and vice-versa
- Detect deadlock situations caused by improper usage of syncthreads

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 CE 498AL, University of Illinois, Urbana-Champa

L8: Control Flow UNIVERSITY

UNIVERSIT

## **Debugging: Device Emulation Mode Pitfalls**

- Emulated device threads execute sequentially, so simultaneous accesses of the same memory location by multiple threads could produce different results.
- Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode
- Results of floating-point computations will slightly differ because of:
  - Different compiler outputs, instruction sets
  - Use of extended precision for intermediate results • There are various options to force strict single precision on the host

L8: Control Flow

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 CE 498AL, University of Illinois, Urbana-Champaign













UNIVERSIT







































