L5: Writing Correct Programs, cont. # Outline - How to tell if your parallelization is correct? - · Race conditions and data dependences - Tools that detect race conditions - Abstractions for writing correct parallel code - Mapping these abstractions to CUDA - Reading (if you can find it): - "Optimizing Compilers for Modern Architectures: A Dependence-Based Approach", Allen and Kennedy, 2002, Ch. 2. CS6963 2 LS: Writing Correct Programs UNIVERSITY # Administrative Next assignment (a homework) given out on Monday CS6963 3 L5: Writing Correct Programs # Is this CUDA code correct? ``` __global kernelcode(float * d_array1, __host callkernel() { d_array2, int N) { dim3 blocks(10); float result; dim3 threads(100); float *d_array; for (int i=0; i<N; i++) { d_array1[threadldx] += d_array2[blockldx][i]; cudaMalloc(&d_array1,...); cudaMalloc(&d_array2,...); kernelcode<<<bl/>blocks,threads, for (int i=1; i<N; i++) { 0>>>{d_array1, d_array2, result += d_array1[threadIdx-1]; 1000); } CS6963 4 L5: Writing Correct Programs ``` # Threads Access Shared Memory! - Global memory and shared memory within an SM can be freely accessed by multiple threads - Requires appropriate sequencing of memory accesses across threads to same location if at least one access is a write - Recall using \_\_syncthreads() within a thread block for synchronization - Not to be used for different blocks within a grid CS6963 5: Writing Correct Programs ### Is this CUDA code correct? \_\_host callkernel() { \_\_global kernelcode(float \* d\_array1, d\_array2, int N) { dim3 blocks(10); dim3 threads(100); for (int i=0; i<N; i++) { float \*d\_array; d\_array1[threadldx] += d\_array2[blockldx][i]; cudaMalloc(&d\_array1,...); cudaMalloc(&d\_array2,...); \_\_syncthreads(); kernelcode<<<bl/>blocks.threads. for (int i=1; i< N; i++) { 0>>>{d\_array1, d\_array2, 1000}; result += d\_array1[threadIdx-1]; # More Formally: Race Condition or Data Dependence - A race condition exists when the result of an execution depends on the timing of two or more events. - A data dependence is an ordering on a pair of memory operations that must be preserved to maintain correctness. CS6963 7 L5: Writing Correct Programs # How about other Shared Memory Architectures? - Race detection software (e.g., Intel ThreadChecker) - Trace memory accesses for each thread - Compare addresses accessed by each thread - Race condition exists if, between synchronization points, - multiple threads access the same memory location - and, at least one access is a write CS6963 CS6963 # What can we do to debug parallelization in CUDA? - · -deviceemu code (to be emulated on host) - Support for pthread debugging? - Can compare GPU output to CPU output, or compare GPU output to device emulation output - Race condition may still be present - · Or can (try to) prevent introduction of race conditions (remainder of lecture) # Data Dependence Two memory accesses are involved in a data dependence if they may refer to the same memory location and one of the references is a A data dependence can either be between two distinct program statements or two different dynamic executions of the same program statement. Two important uses of data dependence information (among others): Parallelization: no data dependence between two computations parallel execution safe Locality optimization: absence of data dependences & presence of reuse > reorder memory accesses for better data locality (next week) 10 L5: Writing Correct Programs # Data Dependence of Scalar Variables Definition: Data dependence exists from a reference instance i to i' iff either i or i' is a write operation i and i' refer to the same variable i executes before i' # Some Definitions (from Allen & Kennedy) - Definition 2.5: - Two computations are equivalent if, on the same inputs, - they produce identical outputs - · the outputs are executed in the same order - Definition 2.6: - A reordering transformation - · changes the order of statement execution - · without adding or deleting any statement executions. - Definition 2.7: - A reordering transformation preserves a dependence if - · it preserves the relative execution order of the dependences' source and sink. # Fundamental Theorem of Dependence - Theorem 2.2: Any reordering transformation that preserves every dependence in a program preserves the meaning of that program. Now we will discuss abstractions and algorithms to determine whether reordering transformations preserve dependences... CS6963 ### Parallelization as a Reordering Transformation in CUDA ``` __host callkernel() { _host callkernel() { dim3 blocks(bx,by); for (int bldx_x=0; bldx_x<bx; bldx_x++) { dim3 threads(tx,ty,tz); for (int bldx_y=0; bldx_y<by; bldx_y++) { kernelcode <\!\!<\!\! blocks, threads O\!\!>\!\!>\!\! [<\ for\ \{int\ tldx\_x=0;\ tldx\_x<\!tx;\ tldx\_x++\}\ \{ args>]; for (int tldx_y=0; tldx_y<ty; tldx_y++) { for (int tldx_z=0; tldx_z<tz; tldx_z++) { __global kernelcode(<args>) { /* code refers to threadldx.x, / * code refers to tldx_x, tldx_y, tldx_z, threadldx.y, threadldx.z, blockldx.x, blockldx.y */ bldx_x, bldx_y */ }}}} EQUIVALENT? CS6963 ``` # In Today's Lecture: Parallelizable Loops Forall (or CUDA kernels or Doall) loops: Loops whose iterations can execute in parallel (a particular reordering transformation) ``` forall (i=1; i<=n; i++)</pre> A[i] = B[i] + C[i]; ``` # Meaning? Each iteration can execute independently of others Free to schedule iterations in any order Why are parallelizable loops an important concept for data-parallel programming models? CS6963 15 L5: Writing Correct Programs CS6963 # CUDA Equivalent to "Forall" ``` _host callkernel() { forall (int bldx_x=0; bldx_x<bx; bldx_x++) { forall (int bldx_y=0; bldx_y < by; bldx_y++) \{ for all \ (int \ tldx\_x=0; \ tldx\_x< tx; \ tldx\_x++) \ \{ forall (int tldx_y=0; tldx_y<ty; tldx_y++) { for all \ (int \ tldx_z=0; \ tldx_z<\!tz; \ tldx_z++) \ \{ /* code refers to tldx_x, tldx_y, tldx_z, bldx_x, bldx_y */ }}}}} 16 L5: Writing Correct Programs ``` # Distance and Direction Vectors • Distance vectors: (infinitely large set) $$\left( \begin{bmatrix} 0 & 0 & 0 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 0 & \dots & 0 \\ 1 & 2 & \dots & n \end{bmatrix} \right) \left( \begin{bmatrix} 1 & \dots & 1 \\ -n & \dots & 0 \end{bmatrix} \dots \left( \begin{bmatrix} n & \dots & n \\ -n & \dots & n \end{bmatrix} \right)$$ - Direction vectors: (realizable if 0 or lexicographically positive) - ([=,=],[=,<],[<,>], [<,=], [<,<]) - Common notation: - 0 = . - > +/- \* CS6963 21 L5: Writing Correct Program: # Parallelization Test: 1-Dimensional Loop • Examples: $\begin{array}{lll} \text{for } (j=1; \ j < N; \ j + +) & \text{for } (j=1; \ j < N; \ j + +) \\ & A[j] = A[j] + 1; & B[j] = B[j-1] + 1; \end{array}$ - · Dependence (Distance and Direction) Vectors? - Test for parallelization: - A loop is parallelizable if for all data dependences D e $\mathbf{D}$ , D = $\mathbf{0}$ CS6963 : Writing Correct Programs # n-Dimensional Loop Nests for (i=1; i<=N; i++) for (j=1; j<=N; j++) A[i][j] = A[i][j-1]+1; for (i=1; i<=N; i++)</pre> or (i=1; i<=N; i++) for (j=1; j<=N; j++) A[i][j] = A[i-1][j+1]+1; - · Distance and direction vectors? - Definition: D = (d<sub>1</sub>, ... d<sub>p</sub>) is loop-carried at level i if d<sub>i</sub> is the first nonzero element. CS6963 23 L5: Writing Correct Programs # A Few Words about n-Dimensional Arrays in C - Largely conceptual, due to difficulty in expressing this in C for dynamically allocated data structures - · Imagine the following macros, #define 2dAccess(i,j,dim\_i) \ i+j\*dim\_i #define 3dAccess(i,j,k,dim\_i,dim\_j) \ i+j\*dim\_i + k\*dim\_i\*dim\_j CS6963 # Test for Parallelization The *i* th loop of an *n*-dimensional loop is parallelizable if there does not exist any level *i* data dependences. The ith loop is parallelizable if for all dependences $D=(d_1,\;\dots\;d_n),$ either $(d_1, \ldots d_{i-1}) > 0$ $(d_1,\;\dots\;d_i)=0$ CS6963 # Safe Parallelization of CUDA Code - · Dependences must be carried by - (a) Loops that execute on the host - OR, loops that execute within a kernel function - \*May be able to use synchronization for dependences across threads, but not across blocks (subtle distinction) for (j=1; j<n; j++) A[threadIdx][j] = A[threadIdx][j-1]; (b) Dependence carried within thread code # Parallelization Algorithm - · For each pair of dynamic accesses to the same array within a loop nest: - determine if there exists a dependence between that pair - Key points: - $n^2$ tests for n accesses in loop! - a single access is compared with itself - includes accesses in all loops within a nest # Dependence Testing - · Question so far: - · What is the distance/direction (in the iteration space) between two dynamic accesses to the same memory location? - · Simpler question: - Can two data accesses ever refer to the same memory location? for (i=11;i<=20;i++) for (i=11;i<=20;i++) A[i] = A[i-1] + 3;A[i] = A[i-10] + 1; CS6963 # Restrict to an Affine Domain for (i=1; i<N; i++) for (j=1; j<N j++) { A[i+2\*j+3, 4\*i+2\*j, 3\*i] = ...; ... = A[1, 2\*i+1, j]; } • Only use loop bounds and array indices which are integer linear functions of loop variables. • Non-affine example: for (i=1; i<N; i++) for (j=1; j<N j++) { A[i\*j] = A[i\*j-1]; A[B[i]] = A[B[j]]; }</pre> # Equivalence to Integer Programming Need to determine if F(i) = G(i'), where i and i' are iteration vectors, with constraints i,i' >= L, U>= i, i' Example: for (i=1; i<=100; i++) A[i] = A[i-1]; Inequalities: 1 <= iw <= 100, ir = iw -1, ir <= 100 integer vector I, AI <= b Integer Programing is NP-complete O(size of the coefficients) O(n") CS6963 CS6963 # Introducing Omega Calculator A software tool used in compilers Description: Solves "Presburger formulas", logical formulas built from affine constraints over integer variables, logical connectives and quantifiers. Can formulate these dependence problems, and derive existence of dependences and distances. Relevant to locality optimizations as well, next week's focus Can download from: http://www.cs.utah.edu/~chunchen/omega/ Also available from CADE Linux machines in: ~cs6963/bin/oc # ``` 2-D Omega Example • Example: for (i=0; i<n; i++) for (j=i; j<n; j++) a[i][j+1] = a[n][j]; • Formula (more complex): R := {[iw,jw] -> [ir,jr] : exists(n : /* unbound variable */ 1 <= iw <= n && iw <= jw <= n /* iteration space */ && 1 <= ir <= jr <= n /* loop-carried dependence? */ && jw+1 = jr && ir = n)); Result: {[iw,jw] -> [ir,jr] : FALSE} ``` ``` Calculating Distance Vectors • Example from before: for (i=1; i<=100; i++) for (j=1; j<=100; j++) A[i][j] = A[i-1][j+1]+1; • Omega formula: R:= {[iw, jw] → [di, dj]: exists (ir, jr: /* read iters unbound */ 1 <= iw, ir <= 100 && 1 <= jw, jr <= 100 /* iteration space */ && iw = ir -1 && jw = jr +1 && /* access exprs */ && di = ir - iw && dj = jr - jw)}; /* distances */ Result: {[iw,jw] → [1,-1]: 1 <= iw <= 99 && 2 <= jw <= 100} CS6963 LS: Writing Correct Programs ``` # Aside: What about dependences for other data structures? - Pointer-based - Pointer alias analysis - Shape analysis - · Objects - Escape analysis - · In practice - Lots of #pragma and special flags for programmer to assert no dependence CS6963 37 L5: Writing Correct Programs # Homework Assigned Monday - · Example questions - Given some sequential code, do the following: - Show distance vectors and decide which loop can be parallelized - Show Omega dependence relations - · Show correct CUDA code - Memory hierarchy optimization - · Simple tiling example - · Identify safety of code transformations - Given description of goal, show CUDA code to manage memory hierarchy CS6963 38 : Writing Correct Programs # Summary of Lecture - Data dependence can be used to determine the safety of reordering transformations such as parallelization - preserving dependences = preserving "meaning" - Iteration, distance and direction vectors are abstractions for understanding whether reordering transformations preserve dependences. - Parallelization of CUDA kernel programs can be viewed as a reordering transformation of a sequential implementation - Dependence testing on array accesses in loops has been shown to be equivalent to integer programming. - Omega calculator demonstrated CS696 39 L5: Writing Correct Programs # What's Ahead - Next week - Homework assignment on Monday - Managing the memory hierarchy - Initial discussion of projects - · February 16: President's Day holiday - February 18: - Jim Guilkey to present MPM for projects - February 20 (Friday): - Make up class (we'll discuss control flow) CS6963