

## Project Proposal (due 3/8)

- Team of 2-3 people
- Please let me know if you need a partner
- Proposal Logistics:
  - Significant implementation, worth 50% of grade
  - Each person turns in the proposal (should be same as other team members)
- Proposal:
  - 3-4 page document (11pt, single-spaced)
  - Submit with handin program:
  - "handin CS6235 prop <pdf-file>"

CS6235

UNIVERSITY

## Project Parts (Total = 50%)

- Proposal (5%)
  - Short written document, next few slides
- Design Review (10%)

   Oral, in-class presentation 2 weeks before
- end • Presentation and Poster (15%)
  - Poster session last week of class, dry run week before
- Final Report (20%)
  - Due during finals no final for this class

UNIVERSITY OF UTAH



## Projects - How to Approach

- Some questions:
  - 1. Amdahl's Law: target bulk of computation and can profile to obtain key computations...
  - 2. Strategy for gradually adding GPU execution to CPU code while maintaining correctness
  - 3. How to partition data & computation to avoid synchronization?
  - 4. What types of floating point operations and accuracy requirements?
  - 5. How to manage copy overhead? Can you overlap computation and copying?

CS6235

**Floating Point**  Incompatibility - Most scientific apps are double precision codes! Graphics applications do not need double precision (criteria are speed and whether the picture looks ok, not whether it accurately models some scientific phenomena). -> Prior to GTX and Tesla platforms, double precision floating point not supported at all. Some inaccuracies in singleprecision operations. In general Double precision needed for convergence on fine meshes, or large set of values - Single precision ok for coarse meshes

8 L9: Projects and Floating Point

CS6235

UNIVERSITY



- What is IEEE floating-point format?
- A floating point binary number consists of three parts:
  - sign (S), exponent (E), and mantissa (M).
  - Each (S, E, M) pattern uniquely identifies a floating point number.
- For each bit pattern, its IEEE floating-point value is derived as:
  - value = (-1)<sup>5</sup> \* M \* {2<sup>E</sup>}, where  $1.0 \le M < 10.0_{B}$
- The interpretation of S is simple: S=0 results in a positive number and S=1 a negative number. UNIVERSITY OF UTAH



- Platforms of compute capability 1.2 and below only support single precision floating point
- Some systems (GTX, 200 series, Tesla) include double precision, but much slower than single precision
  - A single dp arithmetic unit shared by all SPs in an SM
  - Similarly, a single fused multiply-add unit
- · Greatly improved in Fermi
  - Up to 16 double precision operations performed per warp (subsequent slides) 11 L9: Projects and Floating Point

CS6235

UNIVERSITY OF UTAH





| GPU                                           |                      |                        | Fermi                          |
|-----------------------------------------------|----------------------|------------------------|--------------------------------|
| Transistors                                   | 681 million          | 1.4 billion            | 3.0 billion                    |
| CUDA Cores                                    | 128                  | 240                    | 512                            |
| Double Precision Floating<br>Point Capability | None                 | 30 FMA ops / clock     | 256 FMA ops /clock             |
| Single Precision Floating<br>Point Capability | 128 MAD<br>ops/clock | 240 MAD ops /<br>clock | 512 FMA ops /clock             |
| Warp schedulers (per SM)                      | 1                    | 1                      | 2                              |
| Special Function Units<br>(SFUs) / SM         | 2                    | 2                      | 4                              |
| Shared Memory (per SM)                        | 16 KB                | 16 KB                  | Configurable 48 KB<br>16 KB    |
| L1 Cache (per SM)                             | None                 | None                   | Configurable 16 KB of<br>48 KB |
| L2 Cache (per SM)                             | None                 | None                   | 768 KB                         |
| ECC Memory Support                            | No                   | No                     | Yes                            |
| Concurrent Kernels                            | No                   | No                     | Up to 16                       |
| Load/Store Address Width                      | 32-bit               | 32-bit                 | 64-bit                         |

|                                                 | G80                                   | SSE                                              | IBM Altivec                    | Cell SPE                        |
|-------------------------------------------------|---------------------------------------|--------------------------------------------------|--------------------------------|---------------------------------|
| Precision                                       | IEEE 754                              | IEEE 754                                         | IEEE 754                       | IEEE 754                        |
| Rounding modes for<br>FADD and FMUL             | Round to nearest and<br>round to zero | All 4 IEEE, round to<br>nearest, zero, inf, -inf | Round to nearest only          | Round to zero/<br>truncate only |
| Denormal handling                               | Flush to zero                         | Supported,<br>1000's of cycles                   | Supported,<br>1000's of cycles | Flush to zero                   |
| NaN support                                     | Yes                                   | Yes                                              | Yes                            | No                              |
| Overflow and Infinity<br>support                | Yes, only clamps to<br>max norm       | Yes                                              | Yes                            | No, infinity                    |
| Flags                                           | No                                    | Yes                                              | Yes                            | Some                            |
| Square root                                     | Software only                         | Hardware                                         | Software only                  | Software only                   |
| Division                                        | Software only                         | Hardware                                         | Software only                  | Software only                   |
| Reciprocal estimate<br>accuracy                 | 24 bit                                | 12 bit                                           | 12 bit                         | 12 bit                          |
| Reciprocal sqrt<br>estimate accuracy            | 23 bit                                | 12 bit                                           | 12 bit                         | 12 bit                          |
| og2(x) and 2 <sup>x</sup><br>estimates accuracy | 23 bit                                | No                                               | 12 bit                         | No                              |





| Arithmetic Instruction Throughput<br>(G80)                                                                                                                                                                                                                                                   | Ari                        |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------|
| <ul> <li>int and float add, shift, min, max and float mul, mad:</li></ul>                                                                                                                                                                                                                    | • Re                       |
| 4 cycles per warp <ul> <li>int multiply (*) is by default 32-bit</li> <li>requires multiple cycles / warp</li> <li>Usemul24() /umul24() intrinsics for 4-cycle 24-bit</li></ul>                                                                                                              | lo                         |
| int multiply                                                                                                                                                                                                                                                                                 | –                          |
| <ul> <li>Integer divide and modulo are expensive</li> <li>Compiler will convert literal power-of-2 divides to shifts</li> <li>Be explicit in cases where compiler can't tell that divisor is a power of 2!</li> <li>Useful trick: foo % n == foo &amp; (n-1) if n is a power of 2</li> </ul> | • 0<br>at<br>-             |
| © David Kirk/NVIDIA and Wen-mel W. Hwu, 2007-2009 19                                                                                                                                                                                                                                         | © David Kirk/NVIDIA ar     |
| University of Illinois, Urbana-Champaign L9: Projects and Floating Point                                                                                                                                                                                                                     | University of Illinois, Ur |

## Arithmetic Instruction Throughput (680)

- Reciprocal, reciprocal square root, sin/cos, log, exp: 16 cycles per warp
  - These are the versions prefixed with "\_\_\_"
  - Examples:\_\_rcp(), \_\_sin(), \_\_exp()
- Other functions are combinations of the above
  - y / x == rcp(x) \* y == 20 cycles per warp
  - sqrt(x) == rcp(rsqrt(x)) == 32 cycles per warp

Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 20 ty of illinois, Urbana-Champaign L9: Projects and Floating Point

5

UNIVERSITY



