## 3810 Review Session Spring 2023

#### Hit Record!

#### Reminders:

- Practice exam, annotated slides, class notes, homework solutions
- Friday April 28, 8-10am. Room assignments: Last names A-R in WEB L104. Last names S-Z in WEB 2230.
- 80-20 post-pre midterm material
- Office hours: today until 11:30am, Wed and Thurs 9-11am
- No laptops/textbooks. 6 sheets + green sheet. Calculators ok.
- SoC code of conduct

#### **Disks Basics**

Non-volatile stoc





- Focus on other metrics, especially reliability
- A sector on the disk is associated with a cyclic redundancy code (CRC) a hash that tells us if the read data is correct or not – it is simply an error detector, not an error corrector J Redundant and an of inexperience.

  To correct the error, RAID is commonly used
- Reliability measures continuous service accomplishment and is usually expressed as mean time to failure (MTTF)
- Availability is measured as MTTF/(MTTF+MTTRecovery)

$$f(D) = CRC$$





10-stage pipeline Pipelined processor CPI: Clock speed: 1.67642 Throughput: 1.67 GHz ×1 = 1.67 BIR Pipeline Performance

# No Bypassing (for the 5-stage pipeline) Point of production: always RW middle Point of consumption: always D/R middle

I1 add: IF DR AL DM RW
I2 add: IF DR DR DR AL DM RW
\* PoC



#### **Bypassing**

Point of production:

add, sub, etc.: end of ALU

lw: end of DM

Point of consumption:

add, sub, lw: start of ALU

sw \$1, 8(\$2): start of ALU for \$2,

start of DM for \$1

```
* PoP
I1 add: IF DR AL DM RW
I2 add: IF DR AL DM RW
* PoC
```

Approach 3: Branch Delay Slot - hw/compiles **Assumptions** Option A: always useful 100 instructions Option B: useful when the branch  $160 + 6 \times 5$ goes along common fork > 130 WC 20 branches Option C: useful when the branch 14 Not-Taken, 6 Taken 100+14 XS Branch resolved in 6th cycle (penalty of 5) goes along uncommon fork Option D: no-op, always non-useful 100+2015-100 Approach 1: Panic and wait Option A — Branch\_ Exectine = 160 ydes + 20 br x 5 cyc perally Slot S (idealised) = (00+100 = 200 cyc NTaken / 1 Taken

Approach 2: Fetch-next-instr

Approach 4: Branch predictor hw -- Ly Exective=160+6 Talkbr x 5 cycpen 100 + 20 x 10 misses
=130 cycles

=130 cycles

Accuracy of 90%

=100 + 20 x 10 misses

=100 cycles

Control Hazards Accuracy of 90%



**Out of Order Processor** 

idealized (all L1 hits) (L1 laterly already captured) **Assumptions** 1000 instructions, 1000 cycles, no stalls with L1 hits # loads/stores: 400 5/08 400 show upin LZ % of loads/stores that show up at L2: 531. of 400 in L3 = 12 yc = 11. of 400 in men = 4 % of loads/stores that show up at L3:  $3^{"}/$ % of loads/stores that show up at mem: L2 acc = 10 cyc, L3 acc = 25 cyc, mem acc = 200 cycLlacc= Zaye 1000 imps trechne= 1000 cyc + 20×10 + 12×25+4×200 (idealized) (LZ) (spent in) (men CP1=2300=2.3 = 1000 + 200 + 300 + 800

Cache Latency

#### **Assumptions**

512KB cache, 8-way set-associative, 64-byte blocks, 32-bit addresses

$$512KB = 4sets \times 8 \times 64B$$
  
 $512KB = 4sets = 2^{19} = 2^{10} = 1024$   
 $8 \times 64$  sets

Cache Size

```
Assumptions
Lindex
16 sets, 1 way, 32-byte blocks
Access pattern:
                            40
                                  400
                                           480
                                                   512
                                                            520
                                                                     1032
                                                                                1540
           Offset = address % 32 (address modulo 32, extract last 5)
           Index = address/32 \% 16 (shift right by 5, extract last 4)
           Tag = address/512
                                (shift address right by 9)
                           32-bit address
                                           5 bits offset
                23 bits tag
                               4 bits index
                                                         H/M Evicted address
           4:
                                   0 _
                                                4
                                                          M
                                                                  Inv
           40:
                   0
                                                 8
                                                          M
                                                                  Inv
           400:
                                                16
                                   12
                                                          M
                                                                  Inv
                                   15
           480:
                                                          M
                                                                 Inv
           512:
                                                          M
                                                                  0
           520:
                                                 8
                                                          Η
                                                 8
```

M

M

4

512

1024

1032:

1540:

3

### Example 0b

Show how the following addresses map to the cache and yield hits or misses.

The cache is direct-mapped, has 16 sets, and a 64-byte block size.

Addresses: 8, 96, 32, 480, 976, 1040, 1096



Offset = address % 64 (address modulo 64, extract last 6) Index = address/64 % 16 (shift right by 6, extract last 4) Tag = address/1024 (shift address right by 10)

| • |  |
|---|--|
| • |  |
| • |  |
|   |  |
|   |  |
|   |  |
|   |  |

|       | 32-bit address |              |               |   |
|-------|----------------|--------------|---------------|---|
|       | 22 bits tag    | 4 bits index | 6 bits offset | • |
| 8:    | 0              | 0            | 8             | M |
| 96:   | 0              | 1            | 32            | M |
| 32:   | 0              | 0            | 32            | Н |
| 480:  | 0              | 7            | 32            | M |
| 976:  | 0              | 15           | 16            | M |
| 1040: | 1              | 0            | 16            | M |
| 1096: | : 1            | 1            | 8             | M |

6. Consider a 4-processor multiprocessor connected with a shared bus that has the following properties: (i) centralized shared memory accessible with the bus, (ii) snooping-based MSI cache coherence protocol, (iii) write-invalidate policy. Also assume that the caches have a writeback policy. Initially, the caches all have invalid data. The processors issue the following three requests, one after the other. Similar to slide 4 of lecture 25, fill in the following table to indicate what happens for every request. Also indicate if/when memory writeback is performed. (12 points)

(a) P3: Read X

(b) P3: Write X

(c) P2: Write X

| M | -> | S |
|---|----|---|
|   |    |   |

>> Mem writeback

|          | Request  | Cache         | Request     | Who responds | State   | State   | State   | State   |
|----------|----------|---------------|-------------|--------------|---------|---------|---------|---------|
|          |          | Hit/Miss      | on bus      |              | Cache 1 | Cache 2 | Cache 3 | Cache 4 |
| •        |          |               |             |              | Inv     | Inv     | Inv     | Inv     |
|          | P3: Rd X | Rd<br>Miss    | Rd<br>MissX | Mem          | )       | 7       | Sh      |         |
|          | P3: Wr X | Perms<br>Miss | Upgrade     | No one       | I       | I.      | St.J    | I       |
| <b>\</b> | P2: Wr X | 35.           | Wr<br>Missx | R3 respon    | L I     | M       | I       |         |
|          | Pl: Rdx  | RAKY          | Rd<br>MS/X  | Plresp       | Sh      | Sh      | T       | I       |

How does Meltdown work?

How does Spectre work?

How can you force a footprint? (the relevant code sequence)

How can you examine footprints? (exploiting the side channel)

How can you defend against these attacks?

illegal access

What does the programmer/compiler deal with? What does the OS deal with? How is translation done efficiently?

Page tables TUB

Why do multiprocs need to deal with prog. models, coherence, synchronization, consistency? What are race conditions?

What is an example synchronization primitive and how is it implemented?

What consistency model is assumed by a programmer?

Why is it slow? no reordering

How do I make life easier for the programmer and provide high performance?

e locks

reorderly in most places

What are the central philosophies in a GPU? In what ways does the GPU design differ from a CPU? What are the different ways that disks provide high reliability? Can you explain how parity is used to recover lost data?