## CS/EE 3810: Computer Organization

## Lecture 19: Final Recap

Anton Burtsev December , 2022

## **MIPS** Datapath



## Designing a Latch

An S-R latch: set-reset latch

- When Set is high, a 1 is stored
- When Reset is high, a 0 is stored
- When both are low, the previous state is preserved (hence, known as a storage or memory element)
- Both are high this set of inputs is not allowed

Verify the above behavior!



## **D** Latch

Incorporates a clock

• The value of the input D signal (data) is stored only when the clock is high – the previous state is preserved when the clock is low



## D Flip Flop

• Terminology:

Latch: outputs can change any time the clock is high (asserted) Flip flop: outputs can change only on a clock edge

• Two D latches in series – ensures that a value is stored only on the falling edge of the clock



# Register file



# Register file (read ports)

- Build from D-flops
- Read ports
  - Multiplexors
  - N-to-1, 32bit wide



## Common Logic Blocks – Multiplexor

 Multiplexor or selector: one of N inputs is reflected on the output depending on the value of the log<sub>2</sub>N selector bits



# Register file (write ports)

• Decoder



## Common Logic Blocks – Decoder

Takes in N inputs and activates one of  $2^{N}$  outputs

| I <sub>o</sub> | <b>I</b> <sub>1</sub> | <b>I</b> <sub>2</sub> | <b>O</b> <sub>0</sub> | <b>O</b> <sub>1</sub> | <b>O</b> <sub>2</sub> | <b>O</b> <sub>3</sub> | <b>O</b> <sub>4</sub> | <b>O</b> <sub>5</sub> | <b>O</b> <sub>6</sub> | <b>O</b> <sub>7</sub> |  |
|----------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|--|
| 0              | 0                     | 0                     | 1                     | 0                     | 0                     | 0                     | 0                     | 0                     | 0                     | 0                     |  |
| 0              | 0                     | 1                     | 0                     | 1                     | 0                     | 0                     | 0                     | 0                     | 0                     | 0                     |  |
| 0              | 1                     | 0                     | 0                     | 0                     | 1                     | 0                     | 0                     | 0                     | 0                     | 0                     |  |
| 0              | 1                     | 1                     | 0                     | 0                     | 0                     | 1                     | 0                     | 0                     | 0                     | 0                     |  |
| 1              | 0                     | 0                     | 0                     | 0                     | 0                     | 0                     | 1                     | 0                     | 0                     | 0                     |  |
| 1              | 0                     | 1                     | 0                     | 0                     | 0                     | 0                     | 0                     | 1                     | 0                     | 0                     |  |
| 1              | 1                     | 0                     | 0                     | 0                     | 0                     | 0                     | 0                     | 0                     | 1                     | 0                     |  |
| 1              | 1                     | 1                     | 0                     | 0                     | 0                     | 0                     | 0                     | 0                     | 0                     | 1                     |  |



# Register file (write ports)

 What if we Write read and write С 0 **Register 0** the register in D n-to-2<sup>n</sup> Register number the same decoder С Register 1 n-2cycle? D n-1add \$s0, \$s0, \$s1 • С Register n-2 Л С Register n-1Register data

## **MIPS** Datapath



Pipelining

## Building a Car

Time

## Building a Car



## The Assembly Line



Time

Break the job into smaller stages

## Performance Improvements?

Does it take longer to finish each individual job?

Does it take shorter to finish a series of jobs?

What assumptions were made while answering these questions?

Is a 10-stage pipeline better than a 5-stage pipeline?

# **Quantitative Effects**

- As a result of pipelining:
  - Time in ns per instruction goes up
  - Each instruction takes more cycles to execute
  - But... average CPI remains roughly the same
  - Clock speed goes up
  - Total execution time goes down, resulting in lower average time per instruction
  - Under ideal conditions, speedup

= ratio of elapsed times between successive instruction completions

= number of pipeline stages = increase in clock speed

## **Clocks and Latches**



## **Clocks and Latches**



#### Some Equations

- Unpipelined: time to execute one instruction = T + Tovh
- For an N-stage pipeline, time per stage = T/N + Tovh
- Total time per instruction = N (T/N + Tovh) = T + N Tovh
- Clock cycle time = T/N + Tovh
- Clock speed = 1 / (T/N + Tovh)
- Ideal speedup = (T + Tovh) / (T/N + Tovh)
- Cycles to complete one instruction = N
- Average CPI (cycles per instr) = 1

#### Hazards

## A 5-Stage Pipeline



Source: H&P textbook <sup>24</sup>

#### Hazards

- Structural hazards: different instructions in different stages (or the same stage) conflicting for the same resource
- Data hazards: an instruction cannot continue because it needs a value that has not yet been generated by an earlier instruction
- Control hazard: fetch cannot continue because it does not know the outcome of an earlier branch – special case of a data hazard – separate category because they are treated in different ways

#### Structural hazards

## A 5-Stage Pipeline



Source: H&P textbook <sup>27</sup>

### **Structural Hazards**

- Example: a unified instruction and data cache  $\rightarrow$  stage 4 (MEM) and stage 1 (IF) can never coincide
- The later instruction and all its successors are delayed until a cycle is found when the resource is free → these are pipeline bubbles
- Structural hazards are easy to eliminate increase the number of resources (for example, implement a separate instruction and data cache)

#### Data hazards

## A 5-Stage Pipeline



Source: H&P textbook <sup>30</sup>

## **Pipeline Implementation**

- Signals for the muxes have to be generated some of this can happen during ID
- Need look-up tables to identify situations that merit bypassing/stalling the number of inputs to the muxes goes up



31















 For the 5-stage pipeline, bypassing can eliminate delays between the following example pairs of instructions: add/sub
 R1, R2, R3 add/sub/lw/sw
 R4, R1, R5

| W  | R1, 8(R2) |
|----|-----------|
| SW | R1, 4(R3) |

The following pairs of instructions will have intermediate stalls:

| lw<br>add/sub/ | /lw | R1, 8(R2)<br>R3, R1, R4 | or | SW | R3, 8(R1) |
|----------------|-----|-------------------------|----|----|-----------|
| fmul<br>fadd   | ,   | F2, F3<br>F1, F4        |    |    |           |

#### **Control hazards**



- Structural hazards
- Data hazards
- Control hazards

# **Control Hazards**

- Simple techniques to handle control hazard stalls:
  - for every branch, introduce a stall cycle (note: every 6<sup>th</sup> instruction is a branch on average!)
  - assume the branch is not taken and start fetching the next instruction – if the branch is taken, need hardware to cancel the effect of the wrong-path instructions
  - predict the next PC and fetch that instr if the prediction is wrong, cancel the effect of the wrong-path instructions
  - Fetch the next instruction (branch delay slot) and execute it anyway – if the instruction turns out to be on the correct path, useful work was done – if the instruction turns out to be on the wrong path, hopefully program state is not lost

# Branch delay slot



39

**Branch prediction** 

## **Control Hazards**

- Simple techniques to handle control hazard stalls:
  - for every branch, introduce a stall cycle (note: every 6<sup>th</sup> instruction is a branch!)
  - assume the branch is not taken and start fetching the next instruction – if the branch is taken, need hardware to cancel the effect of the wrong-path instruction
  - Fetch the next instruction (branch delay slot) and execute it anyway – if the instruction turns out to be on the correct path, useful work was done – if the instruction turns out to be on the wrong path, hopefully program state is not lost
    - make a smarter guess and fetch instructions from the expected target

## **Branch Delay Slots**



42

Source: H&P textbook

#### • Pipeline without Branch Predictor



In the 5-stage pipeline, a branch completes in two cycles  $\rightarrow$ If the branch went the wrong way, one incorrect instr is fetched  $\rightarrow$ One stall cycle per incorrect branch

#### • Pipeline with Branch Predictor



In the 5-stage pipeline, a branch completes in two cycles  $\rightarrow$ If the branch went the wrong way, one incorrect instr is fetched  $\rightarrow$ One stall cycle per incorrect branch

#### • 1-Bit Bimodal Prediction

- For each branch, keep track of what happened last time and use that outcome as the prediction
- What are prediction accuracies for branches 1 and 2 below:

```
while (1) {
    for (i=0;i<10;i++) {
        branch-1
        ...
    }
    for (j=0;j<20;j++) {
        branch-2
        ...
    }
}</pre>
```

#### • 2-Bit Bimodal Prediction

- For each branch, maintain a 2-bit saturating counter: if the branch is taken: counter = min(3,counter+1) if the branch is not taken: counter = max(0,counter-1)
- If (counter >= 2), predict taken, else predict not taken
- Advantage: a few atypical branches will not influence the prediction (a better measure of "the common case")
- Especially useful when multiple branches share the same counter (some bits of the branch PC are used to index into the branch predictor)
- Can be easily extended to N-bits (in most processors, N=2)

### **Bimodal Predictor**



# **Multicycle Instructions**



© 2003 Elsevier Science (USA). All rights reserved.

- Multiple parallel pipelines each pipeline can have a different number of stages
- Instructions can now complete out of order must make sure that writes to a register happen in the correct order

# • The Alpha 21264 Out-of-Order Implementation



Issue Queue (IQ)

• Logical and physical registers

Temporary values are stored in the register file and not the ROB



At the start, R1-R32 can be found in P1-P32 Instructions stop entering the pipeline when P64 is assigned



#### The Commit Process

- On commit, no copy is required
- The register map table is updated the "committed" value of R1 is now in P33 and not P1 – on an exception, P33 is copied to memory and not P1
- An instruction in the issue queue need not modify its input operand when the producer commits
- When instruction-1 commits, we no longer have any use for P1 – it is put in a free pool and a new instruction can now enter the pipeline → for every instr that commits, a new instr can enter the pipeline → number of in-flight instrs is a constant = number of extra (rename) registers



- When does the decode stage stall? When we either run out of registers, or ROB entries, or issue queue entries
- Issue width: the number of instructions handled by each stage in a cycle. High issue width → high peak ILP
- Window size: the number of in-flight instructions in the pipeline. Large window size → high ILP
- No more WAR and WAW hazards because of rename registers – must only worry about RAW hazards

# • The Alpha 21264 Out-of-Order Implementation



Issue Queue (IQ)

# Additional Details

- When does the decode stage stall? When we either run out of registers, or ROB entries, or issue queue entries
- Issue width: the number of instructions handled by each stage in a cycle. High issue width → high peak ILP
- Window size: the number of in-flight instructions in the pipeline. Large window size → high ILP
- No more WAR and WAW hazards because of rename registers – must only worry about RAW hazards

# • The Alpha 21264 Out-of-Order Implementation



Issue Queue (IQ)

## Caches

- Data and instructions are stored on DRAM chips DRAM is a technology that has high bit density, but relatively poor latency – an access to data in memory can take as many as 300 cycles today!
- Hence, some data is stored on the processor in a structure called the cache – caches employ SRAM technology, which is faster, but has lower bit density
- Internet browsers also cache web pages same concept

# **Memory Hierarchy**

• As you go further, capacity and latency increase



# Locality

- Why do caches work?
  - Temporal locality: if you used some data recently, you will likely use it again
  - Spatial locality: if you used some data recently, you will likely access its neighbors
    - No hierarchy: average access time for data = 300 cycles
    - 32KB 1-cycle L1 cache that has a hit rate of 95%: average access time = 0.95 x 1 + 0.05 x (301) = 16 cycles

# Accessing the Cache



# The Tag Array



## **Example Access Pattern**



# Increasing Line Size



# Associativity



# Associativity



Memory and virtual memory

# Paging

# Pages





# Pages



# Paging idea

- Break up memory into 4096-byte chunks called pages
  - Modern hardware supports 2MB, 4MB, and 1GB pages
- Independently control mapping for each page of linear address space

- Compare with segmentation (single base + limit)
  - many more degrees of freedom

# How can we build this translation mechanism?

#### Paging: naive approach: translation array



• Virtual address 0x410010

## What is wrong?

# What is wrong?

- We need 4 bytes to relocate each page
  - 20 bits for physical page number
  - 12 bits of access flags
  - Therefore, we need array of 4 bytes x 1M entries
    - 4MBs

#### Paging: page table















#### Cache-coherence

#### **Snooping-Based Protocols**

- Three states for a block: invalid, shared, modified
- A write is placed on the bus and sharers invalidate themselves
- The protocols are referred to as MSI, MESI, etc.



#### Example

- P1 reads X: not found in cache-1, request sent on bus, memory responds,
   X is placed in cache-1 in shared state
- P2 reads X: not found in cache-2, request sent on bus, everyone snoops this request, cache-1does nothing because this is just a read request, memory responds, X is placed in cache-2 in shared state



- P1 writes X: cache-1 has data in shared state (shared only provides read perms), request sent on bus, cache-2 snoops and then invalidates its copy of X, cache-1 moves its state to modified
- P2 reads X: cache-2 has data in invalid state, request sent on bus, cache-1 snoops and realizes it has the only valid copy, so it downgrades itself to shared state and responds with data, X is placed in cache-2 in shared state, memory is also updated

#### Example

| Request  | Cache<br>Hit/Miss | Request<br>on the bus | Who responds                                | State in<br>Cache 1 | State in<br>Cache 2 | State in<br>Cache 3 | State in<br>Cache 4 |
|----------|-------------------|-----------------------|---------------------------------------------|---------------------|---------------------|---------------------|---------------------|
|          |                   |                       |                                             | Inv                 | Inv                 | Inv                 | Inv                 |
| P1: Rd X | Miss              | Rd X                  | Memory                                      | S                   | Inv                 | Inv                 | Inv                 |
| P2: Rd X | Miss              | Rd X                  | Memory                                      | S                   | S                   | Inv                 | Inv                 |
| P2: Wr X | Perms<br>Miss     | Upgrade X             | No response.<br>Other caches<br>invalidate. | Inv                 | М                   | Inv                 | Inv                 |
| P3: Wr X | Write<br>Miss     | Wr X                  | P2 responds                                 | Inv                 | Inv                 | М                   | Inv                 |
| P3: Rd X | Read Hit          | -                     | -                                           | Inv                 | Inv                 | М                   | Inv                 |
| P4: Rd X | Read<br>Miss      | Rd X                  | P3 responds.<br>Mem wrtbk                   | Inv                 | Inv                 | S                   | S                   |

#### Before midterm

## Simple program: String Copy

• MIPS code:

| stro | cpy: |       |                           |   |                         |
|------|------|-------|---------------------------|---|-------------------------|
|      | addi | \$sp, | \$sp, -4                  | # | adjust stack for 1 item |
|      | SW   | \$s0, | 0(\$sp)                   | # | save \$s0               |
|      | add  | \$s0, | <pre>\$zero, \$zero</pre> | # | i = 0                   |
| L1:  | add  | \$t1, | \$s0, \$a1                | # | addr of y[i] in \$t1    |
|      | lbu  | \$t2, | 0(\$t1)                   | # | $t_{2} = y[i]$          |
|      | add  | \$t3, | \$s0, \$a0                | # | addr of x[i] in \$t3    |
|      | sb   | \$t2, | 0(\$t3)                   | # | x[i] = y[i]             |
|      | beq  | \$t2, | \$zero, L2                | # | exit loop if y[i] == 0  |
|      | addi | \$s0, | \$s0, 1                   | # | i = i + 1               |
|      | j    | L1    |                           | # | next iteration of loop  |
| L2:  | lw   | \$s0, | 0(\$sp)                   | # | restore saved \$s0      |
|      | addi | \$sp, | \$sp, 4                   | # | pop 1 item from stack   |
|      | jr   | \$ra  |                           | # | and return              |

- Where is this program stored?
- How does it get executed?

#### Instruction Fetch



# Three types of instructions

- R-Type
  - Add, and, jr, nor, or, slt, sll, sub
- I-Type
  - addi, beq, bne, lw (and other loads), ori, slti, sw (and other stores)
- J-Type
  - j, jal

#### **MIPS** Reference Data

1

OPCODE

#### CORE INSTRUCTION SET

| FOR- / FUNCT                   |       |     |                                                          |         |                       |  |  |
|--------------------------------|-------|-----|----------------------------------------------------------|---------|-----------------------|--|--|
| NAME, MNEMONIC                 |       | MAT |                                                          |         | (Hex)                 |  |  |
| Add add                        |       | R   | R[rd] = R[rs] + R[rt]                                    | (1)     | 0 / 20 <sub>hex</sub> |  |  |
| Add Immediate                  | addi  | Ι   | R[rt] = R[rs] + SignExtImm                               | (1,2)   | 8 <sub>hex</sub>      |  |  |
| Add Imm. Unsigned addiu        |       | Ι   | R[rt] = R[rs] + SignExtImm                               | (2)     | 9 <sub>hex</sub>      |  |  |
| Add Unsigned                   | addu  | R   | R[rd] = R[rs] + R[rt]                                    |         | 0 / 21 <sub>hex</sub> |  |  |
| And                            | and   | R   | R[rd] = R[rs] & R[rt]                                    |         | 0 / 24 <sub>hex</sub> |  |  |
| And Immediate                  | andi  | Ι   | R[rt] = R[rs] & ZeroExtImm                               | (3)     | c <sub>hex</sub>      |  |  |
| Branch On Equal                | beq   | Ι   | if(R[rs]==R[rt])<br>PC=PC+4+BranchAddr                   | (4)     | 4 <sub>hex</sub>      |  |  |
| Branch On Not Equa             | lbne  | Ι   | if(R[rs]!=R[rt])<br>PC=PC+4+BranchAddr                   | (4)     | 5 <sub>hex</sub>      |  |  |
| Jump                           | j     | J   | PC=JumpAddr                                              | (5)     | $2_{\text{hex}}$      |  |  |
| Jump And Link                  | jal   | J   | R[31]=PC+8;PC=JumpAddr                                   | (5)     | 3 <sub>hex</sub>      |  |  |
| Jump Register                  | jr    | R   | PC=R[rs]                                                 |         | 0 / 08 <sub>hex</sub> |  |  |
| Load Byte Unsigned             | lbu   | Ι   | $R[rt] = \{24'b0, M[R[rs] + SignExtImm](7:0)\}$          | (2)     | $24_{hex}$            |  |  |
| Load Halfword<br>Unsigned      | lhu   | Ι   | $R[rt] = \{16'b0, M[R[rs] + SignExtImm](15:0)\}$         | (2)     | $25_{\rm hex}$        |  |  |
| Load Linked                    | 11    | Ι   | R[rt] = M[R[rs]+SignExtImm]                              | (2,7)   | 30 <sub>hex</sub>     |  |  |
| Load Upper Imm.                | lui   | Ι   | $R[rt] = \{imm, 16'b0\}$                                 |         | fhex                  |  |  |
| Load Word                      | lw    | Ι   | R[rt] = M[R[rs]+SignExtImm]                              | (2)     | $23_{hex}$            |  |  |
| Nor                            | nor   | R   | $R[rd] = \sim (R[rs]   R[rt])$                           |         | 0 / 27 <sub>hex</sub> |  |  |
| Or                             | or    | R   | R[rd] = R[rs]   R[rt]                                    |         | 0 / 25 <sub>hex</sub> |  |  |
| Or Immediate                   | ori   | Ι   | R[rt] = R[rs]   ZeroExtImm                               | (3)     | d <sub>hex</sub>      |  |  |
| Set Less Than                  | slt   | R   | R[rd] = (R[rs] < R[rt]) ? 1 : 0                          |         | 0 / 2a <sub>hex</sub> |  |  |
| Set Less Than Imm.             | slti  | Ι   | R[rt] = (R[rs] < SignExtImm)? 1                          | : 0 (2) | a <sub>hex</sub>      |  |  |
| Set Less Than Imm.<br>Unsigned | sltiu | Ι   | R[rt] = (R[rs] < SignExtImm)<br>? 1 : 0                  | (2,6)   | b <sub>hex</sub>      |  |  |
| Set Less Than Unsig            | .sltu | R   | R[rd] = (R[rs] < R[rt]) ? 1 : 0                          | (6)     | 0 / 2b <sub>hex</sub> |  |  |
| Shift Left Logical             | sll   | R   | $R[rd] = R[rt] \leq shamt$                               |         | 0 / 00 <sub>hex</sub> |  |  |
| Shift Right Logical            | srl   | R   | R[rd] = R[rt] >>> shamt                                  |         | 0 / 02 <sub>hex</sub> |  |  |
| Store Byte                     | sb    | Ι   | M[R[rs]+SignExtImm](7:0) = R[rt](7:0)                    | (2)     | $28_{hex}$            |  |  |
| Store Conditional              | SC    | Ι   | M[R[rs]+SignExtImm] = R[rt];<br>R[rt] = (atomic) ? 1 : 0 | (2,7)   | 38 <sub>hex</sub>     |  |  |
| Store Halfword                 | sh    | Ι   | M[R[rs]+SignExtImm](15:0) = R[rt](15:0)                  | (2)     | 29 <sub>hex</sub>     |  |  |
| Store Word                     | SW    | Ι   | M[R[rs]+SignExtImm] = R[rt]                              | (2)     | $2b_{hex}$            |  |  |
| Subtract                       | sub   | R   | R[rd] = R[rs] - R[rt]                                    | (1)     | $0/22_{hex}$          |  |  |
| Subtract Unsigned              | subu  | R   | R[rd] = R[rs] - R[rt]                                    |         | 0 / 23 <sub>hex</sub> |  |  |

## Fits on one page

# Three types of instructions

• R-Type

| ор     | rs     | rt     | rd     | shamt  | funct  |
|--------|--------|--------|--------|--------|--------|
| 6 bits | 5 bits | 5 bits | 5 bits | 5 bits | 6 bits |

I-Type

| ор     | rs     | rt     | constant or address |
|--------|--------|--------|---------------------|
| 6 bits | 5 bits | 5 bits | 16 bits             |

• J-Type

| ор     | rs     | rt     | constant or address |
|--------|--------|--------|---------------------|
| 6 bits | 5 bits | 5 bits | 16 bits             |

## **Datapath With Control**



## **Datapath With Jumps Added**



How do we get from high-level languages to MIPS instructions?



## **Translation and Startup**



#### If then .. else ...

if (i == j) f = g + h; else f = g - h;

• Assume that f,g, h, i, and j are in \$s0, \$s1, etc.

#### If then ... else ...

if (i == j) f = g + h; else f = g - h;

• Assume that f,g, h, i, and j are in \$s0, \$s1, etc.

bne \$s3,\$s4,Else # go to Else if i ≠ j
add \$s0,\$s1,\$s2 # f = g + h (skipped if i ≠ j)
j Exit # unconditional jump to Exit
Else:

sub \$s0,\$s1,\$s2 # f = g - h (skipped if i = j)
Exit:

#### Loops

while (save[i] == k)

i += 1;

• Assume that i and k are in \$s3 and \$s5

#### Loops

while (save[i] == k)

i += 1;

• Assume that i and k are in \$s3 and \$s5

```
Loop: sll $t1,$s3,2

add $t1,$t1,$s6  # $t1 = address of save[i]

lw $t0,0($t1)  # Temp reg $t0 = save[i]

bne $t0,$s5, Exit # go to Exit if save[i] ≠ k

addi $s3,$s3,1  # i = i + 1

j Loop  # go to Loop
```

Exit:

#### Procedures

# Calling functions

```
// some code...
foo();
// more code..
```

- \$ra contains information for how to return from a subroutine
  - i.e., from foo()

• Functions can be called from different places in the program

```
if (a == 0) {
    foo();
    ...
} else {
```

```
foo();
```

...

}

## **Procedure Call Instructions**

- Procedure call: jump and link jal ProcedureLabel
  - Address of following instruction put in \$ra
  - Jumps to target address
- Procedure return: jump register
  - jr \$ra
    - Copies \$ra to program counter
    - Can also be used for computed jumps
      - e.g., for case/switch statements

# Calling conventions

- Goal: re-entrant programs
  - How to pass arguments
    - On the stack?
    - In registers?
  - How to return values
    - On the stack?
    - In registers?
  - What registers have to be preserved
    - All? Some subset?
- Conventions differ from compiler, optimizations, etc.

# Passing arguments

- First 4 arguments in registers
  - \$a0 \$a3
- Other arguments on the stack
- Return values in registers
  - \$v0 \$v1

# **Preserving registers**

- \$t0 \$t9: temporaries
  - Can be overwritten by callee
- \$s0 \$s7: saved
  - Must be saved/restored by callee

## Leaf Procedure Example

- C code: int leaf\_example (int g, h, i, j) { int f; f = (g + h) - (i + j); return f; }
  - Arguments g, ..., j in \$a0, ..., \$a3
  - f in \$s0 (hence, need to save \$s0 on stack)
  - Result in \$v0

## Leaf Procedure Example

• MIPS code:



### **Recursive invocations**

```
foo(int a) {
    if (a == 0)
        return;
    a--;
    foo(a);
    return;
}
```

foo(4);

## Non-Leaf Procedures

- Procedures that call other procedures
- For nested call, caller needs to save on the stack:
  - Its return address
  - Any arguments and temporaries needed after the call
- Restore from the stack after the call

## Non-Leaf Procedure Example

```
• C code:
int fact (int n)
{
    if (n < 1) return f;
    else return n * fact(n - 1);
}
```

- Argument n in \$a0
- Result in \$v0

## Non-Leaf Procedure Example

• MIPS code:

| fact: |      |       |            |   |                          |
|-------|------|-------|------------|---|--------------------------|
|       | addi | \$sp, | \$sp, -8   | # | adjust stack for 2 items |
|       | SW   | \$ra, | 4(\$sp)    | # | save return address      |
|       |      |       | 0(\$sp)    |   | save argument            |
|       | slti | \$t0, | \$a0, 1    | # | test for n < 1           |
|       | beq  | \$t0, | \$zero, L1 |   |                          |
|       | addi | \$v0, | \$zero, l  | # | if so, result is 1       |
|       | addi | \$sp, | \$sp, 8    | # | pop 2 items from stack   |
|       | jr   |       |            | # | and return               |
| L1:   | addi | \$a0, | \$a0, -1   | # | else decrement n         |
|       | jal  | fact  |            | # | recursive call           |
|       | lw   | \$a0, | 0(\$sp)    | # | restore original n       |
|       | lw   | \$ra, | 4(\$sp)    | # | and return address       |
|       | addi | \$sp, | \$sp, 8    | # | pop 2 items from stack   |
|       | mul  | \$v0, | \$a0, \$v0 | # | multiply to get result   |
|       | jr   | \$ra  |            | # | and return               |

# Calling convention (again)

| Preserved                     | Not preserved                            |
|-------------------------------|------------------------------------------|
| Saved registers: \$s0-\$s7    | Temporary registers: \$t0-\$t9           |
| Stack pointer register: \$sp  | Argument registers: \$a0-\$a3            |
| Return address register: \$ra | <b>Return value registers:</b> \$v0-\$v1 |
| Stack above the stack pointer | Stack below the stack pointer            |











Relocation (What needs to be done to move the program in memory?)

# What types of variables do you know?

• Or where these variables are allocated in memory?

# What types of variables do you know?

- Global variables
  - Initialized  $\rightarrow$  data section
  - Uninitalized  $\rightarrow$  BSS
- Dynamic variables
  - Heap
- Local variables
  - Stack

## **Global variables**

1. #include <stdio.h>

2.

- 3. char hello[] = "Hello";
- 4. int main(int ac, char \*\*av)
- 5. {
- 6. static char world[] = "world!";
- 7. printf("%s %s\n", hello, world);
- 8. return 0;

9. }

## **Global variables**

```
1. #include <stdio.h>
2.
3. char hello[] = "Hello";
4. int main(int ac, char **av)
5. {
6. static char world[] = "world!";
7. printf("%s %s\n", hello, world);
8. return 0;
9. }
```

- Allocated in the data section
  - It is split in initialized (non-zero), and non-initialized (zero)
  - As well as read/write, and read only data section

### **Global variables**

## Dynamic variables (heap)

```
1. #include <stdio.h>
2. #include <string.h>
3. #include <stdlib.h>
4.
5. char hello[] = "Hello";
6. int main(int ac, char **av)
7. {
8.
     char world[] = "world!";
9.
      char *str = malloc(64);
      memcpy(str, "beautiful", 64);
10.
11. printf("%s %s %s\n", hello, str, world);
12. return 0;
13.}
```

# Dynamic variables (heap)

```
1. #include <stdio.h>
2. #include <string.h>
3. #include <stdlib.h>
4.
5. char hello [] = "Hello";
6. int main(int ac, char **av)
7. {
8. char world[] = "world!";
9. char *str = malloc(64);
10.
      memcpy(str, "beautiful", 64);
      printf("%s %s %s\n", hello, str, world);
11.
12.
      return 0;
13.}
```

- Allocated on the heap
  - Special area of memory provided by the OS from where malloc() can allocate memory

## Local variables

- Local variables
- 1. #include <stdio.h>
- 2.
- 3. char hello[] = "Hello";
- 4. int main(int ac, char \*\*av)

5. {

- 6. //static char world[] = "world!";
- 7. char world[] = "world!";
- 8. printf("%s %s\n", hello, world);
- 9. return 0;

10.}

```
1 # "Hello World" in MIPS assembly
2 # From: http://labs.cs.upt.ro/labs/so2/html/resources/nachos-doc/mipsf.html
 3
4 # All program code is placed after the
 5 # .text assembler directive
 6 .text
 7
8 # Declare main as a global function
9 .globl main
10
11 # The label 'main' represents the starting point
12 main:
          # Run the print_string syscall which has code 4
13
14
           li
                   $v0,4
                                  # Code for syscall: print_string
                                 # Pointer to string (load the address of msg)
15
                   $a0, msg
           la
           syscall
16
17
          li
                   $v0,10
                                 # Code for syscall: exit
           syscall
18
19
20 # All memory structures are placed after the
21 # .data assembler directive
22
           .data
23
24
           # The .asciiz assembler directive creates
           # an ASCII string in memory terminated by
25
          # the null character. Note that strings are
26
27
          # surrounded by double-quotes
          .asciiz "Hello World!\n"
28 msg:
```

#### What needs to be relocated?

User Text Segment [00400000]..[00440000]

[00400000] 8fa40000lw \$4, 0(\$29); 183: lw \$a0 0(\$sp) # argc[00400004] 27a50004addiu \$5, \$29, 4; 184: addiu \$a1 \$sp 4 # argv[00400008] 24a60004addiu \$6, \$5, 4; 185: addiu \$a2 \$a1 4 # envp[0040000c] 00041080sll \$2, \$4, 2; 186: sll \$v0 \$a0 2[00400010] 00c23021addu \$6, \$6, \$2; 187: addu \$a2 \$a2 \$v0[00400014] 0c100009jal 0x00400024 [main]; 188: jal main[00400016] 3402000aori \$2, \$0, 10; 191: li \$v0 10[00400020] 000000csyscall; 192: syscall # syscall 10 (a[00400024] 34020004ori \$2, \$0, 4; 14: li \$v0, 4 # Code for system

[00400028] 3c011001 lui \$1, 4097 [msg]

[0040002c] 34240000 ori \$4, \$1, 0 [msg] [00400030] 000000c syscall [00400034] 3402000a ori \$2, \$0, 10

[00400038] 000000c syscall

```
; 189: nop
; 192: syscall # syscall 10 (exit)
       ; 14: li $v0,4 # Code for syscall:
       ; print string
       ; 15: la $a0, msg # Pointer to string
       ; (load the address of msg)
       ; 16: syscall
       ; 17: li $v0,10 # Code for syscall:
       : exit
```

; 18: syscall

#### What needs to be relocated?

User Text Segment [00400000]..[00440000] [00400000] 8fa40000 lw \$4, 0(\$29) [00400004] 27a50004 addiu \$5, \$29, 4 [00400008] 24a60004 addiu \$6, \$5, 4 [0040000c] 00041080 sll \$2, \$4, 2 [00400010] 00c23021 addu \$6, \$6, \$2 [00400014] 0c100009 jal 0x00400024 [main] [00400018] 00000000 nop [0040001c] 3402000a ori \$2, \$0, 10 [00400020] 0000000c syscall [00400024] 34020004 ori \$2, \$0, 4 [00400028] 3c011001 lui \$1, 4097 [msg] ori \$4, \$1, 0 [msg] [0040002c] 34240000 [00400030] 000000c syscall [00400034] 3402000a ori \$2, \$0, 10 [00400038] 000000c syscall

; 183: lw \$a0 0(\$sp) # argc ; 184: addiu \$a1 \$sp 4 # argv ; 185: addiu \$a2 \$a1 4 # envp : 186: sll \$v0 \$a0 2 : 187: addu \$a2 \$a2 \$v0 ; 188: jal main ; 189: nop ; 191: li \$v0 10 ; 192: syscall # syscall 10 (exit) ; 14: li \$v0,4 # Code for syscall: ; print string ; 15: la \$a0, msg # Pointer to string ; (load the address of msg) ; 16: syscall ; 17: li \$v0,10 # Code for syscall: : exit ; 18: syscall



#### Memory layout of a process



#### Where do these areas come from?



### Memory layout of a process



Compiler and linker

#### Load program in memory





## Thank you!