Lecture 7: Pipelining Contd.

Kunle Olukotun
Gates 302
kunle@ogun.stanford.edu

http://www-leland.stanford.edu/class/ee282h/

More pipelining complications:
Interrupts and Exceptions

- Hard to handle in pipelined machines
- Overlapping instructions
  - When can you update the state of the machine
  - Exceptions cause instructions to be aborted in the middle of execution
What kind of interrupts can occur?

» In DLX,
  – IF: page fault, misaligned memory access, memory protection violation
  – ID: Illegal opcode
  – EX: arithmetic overflow
  – MEM: page fault, misaligned memory access, memory protection violation
  – WB: none

» In other processors:
  – Other arithmetic events: fixed/floating under/overflow, divide-by-zero, loss of significance
  – Privilege violations (kernel vs. user state)
  – I/O interrupts
  – Debugging interrupts

Classifying Interrupts

- Synchronous vs. asynchronous
- User requested vs. coerced
- User maskable vs. nonmaskable
- Within vs. between Instructions
- Resume vs. terminate
### Classifying Interrupts

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>I/O request</td>
<td>N</td>
<td>Y</td>
<td>N</td>
<td>N</td>
<td>Y</td>
</tr>
<tr>
<td>Trap</td>
<td>Y</td>
<td>N</td>
<td>Y</td>
<td>N</td>
<td>Y</td>
</tr>
<tr>
<td>FP overflow</td>
<td>Y</td>
<td>Y</td>
<td>Y</td>
<td>Y</td>
<td>Y</td>
</tr>
<tr>
<td>Page fault</td>
<td>Y</td>
<td>Y</td>
<td>N</td>
<td>Y</td>
<td>Y</td>
</tr>
<tr>
<td>Undefined instruction</td>
<td>Y</td>
<td>Y</td>
<td>N</td>
<td>Y</td>
<td>N</td>
</tr>
<tr>
<td>Power failure</td>
<td>N</td>
<td>Y</td>
<td>N</td>
<td>Y</td>
<td>Y</td>
</tr>
</tbody>
</table>

### What makes interrupts difficult

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>I/O request</td>
<td>N</td>
<td>Y</td>
<td>N</td>
<td>N</td>
<td>Y</td>
</tr>
<tr>
<td>Trap</td>
<td>Y</td>
<td>N</td>
<td>Y</td>
<td>N</td>
<td>Y</td>
</tr>
<tr>
<td>FP overflow</td>
<td>Y</td>
<td>Y</td>
<td>Y</td>
<td>Y</td>
<td>Y</td>
</tr>
<tr>
<td>Page fault</td>
<td>Y</td>
<td>Y</td>
<td>N</td>
<td>Y</td>
<td>Y</td>
</tr>
<tr>
<td>Undefined instruction</td>
<td>Y</td>
<td>Y</td>
<td>N</td>
<td>Y</td>
<td>N</td>
</tr>
<tr>
<td>Power failure</td>
<td>N</td>
<td>Y</td>
<td>N</td>
<td>Y</td>
<td>Y</td>
</tr>
</tbody>
</table>
What makes pipelined interrupts difficult

- Precise Interrupts
  - Interrupts that occur in the middle of an instruction and must be restartable

\[
I_{n-1} < n : \text{complete} \\
I_n \quad \text{Exception} \\
I_{n+1} \geq n : \text{not executed}
\]

Precise Interrupts

- Precise interrupts can be hard to implement
  - Interrupts can occur anywhere during the execution of an instruction
  - Multiple instructions are being executed, and each may have partially updated state information.
  - Multiple instructions are being executed, so multiple interrupts can occur at the same time.

- The alternative is “imprecise” interrupts that cannot be restarted, but they can’t be used for:
  - demand paging memory faults
  - IEEE 754 compliant floating point interrupts
Example of multiple interrupts

<table>
<thead>
<tr>
<th>Instr</th>
<th>Fault</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
</tr>
</thead>
<tbody>
<tr>
<td>add r1,r2,r3</td>
<td>overflow</td>
<td>IF</td>
<td>ID</td>
<td><strong>EX</strong></td>
<td>MEM</td>
<td>WB</td>
<td></td>
<td></td>
</tr>
<tr>
<td>lw r3,0(r5)</td>
<td>MEM alignment error</td>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td><strong>MEM</strong></td>
<td>WB</td>
<td></td>
<td></td>
</tr>
<tr>
<td>sub r6,r6,r6</td>
<td>IF page fault</td>
<td></td>
<td></td>
<td></td>
<td><strong>IF</strong></td>
<td>ID</td>
<td>EX</td>
<td>MEM</td>
</tr>
</tbody>
</table>

- *add* and *sub* faults occur at the same time
- *sub* fault occurs before the *lw* fault -- out of order

Getting to the interrupt routine and back

- In DLX, no state change occurs until MEM or WB. So we can handle interrupts by:
  - Forcing a “trap” instruction into the pipeline
  - Turn off all writes until the trap executes. (Turn all instructions into NOPs by clearing pipeline registers other than the PC chain)
  - Give the faulting PC to the trap routine to allow restart.
Getting to the interrupt routine and back

Killing Instructions in Pipeline

<table>
<thead>
<tr>
<th>Instruction</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
</tr>
</thead>
<tbody>
<tr>
<td>i</td>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>MEM</td>
<td>WB</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>i+1</td>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>MEM</td>
<td>WB</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>i+2</td>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>MEM</td>
<td>WB</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>i+3</td>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>MEM</td>
<td>WB</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>i+4</td>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td>MEM</td>
<td>WB</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Branch Delay Slots

- Complication: Branch delay slots! If the instruction in a delay slot faults, which PC do you save?
  
  ```
  beq r2, loc
  add r3, r4, r5        <--- overflow here. Save which PC?
  ...
  loc: sub r6, r7, r8
  ```

- Answer: We must save both the address of the “add” and the address of the “sub”. In general, save #branch delay slots + 1 PCs. (Or: save the address of the branch!)

Simultaneous and out-of-order interrupts

- If two fault conditions occur at the same time, take the interrupt for the earlier instruction. The later one will recur after returning from the interrupt routine.

- A fault for a later instruction may occur before one for an earlier instruction. We have two choices:
  - Take the interrupts as they occur. Do not allow previous instructions to complete. Identify the faulting instructions, but restart at the first uncompleted instruction.
  - Defer all interrupts until the end of an instruction. An instruction’s interrupt status moves down the pipeline with it. Take the interrupt only at the end of MEM -- the “commit” point at which all interrupt conditions are known, but no effects have occurred.
Precise Interrupts in DLX Pipeline

<table>
<thead>
<tr>
<th>Instr</th>
<th>Fault</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
</tr>
</thead>
<tbody>
<tr>
<td>add r1,r2,r3</td>
<td>overflow</td>
<td>IF</td>
<td>ID</td>
<td><strong>EX</strong></td>
<td>MEM</td>
<td>WB</td>
<td></td>
<td></td>
</tr>
<tr>
<td>lw r3,0(r5)</td>
<td>MEM alignment error</td>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td><strong>MEM</strong></td>
<td>WB</td>
<td></td>
<td></td>
</tr>
<tr>
<td>sub r6,r6,r6</td>
<td>IF page fault</td>
<td></td>
<td>IF</td>
<td>ID</td>
<td>EX</td>
<td><strong>MEM</strong></td>
<td>WB</td>
<td></td>
</tr>
</tbody>
</table>

- Interrupt status moves with instruction
- Interrupt is handled when instruction reaches commit point

More pipelining complications: Multicycle operations

- Some operations are too long to accommodate in any fixed-length pipeline:
  - integer divide
  - most floating-point operations
  - complex CISC addressing modes
- Number of clock cycles varies, perhaps based on execution data
  - Latency
  - Repeat rate
    - 1 cycle for fully pipelined unit
- Cannot use a static pipeline with all instructions going through all stages; must use a dynamic pipeline with optional stages.
- As long as we’re at it, we can do it for simple instructions too!
  - ALU: IF ID EX WB
  - STORE: IF ID EX MEM
  - etc.
Pipeline with Multicycle Execution Units

Instructions share front of pipeline
Split for execution unit paths
Rejoin to share writeback stage

The problems

- We now have structural hazards:
  » Units not fully pipelined may be busy: multiply, divide
  » Because cycle times vary, multiple instructions may reach WB at the same time
- We now have WAW hazards:
  » Short instructions that start late may reach WB before long instructions that start early
  » (WAR hazards are still not possible because we issue in order and register reads always occur in ID)
- We now have interrupt headaches:
  » Instructions may complete out-of-order, so how can we restart instruction i if instructions i+1 and i+2 are already finished?
Reservation Tables

Reservation Shift Register

Reservation *shift register* indicates what cycles a resource (register write port) will be available:

Now

shift once each cycle

<table>
<thead>
<tr>
<th></th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>1</th>
<th>0</th>
<th>1</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>0</th>
<th>1</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td>FP mult</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>FP add</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>FP divide</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
</tbody>
</table>

AND the mask with the reservation; if non-zero, then don’t issue the instruction.
### Competing for the Writeback Port

Consider the sequence:

```plaintext
FDIV
FMULT
NOP
FADD
AND
```

All 4 units request W stage in same cycle

Three approaches
1. Don’t issue until reservation table indicates writeback stage will be available
2. Queue requests before writeback stage. Allows subsequent instructions to issue
3. Stall at output of unit until available

### Dealing with data hazards

- Handle RAW the usual way: Stall the instruction issue until source registers can be forwarded
- WAR hazards can’t happen
- WAW hazards:
  - Example: `div f0, f2, f4 ;finishes second
      subf f0, f8, f10 ;finishes first`
  - Is this a useless sequence that we can produce incorrect results for? No!
  - Two choices:
    - Delay issuing `subf` until `divf` will enter WBF first
    - Detect the hazard and turn the `divf` into a NOP.
  - Note that an intervening instruction that uses `f0` makes this become a standard RAW hazard, and the WAW disappears.
Dealing with out-of-order completion when an exception occurs

- After structural or data hazards have been eliminated, instructions may write results out-of-order:
  - `div f0, f2, f4`
  - `addf f10, f10, f8`; this destroys its operand
  - `subf f12, f12, f14`; this could interrupt with `addf` done and `divf` not

- Several solutions:
  - Punt: Imprecise interrupts that disallow restart (360/91, CRAY)
  - Allow the software to optionally determine restart points (DEC Alpha “trapb” barrier causes all prior instructions to commit; can be used in basic blocks without operand destruction to get precise interrupts.)
  - Delay instruction issue until all prior instruction are known not to cause an interrupt. (Use early rough tests that can have false positives. MIPS 3010, Pentium do this.)
  - Buffer results to allow the correct register set to be reconstructed (“history file”, “reorder buffer”, “future file”)

Reconstructing the register file for out-of-order completion

- **#1 History file:** keep original values that can be restored after an exception

  Results → Register file (Saved destinations) → History file → Restored destinations

- **#2 Reorder buffer:** don’t write register values out-of-order

  Results → Reorder buffer → Register file (Forwarded results) → Operands to ALU

  Each entry: status, PC, Rdest, value

(Requires many ports/compators in reorder buffer, since the same register may appear multiple times.)
Reconstructing the register file for out-of-order completion

#3 Reorder buffer with future file

Used for exception state

Reorder buffer

Sequential Register file

Operands to ALU

Future Register file

Results

Used for normal operands

• Fewer comparators, but duplicates all registers.

Another pipeline example: high memory latency

• An 8-stage pipeline with multiple stages for memory (the MIPS R4000, approximately):

  » IF1: Instruction fetch, part 1
  » IF2: Instruction fetch, part 2
  » ID Decode, Register fetch, branch target address computation
  » EX ALU ops, effective address computation branch condition determination
  » MEM1 memory access, part 1
  » MEM2 memory access, part 2
  » MEM3 memory access, part 3
  » WB Write back to register file

• Assume that the fast cycle time makes register reads and write take a full cycle -- ie, we get no free WB->ID forwarding.
The pipeline diagram

<table>
<thead>
<tr>
<th>Instr</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>10</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>IF1</td>
<td>IF2</td>
<td>ID</td>
<td>EX</td>
<td>MEM1</td>
<td>MEM2</td>
<td>MEM3</td>
<td>WB</td>
<td></td>
<td></td>
</tr>
<tr>
<td>i+1</td>
<td>IF1</td>
<td>IF2</td>
<td>ID</td>
<td>EX</td>
<td>MEM1</td>
<td>MEM2</td>
<td>MEM3</td>
<td>WB</td>
<td></td>
<td></td>
</tr>
<tr>
<td>i+2</td>
<td>IF1</td>
<td>IF2</td>
<td>ID</td>
<td>EX</td>
<td>MEM1</td>
<td>MEM2</td>
<td>MEM3</td>
<td>WB</td>
<td></td>
<td></td>
</tr>
<tr>
<td>i+3</td>
<td>IF1</td>
<td>IF2</td>
<td>ID</td>
<td>EX</td>
<td>MEM1</td>
<td>MEM2</td>
<td>MEM3</td>
<td></td>
<td>MEM3</td>
<td>WB</td>
</tr>
<tr>
<td>i+4</td>
<td>IF1</td>
<td>IF2</td>
<td>ID</td>
<td>EX</td>
<td>MEM1</td>
<td>MEM2</td>
<td>MEM3</td>
<td></td>
<td>MEM3</td>
<td>MEM3</td>
</tr>
<tr>
<td>i+5</td>
<td>IF1</td>
<td>IF2</td>
<td>ID</td>
<td>EX</td>
<td>MEM1</td>
<td>MEM2</td>
<td>MEM3</td>
<td></td>
<td>MEM3</td>
<td>MEM3</td>
</tr>
<tr>
<td>i+6</td>
<td>IF1</td>
<td>IF2</td>
<td>ID</td>
<td>EX</td>
<td>MEM1</td>
<td>MEM2</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>i+7</td>
<td>IF1</td>
<td>IF2</td>
<td>ID</td>
<td></td>
<td>MEM1</td>
<td>MEM2</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

What are the forwarding paths?

<table>
<thead>
<tr>
<th>From stage</th>
<th>To stage</th>
<th>To which subsequent instruction</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Characteristics of this pipeline

- How many comparators are needed to implement the forwarding decisions?

- What instruction sequences will still cause stalls?

- What is the branch delay?

- What is the load delay?
Pipeline Performance

5-stage pipeline

8-stage pipeline

- 8-stage R4400, 200 MHz, 140 SPECint92
- 5-stage R4700, 175 MHz, 130 SPECint92

Better performance from more instruction-level parallelism (Chap 4)

- Pipeline CPI = Ideal CPI +
  - structural stalls (resource conflict)
  - control stalls (control dependency)
  - RAW stalls (true data dependency)
  - WAR stalls (anti-dependency, a name dependence)
  - WAW stalls (output dependency, another name dependence)

- More hardware tricks
  - multiple instruction issue ("superscaler", VLIW)
  - dynamic scheduling
  - advanced branch prediction
  - speculative execution

- Clever compilers can also help
  - pipeline scheduling
  - loop unrolling
  - register renaming