Laboratorio de Tecnologías de Información
Introduction to Pipelining: Datapath Arquitectura de Computadoras Arturo Díaz Pérez Centro de Investigación y de Estudios Avanzados del IPN Laboratorio de Tecnologías de Información
[email protected]
Arquitectura de Computadoras
Pipelining- 1
Pipelining
Laboratorio de Tecnologías de Información
A way of exploiting instruction level parallelism 1 2 3 4
instrns
time
1
2
3
Throughput
4 1
2
3
4 1
2
3
4
Latency
instrns
time
1
Throughput
2
3
4
1
2
3
4
1
2
3
4 Latency
Arquitectura de Computadoras
Pipelining- 2
Observations
Laboratorio de Tecnologías de Información
♦ Pipelining doesn’t help latency of a single task, it ♦ ♦ ♦ ♦ ♦
helps throughput of the entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup Arquitectura de Computadoras
Pipelining- 3
5 Steps of DLX Datapath Instruction Fetch
Instruction Decode/ Register Fetch
Execute Addr. Calc.
Memory Access
Laboratorio de Tecnologías de Información
Write Back
M u x
4
Add
Zero ? NPC
A PC
IR Inst. Memory
Registers B
16
Arquitectura de Computadoras
Sign Extend
32
M u x M u x
Add Data LM ALU Output Memory D SM D
M u x
Pipelining- 4
Pipelined DLX Datapath
Laboratorio de Tecnologías de Información
Data stationary control - local decode for each phase / pipeline stage Arquitectura de Computadoras
Pipelining- 5
Visualizing Pipelining
Arquitectura de Computadoras
Laboratorio de Tecnologías de Información
Pipelining- 6
Single Cycle, Multiple Cycle, vs. Pipeline Cycle 1
Cycle 2
Clk Single Cycle Implementation: Load
Waste R-type
Store
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Ifetch Reg Exec Mem
Wr
Store Ifetch
Reg
Pipeline Implementation: Load Ifetch
Reg
Store Ifetch
Exec
Mem
Wr
Reg
Exec
Mem
R-type Ifetch
Reg
Exec
Wr Mem
Wr
Exec
Mem
R-type Ifetch
Why Pipeline?
Laboratorio de Tecnologías de Información
♦ Suppose we execute 100 instructions ♦ Single Cycle Machine ■ 45 ns/cycle x 1 CPI x 100 inst = 4500 ns
♦ Multicycle Machine ■ 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns
♦ Ideal pipelined machine ■ 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns
Arquitectura de Computadoras
Pipelining- 8
Limits to Pipelining
Laboratorio de Tecnologías de Información
♦ Its not that easy for computers ♦ Limits to pipelining: Hazards prevent next instruction from
executing during its designated clock cycle ■ Structural hazards: HW cannot support this combination of instructions ■ Data hazards: instruction depends on result of prior instruction still in the pipeline ■ Control hazards: pipelining of branches & other instructions that change the PC
♦ Common solution is to stall the pipeline until the hazard is
resolved, inserting one or more “bubbles” in the pipeline Arquitectura de Computadoras
Pipelining- 9
1 Memory is Structural Hazard
Laboratorio de Tecnologías de Información
Detection is easy in this case! (right half highlight means read, left half write) Arquitectura de Computadoras
Pipelining- 10
1 Memory is Structural Hazard
Arquitectura de Computadoras
Laboratorio de Tecnologías de Información
Pipelining- 11
Data Hazard on r1
Laboratorio de Tecnologías de Información
• Dependencies backwards in time are hazards Time (clock cycles) IF
Arquitectura de Computadoras
Dm
Im
Reg
Dm
Im
Reg
Dm
Im
Reg
Dm
Im
Reg
ALU
xor r10,r1,r11
Reg
ALU
or r8,r1,r9
WB
ALU
and r6,r1,r7
MEM
ALU
O r d e r
sub r4,r1,r3
Im
EX ALU
I n s t r.
add r1,r2,r3
ID/RF
Reg Reg
Reg
Reg
Dm
Reg
Pipelining- 12
Data Hazard Solution
Laboratorio de Tecnologías de Información
• “Forward” result from one stage to another Time (clock cycles) IF
Dm
Im
Reg
Dm
Im
Reg
Dm
Im
Reg
Dm
Im
Reg
ALU
xor r10,r1,r11
Reg
ALU
or r8,r1,r9
WB
ALU
and r6,r1,r7
MEM
ALU
O r d e r
sub r4,r1,r3
Im
EX ALU
I n s t r.
add r1,r2,r3
ID/RF
Reg
Reg
•“or” OK if define read/write properly
Arquitectura de Computadoras
Reg
Reg
Dm
Reg
Pipelining- 13
3 Generic Data Hazards
Laboratorio de Tecnologías de Información
♦ Instri followed by Instrj ♦ Read After Write (RAW) ■ Instrj tries to read operand before Instri writes it ♦ Write After Read (WAR) ■ Instrj tries to write operand before Instri reads it ■ Can’t happen in DLX 5 stage pipeline because » all instructions take 5 stages » reads are always in stage 2, and » writes are always in stage 5
♦ Write After Write (WAW) ■ Instrj tries to write operand before Instri writes it » Leaves wrong result (Instri not Instrj)
■ Can’t happen in DLX 5 stage pipeline because » all instructions take 5 stages » writes are always in stage 5 Arquitectura de Computadoras
Pipelining- 14
Control Hazard: Wait I n s t r.
Laboratorio de Tecnologías de Información
Time (clock cycles)
Mem
Reg
Reg
Mem
Reg
Lost potential
Mem
Reg
Load
ALU
Mem ALU
O r d e r
Beq
Reg
ALU
Add
Mem
Mem
Reg
♦ Stall: wait until decision is clear ♦ Impact: 2 lost cycles (i.e. 3 clock cycles per branch instruction) =>
slow ♦ Move decision to end of decode ■ save 1 cycle per branch
Arquitectura de Computadoras
Pipelining- 15
Control Hazard: Predict
♦
Beq Load
Reg
Mem
Mem
Reg
Reg
Mem
Reg
Mem
Reg
ALU
♦
Add
Mem
ALU
O r d e r
Time (clock cycles) ALU
I n s t r.
Laboratorio de Tecnologías de Información
Mem
Reg
Predict: guess one direction then back up if wrong Impact: 0 lost cycles per branch instruction if right, 1 if wrong (right - 50% of time) ■ Need to “Squash” and restart following instruction if wrong ■ Produce CPI on branch of (1 *.5 + 2 * .5) = 1.5 ■ Total CPI might then be: 1.5 * .2 + 1 * .8 = 1.1 (20% branch)
Arquitectura de Computadoras
♦
More dynamic scheme: history of 1 branch (- 90%)
Pipelining- 16
Control Hazard: Delayed Branch
Misc Load
Mem
Mem
Reg
Reg
Mem
Reg
Mem
Reg
Mem
Reg
Mem
Reg
ALU
Beq
Reg
ALU
Add
Mem
ALU
O r d e r
Time (clock cycles) ALU
I n s t r.
Laboratorio de Tecnologías de Información
Mem
Reg
♦ Delayed Branch: Redefine branch behavior (takes place after next
instruction) ♦ Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” (- 50% of time) ♦ As launch more instruction per clock cycle, less useful Arquitectura de Computadoras
Pipelining- 17
Forwarding to Avoid Data Hazard
Arquitectura de Computadoras
Laboratorio de Tecnologías de Información
Pipelining- 18
Data Hazard even with Forwarding
Laboratorio de Tecnologías de Información
Arquitectura de Computadoras
Pipelining- 19
Forwarding and Loads
Laboratorio de Tecnologías de Información
• Dependencies backwards in time are hazards Time (clock cycles) IF
MEM
Reg
Dm
Im
Reg
ALU
sub r4,r1,r3
Im
EX ALU
lw r1,0(r2)
ID/RF
WB Reg
Dm
Reg
Can’t solve with forwarding: Must delay/stall instruction dependent on loads Arquitectura de Computadoras
Pipelining- 20
Forwarding and Loads
Laboratorio de Tecnologías de Información
Dependencies backwards in time are hazards Time (clock cycles) IF
Reg
Stall
MEM
WB
Dm
Reg
Im
Reg
ALU
sub r4,r1,r3
Im
EX ALU
lw r1,0(r2)
ID/RF
Dm
Reg
Can’t solve with forwarding: Must delay/stall instruction dependent on loads Arquitectura de Computadoras
Pipelining- 21
Designing a Pipelined Processor
Laboratorio de Tecnologías de Información
♦ Go back and examine your datapath and control diagram ♦ associated resources with states ♦ ensure that flows do not conflict, or figure out how to
resolve ♦ assert control in appropriate stage
Arquitectura de Computadoras
Pipelining- 22
Control and Datapath: Split state diag into 5 pieces IR