Pipelining: the laundry analogy Time
6 PM
7
8
9
10
11
12
1
2 AM
6 PM
7
8
9
10
11
12
1
2 AM
Task order A B C D
Time Task order A B C D
1998 Morgan Kaufmann Publishers
1
Pipelining •
Improve perfomance by increasing instruction throughput Program execution Time order (in instructions) lw $1, 100($0)
2
Instruction Reg fetch
lw $2, 200($0)
4
6
8
ALU
Data access
10
12
14
ALU
Data access
16
18
Reg Instruction Reg fetch
8 ns
lw $3, 300($0)
Reg Instruction fetch
8 ns
8 ns Program execution Time order (in instructions)
2
lw $1, 100($0)
Instruction fetch
lw $2, 200($0)
2 ns
lw $3, 300($0)
4
Reg Instruction fetch
2 ns
6
ALU Reg Instruction fetch
2 ns
8
Data access ALU Reg
2 ns
10
...
14
12
Reg Data access
Reg
ALU
Data access
2 ns
2 ns
Reg
2 ns
Ideal speedup is number of stages in the pipeline. Do we achieve this? 1998 Morgan Kaufmann Publishers
2
Pipelining
•
What makes it easy – all instructions are the same length – just a few instruction formats – memory operands appear only in loads and stores
•
What makes it hard? – structural hazards: suppose we had only one memory – control hazards: need to worry about branch instructions – data hazards: an instruction depends on a previous instruction
•
We’ll build a simple pipeline and look at these issues
1998 Morgan Kaufmann Publishers
3
Basic Idea
IF: Instruction fetch
ID: Instruction decode/ register file read
EX: Execute/ address calculation
MEM: Memory access
WB: Write back
0 M u x 1
Add 4
Add
Add result
Shift left 2
PC
Read register 1
Address
Instruction Instruction memory
Read data 1 Read register 2 Registers Read Write data 2 register Write data
16
•
Sign extend
0 M u x 1
Zero ALU ALU result
Address Data memory Write data
Read data
1 M u x 0
32
What do we need to add to actually split the datapath into stages? 1998 Morgan Kaufmann Publishers
4
Pipelined Datapath
0 M u x 1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add Add Add result
4
PC
Address Instruction memory
Instruction
Shift left 2 Read register 1
Read data 1 Read register 2 Registers Read Write data 2 register Write data
0 M u x 1
Zero ALU ALU result
Address
Read data
Data memory Write data
16
Sign extend
1 M u x 0
32
Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem? 1998 Morgan Kaufmann Publishers
5
Corrected Datapath
0 M u x 1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add Add
4
Add result
PC
Address Instruction memory
Instruction
Shift left 2 Read register 1
Read data 1 Read register 2 Registers Read Write data 2 register Write data
16
Sign extend
0 M u x 1
Zero ALU ALU result
Address Data memory Write data
Read data
1 M u x 0
32
1998 Morgan Kaufmann Publishers
6
Graphically Representing Pipelines Time (in clock cycles) Program execution order (in instructions) lw $10, 20($1)
sub $11, $2, $3
•
CC 1
CC 2
CC 3
IM
Reg
ALU
IM
Reg
CC 4
CC 5
DM
Reg
ALU
DM
CC 6
Reg
Can help with answering questions like: – how many cycles does it take to execute this code? – what is the ALU doing during cycle 4? – use this representation to help understand datapaths
1998 Morgan Kaufmann Publishers
7
Pipeline Control PCSrc 0 M u x 1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Shift left 2
Address Instruction memory
Instruction
RegWrite
PC
Add result
Add
4
Read register 1
Branch
MemWrite Read data 1
Read register 2 Registers Read Write data 2 register Write data
ALUSrc
Zero Zero ALU ALU result
0 M u x 1
MemtoReg Address Data memory Write
Read data
1 M u x 0
data Instruction 16 [15– 0]
Instruction [20– 16] Instruction [15– 11]
Sign extend
32
6
0 M u x 1
ALU control
MemRead
ALUOp
RegDst
1998 Morgan Kaufmann Publishers
8
Pipeline control •
We have 5 stages. What needs to be controlled in each stage? – Instruction Fetch and PC Increment – Instruction Decode / Register Fetch – Execution – Memory Stage – Write Back
•
How would control be handled in an automobile plant? – a fancy control center telling everyone what to do? – should we use a finite state machine?
1998 Morgan Kaufmann Publishers
9
Pipeline Control •
Pass control signals along just like the data
Instruction R-format lw sw beq
Execution/Address Calculation Memory access stage control lines stage control lines ALU ALU Mem Mem Reg ALU Op0 Src Branch Read Write Dst Op1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 X 0 0 1 0 0 1 X 0 1 0 1 0 0
stage control lines Reg Mem to write Reg 1 0 1 1 0 X 0 X
WB Instruction
IF/ID
Control
M
WB
EX
M
WB
ID/EX
EX/MEM
MEM/WB
1998 Morgan Kaufmann Publishers
10
Datapath with Control PCSrc
ID/EX
0 M u x 1
WB Control
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB WB
Add Add Add result
Instruction memory
ALUSrc
Read register 1
Read data 1 Read register 2 Registers Read Write data 2 register Write data
Zero ALU ALU result
0 M u x 1
MemtoReg
Address
Branch
Shift left 2
MemWrite
PC
Instruction
RegWrite
4
Address Data memory
Read data
Write data Instruction 16 [15– 0]
Instruction [20– 16] Instruction [15– 11]
Sign extend
32
6
ALU control
0 M u x 1
1 M u x 0
MemRead
ALUOp
RegDst
1998 Morgan Kaufmann Publishers
11
Dependencies •
Problem with starting next instruction before first is finished – dependencies that “go backward in time” are data hazards
Time (in clock cycles) CC 1 Value of register $2: 10
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
10
10
10/– 20
– 20
– 20
– 20
– 20
DM
Reg
Program execution order (in instructions) sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
IM
Reg
IM
DM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
Reg
Reg
DM
Reg
1998 Morgan Kaufmann Publishers
12
Software Solution • •
Have compiler guarantee no hazards Where do we insert the “nops” ? sub and or add sw
•
$2, $1, $3 $12, $2, $5 $13, $6, $2 $14, $2, $2 $15, 100($2)
Problem: this really slows us down!
1998 Morgan Kaufmann Publishers
13
Forwarding •
Use temporary results, don’t wait for them to be written – register file forwarding to handle read/write to same register – ALU forwarding Time (in clock cycles) CC 1 Value of register $2 : 10 Value of EX/MEM : X Value of MEM/WB : X
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10 X X
10 X X
10 – 20 X
10/– 20 X – 20
– 20 X X
– 20 X X
– 20 X X
– 20 X X
DM
Reg
Program execution order (in instructions) sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
what if this $2 was $13?
IM
Reg
IM
Reg
IM
DM
Reg
IM
Reg
DM
DM
Reg
IM
Reg
Reg
Reg
DM
Reg
1998 Morgan Kaufmann Publishers
14
Forwarding
ID/EX WB Control
PC
Instruction memory
Instruction
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB WB
M u x Registers ALU
Data memory
M u x
M u x IF/ID.RegisterRs
Rs
IF/ID.RegisterRt
Rt
IF/ID.RegisterRt
Rt
IF/ID.RegisterRd
Rd
M u x
EX/MEM.RegisterRd
Forwarding unit
MEM/WB.RegisterRd
1998 Morgan Kaufmann Publishers
15
Can't always forward •
Load word can still cause a hazard: – an instruction tries to read a register following a load instruction that writes to the same register. Time (in clock cycles) Program CC 1 execution order (in instructions) lw $2, 20($1)
and $4, $2, $5
–
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
•
IM
CC 2
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
IM
DM
Reg
IM
CC 6
CC 8
CC 9
Reg
DM
Reg
IM
CC 7
Reg
DM
Reg
Reg
DM
Reg
Thus, we need a hazard detection unit to “stall” the load instruction 1998 Morgan Kaufmann Publishers
16
Stalling •
We can stall the pipeline by keeping an instruction in the same stage Time (in clock cycles) Program execution CC 1 CC 2 order (in instructions) lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
IM
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
Reg
IM
IM
CC 6
CC 7
DM
Reg
Reg
DM
CC 8
CC 9
CC 10
Reg
bubble add $9, $4, $2
slt $1, $6, $7
IM
DM
Reg
IM
Reg
Reg
DM
Reg
1998 Morgan Kaufmann Publishers
17
Branch Hazards •
When we decide to branch, other instructions are in the pipeline! Time (in clock cycles) Program execution CC 2 CC 1 order (in instructions) 40 beq $1, $3, 7
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
•
IM
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
IM
DM
Reg
IM
CC 6
CC 8
CC 9
Reg
DM
Reg
IM
CC 7
Reg
DM
Reg
Reg
DM
Reg
We are predicting “branch not taken” – need to add hardware for flushing instructions if we are wrong 1998 Morgan Kaufmann Publishers
18
Improving Performance •
Try and avoid stalls! E.g., reorder these instructions: lw lw sw sw
•
$t0, $t2, $t2, $t0,
0($t1) 4($t1) 0($t1) 4($t1)
Add a “branch delay slot” – the next instruction after a branch is always executed – rely on compiler to “fill” the slot with something useful
1998 Morgan Kaufmann Publishers
19
Dynamic Scheduling
•
The hardware performs the “scheduling” – hardware tries to find instructions to execute – out of order execution is possible – dynamic branch prediction
•
All modern processors are very complicated – DEC Alpha 21264: 9 stage pipeline – PowerPC and Pentium: branch history table for branch prediction – Compiler technology is important
1998 Morgan Kaufmann Publishers
20