CPSC 330 Computer Organization Chapter 6-II Pipelining
1
Problem 6.2 n
n
n
n
A computer architect needs to design the pipeline for a new microprocessor. The workload program core consists of 1E6 instructions. Each instruction takes 100 ps to finish. How long does it take to execute this program on a non-pipelined processor? The current processor has about 20 pipelined stages. Assuming it is perfectly pipelined, what potential speedup will it achieve compared to a non-pipelined processor? Pipelining introduces overhead per pipeline stages. Will this affect instruction latency, instruction throughput or both? CPSC330 CompOrg: Dr. Gerousis
pipelining 2
2
Hazards 3 types of pipelining hazards structural hazards: attempt to use the same resource two different ways at the same time –A register is used to write back to at the same time the same register is used to ‘put bits’ into from the current instruction in the decode cycle. n data hazards: attempt to use a register before its proper value is ready. –instruction depends on result of prior instruction still in pipeline n control hazards: attempt to make a decision but the information needed to make the decision is not available yet. –Branch instructions n
CPSC330 CompOrg: Dr. Gerousis
pipelining 3
3
Hazards - Review n
For R-type instructions there are 4 possible conflicts – – – –
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt IF/ID
sub $2, $1, $3 and $12, $2, $5 Rs
IM
ID/EX EX/MEM MEM/WB DM
Reg
IM
Reg
IF/ID
Reg
DM
Reg
ID/EX EX/MEM MEM/WB
CPSC330 CompOrg: Dr. Gerousis
pipelining 4
4
Hazards - Review n
For R-type instructions there are 4 possible conflicts – – – –
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt IF/ID
sub $2, $1, $3 and $12, $5, $2 Rt
IM
ID/EX EX/MEM MEM/WB DM
Reg
IM
Reg
IF/ID
Reg
DM
Reg
ID/EX EX/MEM MEM/WB
CPSC330 CompOrg: Dr. Gerousis
pipelining 5
5
Hazards - Review n
For R-type instructions there are 4 possible conflicts – – – –
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt IF/ID
sub $2, $1, $3 and $12, $2, $5 or $13, $2, $6 Rs
IM
ID/EX EX/MEM MEM/WB DM
Reg
IM
Reg
DM
Reg
IM
Reg
IF/ID
Reg
DM
Reg
ID/EX EX/MEM MEM/WB
CPSC330 CompOrg: Dr. Gerousis
pipelining 6
6
Hazards - Review n
For R-type instructions there are 4 possible conflicts – – – –
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt IF/ID
sub $2, $1, $3 and $12, $2, $5 or $13, $6, $2 Rt
IM
ID/EX EX/MEM MEM/WB DM
Reg
IM
Reg
DM
Reg
IM
Reg
IF/ID
Reg
DM
Reg
ID/EX EX/MEM MEM/WB
CPSC330 CompOrg: Dr. Gerousis
pipelining 7
7
Data Hazard - Problem IF/ID
ADD
R1, R2, R3
SUB
R4, R5, R1
OR
R8, R3, R9
ADD
R3, R2, R4
IM
ID/EX EX/MEM MEM/WB Reg
IM
DM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
Reg
Reg
DM
Reg
v Identify the instruction(s) affected by the data hazard. v Fix the hazard by inserting ‘nops’ ß note: not efficient
CPSC330 CompOrg: Dr. Gerousis
pipelining 8
8
Fixing the data hazard using software: ‘NOPS’ IM
Reg
IM
DM
DM
Reg
IM
Reg
DM
Reg
IM
Reg
DM
Reg
IM
Reg
Reg
IM
Reg
DM
DM
Reg
IM
Reg
Reg
CPSC330 CompOrg: Dr. Gerousis
Reg
DM
Reg
pipelining 9
9
Fixing the data hazard using data forwarding n
n
n
n n
Use temporary results, don’t wait for them to be written Pipeline registers have data too! “sub” ALU operation completes prior to “and” execution Same for “or” No conflict with “add”, since write completes before read in same cycle
T ime (in cloc k cycle s) P ro gra m
CC 1
CC 2
IM
Reg
CC 3
CC 4
CC 5
DM
Reg
CC 6
CC 7
CC 8
CC 9
e xecu tio n o rd e r (in in stru ction s)
su b $ 2, $ 1, $3
a nd $1 2 , $ 2, $ 5
o r $1 3, $6 , $ 2
a dd $1 4 , $2 , $2
IM
R eg
IM
DM
Re g
IM
R eg
DM
Reg
R eg
DM
Reg
sw $ 15 , 1 0 0($2 ) IM
CPSC330 CompOrg: Dr. Gerousis
R eg
DM
R eg
pipelining 10
10
Fixing data hazard Hardware for data forwarding n
The main idea (some details not shown) ID /E X
WB
C o n tro l
PC
In s tr u c tio n
I n str u ct i o n
I F/ ID
E X /M E M
M
WB
EX
M
MEM/ W B
WB
M u x R e g i s te rs A LU
m em or y
D a ta m em or y
M u x
M u x IF /ID .R e g i s te r R s
Rs
IF /ID .R e g i s te r R t
Rt
IF /ID .R e g i s te r R t
Rt
IF /ID .R e g i s te r R d
Rd
M u x
E X /M E M .R e g i s te rR d
F o r w a rd in g unit
CPSC330 CompOrg: Dr. Gerousis
M E M /W B .R e g is te r R d
pipelining 11
11
Can't always forward n
Load word can still cause a hazard: – an instruction (and) tries to read a register following a load instruction that writes to the same register. – ‘lw’ & ‘and’ dependence goes backwards in time.
n
Thus, we need a hazard detection unit to “stall” the load instruction
Time (in clock cycles) CC 1 CC 2
CC 3
CC 4
CC 5
DM
Reg
CC 6
CC 7
CC 8
CC 9
Program execution order (in instructions) lw $2, 20($1)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
IM
Reg
IM
Reg
IM
DM
Reg
IM
Reg
DM
Reg
Reg
DM
Reg
slt $1, $6, $7 IM
CPSC330 CompOrg: Dr. Gerousis
Reg
DM
Reg
pipelining 12
12
Data Hazards and “stalls” n n
By stalling instructions in the pipe 1 cycle, dependence is gone A “stall” is said to inject a “bubble” or “nop” into the pipe Time (in c lock cy cles) CC 1 CC 2 CC 3
CC 4
CC 5
R eg
DM
R eg
CC 6
CC 7
CC 8
CC 9
CC 10
Program execution order (in instructions) lw $2, 20($1)
IM
bubble and becomes nop
an d $4, $2, $5
or $8, $2, $6
add $9, $4, $2
IM
Reg
IM
DM
Reg
IM
Reg
DM
DM
R eg
IM
Reg
Reg
CPSC330 CompOrg: Dr. Gerousis
Reg
DM
Reg
pipelining 13
13
Detecting “lw” hazard & injecting a “bubble” (nop) ID /EX .M e mR e ad
IF/ID Write
H a za rd d e te cti on un i t
ID /EX WB M u x
C on tro l
If (ID/EX.MemRead AND ((ID/EX.RegRt = IF/ID.RegRs) OR (ID/EX.RegRt = IF/ID.RegRt))), then stall the pipeline EX /ME M
M
WB
EX
M
M EM /WB
0
PC
Ins tr uct ion m em o ry
In stru ct ion
PC Wri te
IF/ID
WB
M u x R eg is ters AL U
Da ta me m or y
M u x
M u x
IF /ID .Re g is te rR s
Inject a“bubble”, à a “nop” in the pipe, set control lines to 0
IF /ID .Re g is te rR t IF /ID .Re g is te rR t
Rt
IF /ID .Re g is te rR d
Rd
ID /EX .R e g iste rR t
Rs Rt
M u x
EX/MEM .R e gi ste rR d
Fo rw ar d ing u nit
CPSC330 CompOrg: Dr. Gerousis
MEM /WB .R e giste rRd
pipelining 14
14
Branch Hazards n
When we decide to branch, other instructions are in the pipeline!
Time (in clo ck cycles)
Assuming “branch not taken” (simple form of branch prediction) – need to ‘flush’ instructions, if we are wrong – add control line, IF.flush
CC 2
IM
R eg
CC 3
CC 4
CC 5
DM
Re g
CC 6
CC 7
CC 8
CC 9
Pro gram execut io n order (in in st ru ctio ns) 40 beq $1 , $3 , 28
n
CC 1
44 and $1 2, $ 2, $ 5
48 or $1 3, $ 6, $ 2
52 add $1 4, $ 2, $ 2
IM
R eg
IM
DM
R eg
IM
Re g
DM
Re g
Reg
DM
Reg
72 lw $ 4, 5 0($7) IM
CPSC330 CompOrg: Dr. Gerousis
Reg
DM
R eg
pipelining 15
15
Control Hazards branch prediction n
n
n n
n
n
It’s easy to continue executing sequentially, assuming or predicting “branch not taken” If prediction fails, simply flush pipe and inject a bubble, and take branch However, what if the majority of time “branch taken” We could record the result of branch condition, for future branches A “branch prediction buffer” or “branch history table” could provide this dynamically Hence, dynamic branch prediction: Prediction of branches at runtime using runtime information CPSC330 CompOrg: Dr. Gerousis
pipelining 16
16
Control Hazards branch delayed decision n
n
n
To avoid a possible pipeline flush of an instruction as result of mis-predicted branch, we could insert “non-conditional” instructions in delayed branch slots Compilers and assemblers could handle this, inserting “nop”s whenever suitable instructions can’t be found MIPS actually implements delayed branches P rogr am e xe cu tion o rde r Time ( in instructions) beq $ 1, $ 2, 4 0
2
Ins truction fetch
ad d $4, $5 , $6 (Del ayed br anch slot) lw $3, 3 00 ($0 )
2 ns
4
Reg Instruction fetch
2 ns
6
ALU
Reg Instruction fetch
8
Data access ALU
Reg
10
12
14
Reg Data ac cess ALU
Reg Data access
Reg
2 ns
CPSC330 CompOrg: Dr. Gerousis
pipelining 17
17
Enhancing performance with pipelining Summary n
What makes it easy/simple – all instructions are the same length – just a few instruction formats – memory operands appear only in loads and stores
n
What makes it tough/challenging? – structural hazards: suppose we had only one memory – control hazards: need to worry about branch instructions – data hazards: an instruction needs data that is not yet available
CPSC330 CompOrg: Dr. Gerousis
pipelining 18
18
Comparing Performance (p.425) n
Compare the performance for single-cycle, multicycle, and pipeline control using the SPECint2000 instruction mix – – – – –
n
25% loads 10% stores 11% branches 2% jumps 52% ALU
The number of clock cycles for each instruction class: – – – – –
Loads: 5 Stores: 4 Branches: 3 Jumps: 3 ALU: 4 CPSC330 CompOrg: Dr. Gerousis
pipelining 19
19
Comparing Performance n
Start with performance of single-cycle machine: – 200 ps for memory access – 100 ps for ALU operation – 50 ps for register file read or write
n n n
n
What is the clock cycle time for single-cycle datapath? What is the average CPI for the multiple cycle design? What is the average CPI for the pipeline design? (Loads, stores, and ALU take 1 clock cycle. Branches take 1 clock cycles when predicted correctly and 2 when not. Jump CPI = 2) Note that the long cycle time of memory is a performance bottleneck for pipelined and multicycle design.
CPSC330 CompOrg: Dr. Gerousis
pipelining 20
20
Problem 6.3 n
Show the forwarding paths needed to execute the following four instructions: add $3, $4, $6 sub $5, $3, $2 lw $7, 100($5) add $8, $7, $2
CPSC330 CompOrg: Dr. Gerousis
pipelining 21
21
Problem 6.3 IM
Re g
IM
lw $7, 100($5)
DM
Re g
IM
add $8, $7, $2
DM
R eg
IM
add $8, $7, $2
add $3, $4, $6
R eg
Re g
DM
Re g
IM
sub $5, $3, $2 Re g
DM
Re g
CPSC330 CompOrg: Dr. Gerousis
Re g
DM
R eg
pipelining 22
22
Problem 6.22
CPSC330 CompOrg: Dr. Gerousis
pipelining 23
23
Problem 6.22
IM
Re g
IM
add $2, $3, $5
DM
Re g
IM
DM
Re g
lw $4, 100($2)
R eg
R eg
DM
CPSC330 CompOrg: Dr. Gerousis
sub $6, $4, $3 Re g
pipelining 24
24
Problem 6.22 - continued CC8
IM
Re g
IM
add $2, $3, $5
DM
Re g
IM
Re g
IM
lw $4, 100($2)
R eg
DM
Re g
CPSC330 CompOrg: Dr. Gerousis
Re g
DM
sub $6, $4, $3 Re g
pipelining 25
25
Advanced Pipelining n
n
n
Computer Organizations and Design covers advanced pipelining in 18 pages à Sections 6.9 – 6.10 Consult one of the advanced books, Computer Architecture: A Quantitative Approach (CPEN414) The Verilog Hardware Descriptive Language to describe a pipeline like that in the Pentium 4 will be on the order of thousands of lines.
CPSC330 CompOrg: Dr. Gerousis
pipelining 26
26
Advanced Pipelining Extracting More Performance n
n
n
Increase the depth of the pipeline DEC Alpha 21264: 9 stage pipeline Launch multiple instructions in every pipeline stage (multiple issue) – Replicate units A 6 GHz four-way multiple-issue with CPI=0.25; microprocessor can execute a peak of 24 billion instructions per second!
CPSC330 CompOrg: Dr. Gerousis
pipelining 27
27
Concluding Remark n
n
This chapter gives you the background you need to learn more! Next time we will start chapter 6: Memory
CPSC330 CompOrg: Dr. Gerousis
pipelining 28
28