Multicycle Datapath Implementation Adapted from instructor’s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, © 2008, MK] and and Computer Architecture: From Microprocessors to Supercomputers, B. Parhami, 2005 Oxford Press
Review: A Multicycle Data Path Inst Reg
x Reg
jjta
Address
rs,rt,rd
(rs)
PC imm
Cache
z Reg
Reg file
ALU (rt)
Data Data Reg
op
y Reg
fn
Control
Fig. 14.2 Abstract view of a multicycle instruction execution unit for MicroMIPS For naming of instruction fields MicroMIPS. fields, see Fig Fig. 13 13.1. 1 Feb. 2011
Computer Architecture, Data Path and Control
Slide 2
Cycle 1
Cycle 2
Notes for State 5: % 0 for j or jal, 1 for syscall, don’t-care for other instr’s @ 0 for j, jal, and syscall, 1 for jr, 2 for branches # 1 for j, jr, jal, and syscall, ALUZero () for beq (bne), bit 31 of ALUout for bltz For jal, RegDst = 2, RegInSrc = 1, RegWrite = 1 State 0 InstData = 0 MemRead = 1 IRWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘+’ PCSrc = 3 PCWrite = 1
Jump/ Branch
Cycle 3
Cycle 4
State 5 ALUSrcX = 1 ALUSrcY = 1 ALUFunc = ‘’ JumpAddr = % PCSrc = @ PCWrite = #
InstData = 1 MemWrite = 1
lw/ sw
ALUSrcX = 0 ALUSrcY = 3 ALUFunc = ‘+’
State 2 ALUSrcX = 1 ALUSrcY = 2 ALUFunc = ‘+’
Start
ALUtype
PC control t l
Cache control t l
Dispatch table 2
0
State 4
InstData = 1 MemRead = 1
RegDst = 0 RegInSrc = 0 RegWrite = 1
State 7
State 8 RegDst = 0 or 1 RegInSrc = 1 RegWrite = 1
Register control t l
ALU iinputs t
ALU Sequence ffunction ti control t l
FnType LogicFn AddSub ALUSrcY ALUSrcX RegInSrc RegDst RegWrite
InstData MemRead MemWrite IRWrite
0 1 2 3
lw
State 3
ALUSrcX = 1 ALUSrcY = 1 or 2 ALUFunc = Varies
JumpAddr PCSrc PCWrite
Dispatch table 1
State 6
sw
State 1
Note for State 7: ALUFunc is determined based on the op and fn fields
Cycle 5
MicroPC 1
Address
Microprogram memory or PLA
Incr Data Microinstruction register
op (from instruction register)
Feb. 2011
Control signals to data path
Review: Microprogramming fetch:
PCnext, CacheFetch # State PC + 4imm, PCdisp1 # State lui1: lui(imm) # State rt z, PCfetch # State add1: x + y # State rd z, PCfetch # State sub1: x - y # State rd z, PCfetch # State slt1: x - y # State rd z, PCfetch # State addi1: x + imm # State rt z, PCfetch # State slti1: x - imm # State rt z, , PCfetch # State and1: x y # State rd z, PCfetch # State or1: x y # State rd z, PCfetch # State xor1: x y # State rd z, PCfetch # State nor1: x y # State rd z, PCfetch # State andi1: x imm # State rt z, PCfetch # State ori1: x imm # State rt z, PCfetch # State xori: x imm # State rt z, PCfetch # State l lwsw1: 1 x + imm, i mPCdisp2 PCdi 2 # State St t lw2: CacheLoad # State rt Data, PCfetch # State sw2: CacheStore, PCfetch# State j1: PCjump, PCfetch # State jr1: PCjreg, PCfetch # State branch1: PCbranch, PCfetch # State jal1 jal1: PCj mp $31PC, PCjump, $31PC PCfetch # syscall1:PCsyscall, PCfetch # State
0 (start) 1 7lui 8lui 7add 8add 7sub 8sub 7slt 8slt 7addi 8addi 7slti 8slti 7and 8and 7or 8or 7xor 8xor 7nor 8nor 7andi 8andi 7ori 8ori 7xori 8xori 2 3 4 6 5j 5jr 5branch State 5jal 5syscall
Sequence control
Slide 3
Review: Exception Control Control States
Cycle 1
Cycle 2
Jump/ Branch
Cycle 3
Cycle 4
State 5 ALUSrcX = 1 ALUSrcY = 1 ALUFunc = ‘’ JumpAddr = % PCSrc = @ PCWrite = #
State 6
Cycle 5
InstData = 1 MemWrite = 1
sw State 0 InstData = 0 MemRead = 1 IRWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘+’ PCSrc = 3 PCWrite = 1
State 1 ALUSrcX = 0 ALUSrcY = 3 ALUFunc = ‘+’
lw/ sw
Start
ALU ALUtype
Illegal operation
Fig. 14.10 Feb. 2011
State 2 ALUSrcX = 1 ALUSrcY = 2 ALUFunc = ‘+’
lw
State 3
State 4
InstData = 1 MemRead = 1
RegDst = 0 RegInSrc = 0 RegWrite = 1
State 7
State 8
ALUSrcX = 1 ALUSrcY = 1 or 2 ALUFunc = Varies
RegDst = 0 or 1 RegInSrc = 1 RegWrite = 1
State 10 IntCause = 0 CauseWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘’ EPCWrite = 1 JumpAddr = 1 PCSrc = 0 PCWrite = 1
Overflow
State 9 IntCause = 1 CauseWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘’ EPCWrite = 1 JumpAddr = 1 PCSrc = 0 PCWrite = 1
Exception states 9 and 10 added to the control state machine. Computer Architecture, Data Path and Control
Slide 4
MIPS Pipelined Datapath and Control
Single‐Cycle vs. Multicycle vs. Pipelined Clock Time needed Time allotted
Instr 1
Instr 2
Instr 3
Instr 4
Clock Time needed
Time saved
Time allotted
1 2
3 cycles
5 cycles
3 cycles
4 cycles
Instr 1
Instr 2
Instr 3
Instr 4
1
2
3
4
5
f
r
a
d
w
f
r
a
d
w
f
r
a
d
w
f f = Fetch r = Reg read a = ALU op d = Data access w = Writeback
r
a
d
w
f
r
a
d
w
f
r
a
d
w
f
r
a
d
3 4 5 6 7
6
7
8
9
10
11
Cycle
1 2
1
2
3
4
5
6
7
f
f
f
f
f
f
f
r
r
r
r
r
r
r
a
a
a
a
a
a
a
d
d
d
d
d
d
d
w
w
w
w
w
w
3 4 5
w
Start-up region
8
9
10
11
Cycle Drainage region
w
Pipeline stage
Instruction (a) Task-time diagram
Feb. 2011
(b) Space-time diagram
Computer Architecture, Data Path and Control
Slide 6
• Pipelined laundry: overlapping execution – Parallelism improves performance Parallelism improves performance
Four loads: Four loads:
Non‐stop:
Chapter 4 — The Processor — 7
Speedup / = 8/3.5 = 2.3 Speedup p p = 2n/0.5n + 1.5 ≈ 4 = number of stages
§4.5 An O Overview off Pipelining
Pipelining Analogy Pipelining Analogy
MIPS Pipeline MIPS Pipeline Five stages, one step per stage 1. IF: Instruction fetch from memory 2. ID: Instruction decode & register read ID: Instruction decode & register read 3. EX: Execute operation or calculate address lw 4. MEM: Access memory operand 5. WB: Write result back to register
Chapter 4 — The Processor — 8
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 IFetch Dec
Exec
Mem
WB
Pipeline Performance Pipeline Performance • Assume time for stages is – 100ps for register read or write 00ps o eg ste ead o te – 200ps for other stages
• Compare pipelined datapath with single‐cycle p pp p g y datapath Instr
Instr fetch Register read
ALU op
Memory access
Register write
Total time
lw
200ps
100 ps
200ps
200ps
100 ps
800ps
sw
200ps
100 ps
200ps
200ps
R-format
200ps
100 ps
200ps
beq
200ps
100 ps
200ps
Chapter 4 — The Processor — 9
700ps 100 ps
600ps 500ps
Pipeline Performance Pipeline Performance Single‐cycle (Tc= 800ps)
Pipelined (T p ( c= 200ps) p)
Chapter 4 — The Processor — 10
Pipeline Speedup Pipeline Speedup • If all stages are balanced If all stages are balanced – i.e., all take the same time – Time between instructions Time between instructionspipelined i li d = Time between instructionsnonpipelined Number of stages
• If not balanced, speedup is less • Speedup due to increased throughput p p g p – Latency (time for each instruction) does not decrease Chapter 4 — The Processor — 11
Pipelining and ISA Design Pipelining and ISA Design • MIPS ISA designed for pipelining g pp g – All instructions are 32‐bits • Easier to fetch and decode in one cycle • c.f. x86: 1‐ f 86 1 to 17‐byte instructions 17 b i i
– Few and regular instruction formats • Can decode and read registers in one step g p
– Load/store addressing • Can calculate address in 3rd stage, access memory in 4th stage
– Alignment of memory operands • Memory access takes only one cycle Chapter 4 — The Processor — 12
Graphically Representing MIPS Pipeline Reg
ALU
IM
DM
Reg
• Can help with answering questions like: – How many cycles does it take to execute this code? How many cycles does it take to execute this code? – What is the ALU doing during cycle 4? – Is there a hazard, why does it occur, and how can it be fixed?
Why Pipeline? For Performance! Ti ( l k Time (clock cycles) l )
IM
Regg
DM
IM
Reg
DM
IM
Reg
DM
IM
Reg
ALU
Inst 3
DM
ALU
Inst 2
R Reg
ALU
Inst 1
IM
ALU U
O r d e r
Inst 0 Inst 0
ALU
I n s t r.
Once the pipeline i f ll is full, one instruction is completed every cycle so CPI = 1 cycle, so CPI = 1
Inst 4 Time to fill the pipeline Time to fill the pipeline
R Reg
Regg
Reg
Reg
DM
Reg
Hazards • Situations that prevent starting the next p g instruction in the next cycle • Structure hazards – A required resource is busy
• Data hazard – Need to wait for previous instruction to complete its data read/write
• Control hazard Control hazard – Deciding on control action depends on previous instruction Chapter 4 — The Processor — 15
Structure Hazards Structure Hazards • Conflict for use of a resource Conflict for use of a resource • In MIPS pipeline with a single memory – Load/store requires data access L d/ t i d t – Instruction fetch would have to stall for that cycle • Would cause a pipeline “bubble” W ld i li “b bbl ”
• Hence, pipelined datapaths require separate i instruction/data memories i /d i – Or separate instruction/data caches Chapter 4 — The Processor — 16
A Single Memory Would Be a Structural Hazard Ti ( l k Time (clock cycles) l )
Regg
Mem
Regg
Reg
Mem
Reg
Reg
Mem
Reg
Reg
ALU
Inst 4
R Reg
ALU
Inst 3
Mem
Reading data from memory
M Mem
ALU
Inst 2
R Reg
ALU U
O r d e r
Inst 1
M Mem
ALU
I n s t r.
lw
Mem
Mem
Mem
Reading instruction from memoryy
Mem
Fix with separate instr and data memories (I$ and D$)
Reg
Data Hazards Data Hazards • An instruction depends on completion of data access by a previous instruction access by a previous instruction – add sub
Chapter 4 — The Processor — 18
$s0, $t0, $t1 $t2 $s0, $t2, $s0 $t3
Register Usage Can Cause Data Hazards • Dependencies backward in time cause hazards Dependencies backward in time cause hazards
IM
Reg
DM
IM
Reg
DM
IM
Reg
ALU
DM
IM
Reg
ALU
or
DM
ALU
and $6,$1,$7
Reg
ALU U
sub $ $4,$1,$5 ,$ ,$
IM
ALLU
add $1 $1,
$8,$1,$9
xor $4,$1,$5
Read before write data hazard
Reg
Reg
Reg
Reg
DM
Reg
Loads Can Cause Data Hazards • Dependencies backward in time cause hazards Dependencies backward in time cause hazards
Reg
DM
IM
Reg
DM
IM
Reg
DM
IM
Reg
ALU
DM
IM
Reg
ALU
sub $ $4,$1,$5 ,$ ,$
IM
ALU
$1 $1,4($2) 4($2)
ALU U
O r d e r
lw
ALLU
I n s t r.
and $6,$1,$7 or
$8,$1,$9
xor $4,$1,$5
Load‐use data hazard
Reg
Reg
Reg
Reg
DM
Reg
How About Register File Access? Time (clock cycles) ( y )
DM
IM
Reg
DM
IM
Reg
DM
IM
Reg
ALU
Inst 1
Regg
ALU
IM
ALU
O r d e r
add $1, ,
ALU U
I n s t r.
Inst 2 add $2,$1,
clock edge that controls register writing
Reg
Reg
Fix register file access hazard by doing reads in the second half of the cycle and writes in the first half Reg
DM
Reg
clock edge that controls l k d th t t l loading of pipeline state registers
One Way to “Fix” a Data Hazard Reg
DM
Reg
IM
Reg
DM
IM
Reg
ALU
IM
ALU
O r d e r
add $1,
ALU
I n s t r r.
Can fix data hazard by waiting – stall – waiting – but impacts CPI
stall stall sub $4,$1,$5 and $6,$1,$7
Reg
DM
Reg
Forwarding (aka Bypassing) Forwarding (aka Bypassing) • Use result when it is computed – Don Don’tt wait for it to be stored in a register wait for it to be stored in a register – Requires extra connections in the datapath
Chapter 4 — The Processor — 23
Another Way to “Fix” a Data Hazard
or
$8,$1,$9
xor $4,$1,$5
IM
Reg
DM
IM
Reg
DM
IM
Reg
DM
IM
Reg
ALU
and $6,$1,$7
DM
ALU
sub $4,$1,$5
Reg
ALU
IM
ALU
O r d e r
add $1,
ALU
I n s t r.
Fix data hazards by forwarding results as soon as they are available to where they are needed
Reg
Reg
Reg
Reg
DM
Reg
Forwarding Illustration
sub $4,$1,$5
and $ $6,$7,$1 ,$ ,$
Reg
DM
IM
Reg
DM
IM
Reg
ALU
IM
ALU
O r d e r
add $1,
ALU
I n s t r r.
EX forwarding
Reg
Reg
DM
Reg
MEM forwarding
Yet Another Complication! • Another Another potential data hazard can occur when there is potential data hazard can occur when there is a conflict between the result of the WB stage instruction and the MEM stage instruction – which should be forwarded? should be forwarded?
add $1,$1,$4
Reg
DM
IM
Reg
DM
IM
Reg
ALU
add $1,$1,$3
IM
ALU
O r d e r
add $1,$1,$2
ALU
I n s t r.
Reg
Reg
DM
Reg
Load‐Use Load Use Data Hazard Data Hazard • Can’t always avoid stalls by forwarding – If value not computed when needed If value not computed when needed – Can’t forward backward in time!
Chapter 4 — The Processor — 27
Code Scheduling to Avoid Stalls Code Scheduling to Avoid Stalls • Reorder code to avoid use of load result in the next instruction next instruction • C code for A = B + E; C = B + F;
stall
stall
lw lw add sw lw add sw
$t1, $t2, , $t3, $t3, $t4 $t4, $t5, $t5,
0($t0) 4($t0) $t1, $t2 12($t0) 8($t0) $t1, $t4 16($t0)
13 cycles Chapter 4 — The Processor — 28
lw lw lw add sw add sw
$t1, $t2, , $t4, $t3, $t3 $t3, $t5, $t5,
0($t0) 4($t0) 8($t0) $t1, $t2 12($t0) $t1, $t4 16($t0)
11 cycles