Giving credit where credit is due
CSCE 230J Computer Organization
Most of slides for this lecture are based on slides created by Dr. Bryant, Carnegie Mellon University.
Processor Architecture V: Making the Pipelined Implementation Work
I have modified them and added new slides.
Dr. Steve Goddard
[email protected]
http://cse.unl.edu/~goddard/Courses/CSCE230J
2
Overview
W_icode, W_valM
Pipeline Stages
W_valE, W_valM, W_dstE, W_dstM
W valM
Fetch
Make the pipelined processor work!
Data Data memory memory
M_icode, M_Bch, M_valA
Memory
Addr, Data
Select current PC Read instruction Compute incremented PC
Data Hazards Instruction having register R as source follows shortly after instruction having register R as destination Common condition, don’t want to slow down pipeline
M
Bch
valE
CC CC
Execute
ALU ALU
aluA, aluB
Decode
E
Read program registers
Control Hazards
valA, valB
Mispredict conditional branch
d_srcA, d_srcB
Decode
Execute
A
B
Register Register M file file E
Write back
Our design predicts all branches as being taken Naïve pipeline executes two extra instructions
Operate ALU
D
icode, ifun, rA, rB, valC
Memory
Getting return address for ret instruction
valP
valP
Instruction Instruction memory memory
Fetch
PC PC increment increment predPC
Read or write data memory
Naïve pipeline executes three extra instructions
PC
Making Sure It Really Works
f_PC
Write Back
What if multiple special cases happen simultaneously?
F
Update register file 3
4
Data Dependencies: 2 Nop’s
Write back
PIPE- Hardware
W
icode
valE
Pipeline registers hold intermediate values from instruction execution
Forward (Upward) Paths Values passed from one stage to next Cannot jump past stages
valM
dstE dstM
data out read
Mem. control write
Memory
Data Data memory memory data in
# demo-h2.ys
1
2
3
4
5
0x000: irmovl $10,%edx
F
D F
E D F
M E D F
W M E D F
Addr
icode
0x006: irmovl
M_valA
M_Bch
M
Bch
valE
valA
0x00d: nop
ALU fun.
ALU ALU
CC CC
$3,%eax
0x00c: nop
dstE dstM
e_Bch
0x00e: addl %edx,%eax ALU A
Execute
E
icode
ifun
ALU B
valC
0x010: halt
valA
valB
dstE dstM srcA
6
7
8
9
10
W M E D F
W M E D
W M E
W M
W
Cycle 6
srcB
d_srcA d_srcB
Select A
Decode
d_rvalA
A
dstE dstM srcA
srcB
W
W_valM
B
Register Register M file file
R[%eax] R[%eax] 3
W_valE
3
E
e.g., valC passes through decode
D
Fetch
icode
ifun
rA
rB
Instruction Instruction memory memory
valC
valP
PC PC increment increment
• • •
Predict PC
D
f_PC M_valA
Select PC
F
valA valB
W_valM
predPC
5
R[%edx] = 10 R[%eax] = 0
Error
6
Page 1
Data Dependencies: No Nop # demo-h0.ys
1
2
3
4
5
6
0x000: irmovl $10,%edx
F
D F
E D
M E
W M
W
F
D F
E D
M E
0x006: irmovl
$3,%eax
0x00c: addl %edx,%eax 0x00e: halt
7
Stalling for Data Dependencies
8
# demo-h2.ys
1
2
3
4
5
6
7
0x000: irmovl $10,%edx
F
D F
E D F
M E D
W M E
W M
W
F
D
E
M
W
D F
E D F
M E D
0x006: irmovl
W M
$3,%eax
0x00c: nop
W
0x00d: nop bubble
Cycle 4
F
0x00e: addl %edx,%eax
M
0x010: halt
8
9
10
11
W M E
W M
W
M_valE = 10 M_dstE = %edx
If instruction follows too closely after one that writes register, slow it down Hold instruction in decode Dynamically inject nop into execute stage
E e_valE 0 + 3 = 3 E_dstE = %eax
D valA valB
Error
R[%edx] = 0 R[%eax] = 0
7
8
Write back
Stall Condition
W
icode
valE
valM
dstE
Detecting Stall Condition
dstM
data out read
Mem. control write
Memory
Source Registers
Data Data memory memory
srcA and srcB of current instruction in decode stage
M
M_valA
icode
Bch
valE
valA
dstE
1
2
3
4
5
6
0x000: irmovl $10,%edx
F
D F
E D F
M E D F
W M E D
W M E
0x006: irmovl
dstM
ALU fun.
ALU ALU ALU A
Execute
$3,%eax
0x00c: nop
e_Bch
CC CC
Destination Registers
# demo-h2.ys
0x00d: nop
ALU B
bubble
F
0x00e: addl %edx,%eax
dstE and dstM fields Instructions in execute, memory, and write-back stages
0x010: halt E
7
8
9
10
11
W M E D F
W M E D
W M E
W M
W
data in
Addr M_Bch
icode
ifun
valC
valA
valB
dstE
dstM srcA
srcB
D F
d_srcA d_srcB d_rvalA
Select A
Decode
A
dstE
dstM srcA
srcB
Cycle 6
W_valM
B
Register RegisterM file file
W
W_valE
E
D
Special Case Don’t stall for register ID 8
icode
ifun
rA
rB
valC
Instruction Instruction memory memory
Fetch
W_dstE = %eax W_valE = 3
valP Predict PC
PC PC increment increment
• • •
f_PC M_valA
Select PC
Indicates absence of register operand
F
D
W_valM
predPC
srcA = %edx srcB = %eax 9
10
Stalling X3
What Happens When Stalling? 1
2
3
4
5
F
D F
E D
M E
W M E
# demo-h0.ys 0x000: irmovl $10,%edx 0x006: irmovl
$3,%eax
bubble bubble
6
F
0x00e: halt
D F
D F
8
9
10
11
# demo-h0.ys
W M E
bubble 0x00c: addl %edx,%eax
7
D F
W M E D F
0x006: irmovl
W M E D
0x00e: halt
W M
E
• • • D
E_dstE = %eax
D srcA = %edx srcB = %eax
srcA = %edx srcB = %eax
0x000: irmovl 0x006: bubble $10,%edx $3,%eax 0x000: irmovl 0x006: bubble $10,%edx $3,%eax 0x006: addl 0x00c: irmovl bubble %edx,%eax $3,%eax 0x00c: addl 0x00e: halt %edx,%eax 0x00e: halt
Stalling instruction held back in decode stage Following instruction stays in fetch stage Bubbles injected into execute stage
W_dstE = %eax
Like dynamically generated nop’s Move through later stages
M M_dstE = %eax
Write Back Memory Execute Decode Fetch
W
W
Cycle 4
$3,%eax
0x00c: addl %edx,%eax
W M E
Cycle 6 Cycle 5
Cycle 8 4 5 6 7
0x000: irmovl $10,%edx
• • • D srcA = %edx srcB = %eax
11
12
Page 2
Implementing Stalling
Pipeline Register Modes
W_dstM W_dstE
W
icode
valE
valM
Input = y
dstE dstM
Rising clock
Output = x
Output = y
x
Normal
M_dstM
y
M_dstE
M
icode
Bch
valE
valA
stall =0
dstE dstM
bubble =0
E_dstM
Pipe control logic
E_dstE E_bubble
E
icode ifun
valC
valA
valB
dstE dstM srcA
srcB
Input = y
Rising clock
Output = x
Output = x
d_srcB d_srcA D_icode
D_stall
F_stall
D
Stall
srcB
x
x
srcA icode ifun
F
rA
rB
valC
stall =1
valP
bubble =0
Rising clock
predPC
Input = y
Pipeline Control
Bubble
Combinational logic detects stall condition Sets mode signals for how pipeline registers should update
Output = x
stall =0
Output = nop
n o p
x bubble =1
13
14
Data Forwarding
Data Forwarding Example # demo-h2.ys 0x000: irmovl $10,%edx
Naïve Pipeline
0x006: irmovl
Register isn’t written until completion of write-back stage Source operands read from register file in decode stage
$3,%eax
1
2
3
4
5
6
F
D F
E D
M E
W M
W
F
D F
E D F
0x00c: nop 0x00d: nop 0x00e: addl %edx,%eax
Needs to be in register file at start of stage
0x010: halt
Observation Trick Pass value directly from generating instruction to decode stage Needs to be available at end of decode stage
8
9
10
M E D
W M E
W M
W
F
D
E
M
W R[%eax]
W_dstE = %eax W_valE = 3
3
• • • D srcA = %edx srcB = %eax
valA valB
R[%edx] = 10 W_valE = 3
15
valM
Data Forwarding Example #2
W_valE W_valM
W
Decode Stage
m_valM
Data Data memory memory
icode, M_Bch, valA
# demo-h0.ys
1
2
3
4
5
6
0x000: irmovl $10,%edx
F
D F
E D
M E
W M
W
F
D
E
M
W
F
D
E
M
0x006: irmovl
$3,%eax
0x00c: addl %edx,%eax Addr, Data
0x00e: halt
M_valE M
Register %edx
e_valE
Bch CC CC
Generated by ALU during previous cycle Forward from memory as valA
ALU ALU
E_valA, E_valB, E_srcA, E_srcB
Forwarding Sources Execute: valE Memory: valE, valM Write back: valE, valM
16
W_valE, W_valM, W_dstE, W_dstM
Bypass Paths Forwarding logic selects valA and valB Normally from register file Forwarding: get valA or valB from later pipeline stage
Value just generated by ALU Forward from execute as valB
valA, valB Forward
d_srcA, d_srcB
A
B
Register Register M file file E
Write back
7
8
W
Cycle 4 M M_dstE = %edx M_valE = 10
E
Register %eax
E
D
W
Cycle 6
irmovl in writeback stage Destination value in W pipeline register Forward as valB for decode stage
Value generated in execute or memory stage
7
E_dstE = %eax e_valE 0 + 3 = 3
D srcA = %edx srcB = %eax
valA valB
M_valE = 10 e_valE = 3
valP
17
18
Page 3
Implementing Forwarding
W_valE
Write back W_valM
W
icode
valE
valM
dstE dstM
data out
W_valM
Data Data memory memory
write
Memory
Add additional feedback paths from E, M, and W pipeline registers into decode stage Create logic blocks to select from multiple sources for valA and valB in decode stage
data in
Addr
M
icode
M_valA
M_valE
Bch
valE
valA
dstE dstM
e_Bch e_valE
ALU ALU
CC CC
E
ALU fun.
ALU A
Execute icode ifun
ALU B
valC
W_valE
m_valM
valE
read
Mem. control
M_Bch
Implementing Forwarding
valA
valB
valM
## What should be the A value? int new_E_valA = [ # Use incremented PC D_icode in { ICALL, IJXX } : D_valP; # Forward valE from execute d_srcA == E_dstE : e_valE; # Forward valM from memory d_srcA == M_dstM : m_valM; # Forward valE from memory d_srcA == M_dstE : M_valE; # Forward valM from write back d_srcA == W_dstM : W_valM; # Forward valE from write back d_srcA == W_dstE : W_valE; # Use value read from register file 1 : d_rvalA; ];
dstE dstM
data out
m_valM
Data Data memory memory data in
Addr M_valA
valE
valA
dstE dstM e_valE
ALU ALU
ALU fun. ALU B
dstE dstM srcA srcB d_srcA d_srcB
dstE dstM srcA srcB
valA
valB
dstE dstM srcA
srcB
d_srcA d_srcB
Sel+Fwd A
Decode
Fwd B
A
dstE dstM srcA
Sel+Fwd A
W_valM
B
Register M Register file file
W_valE
E
A
D
icode ifun
rA
rB
valC
srcB
Fwd B
W_valM
B
Register Register M file file E
valP
W_valE
19
20
Avoiding Load/Use Hazard
Limitation of Forwarding # demo-luh.ys
1
2
F 0x006: irmovl $3,%ecx 0x00c: rmmovl %ecx, 0(%edx)
3
D F
4
5
6
7
8
9
E D
M E
W M
W
F
D
E
M
W
0x012: irmovl $10,%ebx F 0x018: mrmovl 0(%edx),%eax # Load %eax
D F
E D
M E
W M
W
F
D F
E D
M E
0x000: irmovl $128,%edx
0x01e: addl %ebx,%eax # Use %eax
1
2
3
4
5
0x000: irmovl $128,%edx 0x006: irmovl $3,%ecx
F
D
E
M
W
F
D
E
M
W
F D 0x012: irmovl $10,%ebx F 0x018: mrmovl 0(%edx),%eax # Load %eax
E D F
M
W M
0x01e: addl %ebx,%eax # Use %eax
W
7
8
M E
W M
W
D
E
M
W
E
M
W
F
D F
D F
E D
M E
0x020: halt
Stall using instruction for one cycle Can then pick up loaded value by forwarding from memory stage
M
M_dstE = %ebx M_valE = 10
6
bubble
Cycle 8
Cycle 7
Value needed by end of decode stage in cycle 7 Value read from memory in memory stage of cycle 8
# demo-luh.ys
M_dstM = %eax m_valM M[128] = 3
M_valE = 10 R[%eax] = 0
M
• • •
Error
D valA valB
# demo-luh.ys
icode
ifun
valC
valA
valB
dstE
dstM
srcA
srcB
d_srcA d_srcB
dstE
Sel +Fwd A
Decode
D
icode
Condition Load/Use Hazard
rA
rB
valC
srcA
srcB
Fwd B
A
ifun
dstM
B
Register RegisterM file file E
1
2
3
4
irmovl $128,%edx F D E M irmovl $3,%ecx F D E rmmovl %ecx, 0(%edx) F D irmovl $10,%ebx F mrmovl 0(%edx),%eax # Load %eax bubble 0x01e: addl %ebx,%eax # Use %eax 0x020: halt 0x000: 0x006: 0x00c: 0x012: 0x018:
e_valE
E
W_valE = 10 m_valM = 3
22
Control for Load/Use Hazard
ALU fun. ALU B
W
e_Bch
ALU A
W M
M_dstM = %eax m_valM M[128] = 3
Detecting Load/Use Hazard CC CC
12
W
21
Execute
11
Cycle 8
D
ALU ALU
10
W_dstE = %ebx W_valE = 10
• • •
valA valB
9
11
0x00c: rmmovl %ecx, 0(%edx)
0x020: halt
Load-use dependency
10
5
6
7
W M E D F
W M E D
W M E
F
D F
8
9
10
11
W M E D F
W M E D
W M E
W M
12
W
Stall instructions in fetch and decode stages Inject bubble into execute stage
W_valM
W_valE
valP
Trigger
Condition
E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB }
Load/Use Hazard
23
F
D
E
M
W
stall
stall
bubble
normal
normal
24
Page 4
Handling Misprediction
Branch Misprediction Example demo-j.ys
# demo-j.ys
1
2
3
4
5
0x000:
xorl %eax,%eax
F
D
E
M
W
0x002:
jne target # Not taken
F
D F
E D
M
W
E
M
W
D F
E D F
0x011: t: irmovl $2,%edx # Target
0x000: xorl %eax,%eax 0x002: jne t 0x007: irmovl $1, %eax 0x00d: nop 0x00e: nop 0x00f: nop 0x010: halt 0x011: t: irmovl $3, %edx 0x017: irmovl $4, %ecx 0x01d: irmovl $5, %edx
bubble
# Not taken # Fall through
0x017:
6
7
8
9
M E
W M
W
D
E
M
10
F
irmovl $3,%ebx # Target+1 bubble
0x007:
irmovl $1,%eax # Fall through
0x00d:
nop
W
Predict branch as taken
# Target (Should not execute) # Should not execute # Should not execute
Fetch 2 instructions at target
Cancel when mispredicted Should only execute first 8 instructions
Detect branch not-taken in execute stage On following cycle, replace instructions in execute and decode by bubbles No side effects have occurred yet 25
26
Control for Misprediction
Detecting Mispredicted Branch M
icode
Bch
valE
valA
dstE
# demo-j.ys
1
2
3
4
5
0x000:
xorl %eax,%eax
F
D
E
M
W
0x002:
jne target # Not taken
F
D
E
M
W
F
D E
M
W
D F
E D F
dstM
e_Bch
6
7
8
9
M E
W M
W
D
E
M
10
e_valE
ALU ALU
CC CC ALU A
Execute E
Condition
icode
0x011: t: irmovl $2,%edx # Target
ALU fun.
bubble
ALU B
0x017: ifun
valC
valA
valB
dstE
dstM
srcA
F
irmovl $3,%ebx # Target+1 bubble
srcB
0x007:
irmovl $1,%eax # Fall through
0x00d:
nop
W
Trigger
Mispredicted Branch E_icode = IJXX & !e_Bch
Condition
F
Mispredicted Branch normal
D
E
M
W
bubble
bubble
normal
normal
27
28
Correct Return Example
demo-retb.ys
Return Example
# demo-retb 0x026:
0x000: 0x006: 0x00b: 0x011: 0x020: 0x020: 0x026: 0x027: 0x02d: 0x033: 0x039: 0x100: 0x100:
irmovl Stack,%esp call p irmovl $5,%esi halt .pos 0x20 p: irmovl $-1,%edi ret irmovl $1,%eax irmovl $2,%ecx irmovl $3,%edx irmovl $4,%ebx .pos 0x100 Stack:
ret
F
bubble
# Initialize stack pointer # Procedure call # Return point
D F
bubble bubble
0x00b:
irmovl $5,%esi # Return
E D F
M E D F
W M E D F
W M E D
W M E
W M
W
# procedure # # # #
Should Should Should Should
not not not not
be be be be
As ret passes through pipeline, stall at fetch stage
executed executed executed executed
While in decode, execute, and memory stage
Inject bubble into decode stage Release stall when reach write-back stage
# Stack: Stack pointer
Previously executed three additional instructions
29
W valM = 0x0b
• • • F valC 5 rB %esi
30
Page 5
Detecting Return
Control for Return # demo-retb
M
icode
Bch
valE
valA
dstE dstM
0x026:
F
ret
D
E
M
W
F
D F
E D
M E
W M
W
F
D
E
M
W
F
D
E
M
e_Bch e_valE
ALU ALU
CC CC ALU A
ALU B
valC
valA
Execute
bubble
ALU fun.
bubble bubble
0x00b: E
icode
ifun
valB
irmovl $5,%esi # Return
dstE dstM srcA srcB
W
d_srcA d_srcB
dstE dstM srcA srcB
Condition Sel+Fwd A
Decode
D
icode
Fwd B
A
Processing ret
rA
rB
valC
D
E
M
W
bubble
normal
normal
normal
W_valM
B
Register Register M file file E
ifun
F stall
W_valE
valP
Condition
Trigger
Processing ret
IRET in { D_icode, E_icode, M_icode }
31
32
Special Control Cases
Implementing Pipeline Control
Detection
W
icode
valE
valM
dstE dstM
valE
valA
dstE dstM
M_icode
Condition
Trigger
Processing ret
IRET in { D_icode, E_icode, M_icode }
Load/Use Hazard
E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB }
M
icode
Bch
e_Bch
E_dstM E_icode
Mispredicted Branch E_icode = IJXX & !e_Bch
Pipe control logic
Action (on next cycle) Condition
CC CC
E_bubble
E
icode
ifun
icode
ifun
valC
valA
valB
valC
valP
dstE dstM srcA
d_srcA
srcB
D_icode
F
D
E
M
W
Processing ret
stall
bubble
normal
normal
normal
Load/Use Hazard
stall
stall
bubble
normal
normal
bubble
bubble
normal
normal
Mispredicted Branch normal
srcB
d_srcB
srcA
D_bubble D_stall
D
F_stall
F
rA
rB
predPC
Combinational logic generates pipeline control signals Action occurs at start of following cycle 33
34
Initial Version of Pipeline Control
Control Combinations Load/use
bool F_stall = # Conditions for a load/use hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB } || # Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode };
M E D
Load Use
ret 1
Mispredict M E D
M JXX
E D
ret
ret 3
ret 2 M E D
M
ret
E D
bubble
ret bubble
bubble
Combination A
bool D_stall = # Conditions for a load/use hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB };
Combination B
bool D_bubble = # Mispredicted branch (E_icode == IJXX && !e_Bch) || # Stalling at fetch while ret passes through pipeline IRET in { D_icode, E_icode, M_icode };
Special cases that can arise on same clock cycle
Combination A Not-taken branch ret instruction at branch target
bool E_bubble = # Mispredicted branch (E_icode == IJXX && !e_Bch) || # Load/use hazard E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB};
Combination B Instruction that reads from memory to %esp Followed by ret instruction 35
36
Page 6
Control Combination A M E D
Control Combination B
ret 1
Mispredict M
Fetch
M_valA
E ret D Combination A
Processing ret
M E D
f_PC
JXX
Condition
PC PC increment increment
Select PC
F
ret 1
Load/use
Predict PC
Instruction Instruction memory memory
W_valM
predPC
M Load Use
E D
ret
Combination B
F
D
E
M
W
F
D
E
M
W
stall
bubble
normal
normal
normal
Processing ret
Condition
stall
bubble
normal
normal
normal
stall
bubble
normal
normal
bubble + bubble stall
normal
normal
Mispredicted Branch normal
bubble
bubble
normal
normal
Load/Use Hazard
stall
Combination
bubble
bubble
normal
normal
Combination
stall
stall
Should handle as mispredicted branch Stalls F pipeline register But PC selection logic will be using M_valM anyhow
Would attempt to bubble and stall pipeline register D Signaled by processor as pipeline error
37
38
Handling Control Combination B M E D
Corrected Pipeline Control Logic
ret 1
Load/use M E D
Load Use
bool D_bubble = # Mispredicted branch (E_icode == IJXX && !e_Bch) || # Stalling at fetch while ret passes IRET in { D_icode, E_icode, M_icode # but not condition for a load/use && !(E_icode in { IMRMOVL, IPOPL } && E_dstM in { d_srcA, d_srcB
ret
Combination B
Condition
F
D
E
M
W
Processing ret
stall
bubble
normal
normal
normal
Load/Use Hazard
stall
stall
bubble
normal
Combination
stall
stall
bubble
normal
Condition
through pipeline } hazard });
F
D
E
M
W
Processing ret
stall
bubble
normal
normal
normal
normal
Load/Use Hazard
stall
stall
bubble
normal
normal
normal
Combination
stall
stall
bubble
normal
normal
Load/use hazard should get priority ret instruction should be held in decode stage for additional cycle
Load/use hazard should get priority ret instruction should be held in decode stage for additional cycle 39
40
Pipeline Summary Data Hazards Most handled by forwarding No performance penalty
Load/use hazard requires one cycle stall
Control Hazards Cancel instructions when detect mispredicted branch Two clock cycles wasted
Stall fetch stage while ret passes through pipeline Three clock cycles wasted
Control Combinations Must analyze carefully First version had subtle bug Only arises with unusual instruction combination 41
Page 7