Recovery Mechanism for Latency Misprediction
Recovery Mechanism for Latency Misprediction Enric Morancho, José María Llabería and Àngel Olivé Departament d'Arquitectura de Computadors Universitat Politècnica de Catalunya - Spain
Work supported by the Ministry of Education and Science of Spain (TIC98-0511-C02-01)
Departament d'Arquitectura de Computadors
1
Universitat Politècnica de Catalunya
Recovery Mechanism for Latency Misprediction
Motivation • High-performance processors demand back-to-back execution of dependent instructions R1 ← ... ... ← R1
...
IQ ...
R IQ
exe R
W exe
W
❑ Source-instruct. latency must be known on its issue cycle • Load instructions have unknown latency ❑ Delaying the issue of depend. instructions degrades IPC: ✓ Hit latency 3 cycles: 6% (+1), 11% (+2), 16% (+3) ❑ Back-to-back execution is achieved by: ✓ Latency prediction (hit) ✓ Speculative scheduling of dependent instructions ✓ Recovery mechanism on latency mispredictions Departament d'Arquitectura de Computadors
2
Universitat Politècnica de Catalunya
Recovery Mechanism for Latency Misprediction
Outline • Terminology • Processor Model • Recovery Mechanisms ❑ Issue-Queue Mechanism ❑ Recovery-Buffer Mechanism • Methodology and Results • Conclusions
Departament d'Arquitectura de Computadors
3
Universitat Politècnica de Catalunya
Recovery Mechanism for Latency Misprediction
Terminology r1 ← load ... (latency predicted)
1 IQ
2 R
3 @
4 M
5 M/TC
IQ
R
exe
W
IQ
R
exe
W
IQ
R
exe
W
IQ
R
exe
r2 ← r1 (1-cycle latency) r3 ← r2 IW
6 ...
7
SW
• Independent Window (IW): interval where issued instructions are independent on the latency-predicted load instruction • Speculative Window (SW): interval between issuing the first instr. potentially dependent on the load and tag-checking (TC) • Verification Delay: duration of the Speculative Window ❑ Constant value Departament d'Arquitectura de Computadors
4
Universitat Politècnica de Catalunya
Recovery Mechanism for Latency Misprediction
Tasks of a Recovery Mechanism • Nullify some instructions issued during the Speculative Window ❑ Remain marked as uncompleted in the Reorder Buffer ❑ Nullification policy: ✓ Non-selective: all instructions ✓ Selective: only dependent instructions • Sleep instrs. dependent on mispredicted and nullified instrs. • Re-issue nullified instructions ❑ Independent instructions: on next cycles ❑ Dependent instructions: on data availability • Keep speculatively issued instructions in a storage structure Departament d'Arquitectura de Computadors
5
Universitat Politècnica de Catalunya
Recovery Mechanism for Latency Misprediction
Ex.: Non-selective Recovery Mechanism a
v
issued instructs.
1
2
3
4
a, v
IQ
R
@
M M/TC ...
IQ
R
exe
W
IQ
R
exe
W
IQ
R
✗
b,y nullified
IQ
✗
c,z no issued
w
w
x
x
b
y
b, y
c
z
c, z
5
y
6
IQ
z
7 a mispredicted
R
...
IQ
...
y re-issued
b,c,z slept Lost Cycles
Departament d'Arquitectura de Computadors
6
Universitat Politècnica de Catalunya
Recovery Mechanism for Latency Misprediction
Processor Model (1/2) • Pipeline stages: Fetch
Decode/ Rename
Issue Queue
Register Execute Read
Write
Commit
❑ Instructions are extracted from IQ after issuing them ✓ Issue-Queue capacity < Reorder-Buffer capacity • Latency predictor: ❑ Always predicts cache-hit latency
Departament d'Arquitectura de Computadors
7
Universitat Politècnica de Catalunya
Recovery Mechanism for Latency Misprediction
Processor Model (2/2) • Structure of the Issue Queue Dependence Matrix ready0
instruction0 . . .
. . .
instructionm R0
R1
Rn ...
latency counters
readym
Register Scoreboard Circuit
❑ Rows: related to queued instructions ❑ Columns: related to physical registers, mark data availability. Columns are set by latency counters ❑ Ready bits are evaluated every cycle Departament d'Arquitectura de Computadors
8
Universitat Politècnica de Catalunya
Recovery Mechanism for Latency Misprediction
Storage for speculative instructions • Evaluated structures ❑ Issue Queue ❑ Recovery Buffer
Departament d'Arquitectura de Computadors
9
Universitat Politècnica de Catalunya
Recovery Mechanism for Latency Misprediction
Issue-Queue Mechanism • Issued instructions are made non-visible to the select logic ❑ non-request bits are added to the IQ entries misprediction remove Removal Circuit
issued • • • entries
non-visible/visible • • • •
• •
ready
•
b)
a)
b) •••
•
•
•
•
a) ready • no-request b) selected
Register Scoreboard Circuit activation of latency counters
Select Logic
•
•
non-request
•
Destination Register
Dependence Matrix
latencypredicted
latency-predicted result available misprediction
• On mispredictions, unsets columns / makes instructions visible • non-visible instructs. are extracted after Verification Delay cycles Departament d'Arquitectura de Computadors
10
Universitat Politècnica de Catalunya
Recovery Mechanism for Latency Misprediction
Recovery-Buffer Mechanism (1/3) • Keeping speculatively-issued instructions in the Issue Queue reduces its ability to look-ahead for independent instructions • RB keeps issued instructs while they can be nullified ❑ As soon as an instruction is issued, it is extracted from IQ ❑ A RB entry contains all the instructions issued concurr. ❑ Instructions are ordered in issue-cycle order Fetch
Decode/ Rename
Register Execute Read
Issue Queue
Recovery Buffer (RB)
Departament d'Arquitectura de Computadors
11
Write
Commit
correct/mispredicted result available
Universitat Politècnica de Catalunya
Recovery Mechanism for Latency Misprediction
Recovery-Buffer Mechanism (2/3) • On a misprediction: ❑ Nullifies issued instructions dependent on the mispredict. ❑ Sleeps instructions dependent on the misprediction ✓ Same operations than in IQ mechanism ❑ Nullified instructions are kept in the RB ✓ An entry range is related to every misprediction Verification Delay -1 entries Recovery Buffer Misprediction Buffer
✓ The scheduling recorded in RB is valid Re-issue does not need to account latencies Departament d'Arquitectura de Computadors
12
Universitat Politècnica de Catalunya
Recovery Mechanism for Latency Misprediction
Recovery-Buffer Mechanism (3/3) • Re-issue performed from the RB ❑ Checks a RB entry per cycle ❑ Also issues instructions from IQ in the free issue slots ❑ Wakes-up dependent instructions recorded in IQ •
•
Select Logic IQ RB
•
•
•
• • • •••
Register Scoreboard Circuit
destination registers
latency-predicted
Departament d'Arquitectura de Computadors
To Execution Pipelines and to Recovery Buffer
• •••
mispredictions
•••
•
ready
•
Destination Register
Dependence Matrix
•
•••
IQ RB
13
Recovery Buffer (RB)
mispredictions latency-predicted
Universitat Politècnica de Catalunya
Recovery Mechanism for Latency Misprediction
Example: RB with selective nullification issued 1 instructs. a a IQ m n b b o c c d d p
2
3
R IQ
@ R IQ
4
5
6
7
M M TC exe W R exe W IQ R exe W IQ R exe IQ R IQ
8
i
• • • ✗ W ✗ ✗ IQ
RB
i+2
M
W
R
exe
W
RB
R IQ
R
exe
IQ
Recovery Buffer:
(b,o,c)
14
i+3
• • •
x
Departament d'Arquitectura de Computadors
i+1
(b,-,c)
Universitat Politècnica de Catalunya
Recovery Mechanism for Latency Misprediction
Methodology • Cycle-by-cycle simulation of SPEC-95 benchmarks • Simulations performed: ❑ 4-way processor ✓ first-level cache latency: 2 cycles ❑ Issue-Queue size: ✓ 15, 20 and 25-entry integer IQ's ✓ 10, 15 and 20-entry floating-point IQ's ❑ Verification Delay: 2, 3 and 4 cycles
Departament d'Arquitectura de Computadors
15
Universitat Politècnica de Catalunya
Recovery Mechanism for Latency Misprediction
Evaluated mechanisms Storage Structure Nullification policy
Issue Queue IQNS IQS
Non selective Selective
Recovery Buffer RBNS RBS
• After issuing an instruction, extracting it from IQ is delayed: ❑ IQNS: Verification-Delay cycles ❑ IQS: 1 cycle (to decide dependencies) ❑ RBNS and RBS: 0 cycles
Departament d'Arquitectura de Computadors
16
Universitat Politècnica de Catalunya
Recovery Mechanism for Latency Misprediction
Results: Integer benchmarks (1/2) • Sensitivity to the verification delay ❑ IQNS and IQS: function of Issue-Queue size ✓ significant for 15-entry Issue Queue ❑ RBNS and RBS: almost independent 20-entry integer issue-queue
25-entry integer issue-queue
1.9
1.9
1.8
1.8
1.8
1.7
1.7
1.7
IPC
1.9
IPC
IPC
15-entry integer issue-queue
1.6
1.6
1.6
1.5
1.5
1.5
1.4 verif=2
1.4 verif=2
1.4 verif=2
verif=3
verif=4
verif=3
verif=4
RBS RBNS IQS IQNS
verif=3
verif=4
• For 25-entry Issue Queues, IQS & RBNS are almost equivalent Departament d'Arquitectura de Computadors
17
Universitat Politècnica de Catalunya
Recovery Mechanism for Latency Misprediction
Results: Integer benchmarks (2/2) • Sensitivity to Issue-Queue size ❑ RBNS and RBS allow Issue-Queue size reductions respect IQNS and IQS around 25% 4-cycle verification delay
3-cycle verification delay 1.9
1.8
1.8
1.7
1.7
IPC
IPC
1.9
1.6
1.6
1.5
1.5
1.4 iq=15
Departament d'Arquitectura de Computadors
iq=20
1.4 iq=15
iq=25
18
RBS RBNS IQS IQNS
iq=20
iq=25
Universitat Politècnica de Catalunya
Recovery Mechanism for Latency Misprediction
Results: Floating-Point benchmarks (1/2) • Present different behaviour than integer benchmarks ❑ Latencies forbid the existence of a chain of dependent instructions larger than one in the Speculative Window Fr1 ← load...
Fr2 ← float (Fr1, ...)
IQ
R
@
M
IQ
R
...
IQ
R
...
IQ
R
exe
exe
IQ
R
...
IQ
R
...
IQ
R
...
IQ
R
Fr3 ← float (Fr2, ...)
Departament d'Arquitectura de Computadors
19
M
TC
exe
exe
W
exe
Universitat Politècnica de Catalunya
Recovery Mechanism for Latency Misprediction
Results: Floating-Point benchmarks (2/2) • Sensitivity to the verification delay ❑ Non-selective mechanisms present degradation ✓ Most nullified instructions are independent on the misprediction: 85% (floating-point) versus 53% (integer) ❑ Selective mechanisms are almost independent 20-entry floating-point issue-queue 2.2
IPC
2.1 2 RBS RBNS IQS IQNS
1.9 1.8 1.7 verif=2
Departament d'Arquitectura de Computadors
verif=3
20
verif=4
Universitat Politècnica de Catalunya
Recovery Mechanism for Latency Misprediction
Conclusions • The Recovery Buffer increases the capacity of the scheduler to look-ahead for independent instructions • Results depend on the dominating instruction latency ❑ Integer benchmarks: (1-cycle latency) ✓ Recovery-Buffer mechanisms are less sensitive to the Verification Delay than IQ mechanisms ✓ Recovery-Buffer mechanisms allows a reduction in the Issue-Queue size around 25% ✓ The Recovery-Buffer mechanism with non-selective nullification policy is an attractive alternative ❑ Floating-point benchmarks: (4-cycle latency) ✓ Non-selective nullification degrades performance Departament d'Arquitectura de Computadors
21
Universitat Politècnica de Catalunya
Recovery Mechanism for Latency Misprediction
Comparison of prediction types Branch prediction
Memory dependence prediction
Latency prediction
Which instructions are predicted?
branches
loads
loads
Can the speculative instructions be issued before issuing the predicted instruction?
Yes
No
No
Which instruction performs the verification of the prediction?
branch
previous store
load
Speculative-Window duration?
Variable
Variable
Fixed
Which instructions must be re-executed?
New path
The nullified ones
The nullified ones
Departament d'Arquitectura de Computadors
22
Universitat Politècnica de Catalunya
Recovery Mechanism for Latency Misprediction
Verification-Delay values Decoupled TC IQ
R
@
M
M/TC
IQ
R
exe
W
IQ
R
exe
IQ
R
2
IQ
R
@
M
M
IQ
R
exe
W
IQ
R
exe
W
IQ
R
exe
IQ
R
3
IQ
TC
IQ
2 Read Stages, Decoupled TC IQ
Pipelined scheduling logic, Decoupled TC
R
R
@
M
M
IQ
R
R
exe
W
IQ
R
R
exe
W
IQ
R
R
exe
IQ
R
R
IQ
R
4
TC
WU
R
@
M
M
WU
S
R
exe
W
WU
S
R
exe
W
WU
S
R
exe
WU
S
R
WU
S
4
IQ
Departament d'Arquitectura de Computadors
S
TC
WU
23
Universitat Politècnica de Catalunya
Recovery Mechanism for Latency Misprediction
Recovery Buff. & Branch Mispredictions • On a branch misprediction, some instructions that belong to a wrong path can be recorded in the Recovery Buffer ❑ These instructions must not be re-issued ✓ A "structure" contains the instruction-identifier ranges related to wrong-path instructions to filter-out them ❑ Recovery Buffer maintains locally the status of the physical registers: ✓ Set: on issue and re-issue ✓ Unset: on nullifications • These actions are performed concurrently with the normal operations of the Recovery Buffer Departament d'Arquitectura de Computadors
24
Universitat Politècnica de Catalunya