Itanium™ Processor Core
Intel® Itanium™ Processor Core
Harsh Sharangpani Principal Engineer and IA-64 Microarchitecture Manager Intel Corporation ®
Hot Chips, 15 August 2000
Itanium™ Processor Core
Itanium™ Processor Silicon IA-32 Control
FPU IA-64 Control Integer Units
Instr. Fetch & Decode
Cache
TLB Cache
Bus
Core Processor Die
4 x 1MB L3 cache
®
2
Hot Chips, 15 August 2000
Itanium™ Processor Core
Machine Characteristics Frequency Transistor Count Process Package
800 MHz 25.4M CPU; 295M L3 0.18u CMOS, 6 metal layer Organic Land Grid Array
Machine Width Registers
6 insts/clock (4 ALU/MM, 2 Ld/St, 2 FP, 3 Br) 14 ported 128 GR & 128 FR; 64 Predicates
Speculation Branch Prediction
32 entry ALAT, Exception Deferral Multilevel 4-stage Prediction Hierarchy
FP Compute Bandwidth Memory -> FP Bandwidth Virtual Memory Support L2/L1 Cache L2/L1 Latency L3 Cache
3.2 GFlops (DP/EP); 6.4 GFlops (SP) 4 DP (8 SP) operands/clock 64 entry ITLB, 32/96 2-level DTLB, VHPT Dual ported 96K Unified & 16KD; 16KI 6 / 2 clocks 4MB, 4-way s.a., BW of 12.8 GB/sec;
System Bus
2.1 GB/sec; 4-way Glueless MP Scalable to large (512+ proc) systems
®
3
Hot Chips, 15 August 2000
Itanium™ Processor Core
EPIC compared to Dynamic Scheduled RISC Bottleneck
Itanium EPIC Approach
Dynamic RISC Approach
Scheduling Scope Memory Latency & Control Flow Barriers
Entire compilation scope
Traditional compiler + limited hardware window
Hardware Scheduling across dynamic window assisted by Memory Order Buffer
Operand Delivery
Control Speculation across compiler scope; Data Speculation for undisambiguated memory; Extensive Memory Hints Predication for flaky branches; Extensive branch/prefetch Hints; Superscalar branching; Large Register File, with Stacking & Rotation
Interprocedural Overhead
Stacking for parameter passing
Control Flow Disruptions
Large Dynamic Branch Predictors; 1 branch/clock. Small Architectural File with Register Rename Tables Require spill/fill to memory or registers
®
4
Hot Chips, 15 August 2000
Itanium™ Processor Core
Itanium™ EPIC Design Maximizes SW-HW Synergy Architecture Features programmed by compiler: Branch Hints
Explicit Parallelism
Register Data & Control Stack Predication Speculation & Rotation
Memory Hints
MicroMicro-architecture Features in hardware:
®
Fast, Simple 6-Issue
Instruction Cache & Branch Predictors
Issue
Register Handling
Control
Parallel Resources
Bypasses & Dependencies
Fetch
4 Integer + 4 MMX Units
128 GR & 128 FR, Register Remap & Stack Engine
Memory Subsystem
2 FMACs (4 for SSE)
Three levels of cache:
2 LD/ST units
L1, L2, L3
32 entry ALAT
Speculation Deferral Management 5
Hot Chips, 15 August 2000
Itanium™ Processor Core
10 Stage In-Order Core Pipeline Execution • 4 single cycle ALUs, 2 ld/str • Advanced load control • Predicate delivery & branch • Nat/Exception//Retirement
Front End • Pre-fetch/Fetch of up to 6 instructions/cycle • Hierarchy of branch predictors • Decoupling buffer EXPAND
IPG INST POINTER GENERATION
FET FETCH
ROT
EXP
RENAME
REN
WORD-LINE DECODE REGISTER READ
WLD
ROTATE
REG
EXE EXECUTE
Instruction Delivery • Dispersal of up to 6 instructions on 9 ports • Reg. remapping • Reg. stack engine
DET
WRB
EXCEPTION WRITE-BACK DETECT
Operand Delivery • Reg read + Bypasses • Register scoreboard • Predicated dependencies
®
6
Hot Chips, 15 August 2000
Itanium™ Processor Core
Front End l l l
IPG FET ROT
SW-triggered prefetch loads target code early using BRP hints I-Fetch of 32 Bytes/clock feeds an 8-bundle decoupling buffer Branch hints combine with predictor hierarchy to improve branch prediction, delivering upto four progressive resteers l
4 TARs under compiler control.
l
Adaptive 2-level predictor (512-entry 2-way + 64-entry Multiway); 64-entry Target Address Cache fed by hints; Return stack buffer;
l
Perfect loop-exit predictor, BAC1, BAC2 IP MUX Target Address Registers
®
IPG
16KB I-Cache & ITLB
decoupling buffer
To Dispersal
Return Stack Buffer
Adaptive 2-Level Predictor
Br Target Address Cache
FET
BAC 1 & Loop Exit Predictor
ROT 7
BAC 2
EXP Hot Chips, 15 August 2000
Itanium™ Processor Core
Instruction Delivery l l
l
EXP REN
Stop bits eliminate dependency checking; Templates simplify routing; 1st available dispersal from 6 syllables to 9 issue ports Stacking eliminates most register spill / fills l
Register remapping done via several parallel 7-bit adders
l
Stack engine performs the few required spill/fills
REN stage supports renaming for stacking & rotation
S0 S1
Integer, FP, & Predicate Renamers
F0 F1
S3 S4 S5
Dispersal Network
Stall
Spill/Fill Injection
I0 I1
S2
®
Stack Engine
M0 M1
B0 B1 B2
REN
EXP 8
WLD Hot Chips, 15 August 2000
Itanium™ Processor Core
Operand Delivery
WLD REG
l
Multiported register file + mux hierarchy delivers operands in REG
l
Unique “Delayed Stall” mechanism used for register dependencies Avoids pipeline flush or replay on unavailable data l Stall computed in REG, but core pipeline stalls only in EXE l Special Operand Latch Manipulation (OLM) captures data returns into operand latches, to mimic register file read è Retains benefits of “stall paradigm” on wide and hi-frequency machine l
128 Entry Integer Register File 8R / 6W Src
Dependency Control Scoreboard Comparators
ALUs
Src
Bypass Muxes
Src
OLM comparators Delayed Stall
Dst Preds ®
WLD
REG
EXE 9
Hot Chips, 15 August 2000
Itanium™ Processor Core
Execution Resources Memory and Integer Resources: Instruction Class ALU (Add, shift-add, logical, addp4, cmp) Sign/zero extend, MoveLong Fixed Extract/Deposit, TBit, TNaT Multimedia ALU MM Shift, Avg, Mix, Pack Move to/from BR/PR/ARs, Packed Multiply, PopCount LD/ST/Prefetch/SetF/Cache Control Memory Mngmt/System/GetF FP Resources: Instruction Class FMAC, SIMD FMAC Fixed Multiply Fset, Fchk FCompare FP Logicals/Class/Min/Max
Ports F0 F1 • • • • • • • •
Latency (clocks) 5 7 1 2 5
Ports M0 M1 I0 • • • • • • • • • • • •
•
EXE
I1 • • • •
Latency (clocks) 1 1 1 2 2 2 2+ 2+
Branch Resources: Ports Instruction Class B0 B1 B2 Cond/Uncond • • • Call/Ret/Indirect • • • Branch.iA, EPC • • • Loop, BSW, Cover • RFI •
®
10
Hot Chips, 15 August 2000
Itanium™ Processor Core
Predication Support l
EXE
Basic strategy: All instructions read operands and execute l
Canceled at retirement if predicates off
l
Predicates generated in EXE (by cmps), delivered in DET, & feed into retirement, branch execution and dependency detection
l
Smart control cancels false stalls on predicated dependencies l
l
Special detection exists in REG for cancelled producer or consumer
Predication supported transparently - branches (& mispredicts) eliminated without introduction of spurious stalls REG
DET
EXE Bypass Muxes
Predicate Register File Read
To Dependency Detect (x6) To Branch Execution (x3) To Retirement (x6)
I-Cmps F-Cmps ®
11
Hot Chips, 15 August 2000
Itanium™ Processor Core
Speculation Hardware
DET WRB
Control Speculation support requires minimal hardware
l
l
Computed memory exception delivered with data as tokens (NaTs)
l
NaTs propagate through subsequent executions like source data
Data Speculation enabled efficiently via ALAT structure
l
l
32 outstanding advanced loads
l
Indexed by reg-ids, keeps partial physical address tag
0 clk checks: dependent use can be issued in parallel with check
l
EXE
TLB & Memory Subsystem
DET
32-entry Adv Ld Status ALAT Spec. Ld. Status (NaT)
WRB
Exception Logic
Address
Physical Address
Check Exception
Check Instruction ®
Efficient elimination of memory bottlenecks 12
Hot Chips, 15 August 2000
Itanium™ Processor Core
Floating Point Features l
Native 82-bit hardware provides support for multiple numeric models
l
2 Extended precision pipelined FMACs deliver 4 EP / DP FLOPs/cycle
l
Balanced with plenty of operand bandwidth from registers / memory
l
Tuned for 3D graphics: 2 Additional SP FMACs deliver 8 SP FLOPs/cycle; Software divide allows SW pipelining for high throughput;
l
FPU hardware used for twin Integer multiply-add (>1000 RSA decrypts/sec) 6 x 82-bit operands
2 stores/clk
even
4Mbyte L3 Cache
L2 Cache 2 DP Ops/clk
odd
128 entry 82-bit RF
4 DP Ops/clk (2 x Fld-pair)
2 x 82-bit results
®
13
Hot Chips, 15 August 2000
Itanium™ Processor Core
Intel® Itanium™ Processor Block Diagram ECC
Branch Prediction
L1 Instruction Cache and Fetch/PreFetch/Pre-fetch Engine Instruction Queue
9 Issue Ports B B B
ITLB
8 bundles
M M I
I
F F
IAIA-32 Decode and Control
Branch Units
128 Integer Registers
Integer and MM Units
DualDual-Port L1 Data Cache and DTLB
ECC
ECC
Bus
ECC
Controller
128 FP Registers
L3 Cache
Branch & Predicate Registers
ALAT
Scoreboard, Predicate NaTs,, Exceptions ,NaTs
L2 Cache
Register Stack Engine / ReRe-Mapping
Floating Point Units SIMD SIMD FMAC FMAC ECC ECC
®
14
Hot Chips, 15 August 2000
Itanium™ Processor Core
Itanium™ Processor Core Summary l
State-of-the-Art processor for Servers and Workstations – Combines High Performance with 64-bit addressing, Reliability features for Mission critical Applications, & full iA-32 compatibility in hardware
l
Highly parallel and deeply pipelined hardware at 800Mhz – 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process
l
EPIC technology increases Instruction Level Parallelism (ILP) – Speculation, Predication, Explicit Parallelism, Register Stacking, Rotation, & Branch/Memory Hints maximize hardware-software synergy
l
Dynamic features enable high-throughput on compiled schedule –
l
Register scoreboard, non-blocking caches, Decoupled instruction prefetch & aggressive branch prediction
Supercomputer-level FP (3.2 GFLOPs) for technical workstations ®
15
Hot Chips, 15 August 2000