Intel Itanium Processor Core

Itanium™ Processor Core Intel® Itanium™ Processor Core Harsh Sharangpani Principal Engineer and IA-64 Microarchitecture Manager Intel Corporation ® ...
Author: Ralf Spencer
3 downloads 1 Views 230KB Size
Itanium™ Processor Core

Intel® Itanium™ Processor Core

Harsh Sharangpani Principal Engineer and IA-64 Microarchitecture Manager Intel Corporation ®

Hot Chips, 15 August 2000

Itanium™ Processor Core

Itanium™ Processor Silicon IA-32 Control

FPU IA-64 Control Integer Units

Instr. Fetch & Decode

Cache

TLB Cache

Bus

Core Processor Die

4 x 1MB L3 cache

®

2

Hot Chips, 15 August 2000

Itanium™ Processor Core

Machine Characteristics Frequency Transistor Count Process Package

800 MHz 25.4M CPU; 295M L3 0.18u CMOS, 6 metal layer Organic Land Grid Array

Machine Width Registers

6 insts/clock (4 ALU/MM, 2 Ld/St, 2 FP, 3 Br) 14 ported 128 GR & 128 FR; 64 Predicates

Speculation Branch Prediction

32 entry ALAT, Exception Deferral Multilevel 4-stage Prediction Hierarchy

FP Compute Bandwidth Memory -> FP Bandwidth Virtual Memory Support L2/L1 Cache L2/L1 Latency L3 Cache

3.2 GFlops (DP/EP); 6.4 GFlops (SP) 4 DP (8 SP) operands/clock 64 entry ITLB, 32/96 2-level DTLB, VHPT Dual ported 96K Unified & 16KD; 16KI 6 / 2 clocks 4MB, 4-way s.a., BW of 12.8 GB/sec;

System Bus

2.1 GB/sec; 4-way Glueless MP Scalable to large (512+ proc) systems

®

3

Hot Chips, 15 August 2000

Itanium™ Processor Core

EPIC compared to Dynamic Scheduled RISC Bottleneck

Itanium EPIC Approach

Dynamic RISC Approach

Scheduling Scope Memory Latency & Control Flow Barriers

Entire compilation scope

Traditional compiler + limited hardware window

Hardware Scheduling across dynamic window assisted by Memory Order Buffer

Operand Delivery

Control Speculation across compiler scope; Data Speculation for undisambiguated memory; Extensive Memory Hints Predication for flaky branches; Extensive branch/prefetch Hints; Superscalar branching; Large Register File, with Stacking & Rotation

Interprocedural Overhead

Stacking for parameter passing

Control Flow Disruptions

Large Dynamic Branch Predictors; 1 branch/clock. Small Architectural File with Register Rename Tables Require spill/fill to memory or registers

®

4

Hot Chips, 15 August 2000

Itanium™ Processor Core

Itanium™ EPIC Design Maximizes SW-HW Synergy Architecture Features programmed by compiler: Branch Hints

Explicit Parallelism

Register Data & Control Stack Predication Speculation & Rotation

Memory Hints

MicroMicro-architecture Features in hardware:

®

Fast, Simple 6-Issue

Instruction Cache & Branch Predictors

Issue

Register Handling

Control

Parallel Resources

Bypasses & Dependencies

Fetch

4 Integer + 4 MMX Units

128 GR & 128 FR, Register Remap & Stack Engine

Memory Subsystem

2 FMACs (4 for SSE)

Three levels of cache:

2 LD/ST units

L1, L2, L3

32 entry ALAT

Speculation Deferral Management 5

Hot Chips, 15 August 2000

Itanium™ Processor Core

10 Stage In-Order Core Pipeline Execution • 4 single cycle ALUs, 2 ld/str • Advanced load control • Predicate delivery & branch • Nat/Exception//Retirement

Front End • Pre-fetch/Fetch of up to 6 instructions/cycle • Hierarchy of branch predictors • Decoupling buffer EXPAND

IPG INST POINTER GENERATION

FET FETCH

ROT

EXP

RENAME

REN

WORD-LINE DECODE REGISTER READ

WLD

ROTATE

REG

EXE EXECUTE

Instruction Delivery • Dispersal of up to 6 instructions on 9 ports • Reg. remapping • Reg. stack engine

DET

WRB

EXCEPTION WRITE-BACK DETECT

Operand Delivery • Reg read + Bypasses • Register scoreboard • Predicated dependencies

®

6

Hot Chips, 15 August 2000

Itanium™ Processor Core

Front End l l l

IPG FET ROT

SW-triggered prefetch loads target code early using BRP hints I-Fetch of 32 Bytes/clock feeds an 8-bundle decoupling buffer Branch hints combine with predictor hierarchy to improve branch prediction, delivering upto four progressive resteers l

4 TARs under compiler control.

l

Adaptive 2-level predictor (512-entry 2-way + 64-entry Multiway); 64-entry Target Address Cache fed by hints; Return stack buffer;

l

Perfect loop-exit predictor, BAC1, BAC2 IP MUX Target Address Registers

®

IPG

16KB I-Cache & ITLB

decoupling buffer

To Dispersal

Return Stack Buffer

Adaptive 2-Level Predictor

Br Target Address Cache

FET

BAC 1 & Loop Exit Predictor

ROT 7

BAC 2

EXP Hot Chips, 15 August 2000

Itanium™ Processor Core

Instruction Delivery l l

l

EXP REN

Stop bits eliminate dependency checking; Templates simplify routing; 1st available dispersal from 6 syllables to 9 issue ports Stacking eliminates most register spill / fills l

Register remapping done via several parallel 7-bit adders

l

Stack engine performs the few required spill/fills

REN stage supports renaming for stacking & rotation

S0 S1

Integer, FP, & Predicate Renamers

F0 F1

S3 S4 S5

Dispersal Network

Stall

Spill/Fill Injection

I0 I1

S2

®

Stack Engine

M0 M1

B0 B1 B2

REN

EXP 8

WLD Hot Chips, 15 August 2000

Itanium™ Processor Core

Operand Delivery

WLD REG

l

Multiported register file + mux hierarchy delivers operands in REG

l

Unique “Delayed Stall” mechanism used for register dependencies Avoids pipeline flush or replay on unavailable data l Stall computed in REG, but core pipeline stalls only in EXE l Special Operand Latch Manipulation (OLM) captures data returns into operand latches, to mimic register file read è Retains benefits of “stall paradigm” on wide and hi-frequency machine l

128 Entry Integer Register File 8R / 6W Src

Dependency Control Scoreboard Comparators

ALUs

Src

Bypass Muxes

Src

OLM comparators Delayed Stall

Dst Preds ®

WLD

REG

EXE 9

Hot Chips, 15 August 2000

Itanium™ Processor Core

Execution Resources Memory and Integer Resources: Instruction Class ALU (Add, shift-add, logical, addp4, cmp) Sign/zero extend, MoveLong Fixed Extract/Deposit, TBit, TNaT Multimedia ALU MM Shift, Avg, Mix, Pack Move to/from BR/PR/ARs, Packed Multiply, PopCount LD/ST/Prefetch/SetF/Cache Control Memory Mngmt/System/GetF FP Resources: Instruction Class FMAC, SIMD FMAC Fixed Multiply Fset, Fchk FCompare FP Logicals/Class/Min/Max

Ports F0 F1 • • • • • • • •

Latency (clocks) 5 7 1 2 5

Ports M0 M1 I0 • • • • • • • • • • • •



EXE

I1 • • • •

Latency (clocks) 1 1 1 2 2 2 2+ 2+

Branch Resources: Ports Instruction Class B0 B1 B2 Cond/Uncond • • • Call/Ret/Indirect • • • Branch.iA, EPC • • • Loop, BSW, Cover • RFI •

®

10

Hot Chips, 15 August 2000

Itanium™ Processor Core

Predication Support l

EXE

Basic strategy: All instructions read operands and execute l

Canceled at retirement if predicates off

l

Predicates generated in EXE (by cmps), delivered in DET, & feed into retirement, branch execution and dependency detection

l

Smart control cancels false stalls on predicated dependencies l

l

Special detection exists in REG for cancelled producer or consumer

Predication supported transparently - branches (& mispredicts) eliminated without introduction of spurious stalls REG

DET

EXE Bypass Muxes

Predicate Register File Read

To Dependency Detect (x6) To Branch Execution (x3) To Retirement (x6)

I-Cmps F-Cmps ®

11

Hot Chips, 15 August 2000

Itanium™ Processor Core

Speculation Hardware

DET WRB

Control Speculation support requires minimal hardware

l

l

Computed memory exception delivered with data as tokens (NaTs)

l

NaTs propagate through subsequent executions like source data

Data Speculation enabled efficiently via ALAT structure

l

l

32 outstanding advanced loads

l

Indexed by reg-ids, keeps partial physical address tag

0 clk checks: dependent use can be issued in parallel with check

l

EXE

TLB & Memory Subsystem

DET

32-entry Adv Ld Status ALAT Spec. Ld. Status (NaT)

WRB

Exception Logic

Address

Physical Address

Check Exception

Check Instruction ®

Efficient elimination of memory bottlenecks 12

Hot Chips, 15 August 2000

Itanium™ Processor Core

Floating Point Features l

Native 82-bit hardware provides support for multiple numeric models

l

2 Extended precision pipelined FMACs deliver 4 EP / DP FLOPs/cycle

l

Balanced with plenty of operand bandwidth from registers / memory

l

Tuned for 3D graphics: 2 Additional SP FMACs deliver 8 SP FLOPs/cycle; Software divide allows SW pipelining for high throughput;

l

FPU hardware used for twin Integer multiply-add (>1000 RSA decrypts/sec) 6 x 82-bit operands

2 stores/clk

even

4Mbyte L3 Cache

L2 Cache 2 DP Ops/clk

odd

128 entry 82-bit RF

4 DP Ops/clk (2 x Fld-pair)

2 x 82-bit results

®

13

Hot Chips, 15 August 2000

Itanium™ Processor Core

Intel® Itanium™ Processor Block Diagram ECC

Branch Prediction

L1 Instruction Cache and Fetch/PreFetch/Pre-fetch Engine Instruction Queue

9 Issue Ports B B B

ITLB

8 bundles

M M I

I

F F

IAIA-32 Decode and Control

Branch Units

128 Integer Registers

Integer and MM Units

DualDual-Port L1 Data Cache and DTLB

ECC

ECC

Bus

ECC

Controller

128 FP Registers

L3 Cache

Branch & Predicate Registers

ALAT

Scoreboard, Predicate NaTs,, Exceptions ,NaTs

L2 Cache

Register Stack Engine / ReRe-Mapping

Floating Point Units SIMD SIMD FMAC FMAC ECC ECC

®

14

Hot Chips, 15 August 2000

Itanium™ Processor Core

Itanium™ Processor Core Summary l

State-of-the-Art processor for Servers and Workstations – Combines High Performance with 64-bit addressing, Reliability features for Mission critical Applications, & full iA-32 compatibility in hardware

l

Highly parallel and deeply pipelined hardware at 800Mhz – 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process

l

EPIC technology increases Instruction Level Parallelism (ILP) – Speculation, Predication, Explicit Parallelism, Register Stacking, Rotation, & Branch/Memory Hints maximize hardware-software synergy

l

Dynamic features enable high-throughput on compiled schedule –

l

Register scoreboard, non-blocking caches, Decoupled instruction prefetch & aggressive branch prediction

Supercomputer-level FP (3.2 GFLOPs) for technical workstations ®

15

Hot Chips, 15 August 2000

Suggest Documents