The Intel® Pentium® 4 Processor Doug Carmean Principal Architect Intel Architecture Group Spring 2002
Intel® Copyright © 2002 Intel Corporation.
PDX
Agenda z Review z Pipeline
Depth z Execution Trace Cache z Data
Speculation z Spec Performance z Summary Intel® Copyright © 2002 Intel Corporation.
PDX
Information in this document is provided in connection with Intel® products. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted by this document. Except as provided in Intel’s Terms and Conditions of Sale for such products, Intel assumes no liability whatsoever, and Intel disclaims any express or implied warranty, relating to sale and/or use of Intel® products including liability or warranties relating to fitness for a particular purpose, merchantability, or infringement of any patent, copyright or other intellectual property right. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice. This document contains information on products in the design phase of development. Do not finalize a design with this information. Revised information will be published when the product is available. Verify with your local sales office that you have the latest datasheet before finalizing a design. Intel processors may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel, Pentium, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and foreign countries. Copyright © (2001) Intel Corporation.
Intel® Copyright © 2002 Intel Corporation.
PDX
Intel® NetburstTM Micro-architecture vs P6 Basic P6 Pipeline 1 Fetch
2 Fetch
3 4 5 6 7 8 9 Decode Decode Decode Rename ROB Rd Rdy/Sch Dispatch
10 Exec
Basic Pentium® 4 Processor Pipeline 1
2
TC Nxt IP
3
4
5
6
TC Fetch Drive Alloc
7
8
Rename
9
10
Que Sch
11
12
13
14 15
Sch Sch Disp Disp
RF
16
17
RF
Ex
18
19
20
Flgs Br Ck Drive
Deeper Pipelines enable higher frequency and performance Intel® Copyright © 2002 Intel Corporation.
PDX
Hyper Pipelined Technology Today
2.2GHz
20
1.8GHz Intro
Netburst Micro-Architecture
1.5GHz 1.2GHz
Frequency
10 P6 Micro-Architecture
233MHz 166MHz
5 P5 Micro-Architecture
60MHz
Introduction Copyright © 2002 Intel Corporation.
Time
Intel® PDX
Deeper Pipelines are Better Performance Improvement
120% 2 MB
100%
1 MB
80%
512 KB
60%
256 KB
40% 20% 0% 10
Copyright © 2002 Intel Corporation.
15
20 Pipeline Depth
25
Source: Average of 2000 application segments from performance simulations simulations
30 Intel® PDX
Why not deeper pipelines? z Increases
complexity
– Harder to balance – More challenges to architect around – More algorithms – Greater validation effort – Need to pipeline the wires Overall Engineering Effort Increases Quickly as Pipeline depth increases Intel® Copyright © 2002 Intel Corporation.
PDX
Performance z High
bandwidth front end z Low latency core z Lower
memory latency
High High Bandwidth Bandwidth Front Front End End
Intel® Copyright © 2002 Intel Corporation.
PDX
Higher Frequency increases requirements of front end z Branch
prediction is more important
– So we improved it z Need
greater uop bandwidth
– Branches constantly change the flow – Need to decode more instructions in parallel
Intel® Copyright © 2002 Intel Corporation.
PDX
Block Diagram System Bus
64-bit wide
Dynamic BTB 4K Entries
Instruction TLB Instruction Decoder
Bus Interface Unit
Micro Instruction Sequencer
Quad Pumped 3.2 GB/s
Execution Trace Cache 12K uops µops
Allocator / Register Renamer
Memory µop Queue Memory
Trace Cache BTB 512 Entries
Integer/Floating Point µop Queue
Integer Schedulers Slow Fast Int
Fast Int
Floating Point Schedulers FP Move FP Gen
Integer Register File / Bypass Network
Slow ALU
L2 Cache
Complex
Instr. Instr.
64GB/s 256-bit wide Copyright © 2002 Intel Corporation.
2xAGU
2xALU
2xALU
Ld/St Address unit
Simple Instr. Instr.
Simple Instr. Instr.
FP Register / Bypass
FP Move
L1 Data Cache 8Kbyte 44-way
Fmul Fmul Fmul FAdd FAdd FAdd
Intel® PDX
Execution Trace Cache 1 cmp 2 br -> T1 .. ... (unused code) T1:
T2:
T3:
3 sub 4 br -> T2 .. ... (unused code) 5 mov 6 sub 7 br -> T3 .. ... (unused code)
Trace Cache Delivery 1
cmp
2 br T1
3 T1: sub
4
br T2
5 mov
6
7
br T3
8 T3:add
9 sub
10 mul
11 cmp
sub
12 br T4
8 add 9 sub 10 mul 11 cmp 12 br -> T4 Intel®
Copyright © 2002 Intel Corporation.
PDX
Execution Trace Cache P6 Microarchitecture 1
cmp
2
3 T1: sub 4
br T1 br T2
Trace Cache Delivery 1
cmp
2 br T1
3 T1: sub
4
br T2
5 mov
6
7
br T3
8 T3:add
9 sub
10 mul 5 mov
6
sub
7
8 T3:add
9
sub
10 mul
11 cmp
12 br T4
BW = 1.5 uops/ns
11 cmp
sub
12 br T4
br T3
BW = 6 uops/ns Intel®
Copyright © 2002 Intel Corporation.
PDX
Inside the Execution Trace Cache Way 0 Instruction Pointer
0x0900
Set 0 Set 1 Set 2 Set 3 Set 4
Way 1
Way 2
Way 3
head body 1 body 2 tail
head
cmp,
br T1,
body 1
br T3,
T3:add, sub,
body 2
T4:add, sub,
tail
add,
sub,
T1:sub, br T2, mov, sub mul,
cmp, br T4
mov,
add,
add, mov
mov,
add,
add, mov Intel®
Copyright © 2002 Intel Corporation.
PDX
Self Modifying Code z Programs
that modify the instruction stream that is being executed z Very common in Java* code from JITs z Requires
hardware mechanisms to maintain consistency
*Other names and brands may be claimed as the property of others
Copyright © 2002 Intel Corporation.
.
Intel® PDX
Self Modifying Code z The
hardware needs to handle two basic cases: – Stores that write to instructions in the Trace Cache – Instruction fetches that hit pending stores – Speculative – Committed
Intel® Copyright © 2002 Intel Corporation.
PDX
Case 1: Stores to cached instructions “in use” bits addr addr
Instruction TLB (128 entries) Trace Cache
Execution Core addr
Data TLB
Store’s Physical Address
Intel® Copyright © 2002 Intel Corporation.
PDX
Case 2: Fetches to pending stores addr addr
Instruction Pointer
Instruction TLB (128 entries) Committed addr
Please Re-Fetch Please Re-Fetch
Store Buffer Speculative
Write Combining Buffer Please Flush Pipeline
Execution Core Intel®
Copyright © 2002 Intel Corporation.
PDX
Execution Trace Cache z Provides
higher bandwidth for higher frequency core z Reduces fetch latency z Requires
new fundamentally new algorithms
Intel® Copyright © 2002 Intel Corporation.
PDX
Performance z High
bandwidth front end z Low latency core z Lower
memory latency
Low Low Latency Latency Core Core
Intel® Copyright © 2002 Intel Corporation.
PDX
Data Speculation z Use
data before we are sure it is valid
– Lowers effective LD latency – Fast ALUs in Pentium 4 want fast LDs – Ratio of LD latency to ADD latency is important if 1 in 5 uops is a LD z As
pipelines get deeper, data speculation gets more important – Number of cycles saved /w data speculation increases as pipeline depth increases Intel®
Copyright © 2002 Intel Corporation.
PDX
L1 Data Cache System Bus
64-bit wide
Dynamic BTB 4K Entries
Instruction TLB Instruction Decoder
Bus Interface Unit
Micro Instruction Sequencer
Quad Pumped 3.2 GB/s
Execution Trace Cache 12K µops
Allocator / Register Renamer
Memory µop Queue Memory
Trace Cache BTB 512 Entries
Integer/Floating Point µop Queue
Integer Schedulers Slow Fast Int
Fast Int
Floating Point Schedulers FP Move FP Gen
Integer Register File / Bypass Network
Slow ALU
L2 Cache
Complex
Instr. Instr.
64GB/s 256-bit wide Copyright © 2002 Intel Corporation.
2xAGU
2xALU
2xALU
Ld/St Address unit
Simple Instr. Instr.
Simple Instr. Instr.
FP Register / Bypass
FP Move
L1 Data Cache 8Kbyte 44-way L1 Data Cache 8Kbyte 4-way
Fmul Fmul Fmul FAdd FAdd FAdd
Intel® PDX
L1 Cache is >3x Faster z P6:
– 3 clocks @ 1GHz 3ns
z P4P:
– 2 clocks @ 2GHz 1ns
Lower Lower Latency Latency is is Higher Higher Performance Performance Intel® Copyright © 2002 Intel Corporation.
PDX
L1 Data Cache Pipeline Stages VA 15:0
tag VA 31:16
2x Clock
data
Data Fast SB
VA 31:0
TLB
TAG
Slow SB
replay Copyright © 2002 Intel Corporation.
replay
replay
Intel® PDX
L1 Data Cache VA 15:0
tag VA 31:16
2x Clock
data
Data Fast SB
VA 31:0
TLB
TAG
Slow SB
…
10:6
15:11
Copyright © 2002 Intel Corporation.
Way select
Data array
…
Way Predictor (Tag array)
19:16
Hit (Replay)
Intel® PDX
A Digression on Stores z Two
components to a store:
– STA: address computation – STD: data piece z Hybrid
uOP
– Single uOP in the front, back ends – Two uOPs in the middle
Intel® Copyright © 2002 Intel Corporation.
PDX
Memory Disambiguation
Ld EAX 20% – The other traces are unaffected – Average performance improvement < 0.1%
Intel® Copyright © 2002 Intel Corporation.
PDX
Performance z High
bandwidth front end z Low latency core z Lower
memory latency
Lower Lower Memory Memory Latency Latency Intel® Copyright © 2002 Intel Corporation.
PDX
Reducing Latency z As
frequency increases, it is important to improve the performance of the memory subsystem z Data Prefetch Logic – Watches processor memory traffic – Looks for patterns – Initiates accesses Intel® Copyright © 2002 Intel Corporation.
PDX
Data Prefetch Logic System Bus
64-bit wide
Dynamic BTB 4K Entries
Instruction TLB Instruction Decoder
Bus Interface Unit
Micro Instruction Sequencer
Quad Data Pumped Prefetch 3.2 GB/s Logic
Execution Trace Cache 12K µops
Allocator / Register Renamer
Memory µop Queue Memory
Trace Cache BTB 512 Entries
Integer/Floating Point µop Queue
Integer Schedulers Slow Fast Int
Fast Int
Floating Point Schedulers FP Move FP Gen
Integer Register File / Bypass Network
Slow ALU
L2 Cache
Complex
Instr. Instr.
64GB/s 256-bit wide Copyright © 2002 Intel Corporation.
2xAGU
2xALU
2xALU
Ld/St Address unit
Simple Instr. Instr.
Simple Instr. Instr.
FP Register / Bypass
FP Move
L1 Data Cache 8Kbyte 44-way
Fmul Fmul Fmul FAdd FAdd FAdd
Intel® PDX
Data Prefetch Logic Instruction Buffers
64
Instruction Fetch Data Prefetch Logic L1 Data Cache
L2 Advanced Transfer Cache
Bus Queue
256
Prefetch logic first checks L2 cache and then fetches lines from memory that miss L2 cache. Intel® Copyright © 2002 Intel Corporation.
PDX
Data Prefetch Logic z Watches
patterns
for streaming memory access
– Can track 8 multiple independent streams – Loads, Stores or Instruction – Forward or Backward z Analysis
on 32 byte cache line granularity z Looks for “mostly” complete streams: – Access to cache lines 1,2,3,4,5,6 will prefetch – Access to cache lines 1,2, 4,5,6 will prefetch – 1, ,3, , ,6, , ,9 will not prefetch Intel® Copyright © 2002 Intel Corporation.
PDX
Performance z High
bandwidth front end z Low latency core z Lower
memory latency
Intel® Copyright © 2002 Intel Corporation.
PDX
Intel® Copyright © 2002 Intel Corporation.
PDX