The Intel Pentium 4 Processor

The Intel® Pentium® 4 Processor Doug Carmean Principal Architect Intel Architecture Group Spring 2002 Intel® Copyright © 2002 Intel Corporation. PDX...
Author: Horace Gibson
48 downloads 0 Views 1MB Size
The Intel® Pentium® 4 Processor Doug Carmean Principal Architect Intel Architecture Group Spring 2002

Intel® Copyright © 2002 Intel Corporation.

PDX

Agenda z Review z Pipeline

Depth z Execution Trace Cache z Data

Speculation z Spec Performance z Summary Intel® Copyright © 2002 Intel Corporation.

PDX

Information in this document is provided in connection with Intel® products. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted by this document. Except as provided in Intel’s Terms and Conditions of Sale for such products, Intel assumes no liability whatsoever, and Intel disclaims any express or implied warranty, relating to sale and/or use of Intel® products including liability or warranties relating to fitness for a particular purpose, merchantability, or infringement of any patent, copyright or other intellectual property right. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice. This document contains information on products in the design phase of development. Do not finalize a design with this information. Revised information will be published when the product is available. Verify with your local sales office that you have the latest datasheet before finalizing a design. Intel processors may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel, Pentium, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and foreign countries. Copyright © (2001) Intel Corporation.

Intel® Copyright © 2002 Intel Corporation.

PDX

Intel® NetburstTM Micro-architecture vs P6 Basic P6 Pipeline 1 Fetch

2 Fetch

3 4 5 6 7 8 9 Decode Decode Decode Rename ROB Rd Rdy/Sch Dispatch

10 Exec

Basic Pentium® 4 Processor Pipeline 1

2

TC Nxt IP

3

4

5

6

TC Fetch Drive Alloc

7

8

Rename

9

10

Que Sch

11

12

13

14 15

Sch Sch Disp Disp

RF

16

17

RF

Ex

18

19

20

Flgs Br Ck Drive

Deeper Pipelines enable higher frequency and performance Intel® Copyright © 2002 Intel Corporation.

PDX

Hyper Pipelined Technology Today

2.2GHz

20

1.8GHz Intro

Netburst Micro-Architecture

1.5GHz 1.2GHz

Frequency

10 P6 Micro-Architecture

233MHz 166MHz

5 P5 Micro-Architecture

60MHz

Introduction Copyright © 2002 Intel Corporation.

Time

Intel® PDX

Deeper Pipelines are Better Performance Improvement

120% 2 MB

100%

1 MB

80%

512 KB

60%

256 KB

40% 20% 0% 10

Copyright © 2002 Intel Corporation.

15

20 Pipeline Depth

25

Source: Average of 2000 application segments from performance simulations simulations

30 Intel® PDX

Why not deeper pipelines? z Increases

complexity

– Harder to balance – More challenges to architect around – More algorithms – Greater validation effort – Need to pipeline the wires Overall Engineering Effort Increases Quickly as Pipeline depth increases Intel® Copyright © 2002 Intel Corporation.

PDX

Performance z High

bandwidth front end z Low latency core z Lower

memory latency

High High Bandwidth Bandwidth Front Front End End

Intel® Copyright © 2002 Intel Corporation.

PDX

Higher Frequency increases requirements of front end z Branch

prediction is more important

– So we improved it z Need

greater uop bandwidth

– Branches constantly change the flow – Need to decode more instructions in parallel

Intel® Copyright © 2002 Intel Corporation.

PDX

Block Diagram System Bus

64-bit wide

Dynamic BTB 4K Entries

Instruction TLB Instruction Decoder

Bus Interface Unit

Micro Instruction Sequencer

Quad Pumped 3.2 GB/s

Execution Trace Cache 12K uops µops

Allocator / Register Renamer

Memory µop Queue Memory

Trace Cache BTB 512 Entries

Integer/Floating Point µop Queue

Integer Schedulers Slow Fast Int

Fast Int

Floating Point Schedulers FP Move FP Gen

Integer Register File / Bypass Network

Slow ALU

L2 Cache

Complex

Instr. Instr.

64GB/s 256-bit wide Copyright © 2002 Intel Corporation.

2xAGU

2xALU

2xALU

Ld/St Address unit

Simple Instr. Instr.

Simple Instr. Instr.

FP Register / Bypass

FP Move

L1 Data Cache 8Kbyte 44-way

Fmul Fmul Fmul FAdd FAdd FAdd

Intel® PDX

Execution Trace Cache 1 cmp 2 br -> T1 .. ... (unused code) T1:

T2:

T3:

3 sub 4 br -> T2 .. ... (unused code) 5 mov 6 sub 7 br -> T3 .. ... (unused code)

Trace Cache Delivery 1

cmp

2 br T1

3 T1: sub

4

br T2

5 mov

6

7

br T3

8 T3:add

9 sub

10 mul

11 cmp

sub

12 br T4

8 add 9 sub 10 mul 11 cmp 12 br -> T4 Intel®

Copyright © 2002 Intel Corporation.

PDX

Execution Trace Cache P6 Microarchitecture 1

cmp

2

3 T1: sub 4

br T1 br T2

Trace Cache Delivery 1

cmp

2 br T1

3 T1: sub

4

br T2

5 mov

6

7

br T3

8 T3:add

9 sub

10 mul 5 mov

6

sub

7

8 T3:add

9

sub

10 mul

11 cmp

12 br T4

BW = 1.5 uops/ns

11 cmp

sub

12 br T4

br T3

BW = 6 uops/ns Intel®

Copyright © 2002 Intel Corporation.

PDX

Inside the Execution Trace Cache Way 0 Instruction Pointer

0x0900

Set 0 Set 1 Set 2 Set 3 Set 4

Way 1

Way 2

Way 3

head body 1 body 2 tail

head

cmp,

br T1,

body 1

br T3,

T3:add, sub,

body 2

T4:add, sub,

tail

add,

sub,

T1:sub, br T2, mov, sub mul,

cmp, br T4

mov,

add,

add, mov

mov,

add,

add, mov Intel®

Copyright © 2002 Intel Corporation.

PDX

Self Modifying Code z Programs

that modify the instruction stream that is being executed z Very common in Java* code from JITs z Requires

hardware mechanisms to maintain consistency

*Other names and brands may be claimed as the property of others

Copyright © 2002 Intel Corporation.

.

Intel® PDX

Self Modifying Code z The

hardware needs to handle two basic cases: – Stores that write to instructions in the Trace Cache – Instruction fetches that hit pending stores – Speculative – Committed

Intel® Copyright © 2002 Intel Corporation.

PDX

Case 1: Stores to cached instructions “in use” bits addr addr

Instruction TLB (128 entries) Trace Cache

Execution Core addr

Data TLB

Store’s Physical Address

Intel® Copyright © 2002 Intel Corporation.

PDX

Case 2: Fetches to pending stores addr addr

Instruction Pointer

Instruction TLB (128 entries) Committed addr

Please Re-Fetch Please Re-Fetch

Store Buffer Speculative

Write Combining Buffer Please Flush Pipeline

Execution Core Intel®

Copyright © 2002 Intel Corporation.

PDX

Execution Trace Cache z Provides

higher bandwidth for higher frequency core z Reduces fetch latency z Requires

new fundamentally new algorithms

Intel® Copyright © 2002 Intel Corporation.

PDX

Performance z High

bandwidth front end z Low latency core z Lower

memory latency

Low Low Latency Latency Core Core

Intel® Copyright © 2002 Intel Corporation.

PDX

Data Speculation z Use

data before we are sure it is valid

– Lowers effective LD latency – Fast ALUs in Pentium 4 want fast LDs – Ratio of LD latency to ADD latency is important if 1 in 5 uops is a LD z As

pipelines get deeper, data speculation gets more important – Number of cycles saved /w data speculation increases as pipeline depth increases Intel®

Copyright © 2002 Intel Corporation.

PDX

L1 Data Cache System Bus

64-bit wide

Dynamic BTB 4K Entries

Instruction TLB Instruction Decoder

Bus Interface Unit

Micro Instruction Sequencer

Quad Pumped 3.2 GB/s

Execution Trace Cache 12K µops

Allocator / Register Renamer

Memory µop Queue Memory

Trace Cache BTB 512 Entries

Integer/Floating Point µop Queue

Integer Schedulers Slow Fast Int

Fast Int

Floating Point Schedulers FP Move FP Gen

Integer Register File / Bypass Network

Slow ALU

L2 Cache

Complex

Instr. Instr.

64GB/s 256-bit wide Copyright © 2002 Intel Corporation.

2xAGU

2xALU

2xALU

Ld/St Address unit

Simple Instr. Instr.

Simple Instr. Instr.

FP Register / Bypass

FP Move

L1 Data Cache 8Kbyte 44-way L1 Data Cache 8Kbyte 4-way

Fmul Fmul Fmul FAdd FAdd FAdd

Intel® PDX

L1 Cache is >3x Faster z P6:

– 3 clocks @ 1GHz 3ns

z P4P:

– 2 clocks @ 2GHz 1ns

Lower Lower Latency Latency is is Higher Higher Performance Performance Intel® Copyright © 2002 Intel Corporation.

PDX

L1 Data Cache Pipeline Stages VA 15:0

tag VA 31:16

2x Clock

data

Data Fast SB

VA 31:0

TLB

TAG

Slow SB

replay Copyright © 2002 Intel Corporation.

replay

replay

Intel® PDX

L1 Data Cache VA 15:0

tag VA 31:16

2x Clock

data

Data Fast SB

VA 31:0

TLB

TAG

Slow SB



10:6

15:11

Copyright © 2002 Intel Corporation.

Way select

Data array



Way Predictor (Tag array)

19:16

Hit (Replay)

Intel® PDX

A Digression on Stores z Two

components to a store:

– STA: address computation – STD: data piece z Hybrid

uOP

– Single uOP in the front, back ends – Two uOPs in the middle

Intel® Copyright © 2002 Intel Corporation.

PDX

Memory Disambiguation

Ld EAX 20% – The other traces are unaffected – Average performance improvement < 0.1%

Intel® Copyright © 2002 Intel Corporation.

PDX

Performance z High

bandwidth front end z Low latency core z Lower

memory latency

Lower Lower Memory Memory Latency Latency Intel® Copyright © 2002 Intel Corporation.

PDX

Reducing Latency z As

frequency increases, it is important to improve the performance of the memory subsystem z Data Prefetch Logic – Watches processor memory traffic – Looks for patterns – Initiates accesses Intel® Copyright © 2002 Intel Corporation.

PDX

Data Prefetch Logic System Bus

64-bit wide

Dynamic BTB 4K Entries

Instruction TLB Instruction Decoder

Bus Interface Unit

Micro Instruction Sequencer

Quad Data Pumped Prefetch 3.2 GB/s Logic

Execution Trace Cache 12K µops

Allocator / Register Renamer

Memory µop Queue Memory

Trace Cache BTB 512 Entries

Integer/Floating Point µop Queue

Integer Schedulers Slow Fast Int

Fast Int

Floating Point Schedulers FP Move FP Gen

Integer Register File / Bypass Network

Slow ALU

L2 Cache

Complex

Instr. Instr.

64GB/s 256-bit wide Copyright © 2002 Intel Corporation.

2xAGU

2xALU

2xALU

Ld/St Address unit

Simple Instr. Instr.

Simple Instr. Instr.

FP Register / Bypass

FP Move

L1 Data Cache 8Kbyte 44-way

Fmul Fmul Fmul FAdd FAdd FAdd

Intel® PDX

Data Prefetch Logic Instruction Buffers

64

Instruction Fetch Data Prefetch Logic L1 Data Cache

L2 Advanced Transfer Cache

Bus Queue

256

Prefetch logic first checks L2 cache and then fetches lines from memory that miss L2 cache. Intel® Copyright © 2002 Intel Corporation.

PDX

Data Prefetch Logic z Watches

patterns

for streaming memory access

– Can track 8 multiple independent streams – Loads, Stores or Instruction – Forward or Backward z Analysis

on 32 byte cache line granularity z Looks for “mostly” complete streams: – Access to cache lines 1,2,3,4,5,6 will prefetch – Access to cache lines 1,2, 4,5,6 will prefetch – 1, ,3, , ,6, , ,9 will not prefetch Intel® Copyright © 2002 Intel Corporation.

PDX

Performance z High

bandwidth front end z Low latency core z Lower

memory latency

Intel® Copyright © 2002 Intel Corporation.

PDX

Intel® Copyright © 2002 Intel Corporation.

PDX

Suggest Documents