IBM's Micro Processor Design and Methodology

IBM's Micro Processor Design and Methodology Ron Kalla IBM Systems and Technology Group © 2003 IBM Corporation IBM’s Micro Processor Design and Me...
Author: Leslie Flowers
2 downloads 1 Views 1015KB Size
IBM's Micro Processor Design and Methodology

Ron Kalla IBM Systems and Technology Group

© 2003 IBM Corporation

IBM’s Micro Processor Design and Methodology

Outline ƒ POWER5 ƒ POWER6 ƒ Design Process ƒ Power Aware Design

© 2003 IBM Corporation

IBM’s Micro Processor Design and Methodology

POWER Server Roadmap 2001

2002-3

2004*

2005*

2007

POWER4

POWER4+

POWER5

POWER5+

POWER6

90 nm

65 nm

130 nm

130 nm

180 nm 1.3 GHz Core

1.3 GHz Core

Shared L2

1.9 GHz Core

1.9 GHz Core

Shared L2 Distributed Switch

Distributed Switch

Chip Multi Processing - Distributed Switch - Shared L2 Dynamic LPARs (16)

Reduced size Lower power Larger L2 More LPARs (32)

1.9 GHz Core

1.9 GHz Core

Shared L2

2.2 GHz 2.2 GHz Core Core

HF Core 4-5GHz L2 caches

Shared L2 Distributed Switch

Advanced System Features

Distributed Switch

Simultaneous multi-threading Sub-processor partitioning Dynamic firmware updates Enhanced scalability, parallelism High throughput performance Enhanced memory subsystem

Autonomic Computing Enhancements * *Planned to be offered by IBM. All statements about IBM’s future direction and intent are subject to change or withdrawal without notice and represent goals and objectives only.

© 2003 IBM Corporation

IBM’s Micro Processor Design and Methodology

POWER5 ƒ Technology: 90nm lithography, Cu, SOI ƒ 245mm2 300M Transistors ƒ Dual processor core ƒ 8-way superscalar ƒ Simultaneous multithreaded (SMT) core  Up

to 2 virtual processors per real processor

© 2003 IBM Corporation

IBM’s Micro Processor Design and Methodology

Multi-threading Evolution Single Thread

Coarse Grain Threading FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL

FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL

Fine Grain Threading FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL

Simultaneous Multi-Threading FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL

Thread 0 Executing

Thread 1 Executing

No Thread Executing © 2003 IBM Corporation

IBM’s Micro Processor Design and Methodology

Thread Priority

Single Thread Mode

ƒ Instances when unbalanced execution desirable 2

work for opposite thread

 Thread

 Software

1

determined non uniform

balance  Power

2

waiting on lock

management

…

ƒ Solution: Control instruction decode rate  Software/hardware

controls 8 priority levels for each thread

1 IPC

 No

1 1 1 0 0 0 0,7 -5

-3

-1

0

1

3

5

7,0 1,1

Thread 1 Priority - Thread 0 Priority Thread 0 IPC

Thread 1 IPC

Power Save Mode © 2003 IBM Corporation

IBM’s Micro Processor Design and Methodology

Terminology ƒ PowerPC Addresses 

Virtualization drives more levels



Effective>(SLB)>Virtual>(Page Table)>Real>(LPAR)>Physical

ƒ Instruction Execution 

I-fetch



Decode



Dispatch



Issue



Finish



Complete

© 2003 IBM Corporation

IBM’s Micro Processor Design and Methodology

© 2003 IBM Corporation

IBM’s Micro Processor Design and Methodology

Multithreaded Instruction Flow in Processor Pipeline Out-of-Order Processing Branch Redirects Instruction Fetch IFIF

IC

BP D0 D0

D1

D2

D3

Xfer

Group Formation and Instruction Decode

GD

BR LD/ST DC Fmt

MP

ISS

RF

EX

MP

ISS

RF

EA

MP

ISS

RF

EX

MP

ISS

RF

F6 F6 F6 F6 F6 F6

FX FP

WB

Xfer

WB

Xfer

WB

Xfer

WB

Xfer

CP CP

Interrupts & Flushes

Branch Prediction

Branch History Tables

Program Counter

Return Stack

Dynamic Instruction Selection

Target Cache

Shared Issue Queues

Shared Execution Units LSU0 FXU0

Alternate

Instruction Cache Instruction Translation

Instruction Buffer 0 Instruction Buffer 1

FXU1

Group Completion

FPU0

Store Queue

FPU1 Thread Priority

Shared by two threads

LSU1

Group Formation, Instruction Decode, Dispatch

Resource used by thread 0

BXU Shared Register Mappers

CRL Read Shared Register Files

Write Shared Register Files

Data Data Translation Cache L2 Cache

Resource used by thread 1

© 2003 IBM Corporation

IBM’s Micro Processor Design and Methodology

Resource Sizes ƒ Analysis done to optimize every micro-architectural resource size GPR/FPR

SMT

rename pool size

buffers

Reservation

Station

IPC

I-fetch

ST

SLB/TLB/ERAT I-cache/D-cache

ƒ Many Workloads examined ƒ Associativity also examined

~ ~ 50

60

70

80

90

100

110

120

130

Number of GPR Renames Results based on simulation of an online transaction processing application Vertical axis does not originate at 0 © 2003 IBM Corporation

IBM’s Micro Processor Design and Methodology

Single Thread Operation ƒ Advantageous for execution unit limited applications

Extra resources necessary for SMT provide higher performance benefit when dedicated to single thread

ƒ Determined dynamically on a per processor basis

POWER5 SMT



POWER5 ST

ƒ Execution unit limited applications provide minimal performance leverage for SMT

POWER4+

Floating or fixed point intensive workloads

IPC



Matrix Multiply

© 2003 IBM Corporation

IBM’s Micro Processor Design and Methodology

16-way Building Block Book Memory

Memory

I/O

Memory

I/O

Memory

I/O

I/O

MCM L3

L3

L3

L3

POWER5

POWER5

POWER5

POWER5

POWER5

POWER5

POWER5

POWER5

L3

L3

L3

L3 MCM

I/O Memory

I/O Memory

I/O Memory

I/O Memory

© 2003 IBM Corporation

IBM’s Micro Processor Design and Methodology

POWER5 Multi-chip Module

ƒ 95mm % 95mm ƒ Four POWER5 chips ƒ Four cache chips ƒ 4,491 signal I/Os ƒ 89 layers of metal

© 2003 IBM Corporation

IBM’s Micro Processor Design and Methodology

POWER6

© 2003 IBM Corporation

IBM’s Micro Processor Design and Methodology

POWER6 Physical Overview ƒ

5+ GHz operation

ƒ

>790M transistors

ƒ

341mm2 die

ƒ

65nm SOI process with 10 levels of Cu interconnect and low-k dielectric on 1st 8 levels

ƒ

2 superscalar, SMT cores

ƒ

8 MB Level-2 cache

ƒ

Support for 32MB L3

ƒ

2 memory controllers

ƒ

Two-tier SMP Fabric

2 MB L2

Core 0

Mem. SMP Fabric Cntl.

2 MB L2

Mem. Cntl.

L2 Dir 2 MB L2

Core 1

2 MB L2

© 2003 IBM Corporation

POWER6 Core • POWER6 offers ~2X the frequency of POWER5 (4 to 5+ GHz). • POWER6 maintains POWER5’s instruction pipeline depth

–Achieves same power envelope –Scales performance with frequency

Instruction Fetch

Instruction Buffer/Decode

Instruction Dispatch/Issue

Data Fetch / Execute

~6ns / instr ~3ns / instr FXU Dependent execution Load Dependent execution

• POWER6 extends functionality of POWER5 Core – Enhanced 2-way SMT with 7 instruction dispatch – 64K, 4-way I Cache; 64K, 8-way D Cache – Out of order floating point – Speculative load look-ahead and enhanced data prefetch – 2 FXU, 2 FPU, 2 LSU, 1 Branch Unit – VMX Unit – Decimal Floating Point Unit

Bullet-Proof Computing •

Error Detection – – – –



100% ECC protection for large caches, interfaces, and architected state >99% of small SRAMs and Register files parity protected Dataflow & control protected by parity and logical consistency checkers Experiments indicate ~3400 random soft errors needed to cause 1 undetected data corruption

Error Recovery Processor architected state check pointed Every cycle ECC & Non-ECC protected circuitry checked Every cycle

No error found Error found Processor restarts from last saved checkpoint

Error found

No error found

Soft error case

Processor workload moved to another CPU

Hard error case

POWER6 Enables Energy Efficiency Benefits of Voltage Frequency Slewing

• Supports a variety of energy policies Power capping Energy reduction Acoustic optimization Performance optimization

• Extensive hardware controls – Wide voltage / frequency range – Architected idle state (Nap) for increased clock gating – Memory request throttling – Power down of memory ranks – Programmable fetch / dispatch throttling

Relative Performance Relative Power

Lower Voltage & Frequency

Im p act of N ap M od e on P ow er Current (A)

– – – –

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

110 100 90 80 70 60 50 40 30 20 10 0

O /S O /S O /S S tress S tress Idle w / N ap W o rklo ad

O /S Idle w / N ap