IBM's POWER5 Microprocessor Design and Methodology

IBM's POWER5 Microprocessor Design and Methodology Ron Kalla IBM Systems Group © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Metho...
5 downloads 1 Views 3MB Size
IBM's POWER5 Microprocessor Design and Methodology Ron Kalla IBM Systems Group

© 2003 IBM Corporation

IBM’s POWER5 Micro Processor Design and Methodology

Outline § Motivation § Background § Threading Fundamentals § Enhanced SMT Implementation in POWER5 § Memory Subsystem Enhancements § Additional SMT Considerations § Summary

UT Computer Architecture Seminar, November 2003

© 2003 IBM Corporation

IBM’s POWER5 Micro Processor Design and Methodology

Microprocessor Design Optimization Focus Areas § Memory latency 4

Increased processor speeds make memory appear further away

4

Longer stalls possible

§ Branch processing 4

Mispredict more costly as pipeline depth increases resulting in stalls and wasted power

4

Predication drives increased power and larger chip area

§ Execution Unit Utilization 4

Currently 20-25% execution unit utilization common

§ Simultaneous multi-threading (SMT) and POWER architecture address these areas UT Computer Architecture Seminar, November 2003

© 2003 IBM Corporation

IBM’s POWER5 Micro Processor Design and Methodology

4 POWER4+ 4 267mm 2

shipping in 130nm today FPU

185M transistors

§ Dual processor core § 8-way superscalar Load / Store units

42

Fixed Point units

42

Floating Point units

4 Logical

operations on Condition Register

4 Branch

Execution unit

§ > 200 instructions in flight § Hardware instruction and data prefetch

ISU

FPU IDU

LSU

IFU BXU

of Order execution

42

ISU

IDU

L3 Directory/Control

4 Out

FXU

§ Technology: 180nm lithography, Cu, SOI

FXU

POWER4 --- Shipped in Systems December 2001

L2

UT Computer Architecture Seminar, November 2003

LSU

L2

IFU BXU

L2

© 2003 IBM Corporation

IBM’s POWER5 Micro Processor Design and Methodology

POWER5 --- The Next Step § Technology: 130nm lithography, Cu, SOI § 389mm 2 276M Transistors § Dual processor core § 8-way superscalar § Simultaneous multithreaded (SMT) core 4 Up

to 2 virtual processors per real processor

4 Natural

extension to POWER4

design

UT Computer Architecture Seminar, November 2003

© 2003 IBM Corporation

IBM’s POWER5 Micro Processor Design and Methodology

Multi-threading Evolution Single Thread

Coarse Grain Threading FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL

FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL

Fine Grain Threading FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Thread 0 Executing

Simultaneous Multi-Threading FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Thread 1 Executing

UT Computer Architecture Seminar, November 2003

No Thread Executing © 2003 IBM Corporation

IBM’s POWER5 Micro Processor Design and Methodology

Changes Going From ST to SMT Core § SMT easily added to Superscalar Micro-architecture Second Program Counter (PC) added to share I-fetch bandwidth 4 GPR/FPR rename mapper expanded to map second set of registers (High order address bit indicates thread) 4 Completion logic replicated to track two threads 4 Thread bit added to most address/tag buses 4

Fetch Unit

PC

I-Cache Decode

Register Register Rename

BR/CRU BR/CRU Issue Issue Qs Qs

CR, LR, CTR

FP FP Issue Qs Qs Issue

FPRs FPRs

Integer Integer Issue Issue Qs Qs

GPRs

UT Computer Architecture Seminar, November 2003

BR, CRL Units Units FPUs FPUs FPUs FXUs, LSUs LSUs

Data Cache

© 2003 IBM Corporation

IBM’s POWER5 Micro Processor Design and Methodology

Resource Sizes § Analysis done to optimize every micro-architectural resource size 4GPR/FPR rename pool size

SMT

buffers

4Reservation

Station

IPC

4I-fetch

ST

4SLB/TLB/ERAT 4I-cache/D-cache

§ Many Workloads examined § Associativity also examined

~ ~ 50

60

70

80

90

100

110

120

130

Number of GPR Renames Results based on simulation of an online transaction processing application Vertical axis does not originate at 0

UT Computer Architecture Seminar, November 2003

© 2003 IBM Corporation

IBM’s POWER5 Micro Processor Design and Methodology

POWER5 Resources Size Enhancements § Enhanced caches and translation resources 4

I-cache: 64 KB, 2-way set associative, LRU

4

D-cache: 32 KB, 4-way set associative, LRU

4

First level Data Translation: 128 entries, fully associative, LRU

4

L2 Cache: 1.92 MB, 10-way set associative, LRU

§ Larger resource pools 4 4

Rename registers: GPRs, FPRs increased to 120 each L2 cache coherency engines: increased by 100%

§ Enhanced data stream prefetching § Memory controller moved on chip

UT Computer Architecture Seminar, November 2003

© 2003 IBM Corporation

IBM’s POWER5 Micro Processor Design and Methodology

Resource Sharing Global Completion Table Occupancy

10 Relative Occurrence

§ Threads share many resources

Without dynamic resource utilization adjustment

4 Global

Completion Table, BHT, TLB, . . .

8 6 0

4

5 10

2 0

5

10

15

Th re ad

20

1

15

0 20

Thread 0

to drift toward extremes accompanied by reduced performance

8 6 0

4

5 15

0

20 0

5

10

15

20

1

10

2

§ Solution: Dynamically adjust resource utilization

Th re ad

Relative Occurrence

4 Tendency

With dynamic resource utilization adjustment

10

§ Higher performance realized when resources balanced across threads

Thread 0

Results based on simulation of an online transaction processing application

UT Computer Architecture Seminar, November 2003

© 2003 IBM Corporation

IBM’s POWER5 Micro Processor Design and Methodology

Thread Priority

Single Thread Mode

§ Instances when unbalanced execution desirable 2

work for opposite thread

4 Thread

4 Software

1

determined non uniform

balance 4 Power

2

waiting on lock

management

4…

1 IPC

4 No

1 1 1

§ Solution: Control instruction decode rate 4 Software/hardware

controls 8 priority levels for each thread

0 0 0 0,7 -5

-3

-1

0

1

3

5

7,0 1,1

Thread 1 Priority - Thread 0 Priority Thread 0 IPC

UT Computer Architecture Seminar, November 2003

Thread 1 IPC

Power Save Mode © 2003 IBM Corporation

IBM’s POWER5 Micro Processor Design and Methodology

Dynamic Thread Switching Thread States

§ Used if no task ready for second thread to run software § Allocates all machine resources to one thread § Initiated by software Active § Dormant thread wakes up on: 4

Dormant

hardware or software

External interrupt

4

Decrementer interrupt

4

Special instruction from active thread

UT Computer Architecture Seminar, November 2003

software

software

Null

© 2003 IBM Corporation

IBM’s POWER5 Micro Processor Design and Methodology

Single Thread Operation § Advantageous for execution unit limited applications

Extra resources necessary for SMT provide higher performance benefit when dedicated to single thread

§ Determined dynamically on a per processor basis

UT Computer Architecture Seminar, November 2003

POWER5 SMT

4

POWER5 ST

§ Execution unit limited applications provide minimal performance leverage for SMT

POWER4+

Floating or fixed point intensive workloads

IPC

4

Matrix Multiply

© 2003 IBM Corporation

IBM’s POWER5 Micro Processor Design and Methodology

Modifications to POWER4 System Structure P

P

P

P

L2

L2

Fab Ctl

Fab Ctl

P L3

P

P

P

L2

L2

Fab Ctl

Fab Ctl

Mem Ctl

Mem Ctl

POWER4 Systems

L3

POWER5 Systems

L3

L3

Mem Ctl

Mem Ctl

Memory

Memory

Memory

UT Computer Architecture Seminar, November 2003

Memory © 2003 IBM Corporation

IBM’s POWER5 Micro Processor Design and Methodology

16-way Building Block Book Memory

Memory

I/O

Memory

I/O

Memory

I/O

I/O

MCM L3

L3

L3

L3

POWER5

POWER5

POWER5

POWER5

POWER5

POWER5

POWER5

POWER5

L3

L3

L3

L3 MCM

I/O Memory

I/O Memory

I/O Memory

UT Computer Architecture Seminar, November 2003

I/O Memory

© 2003 IBM Corporation

IBM’s POWER5 Micro Processor Design and Methodology

POWER5 Multi-chip Module

§ 95mm % 95mm § Four POWER5 chips § Four cache chips § 4,491 signal I/Os § 89 layers of metal

UT Computer Architecture Seminar, November 2003

© 2003 IBM Corporation

IBM’s POWER5 Micro Processor Design and Methodology

64-way SMP Interconnection

Interconnection exploits enhanced distributed switch § All chip interconnections operate at half processor frequency and scale with processor frequency UT Computer Architecture Seminar, November 2003

© 2003 IBM Corporation

IBM’s POWER5 Micro Processor Design and Methodology

POWER4 and POWER5 Storage Hierarchy POWER4

POWER5

L2 Cache Capacity, line size Associativity, replacement Off-chip L3 Cache

1.44 MB, 128 B line 1.92 MB, 128 B line 8-way, LRU

10-way, LRU

Capacity, line size Associativity, replacement Chip interconnect

32 MB, 512 B line

36 MB, 256 B line

8-way, LRU

12-way, LRU

Type Intra-MCM data buses Inter-MCM data buses Memory

Enhanced distributed switch ½ processor speed Processor speed Distributed switch

½ processor speed ½ processor speed 1024 GB (1 TB) 512 GB maximum maximum

UT Computer Architecture Seminar, November 2003

© 2003 IBM Corporation

IBM’s POWER5 Micro Processor Design and Methodology

Other SMT Considerations § Power Management 4

SMT Increases execution unit utilization

4

Dynamic power management does not impact performance

§ Debug tools / Lab bring-up 4

Instruction tracing

4

Hang detection

4

Forward progress monitor

§ Performance Monitoring § Serviceability

UT Computer Architecture Seminar, November 2003

© 2003 IBM Corporation

IBM’s POWER5 Micro Processor Design and Methodology

POWER Server Roadmap 2001

2002-3

2004*

2005*

2006*

POWER4

POWER4+

POWER5

POWER5+

POWER6

90 nm

65 nm

130 nm

130 nm

180 nm 1.3 GHz Core

1.3 GHz Core

Shared L2

1.7 GHz Core

1.7 GHz Core

Shared L2 Distributed Switch

Distributed Switch

Chip Multi Processing - Distributed Switch - Shared L2 Dynamic LPARs (16)

Reduced size Lower power Larger L2 More LPARs (32)

> GHz Core

> GHz Core

Shared L2

>> GHz >> GHz Core Core

Ultra high frequency cores L2 caches

Shared L2 Distributed Switch

Advanced System Features

Distributed Switch

Simultaneous multi-threading Sub-processor partitioning Dynamic firmware updates Enhanced scalability, parallelism High throughput performance Enhanced memory subsystem

Autonomic Computing Enhancements *Planned to be offered by IBM. All statements about IBM’s future direction and intent are* subject to change or withdrawal without notice and represent goals and objectives only.

UT Computer Architecture Seminar, November 2003

© 2003 IBM Corporation

IBM’s POWER5 Micro Processor Design and Methodology

Summary § POWER5 SMT implementation is more than SMT 4

Good ROI for silicon area: Performance gain > Area increase

4

Resource sizes optimized

4

Dynamic feedback enhances instruction throughput

4

Software controlled priority exploits machine architecture

4

Dynamic ST to/from SMT mode capability optimizes system resources

§ SMT impacts pervasive throughout chip § Storage subsystem scalable to 64 Processor/ 128 Threads § Operating in laboratory 4

AIX, Linux and OS/400 booted and running

UT Computer Architecture Seminar, November 2003

© 2003 IBM Corporation