IBM's POWER5 Microprocessor Design and Methodology Ron Kalla IBM Systems Group
© 2003 IBM Corporation
IBM’s POWER5 Micro Processor Design and Methodology
Outline § Motivation § Background § Threading Fundamentals § Enhanced SMT Implementation in POWER5 § Memory Subsystem Enhancements § Additional SMT Considerations § Summary
UT Computer Architecture Seminar, November 2003
© 2003 IBM Corporation
IBM’s POWER5 Micro Processor Design and Methodology
Microprocessor Design Optimization Focus Areas § Memory latency 4
Increased processor speeds make memory appear further away
4
Longer stalls possible
§ Branch processing 4
Mispredict more costly as pipeline depth increases resulting in stalls and wasted power
4
Predication drives increased power and larger chip area
§ Execution Unit Utilization 4
Currently 20-25% execution unit utilization common
§ Simultaneous multi-threading (SMT) and POWER architecture address these areas UT Computer Architecture Seminar, November 2003
© 2003 IBM Corporation
IBM’s POWER5 Micro Processor Design and Methodology
4 POWER4+ 4 267mm 2
shipping in 130nm today FPU
185M transistors
§ Dual processor core § 8-way superscalar Load / Store units
42
Fixed Point units
42
Floating Point units
4 Logical
operations on Condition Register
4 Branch
Execution unit
§ > 200 instructions in flight § Hardware instruction and data prefetch
ISU
FPU IDU
LSU
IFU BXU
of Order execution
42
ISU
IDU
L3 Directory/Control
4 Out
FXU
§ Technology: 180nm lithography, Cu, SOI
FXU
POWER4 --- Shipped in Systems December 2001
L2
UT Computer Architecture Seminar, November 2003
LSU
L2
IFU BXU
L2
© 2003 IBM Corporation
IBM’s POWER5 Micro Processor Design and Methodology
POWER5 --- The Next Step § Technology: 130nm lithography, Cu, SOI § 389mm 2 276M Transistors § Dual processor core § 8-way superscalar § Simultaneous multithreaded (SMT) core 4 Up
to 2 virtual processors per real processor
4 Natural
extension to POWER4
design
UT Computer Architecture Seminar, November 2003
© 2003 IBM Corporation
IBM’s POWER5 Micro Processor Design and Methodology
Multi-threading Evolution Single Thread
Coarse Grain Threading FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL
FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL
Fine Grain Threading FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Thread 0 Executing
Simultaneous Multi-Threading FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Thread 1 Executing
UT Computer Architecture Seminar, November 2003
No Thread Executing © 2003 IBM Corporation
IBM’s POWER5 Micro Processor Design and Methodology
Changes Going From ST to SMT Core § SMT easily added to Superscalar Micro-architecture Second Program Counter (PC) added to share I-fetch bandwidth 4 GPR/FPR rename mapper expanded to map second set of registers (High order address bit indicates thread) 4 Completion logic replicated to track two threads 4 Thread bit added to most address/tag buses 4
Fetch Unit
PC
I-Cache Decode
Register Register Rename
BR/CRU BR/CRU Issue Issue Qs Qs
CR, LR, CTR
FP FP Issue Qs Qs Issue
FPRs FPRs
Integer Integer Issue Issue Qs Qs
GPRs
UT Computer Architecture Seminar, November 2003
BR, CRL Units Units FPUs FPUs FPUs FXUs, LSUs LSUs
Data Cache
© 2003 IBM Corporation
IBM’s POWER5 Micro Processor Design and Methodology
Resource Sizes § Analysis done to optimize every micro-architectural resource size 4GPR/FPR rename pool size
SMT
buffers
4Reservation
Station
IPC
4I-fetch
ST
4SLB/TLB/ERAT 4I-cache/D-cache
§ Many Workloads examined § Associativity also examined
~ ~ 50
60
70
80
90
100
110
120
130
Number of GPR Renames Results based on simulation of an online transaction processing application Vertical axis does not originate at 0
UT Computer Architecture Seminar, November 2003
© 2003 IBM Corporation
IBM’s POWER5 Micro Processor Design and Methodology
POWER5 Resources Size Enhancements § Enhanced caches and translation resources 4
I-cache: 64 KB, 2-way set associative, LRU
4
D-cache: 32 KB, 4-way set associative, LRU
4
First level Data Translation: 128 entries, fully associative, LRU
4
L2 Cache: 1.92 MB, 10-way set associative, LRU
§ Larger resource pools 4 4
Rename registers: GPRs, FPRs increased to 120 each L2 cache coherency engines: increased by 100%
§ Enhanced data stream prefetching § Memory controller moved on chip
UT Computer Architecture Seminar, November 2003
© 2003 IBM Corporation
IBM’s POWER5 Micro Processor Design and Methodology
Resource Sharing Global Completion Table Occupancy
10 Relative Occurrence
§ Threads share many resources
Without dynamic resource utilization adjustment
4 Global
Completion Table, BHT, TLB, . . .
8 6 0
4
5 10
2 0
5
10
15
Th re ad
20
1
15
0 20
Thread 0
to drift toward extremes accompanied by reduced performance
8 6 0
4
5 15
0
20 0
5
10
15
20
1
10
2
§ Solution: Dynamically adjust resource utilization
Th re ad
Relative Occurrence
4 Tendency
With dynamic resource utilization adjustment
10
§ Higher performance realized when resources balanced across threads
Thread 0
Results based on simulation of an online transaction processing application
UT Computer Architecture Seminar, November 2003
© 2003 IBM Corporation
IBM’s POWER5 Micro Processor Design and Methodology
Thread Priority
Single Thread Mode
§ Instances when unbalanced execution desirable 2
work for opposite thread
4 Thread
4 Software
1
determined non uniform
balance 4 Power
2
waiting on lock
management
4…
1 IPC
4 No
1 1 1
§ Solution: Control instruction decode rate 4 Software/hardware
controls 8 priority levels for each thread
0 0 0 0,7 -5
-3
-1
0
1
3
5
7,0 1,1
Thread 1 Priority - Thread 0 Priority Thread 0 IPC
UT Computer Architecture Seminar, November 2003
Thread 1 IPC
Power Save Mode © 2003 IBM Corporation
IBM’s POWER5 Micro Processor Design and Methodology
Dynamic Thread Switching Thread States
§ Used if no task ready for second thread to run software § Allocates all machine resources to one thread § Initiated by software Active § Dormant thread wakes up on: 4
Dormant
hardware or software
External interrupt
4
Decrementer interrupt
4
Special instruction from active thread
UT Computer Architecture Seminar, November 2003
software
software
Null
© 2003 IBM Corporation
IBM’s POWER5 Micro Processor Design and Methodology
Single Thread Operation § Advantageous for execution unit limited applications
Extra resources necessary for SMT provide higher performance benefit when dedicated to single thread
§ Determined dynamically on a per processor basis
UT Computer Architecture Seminar, November 2003
POWER5 SMT
4
POWER5 ST
§ Execution unit limited applications provide minimal performance leverage for SMT
POWER4+
Floating or fixed point intensive workloads
IPC
4
Matrix Multiply
© 2003 IBM Corporation
IBM’s POWER5 Micro Processor Design and Methodology
Modifications to POWER4 System Structure P
P
P
P
L2
L2
Fab Ctl
Fab Ctl
P L3
P
P
P
L2
L2
Fab Ctl
Fab Ctl
Mem Ctl
Mem Ctl
POWER4 Systems
L3
POWER5 Systems
L3
L3
Mem Ctl
Mem Ctl
Memory
Memory
Memory
UT Computer Architecture Seminar, November 2003
Memory © 2003 IBM Corporation
IBM’s POWER5 Micro Processor Design and Methodology
16-way Building Block Book Memory
Memory
I/O
Memory
I/O
Memory
I/O
I/O
MCM L3
L3
L3
L3
POWER5
POWER5
POWER5
POWER5
POWER5
POWER5
POWER5
POWER5
L3
L3
L3
L3 MCM
I/O Memory
I/O Memory
I/O Memory
UT Computer Architecture Seminar, November 2003
I/O Memory
© 2003 IBM Corporation
IBM’s POWER5 Micro Processor Design and Methodology
POWER5 Multi-chip Module
§ 95mm % 95mm § Four POWER5 chips § Four cache chips § 4,491 signal I/Os § 89 layers of metal
UT Computer Architecture Seminar, November 2003
© 2003 IBM Corporation
IBM’s POWER5 Micro Processor Design and Methodology
64-way SMP Interconnection
Interconnection exploits enhanced distributed switch § All chip interconnections operate at half processor frequency and scale with processor frequency UT Computer Architecture Seminar, November 2003
© 2003 IBM Corporation
IBM’s POWER5 Micro Processor Design and Methodology
POWER4 and POWER5 Storage Hierarchy POWER4
POWER5
L2 Cache Capacity, line size Associativity, replacement Off-chip L3 Cache
1.44 MB, 128 B line 1.92 MB, 128 B line 8-way, LRU
10-way, LRU
Capacity, line size Associativity, replacement Chip interconnect
32 MB, 512 B line
36 MB, 256 B line
8-way, LRU
12-way, LRU
Type Intra-MCM data buses Inter-MCM data buses Memory
Enhanced distributed switch ½ processor speed Processor speed Distributed switch
½ processor speed ½ processor speed 1024 GB (1 TB) 512 GB maximum maximum
UT Computer Architecture Seminar, November 2003
© 2003 IBM Corporation
IBM’s POWER5 Micro Processor Design and Methodology
Other SMT Considerations § Power Management 4
SMT Increases execution unit utilization
4
Dynamic power management does not impact performance
§ Debug tools / Lab bring-up 4
Instruction tracing
4
Hang detection
4
Forward progress monitor
§ Performance Monitoring § Serviceability
UT Computer Architecture Seminar, November 2003
© 2003 IBM Corporation
IBM’s POWER5 Micro Processor Design and Methodology
POWER Server Roadmap 2001
2002-3
2004*
2005*
2006*
POWER4
POWER4+
POWER5
POWER5+
POWER6
90 nm
65 nm
130 nm
130 nm
180 nm 1.3 GHz Core
1.3 GHz Core
Shared L2
1.7 GHz Core
1.7 GHz Core
Shared L2 Distributed Switch
Distributed Switch
Chip Multi Processing - Distributed Switch - Shared L2 Dynamic LPARs (16)
Reduced size Lower power Larger L2 More LPARs (32)
> GHz Core
> GHz Core
Shared L2
>> GHz >> GHz Core Core
Ultra high frequency cores L2 caches
Shared L2 Distributed Switch
Advanced System Features
Distributed Switch
Simultaneous multi-threading Sub-processor partitioning Dynamic firmware updates Enhanced scalability, parallelism High throughput performance Enhanced memory subsystem
Autonomic Computing Enhancements *Planned to be offered by IBM. All statements about IBM’s future direction and intent are* subject to change or withdrawal without notice and represent goals and objectives only.
UT Computer Architecture Seminar, November 2003
© 2003 IBM Corporation
IBM’s POWER5 Micro Processor Design and Methodology
Summary § POWER5 SMT implementation is more than SMT 4
Good ROI for silicon area: Performance gain > Area increase
4
Resource sizes optimized
4
Dynamic feedback enhances instruction throughput
4
Software controlled priority exploits machine architecture
4
Dynamic ST to/from SMT mode capability optimizes system resources
§ SMT impacts pervasive throughout chip § Storage subsystem scalable to 64 Processor/ 128 Threads § Operating in laboratory 4
AIX, Linux and OS/400 booted and running
UT Computer Architecture Seminar, November 2003
© 2003 IBM Corporation