Scaling Power and the Future of CMOS Mark Horowitz, EE/CS Stanford University
A Long Time Ago
In a building far away
A man made a prediction
On surprisingly little data
That has defined an industry
2
Moore’s Law
3
CMOS Computer Performance 100.00 intel 486 intel pentium intel pentium 2 intel pentium 3 intel pentium 4 intel itanium Alpha 21064 Alpha 21164 Alpha 21264 Sparc SuperSparc Sparc64 M ips HP PA Power PC AM D K6 AM D K7 AM D x86-64 IBM Power SUN UltraSPARC Intel Core 2 AM D Opteron AM D Phenom
Specint 2006 10.00
1.00
0.10
0.01 88
90
92
94
96
98
00
02
04
0 4
CMOS Computer Performance 100.00 intel 486 intel pentium intel pentium 2 intel pentium 3 intel pentium 4 intel itanium Alpha 21064 Alpha 21164 Alpha 21264 Sparc SuperSparc Sparc64 M ips HP PA Power PC AM D K6 AM D K7 AM D x86-64 IBM Power SUN UltraSPARC Intel Core 2 AM D Opteron AM D Phenom
Specint 2006 10.00
1.00
0.10
0.01 88
90
92
94
96
98
00
02
04
06
08
10 5
Moore’s Original Issues Design cost Power dissipation What to do with all the functionality possible
V scales to αV, L scales to αL So C scales to αC, i scales to αi (i/μ is stable) Delay = CV/I scales as α Energy = CV2 scales as α3 7
Processor Power 1000
Watts 100
10
1 88
90
92
94
96
98
00
02
04
06
08
10 8
Power Density 1.00
Watts/mm2
0.10
0.01 85 87
89 91 93 95
97 99 01 03
05 07 09 9
Why Power Increased 10000
Clock Frequency (MHz)
1000
100
10
85
87
89
91
93
95
97
99
01
03
05
07
09 10
Good News Die growth & super frequency scaling have stopped 100
Cycle in FO4
10 85
87
89
91
93
95
97
99
01
03
05 11
Processor Power They were high power too 100
10
1 85
87
89
91
93
95
97
99
01
03 12
Bad News
Voltage scaling has stopped as well •
kT/q does not scale
•
Vth scaling has power consequences
If Vdd does not scale •
Energy scales slowly
Ed Nowak, IBM 13
Technology Scaling Today
Device sizes are still scaling • •
Cost/device is still scaling down This is what is driving scaling
Voltages are not scaling very fast • •
Threshold voltages set by leakage Gate oxide thickness is set by leakage This means that the channel lengths are not scaling Current is increasing by stressing silicon
Now Vdd and Vth are set by optimization 14
Other Technologies
For computing, I am not optimistic Current problems are set by Physics: •
Vdd set by kT/q Sets the on-off ratio
•
Wire energy by CVdd2
To get around these limitations •
Need to create something very different!
15
Problem with Different Technologies
Design processes have been optimized for silicon •
Working on making it better for over 30 years
Silicon has set: • • •
Notions of logic (binary signals), digital design styles Computing (distinct memory and logic) Relative size and speed of memory logic
No new technology will fit this mold well • •
Changing the world is hard If you build it, generally they don’t come Unless they absolutely have to 16
Maturing of Silicon
Silicon will not disappear •
It will still be a huge business Growth rate is slower, Eventually very slow scaling
Silicon will become like concrete and steel • • •
Basis of a huge industry Critical to nearly everything But fairly stable and predictable
Will remain the dominate substrate for computing •
And performance be limited by power dissipation
17
Optimizing Energy
Energy
Every design is a point on a 2-D plane
Performance 18
Optimizing Energy
Energy
Every design is a point on a 2-D plane
Performance 19
Optimizing Energy
Energy
Every design is a point on a 2-D plane
Performance 20
Years of Low Power Research …
Shown only one design technique to reduce power •
Reduce waste
Can waste • •
Energy (clock gating, leakage control, etc) Performance Adding additional constraints to operation flow
If technology scaling has stalled •
Need to focus on reducing waste in our systems
Increase in efficiency in our designs will set performance 21
Future Systems
Some simple math • • •
Assume scaling continues Dies don’t shrink in size Average power/gate must decrease by 2x / gen Or need to build systems that increase in power
Since gates are shrinking in size • •
Get 1.4x from capacitive reduction Where is the other factor of 1.4x ?
22
The Push for Parallelism 1
Watts/Spec*L*Vdd^2
i ntel 386 i ntel 486 i ntel penti um i ntel penti um2 i ntel penti um3 i ntel penti um4
0.1
i ntel i tani um Al pha 21064 Al pha 21164 Al pha 21264 Spar c Super Spar c Spar c64
10 parallel processors 0.01
Mi ps HP PA Power PC AMD K6 AMD K7
1
10
100
AMD x86-64
1000
Spec2000 *L 23
Exploit Parallelism / Scale Vdd
•
Add more function units Fill up new die (2x)
•
Lower energy/op ΔE/ΔP will decrease Vdd, sizes, etc will reduce Build simpler architectures
Energy/op
If you have parallelism
Works well when ΔE/ΔP is large •
Performance
But what happens when that runs out?
24
Problem Reformulation
Best way to save energy is to do less work • •
Energy directly reduced by the reduction in work But required time for the function decreases as well Convert this into extra power gains
Shifts the optimal curve down and to the right
Energy/op
•
User Performance 25
Exploit Specialization
Optimize execution units for specific applications • •
Reformulate the hardware to reduce needed work Can improve energy efficiency for a class of applications
DSP/Vector engines are more efficient than CPUs • •
Exploit locality, reuse High compute density
ASICs are more efficient than DSP/Vector engines •
If we want efficiency, we need more application optimization
26
ASIC/SOC Design Trends
Rising non-recurring engineering costs
Few
•
Increasing design complexity
•
Growing verification complexity
• •
ma
rke
t s ca Challenging physical design n ju Rising mask costs
stify A
SIC
NR E
27
28
ASIC Future Depends on Your Religion
Believe in correct by construction? Believe in a generic high-level design language?
Historically both have not worked •
I believe history is correct
Allowing people to connect complex blocks • •
Yields a complex validation problem, and a $20M+ design General SoC, SiP will never be cheap
29
Computing’s Future: Create a new universal computing platform • •
That is more efficient that today/tomorrow’s CMP Bill Dally is working on this one
Leverage existing large volume processors for other applications • •
GPUs moving into general processing OMAP being distributed as Unix system
30
Can We Do Better?
Chip design is expensive since chips are complex But the building blocks are well known • • •
Many of the optimizations are well known too Designer often do many of the same steps Part of the reason for off-shoring Don’t need experience
Getting the system to work is hard •
There is a lot of turning the crank that is needed
Can we automate some of the crank turning? 31
Chip Generator Idea Application
Process:
$$$ ASIC design
Configure a programmable chip
Configure / program a virtual chip
Generate optimized chip
Final Product:
Custom System
Not Efficient Configured System
Semi-custom System
32
Another Way To Put It…
Performance per Watt
ASIC CMP Generator
•Excel’ perf’/watt per app’ •Amortized costs •Wide app’ domain
•Best perf’/watt per app’ •Highest costs •Specific for a single app’
Programable Per App
Cost Structure
•Worse perf’/watt per app’ •Amortized costs •Wider app’ domain
Amortized
33
Smart Memories - A “Pretend” Generator Data Cache
Instruction Cache
M
M
M
M
M
M
M
M
T
D
D
T
D
D
M
M
M
M
M
M
M
M
M
M
T
D
D
M
M
M
M
M
Crossbar
Processor
Crossbar
Processor
Processor FU
Smart Memories Architecture: Single Tile
Processor
FU
Chip Generator Derivative
34
Looks Promising Large energy / performance gains are possible: • •
Use H.264 as example application Use a GP-CMP chip generator 400-600X initial perf. gap
New And Exciting Challenges Potential Benefits
Process
Is it useful? How much better than CMP? How much worse than ASIC?
How do the user use a generator? How would a generator be built? How is the chip specified?
(Ofer)
(Wajahat, Rehan, Megan)
Optimization (Omid, Pete)
Given an application and a target (power / perf.) – How do we find the best “chip plan”?
Chip Multiprocessor Generator
Verification (Megan, Ofer)
Verification of a chip is difficult. How do we verify a generator? 36
Conclusions
The technology engine driving IT is slowing down •
Power efficiency is the real problem
Application optimization leads to efficiency •
But design is too expensive today to do this
Need to rethink design •
Build chip generators not chips These are virtual programmable chips Have tools generates the real chips that customers want