CS Stanford University

Scaling Power and the Future of CMOS Mark Horowitz, EE/CS Stanford University A Long Time Ago In a building far away A man made a prediction On s...
1 downloads 1 Views 543KB Size
Scaling Power and the Future of CMOS Mark Horowitz, EE/CS Stanford University

A Long Time Ago

In a building far away

A man made a prediction

On surprisingly little data

That has defined an industry

2

Moore’s Law

3

CMOS Computer Performance 100.00 intel 486 intel pentium intel pentium 2 intel pentium 3 intel pentium 4 intel itanium Alpha 21064 Alpha 21164 Alpha 21264 Sparc SuperSparc Sparc64 M ips HP PA Power PC AM D K6 AM D K7 AM D x86-64 IBM Power SUN UltraSPARC Intel Core 2 AM D Opteron AM D Phenom

Specint 2006 10.00

1.00

0.10

0.01 88

90

92

94

96

98

00

02

04

0 4

CMOS Computer Performance 100.00 intel 486 intel pentium intel pentium 2 intel pentium 3 intel pentium 4 intel itanium Alpha 21064 Alpha 21164 Alpha 21264 Sparc SuperSparc Sparc64 M ips HP PA Power PC AM D K6 AM D K7 AM D x86-64 IBM Power SUN UltraSPARC Intel Core 2 AM D Opteron AM D Phenom

Specint 2006 10.00

1.00

0.10

0.01 88

90

92

94

96

98

00

02

04

06

08

10 5

Moore’s Original Issues Design cost Power dissipation What to do with all the functionality possible

ftp://download.intel.com/research/silicon/moorespaper.pdf 6

Scaling MOS Devices

JSSC Oct 74, pg 256

In this ideal scaling • • • •

V scales to αV, L scales to αL So C scales to αC, i scales to αi (i/μ is stable) Delay = CV/I scales as α Energy = CV2 scales as α3 7

Processor Power 1000

Watts 100

10

1 88

90

92

94

96

98

00

02

04

06

08

10 8

Power Density 1.00

Watts/mm2

0.10

0.01 85 87

89 91 93 95

97 99 01 03

05 07 09 9

Why Power Increased 10000

Clock Frequency (MHz)

1000

100

10

85

87

89

91

93

95

97

99

01

03

05

07

09 10

Good News Die growth & super frequency scaling have stopped 100

Cycle in FO4

10 85

87

89

91

93

95

97

99

01

03

05 11

Processor Power They were high power too 100

10

1 85

87

89

91

93

95

97

99

01

03 12

Bad News

Voltage scaling has stopped as well •

kT/q does not scale



Vth scaling has power consequences

If Vdd does not scale •

Energy scales slowly

Ed Nowak, IBM 13

Technology Scaling Today

Device sizes are still scaling • •

Cost/device is still scaling down This is what is driving scaling

Voltages are not scaling very fast • •

Threshold voltages set by leakage Gate oxide thickness is set by leakage This means that the channel lengths are not scaling Current is increasing by stressing silicon

Now Vdd and Vth are set by optimization 14

Other Technologies

For computing, I am not optimistic Current problems are set by Physics: •

Vdd set by kT/q Sets the on-off ratio



Wire energy by CVdd2

To get around these limitations •

Need to create something very different!

15

Problem with Different Technologies

Design processes have been optimized for silicon •

Working on making it better for over 30 years

Silicon has set: • • •

Notions of logic (binary signals), digital design styles Computing (distinct memory and logic) Relative size and speed of memory logic

No new technology will fit this mold well • •

Changing the world is hard If you build it, generally they don’t come Unless they absolutely have to 16

Maturing of Silicon

Silicon will not disappear •

It will still be a huge business Growth rate is slower, Eventually very slow scaling

Silicon will become like concrete and steel • • •

Basis of a huge industry Critical to nearly everything But fairly stable and predictable

Will remain the dominate substrate for computing •

And performance be limited by power dissipation

17

Optimizing Energy

Energy

Every design is a point on a 2-D plane

Performance 18

Optimizing Energy

Energy

Every design is a point on a 2-D plane

Performance 19

Optimizing Energy

Energy

Every design is a point on a 2-D plane

Performance 20

Years of Low Power Research …

Shown only one design technique to reduce power •

Reduce waste

Can waste • •

Energy (clock gating, leakage control, etc) Performance Adding additional constraints to operation flow

If technology scaling has stalled •

Need to focus on reducing waste in our systems

Increase in efficiency in our designs will set performance 21

Future Systems

Some simple math • • •

Assume scaling continues Dies don’t shrink in size Average power/gate must decrease by 2x / gen Or need to build systems that increase in power

Since gates are shrinking in size • •

Get 1.4x from capacitive reduction Where is the other factor of 1.4x ?

22

The Push for Parallelism 1

Watts/Spec*L*Vdd^2

i ntel 386 i ntel 486 i ntel penti um i ntel penti um2 i ntel penti um3 i ntel penti um4

0.1

i ntel i tani um Al pha 21064 Al pha 21164 Al pha 21264 Spar c Super Spar c Spar c64

10 parallel processors 0.01

Mi ps HP PA Power PC AMD K6 AMD K7

1

10

100

AMD x86-64

1000

Spec2000 *L 23

Exploit Parallelism / Scale Vdd



Add more function units Fill up new die (2x)



Lower energy/op ΔE/ΔP will decrease Vdd, sizes, etc will reduce Build simpler architectures

Energy/op

If you have parallelism

Works well when ΔE/ΔP is large •

Performance

But what happens when that runs out?

24

Problem Reformulation

Best way to save energy is to do less work • •

Energy directly reduced by the reduction in work But required time for the function decreases as well Convert this into extra power gains

Shifts the optimal curve down and to the right

Energy/op



User Performance 25

Exploit Specialization

Optimize execution units for specific applications • •

Reformulate the hardware to reduce needed work Can improve energy efficiency for a class of applications

DSP/Vector engines are more efficient than CPUs • •

Exploit locality, reuse High compute density

ASICs are more efficient than DSP/Vector engines •

If we want efficiency, we need more application optimization

26

ASIC/SOC Design Trends

Rising non-recurring engineering costs

Few



Increasing design complexity



Growing verification complexity

• •

ma

rke

t s ca Challenging physical design n ju Rising mask costs

stify A

SIC

NR E

27

28

ASIC Future Depends on Your Religion

Believe in correct by construction? Believe in a generic high-level design language?

Historically both have not worked •

I believe history is correct

Allowing people to connect complex blocks • •

Yields a complex validation problem, and a $20M+ design General SoC, SiP will never be cheap

29

Computing’s Future: Create a new universal computing platform • •

That is more efficient that today/tomorrow’s CMP Bill Dally is working on this one

Leverage existing large volume processors for other applications • •

GPUs moving into general processing OMAP being distributed as Unix system

30

Can We Do Better?

Chip design is expensive since chips are complex But the building blocks are well known • • •

Many of the optimizations are well known too Designer often do many of the same steps Part of the reason for off-shoring Don’t need experience

Getting the system to work is hard •

There is a lot of turning the crank that is needed

Can we automate some of the crank turning? 31

Chip Generator Idea Application

Process:

$$$ ASIC design

Configure a programmable chip

Configure / program a virtual chip

Generate optimized chip

Final Product:

Custom System

Not Efficient Configured System

Semi-custom System

32

Another Way To Put It…

Performance per Watt

ASIC CMP Generator

•Excel’ perf’/watt per app’ •Amortized costs •Wide app’ domain

•Best perf’/watt per app’ •Highest costs •Specific for a single app’

Programable Per App

Cost Structure

•Worse perf’/watt per app’ •Amortized costs •Wider app’ domain

Amortized

33

Smart Memories - A “Pretend” Generator Data Cache

Instruction Cache

M

M

M

M

M

M

M

M

T

D

D

T

D

D

M

M

M

M

M

M

M

M

M

M

T

D

D

M

M

M

M

M

Crossbar

Processor

Crossbar

Processor

Processor FU

Smart Memories Architecture: Single Tile

Processor

FU

Chip Generator Derivative

34

Looks Promising Large energy / performance gains are possible: • •

Use H.264 as example application Use a GP-CMP chip generator 400-600X initial perf. gap

New And Exciting Challenges Potential Benefits

Process

Is it useful? How much better than CMP? How much worse than ASIC?

How do the user use a generator? How would a generator be built? How is the chip specified?

(Ofer)

(Wajahat, Rehan, Megan)

Optimization (Omid, Pete)

Given an application and a target (power / perf.) – How do we find the best “chip plan”?

Chip Multiprocessor Generator

Verification (Megan, Ofer)

Verification of a chip is difficult. How do we verify a generator? 36

Conclusions

The technology engine driving IT is slowing down •

Power efficiency is the real problem

Application optimization leads to efficiency •

But design is too expensive today to do this

Need to rethink design •

Build chip generators not chips These are virtual programmable chips Have tools generates the real chips that customers want

37