Frequency Scaling

Memory Power Management via Dynamic Voltage/Frequency Scaling Howard David (Intel) Eugene Gorbatov (Intel) Ulf R. Hanebutte (Intel) Chris Fallin (CM...
Author: May Williamson
12 downloads 2 Views 866KB Size
Memory Power Management via Dynamic Voltage/Frequency Scaling

Howard David (Intel) Eugene Gorbatov (Intel) Ulf R. Hanebutte (Intel)

Chris Fallin (CMU) Onur Mutlu (CMU)

Memory Power is Significant Power consumption is a primary concern in modern servers Many works: CPU, whole-system or cluster-level approach But memory power is largely unaddressed Our server system*: memory is 19% of system power (avg)

n  n  n  n 

q 

Some work notes up to 40% of total system power

Power  (W)  

400   300   200   100   0  

System  Power   Memory  Power  

lbm   GemsFDTD   milc   leslie3d   libquantum   soplex   sphinx3   mcf   cactusADM   gcc   dealII   tonto   bzip2   gobmk   sjeng   calculix   perlbench   h264ref   namd   gromacs   gamess   povray   hmmer  

Goal: Can we reduce this figure?

n 

*Dual 4-core Intel Xeon®, 48GB DDR3 (12 DIMMs), SPEC CPU2006, all cores active. Measured AC power, analytically modeled memory power.

2

Existing Solution: Memory Sleep States? n 

Most memory energy-efficiency work uses sleep states q 

n 

Shut down DRAM devices when no memory requests active

But, even low-memory-bandwidth workloads keep memory awake Idle periods between requests diminish in multicore workloads q  CPU-bound workloads/phases rarely completely cache-resident Sleep  State  Residency   8%   6%   4%   2%   0%  

3

hmmer  

povray  

gamess  

gromacs  

namd  

h264ref  

perlbench  

calculix  

sjeng  

gobmk  

bzip2  

tonto  

dealII  

gcc  

cactusADM  

mcf  

sphinx3  

soplex  

libquantum  

leslie3d  

milc  

GemsFDTD  

lbm  

Time  Spent  in  Sleep     States  

q 

Memory Bandwidth Varies Widely n 

Workload memory bandwidth requirements vary widely

Bandwidth/channel  (GB/s)  

Memory  Bandwidth  for  SPEC  CPU2006  

n 

8   6   4   2   0  

Memory system is provisioned for peak capacity à often underutilized 4

Memory Power can be Scaled Down n 

DDR can operate at multiple frequencies à reduce power q  q  q 

n 

Lower frequency directly reduces switching power Lower frequency allows for lower voltage Comparable to CPU DVFS CPU  Voltage/ Freq.  

System   Power  

Memory   Freq.  

System   Power  

↓  15%  

↓  9.9%  

↓  40%  

↓  7.6%  

Frequency scaling increases latency à reduce performance q  q  q 

Memory storage array is asynchronous But, bus transfer depends on frequency When bus bandwidth is bottleneck, performance suffers 5

Observations So Far n 

Memory power is a significant portion of total power q 

n 

19% (avg) in our system, up to 40% noted in other works

Sleep state residency is low in many workloads q  q 

Multicore workloads reduce idle periods CPU-bound applications send requests frequently enough to keep memory devices awake

n 

Memory bandwidth demand is very low in some workloads

n 

Memory power is reduced by frequency scaling q 

And voltage scaling can give further reductions 6

DVFS for Memory n 

Key Idea: observe memory bandwidth utilization, then adjust memory frequency/voltage, to reduce power with minimal performance loss à Dynamic Voltage/Frequency Scaling (DVFS) for memory

n 

Goal in this work: q  q 

q 

Implement DVFS in the memory system, by: Developing a simple control algorithm to exploit opportunity for reduced memory frequency/voltage by observing behavior Evaluating the proposed algorithm on a real system

7

Outline n 

Motivation

n 

Background and Characterization q  q  q 

DRAM Operation DRAM Power Frequency and Voltage Scaling

n 

Performance Effects of Frequency Scaling

n 

Frequency Control Algorithm

n 

Evaluation and Conclusions 8

Outline n 

Motivation

n 

Background and Characterization q  q  q 

DRAM Operation DRAM Power Frequency and Voltage Scaling

n 

Performance Effects of Frequency Scaling

n 

Frequency Control Algorithm

n 

Evaluation and Conclusions 9

DRAM Operation n  n 

Main memory consists of DIMMs of DRAM devices Each DIMM is attached to a memory bus (channel)

/8

/8

/8

/8

/8

/8

/8

/8

Memory Bus (64 bits)

10

DRAM Operation n  n  n 

Main memory consists of DIMMs of DRAM devices Each DIMM is attached to a memory bus (channel) Multiple DIMMs can connect to one channel

to Memory Controller

11

Inside a DRAM Device

12

Row Decoder

Inside a DRAM Device

Banks Bank 0

•  Independent arrays •  Asynchronous: independent of memory bus speed

Sense Amps Column Decoder

13

Inside a DRAM Device

Recievers Drivers

Sense Amps Column Decoder

Registers

Runs at bus speed Clock sync/distribution Bank 0 Bus drivers and receivers Buffering/queueing

Write FIFO

•  •  •  • 

Row Decoder

I/O Circuitry

14

Inside a DRAM Device

Recievers

Registers

Bank 0Termination On-Die

Write FIFO

Row Decoder

ODT

Column Decoder

Drivers

•  Required by bus electrical characteristics for reliable operation •  Resistive element that dissipates power whenSense bus is active Amps

15

Inside a DRAM Device

Recievers Drivers

Sense Amps Column Decoder

Registers

Bank 0

Write FIFO

Row Decoder

ODT

16

Effect of Frequency Scaling on Power n 

n 

n 

n 

n 

Reduced memory bus frequency:

Does not affect bank power: q  Constant energy per operation q  Depends only on utilized memory bandwidth Decreases I/O power: q  Dynamic power in bus interface and clock circuitry reduces due to less frequent switching Increases termination power: q  Same data takes longer to transfer q  Hence, bus utilization increases Tradeoff between I/O and termination results in a net power reduction at lower frequencies 17

Effects of Voltage Scaling on Power n 

Voltage scaling further reduces power because all parts of memory devices will draw less current (at less voltage) Voltage reduction is possible because stable operation requires lower voltage at lower frequency: Minimum  Stable  Voltage  for  8  DIMMs  in  a  Real  System   DIMM  Voltage  (V)  

n 

1.6   1.5   1.4   1.3   1.2   1.1   1  

Vdd  for  Power  Model  

1333MHz  

1066MHz  

800MHz   18

Outline n 

Motivation

n 

Background and Characterization q  q  q 

DRAM Operation DRAM Power Frequency and Voltage Scaling

n 

Performance Effects of Frequency Scaling

n 

Frequency Control Algorithm

n 

Evaluation and Conclusions 19

Bandwidth/channel  (GB/s)   7   6   5   4   3   2   1   0   lbm   GemsFDTD   milc   leslie3d   libquantum   soplex   sphinx3   mcf   cactusADM   gcc   dealII   tonto   bzip2   gobmk   sjeng   calculix   perlbench   h264ref   namd   gromacs   gamess   povray   hmmer  

How Much Memory Bandwidth is Needed? Memory  Bandwidth  for  SPEC  CPU2006  

20

Performance Impact of Static Frequency Scaling

Performance  Loss,  StaNc  Frequency  Scaling  

80   70   60   50   40   30   20   10   0  

1333-­‐>800   1333-­‐>1066   lbm   GemsFDTD   milc   leslie3d   libquantum   soplex   sphinx3   mcf   cactusADM   gcc   dealII   tonto   bzip2   gobmk   sjeng   calculix   perlbench   h264ref   namd   gromacs   gamess   povray   hmmer  

n 

Performance impact is proportional to bandwidth demand Many workloads tolerate lower frequency with minimal performance drop Performance  Drop  (%)  

n 

21

Performance Impact of Static Frequency Scaling

8  

Performance  Loss,  StaNc  Frequency  Scaling   :: :: :: :: :: :: :: :: :: :: :: :: ::

6   4  

1333-­‐>800   1333-­‐>1066  

2   0  

lbm   GemsFDTD   milc   leslie3d   libquantum   soplex   sphinx3   mcf   cactusADM   gcc   dealII   tonto   bzip2   gobmk   sjeng   calculix   perlbench   h264ref   namd   gromacs   gamess   povray   hmmer  

n 

Performance impact is proportional to bandwidth demand Many workloads tolerate lower frequency with minimal performance drop Performance    Drop  (%)  

n 

22

Outline n 

Motivation

n 

Background and Characterization q  q  q 

DRAM Operation DRAM Power Frequency and Voltage Scaling

n 

Performance Effects of Frequency Scaling

n 

Frequency Control Algorithm

n 

Evaluation and Conclusions 23

Memory Latency Under Load n 

At low load, most time is in array access and bus transfer à small constant offset between bus-frequency latency curves

As load increases, queueing delay begins to dominate à bus frequency significantly affects latency Memory  Latency  as  a  FuncNon  of  Bandwidth  and  Mem  Frequency  

Latency  (ns)  

n 

800MHz  

180   150   120   90   60   0  

2000  

1067MHz  

4000  

1333MHz  

6000  

8000  

UNlized  Channel  Bandwidth  (MB/s)   24

Control Algorithm: Demand-Based Switching Latency  (ns)  

Memory  Latency  as  a  FuncNon  of  Bandwidth  and  Mem  Frequency  

800MHz  

180   150   120   90   60   0  

1067MHz  

2000   T800

4000  T1066

1333MHz  

6000  

8000  

UNlized  Channel  Bandwidth  (MB/s)  

After each epoch of length Tepoch: Measure per-channel bandwidth BW if BW < T800 : switch to 800MHz else if BW < T1066 : switch to 1066MHz else : switch to 1333MHz 25

Implementing V/F Switching n 

Halt Memory Operations q  q  q 

n 

Transition Voltage/Frequency q  q  q  q 

n 

Pause requests Put DRAM in Self-Refresh Stop the DIMM clock Begin voltage ramp Relock memory controller PLL at new frequency Restart DIMM clock Wait for DIMM PLLs to relock

Begin Memory Operations q  q 

Take DRAM out of Self-Refresh Resume requests 26

Implementing V/F Switching n 

Halt Memory Operations q  q  q 

n 

Pause requests Put DRAM in Self-Refresh Stop the DIMM clock

Transition Voltage/Frequency Begin voltage ramp q  Relock memory controller PLL at new frequency frequency already adjustable statically q  Memory Restart DIMM clock q  Wait for DIMM PLLs to relock q 

C C

n 

Voltage regulators for CPU DVFS can work for Begin Memory Operations memory DVFS

Take DRAM out of Self-Refresh transition q  Full Resume requests takes ~20µs q 

C

27

Outline n 

Motivation

n 

Background and Characterization q  q  q 

DRAM Operation DRAM Power Frequency and Voltage Scaling

n 

Performance Effects of Frequency Scaling

n 

Frequency Control Algorithm

n 

Evaluation and Conclusions 28

Evaluation Methodology n 

Real-system evaluation q  Dual 4-core Intel Xeon®, 3 memory channels/socket q 

n 

Emulating memory frequency for performance q  q 

n 

48 GB of DDR3 (12 DIMMs, 4GB dual-rank, 1333MHz)

Altered memory controller timing registers (tRC, tB2BCAS) Gives performance equivalent to slower memory frequencies

Modeling power reduction q  q 

Measure baseline system (AC power meter, 1s samples) Compute reductions with an analytical model (see paper)

29

Evaluation Methodology n 

Workloads q  q 

n 

SPEC CPU2006: CPU-intensive workloads All cores run a copy of the benchmark

Parameters q  q  q  q 

Tepoch = 10ms Two variants of algorithm with different switching thresholds: BW(0.5, 1): T800 = 0.5GB/s, T1066 = 1GB/s BW(0.5, 2): T800 = 0.5GB/s, T1066 = 2GB/s

à More aggressive frequency/voltage scaling

30

-­‐1   AVG  

hmmer  

povray  

gamess  

gromacs  

2  

namd  

3  

h264ref  

perlbench  

calculix  

sjeng  

gobmk  

bzip2  

tonto  

dealII  

gcc  

cactusADM  

mcf  

sphinx3  

soplex  

libquantum  

leslie3d  

milc  

GemsFDTD  

lbm  

n 

Performance  DegradaNon  (%)  

Performance Impact of Memory DVFS Minimal performance degradation: 0.2% (avg), 1.7% (max)

4  

BW(0.5,1)   BW(0.5,2)  

1  

0  

31

-­‐1   AVG  

hmmer  

povray  

gamess  

gromacs  

2  

namd  

3  

h264ref  

perlbench  

calculix  

sjeng  

gobmk  

bzip2  

tonto  

dealII  

gcc  

cactusADM  

mcf  

sphinx3  

soplex  

libquantum  

leslie3d  

milc  

GemsFDTD  

n 

lbm  

n 

Performance  DegradaNon  (%)  

Performance Impact of Memory DVFS Minimal performance degradation: 0.2% (avg), 1.7% (max) Experimental error ~1% 4  

BW(0.5,1)   BW(0.5,2)  

1  

0  

32

n 

0%   lbm   GemsFDTD   milc   leslie3d   libquantum   soplex   sphinx3   mcf   cactusADM   gcc   dealII   tonto   bzip2   gobmk   sjeng   calculix   perlbench   h264ref   namd   gromacs   gamess   povray   hmmer  

Memory Frequency Distribution Frequency distribution shifts toward higher memory frequencies with more memory-intensive benchmarks

100%  

80%  

60%  

40%   1333  

20%   1066  

800  

33

Memory  Power  ReducNon  (%)   n 

20  

15  

0   lbm   GemsFDTD   milc   leslie3d   libquantum   soplex   sphinx3   mcf   cactusADM   gcc   dealII   tonto   bzip2   gobmk   sjeng   calculix   perlbench   h264ref   namd   gromacs   gamess   povray   hmmer   AVG  

Memory Power Reduction Memory power reduces by 10.4% (avg), 20.5% (max)

25  

BW(0.5,1)   BW(0.5,2)  

10  

5  

34

System  Power  ReducNon  (%)   n 

4   3.5   3   2.5   2   1.5   1   0.5   0   lbm   GemsFDTD   milc   leslie3d   libquantum   soplex   sphinx3   mcf   cactusADM   gcc   dealII   tonto   bzip2   gobmk   sjeng   calculix   perlbench   h264ref   namd   gromacs   gamess   povray   hmmer   AVG  

System Power Reduction As a result, system power reduces by 1.9% (avg), 3.5% (max)

BW(0.5,1)   BW(0.5,2)  

35

System  Energy  ReducNon  (%)   n 

4  

3  

-­‐1   lbm   GemsFDTD   milc   leslie3d   libquantum   soplex   sphinx3   mcf   cactusADM   gcc   dealII   tonto   bzip2   gobmk   sjeng   calculix   perlbench   h264ref   namd   gromacs   gamess   povray   hmmer   AVG  

System Energy Reduction System energy reduces by 2.4% (avg), 5.1% (max) 6  

5  

BW(0.5,1)   BW(0.5,2)  

2  

1  

0  

36

Related Work n 

MemScale [Deng11], concurrent work (ASPLOS 2011) q  q 

q 

n 

n 

Also proposes Memory DVFS Application performance impact model to decide voltage and frequency: requires specific modeling for a given system; our bandwidth-based approach avoids this complexity Simulation-based evaluation; our work is a real-system proof of concept

Memory Sleep States (Creating opportunity with data placement [Lebeck00,Pandey06], OS scheduling [Delaluz02], VM subsystem [Huang05]; Making better decisions with better models [Hur08,Fan01]) Power Limiting/Shifting (RAPL [David10] uses memory throttling for thermal limits; CPU throttling for memory traffic [Lin07,08]; Power shifting across system [Felter05])

37

Conclusions n 

Memory power is a significant component of system power q 

n 

Workloads often keep memory active but underutilized q  q 

n 

Channel bandwidth demands are highly variable Use of memory sleep states is often limited

Scaling memory frequency/voltage can reduce memory power with minimal system performance impact q  q 

n 

19% average in our evaluation system, 40% in other work

10.4% average memory power reduction Yields 2.4% average system energy reduction

Greater reductions are possible with wider frequency/ voltage range and better control algorithms 38

Memory Power Management via Dynamic Voltage/Frequency Scaling

Howard David (Intel) Eugene Gorbatov (Intel) Ulf R. Hanebutte (Intel)

Chris Fallin (CMU) Onur Mutlu (CMU)

Why Real-System Evaluation? n 

Advantages: q 

Capture all effects of altered memory performance n 

q 

q 

n 

Able to run full-length benchmarks (SPEC CPU2006) rather than short instruction traces No concerns about architectural simulation fidelity

Disadvantages: q 

q 

n 

System/kernel code, interactions with IO and peripherals, etc

More limited room for novel algorithms and detailed measurements Inherent experimental error due to background-task noise, real power measurements, nondeterministic timing effects

For a proof-of-concept, we chose to run on a real system in order to have results that capture all potential side-effects of altering memory frequency 40

CPU-Bound Applications in a DRAM-rich system n 

We evaluate CPU-bound workloads with 12 DIMMs: what about smaller memory, or IO-bound workloads?

n 

12 DIMMs (48GB): are we magnifying the problem? q  q 

n 

Large servers can have this much memory, especially for database or enterprise applications Memory can be up to 40% of system power [1,2], and reducing its power in general is an academically interesting problem

CPU-bound workloads: will it matter in real life? q 

q 

Many workloads have CPU-bound phases (e.g., database scan or business logic in server workloads) Focusing on CPU-bound workloads isolates the problem of varying memory bandwidth demand while memory cannot enter sleep states, and our solution applies for any compute phase of a workload

[1] L. A. Barroso and U. Holzle. “The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.” Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2009. [2] C. Lefurgy et al. “Energy Management for Commercial Servers.” IEEE Computer, pp. 39—48, December 2003.

41

Combining Memory & CPU DVFS? n 

Our evaluation did not incorporate CPU DVFS: q  q 

n 

Need to understand effect of single knob (memory DVFS) first Combining with CPU DVFS might produce second-order effects that would need to be accounted for

Nevertheless, memory DVFS is effective by itself, and mostly orthogonal to CPU DVFS: q  q 

q 

Each knob reduces power in a different component Our memory DVFS algorithm has neligible performance impact à negligible impact on CPU DVFS CPU DVFS will only further reduce bandwidth demands relative to our evaluations à no negative impact on memory DVFS

42

Why is this Autonomic Computing? n 

Power management in general is autonomic: a system observes its own needs and adjusts its behavior accordingly à Lots of previous work comes from architecture community, but crossover in ideas and approaches could be beneficial

n 

n 

This work exposes a new knob for control algorithms to turn, has a simple model for the power/energy effects of that knob, and observes opportunity to apply it in a simple way Exposes future work for: n  n  n 

More advanced control algorithms Coordinated energy efficiency across rest of system Coordinated energy efficiency across a cluster/datacenter, integrated with memory DVFS, CPU DVFS, etc. 43