Memory Power Management via Dynamic Voltage/Frequency Scaling
Howard David (Intel) Eugene Gorbatov (Intel) Ulf R. Hanebutte (Intel)
Chris Fallin (CMU) Onur Mutlu (CMU)
Memory Power is Significant Power consumption is a primary concern in modern servers Many works: CPU, whole-system or cluster-level approach But memory power is largely unaddressed Our server system*: memory is 19% of system power (avg)
n n n n
q
Some work notes up to 40% of total system power
Power (W)
400 300 200 100 0
System Power Memory Power
lbm GemsFDTD milc leslie3d libquantum soplex sphinx3 mcf cactusADM gcc dealII tonto bzip2 gobmk sjeng calculix perlbench h264ref namd gromacs gamess povray hmmer
Goal: Can we reduce this figure?
n
*Dual 4-core Intel Xeon®, 48GB DDR3 (12 DIMMs), SPEC CPU2006, all cores active. Measured AC power, analytically modeled memory power.
2
Existing Solution: Memory Sleep States? n
Most memory energy-efficiency work uses sleep states q
n
Shut down DRAM devices when no memory requests active
But, even low-memory-bandwidth workloads keep memory awake Idle periods between requests diminish in multicore workloads q CPU-bound workloads/phases rarely completely cache-resident Sleep State Residency 8% 6% 4% 2% 0%
3
hmmer
povray
gamess
gromacs
namd
h264ref
perlbench
calculix
sjeng
gobmk
bzip2
tonto
dealII
gcc
cactusADM
mcf
sphinx3
soplex
libquantum
leslie3d
milc
GemsFDTD
lbm
Time Spent in Sleep States
q
Memory Bandwidth Varies Widely n
Workload memory bandwidth requirements vary widely
Bandwidth/channel (GB/s)
Memory Bandwidth for SPEC CPU2006
n
8 6 4 2 0
Memory system is provisioned for peak capacity à often underutilized 4
Memory Power can be Scaled Down n
DDR can operate at multiple frequencies à reduce power q q q
n
Lower frequency directly reduces switching power Lower frequency allows for lower voltage Comparable to CPU DVFS CPU Voltage/ Freq.
System Power
Memory Freq.
System Power
↓ 15%
↓ 9.9%
↓ 40%
↓ 7.6%
Frequency scaling increases latency à reduce performance q q q
Memory storage array is asynchronous But, bus transfer depends on frequency When bus bandwidth is bottleneck, performance suffers 5
Observations So Far n
Memory power is a significant portion of total power q
n
19% (avg) in our system, up to 40% noted in other works
Sleep state residency is low in many workloads q q
Multicore workloads reduce idle periods CPU-bound applications send requests frequently enough to keep memory devices awake
n
Memory bandwidth demand is very low in some workloads
n
Memory power is reduced by frequency scaling q
And voltage scaling can give further reductions 6
DVFS for Memory n
Key Idea: observe memory bandwidth utilization, then adjust memory frequency/voltage, to reduce power with minimal performance loss à Dynamic Voltage/Frequency Scaling (DVFS) for memory
n
Goal in this work: q q
q
Implement DVFS in the memory system, by: Developing a simple control algorithm to exploit opportunity for reduced memory frequency/voltage by observing behavior Evaluating the proposed algorithm on a real system
7
Outline n
Motivation
n
Background and Characterization q q q
DRAM Operation DRAM Power Frequency and Voltage Scaling
n
Performance Effects of Frequency Scaling
n
Frequency Control Algorithm
n
Evaluation and Conclusions 8
Outline n
Motivation
n
Background and Characterization q q q
DRAM Operation DRAM Power Frequency and Voltage Scaling
n
Performance Effects of Frequency Scaling
n
Frequency Control Algorithm
n
Evaluation and Conclusions 9
DRAM Operation n n
Main memory consists of DIMMs of DRAM devices Each DIMM is attached to a memory bus (channel)
/8
/8
/8
/8
/8
/8
/8
/8
Memory Bus (64 bits)
10
DRAM Operation n n n
Main memory consists of DIMMs of DRAM devices Each DIMM is attached to a memory bus (channel) Multiple DIMMs can connect to one channel
to Memory Controller
11
Inside a DRAM Device
12
Row Decoder
Inside a DRAM Device
Banks Bank 0
• Independent arrays • Asynchronous: independent of memory bus speed
Sense Amps Column Decoder
13
Inside a DRAM Device
Recievers Drivers
Sense Amps Column Decoder
Registers
Runs at bus speed Clock sync/distribution Bank 0 Bus drivers and receivers Buffering/queueing
Write FIFO
• • • •
Row Decoder
I/O Circuitry
14
Inside a DRAM Device
Recievers
Registers
Bank 0Termination On-Die
Write FIFO
Row Decoder
ODT
Column Decoder
Drivers
• Required by bus electrical characteristics for reliable operation • Resistive element that dissipates power whenSense bus is active Amps
15
Inside a DRAM Device
Recievers Drivers
Sense Amps Column Decoder
Registers
Bank 0
Write FIFO
Row Decoder
ODT
16
Effect of Frequency Scaling on Power n
n
n
n
n
Reduced memory bus frequency:
Does not affect bank power: q Constant energy per operation q Depends only on utilized memory bandwidth Decreases I/O power: q Dynamic power in bus interface and clock circuitry reduces due to less frequent switching Increases termination power: q Same data takes longer to transfer q Hence, bus utilization increases Tradeoff between I/O and termination results in a net power reduction at lower frequencies 17
Effects of Voltage Scaling on Power n
Voltage scaling further reduces power because all parts of memory devices will draw less current (at less voltage) Voltage reduction is possible because stable operation requires lower voltage at lower frequency: Minimum Stable Voltage for 8 DIMMs in a Real System DIMM Voltage (V)
n
1.6 1.5 1.4 1.3 1.2 1.1 1
Vdd for Power Model
1333MHz
1066MHz
800MHz 18
Outline n
Motivation
n
Background and Characterization q q q
DRAM Operation DRAM Power Frequency and Voltage Scaling
n
Performance Effects of Frequency Scaling
n
Frequency Control Algorithm
n
Evaluation and Conclusions 19
Bandwidth/channel (GB/s) 7 6 5 4 3 2 1 0 lbm GemsFDTD milc leslie3d libquantum soplex sphinx3 mcf cactusADM gcc dealII tonto bzip2 gobmk sjeng calculix perlbench h264ref namd gromacs gamess povray hmmer
How Much Memory Bandwidth is Needed? Memory Bandwidth for SPEC CPU2006
20
Performance Impact of Static Frequency Scaling
Performance Loss, StaNc Frequency Scaling
80 70 60 50 40 30 20 10 0
1333-‐>800 1333-‐>1066 lbm GemsFDTD milc leslie3d libquantum soplex sphinx3 mcf cactusADM gcc dealII tonto bzip2 gobmk sjeng calculix perlbench h264ref namd gromacs gamess povray hmmer
n
Performance impact is proportional to bandwidth demand Many workloads tolerate lower frequency with minimal performance drop Performance Drop (%)
n
21
Performance Impact of Static Frequency Scaling
8
Performance Loss, StaNc Frequency Scaling :: :: :: :: :: :: :: :: :: :: :: :: ::
6 4
1333-‐>800 1333-‐>1066
2 0
lbm GemsFDTD milc leslie3d libquantum soplex sphinx3 mcf cactusADM gcc dealII tonto bzip2 gobmk sjeng calculix perlbench h264ref namd gromacs gamess povray hmmer
n
Performance impact is proportional to bandwidth demand Many workloads tolerate lower frequency with minimal performance drop Performance Drop (%)
n
22
Outline n
Motivation
n
Background and Characterization q q q
DRAM Operation DRAM Power Frequency and Voltage Scaling
n
Performance Effects of Frequency Scaling
n
Frequency Control Algorithm
n
Evaluation and Conclusions 23
Memory Latency Under Load n
At low load, most time is in array access and bus transfer à small constant offset between bus-frequency latency curves
As load increases, queueing delay begins to dominate à bus frequency significantly affects latency Memory Latency as a FuncNon of Bandwidth and Mem Frequency
Latency (ns)
n
800MHz
180 150 120 90 60 0
2000
1067MHz
4000
1333MHz
6000
8000
UNlized Channel Bandwidth (MB/s) 24
Control Algorithm: Demand-Based Switching Latency (ns)
Memory Latency as a FuncNon of Bandwidth and Mem Frequency
800MHz
180 150 120 90 60 0
1067MHz
2000 T800
4000 T1066
1333MHz
6000
8000
UNlized Channel Bandwidth (MB/s)
After each epoch of length Tepoch: Measure per-channel bandwidth BW if BW < T800 : switch to 800MHz else if BW < T1066 : switch to 1066MHz else : switch to 1333MHz 25
Implementing V/F Switching n
Halt Memory Operations q q q
n
Transition Voltage/Frequency q q q q
n
Pause requests Put DRAM in Self-Refresh Stop the DIMM clock Begin voltage ramp Relock memory controller PLL at new frequency Restart DIMM clock Wait for DIMM PLLs to relock
Begin Memory Operations q q
Take DRAM out of Self-Refresh Resume requests 26
Implementing V/F Switching n
Halt Memory Operations q q q
n
Pause requests Put DRAM in Self-Refresh Stop the DIMM clock
Transition Voltage/Frequency Begin voltage ramp q Relock memory controller PLL at new frequency frequency already adjustable statically q Memory Restart DIMM clock q Wait for DIMM PLLs to relock q
C C
n
Voltage regulators for CPU DVFS can work for Begin Memory Operations memory DVFS
Take DRAM out of Self-Refresh transition q Full Resume requests takes ~20µs q
C
27
Outline n
Motivation
n
Background and Characterization q q q
DRAM Operation DRAM Power Frequency and Voltage Scaling
n
Performance Effects of Frequency Scaling
n
Frequency Control Algorithm
n
Evaluation and Conclusions 28
Evaluation Methodology n
Real-system evaluation q Dual 4-core Intel Xeon®, 3 memory channels/socket q
n
Emulating memory frequency for performance q q
n
48 GB of DDR3 (12 DIMMs, 4GB dual-rank, 1333MHz)
Altered memory controller timing registers (tRC, tB2BCAS) Gives performance equivalent to slower memory frequencies
Modeling power reduction q q
Measure baseline system (AC power meter, 1s samples) Compute reductions with an analytical model (see paper)
29
Evaluation Methodology n
Workloads q q
n
SPEC CPU2006: CPU-intensive workloads All cores run a copy of the benchmark
Parameters q q q q
Tepoch = 10ms Two variants of algorithm with different switching thresholds: BW(0.5, 1): T800 = 0.5GB/s, T1066 = 1GB/s BW(0.5, 2): T800 = 0.5GB/s, T1066 = 2GB/s
à More aggressive frequency/voltage scaling
30
-‐1 AVG
hmmer
povray
gamess
gromacs
2
namd
3
h264ref
perlbench
calculix
sjeng
gobmk
bzip2
tonto
dealII
gcc
cactusADM
mcf
sphinx3
soplex
libquantum
leslie3d
milc
GemsFDTD
lbm
n
Performance DegradaNon (%)
Performance Impact of Memory DVFS Minimal performance degradation: 0.2% (avg), 1.7% (max)
4
BW(0.5,1) BW(0.5,2)
1
0
31
-‐1 AVG
hmmer
povray
gamess
gromacs
2
namd
3
h264ref
perlbench
calculix
sjeng
gobmk
bzip2
tonto
dealII
gcc
cactusADM
mcf
sphinx3
soplex
libquantum
leslie3d
milc
GemsFDTD
n
lbm
n
Performance DegradaNon (%)
Performance Impact of Memory DVFS Minimal performance degradation: 0.2% (avg), 1.7% (max) Experimental error ~1% 4
BW(0.5,1) BW(0.5,2)
1
0
32
n
0% lbm GemsFDTD milc leslie3d libquantum soplex sphinx3 mcf cactusADM gcc dealII tonto bzip2 gobmk sjeng calculix perlbench h264ref namd gromacs gamess povray hmmer
Memory Frequency Distribution Frequency distribution shifts toward higher memory frequencies with more memory-intensive benchmarks
100%
80%
60%
40% 1333
20% 1066
800
33
Memory Power ReducNon (%) n
20
15
0 lbm GemsFDTD milc leslie3d libquantum soplex sphinx3 mcf cactusADM gcc dealII tonto bzip2 gobmk sjeng calculix perlbench h264ref namd gromacs gamess povray hmmer AVG
Memory Power Reduction Memory power reduces by 10.4% (avg), 20.5% (max)
25
BW(0.5,1) BW(0.5,2)
10
5
34
System Power ReducNon (%) n
4 3.5 3 2.5 2 1.5 1 0.5 0 lbm GemsFDTD milc leslie3d libquantum soplex sphinx3 mcf cactusADM gcc dealII tonto bzip2 gobmk sjeng calculix perlbench h264ref namd gromacs gamess povray hmmer AVG
System Power Reduction As a result, system power reduces by 1.9% (avg), 3.5% (max)
BW(0.5,1) BW(0.5,2)
35
System Energy ReducNon (%) n
4
3
-‐1 lbm GemsFDTD milc leslie3d libquantum soplex sphinx3 mcf cactusADM gcc dealII tonto bzip2 gobmk sjeng calculix perlbench h264ref namd gromacs gamess povray hmmer AVG
System Energy Reduction System energy reduces by 2.4% (avg), 5.1% (max) 6
5
BW(0.5,1) BW(0.5,2)
2
1
0
36
Related Work n
MemScale [Deng11], concurrent work (ASPLOS 2011) q q
q
n
n
Also proposes Memory DVFS Application performance impact model to decide voltage and frequency: requires specific modeling for a given system; our bandwidth-based approach avoids this complexity Simulation-based evaluation; our work is a real-system proof of concept
Memory Sleep States (Creating opportunity with data placement [Lebeck00,Pandey06], OS scheduling [Delaluz02], VM subsystem [Huang05]; Making better decisions with better models [Hur08,Fan01]) Power Limiting/Shifting (RAPL [David10] uses memory throttling for thermal limits; CPU throttling for memory traffic [Lin07,08]; Power shifting across system [Felter05])
37
Conclusions n
Memory power is a significant component of system power q
n
Workloads often keep memory active but underutilized q q
n
Channel bandwidth demands are highly variable Use of memory sleep states is often limited
Scaling memory frequency/voltage can reduce memory power with minimal system performance impact q q
n
19% average in our evaluation system, 40% in other work
10.4% average memory power reduction Yields 2.4% average system energy reduction
Greater reductions are possible with wider frequency/ voltage range and better control algorithms 38
Memory Power Management via Dynamic Voltage/Frequency Scaling
Howard David (Intel) Eugene Gorbatov (Intel) Ulf R. Hanebutte (Intel)
Chris Fallin (CMU) Onur Mutlu (CMU)
Why Real-System Evaluation? n
Advantages: q
Capture all effects of altered memory performance n
q
q
n
Able to run full-length benchmarks (SPEC CPU2006) rather than short instruction traces No concerns about architectural simulation fidelity
Disadvantages: q
q
n
System/kernel code, interactions with IO and peripherals, etc
More limited room for novel algorithms and detailed measurements Inherent experimental error due to background-task noise, real power measurements, nondeterministic timing effects
For a proof-of-concept, we chose to run on a real system in order to have results that capture all potential side-effects of altering memory frequency 40
CPU-Bound Applications in a DRAM-rich system n
We evaluate CPU-bound workloads with 12 DIMMs: what about smaller memory, or IO-bound workloads?
n
12 DIMMs (48GB): are we magnifying the problem? q q
n
Large servers can have this much memory, especially for database or enterprise applications Memory can be up to 40% of system power [1,2], and reducing its power in general is an academically interesting problem
CPU-bound workloads: will it matter in real life? q
q
Many workloads have CPU-bound phases (e.g., database scan or business logic in server workloads) Focusing on CPU-bound workloads isolates the problem of varying memory bandwidth demand while memory cannot enter sleep states, and our solution applies for any compute phase of a workload
[1] L. A. Barroso and U. Holzle. “The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.” Synthesis Lectures on Computer Architecture. Morgan & Claypool, 2009. [2] C. Lefurgy et al. “Energy Management for Commercial Servers.” IEEE Computer, pp. 39—48, December 2003.
41
Combining Memory & CPU DVFS? n
Our evaluation did not incorporate CPU DVFS: q q
n
Need to understand effect of single knob (memory DVFS) first Combining with CPU DVFS might produce second-order effects that would need to be accounted for
Nevertheless, memory DVFS is effective by itself, and mostly orthogonal to CPU DVFS: q q
q
Each knob reduces power in a different component Our memory DVFS algorithm has neligible performance impact à negligible impact on CPU DVFS CPU DVFS will only further reduce bandwidth demands relative to our evaluations à no negative impact on memory DVFS
42
Why is this Autonomic Computing? n
Power management in general is autonomic: a system observes its own needs and adjusts its behavior accordingly à Lots of previous work comes from architecture community, but crossover in ideas and approaches could be beneficial
n
n
This work exposes a new knob for control algorithms to turn, has a simple model for the power/energy effects of that knob, and observes opportunity to apply it in a simple way Exposes future work for: n n n
More advanced control algorithms Coordinated energy efficiency across rest of system Coordinated energy efficiency across a cluster/datacenter, integrated with memory DVFS, CPU DVFS, etc. 43