SER2343BU
Extreme Performance Series: vSphere Compute & Memory Schedulers t o N nt:
7
1 0 2 rld
o w M V Xunjia Lu - VMware, Inc #VMworld #SER2343BU
te n o C
r
n o i t ibu
n
u p r fo
io t a c bli
t s i d or
Disclaimer • This presentation may contain product features that are currently under development. • This overview of new technology represents no commitment from VMware to deliver these
n o i t ibu
features in any generally available product.
tr
• Features are subject to change, and must not be included in contracts, r dis purchase orders, or
sales agreements of any kind.
• •
on i t a c
o
li b u p Technical feasibility and market demand will affect r final delivery. o f t o N : Pricing and packaging for any new technologies or features discussed or presented have not t n e been determined. ont C 17 0 2 ld r o VMw
#SER2343BU CONFIDENTIAL
2
n o i t ibu
r
n
io t a c bli
t o N nt:
7
1 0 2 rld
t s i d or
u p r fo
te n o C
o w M V
CONFIDENTIAL
3
n o i t ibu
r
n
io t a c bli
t o N nt:
7
1 0 2 rld
t s i d or
u p r fo
te n o C
o w M V
CONFIDENTIAL
4
Agenda
r
1
CPU Scheduler
2
Memory Management
3
t o NUMA Scheduler N t:
4
n o i t ibu
n
io t a c bli
t s i d or
u p r fo
n e t n Co
7 Sizing and Host Configuration 1VM 0 2 rld
o w M V
CONFIDENTIAL
5
Agenda
r
1
CPU Scheduler
2
Memory Management
3
t o NUMA Scheduler N t:
4
n o i t ibu
n
io t a c bli
t s i d or
u p r fo
n e t n Co
7 Configuration and VM Sizing 1Host 0 2 rld
o w M V
CONFIDENTIAL
6
CPU Scheduler Overview • Goals – High CPU utilization, high application throughput – Ensure fairness (shares, reservation, limit)
r
n
io t a c bli
t o N nt:
7
1 0 2 rld
n o i t ibu
t s i d or
u p r fo
te n o C
o w M V
CONFIDENTIAL
7
CPU Scheduler Overview • When? – The idle PCPUs have new runnable worlds (wakeup: VM power on, etc.) – The running world voluntarily yields CPU (wait: idle/none-idle) n o i t u reached) b i r – The running world involuntarily gives up CPU (preemption: high priority/fair share t is
d r o on
i t a c ubli
• What? p r o f time / fair share) – World in ready queue with the least (consumed otCPU
C 7 1 0
N : t n nte
o
• Where? 2 d l rPCPUs o – Balance load across w M V – Preserve cache state, minimize migration cost – Avoid HT/LLC contention, sibling vCPUs – Close to worlds that have frequent communication pattern
CONFIDENTIAL
8
Scheduling Through the Lens of esxtop A Command Line Tool for Performance Monitoring • For real time monitoring
n o i t ibu
– Just type esxtop in to ESXi shell /
tr s i d r collection o • For batchnmode io t a c –besxtop –b –a –d $DELAY –n li u p $SAMPLES > $FILE_NAME.csv or terminal
f t o :N
V
ld r o Mw
C 7 1 20
t n e t on
– Tools: perfmon, excel
• A few changes in 6.5 – Processor turbo or frequency scaling
efficiency (%APERF/MPERF) – More intuitive accounting
CONFIDENTIAL
9
Peeking into Virtual Machine using esxtop n o i t ibu
r
n
io t a c bli
t o N nt:
7
1 0 2 rld
t s i d or
u p r fo
te n o C
o w M V
A virtual machine consists of more than vCPU worlds.
CONFIDENTIAL
10
CPU Scheduler Accounting: %USED vs. %RUN n o i t ibu
r
n
io t a c bli
t o N nt:
7
1 0 2 rld
t s i d or
u p r fo
te n o C
%USED vs. %RUN w (UTIL) o
VM
– For 2 seconds, world achieves different amount of work – RUN (UTIL) is based on wall clock time (TSC) – USED reflects frequency scaling (power, turbo) and hyper-thread contention
CONFIDENTIAL
11
CPU Scheduler: Throughput Gain due to Hyperthreading SPEC CPU2006 in 1-vCPU VMs (Haswell)
Normalized Throughput Gain
1.6
1.4
r
n
io t a c bli
1.2
1
t o N nt:
0.8
7
1 0 2 rld
0.6
0.4
n o i t ibu
t s i d or
u p r fo
te n o C
o w M V
0.2
0
perlbench
bzip2
gcc
mcf
gobmk
Baseline
hmmer
sjeng
libquantum
h264ref
HT
CONFIDENTIAL
12
CPU Scheduler: Slowdown due to Hyperthreading SPEC CPU2006 in 1-vCPU VMs (Haswell)
Runtime in minutes (min)
25
r
20
n
1.5x
t o N :
t n e t on
1.9x 10
V
ld r o Mw
1.6x C 7 1 20
io t a c bli
u p r fo 1.6x
15
5
n o i t ibu
t s i d or
1.8x
1.7x
1.6x
1.6x
1.8x
0
perlbench
bzip2
gcc
mcf
gobmk
baseline
hmmer
sjeng
libquantum
h264ref
HT
CONFIDENTIAL
13
CPU Scheduler: Hyperthreading and %USED time • Improve throughput • Each vCPU might runs slower with contention from hyper-twin – 2 cores vs. 2 HT / single core
r
n
• Hyperthreading-aware scheduling
•
n o i t ibu
io t a c bli
t s i d or
u p r fo – Same %RUN translates into different amount oft%USED o N of contention : • 100% %RUN may only give 70% %USED n intcase te n o C 7 1 0 2 d rl for the extra throughput Enabled HT by default o w VM
• Be aware of HT contention. ☺
CONFIDENTIAL
14
CPU Scheduler Accounting: Breakdown %RDY %USED = %RUN + %SYS - %OVRLP - E
CPU scheduling cost
Waiting
W
Time in ready queue
A
t0
t1 t2
on i t a c
o
threading, etc.
li
f t o :N
B Content 7
b u p or
D
1 0 2 rld
o w M %WAITV
Actual execution
Interrupted
n o i t u loss from b i Efficiency r t r dispower mgmt, hyper-
C
E
%OVRLP %RUN t3
t4 t5
t6
t7
t8
%SYS += D if for this VM CONFIDENTIAL
15
CPU Scheduler Accounting: Time from Kernel Contexts vSphere 6.0
n o i t ibu
r
n
io t a c bli
t o N nt:
vSphere 6.5
7
1 0 2 rld
t s i d or
u p r fo
te n o C
NEW!
o w M V
CONFIDENTIAL
16
CPU Scheduler Accounting: Group vs. World
128 vCPUs!!
n o i t ibu
r
n
io t a c bli
t o N nt:
7
1 0 2 rld
t s i d or
u p r fo
te n o C
o w M V
Group (VM) stats aggregate world stats. CONFIDENTIAL
17
%RDY Impact on Throughput (Java Workload) 1.20
n o i -15% t u
Throughput (bops)
1.00
b
0.80
o
li
0.60
f t o :N
0.40
0.20
V
on i t a c
ri t s i rd
ld r o Mw
C 7 1 20
b u p or
t n e t on
0.00
0
2
4
6
8
10
12
14
16
18
20
%RDY
%RDY affects throughput
CONFIDENTIAL
18
%RDY Impact on Latency (Redis Workload) 99.99 Percentile Latency (msec)
18 16
14
10
n
io t a c bli
8 6
f t o :N
4 2 0 0
V
n o i t ibu
r(-4ms)
12
ld r o Mw 0
C 7 1 20
u p r o
spiky flat
t n e t on
5
10
t s i d or
(-7ms)
10
15
20
25
%RDY
20
30
0
10
20
30
Latency depends on the competing workloads.
CONFIDENTIAL
19
CPU Scheduler Co-scheduling • *NOT* gang-scheduling – Allows a subset of vCPUs to run simultaneously – Costop a leading vCPU if it advances too far ahead
r
– Efficient in consolidated setup
• High %CSTOP? – Any %RDY time? –
n
io t a c bli
t o N nt:
n o i t ibu
t s i d or
u p r fo
te n o Watch out for vCPU’s (WAIT –7WAIT_IDLE), i.e. %VMWAIT from esxtop C 1 0 • vCPU blocks due to IOd(to2snapshot) or host level memory swap l r o w M V
CONFIDENTIAL
20
Agenda
r
1
CPU Scheduler
2
Memory Management
3
t o NUMA Scheduler N t:
4
n o i t ibu
n
io t a c bli
t s i d or
u p r fo
n e t n Co
7 Configuration and VM Sizing 1Host 0 2 rld
o w M V
CONFIDENTIAL
21
Memory Management Overview • Goals – Allow memory over-commitment – Handle transient memory pressure well
r
n
io t a c bli
• Terminology
Total Memory Size
o w M V
2 d l r
t o N : t AllocatedeMemory n ont C 017
n o i t ibu
u p r fo
t s i d or
Active Memory
Idle Memory
Free Memory
CONFIDENTIAL
22
Memory Management Overview • Reclaim memory if consumed > entitled – Entitlement: shares, limit, reservation, active estimation – Page sharing > Ballooning > Compression > Host swapping • Breaks host large pages
r
n
io t a c bli
• Page sharing vs. large pages
t o N nt:
n o i t ibu
t s i d or
u p r fo
– Using large pages for both guest and ESXi improves performance by 10 – 30%
te n o C
– Page sharing avoids ballooning and swapping
7
1 0 2 rld
– vSphere 6.0 breaks large pages earlier and increase page sharing (clear state)
o w M V
CONFIDENTIAL
23
Transient Memory Pressure Example • Six 4GB Swingbench VMs (VM-4,5,6 are idle) in a 16GB host VM1
VM2
VM3
n
8000 6000
r o n o cati
4000
o = 0% i t ∆VM1 u rib t s i d ∆VM2 = 0%
li
2000
f t o :N
0
b u p or
0 1 2 3 3 4 5 6 7 7 8 9 10 11 12 13 13 14 15 16 17 18 18 19 20 21 22 22 23 24 25 26 27 27 28 29 30 31 32 32 33 34 35 36 37 38 38 39 40 41 42 43 43 44 45 46 47 48 48 49 50 51 52 53 53 54 55 56
Operations per Minutes
10000
12 10
V
ld r o Mw
C 7 1 20
t n e t on
Time (minutes)
8
Swap Used 6
Compressed 4
Shared 2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
Size(GB)
Balloon
Time (minutes) CONFIDENTIAL
24
3000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
KB per Second
1 2 2 3 4 5 6 7 7 8 9 10 11 12 12 13 14 15 16 17 17 18 19 20 21 22 22 23 24 25 26 27 27 28 29 30 31 32 32 33 34 35 36 37 37 38 39 40 41 42 42 43 44 45 46 47 47 48 49 50 51 52 52 53 54 55 56
Operations per Minute
Constant Memory Pressure Example
• All six VMs run Swingbench workloads 10000
VM1
4000
V
VM2
0
C 7 1 20 ont
VM3
2000
p r o tf
o N : ent
VM4
6000
4000
i t a c ubli
VM5
8000
d r o on
n o i t u = -16% b i ∆VM1 r t is
VM6
∆VM2 = -21%
Time (minutes)
ld r o Mw Swap-in Rate
2000
1000
0 Time (minutes) CONFIDENTIAL 25
General Principles • Two types of memory overcommitment – “Configured” memory overcommitment: SUM (memory size of all VMs) / host memory size
n o i t ibu
– “Active” memory overcommitment: SUM (mem.active of all VMs) / host memory size
r
n
io t a c bli
• Performance impact
t s i d or
u p r fo
– “Active” memory overcommitment ≈ 1 high likelihood of performance degradation! • Some active memory are not in physical RAM
t o N nt:
te n o C
– “Configured” memory overcommitment > 1 zero or negligible impact • Most reclaimed memory are free/idle guest memory
7
1 0 2 rld
•
o w M V Aim for high consolidation while keeping down “active” memory overcommitment
CONFIDENTIAL
26
Agenda
r
1
CPU Scheduler
2
Memory Management
3
t o NUMA Scheduler N t:
4
n o i t ibu
n
io t a c bli
t s i d or
u p r fo
n e t n Co
7 Configuration and VM Sizing 1Host 0 2 rld
o w M V
CONFIDENTIAL
27
NUMA • Non-Uniform Memory Access system architecture – Each node consists of CPU cores, memory and possible devices – Access time can be 30% ~ 200% longer across nodes • NUMA node vs. Sockets
n o i t ibu
n
– Multiple NUMA nodes per socket (Cluster-on-Die)
t o N nt:
– Multiple sockets per NUMA node (less common)
io t a c bli
tr s i d or NUMA node 0
u p r fo
te n o C
• Small VMs scheduled on a0single 17 physical NUMA node 2 ld r – 100% local memoryoaccesses VMw – “Fixes” scale-out apps that don’t scale-up well – Consider sizing databases to fit
NUMA node 1
CONFIDENTIAL
28
NUMA Scheduler: Overview • Load balancing – To balance VMs across different NUMA nodes
n o i t ibu
r
• vNUMA
n
io t a c bli
t s i d or
– To properly expose virtualized NUMA topology to guest VMs for best performance
t o N nt:
7
1 0 2 rld
u p r fo
te n o C
o w M V
CONFIDENTIAL
29
NUMA Scheduler: Load Balancing • Initial placement – Initial placement based on CPU/memory load + round-robin
n o i t ibu
r
• Periodic Rebalancing Algorithm – At every 2 seconds, try incremental move (1 or 2 VMs)
n
io t a c bli
t s i d or
u p r fo
– To improve load balance / memory locality / relation sharing / fairness
t o N nt:
7
1 0 2 rld
te n o C
o w M V
CONFIDENTIAL
30
NUMA Rebalancing In Action (TPCx-V) Group-1
Group-2 PCPU Number
120
90 60 30 0
vm2
vm3
Group-3 PCPU Number
120 90 60
ld r o VMw
f t o :N
vm4
C 7 1 20
30 0
90
n
io t a c bli
30
n o i t ibu
r
60
u p r o
t s i d or
0
t n e t on
vm5
vm6
vm7
Group-4 120
PCPU Number
PCPU Number
120
90 60
30 0
vm8
vm9
vm10
vm11
vm12
vm13
CONFIDENTIAL
31
NUMA Scheduler: Impact of Cluster-on-Die (Haswell) • Cluster-on-Die
SPECjbb2015 Throughput
– Breaks each socket into 2 NUMA domain
1.10
r
– Lower LLC hit latency and local memory latency – Higher local memory bandwidth
•
t n e t on
t o N :
CBroadwell and vSphere 6.5 supports Haswell, 7 1 0 2 future generations of rprocessors d l o w VM
Normalized Throughput Gain
1.08
• Performance Considerations
n
io t a c bli
1.06
n o i t ibu
t s i d or
u p r fo
1.04
1.02
1.00
0.98
0.96 36Vx1
18Vx1
default
9Vx2
CoD
32
NUMA Scheduler: vNUMA • vNUMA – Useful for wide VMs (#vCPUs > #cores/NUMA node)
n o i t ibu
– Expose virtual NUMA topology to improve memory locality for better guest scheduling
r
ist d o n o i t a c i l b u p or f t o N : t n e t on C 17 0 2 ld r o VMw
• Example: 10-vCPU VM maps directly to 2 pNUMA nodes (out of 4)r
C0
C1
C2
C0
C1
C2
C0
C1
C2
C0
C1
C2
C3
C4
C5
C3
C4
C5
C3
C4
C5
C3
C4
C5
CONFIDENTIAL
33
NEW!
NUMA Scheduler: vNUMA vs. vSocket • Decoupling vSocket from vNUMA (vSphere 6.5) – ESXi will always try to pick the optimal vNUMA topology when possible • As long as: vNUMA = N x vSocket or vice versa
n o i t ibu
r
t s i d or
– Blog posts: “Virtual Machine vCPU and vNUMA Rightsizing – Rules of Thumb”
n
io t a c bli
t o N nt:
7
o w M V
vSocket
vCPU
1 0 2 rld
vCPU
te n o C
vSocket
vCPU
vNUMA node
u p r fo
vCPU
vSocket
vCPU
vCPU
vSocket
vCPU
vCPU
vNUMA node
CONFIDENTIAL
34
NUMA Scheduler: Virtual CPU Topology Example n o i t ibu
r
n
io t a c bli
t o N nt:
7
1 0 2 rld
t s i d or
u p r fo
te n o C
o w M V
• numactl –hardware • lstopo –s • coreinfo –n –s
CONFIDENTIAL
35
Agenda
r
1
CPU Scheduler
2
Memory Management
3
t o NUMA Scheduler N t:
4
n o i t ibu
n
io t a c bli
t s i d or
u p r fo
n e t n Co
7 Configuration and VM Sizing 1Host 0 2 rld
o w M V
CONFIDENTIAL
36
Host Configuration: Power Management Policy Throughput of 1-vCPU VM Java Workload High Perf
Balanced (P-states+C-states)
r
1.2
n
io t a c bli
Normalized Performance
1.15
t o N nt:
1.1
7
1 0 2 rld
1.05
n o i t ibu
t s i d or
u p r fo
te n o C
o w 1 M V 0.95
0.9 IvyBridge
Haswell
CONFIDENTIAL
37
VM Sizing: #vCPUs • Cost of over-sizing – Small CPU overhead per vCPU from periodic timer, etc.
n o i t ibu
– May hurt performance due to process migrations (e.g. Redis, up to 40% regression)
r
• Cost of under-sizing
n
– Internal CPU contention – Check CPU usage and processor queue length
t o N nt:
7
1 0 2 rld
io t a c bli
t s i d or
u p r fo
te n o C
o w M V
CONFIDENTIAL
38
VM Sizing: vRAM • Cost of over-sizing – Some apps/OSes treat unused memory as cache • e.g. SuperFetch • Increases active memory • May suffer from memory reclamation
• Cost of under-sizing
t o N nt:
– Guest level paging
7
1 0 2 rld
n o i t ibu
r
n
io t a c bli
t s i d or
u p r fo
te n o C
o w M V
CONFIDENTIAL
39
Summary • Be aware of the different between per-VM %RDY and per-world %RDY • Pay attention to ready time (%RDY) if the tail latency matters • Sizing VM based on number of physical cores instead of hyperthreads
n
io t a c bli
• In vSphere 6.5 and beyond
t o N nt:
n o i t ibu
r
t s i d or
u p r fo
– Insignificant %OVRLP and some %SYS moved to %RUN
te n o C
– No more hassle between vSocket and vNUMA
7
1 0 2 rld
NEW!
– Even smaller scheduling overhead
o w M V
• Avoid active memory overcommit.
• Watch out for under-sizing a VM. • Power policy matters! CONFIDENTIAL
40
Extreme Performance Series – Las Vegas • SER2724BU
Performance Best Practices
• SER2723BU
Benchmarking 101
• SER2343BU
vSphere Compute & Memory Schedulers
• SER1504BU
vCenter Performance Deep Dive
• SER2734BU
Byte Addressable Non-Volatile Memory in vSphere Not
• SER2849BU • SER1494BU
: t n e Predictive DRS – Performance & ont C Best Practices017 2 d l r vMotion Architecture, o Encrypted w VM Performance, & Futures
• STO1515BU • VIRT1445BU
vSAN Performance Troubleshooting Fast Virtualized Hadoop and Spark on All-Flash Disks
• VIRT1397BU
Optimize & Increase Performance Using VMware NSX
• VIRT2550BU
Reducing Latency n in Enterprise o i t Applications ribu with VMware NSX
n
• VIRT1052BU
io t a c bli
u p r fo• VIRT1983BU
t s i d r VM Database oMonster Performance
Cycle Stealing from the VDI Estate for Financial Modeling
• VIRT1997BU
Machine Learning and Deep Learning on VMware vSphere
• FUT2020BU
Wringing Max Perf from vSphere for Extremely Demanding Workloads
• FUT2761BU
Sharing High Performance Interconnects across Multiple VMs CONFIDENTIAL
41
Extreme Performance Series – Barcelona • SER2724BE
Performance Best Practices
• VIRT1397BE
• SER2343BE
vSphere Compute & Memory Schedulers
Optimize & Increase Performance Using VMware NSX
• VIRT1052BE
Monster VM Database n o i t Performance ribu
• SER1504BE
vCenter Performance Deep Dive
• SER2849BE
Predictive DRS – Performance & Best Practices
• VIRT1445BE
Fast Virtualized Hadoop and Spark on All-Flash Disks ent:
t
on C 17
Not
n
• FUT2020BE
u p r fo
io t a c bli
t s i d r Max Perf from vSphere oWringing for Extremely Demanding Workloads
0 2 d orl
VMw
CONFIDENTIAL
42
Extreme Performance Series - Hand on Labs Don’t miss these popular Extreme Performance labs: •
•
n o i t u HOL-1804-01-SDC: vSphere 6.5 Performance Diagnostics & Benchmarking b i r t is and optimizations using d – Each module dives deep into vSphere performance best practices, diagnostics, r o n o i various interfaces and benchmarking tools. t a c i l b u p or f t o N : tLab HOL-1804-02-CHG: vSphere Challenge n e t on fictional scenario to fix common vSphere operational and C – Each module places you in a 7different 1 0 2 performance problems. ld r o VMw
CONFIDENTIAL
43
Performance Survey The VMware Performance Engineering team is always looking for feedback about your experience with the performance of our products, our various tools, interfaces and where we can improve. Scan this QR code to access a : t short survey and provide us direct n e ont feedback. C 7
n o i t ibu
r
n
io t a c bli
Not
t s i d or
u p r fo
1 0 2 rld
Alternatively: www.vmware.com/go/perf Mwo
V
Thank you!
CONFIDENTIAL
44
r
n o i t ibu
n
io t a c bli
t o N nt:
7
o w M V
1 0 2 rld
te n o C
u p r fo
t s i d or
r
n o i t ibu
n
io t a c bli
t o N nt:
7
o w M V
1 0 2 rld
te n o C
u p r fo
t s i d or