Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

SER2343BU Extreme Performance Series: vSphere Compute & Memory Schedulers t o N nt: 7 1 0 2 rld o w M V Xunjia Lu - VMware, Inc #VMworld #SER2343B...
Author: Emil Preston
6 downloads 0 Views 2MB Size
SER2343BU

Extreme Performance Series: vSphere Compute & Memory Schedulers t o N nt:

7

1 0 2 rld

o w M V Xunjia Lu - VMware, Inc #VMworld #SER2343BU

te n o C

r

n o i t ibu

n

u p r fo

io t a c bli

t s i d or

Disclaimer • This presentation may contain product features that are currently under development. • This overview of new technology represents no commitment from VMware to deliver these

n o i t ibu

features in any generally available product.

tr

• Features are subject to change, and must not be included in contracts, r dis purchase orders, or

sales agreements of any kind.

• •

on i t a c

o

li b u p Technical feasibility and market demand will affect r final delivery. o f t o N : Pricing and packaging for any new technologies or features discussed or presented have not t n e been determined. ont C 17 0 2 ld r o VMw

#SER2343BU CONFIDENTIAL

2

n o i t ibu

r

n

io t a c bli

t o N nt:

7

1 0 2 rld

t s i d or

u p r fo

te n o C

o w M V

CONFIDENTIAL

3

n o i t ibu

r

n

io t a c bli

t o N nt:

7

1 0 2 rld

t s i d or

u p r fo

te n o C

o w M V

CONFIDENTIAL

4

Agenda

r

1

CPU Scheduler

2

Memory Management

3

t o NUMA Scheduler N t:

4

n o i t ibu

n

io t a c bli

t s i d or

u p r fo

n e t n Co

7 Sizing and Host Configuration 1VM 0 2 rld

o w M V

CONFIDENTIAL

5

Agenda

r

1

CPU Scheduler

2

Memory Management

3

t o NUMA Scheduler N t:

4

n o i t ibu

n

io t a c bli

t s i d or

u p r fo

n e t n Co

7 Configuration and VM Sizing 1Host 0 2 rld

o w M V

CONFIDENTIAL

6

CPU Scheduler Overview • Goals – High CPU utilization, high application throughput – Ensure fairness (shares, reservation, limit)

r

n

io t a c bli

t o N nt:

7

1 0 2 rld

n o i t ibu

t s i d or

u p r fo

te n o C

o w M V

CONFIDENTIAL

7

CPU Scheduler Overview • When? – The idle PCPUs have new runnable worlds (wakeup: VM power on, etc.) – The running world voluntarily yields CPU (wait: idle/none-idle) n o i t u reached) b i r – The running world involuntarily gives up CPU (preemption: high priority/fair share t is

d r o on

i t a c ubli

• What? p r o f time / fair share) – World in ready queue with the least (consumed otCPU

C 7 1 0

N : t n nte

o

• Where? 2 d l rPCPUs o – Balance load across w M V – Preserve cache state, minimize migration cost – Avoid HT/LLC contention, sibling vCPUs – Close to worlds that have frequent communication pattern

CONFIDENTIAL

8

Scheduling Through the Lens of esxtop A Command Line Tool for Performance Monitoring • For real time monitoring

n o i t ibu

– Just type esxtop in to ESXi shell /

tr s i d r collection o • For batchnmode io t a c –besxtop –b –a –d $DELAY –n li u p $SAMPLES > $FILE_NAME.csv or terminal

f t o :N

V

ld r o Mw

C 7 1 20

t n e t on

– Tools: perfmon, excel

• A few changes in 6.5 – Processor turbo or frequency scaling

efficiency (%APERF/MPERF) – More intuitive accounting

CONFIDENTIAL

9

Peeking into Virtual Machine using esxtop n o i t ibu

r

n

io t a c bli

t o N nt:

7

1 0 2 rld

t s i d or

u p r fo

te n o C

o w M V

A virtual machine consists of more than vCPU worlds.

CONFIDENTIAL

10

CPU Scheduler Accounting: %USED vs. %RUN n o i t ibu

r

n

io t a c bli

t o N nt:

7

1 0 2 rld

t s i d or

u p r fo

te n o C

%USED vs. %RUN w (UTIL) o

VM

– For 2 seconds, world achieves different amount of work – RUN (UTIL) is based on wall clock time (TSC) – USED reflects frequency scaling (power, turbo) and hyper-thread contention

CONFIDENTIAL

11

CPU Scheduler: Throughput Gain due to Hyperthreading SPEC CPU2006 in 1-vCPU VMs (Haswell)

Normalized Throughput Gain

1.6

1.4

r

n

io t a c bli

1.2

1

t o N nt:

0.8

7

1 0 2 rld

0.6

0.4

n o i t ibu

t s i d or

u p r fo

te n o C

o w M V

0.2

0

perlbench

bzip2

gcc

mcf

gobmk

Baseline

hmmer

sjeng

libquantum

h264ref

HT

CONFIDENTIAL

12

CPU Scheduler: Slowdown due to Hyperthreading SPEC CPU2006 in 1-vCPU VMs (Haswell)

Runtime in minutes (min)

25

r

20

n

1.5x

t o N :

t n e t on

1.9x 10

V

ld r o Mw

1.6x C 7 1 20

io t a c bli

u p r fo 1.6x

15

5

n o i t ibu

t s i d or

1.8x

1.7x

1.6x

1.6x

1.8x

0

perlbench

bzip2

gcc

mcf

gobmk

baseline

hmmer

sjeng

libquantum

h264ref

HT

CONFIDENTIAL

13

CPU Scheduler: Hyperthreading and %USED time • Improve throughput • Each vCPU might runs slower with contention from hyper-twin – 2 cores vs. 2 HT / single core

r

n

• Hyperthreading-aware scheduling



n o i t ibu

io t a c bli

t s i d or

u p r fo – Same %RUN translates into different amount oft%USED o N of contention : • 100% %RUN may only give 70% %USED n intcase te n o C 7 1 0 2 d rl for the extra throughput Enabled HT by default o w VM

• Be aware of HT contention. ☺

CONFIDENTIAL

14

CPU Scheduler Accounting: Breakdown %RDY %USED = %RUN + %SYS - %OVRLP - E

CPU scheduling cost

Waiting

W

Time in ready queue

A

t0

t1 t2

on i t a c

o

threading, etc.

li

f t o :N

B Content 7

b u p or

D

1 0 2 rld

o w M %WAITV

Actual execution

Interrupted

n o i t u loss from b i Efficiency r t r dispower mgmt, hyper-

C

E

%OVRLP %RUN t3

t4 t5

t6

t7

t8

%SYS += D if for this VM CONFIDENTIAL

15

CPU Scheduler Accounting: Time from Kernel Contexts vSphere 6.0

n o i t ibu

r

n

io t a c bli

t o N nt:

vSphere 6.5

7

1 0 2 rld

t s i d or

u p r fo

te n o C

NEW!

o w M V

CONFIDENTIAL

16

CPU Scheduler Accounting: Group vs. World

128 vCPUs!!

n o i t ibu

r

n

io t a c bli

t o N nt:

7

1 0 2 rld

t s i d or

u p r fo

te n o C

o w M V

Group (VM) stats aggregate world stats. CONFIDENTIAL

17

%RDY Impact on Throughput (Java Workload) 1.20

n o i -15% t u

Throughput (bops)

1.00

b

0.80

o

li

0.60

f t o :N

0.40

0.20

V

on i t a c

ri t s i rd

ld r o Mw

C 7 1 20

b u p or

t n e t on

0.00

0

2

4

6

8

10

12

14

16

18

20

%RDY

%RDY affects throughput

CONFIDENTIAL

18

%RDY Impact on Latency (Redis Workload) 99.99 Percentile Latency (msec)

18 16

14

10

n

io t a c bli

8 6

f t o :N

4 2 0 0

V

n o i t ibu

r(-4ms)

12

ld r o Mw 0

C 7 1 20

u p r o

spiky flat

t n e t on

5

10

t s i d or

(-7ms)

10

15

20

25

%RDY

20

30

0

10

20

30

Latency depends on the competing workloads.

CONFIDENTIAL

19

CPU Scheduler Co-scheduling • *NOT* gang-scheduling – Allows a subset of vCPUs to run simultaneously – Costop a leading vCPU if it advances too far ahead

r

– Efficient in consolidated setup

• High %CSTOP? – Any %RDY time? –

n

io t a c bli

t o N nt:

n o i t ibu

t s i d or

u p r fo

te n o Watch out for vCPU’s (WAIT –7WAIT_IDLE), i.e. %VMWAIT from esxtop C 1 0 • vCPU blocks due to IOd(to2snapshot) or host level memory swap l r o w M V

CONFIDENTIAL

20

Agenda

r

1

CPU Scheduler

2

Memory Management

3

t o NUMA Scheduler N t:

4

n o i t ibu

n

io t a c bli

t s i d or

u p r fo

n e t n Co

7 Configuration and VM Sizing 1Host 0 2 rld

o w M V

CONFIDENTIAL

21

Memory Management Overview • Goals – Allow memory over-commitment – Handle transient memory pressure well

r

n

io t a c bli

• Terminology

Total Memory Size

o w M V

2 d l r

t o N : t AllocatedeMemory n ont C 017

n o i t ibu

u p r fo

t s i d or

Active Memory

Idle Memory

Free Memory

CONFIDENTIAL

22

Memory Management Overview • Reclaim memory if consumed > entitled – Entitlement: shares, limit, reservation, active estimation – Page sharing > Ballooning > Compression > Host swapping • Breaks host large pages

r

n

io t a c bli

• Page sharing vs. large pages

t o N nt:

n o i t ibu

t s i d or

u p r fo

– Using large pages for both guest and ESXi improves performance by 10 – 30%

te n o C

– Page sharing avoids ballooning and swapping

7

1 0 2 rld

– vSphere 6.0 breaks large pages earlier and increase page sharing (clear state)

o w M V

CONFIDENTIAL

23

Transient Memory Pressure Example • Six 4GB Swingbench VMs (VM-4,5,6 are idle) in a 16GB host VM1

VM2

VM3

n

8000 6000

r o n o cati

4000

o = 0% i t ∆VM1 u rib t s i d ∆VM2 = 0%

li

2000

f t o :N

0

b u p or

0 1 2 3 3 4 5 6 7 7 8 9 10 11 12 13 13 14 15 16 17 18 18 19 20 21 22 22 23 24 25 26 27 27 28 29 30 31 32 32 33 34 35 36 37 38 38 39 40 41 42 43 43 44 45 46 47 48 48 49 50 51 52 53 53 54 55 56

Operations per Minutes

10000

12 10

V

ld r o Mw

C 7 1 20

t n e t on

Time (minutes)

8

Swap Used 6

Compressed 4

Shared 2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55

Size(GB)

Balloon

Time (minutes) CONFIDENTIAL

24

3000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55

KB per Second

1 2 2 3 4 5 6 7 7 8 9 10 11 12 12 13 14 15 16 17 17 18 19 20 21 22 22 23 24 25 26 27 27 28 29 30 31 32 32 33 34 35 36 37 37 38 39 40 41 42 42 43 44 45 46 47 47 48 49 50 51 52 52 53 54 55 56

Operations per Minute

Constant Memory Pressure Example

• All six VMs run Swingbench workloads 10000

VM1

4000

V

VM2

0

C 7 1 20 ont

VM3

2000

p r o tf

o N : ent

VM4

6000

4000

i t a c ubli

VM5

8000

d r o on

n o i t u = -16% b i ∆VM1 r t is

VM6

∆VM2 = -21%

Time (minutes)

ld r o Mw Swap-in Rate

2000

1000

0 Time (minutes) CONFIDENTIAL 25

General Principles • Two types of memory overcommitment – “Configured” memory overcommitment: SUM (memory size of all VMs) / host memory size

n o i t ibu

– “Active” memory overcommitment: SUM (mem.active of all VMs) / host memory size

r

n

io t a c bli

• Performance impact

t s i d or

u p r fo

– “Active” memory overcommitment ≈ 1  high likelihood of performance degradation! • Some active memory are not in physical RAM

t o N nt:

te n o C

– “Configured” memory overcommitment > 1  zero or negligible impact • Most reclaimed memory are free/idle guest memory

7

1 0 2 rld



o w M V Aim for high consolidation while keeping down “active” memory overcommitment

CONFIDENTIAL

26

Agenda

r

1

CPU Scheduler

2

Memory Management

3

t o NUMA Scheduler N t:

4

n o i t ibu

n

io t a c bli

t s i d or

u p r fo

n e t n Co

7 Configuration and VM Sizing 1Host 0 2 rld

o w M V

CONFIDENTIAL

27

NUMA • Non-Uniform Memory Access system architecture – Each node consists of CPU cores, memory and possible devices – Access time can be 30% ~ 200% longer across nodes • NUMA node vs. Sockets

n o i t ibu

n

– Multiple NUMA nodes per socket (Cluster-on-Die)

t o N nt:

– Multiple sockets per NUMA node (less common)

io t a c bli

tr s i d or NUMA node 0

u p r fo

te n o C

• Small VMs scheduled on a0single 17 physical NUMA node 2 ld r – 100% local memoryoaccesses VMw – “Fixes” scale-out apps that don’t scale-up well – Consider sizing databases to fit

NUMA node 1

CONFIDENTIAL

28

NUMA Scheduler: Overview • Load balancing – To balance VMs across different NUMA nodes

n o i t ibu

r

• vNUMA

n

io t a c bli

t s i d or

– To properly expose virtualized NUMA topology to guest VMs for best performance

t o N nt:

7

1 0 2 rld

u p r fo

te n o C

o w M V

CONFIDENTIAL

29

NUMA Scheduler: Load Balancing • Initial placement – Initial placement based on CPU/memory load + round-robin

n o i t ibu

r

• Periodic Rebalancing Algorithm – At every 2 seconds, try incremental move (1 or 2 VMs)

n

io t a c bli

t s i d or

u p r fo

– To improve load balance / memory locality / relation sharing / fairness

t o N nt:

7

1 0 2 rld

te n o C

o w M V

CONFIDENTIAL

30

NUMA Rebalancing In Action (TPCx-V) Group-1

Group-2 PCPU Number

120

90 60 30 0

vm2

vm3

Group-3 PCPU Number

120 90 60

ld r o VMw

f t o :N

vm4

C 7 1 20

30 0

90

n

io t a c bli

30

n o i t ibu

r

60

u p r o

t s i d or

0

t n e t on

vm5

vm6

vm7

Group-4 120

PCPU Number

PCPU Number

120

90 60

30 0

vm8

vm9

vm10

vm11

vm12

vm13

CONFIDENTIAL

31

NUMA Scheduler: Impact of Cluster-on-Die (Haswell) • Cluster-on-Die

SPECjbb2015 Throughput

– Breaks each socket into 2 NUMA domain

1.10

r

– Lower LLC hit latency and local memory latency – Higher local memory bandwidth



t n e t on

t o N :

CBroadwell and vSphere 6.5 supports Haswell, 7 1 0 2 future generations of rprocessors d l o w VM

Normalized Throughput Gain

1.08

• Performance Considerations

n

io t a c bli

1.06

n o i t ibu

t s i d or

u p r fo

1.04

1.02

1.00

0.98

0.96 36Vx1

18Vx1

default

9Vx2

CoD

32

NUMA Scheduler: vNUMA • vNUMA – Useful for wide VMs (#vCPUs > #cores/NUMA node)

n o i t ibu

– Expose virtual NUMA topology to improve memory locality for better guest scheduling

r

ist d o n o i t a c i l b u p or f t o N : t n e t on C 17 0 2 ld r o VMw

• Example: 10-vCPU VM maps directly to 2 pNUMA nodes (out of 4)r

C0

C1

C2

C0

C1

C2

C0

C1

C2

C0

C1

C2

C3

C4

C5

C3

C4

C5

C3

C4

C5

C3

C4

C5

CONFIDENTIAL

33

NEW!

NUMA Scheduler: vNUMA vs. vSocket • Decoupling vSocket from vNUMA (vSphere 6.5) – ESXi will always try to pick the optimal vNUMA topology when possible • As long as: vNUMA = N x vSocket or vice versa

n o i t ibu

r

t s i d or

– Blog posts: “Virtual Machine vCPU and vNUMA Rightsizing – Rules of Thumb”

n

io t a c bli

t o N nt:

7

o w M V

vSocket

vCPU

1 0 2 rld

vCPU

te n o C

vSocket

vCPU

vNUMA node

u p r fo

vCPU

vSocket

vCPU

vCPU

vSocket

vCPU

vCPU

vNUMA node

CONFIDENTIAL

34

NUMA Scheduler: Virtual CPU Topology Example n o i t ibu

r

n

io t a c bli

t o N nt:

7

1 0 2 rld

t s i d or

u p r fo

te n o C

o w M V

• numactl –hardware • lstopo –s • coreinfo –n –s

CONFIDENTIAL

35

Agenda

r

1

CPU Scheduler

2

Memory Management

3

t o NUMA Scheduler N t:

4

n o i t ibu

n

io t a c bli

t s i d or

u p r fo

n e t n Co

7 Configuration and VM Sizing 1Host 0 2 rld

o w M V

CONFIDENTIAL

36

Host Configuration: Power Management Policy Throughput of 1-vCPU VM Java Workload High Perf

Balanced (P-states+C-states)

r

1.2

n

io t a c bli

Normalized Performance

1.15

t o N nt:

1.1

7

1 0 2 rld

1.05

n o i t ibu

t s i d or

u p r fo

te n o C

o w 1 M V 0.95

0.9 IvyBridge

Haswell

CONFIDENTIAL

37

VM Sizing: #vCPUs • Cost of over-sizing – Small CPU overhead per vCPU from periodic timer, etc.

n o i t ibu

– May hurt performance due to process migrations (e.g. Redis, up to 40% regression)

r

• Cost of under-sizing

n

– Internal CPU contention – Check CPU usage and processor queue length

t o N nt:

7

1 0 2 rld

io t a c bli

t s i d or

u p r fo

te n o C

o w M V

CONFIDENTIAL

38

VM Sizing: vRAM • Cost of over-sizing – Some apps/OSes treat unused memory as cache • e.g. SuperFetch • Increases active memory • May suffer from memory reclamation

• Cost of under-sizing

t o N nt:

– Guest level paging

7

1 0 2 rld

n o i t ibu

r

n

io t a c bli

t s i d or

u p r fo

te n o C

o w M V

CONFIDENTIAL

39

Summary • Be aware of the different between per-VM %RDY and per-world %RDY • Pay attention to ready time (%RDY) if the tail latency matters • Sizing VM based on number of physical cores instead of hyperthreads

n

io t a c bli

• In vSphere 6.5 and beyond

t o N nt:

n o i t ibu

r

t s i d or

u p r fo

– Insignificant %OVRLP and some %SYS moved to %RUN

te n o C

– No more hassle between vSocket and vNUMA

7

1 0 2 rld

NEW!

– Even smaller scheduling overhead

o w M V

• Avoid active memory overcommit.

• Watch out for under-sizing a VM. • Power policy matters! CONFIDENTIAL

40

Extreme Performance Series – Las Vegas • SER2724BU

Performance Best Practices

• SER2723BU

Benchmarking 101

• SER2343BU

vSphere Compute & Memory Schedulers

• SER1504BU

vCenter Performance Deep Dive

• SER2734BU

Byte Addressable Non-Volatile Memory in vSphere Not

• SER2849BU • SER1494BU

: t n e Predictive DRS – Performance & ont C Best Practices017 2 d l r vMotion Architecture, o Encrypted w VM Performance, & Futures

• STO1515BU • VIRT1445BU

vSAN Performance Troubleshooting Fast Virtualized Hadoop and Spark on All-Flash Disks

• VIRT1397BU

Optimize & Increase Performance Using VMware NSX

• VIRT2550BU

Reducing Latency n in Enterprise o i t Applications ribu with VMware NSX

n

• VIRT1052BU

io t a c bli

u p r fo• VIRT1983BU

t s i d r VM Database oMonster Performance

Cycle Stealing from the VDI Estate for Financial Modeling

• VIRT1997BU

Machine Learning and Deep Learning on VMware vSphere

• FUT2020BU

Wringing Max Perf from vSphere for Extremely Demanding Workloads

• FUT2761BU

Sharing High Performance Interconnects across Multiple VMs CONFIDENTIAL

41

Extreme Performance Series – Barcelona • SER2724BE

Performance Best Practices

• VIRT1397BE

• SER2343BE

vSphere Compute & Memory Schedulers

Optimize & Increase Performance Using VMware NSX

• VIRT1052BE

Monster VM Database n o i t Performance ribu

• SER1504BE

vCenter Performance Deep Dive

• SER2849BE

Predictive DRS – Performance & Best Practices

• VIRT1445BE

Fast Virtualized Hadoop and Spark on All-Flash Disks ent:

t

on C 17

Not

n

• FUT2020BE

u p r fo

io t a c bli

t s i d r Max Perf from vSphere oWringing for Extremely Demanding Workloads

0 2 d orl

VMw

CONFIDENTIAL

42

Extreme Performance Series - Hand on Labs Don’t miss these popular Extreme Performance labs: •



n o i t u HOL-1804-01-SDC: vSphere 6.5 Performance Diagnostics & Benchmarking b i r t is and optimizations using d – Each module dives deep into vSphere performance best practices, diagnostics, r o n o i various interfaces and benchmarking tools. t a c i l b u p or f t o N : tLab HOL-1804-02-CHG: vSphere Challenge n e t on fictional scenario to fix common vSphere operational and C – Each module places you in a 7different 1 0 2 performance problems. ld r o VMw

CONFIDENTIAL

43

Performance Survey The VMware Performance Engineering team is always looking for feedback about your experience with the performance of our products, our various tools, interfaces and where we can improve. Scan this QR code to access a : t short survey and provide us direct n e ont feedback. C 7

n o i t ibu

r

n

io t a c bli

Not

t s i d or

u p r fo

1 0 2 rld

Alternatively: www.vmware.com/go/perf Mwo

V

Thank you!

CONFIDENTIAL

44

r

n o i t ibu

n

io t a c bli

t o N nt:

7

o w M V

1 0 2 rld

te n o C

u p r fo

t s i d or

r

n o i t ibu

n

io t a c bli

t o N nt:

7

o w M V

1 0 2 rld

te n o C

u p r fo

t s i d or

Suggest Documents