Linux on z/VM Understanding CPU Usage Rob van der Heij rob @ velocitysoftware.de

IBM System z Technical Conference Brussels, 2009 Session LX44

Velocity Software GmbH http://www.velocitysoftware.com/

Copyright 2009 Velocity Software, Inc. All Rights Reserved. Other products and company names mentioned herein may be trademarks of their respective owners.

Introduction Why would you care about CPU Usage?  CPU virtualization is the easiest part • System z CPU is designed for virtualization and sharing

 Sharing CPU resources raises questions as well • Is my workload held back because of CPU constraints? • Why did other workload get the resources that I want? • Are there more resources available, why didn’t I get them?

 Sharing the resources requires “social behavior” • Cycles wasted on useless work can’t be used for real work

LX44 – Linux on z/VM – Understanding CPU Usage

2

Introduction What could you do about CPU Usage?  Understand where your CPU cycles are spent  Measure and identify CPU usage  Reduce peak CPU requirements • Find more economic ways to do the work • Avoid work that does not need to be done • Move some work to quiet “night shift” hours

 Pay attention to “idle load”

Not the same as benchmarks  Top speed versus mileage  Most economic way to do the work within SLA  Scalability of applications LX44 – Linux on z/VM – Understanding CPU Usage

3

Agenda CPU Usage Breakdown  LPAR  z/VM  Virtual Machine

CPU Accounting Linux CPU Usage • • • •

What is that Penguin Doing? Linux Server with High Overhead Improving TSM Throughput My Penguin can't Sleep Performance data shown in the presentation was collected and processed with ESALPS. LX44 – Linux on z/VM – Understanding CPU Usage

4

CPU Usage Breakdown Logical Partition Level    

LPARs more or less share the CPU resources Logical CPUs are defined shared or dedicated For shared CPUs the LPAR weight may be important CPs and IFLs are not mixed • Exception: “VM Mode LPAR” - z/VM 5.4 on selected hardware PR/SM Overhead Normally pretty small

Name No. Type -------- --- ---LP1 0 CP

LP2

1

LP3

2

VCPU Addr Total Ovhd ---- ----- ---0 49.2 0.4 1 48.6 0.3 2 49.2 0.4 IFL 0 100.1 0.0 1 100.1 0.0 2 99.9 0.0 CP 0 0.1 0.0

Only 1 IFL LPAR: 3 dedicated CPUs Nobody to share with

LX44 – Linux on z/VM – Understanding CPU Usage

5

CPU Usage Breakdown PR/SM Management Time  Not attributed to one specific LPAR • Scheduling etc – spread over all physical CPUs • Typically < 1% per CPU • Depends on number of LPARs & logical CPUs

LPAR Overhead

Physical CPU Management time: CPU Percent --------0 0.003 4 0.414 6 0.414 9 0.416 11 0.003 13 0.003 _______ Total: 1.254

 PR/SM work on behalf of a specific LPAR • Dispatching, QDIO, etc - spread over logical CPUs of the LPAR • Typically < 1% per logical CPU • Depends on workload

Virt Name Nbr CPUs Total Ovhd -------- --- ---- ------ ---LP1 0 3 83.6 1.2 LP3 2 1 0.1 0.0

LPAR1

Unused

LPAR2

Productive work by Guest OS LPAR Management Time

LPAR Overhead LX44 – Linux on z/VM – Understanding CPU Usage

6

LPAR3

CPU Usage Breakdown Logical CPU View  Guest OS dispatches workload over logical CPUs • When no work to do: CPU idle (wait state) • LPAR recognizes idle CPU – “white space” is shared Exception: when “wait complete” is set for the LPAR

Total Emul User Sys Idle CPU util time ovrhd ovrhd time -- *---- ----- ----- ----- ----00 40.9 30.8 6.9 3.3 58.6 01 40.4 32.4 5.9 2.1 59.1 02 40.2 31.9 6.0 2.3 59.3

100% LPAR1

CPU 0

CPU 1

CPU 2

LPAR Overhead

LX44 – Linux on z/VM – Understanding CPU Usage

7

CPU Usage Breakdown z/VM View – Totals  System Overhead – General CP work • Scheduling, Monitor, Accounting

 User Overhead – CP work on behalf of specific user

T/V Ratio

• I/O translation, instruction simulation, CP functions

vti

me

 Emulation Time – Productive work for user idle

 z/VM metrics: True CPU% Total Emul User Sys Idle CPU util time ovrhd ovrhd time -- *---- ----- ----- ----- ----00 40.9 30.8 6.9 3.3 58.6 01 40.4 32.4 5.9 2.1 59.1 02 40.2 31.9 6.0 2.3 59.3

Emulation Time

100%

User Overhead

System Overhead

LX44 – Linux on z/VM – Understanding CPU Usage

8

tti

me

CPU Usage Breakdown z/VM View – Virtual Machines  Virtual Time – Virtual Machine work (SIE)  Total Time – Virtual Time plus User Overhead  Some Virtual Machines are not real “users” but system functions • RACFVM, TCPIP, DIRMAINT Virtual Machines Screen: ESAUSP2 Velocity Software 1 of 3 User Percent Utilization UserID Time /Class Total Virt -------- -------- ----- ----16:13:00 SUSELNX2 3.64 3.59 REDHAT04 2.89 2.80 ORACLE 2.12 2.08 VMRLNX 1.89 1.88 DXT2LV 0.61 0.31 ROBLX1 0.35 0.35 TCPIP 0.28 0.13 SUSELNX1 0.24 0.21 REDHAT3 0.21 0.18 SLES8 0.19 0.18 ROBLX2 0.12 0.11

ser u r f pe otal: o t Sum age = atio us ture R Cap

Emulation Time

User Overhead

System Overhead

LX44 – Linux on z/VM – Understanding CPU Usage

9

CPU Accounting Mainframe Operating Systems do CPU accounting  Required for charge back of shared resources

z/VM account records  At logoff or through CP command  Resource usage per virtual machine • CPU usage • I/O operations

 Very simple to process • Easy to audit

 Does not tell you why  Lacks detail for Linux

LX44 – Linux on z/VM – Understanding CPU Usage

10

CPU Accounting Charge-back is to recover the total data center cost  CPU is not the major cost factor anymore  CPU usage is traditionally still used for charge-back • CPU usage considered representative for amount of usage • Total data center cost divided by consumed CPU hours • CPU tariff based on estimated usage and capacity plan

IFL with Linux and z/VM break the model  Installations add a substantial amount of MIPS  Linux applications also consume a lot of CPU hours • Linux Proof of Concept charged much of z/OS license cost

Charge-back motivates users to save resources  Make sure to arrange a correct cost model for Linux LX44 – Linux on z/VM – Understanding CPU Usage

11

CPU Accounting CPU Accounting for Linux on z/VM needs detail  Just listing totals is not enough to convince customers  Exceptional usage must be explained very clearly

Performance Monitor can reveal the detail  Collects CPU data along with many other metrics • ESALPS collects ~ 3500 unique metrics every minute 100's of them repeated per device, per user or per Linux process

• Helps to understand sequence of events causing a problem • Explains any excessive usage to the application owner

 Requires detailed "performance history"  Requires complete data – capture ratio of 100%

Performance Monitor helps to validate the cost model LX44 – Linux on z/VM – Understanding CPU Usage

12

Visualization Techniques Comparing Memory and CPU Usage

LX44 – Linux on z/VM – Understanding CPU Usage

13

CPU Usage Breakdown Linux System View  Virtual Machine Emulation Time is available for Linux usage  Steal Time: when Linux does not know what CPU was used for  Linux Administrator be aware: "idle" ≠ "available for use" Linux Virtual Machine Nice

Linux-2.4

Linux-2.6

User

User

Process Usage

Nice

Nice

Background Process Usage

System

Kernel

Kernel Usage

Soft-IRQ

Kernel-related CPU usage

Interrupt

First-level Interrupt handlers

Idle

No CPU usage

I/O wait

CPU waiting for I/O

Steal

CPU cycles "stolen" by hypervisor

Idle

User

Emulation Time

System

LX44 – Linux on z/VM – Understanding CPU Usage

User Overhead

14

CPU Usage Breakdown Linux Process View  CPU resources allocated to processes  Each process some system time plus some user time (or nice)  Processes should add up to total system and user time • Capture ratio!

Linux CPU accounting

Linux

 Traditionally wrong due to virtualization

Processes Nice

• Linux tools would show too high numbers

 Modern kernels use virtual CPU accounting User

• Linux tools sometimes show wrong data

System

LX44 – Linux on z/VM – Understanding CPU Usage

15

Why can’t I use my Linux Tools? Linux data is incomplete and sometimes incorrect  Virtualization changes the rules of the game • CPU Usage perceived by Linux can be very wrong • Assumptions about used and available do not hold anymore

 z/VM performance impacts Linux behavior  Need to combine Linux and z/VM performance data

z/VM does not clone system administrators     

You may not have time to look when it happens Complex interactions make it hard to reproduce scenarios Multi-tier application involves multiple virtual servers Centralized data collection is easier to manage May need to share data with others to understand it

LX44 – Linux on z/VM – Understanding CPU Usage

16

What is that Penguin doing High Level Overview  Shows no real detail  Sometimes enough for quick check Screen: ESAMAIN 1 of 3 System Overview

Time *------02:49:00 02:48:00 02:47:00 02:46:00 02:45:00 02:44:00 02:43:00

On Actv In Q ---- ---- ---87 72 67.0 87 71 69.0 87 72 64.0 87 68 70.0 87 69 66.0 87 69 68.0 87 69 67.0

Transact. per Avg. Utilization Sec. Time CPUs Total Virt. ---- ---- ---- *---- ----19.3 0.35 3 67.1 63.7 19.3 0.35 3 65.9 62.6 19.1 0.35 3 54.3 50.7 18.3 0.40 3 48.6 44.9 19.1 0.32 3 42.8 39.3 20.1 0.34 3 43.2 39.7 18.3 0.35 3 42.8 39.1

LX44 – Linux on z/VM – Understanding CPU Usage

17

What is that Penguin doing High Level Overview  Shows breakdown per class  Zoom in on one class Screen: ESAUSP2 1 of 3 User Percent Utilization

UserID Time /Class Total Virt -------- -------- ----- ----02:56:00 System: 118 112 *TheUsrs 116 110 LINUX 0.93 0.85 *Servers 0.68 0.61 TCPCTL 0.00 0.00 02:55:00 System: 151 144 *TheUsrs 148 142 LINUX 1.08 0.99 *Servers 0.66 0.59 TCPCTL 0.00 0.00

Lock Total Actv -ed Total Actv ----- ----- ----- ----- ----19M 19M 142 19M 19M 18M 18M 92.00 18M 18M 594K 594K 0.00 594K 594K 11564 4096 0.00 11505 4050 20814 20814 34.00 20672 20672 19M 19M 140 19M 19M 18M 18M 90.00 18M 18M 594K 594K 0.00 594K 594K 11565 4390 0.00 11486 4315 20814 20814 34.00 20672 20672 LX44 – Linux on z/VM – Understanding CPU Usage

18

What is that Penguin doing Usage Breakdown per user  So one server used 25% of a CPU last minute • Is that good or bad? • Often you can’t really tell without knowing behavior over time Screen: ESAUSP2 1 of 3 User Percent Utilization

UserID Time /Class -------- -------02:57:00 DOMINOZ1 IBMTSM TDIRTIM EBIZ2 DB2-A1 ACME EBIZDEV1 EBIZDEV2 EBIZ1 TDIRDB2 IBMRED2

Total Virt ----- ----25.62 23.10 21.51 19.35 12.30 12.24 9.72 9.51 6.83 6.79 4.73 4.53 4.53 4.42 4.27 4.18 4.23 4.14 3.35 3.32 2.70 2.64

ESAMON 3.7.0 CLASS *THEUS

Lock Total Actv -ed Total Actv ----- ----- ----- ----- ----522K 522K 0.00 522K 522K 522K 522K 29.00 522K 522K 1044K 1044K 0.00 1044K 1044K 260K 260K 0.00 260K 260K 522K 522K 0.00 522K 522K 190K 190K 0.00 190K 190K 260K 260K 0.00 260K 260K 260K 260K 0.00 260K 260K 260K 260K 0.00 260K 260K 852K 852K 0.00 852K 852K 260K 260K 0.00 260K 260K

LX44 – Linux on z/VM – Understanding CPU Usage

19

What is that Penguin doing Single User over Time  Looking at usage in recent past shows “when it started” • Frequently more productive than waiting until it stops

 For multi-tier applications you need to look at multiple servers • Arrange servers in classes for “application view” Screen: ESAUSP2 1 of 3 User Percent Utilization

Time -------03:07:00 03:06:00 03:05:00 03:04:00 03:03:00 03:02:00 03:01:00 03:00:00 02:59:00

UserID /Class -------DOMINOZ1 DOMINOZ1 DOMINOZ1 DOMINOZ1 DOMINOZ1 DOMINOZ1 DOMINOZ1 DOMINOZ1 DOMINOZ1

Total Virt ----- ----34.75 31.13 28.27 24.91 26.57 23.97 24.01 21.65 24.07 21.50 75.92 72.51 33.35 30.01 26.31 23.74 22.17 19.47

ESAMON 3.7.0 CLASS * USER

Lock Total Actv -ed Total Actv ----- ----- ----- ----- ----522K 522K 0.00 522K 522K 522K 522K 0.00 522K 522K 522K 522K 0.00 522K 522K 522K 522K 0.00 522K 522K 522K 522K 0.00 522K 522K 522K 522K 0.00 522K 522K 522K 522K 0.00 522K 522K 522K 522K 0.00 522K 522K 522K 522K 0.00 522K 522K

LX44 – Linux on z/VM – Understanding CPU Usage

20

What is that Penguin doing Looking inside the Linux server  Identify the Linux process that consume the resources Screen: ESALNXP 1 of 3 LINUX VSI Process Statistics Report Time Node Name ID PPID GRP -------- -------- --------- ----- ----- ----03:02:00 dominoz1 clrepl 12194 2536 2483 updall 11500 2536 2483 smdemf 5209 2536 2483 sched 5181 2536 2483 update 5174 2536 2483 replica 5168 2536 2483 server 2536 2483 2483 snmpd 1768 1 1767 kjournal 1140 1 1 kswapd0 134 1 1 pdflush 133 8 0 *Totals* 0 0 0

ESAMON 3.7.0 03/27 03: NODE DOMINOZ1 LIMIT 2 user syst usrt ---- ---- ---0.1 0.0 0.0 4.2 0.0 0.0 0.1 0.0 0.0 1.8 0.0 0.0 0.5 0.0 0.0 29.4 0.0 0.0 20.1 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 56.6 0.0 0.0

21

CPU Overhead CPU Overhead can mean many different things     

Productive work for one is overhead for another Make sure your peer means the same thing You're only aware of it when you can measure With System z and z/VM we can measure it Hardware support keeps overhead mostly low

Sometimes abnormal behavior increases overhead  Spending resources on other things than workload  Performance Monitor often helps to clarify things LX44 – Linux on z/VM – Understanding CPU Usage

22

Linux Server with High Overhead Customer reports on Linux server with high CP cost  Linux server using 25-30% of a CPU  Almost half of that is “CP overhead” • T/V ratio of 1.8 • Work that CP does on behalf of the virtual machine

 z/VM has plenty of CPU resources • Linux guest does not appear to be held back

Question  What is Linux doing?  Why high overhead? Answer: Doing Nothing!

12:50 12:51 12:52 12:53 12:54 12:55 12:55

UserID /Class -------LINUX806 LINUX806 LINUX806 LINUX806 LINUX806 LINUX806 LINUX806

LX44 – Linux on z/VM – Understanding CPU Usage

T:V Total Virt Rat ----- ----- --25.84 14.30 1.8 26.44 14.73 1.8 28.25 15.37 1.8 27.78 15.26 1.8 28.20 15.51 1.8 29.95 16.52 1.8 27.01 14.94 1.8

23

Linux Server with High Overhead Review Linux internal CPU statistics  Linux reports total usage of ~ 5-6%  z/VM reports total usage of ~ 25-30%  Someone is off by factor of 5

Server runs SLES 10

Date/ Time -------12:50:00 12:51:00 12:52:00 12:53:00 12:54:00 12:55:00

Node -------LINUX806 LINUX806 LINUX806 LINUX806 LINUX806 LINUX806

Total Syst User Idle ----- ---- ---- ---5.9 4.1 1.8 194 6.1 4.4 1.8 194 6.3 4.5 1.8 193 5.9 4.3 1.6 194 6.1 4.4 1.7 189 6.5 4.8 1.8 198

 Uses “virtual time accounting” to get “correct” numbers LX44 – Linux on z/VM – Understanding CPU Usage

24

Linux Server with High Overhead Per-process breakdown  Many db2sysc processes • DB2 worker threads

 One db2fmcd with 1%

Only 2.7% accounted for  Remainder disappeared  Linux claims “idle”

z/VM monitor: 30% Linux statistics: 6% Explained usage: 3%

node/ Name --------12:51:00 LINUX806 events/0 kjournal kjournal multipat snmpd ha_logd heartbea heartbea ntpd nscd cron db2fmcd db2fmd db2fmd db2fmp db2sysc db2sysc db2sysc db2sysc db2sysc db2sysc db2sysc db2sysc db2fmp db2sysc . . .

Nice ID PPID GRP Valu Tot sys user syst usrt ----- ----- ----- ---- ---- ---- ---- ---- ---0 6 607 1713 2607 2660 2664 2775 2778 2805 2815 2839 3060 4704 5199 10758 11154 11155 11156 13140 13141 13148 13152 13153 15485 15558

0 1 1 1 1 1 2662 1 2775 1 1 1 1 1 1 10736 10741 10741 10741 10741 10741 10741 10741 10741 15473 15478

0 0 1 1 2606 2659 2588 2775 2775 2805 2815 2839 3060 4703 5198 10736 10736 10736 10736 10736 10736 10736 10736 10736 15473 15473

0 -5 0 0 0 -10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2.79 0.02 0.02 0.02 0.02 0.13 0.02 0.05 0.02 0.02 0.02 0.08 1.01 0.18 0.18 0.03 0.02 0.05 0.02 0.05 0.02 0.07 0.02 0.05 0.02 0.03

LX44 – Linux on z/VM – Understanding CPU Usage

0.72 0.02 0.02 0.02 0.02 0.08 0 0.02 0.02 0.02 0.02 0 0.03 0.02 0.02 0.02 0.02 0.03 0.02 0.03 0 0.03 0.02 0.03 0 0.02

1.01 0.30 0.77 0 0 0 0 0 0 0 0 0 0 0 0 0.05 0 0 0.02 0 0 0.03 0 0 0 0 0 0 0 0 0 0 0 0 0.02 0.07 0 0.28 0.70 0.17 0 0 0.17 0 0 0.02 0 0 0 0 0 0.02 0 0 0 0 0 0.02 0 0 0.02 0 0 0.03 0 0 0 0 0 0.02 0 0 0.02 0 0 0.02 0 0

25

Linux Server with High Overhead DB2 process ‘db2fmcd’ is suspicious  No function with Linux on System z • Provided for compatibility with some other configurations

 Largest single source of CPU usage in sample • Likely triggers the work done by the db2sysc processes

 Probably does something that creates high overhead

Reviewed CP trace data to understand overhead  Determine the cause for SIE intercept  Normal behavior: Linux goes idle and wakes up again  But it does that very often… Totl Ovrhead Diag Inst SIE Fast Page • 100,000 SIE intercept per sec

Time ----12:50 12:51 12:52 12:53 12:54 12:55

Util Usr Sys nose Sim intrcp path fault ---- --- --- ---- ---- ------ ---- ----1078 48 77 512 17K 95073 25K 23.9 1010 49 78 1042 17K 100235 43K 36.4 1018 50 84 503 16K 103837 21K 21.4 896 46 69 479 15K 103922 19K 12.2 909 46 71 506 15K 104306 18K 11.2 817 46 64 520 14K 111731 23K 15.3

LX44 – Linux on z/VM – Understanding CPU Usage

26

Linux Server with High Overhead Application requests frequent wake-up  Wake-up request with delay of less than 10 ms • This is polling – frowned upon in shared environment

 Unclear whether this is bug or design failure

Kernel bug rounds small delay to 0  Introduced with “high resolution timer” support  Rounded to 0 ms, or immediate wake-up

Timer interrupt presented when enabled  CP dispatches virtual machine immediately  Eventually minor time slice is consumed • Scheduler reviews the queue and dispatches later

LX44 – Linux on z/VM – Understanding CPU Usage

27

Linux Server with High Overhead Conclusion  Something in the application is polling • • • •

Customer did some traces that point at process db2redom The db2fmcd process was the biggest single consumer Probably DB2 was confused in the recovery process Most likely not productive processing

 High CP overhead due to Linux kernel bug • Turns short sleep into immediate wake-up • Fix is upstream and will eventually go into distributions

 Wrong Linux CPU accounting due to another bug • Fix is supposed to be in the pipeline

 Latency in z/VM prevented Linux from taking more  You can’t always tell from CPU alone that it is looping

LX44 – Linux on z/VM – Understanding CPU Usage

28

Improving TSM Throughput Customer Scenario    

Nightly backup of discrete servers to TSM on System z Dedicated OSA for Linux server with TSM Bottleneck appears to be the physical GbE connection Limited CPU usage thanks to QBESM

TSMSERV

VSWITCH

LX44 – Linux on z/VM – Understanding CPU Usage

29

Improving TSM Throughput LACP: Link Aggregation Control Protocol    

Bundles multiple physical links in one logical path (IEEE 801.3ad) Connection between external switches and VSWITCH Also provides also the fail-over function Using 4 GbE should give 4-fold throughput

TSMSERV LACP LACP VSWITCH

LX44 – Linux on z/VM – Understanding CPU Usage

30

Improving TSM Throughput LACP VSWITCH – Real World Experience  Potential 4-fold throughput is just theoretical • Discrete servers connect with single GbE • Need sufficient servers to provide the data

 Distribution over physical paths is not balanced • Connections are spread over paths by some hash function • In this scenario only 3-4 communication pairs are active

 Still achieved almost 50% improvement over single fiber • Increased qdio buffers from 16 to 128 Network Throughput - 19 Jan 2009

H

uh?

MB/s Received

160 140

3D00

120

2D00

100

1D00

80

0D00

60 40 20 0 00:00

00:30

01:00

01:30

LX44 – Linux on z/VM – Understanding CPU Usage

02:00

02:30

31

Improving TSM Throughput LACP VSWITCH – Real World Experience  CP overhead has increased significantly – T/V ratio of 1.3  Dedicated OSA was replaced by VNIC • No hardware support from QEBSM - CP simulates SIGA

 Strong correlation between bandwidth and user overhead CP Overhead (CPU%)

• No strange things happening – linear relation • Receiving 100 MB/s Linux time – 65% of CPU User overhead – 22% of CPU TSMSERV CPU Usage - 19 Jan 2009

40 30 y =0.2217x + 1.311

20 10 0 10

200

30

50

cp

70

90

110

VSWITCH Throughput (MB/s)

emul

Em ulation Tim e TSMSERV

100

120 50 0 00:00

00:30

01:00

01:30

02:00

02:30

vtime (CPU%)

CPU%

150

CP overhead TSMSERV

100 80 60 40 20 0 10

30

50

70

90

110

LX44 – Linux on z/VM – Understanding CPU (MB/s) Usage VSWITCH Throughput

130

150

32

130

150

Improving TSM Throughput LACP VSWITCH – CPU Usage

• Total CPU utilization ~ 190%

TSMSERV CPU Usage - 19 Jan 2009 200 cp 150 CPU%

 More than just the virtual machine  Also rather large System Overhead

emul

100 50

 Other high priority workload kicked in

0 00:00

00:30

01:00

01:30

02:00

02:30

02:00

02:30

• Matches the dip in throughput • Throughput is now limited by CPU CPU Usage - 19 Jan 2009

CPU Usage - 19 Jan 2009

200

200 150

CPU%

150

CPU%

System CP User

100

Others SAP005 SAP000

100

SAP025

50 0 00:00

50 0 00:00

TSMSERV

00:30

01:00

01:30

02:00

00:30

01:00

01:30

02:30

LX44 – Linux on z/VM – Understanding CPU Usage

33

Improving TSM Throughput LACP VSWITCH – CPU Usage  System overhead correlates with VSWITCH bandwidth • This is different from the CP overhead charged to TSMSERV • Pretty linear relation - about 24% CPU for 100 MB/s

 Probably for work that CP does to receive data • Decoding the LACP packets • Copying data from real QDIO buffers to VNIC buffers Overhead vs VSWITCH Bandwidth

Receiving 100 MB/s

y = 0.2372x + 1.4176

Linux internal work

65%

CP overhead Linux

22%

System overhead

24%

System Overhead (CPU%)

40 35 30 25 20 15 10 5 0

Total

111%

10

30

50

70

90

110

130

VSWITCH Throughput (MB/s)

LX44 – Linux on z/VM – Understanding CPU Usage

34

150

Improving TSM Throughput Ethernet Bonding in Linux  Linux implementation of LACP  Requires exclusive OSA ports like VSWITCH • Other ports for VSWITCH fail-over

LACP

TSMSERV

LACP

VSWITCH LX44 – Linux on z/VM – Understanding CPU Usage

35

Improving TSM Throughput Linux Bonding – Performance measurements  Maximum Throughput slightly higher using all 4 paths  System Overhead has disappeared • CP has no inbound traffic for VSWITCH anymore

 CP Overhead for TSMSERV is gone • CP is not even aware of traffic – QBESM handles it

 Linux CPU usage per MB has increased • Code paths are different for qeth using QBESM vs SIGA T hro ughput Linux B o nding

C P U Usa ge - Linux B o nding Syst em

160.0

19 18

140.0

CP

200

Emul 180

TSMSERV

01 00

120.0

160 140

100.0

120

80.0

100 80

60.0

60

40.0

40

20.0

20 0

0.0 01:00

01:00

01:15

01:30

01:45

02:00

02:15

02:30

01:15

01:30

01:45

02:00

02:15

02:30

02:45

02:45

LX44 – Linux on z/VM – Understanding CPU Usage

36

Improving TSM Throughput VSWITCH LACP versus Linux Bonding  VSWITCH solution provides flexibility and ease of use • At very high bandwidth there is a significant CPU cost

 Linux Bonding solution does not share interfaces among servers • Additional OSA and router ports may be required • Network routing becomes more complicated

 Throughput improvement less than expected • Still latencies to be discovered CPU Usage at 100 MB/s

 Not every application uses 100 MB/s

 It is not obvious what the CPU is used for • There may be options for improvement

Syst em Over head

100

TSMSERV CP Overhead TSMSERV Emulat ion Time

CPU Usage (CPU%)

• With lower bandwidth CPU cost is less • But LACP is meant for high bandwidth

120

80

60

40

20

0 VSWITCH LACP

LX44 – Linux on z/VM – Understanding CPU Usage

Linux Bonding

37

My Penguin can't sleep Linux servers without work should be idle  Virtual machines drop from queue at transaction end • CP defines transactions complete after 300 ms idle The queue drop delay is a bit more complicated than this

 Linux servers tend to have some background work • Frequent CPU usage causes server to stay in queue • CP is reluctant to take pages from in-queue virtual machines • No queue drop = non-interactive virtual machine (batch like)

 In-queue idle servers impact scalability

LX44 – Linux on z/VM – Understanding CPU Usage

38

My Penguin can't sleep Example of an idle Linux server  Found waiting for CPU resource 5% of the time  Never found actually running  Waiting for queue drop 95% of the time Screen: ESAXACT 1 of 2 Transaction Delay Analysis

ESAMON 3.7.0 CLASS * USER

95% of time “test idle” (waiting for Q-drop)

000000000010CB4E' STCK B2050DE8 >> R00000DE8 C067D666 6D01818A -> 000000000010CB4E' STCK B2050DE8 >> R00000DE8 C067D666 6F80F54A

11 Time Difference (ms)

Mostly 10 ms between ticks

Tim er Interrupts (w ith 10 m s tim er)

10.5 10 9.5 9 0

0.2

0.4

0.6

0.8

Elapsed Tim e (s)

LX44 – Linux on z/VM – Understanding CPU Usage

41

1

My Penguin can't sleep Linux on-demand timer – System #1  Avoids 10 ms timer ticks when otherwise idle  Should be configured as /proc/sys/kernel/hz_timer = 0  Default setting changed with various releases

100% of time “test idle” 42 trans/min ~ average 1.5 second idle

Screen: ESAXACT 1 of 2 Transaction Delay Analysis

ESAMON 3.7.0 CLASS * USER

000000000033BCCC'

LGR

B904003B

G03=00000000000005DC -> 000000000033BCD4' LG E3100DD8 G01=000000001FE86D88 V1FE86E8C 0000044C 0000044B V00000DE8 C066F5C8 C8E8E5C2

TOD clock LX44 – Linux on z/VM – Understanding CPU Usage

46

My Penguin can't sleep Timer Requests – System #1  PID 1: init • 5 sec check for dead orphans

 PID 1086 / 1087: nscd • 15 sec to expire any cached items

 There are also timer interrupts for the kernel threads and drivers • Visible in TRACE EXT 1004 • Not something you tune yourself

 Timer interrupts is different from wake-up calls • Multiple places where call is made • Merge of timer requests

2007-04-10 2007-04-10 2007-04-10 2007-04-10 2007-04-10 2007-04-10 2007-04-10 2007-04-10 2007-04-10 2007-04-10 2007-04-10 2007-04-10 2007-04-10 2007-04-10 2007-04-10 2007-04-10

Request Time 10:29:25.387323 10:29:30.386720 10:29:33.227057 10:29:33.227057 10:29:36.226925 10:29:41.227045 10:29:46.227236 10:29:48.227111 10:29:48.227111 10:29:52.226871 10:29:57.227049 10:30:02.227412 10:30:04.227001 10:30:04.227001 10:30:07.226932 10:30:12.226734

LX44 – Linux on z/VM – Understanding CPU Usage

Timeout 500 500 1500 1500 500 500 500 1500 1500 500 500 500 1500 1500 500 500

PID 1 1 1086 1087 1 1 1 1086 1087 1 1 1 1086 1087 1 1

47

My Penguin can't sleep Timer Requests – System #1  Stopped nscd process  Remains init at 5 second interval  Kernel interrupts • 2 sec • 30 sec

reap_cache do_cache_clean Tim er Interrupt Analysis - System #1

Dormant Test Idle

Time between interrupts (s)

2.5 2 1.5 1 0.5 0

LX44 – Linux on z/VM – Understanding CPU Usage

48

My Penguin can't sleep

PowerTOP  Frequent wake-up for nothing bothers others too! • Unable to lower CPU frequency – reduces laptop battery life

 PowerTOP reveals what causes the wake-up PowerTOP 1.8

java processes cause 120 wake-up calls per second (worse than 100 Hz timer)

(C) 2007 Intel Corporation

Collecting data for 15 seconds < Detailed C-state information is only available on Mobile CPUs (laptops) > P-states (frequencies) Wakeups-from-idle per second : 122.5 interval: 15.0s Top causes for wakeups: 98.4% (120.5) java : schedule_timeout (process_timeout) 0.4% ( 0.5) : queue_delayed_work_on (delayed_work_timer_fn) 0.2% ( 0.2) init : schedule_timeout (process_timeout) 0.2% ( 0.2) : page_writeback_init (wb_timer_fn) 0.2% ( 0.2) : neigh_table_init_no_netlink (neigh_periodic_timer) 0.2% ( 0.2) nscd : schedule_timeout (process_timeout) 0.1% ( 0.1) : neigh_table_init_no_netlink (neigh_periodic_timer)

LX44 – Linux on z/VM – Understanding CPU Usage

49

My Penguin can't sleep PowerTOP  Wake-up calls disappear when JVM is stopped • This may not be a useful option in real life

Requ 2.6.2 ires 1 ker nel Shou ld wo rk o n SLES 11

PowerTOP 1.8

(C) 2007 Intel Corporation

Collecting data for 15 seconds < Detailed C-state information is only available on Mobile CPUs (laptops) > P-states (frequencies) Wakeups-from-idle per second : 1.9 interval: 15.0s Top causes for wakeups: 29.6% ( 0.5) : queue_delayed_work_on (delayed_work_timer_fn) 14.8% ( 0.3) : neigh_table_init_no_netlink (neigh_periodic_timer) 11.1% ( 0.2) init : schedule_timeout (process_timeout) 11.1% ( 0.2) : page_writeback_init (wb_timer_fn) 11.1% ( 0.2) nscd : schedule_timeout (process_timeout) 7.4% ( 0.1) : neigh_table_init_no_netlink (neigh_periodic_timer) 3.7% ( 0.1) sshd : schedule_timeout (process_timeout) 3.7% ( 0.1) : sk_reset_timer (tcp_delack_timer) 3.7% ( 0.1) sshd : sk_reset_timer (tcp_write_timer) 3.7% ( 0.1) ip : __netdev_watchdog_up (dev_watchdog)

LX44 – Linux on z/VM – Understanding CPU Usage

50

My Penguin can't sleep Linux on-demand timer – System #2  Virtual machine reported as 135% in-queue: virtual 2-way • To be really idle, both virtual CPU’s must be idle at the same time • Makes it very hard for CP to find the virtual machine idle Not an easy candidate to take pages away

Screen: ESAXACT Marist OSDL 1 of 2 Transaction Delay Analysis

Counting Ext 1004 CPU 00: 97 CPU 01: 314

ESAMON 3.7.0 CLASS * USER