Linux on z/VM Understanding CPU Usage Rob van der Heij rob @ velocitysoftware.de
IBM System z Technical Conference Brussels, 2009 Session LX44
Veloc...
Copyright 2009 Velocity Software, Inc. All Rights Reserved. Other products and company names mentioned herein may be trademarks of their respective owners.
Introduction Why would you care about CPU Usage? CPU virtualization is the easiest part • System z CPU is designed for virtualization and sharing
Sharing CPU resources raises questions as well • Is my workload held back because of CPU constraints? • Why did other workload get the resources that I want? • Are there more resources available, why didn’t I get them?
Sharing the resources requires “social behavior” • Cycles wasted on useless work can’t be used for real work
LX44 – Linux on z/VM – Understanding CPU Usage
2
Introduction What could you do about CPU Usage? Understand where your CPU cycles are spent Measure and identify CPU usage Reduce peak CPU requirements • Find more economic ways to do the work • Avoid work that does not need to be done • Move some work to quiet “night shift” hours
Pay attention to “idle load”
Not the same as benchmarks Top speed versus mileage Most economic way to do the work within SLA Scalability of applications LX44 – Linux on z/VM – Understanding CPU Usage
3
Agenda CPU Usage Breakdown LPAR z/VM Virtual Machine
CPU Accounting Linux CPU Usage • • • •
What is that Penguin Doing? Linux Server with High Overhead Improving TSM Throughput My Penguin can't Sleep Performance data shown in the presentation was collected and processed with ESALPS. LX44 – Linux on z/VM – Understanding CPU Usage
4
CPU Usage Breakdown Logical Partition Level
LPARs more or less share the CPU resources Logical CPUs are defined shared or dedicated For shared CPUs the LPAR weight may be important CPs and IFLs are not mixed • Exception: “VM Mode LPAR” - z/VM 5.4 on selected hardware PR/SM Overhead Normally pretty small
Only 1 IFL LPAR: 3 dedicated CPUs Nobody to share with
LX44 – Linux on z/VM – Understanding CPU Usage
5
CPU Usage Breakdown PR/SM Management Time Not attributed to one specific LPAR • Scheduling etc – spread over all physical CPUs • Typically < 1% per CPU • Depends on number of LPARs & logical CPUs
PR/SM work on behalf of a specific LPAR • Dispatching, QDIO, etc - spread over logical CPUs of the LPAR • Typically < 1% per logical CPU • Depends on workload
Virt Name Nbr CPUs Total Ovhd -------- --- ---- ------ ---LP1 0 3 83.6 1.2 LP3 2 1 0.1 0.0
LPAR1
Unused
LPAR2
Productive work by Guest OS LPAR Management Time
LPAR Overhead LX44 – Linux on z/VM – Understanding CPU Usage
6
LPAR3
CPU Usage Breakdown Logical CPU View Guest OS dispatches workload over logical CPUs • When no work to do: CPU idle (wait state) • LPAR recognizes idle CPU – “white space” is shared Exception: when “wait complete” is set for the LPAR
Total Emul User Sys Idle CPU util time ovrhd ovrhd time -- *---- ----- ----- ----- ----00 40.9 30.8 6.9 3.3 58.6 01 40.4 32.4 5.9 2.1 59.1 02 40.2 31.9 6.0 2.3 59.3
100% LPAR1
CPU 0
CPU 1
CPU 2
LPAR Overhead
LX44 – Linux on z/VM – Understanding CPU Usage
7
CPU Usage Breakdown z/VM View – Totals System Overhead – General CP work • Scheduling, Monitor, Accounting
User Overhead – CP work on behalf of specific user
z/VM metrics: True CPU% Total Emul User Sys Idle CPU util time ovrhd ovrhd time -- *---- ----- ----- ----- ----00 40.9 30.8 6.9 3.3 58.6 01 40.4 32.4 5.9 2.1 59.1 02 40.2 31.9 6.0 2.3 59.3
Emulation Time
100%
User Overhead
System Overhead
LX44 – Linux on z/VM – Understanding CPU Usage
8
tti
me
CPU Usage Breakdown z/VM View – Virtual Machines Virtual Time – Virtual Machine work (SIE) Total Time – Virtual Time plus User Overhead Some Virtual Machines are not real “users” but system functions • RACFVM, TCPIP, DIRMAINT Virtual Machines Screen: ESAUSP2 Velocity Software 1 of 3 User Percent Utilization UserID Time /Class Total Virt -------- -------- ----- ----16:13:00 SUSELNX2 3.64 3.59 REDHAT04 2.89 2.80 ORACLE 2.12 2.08 VMRLNX 1.89 1.88 DXT2LV 0.61 0.31 ROBLX1 0.35 0.35 TCPIP 0.28 0.13 SUSELNX1 0.24 0.21 REDHAT3 0.21 0.18 SLES8 0.19 0.18 ROBLX2 0.12 0.11
ser u r f pe otal: o t Sum age = atio us ture R Cap
Emulation Time
User Overhead
System Overhead
LX44 – Linux on z/VM – Understanding CPU Usage
9
CPU Accounting Mainframe Operating Systems do CPU accounting Required for charge back of shared resources
z/VM account records At logoff or through CP command Resource usage per virtual machine • CPU usage • I/O operations
Very simple to process • Easy to audit
Does not tell you why Lacks detail for Linux
LX44 – Linux on z/VM – Understanding CPU Usage
10
CPU Accounting Charge-back is to recover the total data center cost CPU is not the major cost factor anymore CPU usage is traditionally still used for charge-back • CPU usage considered representative for amount of usage • Total data center cost divided by consumed CPU hours • CPU tariff based on estimated usage and capacity plan
IFL with Linux and z/VM break the model Installations add a substantial amount of MIPS Linux applications also consume a lot of CPU hours • Linux Proof of Concept charged much of z/OS license cost
Charge-back motivates users to save resources Make sure to arrange a correct cost model for Linux LX44 – Linux on z/VM – Understanding CPU Usage
11
CPU Accounting CPU Accounting for Linux on z/VM needs detail Just listing totals is not enough to convince customers Exceptional usage must be explained very clearly
Performance Monitor can reveal the detail Collects CPU data along with many other metrics • ESALPS collects ~ 3500 unique metrics every minute 100's of them repeated per device, per user or per Linux process
• Helps to understand sequence of events causing a problem • Explains any excessive usage to the application owner
Requires detailed "performance history" Requires complete data – capture ratio of 100%
Performance Monitor helps to validate the cost model LX44 – Linux on z/VM – Understanding CPU Usage
12
Visualization Techniques Comparing Memory and CPU Usage
LX44 – Linux on z/VM – Understanding CPU Usage
13
CPU Usage Breakdown Linux System View Virtual Machine Emulation Time is available for Linux usage Steal Time: when Linux does not know what CPU was used for Linux Administrator be aware: "idle" ≠ "available for use" Linux Virtual Machine Nice
Linux-2.4
Linux-2.6
User
User
Process Usage
Nice
Nice
Background Process Usage
System
Kernel
Kernel Usage
Soft-IRQ
Kernel-related CPU usage
Interrupt
First-level Interrupt handlers
Idle
No CPU usage
I/O wait
CPU waiting for I/O
Steal
CPU cycles "stolen" by hypervisor
Idle
User
Emulation Time
System
LX44 – Linux on z/VM – Understanding CPU Usage
User Overhead
14
CPU Usage Breakdown Linux Process View CPU resources allocated to processes Each process some system time plus some user time (or nice) Processes should add up to total system and user time • Capture ratio!
Linux CPU accounting
Linux
Traditionally wrong due to virtualization
Processes Nice
• Linux tools would show too high numbers
Modern kernels use virtual CPU accounting User
• Linux tools sometimes show wrong data
System
LX44 – Linux on z/VM – Understanding CPU Usage
15
Why can’t I use my Linux Tools? Linux data is incomplete and sometimes incorrect Virtualization changes the rules of the game • CPU Usage perceived by Linux can be very wrong • Assumptions about used and available do not hold anymore
z/VM performance impacts Linux behavior Need to combine Linux and z/VM performance data
z/VM does not clone system administrators
You may not have time to look when it happens Complex interactions make it hard to reproduce scenarios Multi-tier application involves multiple virtual servers Centralized data collection is easier to manage May need to share data with others to understand it
LX44 – Linux on z/VM – Understanding CPU Usage
16
What is that Penguin doing High Level Overview Shows no real detail Sometimes enough for quick check Screen: ESAMAIN 1 of 3 System Overview
Time *------02:49:00 02:48:00 02:47:00 02:46:00 02:45:00 02:44:00 02:43:00
What is that Penguin doing Usage Breakdown per user So one server used 25% of a CPU last minute • Is that good or bad? • Often you can’t really tell without knowing behavior over time Screen: ESAUSP2 1 of 3 User Percent Utilization
What is that Penguin doing Single User over Time Looking at usage in recent past shows “when it started” • Frequently more productive than waiting until it stops
For multi-tier applications you need to look at multiple servers • Arrange servers in classes for “application view” Screen: ESAUSP2 1 of 3 User Percent Utilization
Time -------03:07:00 03:06:00 03:05:00 03:04:00 03:03:00 03:02:00 03:01:00 03:00:00 02:59:00
What is that Penguin doing Looking inside the Linux server Identify the Linux process that consume the resources Screen: ESALNXP 1 of 3 LINUX VSI Process Statistics Report Time Node Name ID PPID GRP -------- -------- --------- ----- ----- ----03:02:00 dominoz1 clrepl 12194 2536 2483 updall 11500 2536 2483 smdemf 5209 2536 2483 sched 5181 2536 2483 update 5174 2536 2483 replica 5168 2536 2483 server 2536 2483 2483 snmpd 1768 1 1767 kjournal 1140 1 1 kswapd0 134 1 1 pdflush 133 8 0 *Totals* 0 0 0
CPU Overhead CPU Overhead can mean many different things
Productive work for one is overhead for another Make sure your peer means the same thing You're only aware of it when you can measure With System z and z/VM we can measure it Hardware support keeps overhead mostly low
Sometimes abnormal behavior increases overhead Spending resources on other things than workload Performance Monitor often helps to clarify things LX44 – Linux on z/VM – Understanding CPU Usage
22
Linux Server with High Overhead Customer reports on Linux server with high CP cost Linux server using 25-30% of a CPU Almost half of that is “CP overhead” • T/V ratio of 1.8 • Work that CP does on behalf of the virtual machine
z/VM has plenty of CPU resources • Linux guest does not appear to be held back
Question What is Linux doing? Why high overhead? Answer: Doing Nothing!
Linux Server with High Overhead Review Linux internal CPU statistics Linux reports total usage of ~ 5-6% z/VM reports total usage of ~ 25-30% Someone is off by factor of 5
Server runs SLES 10
Date/ Time -------12:50:00 12:51:00 12:52:00 12:53:00 12:54:00 12:55:00
Linux Server with High Overhead DB2 process ‘db2fmcd’ is suspicious No function with Linux on System z • Provided for compatibility with some other configurations
Largest single source of CPU usage in sample • Likely triggers the work done by the db2sysc processes
Probably does something that creates high overhead
Reviewed CP trace data to understand overhead Determine the cause for SIE intercept Normal behavior: Linux goes idle and wakes up again But it does that very often… Totl Ovrhead Diag Inst SIE Fast Page • 100,000 SIE intercept per sec
Linux Server with High Overhead Application requests frequent wake-up Wake-up request with delay of less than 10 ms • This is polling – frowned upon in shared environment
Unclear whether this is bug or design failure
Kernel bug rounds small delay to 0 Introduced with “high resolution timer” support Rounded to 0 ms, or immediate wake-up
Timer interrupt presented when enabled CP dispatches virtual machine immediately Eventually minor time slice is consumed • Scheduler reviews the queue and dispatches later
LX44 – Linux on z/VM – Understanding CPU Usage
27
Linux Server with High Overhead Conclusion Something in the application is polling • • • •
Customer did some traces that point at process db2redom The db2fmcd process was the biggest single consumer Probably DB2 was confused in the recovery process Most likely not productive processing
High CP overhead due to Linux kernel bug • Turns short sleep into immediate wake-up • Fix is upstream and will eventually go into distributions
Wrong Linux CPU accounting due to another bug • Fix is supposed to be in the pipeline
Latency in z/VM prevented Linux from taking more You can’t always tell from CPU alone that it is looping
LX44 – Linux on z/VM – Understanding CPU Usage
28
Improving TSM Throughput Customer Scenario
Nightly backup of discrete servers to TSM on System z Dedicated OSA for Linux server with TSM Bottleneck appears to be the physical GbE connection Limited CPU usage thanks to QBESM
TSMSERV
VSWITCH
LX44 – Linux on z/VM – Understanding CPU Usage
29
Improving TSM Throughput LACP: Link Aggregation Control Protocol
Bundles multiple physical links in one logical path (IEEE 801.3ad) Connection between external switches and VSWITCH Also provides also the fail-over function Using 4 GbE should give 4-fold throughput
TSMSERV LACP LACP VSWITCH
LX44 – Linux on z/VM – Understanding CPU Usage
30
Improving TSM Throughput LACP VSWITCH – Real World Experience Potential 4-fold throughput is just theoretical • Discrete servers connect with single GbE • Need sufficient servers to provide the data
Distribution over physical paths is not balanced • Connections are spread over paths by some hash function • In this scenario only 3-4 communication pairs are active
Still achieved almost 50% improvement over single fiber • Increased qdio buffers from 16 to 128 Network Throughput - 19 Jan 2009
H
uh?
MB/s Received
160 140
3D00
120
2D00
100
1D00
80
0D00
60 40 20 0 00:00
00:30
01:00
01:30
LX44 – Linux on z/VM – Understanding CPU Usage
02:00
02:30
31
Improving TSM Throughput LACP VSWITCH – Real World Experience CP overhead has increased significantly – T/V ratio of 1.3 Dedicated OSA was replaced by VNIC • No hardware support from QEBSM - CP simulates SIGA
Strong correlation between bandwidth and user overhead CP Overhead (CPU%)
• No strange things happening – linear relation • Receiving 100 MB/s Linux time – 65% of CPU User overhead – 22% of CPU TSMSERV CPU Usage - 19 Jan 2009
40 30 y =0.2217x + 1.311
20 10 0 10
200
30
50
cp
70
90
110
VSWITCH Throughput (MB/s)
emul
Em ulation Tim e TSMSERV
100
120 50 0 00:00
00:30
01:00
01:30
02:00
02:30
vtime (CPU%)
CPU%
150
CP overhead TSMSERV
100 80 60 40 20 0 10
30
50
70
90
110
LX44 – Linux on z/VM – Understanding CPU (MB/s) Usage VSWITCH Throughput
130
150
32
130
150
Improving TSM Throughput LACP VSWITCH – CPU Usage
• Total CPU utilization ~ 190%
TSMSERV CPU Usage - 19 Jan 2009 200 cp 150 CPU%
More than just the virtual machine Also rather large System Overhead
emul
100 50
Other high priority workload kicked in
0 00:00
00:30
01:00
01:30
02:00
02:30
02:00
02:30
• Matches the dip in throughput • Throughput is now limited by CPU CPU Usage - 19 Jan 2009
CPU Usage - 19 Jan 2009
200
200 150
CPU%
150
CPU%
System CP User
100
Others SAP005 SAP000
100
SAP025
50 0 00:00
50 0 00:00
TSMSERV
00:30
01:00
01:30
02:00
00:30
01:00
01:30
02:30
LX44 – Linux on z/VM – Understanding CPU Usage
33
Improving TSM Throughput LACP VSWITCH – CPU Usage System overhead correlates with VSWITCH bandwidth • This is different from the CP overhead charged to TSMSERV • Pretty linear relation - about 24% CPU for 100 MB/s
Probably for work that CP does to receive data • Decoding the LACP packets • Copying data from real QDIO buffers to VNIC buffers Overhead vs VSWITCH Bandwidth
Receiving 100 MB/s
y = 0.2372x + 1.4176
Linux internal work
65%
CP overhead Linux
22%
System overhead
24%
System Overhead (CPU%)
40 35 30 25 20 15 10 5 0
Total
111%
10
30
50
70
90
110
130
VSWITCH Throughput (MB/s)
LX44 – Linux on z/VM – Understanding CPU Usage
34
150
Improving TSM Throughput Ethernet Bonding in Linux Linux implementation of LACP Requires exclusive OSA ports like VSWITCH • Other ports for VSWITCH fail-over
LACP
TSMSERV
LACP
VSWITCH LX44 – Linux on z/VM – Understanding CPU Usage
35
Improving TSM Throughput Linux Bonding – Performance measurements Maximum Throughput slightly higher using all 4 paths System Overhead has disappeared • CP has no inbound traffic for VSWITCH anymore
CP Overhead for TSMSERV is gone • CP is not even aware of traffic – QBESM handles it
Linux CPU usage per MB has increased • Code paths are different for qeth using QBESM vs SIGA T hro ughput Linux B o nding
C P U Usa ge - Linux B o nding Syst em
160.0
19 18
140.0
CP
200
Emul 180
TSMSERV
01 00
120.0
160 140
100.0
120
80.0
100 80
60.0
60
40.0
40
20.0
20 0
0.0 01:00
01:00
01:15
01:30
01:45
02:00
02:15
02:30
01:15
01:30
01:45
02:00
02:15
02:30
02:45
02:45
LX44 – Linux on z/VM – Understanding CPU Usage
36
Improving TSM Throughput VSWITCH LACP versus Linux Bonding VSWITCH solution provides flexibility and ease of use • At very high bandwidth there is a significant CPU cost
Linux Bonding solution does not share interfaces among servers • Additional OSA and router ports may be required • Network routing becomes more complicated
Throughput improvement less than expected • Still latencies to be discovered CPU Usage at 100 MB/s
Not every application uses 100 MB/s
It is not obvious what the CPU is used for • There may be options for improvement
Syst em Over head
100
TSMSERV CP Overhead TSMSERV Emulat ion Time
CPU Usage (CPU%)
• With lower bandwidth CPU cost is less • But LACP is meant for high bandwidth
120
80
60
40
20
0 VSWITCH LACP
LX44 – Linux on z/VM – Understanding CPU Usage
Linux Bonding
37
My Penguin can't sleep Linux servers without work should be idle Virtual machines drop from queue at transaction end • CP defines transactions complete after 300 ms idle The queue drop delay is a bit more complicated than this
Linux servers tend to have some background work • Frequent CPU usage causes server to stay in queue • CP is reluctant to take pages from in-queue virtual machines • No queue drop = non-interactive virtual machine (batch like)
In-queue idle servers impact scalability
LX44 – Linux on z/VM – Understanding CPU Usage
38
My Penguin can't sleep Example of an idle Linux server Found waiting for CPU resource 5% of the time Never found actually running Waiting for queue drop 95% of the time Screen: ESAXACT 1 of 2 Transaction Delay Analysis
My Penguin can't sleep Linux on-demand timer – System #1 Avoids 10 ms timer ticks when otherwise idle Should be configured as /proc/sys/kernel/hz_timer = 0 Default setting changed with various releases
100% of time “test idle” 42 trans/min ~ average 1.5 second idle
My Penguin can't sleep Timer Requests – System #1 Stopped nscd process Remains init at 5 second interval Kernel interrupts • 2 sec • 30 sec
reap_cache do_cache_clean Tim er Interrupt Analysis - System #1
Dormant Test Idle
Time between interrupts (s)
2.5 2 1.5 1 0.5 0
LX44 – Linux on z/VM – Understanding CPU Usage
48
My Penguin can't sleep
PowerTOP Frequent wake-up for nothing bothers others too! • Unable to lower CPU frequency – reduces laptop battery life
PowerTOP reveals what causes the wake-up PowerTOP 1.8
java processes cause 120 wake-up calls per second (worse than 100 Hz timer)
(C) 2007 Intel Corporation
Collecting data for 15 seconds < Detailed C-state information is only available on Mobile CPUs (laptops) > P-states (frequencies) Wakeups-from-idle per second : 122.5 interval: 15.0s Top causes for wakeups: 98.4% (120.5) java : schedule_timeout (process_timeout) 0.4% ( 0.5) : queue_delayed_work_on (delayed_work_timer_fn) 0.2% ( 0.2) init : schedule_timeout (process_timeout) 0.2% ( 0.2) : page_writeback_init (wb_timer_fn) 0.2% ( 0.2) : neigh_table_init_no_netlink (neigh_periodic_timer) 0.2% ( 0.2) nscd : schedule_timeout (process_timeout) 0.1% ( 0.1) : neigh_table_init_no_netlink (neigh_periodic_timer)
LX44 – Linux on z/VM – Understanding CPU Usage
49
My Penguin can't sleep PowerTOP Wake-up calls disappear when JVM is stopped • This may not be a useful option in real life
Requ 2.6.2 ires 1 ker nel Shou ld wo rk o n SLES 11
PowerTOP 1.8
(C) 2007 Intel Corporation
Collecting data for 15 seconds < Detailed C-state information is only available on Mobile CPUs (laptops) > P-states (frequencies) Wakeups-from-idle per second : 1.9 interval: 15.0s Top causes for wakeups: 29.6% ( 0.5) : queue_delayed_work_on (delayed_work_timer_fn) 14.8% ( 0.3) : neigh_table_init_no_netlink (neigh_periodic_timer) 11.1% ( 0.2) init : schedule_timeout (process_timeout) 11.1% ( 0.2) : page_writeback_init (wb_timer_fn) 11.1% ( 0.2) nscd : schedule_timeout (process_timeout) 7.4% ( 0.1) : neigh_table_init_no_netlink (neigh_periodic_timer) 3.7% ( 0.1) sshd : schedule_timeout (process_timeout) 3.7% ( 0.1) : sk_reset_timer (tcp_delack_timer) 3.7% ( 0.1) sshd : sk_reset_timer (tcp_write_timer) 3.7% ( 0.1) ip : __netdev_watchdog_up (dev_watchdog)
LX44 – Linux on z/VM – Understanding CPU Usage
50
My Penguin can't sleep Linux on-demand timer – System #2 Virtual machine reported as 135% in-queue: virtual 2-way • To be really idle, both virtual CPU’s must be idle at the same time • Makes it very hard for CP to find the virtual machine idle Not an easy candidate to take pages away
Screen: ESAXACT Marist OSDL 1 of 2 Transaction Delay Analysis