VSP3866 Performance Best Practices and Troubleshooting Name, Title, Company.

VSP3866

Performance Best Practices andTroubleshooting

Name, Title, Company

2

Disclaimer

This session may contain product features that are currently under development.

This session/overview of the new technology represents no commitment from VMware to deliver these features in any generally available product.

Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.

Technical feasibility and market demand will affect final delivery.

Pricing and packaging for any new technologies or features discussed or presented have not been determined.

4

Overview

If you hear this, this is when we have obviously hit a performance bottleneck.

Today, you’ll learn some of the most common gotchas and best practices when it comes to performance and density calls we see in VMware’s customer support centers.

Have you ever heard:

“My VM is running slow andI don’t know what to do”

“I tried adding more memoryand CPUs but that didn’t work”

5

Agenda

Benchmarking & Tools

Memory

• Best Practices

• Over commitment

• Troubleshooting

CPU

• Best Practices

• Scheduling

• Troubleshooting

Storage

• Considerations

• Troubleshooting

Network

© 2009 VMware Inc. All rights reserved

BENCHMARKING & TOOLS

7

A word about performance…. (My most favorite slide!)

Chethan Kumar and Hal Rosenberg have a great statement in their Performance Troubleshooting guide for vSphere 4.1:

A performance troubleshooting methodology must provide guidance on how to find the root-cause of the observed performance symptoms, and how to fix the cause once it is found. To do this, it must answer the following questions:

• 1. How do we know when we are done?

• 2. Where do we start looking for problems?

• 3. How do we know what to look for to identify a problem?

• 4. How do we find the root-cause of a problem we have identified?

• 5. What do we change to fix the root-cause?

• 6. Where do we look next if no problem is found?

http://communities.vmware.com/docs/DOC-14905



8

Benchmarking Methodologies

When troubleshooting for performance purposes, it is extremely helpful to have a mutually agreed upon base level of acceptable performance.

Talk to your business sponsors well before putting an application or VM into production about what sort of performance is expected and accepted.

• Use benchmarking tools to generate a good baseline of performance for your virtual machines prior to deployment.

• If appropriate, run the benchmarking on physical hardware first, arrive at a base level of performance, and only then take that data and apply it towards a virtualized environment.

• Try to stay clear of metrics that are subjective, or may be due to client issues on the user’s workstation. (Such as producing a report based off of server data that then is fed to the client, where a viewer is opened locally.)

9

Benchmarking Methodologies

Benchmarking can also be accomplished at the application layer.

But be aware:

• In situations where human interaction is required, utilize third party automation or macro applications where possible to benchmark applications.

• Always contact your vendor if you are unsure of a specific product’s virtualization awareness.

• Do not run your benchmarking on a host that is already busy with other workloads. Get an idea what an optimum situation looks like on paper, then start to introduce load.

10

Slide 10

Monitoring with vCenter Operations

Aggregates thousands of metrics into Workload, Capacity, Health scores

Self-learns “normal” conditions using patented analytics

Smart alerts of impending performance and capacity degradation

Designed for virtualization and cloud with vSphere Health Model

Powerful monitoring and drill down troubleshooting from datacenter to component level

Integration of 3rd party performance data

An integrated approach and patented analytics to transform how IT ensures service levels in dynamic environments

!

11

Troubleshooting Tools - esxtop

esxtop is a valuable tool that allows you to troubleshoot performance issues without the aid of the UI’s Performance graphs

• esxtop allows you to see six distinct performance data views in real-time, and to select multiple counters to see at one time through selectable fields.

• Most counters in performance graphs have a direct correlation within esxtop, but the presentation may be different. (%RDY as an example.)

• esxtop has the advantage of incurring little overhead on the ESX host, and also is available even when conditions may prevent connection to the VI Client.

• The data from esxtop can be exported into a file and viewed offline or played back after the fact. It can even be imported into third party tools.

For the purposes of this presentation, we will be referencing esxtop counters.

Want to know about all the esxtop and its counters?? http://communities.vmware.com/docs/DOC-9279



MEMORY

13

Memory – Resource Types

When assigning a VM a “physical” amount of RAM, all you are really doing is telling ESX how much memory a given VM process will maximally consume in addition to the overhead.

Whether or not that memory is physical or virtual (swap) depends on a few factors:

• Host configuration

• DRS shares/Limits

• Reservations

• Host Load

Storage bandwidth (swapping) is always more expensive, time-wise, than memory bandwidth.

14

Memory – Rightsizing is KING

Generally speaking, it is better to OVER-commit than UNDER-commit:

If the running set of VMs are consuming too much host memory…

• Some VMs do not get enough host memoryThis causes: Forced ballooning or host swapping to satisfy VM demands

Leading to: Greater disk usage as all VMs start to swap pages out. Which results in: All VMs (and even any attached physical servers!)

may slow down as a result of increased disk traffic to the SAN.

If we do not size a VM properly (e.g., create Windows VM with 128MB RAM)

• Within the VM, pagefile swapping occurs, resulting in disk trafficAnd again, all VMs may slow down as a result of that increased disk traffic.

But…don’t make memory footprints too big!

(High overhead memory, HA complexities, etc…)

15

Memory – Over-commitment & Sizing

So what do I do?!?

Avoid high active host memory over-commitment

• Total memory demand = active memory (%ACTV) working sets of all VMs:

+ memory overhead

– page sharing of similar VMs.

• No ESX swapping occurs when total memory demand is less than the physical memory. (Discounting limits!)

Right-size guest memory

• Define adequate guest memory to avoid in-guest swappingPer-VM memory space overhead grows with guest memory, so don’t go

overboard!Ensure that you allocate enough memory to cover demand peaks.

16

Memory – Balancing & Overcommitment

ESX must balance memory usage for all worlds

Virtual machines, Service Consoles, and vmkernel all consume memory. They manage memory using a number of methods:

• Page sharing to reduce all-around memory footprint of similar Virtual Machines

• Ballooning to assure VMs that actively need memory get it from VMs that aren’t using it.

• Compression to alleviate our need to swap memory to slow disk.

• Host swapping as a last resort to relieve memory pressure when ballooning and compression is insufficient.

Lack of host memory can also cause systemic slowness.

• If the host is undersized for the VM load, the host pages out memory to the vswp files in order to cover the memory configuration for the VMs.

17

Memory – Ballooning vs. Swapping

Bottom line:

• Ballooning is vastly preferably to swapping

• Guest can surrender unused/free pages though use of VM Tools

• Even if balloon driver has to swap to satisfy the balloon request, guest chooses what to swapCan avoid swapping “hot” pages within guest

• Ballooning may occur even when no memory pressure just to keep memory proportions under control

Why does my “Memory Granted” never decrease?

• Most VMs will consume 100% of their physical memory when booting. Unless there is pressure, it is more efficient and architecturally inexpensive to keep this memory allocated to the process than it is to de-allocate and re-allocate as needed.

18

Better Performance Though Hardware Assist (HWMMU)

Prior to these hardware technologies – “shadow paging” (or SoftWareMMU) was used. This consumed both CPU and overhead.

Intel

• Extended Page Tables (EPT)

• Available since: 2009

• Supported in ESX4.0 +

• Nehalem or better

AMD

• Rapid Virtualization Indexing (RVI)

• Available since: 2008

• Supported in ESX3.5 +

• Shanghai or better

For more info: http://www.vmware.com/files/pdf/perf-vsphere-monitor_modes.pdf

http://www.vmware.com/files/pdf/perf-vsphere-monitor_modes.pdf

19

Memory: Common Performance Issues

Using SWMMU

• If you have the option, hardware memory assist (HWMMU) reduces virtualization overhead for memory intensive workloads. (eg: Tier 1)

Not monitoring for ballooning or swapping at the host level

• Ballooning is a early warning sign paging may occur.

• Swapping is a performance issue if seen over an extended period.

• Look for SWW/SWR in esxtop.

Not monitoring for swapping at the guest level

• Under provisioning guest memory.

Removing balloon driver or disabling/marginalizing TPS

Under sizing VMs for peak loads


CPU

21

CPU - Intro

Basically….

CPU resources are the raw processing speed of a given host or VM

However, on a more abstract level, we are also bound by the hosts’ ability to schedule those resources.

We also have to account for running a VM in the most optimal fashion, which typically means running it on the same processor that the last cycle completed on.

22

CPU - Performance Overhead & Utilization

In esxtop, we can observe the % of CPU used for a particular VM, but that is actually a sum of three different metrics:

%USED = %RUN + %SYS - %OVRLP

“%USED” – The actual CPU used by this VM.

“%SYS” - The percentage of time spent by system services on behalf of the world. Some

possible system services are interrupt handling, system date/time calls, and system worlds.

“%OVRLP” - The percentage of time spent by system services on behalf of other VMs.

“%RUN” - The percentage of total scheduled time for the world to run.

23

CPU - Performance Overhead & Utilization

Remember: Different workloads can present entirely different virtualization overhead costs (%SYS) even at the same utilization (%USED) in CPU.

CPU virtualization adds varying amounts of system overhead Little or no overhead for the part of the workload that can run in direct execution

• An Example of this might be a web server or other basic workload that does not take advantage of CPU features; e.g., wholly reliant upon I/O for processing (File/Web/non kernel-ring)

Small to significant overhead for virtualizing sensitive privileged instructions

• For example, an SSL gateway that requires context switching for computation work, or a email gateway with virus scanning that uses SSE 4.1

• Servicing non-paravirtualized adapters & virtual hardware (Interrupts!)

24

CPU – vSMP Processor Support

ESX 4.x supports up to eight virtual processors per VM

• Multi-Processor VMs can run out-of-sync in ESX 4.x, resulting in greater density. (Relaxed Co-Scheduling)

• Use Uniprocessor VMs for single-threaded applications

• For vSMP VMs, configure only as many vCPUs as needed

• Unused vCPUs in SMP VMs:Impose unnecessary scheduling constraints on ESX ServerWaste system resources (idle looping, process migrations, etc.)

vSMP VMs may not always use those vCPUs. Test your applications and verify that the threads for that application are being split among the processors equitably. An idle vCPU incurs a scheduling penalty.

• Pay attention to the %CSTP counter on vSMP VMs. The more you see this, the more your processing is unbalanced. (The ESX 4.x relaxed co-scheduler is needing to catch all vCPUs up to a vCPU that is much further advanced.)

25

CPU – Ready Time

One very common issue is high CPU ready (%RDY) time

What is it?

CPU Ready Time is the percentage of time that a VM is ready to run, but there is no physical processor that is free to run it.

High ready time possible contention for CPU scheduling among VMs on a particular host.

26

CPU Related Performance Problems

Over committing physical CPUs

ESX Scheduler

27



ESX Scheduler

X X

28



ESX Scheduler

X XX X

29

CPU – Ready Time

So what causes Ready Time?

Many possible reasons

• CPU over commitment (typically high %rdy + low %used)

• Workload variability (High overhead, or lopsided VM load on host)

There is no fixed threshold for trouble, but > 20% for a VCPU Investigate further

With multi-vCPU VMs, divide reported %RDY in esxtop by the number of vCPUs configured in order to see a “human-readable” percentage.

(Or press “E” to expand the GID of the VM in order to see a per-vCPU counter.)

30

CPU – LLC & the Modern Multicore Processor

Last Level Cache – Shared on-die cache for multicore CPUs

The ESX CPU scheduler, by default, tries to place the vCPUs of an SMP virtual machine into as much Last-Level Cache (LLC) as possible.

For some workloads, this is not efficient due to the usage pattern of those applications – it is more beneficial to have the VM consolidated into the fewest LLCs that it can.

Use the following directive in your VMX to use as little LLC as possible:

sched.cpu.vsmpConsolidate = "true“

VMware KB Article #1017936

31

CPU: Common Issues

CPU (#vCPU) over allocation

Too many vCPUs for the ESX scheduler to keep up with. This is evidenced by low %USED counters and high in-VM CPU counters, with high %RDY counters. The ESX host is unable to find idle processors to run the VM on, or enough idle processors to accommodate vSMP VMs that need to run all processors.

• Migrate VMs off the host.

• Reduce the amount of vCPUs on the host.

• Right-size vSMP VMs with a preference to using 1 vCPU by default, and vSMP VMs only when needed..

• Use anti-affinity rules to keep 4-way and 8-way VMs separated as much as possible.

vSMP VMs may not always use those vCPUs. Test your applications and verify that the threads for that application are being split among the processors equitably. An idle vCPU incurs a scheduling penalty.

• Pay attention to the %CSTP counter on vSMP VMs. The more you see this, the more your processing is unbalanced

32

CPU: Common Issues

Accidental limits

• Always check reservations/limits on VMs and resource pools as many people “forget” to remove temporary limits, or there are too many “cooks in the kitchen.”

• Nested resource pools with too little resources in the parent pools to cover their child pools can create artificial limits.

Continuing to expect the same consolidation ratios with different workloads

• As you progress further into your virtualization journey, it is not uncommon to virtualize the “easy” workloads first. This means that as you progress, the workloads change significantly. (You can virtualize more DNS servers than you can mail severs…)


STORAGE

34

Storage - esxtop Counters

Once you are at the main storage section, you can switch in between different views – Adapter (d), VM (v), and Disk Device (u)

Key Fields:

KAVG – Average delay from vmkernel to the adapter.

DAVG – Average delay from the adapter to the target.

GAVG – Average delay as perceived by the guest.

DAVG + KAVG = GAVG

QUED/USD – Command Queue Depth

CMDS/s – Commands Per Second

MBREADS/s

MBWRTN/s

35

Storage – Troubleshooting with esxtop

High DAVG numbers indicate that something is wrong beyond the adapter of the ESX host – bad/overloaded zoning, over utilized storage processors, too few platters in the RAID set, etc.

High KAVG numbers indicate an issue with the ESX host’s storage stack. This could be due to lack of memory or a full command queue.

Aborts are caused by either DAVG or KAVG exceeding 5000ms – you can see these in /var/log/vmkernel, or track the ABRTS/s counter.

Aborts and/or excessive resets can often be caused by unbalanced paths to the storage. Balance out your paths! All I/O down one HBA can lead to a saturated link in peak demand. Also consider the load on the Storage Processor from all attached hosts – even non-VMware. ESX 4.x supports Round Robin with ALUA – Use it if your vendor supports it! (Do not use Round Robin on pseudo active-active arrays without ALUA enabled.)

36

Storage - Considerations

• Consider providing different tiers of storage – Tier 3 for SATA-backed slow storage, up to Tier 0, which is fast RAID 10. Place the OS drive on Tier 3, and the database on Tier 0.

• But be careful when snapshotting those VMs! Consider the placement of your vmx file. (Snapshot deltas reside, by default, where the VMX lives)

• Consider your I/O patterns – should this effect how you allocate your SAN’s cache mechanisms for that LUN?

• It should certainly affect how you build your LUN!! (Spindles/Stripe size…)

• Be very aware of any shared datastores that reside on one LUN or RAID set – particularly if you are sharing a SAN between virtualized and non-virtualized loads.

• Use path balancing where possible, either through plugins (Powerpath) / Round Robin and ALUA, if supported.

37

Storage - Troubleshooting

Use the paravirtualized SAS adapter to reduce CPU overhead. (Especially important with high I/O workloads.)

Excessive traffic down one HBA / Switch / SP can cause aborts and data loss / collisions in extreme situations when the path becomes overloaded.

• Consider using Round Robin in conjunction with ALUA.

• Always be paranoid when it comes to monitoring storage I/O. You can never watch it too much in high churn/change environments. Put in place monitoring such as VCenter Operations.

Always talk to your array vendor to determine if you are following best practices for their array! They will be able to tell you the appropriate fan-in ratio, command queue depth, etc….


NETWORK

39

Load Balancing

A quick note regarding Load Balancing in ESX:

With the exception of “IP Hash,” all load balancing options are simply ways to influence what pNIC a VM is assigned to use as its uplink. “Port ID” uses the virtual Port ID as a value in a pseudorandom calculation to determine the uplink.

Until an event occurs that forces a re-election (pNIC state change, adding/removing pNIC from a team), the VMs will stay on their elected uplink pNIC.

IP Hash will give the most benefit if used in situations where the VM is communicating with a wide variety of different IPs (Think file servers, print servers, etc.) If the IPs always stay the same, then the elected uplink NIC will always stay the same.

40

Troubleshooting

• Be sure to check counters for both vSwitch and per-VM. There could potentially be another VM that is experiencing high network load on the same uplink as the VM that is having a connection speed issue.

• 10 Gbps NICs can incur a significant CPU load when running at 100%. Using TCP/IP Segmentation Offload (TSO) in conjunction with paravirtualized (VMXNET3) hardware can help out. Additionally, ensure that you are running the latest drivers for your NIC on the host.

• VMs without a paravirtualized adapters can cause excess CPU usage when under high load.

• Consider using intra-vSwitch communications and DRS affinity rules for VM pairs such as a web server/database backend. This will utilize the system bus rather than the network for communications.

• ESX 4.1 includes the ability to use network shares – ideal for blade systems where 10Gb NICs are becoming common, but there may only be one or two. This allows equitable sharing of resources without a IP Hash load balancing setup.


Questions?

See us at the VMware Genius Bar!

VSP3866 Performance Best Practices and Troubleshooting Name, Title, Company.

Documents

Transcript of VSP3866 Performance Best Practices and Troubleshooting Name, Title, Company.