VMware ESX Server HP Virtualization

VMware ESX Server 3.0: How VMware ESX Server virtualizes HP ProLiant servers

Executive summary ..............................................................................................................................................3 This white paper................................................................................................................................................3

Architecture ..........................................................................................................................................................4 Hardware performance differences ............................................................................................................5

Driver translation ..........................................................................................................................................5 World switching ............................................................................................................................................5 Accommodating increased utilization ....................................................................................................5

Service console.....................................................................................................................................................5 Boot process......................................................................................................................................................5 Service console overview...............................................................................................................................7

VMware Virtual SMP.............................................................................................................................................8 Resource virtualization.........................................................................................................................................9

CPU .....................................................................................................................................................................9 Reacting to an idle VM.............................................................................................................................11 Updating the VM clock.............................................................................................................................12 Default settings...........................................................................................................................................12 Impact of processor cache size..............................................................................................................12 Impact of cache on scheduling .............................................................................................................13 Intel Xeon processors.................................................................................................................................14 Xeon processors introduced Hyper-Threading Technology, which allows ESX Server to treat a single physical processor package as two logical processors. By design, hyperthreaded processors include a second instruction pipeline but still feature a single execution pipeline. The processor is solely responsible for distributing execution cycles between the instruction pipelines........................................................................................................................................................................14 AMD Opteron processors .........................................................................................................................15 Using NUMA architecture .........................................................................................................................15 Disabling NUMA capability ......................................................................................................................17

Memory............................................................................................................................................................17 Other memory consumers........................................................................................................................17 Memory management .............................................................................................................................19 Using the balloon driver ............................................................................................................................20 Using a swap file.........................................................................................................................................21 Using background memory page sharing ............................................................................................21 More on memory overcommitting .........................................................................................................21 Recommendations for memory virtualization ......................................................................................22

Network............................................................................................................................................................22 How to dedicate a physical NIC to a VM .............................................................................................23 Configuring virtual switches .....................................................................................................................24 Load distribution.........................................................................................................................................25 Distributing outbound traffic ....................................................................................................................27 Distributing inbound traffic .......................................................................................................................27 Eliminating the switch as a single point of failure.................................................................................28 Improving network performance............................................................................................................28 How the network perceives VMs.............................................................................................................28 VLANs ...........................................................................................................................................................29 Considerations when configuring virtual switches...............................................................................29

Storage.............................................................................................................................................................30 Architecture ................................................................................................................................................31 VMFS.............................................................................................................................................................32 LUN performance considerations ...........................................................................................................33 Tuning VM storage.....................................................................................................................................34 Using raw device mapping......................................................................................................................34 Other design considerations....................................................................................................................35 Sizing VM disk files ......................................................................................................................................35 Presenting a raw LUN to a VM.................................................................................................................35 Raw device mapping ...............................................................................................................................36 Planning partitions .....................................................................................................................................39 Implementing boot-from-SAN .................................................................................................................39 Noting changes to the boot drive and device specification ...........................................................40 Taking care during the installation..........................................................................................................40 Defining the connection type .................................................................................................................40 Fibre Channel multipathing and failover...............................................................................................40 Fail-back ......................................................................................................................................................41

Resource Management....................................................................................................................................41 Clusters .............................................................................................................................................................41 VMware High Availability (HA) Clusters......................................................................................................42 VMware Distributed Resource Scheduling (DRS) Clusters ......................................................................42 Resource Pools................................................................................................................................................42 Resource Allocation ......................................................................................................................................42

Absolute allocation ...................................................................................................................................43 Share-based allocation ............................................................................................................................43 Differences between allocation methods ............................................................................................43 Warning on setting a guaranteed minimum ........................................................................................43 Allocating shares for other resources .....................................................................................................44

Best practices......................................................................................................................................................44 VMware VirtualCenter.......................................................................................................................................44

Architecture ....................................................................................................................................................45 Templates and clones...................................................................................................................................45

Template .....................................................................................................................................................45 Cloning.........................................................................................................................................................46 Differences between templates and clones........................................................................................46

Considerations and requirements for VirtualCenter server ...................................................................46 Compatibility ..............................................................................................................................................47 Virtual Infrastructure Client application requirements ........................................................................47

VMotion................................................................................................................................................................47 Architecture ....................................................................................................................................................47 Considerations and requirements...............................................................................................................48

For more information .........................................................................................................................................51

3

Executive summary This document contains functional information for VMware ESX server as well as new features and functionality introduced by VMware ESX Server 3.0. Specifically, it provides operational parameters, component virtualization methodologies, general utilization, and best practice methods for the integration and operation of a virtual infrastructure.

This guide is intended for Project Managers and corporate decision makers involved in the initial phases of enterprise virtualization. This document provides an overall understanding of how ESX works, and should help the reader make informed decisions concerning the implementation of virtualization.

The reader should be familiar with industry terminology, and generally familiar with virtualized infrastructures. For access to more in-depth information see the reference section of this guide.

This guide is the result of a joint effort by VMware and HP.

This white paper • Architecture – Outlines the virtualized computing environment implemented by VMware ESX Server;

describes performance differentials • Service console – Outlines the architecture and capabilities of the ESX Server service console; explains the

differences between Linux, the service console, and ESX Server • VMware Virtual SMP – Outlines the use of Virtual SMP to give a VM access to four virtual processors;

explains processor fragmentation • Resource virtualization – Outlines resource utilization issues in a virtualized computing environment

– CPU – Describes processor virtualization concepts such as virtual processors; provides an explanation of single core, dual core and hyperthreaded processing resources; outlines the management of idle VMs and virtual processor scheduling; describes the impact of cache size on performance; describes the concept and impact of cache fragmentation; outlines the impact of Intel® Xeon™ and AMD Opteron™ chipset technologies; describes the influence of Intel® and AMD virtualization technologies; describes the use of Non-Uniform Memory Access (NUMA) architecture

– Memory – Describes the memory guarantees required for VMs; outlines the memory management features of ESX Server; describes the use of the balloon driver, swap files, and background memory map sharing to free up memory; discusses the implications of memory overcommitting; provides recommendations for memory virtualization

– Network – Describes the concept of a virtual switch; outlines how to configure a virtual switch; describes MAC- and IP-based methods for load distribution; describes how to eliminate the switch as a single point of failure; outlines methods for improving network performance; discusses the uses of VLANs; outlines NIC recommendations for the service console, VMotion and the VMs

– Storage – Provides an overview of virtual storage architecture; explains why VMFS-3 is used for virtualized storage; outlines LUN performance considerations; provides methods for tuning VM storage; describes iSCSI support; outlines support for NAS; describes the use of raw device mapping; outlines the use of internal storage controllers; describes how to implement boot-from-SAN; discusses multipathing and failover in a virtualized environment

• Resource management – Outlines the ability to divide and allocate the resources of a combined group of ESX Server hosts – Cluster – Describes the ability to group similar hosts in order to improve workload distribution and

failover capability • VMware HA – Briefly describes the failover capability of a cluster configured for HA and provides a link

to the VMware HA whitepaper which includes best practices • VMware DRS – Outlines the basic principals of DRS and provides a link to the VMware DRS whitepaper

and best practices – Resource pools - Outlines the advantages of using resource pools; describes the process of establishing

resource pools and adding VMs to the newly created pools; outlines the process of modifying the

4

resource allocation allotted to a specific resource pool; explains potential issues that arise when assigning resource reservations and workarounds for over allocation

Architecture VMware ESX Server provides a virtualized computing environment that, unlike VMware Server or VMware Workstation, does not rely on an underlying operating system to communicate with the server hardware – instead, ESX Server is installed directly on the server hardware. Virtual Machines (VMs) are then installed and managed on top of the ESX Server software layer.

Since virtualization components are not hosted within the confines of a host operating system, the ESX Server architecture has been described as “unhosted,” “native,” or “hostless.” Hosted and unhosted architectures are compared in Figure 1.

Figure 1: Comparing the native ESX Server architecture with a typical hosted architecture

The ESX Server architecture provides shorter, more efficient computational and I/O paths for VMs and their applications, reducing virtualization overhead and improving application performance. The unhosted architecture also enables ESX Server to provide more granular and enforceable policies for hardware allocation and VM prioritization – an important differentiator for ESX Server over a hosted architecture. In a hosted virtualization environment, the host OS governs the execution of VM threads and typically limits the granularity of prioritization to categories such as “high,” “low,” or “normal.”

Furthermore, in the unhosted architecture of ESX Server, VM processes do not contend with the many and various processes that consume the resources allocated to a host OS.

In short, the unhosted architecture of ESX Server provides a lightweight, single-purpose virtualization environment that allows enforceable hardware allocation and prioritization policies. The single-purpose micro-kernel, called VMkernel, translates into higher-performance and flexibility. Also, because VMkernel uses only drivers ported and rigorously validated by both HP and VMware, the micro-kernel provides exceptional stability.

5

Hardware performance differences Performance and resource utilization for a particular operating system instance and application differ when running in virtualized and unvirtualized environments, as discussed below.

Driver translation While the quantification of performance differentials is very complex, it can be stated that, in general, CPU and memory performance overheads in a VM tend to be lower than overheads for network or disk traffic. This is because neither CPU nor memory needs the same amount of translation as required when data flows between virtual and physical device drivers. With the introduction of ESX Server 3.0, many of the physical device drivers have been incorporated into the kernel to further improve VM performance. In general, the performance of a primarily CPU-intensive application in a VM is likely to be closer to its performance on a physical server than that of an application that is more network- or disk-intensive.

However, translation between virtual and physical devices does consume some additional CPU resources on top of those required for application and guest OS processing. This translation results in a higher percentage of CPU utilization being needed for each request processed in a VM when compared to a physical server running the same application.

World switching The world switching process helps the sharing of physical system resources by preempting a currently-running VM, capturing and saving the instantaneous execution state of that VM, and initiating CPU execution for a second VM.

Although world switching allows VMs to share physical system resources, the process introduces an additional amount of overhead associated with running VMs. Though this process adds a small amount of overhead, the benefits of virtualization strongly outweigh the additional cost.

Accommodating increased utilization On average1, the CPU utilization of a Microsoft® Windows®-based x86 server is approximately 4%. Even with virtualization overhead and driver translation, many systems and application environments have sufficient CPU resources to accommodate a substantial increase in utilization. While this is not always the case, many OS and application environments can be virtualized without sacrificing much performance.

The performance sacrifice has been reduced by optimizations made in ESX Server 3.0 which target OLTP, Citrix, Windows 2003 Web Server and custom Linux applications. Specific improvements have been made to optimize the referenced workloads and enhance performance.

Service console ESX Server is often thought of as Linux or Linux-based – a misconception that might stem from the service console. To rebut this misconception, consider the following:

• The VMkernel, responsible for the creation and execution of virtual machines, is a single-purpose, micro-kernel.

• The VMkernel cannot boot itself and has no user interface. It relies on a privileged, modified Red Hat Enterprise Linux installation to provide ancillary services like a boot loader and user interface.

• It is important to understand that the VMkernel – not the Linux kernel – is the governing authority in an ESX Server deployment. It is the VMkernel that creates, monitors, and defines the virtualization components; it makes the only and final decision for execution allocation – even the Linux service console is subject to the scheduling decisions of the VMkernel.

Boot process The boot process helps explain the relationship between Linux, the service console, and ESX Server.

1 According to industry averages compiled by VMware Capacity Planner.

6

During boot, the bootloader (GRUB) loads a Linux kernel. Since certain PCI devices are masked by the GRUB configuration, the Linux kernel only loads drivers for visible devices.

After most Linux services have been loaded, the Linux kernel loads the vmnixmod module, which loads the VMkernel logger, which, in turn, loads the VMkernel itself. During its loading process, the VMkernel assumes nearly all hardware interrupts and, effectively, takes over server hardware that was not allocated to the Linux service console kernel. At this point, with the VMkernel owning most of the server hardware, it is free to schedule VM execution and distribute physical resources between VMs.

The final component loaded is the VMkernel core dump process, which is designed to capture the state of the VMkernel in the event of a kernel crash.

7

Service console overview With the advent of ESX Server 3.0, the service console is now executed as a VM. Drivers for service console devices, such as the NIC and storage, are loaded in the VMkernel which allows service console access to the configured hardware through the kernel itself. Although the service console accesses most devices through kernel modules, access to USB devices and the floppy is direct.

The VMware host agent which runs within the service console provides access for the Virtual Infrastructure client. Additionally, a web access client is available and is powered by a Tomcat web service which runs within the confines of the service console. The web access client allows users to perform many management tasks using a web based interface. Furthermore, the Secure SHell within the service console provides secure access for the command-line management of the ESX Server. While these interfaces might appear identical to any Linux installation, the service console includes packages that allow both the command line and the web interfaces to pass commands and configuration data to the VMkernel.

Figure 2 shows how the service console integrates into the ESX Server architecture.

Figure 2: The relationship between ESX Server and the service console

The service console, which uses a uni-processor 2.4.21 Linux kernel, is scheduled only on physical CPU0. By default, the VMkernel also reserves a minimum of 8% of CPU0 for the service console through the same guarantee mechanism used for VMs, which are free to consume remaining CPU0 resources. This CPU allocation, in most cases, ensures that the service console remains responsive, even if other, busy VMs are consuming all other available physical resources.

Although the service console is not responsible for scheduling VMs, there is a correlation between the responsiveness of the service console and the responsiveness of VMs. This is due, in part, because ESX Server transmits the keyboard, video, and mouse access of a VM to a VMware Remote Console session through the service console network connection.

8

Because of this relationship, if the service console should become unresponsive or unable to perform the supporting processes (such as updating /proc nodes or maintaining the VMkernel logger), the virtualized environment may exhibit symptoms of this contention, ranging from slow remote console access to VMkernel crashing. To combat this, consider increasing the memory allocation and/or minimum CPU guarantee for the service console. Note, however, that this discussion addresses an extreme case; in most cases, the default allocations should provide stable and responsive operation.

The service console also provides an execution environment for management and backup agents; the loads generated by these additional processes further justify an increase in memory and CPU allocations over the default values.

Note: HP Systems Insight Manager (SIM) and other hardware monitoring agents run in the service console, not in VMs. ESX Server does not support the running of unqualified packages within the service console environment.

Access to floppy drives, serial port devices, parallel port devices, and CD-ROM drives access – even from within a VM – are proxied through the service console. This delegation of slower-access devices allows the VMkernel to focus on high-speed, low-latency devices like hard disks.

VMware Virtual SMP With a valid Virtual SMP license, ESX Server can optionally give a single VM simultaneous access to four execution cores by exposing four virtual processors within the VM, allowing multithreaded applications within the VM to process simultaneous instructions on four distinct processor cores. Since ESX Server simply abstracts – as opposed to emulating – processors, simultaneous execution requires four processor cores to be allocated – simultaneously and exclusively – to a single VM. Two cores are available within a single package (for a dual-core processor or a processor with Hyper-Threading Technology) or across physical packages.

Note: Virtual SMP is licensed separately and requires this license to support the exposure of four processors within a single VM. This license can be purchased separately or as part of the Virtual Infrastructure Node bundle.

Although it may be tempting to use Virtual SMP by default when creating new VMs, it should be used carefully – especially when running on systems which offer few execution cores, for instance a dual processor system populated with single core processors. A VM is never allocated a portion of a core; during its allocated unit of CPU time, a VM’s access is exclusive. As a result, when a VM using Virtual SMP is deployed on a physical server with only two cores (as in a dual-processor, single-core server without Hyper-Threading Technology), both cores are allocated to this VM during the scheduled period; no CPU resources are available for other VMs or the service console. The corollary also applies: when any other VM or service console process is scheduled for execution on either one of the two execution cores, processes on the Virtual SMP-enabled VM cannot execute. This phenomenon is known as processor fragmentation and is shown in Figure 3.

9

Figure 3: Both physical processor cores have been allocated to a Virtual SMP-enabled VM, leaving no CPU resources available for other VMs

Processor fragmentation is often the reason for poor performance on servers with only two execution cores. With dual-processor VMs on servers with only two cores, there is nearly 100% contention for CPU resources when the system is under load. Now, if the goal were to run only a single VM with Virtual SMP on a platform with two execution cores, the performance impact of this contention would be less noticeable; however, it is far more common to deploy multiple VMs on such a platform, making contention a significant issue. When using Virtual SMP, it is recommended that more processor cores be deployed on the physical server than any single VM.

New technologies, such as Hyper-Threading Technology and dual-core processors, change this behavior slightly and are examined more closely in later sections of this white paper.

Resource virtualization In discussing how ESX Server presents virtual abstractions of hardware and schedules the execution of VMs, it is helpful to discuss the four primary resource groups (CPU, memory, network, and disk) independently.

CPU The primary concepts of CPU virtualization are as follows:

• A physical processor package • A virtual processor • A logical processor capable of executing a thread

The physical processor is a familiar concept; it has a clock speed, a cache size, and a manufacturer; and you can hold it in your hand.

Introduced with newer technologies such as Hyper-Threading Technology and dual-core processors, the logical processor is slightly more abstract, and may best be explained by examples such as those shown in Figure 4.

10

Figure 4: Representations of physical processors

Single-core processor without Hyper-Threading Technology

One logical processor

Single-core processor with Hyper-Threading Technology

Two logical processors

Dual-core processor Two logical processors

In the context of ESX Server, not all logical processors are equal. For example, a processor with Hyper-Threading Technology includes two instruction pipelines; however, only one of these can access the execution pipeline at any given moment. Contrast this with a dual-core processor where both instruction pipelines have access to their own execution pipelines. As such, when discussing virtual processors, it might be helpful to refer specifically to an execution core to avoid confusion with the non-executing second instruction pipeline in a processor with Hyper-Threading Technology.

By far the most abstract of the concepts of CPU virtualization is the virtual processor, which is best defined as a period of time allocated for exclusive execution on a processor core. When a VM is powered on and its virtual processor is scheduled to execute (or multiple virtual processors if using Virtual SMP), a slice of time on one logical execution core (or multiple logical execution cores if using Virtual SMP) within the physical processors is assigned to the virtual processor(s) within the VM. Since this explanation and concept is purely abstract, perhaps an example will clarify. Consider the following simplified examples:

• A physical processor with a single logical core and a single VM with a single virtual processor Ignoring all virtualization overheads and execution cycles for system services, the virtual processor in this scenario receives 100% of the execution time. If the logical core of the physical processor is a 3.0 GHz CPU, the virtual processor receives all three billion cycles of the CPU clock.

• A single processor with a single logical core and two VMs, each with one virtual processor Ignoring all virtualization overheads and execution cycles for system services and assuming equal priority for both VMs, each virtual processor receives, over some period of time, 50% of the execution time. It is important to understand that when one VM is executing, that VM has exclusive access to the allocated logical execution core. In other words, only one virtual processor can execute within a single logical processor at any given moment. The x86 architecture does not allow two virtual processors to have simultaneous access to a single execution pipeline. Thus, in this example, if one VM is executing, the other is not. The non-executing VM is not idle, rather, it has been pre-empted.

11

Figure 5 shows a number of scenarios featuring a single virtual processor.

Figure 5: Showing scenarios with one virtual processor per physical processor core

As can be inferred from the above discussion, when the number of virtual processors increases, the period of time for execution may become shorter. Similarly, as the ratio of virtual processors to logical execution cores increases2, the contention for physical resources may increase. One possible result – depending on the amount of idleness within the VMs – is that the number of computational cycles available to a VM may be less.

The qualifications in the previous paragraph – “may become shorter,” “may increase,” and “may be less” – are rooted in the manner in which ESX Server treats an “idle” VM. When an operating system is not consuming resources for system-sustaining processes or in support of an application, it is idle; indeed, most operating systems spend considerable amounts of time in this state. While idle, the operating system issues instructions3 to the CPU indicating that no work is to be done.

ESX Server is capable of recognizing this idle loop – unique to each operating system – and automatically gives priority to VMs using CPU cycles to perform non-idle operations, giving rise to the qualifications stated above. For example, as the ratio of virtual processors to logical processors increases, the contention for physical resources may increase unless there are idle VMs.

Reacting to an idle VM Consider the following to illustrate how idle VMs may affect the scheduling of virtual processors. In this scenario, there is a single processor with a single logical core supporting two VMs, each with one virtual processor. If one VM is performing CPU-intensive operations that entirely consume the cycles allotted to it

2 This ratio is often called virtual machine density or consolidation ratio. 3 Generally referred to as an idle loop

12

and the other VM is completely idle, ESX Server recognizes this disparity and effectively increases the percentage of cycles allocated to the busy VM. Note that the busy VM never receives 100% of the CPU cycles; some cycles are allocated to – and consumed by – the more idle VM to advance its clock and ensure that it has the opportunity to become busy.

Updating the VM clock As mentioned earlier, when a VM is not scheduled for execution within a processor core, time does not pass in that VM. As a result, the measurement of time within that VM is incorrect – unless the VMware Tools package is installed.

This package includes a component that, when enabled, updates the clock within the VM to ensure more accurate timekeeping. This updating does not provide for a real-time measurement; however, the accuracy of the updated time should be sufficient for most application purposes.

IMPORTANT: VMware strongly recommends that the measurement of time should not be used for the purposes of benchmarking. Any application running in a VM that measures performance with respect to time – for example, requests per second, transactions per second, or response time – has temporal components that should be considered unreliable.

Default settings To this point, the discussion has assumed that none of the CPU resource management features of ESX Server have been changed from their default values, which allow a virtual processor to consume up to 100% of an execution core or as little as 0%. The default configuration allows ESX Server to dynamically change which logical processor clock cycles are used to fulfill the time allotted to a virtual processor; in other words, ESX Server can move a virtual processor between logical cores or, even, physical processors in response to shifting loads within the physical host.

While unique resource management settings can be configured for each VM, VMware recommends leaving these values set to their default values and allowing the VMkernel to make these decisions with maximum flexibility.

Impact of processor cache size Generally speaking, VM performance is more sensitive to processor cache size than to the speed of the processor. Cache size is important when multiple VMs are switching between execution states, reducing the effective CPU cache hit ratio. In a non-virtualized server, this ratio may be as high as 90%, which is sustainable in single operating system environment; however, in a virtualized environment, many operating systems are utilizing the same physical processor and core, making it difficult to obtain such a high rate. This reduction of the cache hit ratio is known as cache fragmentation or cache pollution.

13

To illustrate the performance impact of cache fragmentation, consider an ESX Server environment freshly booted with no VMs powered on. When the first VM is powered on, the cache hit ratio for its processor is initially zero but begins to increase; after some time, this VM, running alone on the processor, might achieve a hit ratio that is high enough to improve performance. When a second VM is powered on and scheduled to execute on the same processor, this VM begins to populate processor cache with its own data and processes, replacing the cached data and processes from the first VM. When the first VM is next scheduled for execution, the cache hit ratio will be lower than previously achieved.

While the hit ratio will improve over time, it is likely to be lower during initial execution cycles (as shown in Figure 6), forcing VMs to execute from main memory. Because access to main memory is much slower than filling the requests from processor cache, the VM will run slower until the hit ratio improves.

Figure 6: Simulated impact of cache fragmentation on the CPU cache hit ratio, showing the ratio dropping to zero each time a world switch (indicated by a red line) occurs

Note: Unlike processor registers, processor cache is not restored or saved when switching between executions of virtual machines.

The impact of cache fragmentation is intensified in a higher-density deployment. With more VMs running per processor, each VM runs for a shorter period of time, which may limit a VM’s ability to fully populate and realize the benefits of processor cache. As density increases, the following conditions occur:

• There are more VMs to push data out of cache • The length of time between executions for each VM increases

These conditions combine to reduce the amount of data that remains cached between executions.

In order to combat this performance degradation, a larger processor cache may provide sufficient storage to maintain a significant amount of cached data between executions. This would improve the cache hit ratios for all VMs running on the processor.

Impact of cache on scheduling Cache also plays an important role in the scheduling decisions made by the VMkernel. The VMkernel understands the relationship between cache and VM performance. Specifically, the VMkernel understands

14

that performance can be maximized by continuing to run a virtual processor on the same logical core (which is likely to contain cache pages for the particular VM4). As a result, the VMkernel is prepared to accept a temporary increase in contention within one logical core before migrating a virtual processor to a different core.

The decision to migrate or leave a virtual processor in place is governed by the potential penalty imposed by the migration. If the contention within a logical core causes a virtual processor to delay an execution request by a period that exceeds the migration penalty, the VMkernel recognizes that cache relevance has been outweighed by the contention and will migrate the virtual processor to a logical core able to serve the request more quickly.

To retain the ability to migrate virtual processors, VMware recommends that users allow ESX Server to determine processor affinity to virtual machines. Many variables govern scheduling decisions made by the VMkernel to guarantee the best possible performance from a particular physical configuration. Specifying processor affinity reduces the flexibility available to the VMkernel to make optimizations. Thus, VMware discourages forced association of a virtual machine to a specific processing core.

Intel Xeon processors Xeon processors introduced Hyper-Threading Technology, which allows ESX Server to treat a single physical processor package as two logical processors. By design, hyperthreaded processors include a second instruction pipeline but still feature a single execution pipeline. The processor is solely responsible for distributing execution cycles between the instruction pipelines. From processor allocation and guarantee accounting perspectives, the VMkernel considers the two cores (instruction pipelines) to be equivalent, even though only one is executing at any given moment. This, in turn, means that two virtual processors scheduled to run in the two cores of a hyperthreaded processor are considered to have equal access to the physical processor.

With a virtual processor staged and waiting in the secondary core, when the currently executing virtual machine is unscheduled, the next machine to execute is already populated within the processor. One VM can be running in the physical processor’s execution pipeline while instructions for a second VM can be staged in the secondary core. When the scheduled allocation for the first VM ends, the next VM is ready to execute, improving the speed of the world switch. However, the overall impact of hyperthreading on VM performance depends on the nature of the application.

4 A concept known as cache relevance

15

On the other hand, if only one virtual processor were scheduled to run within the hyperthreaded physical processor, the VMkernel would account for this exclusive access through its internal accounting capabilities. In this case, the virtual processor would be charged more for its exclusive consumption of the physical processor. The rationale behind this extra charge is that the single virtual processor consumes the full physical package, whereas two virtual processors within the two logical cores of a hyperthreaded physical processor are each utilizing half of the physical processor. Other than this, the concepts of resource management in ESX Server apply to servers with hyperthreaded processors in exactly the same manner as servers with single-core processors.

ESX Server, however, can also use this secondary core to address processor fragmentation by scheduling the two virtual processors of a Virtual SMP-enabled VM to use both cores of a hyperthreaded processor. However, since hyperthreading may cause contention between these two cores, overall performance depends on the nature of the particular application. If the application uses only a single virtual processor, leaving the second processor largely unused, hyperthreading gives ESX Server the flexibility to avoid the effects of processor fragmentation without significantly impacting application performance. If, however, simultaneous, parallel execution of the two virtual processor threads is required, poor application performance is likely.

Like most other parameters governing scheduling decisions, it is possible to update the policy for scheduling virtual processors in a hyperthreaded environment. For more information on these policies and how to change them, type man hyperthreading at the service console command prompt.

Note that the default setting for hyperthreading scheduling policy for a virtual processor is any, which places no restrictions on the allocation of cores between virtual processors. This setting allows VMkernel to use both cores within each hyperthreaded physical processor for the scheduling and execution of any virtual processor within the system.

Setting hyperthreading sharing policy to none causes the particular VM to effectively ignore the fact that the physical processor is hyperthreaded and continue to consume physical packages as though they contained only a single logical core. Since this policy is set per VM, no virtual processor associated with this VM will share a physical package with any other; the other logical core within the package will remain unused.

VMs with more than one virtual processor can also use the internal setting for hyperthreading sharing policy. This allows the virtual processors of a single VM to share cores within a physical package; however, these virtual processors will not share cores with virtual processors associated with any other VM.

As with any parameters that can alter scheduling decisions made by the VMkernel, VMware strongly recommends accepting the default values for the hyperthreading sharing policy.

AMD Opteron processors AMD Opteron processors feature a unique memory architecture that integrates the memory controller into the high-speed core of the processor. As a result, AMD Opteron processors can access memory with extremely low latency, a capability that is particularly useful when cache fragmentation reduces the processor cache hit ratio, making VMs operate from main memory. The high-speed memory bus delivered by an Opteron processor can also translate into increased VM density on a particular processor by supporting faster world switches5.

Using NUMA architecture Non-Uniform Memory Access (NUMA) is a system architecture that groups memory and processor cores into nodes consisting of some physical memory and some processor packages and cores. Processor cores and memory within a single node are said to be within the same proximity domain. All memory is accessible to all cores, regardless of node membership. However, for cores accessing memory deployed in the same proximity domain, accesses are faster and encounter less contention than accesses to memory within a different node, as shown in Figure 9.

5 The process by which one VM is unscheduled and another scheduled to execute is known as a world switch. This process involves capturing one VM’s processor registers and writing these registers to memory, and reading the registers for the other VM from main memory and, finally, writing these registers to the processor.

16

Figure 9: Showing a NUMA implementation with two nodes

17

HP ProLiant servers with AMD Opteron processors are NUMA systems: that is, the system BIOS creates a System Resource Allocation Table (SRAT) that presents the nodes and proximity domains to ESX Server. ESX Server is NUMA-aware and uses the contents of the SRAT to make decisions on how to optimally schedule VMs and allocate memory.

On NUMA systems, ESX Server attempts to schedule a VM thread to execute in core(s) that are in the same proximity domain as the memory associated with that VM. ESX Server also attempts to maintain physical memory for a particular VM within a single NUMA node. If a VM needs more memory than that available in a NUMA node, ESX Server allocates additional memory from the nearest proximity domain.

The single-core Opteron processor creates a unique NUMA architecture where each NUMA node has only one processor core. While this is perfectly valid within NUMA specifications, it is more typical to deploy multiple cores in a single NUMA node. Within the ESX Server context, the only scenario that is affected by the unique Opteron NUMA presentation involves Virtual SMP.

The Virtual SMP code of ESX Server has been coded to take particular advantage of NUMA architecture on dual-core Opteron processors. When a VM is allocated two virtual processors, ESX Server schedules both threads to execute within a single NUMA node; however, because the SRAT dictates that each node has only one processor, it is impossible for ESX Server to execute both virtual processor threads within a single proximity domain.

Dual-Core Opteron processors implement an architecture with two processors per NUMA node, allowing ESX Server to schedule dual-processor VMs within the confines of a single NUMA node. As long as the number of virtual processors remains the same or lower than the number of execution cores in a proximity domain, ESX Server NUMA optimizations should be in effect.

Disabling NUMA capability Most HP ProLiant servers with Opteron processors offer the capability to disable NUMA by enabling the node interleaving option in the BIOS. With node interleaving enabled, the HP ProLiant BIOS does not construct or present the NUMA SRAT architecture and appears to be flat, uniform memory architecture.

VMware and HP recommend using NUMA features with single-processor VMs. For VMs with multiple virtual processors, testing is recommended to determine which setting delivers the best performance for your application.

Memory

Note: An outstanding resource for detailed information on memory virtualization is available at the ESX Server command line. Issuing the command man mem displays a comprehensive guide on ESX Server memory virtualization.

In a non-overcommitted situation, when a VM is powered on the VMkernel attempts to allocate a region of physical memory for the exclusive use of this VM. This memory space must be no larger than the maximum memory size and no smaller than the minimum memory guarantee (assuming the VM is requesting at least its minimum memory allocation). The total memory space may initially be comprised of both physical RAM and VMkernel swap space. ESX Server performs this allocation and creates the address mappings that allow virtual memory to be mapped to the physical memory. When the VM is powered off, its memory allocation is returned to the pool of free, available physical memory.

Other memory consumers Apart from VMs, there are two other consumers of memory within a physical host:

Service console In the past, the service console required a specific amount of memory overhead for each VM running on a host. The release of ESX Server 3.0 employs a new architecture whereby the service console is no longer

18

burdened with memory requirements per VM running on a host. Individual VM process threads are handled directly by the VMkernel, thereby eliminating the need for additional service console memory per VM. However, if you intend to run additional agents – for hardware monitoring and/or backup – in the service console, it may be prudent to allocate more memory than the defaults allow.

19

Memory management ESX Server provides advanced memory management features that help ensure the flexible and efficient use of system memory resources. For example, ESX Server systems support VM memory allocations that are greater than the amount of physical memory available – overcommitting – as well as background memory page sharing and ballooning.

When attempting to power on a VM, an ESX Server host first verifies that there is enough free physical memory to meet the guaranteed minimum needed to support this VM. Once this admission control feature has been passed, the VMkernel creates and presents the virtual memory space.

While virtual memory space is created and completely addressed as the VM is powered on, physical memory is not allocated entirely at this time; instead, the VMkernel allocates physical memory to the VM as needed. In every case, VMs are granted uncontested allocations of physical memory up to their guaranteed minimums; because of admission control, these allocations are known to be present and available in the server.

If the entire physical memory pool is already being actively used when a VM requests the memory due it according to its guaranteed minimum, the VMkernel makes physical memory available by decreasing the physical memory allocation to another VM deployed on the same host. The VMkernel relies on its own swap file to accommodate the increased physical memory demand.

Consider the following example where two VMs, VMA and VMB, are each guaranteed a minimum of 256 MB and a maximum of 512 MB of RAM, and two additional VMs, VMC and VMD are each guaranteed a minimum of 512 MB and a maximum of 1024 MB of RAM. Ignoring all service console allocations and virtualization overheads for a moment, assume that the server has 1.5 GB of RAM and that VMA and VMB are each actively using 512 MB of physical memory while VMC and VMD are each actively using only 256 MB. Admission control allows all of these VMs to run since the 1.5 GB of physical memory can accommodate the guaranteed minimums. While all physical memory is consumed in this example (as shown in Figure 7), some machines are not actively using their guaranteed minimum or maximum allocations.

Figure 7: In this example, VMC and VMD are using less than their guaranteed minimum memory allocations; all physical memory is consumed

Now, to continue with this example, VMC and VMD each request an additional 256MB of memory, which is guaranteed and must be granted. To accommodate these additional allocations, ESX Server reclaims

20

physical memory from VMA and VMB (as shown in Figure 8), both of which are operating with above their minimum guaranteed allocations. If VMA and VMB have equivalent memory shares, each should relinquish roughly the same amount of memory.

Figure 8: In this continuing example, VMC and VMD are allocated memory that has been reclaimed from VMA and VMB

However, the applications running within VMA and VMB are not aware of memory guarantees and will continue to address what they perceive to be their full memory ranges; as a result, the VMkernel must address the deficits.

Using the balloon driver The VMkernel has a range of options for addressing these deficits; two options are active and one passive. The preferred active approach is to employ the balloon driver, a virtual memory controller driver installed in a VM with the VMware Tools package. The VMkernel can instruct the balloon driver to inflate (consume) memory within the memory space of a VM, forcing the guest operating system to use its own algorithms to swap its own memory contents.

Note: In the case of a balloon driver-induced swap within a VM, memory is swapped to the VM’s – rather than the VMkernel’s – swapfile.

Memory that is reclaimed by the balloon driver and then distributed to an alternate VM is cleared prior to the re-distribution. Therefore, the re-allocated memory contains no residual information from the virtual machine that previously occupied that memory space. This process reinforces the isolation and encapsulation properties that are inherent within VMware virtual machines.

Since the guest operating system is able to make intelligent decisions about which pages are appropriate to swap and which are not, ESX Server uses the balloon driver to force the guest operating system to apply

21

this intelligence to reduce the physical memory used by its processes. At the same time, ESX Server is able to identify the memory pages consumed by the balloon driver. These consumed pages are useless to the VM but, to ESX Server, they represent physical memory that is essentially free to commit to other VMs. The balloon driver only inflates by an amount that is enough to reduce the VM’s physical memory utilization to the appropriate guaranteed minimum memory allocation.

Using a swap file Beyond the balloon driver, ESX Server also supports a swap file for each VM. Once the balloon driver has reduced every VM within a physical host to the guaranteed minimum memory allocation, additional memory requested by VMs is granted through the VM swap file. ESX Server, which maintains a swap file for each VM in the same location as the virtual machine configuration file (.vmx file), simulates the additional memory by swapping memory contents to disk. The swap file is used when there simply is not enough physical memory to accommodate requests beyond the guaranteed minimums.

Since the balloon driver is not an instantaneous solution, it may take a few minutes to fully inflate and free hundreds of megabytes of physical memory. During this time, the VMkernel can use the VM swapfile to provide the memory requested. After the ballooning operation is complete, the VMkernel may be able to move all pages from the swapfile into physical memory.

In extreme circumstances when the VMkernel cannot provide timely memory allocations – either through ballooning or the use of the swapfile – the VMkernel may temporarily pause a VM in an attempt to meet memory allocation requests.

Using background memory page sharing Both the VMkernel swapfile and the balloon driver can be considered active mechanisms to combat memory overcommit. By contrast, background memory page sharing is a passive process that uses idle cycles to identify redundant memory contents and consolidate them to reclaim physical memory. When ESX Server detects an extended period of idleness in the system, the VMkernel will begin to compare physical memory pages using a hashing algorithm. After encountering two memory pages that appear to have the same contents, a binary compare is executed to ensure similar content. The ESX Server then frees up one of the memory pages by updating the memory mappings for both VMs to point to the same physical memory address. In this way, physical memory can be freed up for additional VMs.

ESX Server performs a copy-on-write operation so that one VM can update a shared page without affecting the original, shared data. Should a VM attempt to write to or modify a shared memory page, ESX Server first copies the shared memory page, so that a distinct instance is created for each VM. The VM requesting the write operation to the memory page is then able to its contents without affecting other VMs sharing this same page.

Background memory page sharing should not have a noticeable impact on VM performance since, by default, the memory scrubbing algorithms that detect redundant pages are only active during periods of low activity.

The intended effect of background memory page sharing is to create free physical memory; how this free memory is used will be dependent upon the specific virtualized environment. In many cases, the memory simply remains free until VMs attempt to modify the shared pages; however, in some cases, the free physical memory created by background memory page sharing is used to power on an additional VM. Note that it is possible to free enough physical memory to allow more total memory to be allocated for VMs than is available on the server.

More on memory overcommitting Memory overcommit should be well understood before relying on this feature to deliver higher VM densities; there is the potential for a significant performance impact in memory overcommit situations. This feature, like many others, is best explained through an example. Begin with the following assumptions:

• The physical host is an HP ProLiant server with 2.0 GB of RAM (assume 2.0 GB = 2048 MB) • There are three VMs; all are powered on • Each VM has been assigned 576MB of RAM

22

3A*576MBB=1728MBC

A – 3 virtual machines currently powered on B – Total physical memory currently allocated per VM C – Total physical memory required for all 3 VMs

Initially, this physical host would not have enough memory available to power-on a fourth VM. However, through background memory page sharing, ESX Server may eventually find the necessary 256 MB of redundant memory pages.

Note: The amount of redundant memory reclaimed on a system is highly dependent on the nature of the specific environment. The opportunity to share memory dramatically increases in an environment where VMs are executing the same OS. Metrics appearing in this example are only used for illustrative purposes and do not represent actual physical memory that may be reclaimed.

With this newly reclaimed memory, ESX Server has enough free RAM to power on a fourth VM. Unless the first three VMs attempt to update their shared memory pages, this system will continue to function as expected with no discernable manifestation of either the memory page sharing or memory overcommitment. However, if activity in the three original VMs should increase to the extent that each VM needs its own distinct page, ESX Server can accommodate this shortage of physical memory through the use of a swapfile.

Based on the share-based memory allocation policy, ESX Server reclaims physical memory by moving the memory contents of a VM to disk. Ordinarily, this would be a very risky operation since there is no reliable, programmatic method for the VMkernel to identify VM pages that are optimal for swapping to disk and pages that should never be swapped to disk (for example, it would be inappropriate to swap VMkernel pages). ESX Server solves this problem through the use of the balloon driver.

It is important to note, though, that this overcommit scenario with the use of the ESX Server swapfile can have serious implications on the performance of the applications running in a VM. In environments where performance is, in any way, a concern, avoid memory overcommitment. If possible, configure the physical ESX Server platform with enough physical memory to accommodate all hosted VMs.

Recommendations for memory virtualization From a performance perspective, the recommendations for memory virtualizations are few and straightforward:

• When configuring servers to run ESX Server, try to install as much memory as possible on these systems, so that more VMs can be run on them without memory overcommitment.

• To handle memory overcommit for virtual machines, you should install the VMware Tools package into your VMs. This package includes a memory controller driver that allows ESX Server to gracefully reclaim memory from individual VMs.

• When installing ESX Server, VMware recommends creating a VMkernel swapfile whose size is between 100% and 200% of the amount of physical RAM installed in a system. This large swapfile can accommodate significant overcommitment and provide the VMkernel with maximum flexibility when addressing memory allocation requests.

Network Network virtualization in ESX Server is centered on the concept of a virtual switch, which is a software representation of a 1016-port, full-duplex Ethernet switch. The virtual switch is the conduit between VM network interfaces in two VMs or between a VM and the physical network. VM network interfaces connect

23

to virtual switches; virtual switches connect to both physical and virtual network interfaces, as shown in Figure 10.

Figure 10: Showing a virtual switch providing connectivity between VMs and the network

When an application running in a VM attempts to send a packet, the request is handled by the operating system and pushed to the network interface card device driver through the network stack. Inside the VM, however, the network interface driver is actually the driver for the abstracted instance of the network resource; the pathway through this virtual interface is not directly to the physical network interface but, instead, passes the packet to the virtual switch components of the VMkernel. Once the packet has been passed to the virtual switch, the VMkernel forwards the packet to the appropriate destination – either out of the physical interface or to the virtual interface of another VM connected to the same virtual switch. Since the virtual switch is implemented entirely in software, switch speed is a function of server processor power.

How to dedicate a physical NIC to a VM A common question is, “How do I dedicate a physical NIC to a VM?”

By connecting a single virtual network adapter and a single physical network interface to a virtual switch, a single VM obtains exclusive use of the physical interface. In this way, a physical NIC can be dedicated to a VM. However, if a second VM is connected to the same virtual switch, both VMs will pass traffic through the same physical interface.

A more complete response to the question of dedicating a physical NIC to a VM would be that a physical network interface is not dedicated to a VM; instead, ESX Server is configured, through virtual switches, to bridge only a single virtual network adapter through an individual physical interface.

24

Configuring virtual switches Since it is possible to connect either more than one virtual adapter to a virtual switch or more than one physical adapter to a single virtual switch (as shown in Figure 11), consider the multiplexing operation of a virtual switch in each of these scenarios.

Figure 11: Showing connectivity options

First, consider a virtual switch connected to two virtual network adapters deployed in two different VMs. Just as if this were a physical switch with two servers connected, these VMs can communicate with one another via the virtual switch. This scenario can be scaled up to the limits of the virtual switch, a 1016-port device (32 ports by default), allowing up to 1016 VMs attached to the same virtual switch to communicate within a physical host. In this environment, with no physical adapter connected to the virtual switch, network traffic is utterly isolated from the physical network segment.

If this example is modified by connecting 1016 VMs and one physical adapter to the virtual switch, all 1016 VMs can communicate with the physical network via the single physical interface. Interestingly, in ESX Server, physical adapters connected to virtual switches do not deduct from the number of ports available on a virtual switch.

Note that, in the example, each VM has only a single network interface connected to the virtual switch; there is no reason for a VM to have more than one virtual network interface connected to a single virtual switch. In fact, ESX Server does allow a VM to be configured with more than one virtual network adapter connected to a single virtual switch.

Virtual network adapters, virtual switches, and the connections between these devices, are VMkernel processes whose speeds are dictated by server CPU speed; as purely software processes within the VMkernel, these devices are operational as long as the VMkernel is operational. Unlike physical network components, virtual network devices cannot fail or reach the physical limitations of media throughput. As a result, there is no need to use multiple virtual adapters to address fault tolerance or performance concerns – not always the case for physical adapters.

25

When interfacing with the physical world, however, virtual switches can be connected to multiple physical network adapters, as shown in Figure 12.

Figure 12: Connecting virtual switches to multiple physical network adapters

When multiple physical network interfaces are attached to a single virtual switch, ESX Server and the VMkernel recognize this as an attempt to address the fault-tolerance and performance concerns of physical networks and automatically create a bonded team of physical network interfaces. This bonded team is able to send and receive higher rates of data and, should a link in the bonded team fail, the remaining member(s) of the team continue to provide network access. In other words, the connection of multiple physical adapters to the same virtual switch creates a fault-tolerant NIC team for all VMs communicating through this virtual switch. There is no need for a driver or special configuration settings within the VMs.

The fault-tolerance delivered by the NIC team is completely transparent to the guest operating system. Indeed, even if the guest operating system does not support NIC teaming or fault-tolerant network connections, the VMkernel and the virtual switch deliver this functionality through the abstracted network service exposed to the VM.

Load distribution ESX Server does not distribute the frames that make up a single TCP session across multiple links in a bond. This means that a session with a single source and single destination never consumes more bandwidth than is provided by a single network interface. However, IP-based load-balancing multiple sessions to multiple destinations can consume more total bandwidth than any single physical link. The only scenario in which the same frame is sent over more than one interface occurs when no network links within the bond can be verified as functional.

The network load sharing capability of ESX Server can be configured to employ load sharing policies based on either Layer 2 (based on the source MAC address) or Layer 3 (based on a combination of the source and destination IP addresses). By default, ESX Server uses the Layer 2 policy, which does not require any configuration on the external, physical switch.

26

With MAC-based teaming, because the only consideration when determining which link use is the MAC address, the VM always transmits frames over the same physical NIC within a bond. However, the IP-based load distribution algorithm typically results in a more evenly balanced utilization of all physical links in a bond.

If ESX Server is configured to use the IP-based load-distribution algorithm, the external, physical switch must be configured to communicate using the IEEE 802.3ad specification. Because the MAC address of the VM will appear to be connected to each of the ports on virtual switch that it is transmitting, this configuration is likely to confuse the switch unless 802.3ad is enabled. The load-distribution algorithm also handles inbound and outbound traffic differently.

Figure 13 compares MAC-based load balancing with IP-based load balancing.

Figure 13: Comparing MAC- and IP-based load distribution

MAC-based load distribution IP-based load distribution

27

Distributing outbound traffic With the IP-address-based load-distribution algorithm enabled, and outbound network packets being sent from a virtual machine, non-IP frames are distributed among the network interfaces within a single bond in a round-robin fashion. IP-based traffic is, by default, distributed among the member interfaces of a bond based on the destination IP address within the packet. This algorithm has the following strengths and weaknesses:

• It prevents out-of-order TCP segments and provides, in most cases, reasonable distribution of the load. • 802.3ad-capable network switches may be required as there have been reports that this algorithm

confuses non-802.3ad switches. • It is not well-suited for environments where a VM communicates with a single host – in this environment, all

traffic is destined for the same IP address and would not be distributed. • With this algorithm, all traffic between each pair of hosts traverses only one link per pair of hosts until a

failover occurs. • The transmit NIC to be used by a VM for a network transaction is chosen based on whether the

combination of destination and source IP addresses is even or odd. Specifically, the IP-based algorithm uses an exclusive-or (XOR) of the last bytes in both the source and destination IP addresses to determine which physical link should be used for a source-destination pair. By considering both the source and destination IP addresses when selecting the bond member for a particular host, it is possible for two VMs within the same ESX Server host, using the same set network bond to select different physical interfaces, even when communicating with the same remote host. It is important to note is that the load distribution algorithm is not bound on a per-VM basis. In other words, the path selection for load distribution supports different physical paths to the same destination on a multi-homed virtual adapter

Distributing inbound traffic For inbound traffic destined for a VM, there are two possible load distribution methods, depending on the capabilities of the external, physical switch. For non-802.3ad switches, the return path for packets is determined by the ARP table built by the switch and by an understanding of which machines (virtual or physical) are connected to which physical ports. Because performance with non-802.3ad switches may be affected by this learning process, higher throughput is possible with 802.3ad-compatible switches when using bonded NICs with ESX Server.

Having decided which load distribution mechanism is most appropriate for your environment, you must configure your virtual switches appropriately. This advanced configuration is available through the Virtual Infrastructure Client management interface under the “Configuration” tab for the appropriate ESX Server host.

ESX Server supports several other virtual switch and load-distribution settings, which are documented in the Virtual Infrastructure Server Configuration Guide.

28

Eliminating the switch as a single point of failure ESX Server allows a single team of bonded NICs to be connected to multiple physical switches, eliminating the switch as a single point of failure. This feature requires the beacon monitoring feature of both the physical switch and ESX Server NIC team to be enabled.

Beacon monitoring allows ESX Server to test the links in a bond by sending a packet from one adapter to the other adapters within a virtual switch across the physical links. For more information on the beacon monitoring feature, see the ESX Server administration guide.

Improving network performance If a VM is not delivering acceptable network performance, load distribution through NIC teaming and bonding (as described above) can improve performance in certain situations.

Bonding physical Gigabit Ethernet ports is not likely to improve performance; in general, VMs cannot saturate a single Gigabit Ethernet port because of the CPU overhead on extremely high network throughput. However, bonding 100Mbps Ethernet ports can improve throughput if all the following conditions are met:

• There are spare CPU cycles within the physical server to handle the additional processing required for increased traffic

• The VM is communicating with more than one destination • ESX Server is configured to perform Layer 3-based load distribution (using IP addresses) • The performance limitation is the network interface

Another effective way to improve VM performance is by deploying the VMware vmxnet device driver. By default, VMs are created with a highly-compatible virtual network adapter – the device reports itself as an AMD PCNet PCI Ethernet adapter (vlance). This device is used as the default because of its near-universal compatibility – there are DOS drivers for this adapter, as well as Linux, Netware, and all versions of Windows. While this virtual adapter reports link speeds of 10Mbps with only a half-duplex interface, the actual throughput can be much closer to the capabilities of the physical interface.

If the vlance adapter is not delivering acceptable throughput or if the physical host is suffering from excessive CPU utilization, higher throughput may be possible by changing to the vmxnet adapter, which is a highly-tuned virtual network adapter for VMs. The vmxnet driver is installed as a component of the VMware Tools package, and must be supported by the operating system running in the virtual machine. For a list of supported operating systems, please see the System Compatibility Guide for ESX Server 3.

Another key to maximizing the performance of physical network adapters is the manual configuration of the speed and duplex settings of both the physical network adapters in an ESX Server and the physical switches to which the ESX Server is connected. VMware Knowledge Base article #813 details the settings and steps necessary to force the speed and duplex of most physical network adapters.

In most cases, ESX Server is configured to dedicate a physical network adapter to the service console for management and administration. There are, however, scenarios where it may be necessary to have the service console use a network adapter that is allocated to the ESX Server VMkernel. Such scenarios are usually introduced by dense server blade configurations that have only two physical NICs and cannot spare an entire physical interface for the service console. In this case, the service console can access the same virtual networking resources (virtual switches and network adapters). This is achieved by correctly configuring a single virtual switch to handle a combination of interfaces such as the service console, VMotion and VM networks. Although consolidating the interfaces is not a recommended best practice, it is possible.

How the network perceives VMs It is critical to remember that all virtual network adapters have their own MAC addresses. Since TCP/IP is governed by the operating system within a VM, each VM requires its own IP address for network connectivity. To an external network, a VM looks exactly like a physical machine, with every packet having a unique source MAC and IP address.

29

To handle multiple source MAC addresses, the physical network interface of the server is put into promiscuous mode. This causes its physical MAC address to be masked; all packets transmitted on the network segment are presented to the VMkernel virtual switch interface. Any packets destined for a VM are forwarded to the virtual network adapter through the virtual switch interface. Packets not destined for a VM are immediately discarded. Similarly, network nodes perceive packets from a VM to have been transmitted by the VM; the role of the physical interface is undetectable – the physical network interface has become similar to a port on a switch, an identity-less conduit.

VLANs ESX Server and virtual switches also support IEEE 802.1q VLAN tagging. To increase isolation or improve the security of network traffic, ESX Server allows VMs to fully leverage existing VLAN capabilities and even extends this functionality by implementing VLAN tagging within virtual switches.

VLAN tagging allows traffic to be isolated within the confines of a switched network. Traditionally, VLAN tagging is performed by a physical switch, based on the physical port on which a packet arrives at the switch. In an environment with no virtualized server instances, this approach provides complete isolation within broadcast domains. However, when virtualization is introduced, port-based tagging at the physical switch does not provide VLAN isolation between VMs that share the same physical network connection.

To address the scenario where broadcast-domain isolation is required between two VMs sharing the same physical network, virtual switches support the creation of port groups that can provide VLAN tagging isolation between VMs within the confines of a virtual switch. Port groups aggregate multiple ports under a common configuration and provide a stable anchor point for virtual machines connecting to labeled networks. Each port group is identified by a network label, which is unique to the current host, and can optionally have a VLAN tagging ID.

Considerations when configuring virtual switches • When initially configuring your virtual switches on ESX Server, invest in creating a naming convention that

provides meaningful names for these switches beyond the context of a single server. For example, VMotion requires that both the source and destination ESX Server have the same network names; for this reason, virtual switch names like “Second Network” may not translate from server to server as easily as more definitive designations like, “Production Network” or “Management Network.”

• ESX Server supports a maximum of 20 physical NICs, whether 100 Mbps or 1 Gbps

• A virtual switch provides up to 1016 ports for virtual network adapter connections, the default is 32. However, physical connections do not consume ports on virtual switches. For example, if four physical network cards are connected to a single virtual switch, that switch still has all 1016 ports available for VMs.

• When using VLAN tagging within a virtual switch, you should configure the VM’s network adapter to connect to the name of the port group, rather than the name of the physical switch. Note that external, physical switch port to which ESX Server connects should be set to VLAN trunking mode to allow the port to receive packets bound for multiple broadcast domains.

• A virtual switch may connect to multiple virtual network adapters (multiple VMs), but a VM can have no more than one connection to any virtual switch

• A physical adapter may not connect to more than one virtual switch, but a virtual switch may connect to multiple physical network adapters. When multiple physical adapters are connected to the same virtual switch, they are automatically teamed and bonded.

• If you are implementing VMotion within your ESX Server environment, reserve or assign a Gigabit NIC for VMotion to ensure the quickest possible migration.

Note: VMware only supports VMotion over Gigabit Ethernet; VMotion over a 10/100 Mbps network is not supported.

30

Storage Storage virtualization is probably the most complex component within an ESX Server environment. Some of this complexity can be attributed to the robust, feature-rich Storage Array Network (SAN) devices deployed to provide storage, but much is due to the fact that SANs and servers are often managed independently, sometimes by entirely different organizations. As a result, this white paper discusses storage virtualization from the following two perspectives:

• How the SAN (iSCSI, fibre channel) sees ESX Server • How ESX Server sees the SAN (iSCSI and fibre channel)

Presenting both perspectives should help both SAN and server administrators better communicate their unique requirements in an ESX Server deployment.

31

Architecture Figure 14 presents an overview of virtual storage.

Figure 14: A virtual storage solution with three VMs accessing a single VMFS volume

ESX Server storage virtualization allows VMs to access underlying physical storage as though it were JBOD SCSI within the VM – regardless of the physical storage topology or protocol. In other words, a VM accesses physical storage by issuing read and write commands to what appears to be a local SCSI controller with a locally-attached SCSI drive. Either an LSILogic or BusLogic SCSI controller driver is loaded in the VM so that the guest operating system can access storage exactly as if this were a physical environment.

When an application within the VM issues a file read or write request to the operating system, the operating system performs a file-to-block conversion and passes request to the driver. However, the driver in an ESX Server environment does not talk directly to the hardware; instead, the driver passes the block read/write request to the VMkernel where the physical device driver resides and then the read/write request is forwarded to the actual physical hardware device and forwarded to the storage controller. In previous versions of ESX, the physical device drivers were not loaded in the kernel which created an extra leg in the journey from the VM to the physical storage. The integration of the drivers into the kernel in ESX Server 3 thereby removes an extra translation layer and improves I/O performance.

The storage controller may be a locally-attached RAID controller or a remote, multi-pathed SAN device – the physical storage infrastructure is completely hidden from the virtual machine. To the SAN, however, the converse is true: VMs are completely hidden from the physical storage infrastructure. The storage controller sees I/O requests that appear to originate from an ESX Server; all storage bus traffic from VMs on a particular physical host appears to originate from a single source.

32

There are two ways to make blocks of storage accessible to a VM:

• Using an encapsulated, VMware File System (VMFS)-hosted VM disk file • Using a raw LUN formatted with the operating system’s native file system

VMFS The vast majority of (unclustered) VMs use encapsulated disk files stored on a VMFS volume.

Note: VMFS is a high-performance file system that stores large, monolithic virtual disk files and is tuned for this task alone.

To understand why VMFS is used requires an understanding of VM disk files. Perhaps the closest analogy to a VM disk file is an .ISO image of a CD-ROM disk, which is a single, large file containing a file system with many individual files. Through the virtualization layer, the storage blocks within this single, large file are presented to the VM as a SCSI disk drive, made possible by the file and block translations described above. To the VM, this file is a hard disk, with physical geometry, files, and a file system; to the storage controller, this is a range of blocks.

A VM disk file is, for all intents and purposes, the hard drive of a VM. This file contains the operating system, applications, data, and all the settings associated with a typical/conventional hard drive. If an administrator were to delete a VM disk file, it would be analogous to throwing a physical hard drive in the trash – the data, the operating system, the applications, the settings, and even blocks of storage would be lost. By the same token, if an administrator were to copy a VM disk file, an exact duplicate of the VM’s hard drive would be created for use as a backup or for cloning the particular configuration.

Unlike Windows and Linux operating systems, ESX Server does not lock a LUN when it is mounted – a simple fact that is the source of both power and potential confusion in an ESX Server environment. When configuring a switched SAN topology, it is critical to use zoning, selective storage presentation, or LUN masking to limit the number of physical servers (non-ESX Server) that can see a particular LUN. Without limiting which physical – Windows or Linux – servers can see a LUN, locking and LUN contention will quickly cause data to become inaccessible or inconsistent between nodes.

VMFS is inherently a distributed file system, allowing more than one ESX Server to view the same LUN. Unlike Windows/NTFS or Linux/ext3, ESX Server/VMFS supports simultaneous access by multiple hosts. This means that while numerous ESX Server instances may view the contents of a VMFS LUN, only one ESX Server may open a file at any given moment. To an ESX Server and VMFS, when a VM is powered on, the VM disk file is locked.

While VMotion is described in detail later in the document, it might be helpful to explain now that, in a VMotion operation, the VM disk file remains in place on the SAN, in the same LUN; file ownership is simply transferred between ESX Server hosts that have access to the same LUN.

The distributed nature of VMFS means that, when configuring the SAN to which ESX Server is attached, zoning should be configured to allow multiple ESX Servers to access the same LUN where the VMFS partition resides. This may be out of the ordinary for the SAN administrator.

33

Figure 15 shows a typical SAN solution.

Figure 15: A virtual storage solution with six VMs accessing LUNs on a SAN array

There are many perspectives to a virtualized storage environment:

• To VMs, VMFS is completely hidden. A VM is not aware that the storage that it sees is encapsulated in a larger file within a VMFS volume.

• To ESX Server, multiple LUNs and multiple VMFS partitions may be visible. ESX Server can run multiple VMs from multiple SAN devices; that is, ESX Server does not require or prefer that all VMs are in the same VMFS volumes.

• To the SAN, the controller should be configured to expose the LUN containing the VMFS volume to any and all ESX Servers that might be involved in VMotion operations for a particular VM.

LUN performance considerations When constructing a LUN for VMFS volumes, you should follow some basic storage rules that apply to VMFS and a few that require further consideration.

As with any LUN, more spindles mean more concurrent I/O operations. When planning a storage configuration that maximizes storage performance, you should deploy as many spindles as practical to create your VMFS LUN.

34

Remember that the VMFS volume will host multiple VMs, which has two effects on LUN performance:

• Since a single VMFS volume may have multiple ESX Servers and each ESX Server may have multiple VMs within the same partition, the I/O loads on a VMFS-formatted LUN can be significantly higher than the loads on a single-host, single-operating system LUN.

• Since many VM disk files are likely to be stored within a single VMFS volume, the importance for fault tolerance on this LUN is amplified. Always employ at least the level of fault tolerance used for physical machines. Fault tolerance becomes even more of a concern if a larger VMFS volume is created from multiple, smaller VMFS extents within ESX Server. Should any one extent fail, all data within that extent would be lost, whereas information on the remaining extents would remain available. Therefore, measures like RAID technology and stand-by drives should be considered standard as part of any VMFS LUN.

From a pure performance perspective, tuning an array for a particular application may not be as effective with VMs as with physical machines. Since VM storage is abstracted from the VM and, typically, encapsulated in a virtual machine disk file within a VMFS volume, it is probable that the same parameters that enhanced database performance in an NTFS partition will not deliver the same gains in a virtualized environment. As a result, at this time there are no recommended application-specific tuning parameters for a VMFS formatted LUN.

Tuning VM storage While it may be possible to perform some storage performance tuning in an ESX Server environment, you should consider some potential trade-offs.

Storage performance tuning generally involves a low-level understanding of how an application accesses disks and how to configure placement, allocation units, and caches within an array to optimize the performance of this application. What is not always considered is that enhancing the performance for application may, in practice, degrade the performance of many other applications.

Understanding this tuning trade-off is especially important in an ESX Server environment where dissimilar applications may access the same groups of spindles. If an array hosting the virtual disks for several file and print server VMs were tuned to optimize Microsoft SQL Server performance, for example, the performance of the file and print servers would probably be degraded. It is also possible that, since the array is tuned for SQL Server traffic – and is therefore less efficient when handling file and print traffic – SQL Server performance could be degraded while the array struggles with the suboptimal file and print workload.

What may ultimately determine the degree to which storage is tuned for VM application performance is the trade-off in flexibility. For the majority of deployments, the flexible, on-demand capability to create and move VMs is one of the most powerful features of an ESX Server environment. To some extent, creating LUNs that are tuned for specific applications restricts this flexibility.

Using raw device mapping While the concept of raw device mapping is described in detail later in this white paper, it is relevant to mention here that raw device mapping can allow a VM to access a LUN in much the same way as a non-virtualized machine. In this scenario, where LUNs are created on a per-machine basis, the strategy of tuning a LUN for the specific application within a VM may be more appropriate.

Since raw device mappings do not encapsulate the VM disk as a file within the VMFS file system, LUN access more closely resembles the native application access for which the LUN is tuned.

35

Other design considerations • When designing LUN schemes and storage layouts for a virtualized environment, you should consider the

requirements of VMotion, which needs all VM disk files (or the raw device mapping file) to be visible on the SAN to both the source and destination servers.

• According to the “VirtualCenter Technical Best Practices” white paper, available at http://www.vmware.com/pdf/vc_technical_best.pdf, there should be no more than 16 ESX Server hosts connected to a single VMFS volume.

• In a larger deployment, it may not be practical to expose all VMs to all hosts; as a result, care should be taken to ensure that VM disk files or disk mappings are accessible to the appropriate ESX Server hosts.

Sizing VM disk files If a 72 GB hard drive is created for a virtual machine, ESX Server will create a sparse file within the specified VMFS volume. As the space requirements for that VM increase, the VM disk file increases as well. While a 72 GB file is too large for most other file systems, VMFS can accommodate a 27 TB file, allowing the VM to support disk files that meet the needs of almost all enterprise applications.

Presenting a raw LUN to a VM Presenting the encapsulated VM disk file may not always be the optimal storage configuration. To a VM running on an ESX Server, its VM disk file appears as a hard drive, with geometry, many files, and a file system. However, to an external system not running within the ESX Server instance, the VM disk file appears as a single, monolithic file.

If, for example, a VM were to be clustered with a physical machine, the data and quorum drives could not be VM disk files since the physical cluster node would not be able to read a VMFS file system, let alone the encapsulated virtual disk file.

To accommodate scenarios where external physical machines must share data (at a block level) with a VM, ESX Server allows a raw LUN to be presented to the VM. The raw LUN is nothing more than a traditional array of drives, as opposed to the encapsulated, monolithic virtual machine disk file. With a raw LUN, the VM can be configured to use storage in nearly the same way that a physical device accesses storage (except that the VM still accesses this raw LUN through a driver that presents the blocks as a locally attached SCSI drive; the VMkernel still does the translation and encapsulation that results in the I/O reaching the SAN storage controller).

This configuration is most commonly deployed when a VM is clustered with a physical server (as shown in Figure 16). However, a VM cannot be clustered with a physical server running multipathing software such as HP StorageWorks Secure Path; in this scenario, some custom multipathing commands are not supported in ESX Server bus sharing.

36

Figure 16: A VM and physical machine clustered with a raw LUN

Using its capability to attach a raw device as a local storage device, a VM can hold or host data within native operating system file systems, such as NTFS or ext3.

Raw device mapping Prior to the release of ESX Server 2.5, the use of raw devices meant that many of the flexible aspects of VMFS and VM disk files were not available. However, a feature called Raw Device Mapping (RDM) addresses this shortcoming by allowing a VM to attach to a raw device as though it were a VMFS-hosted file. With RDM, raw devices can deliver many of the same features previously reserved for VM disk files – particularly VMotion and .redo logs.

Note: .redo logs for VM disk files used in undoable disks and VM snapshots are available only when raw device mapping is in virtual compatibility mode.

Raw device mapping relies on a VMFS-hosted pointer – or proxy – file to redirect requests from the VMFS file system to the raw LUN.

For example, consider the following VMFS directory: [root@System1 root]# ls –la /vmfs/demo/ total 25975808 drwxrwxrwt 1 root root 512 Aug 19 20:12 drwxrwxrwt 1 root root 512 Aug 22 19:10 -rw------- 1 root root 4194304512 Aug 25 15:13 W2K-SQL.vmdk -rw------- 1 root root 18210038272 Aug 19 20:12 W2K-SQLDATA.vmdk -rw------- 1 root root 4194304512 Aug 25 15:13 WNT-BDC.vmdk

37

In the above example, the VMFS volume demo contains both VM disk files and VM raw device mappings. W2K-SQLDATA.vmdk is the raw device mapping that points the physical host to the appropriate LUN.

Note that raw device mapping appears to be exactly like a VM disk file, even appearing to have a file size that is equivalent to the LUN to which the mapping refers. Since the map file is accessible through VMFS, it appears to all physical hosts that can see the VMFS volume. When a VM attempts to access its raw-device-mapped storage, the VMkernel resolves the SAN target through the data stored in the mapping file, which is able to do per-host resolution for the raw device proxied by the raw device mapping file.

Consider a second example with two physical hosts; on each host is one node of a two-node cluster. Each node – NodeA and NodeB – references a shared data disk that is a raw device. The VM configuration file for NodeA references the shared disk as /vmfs/demo/data_disk.vmdk; NodeB shares this apparently identical reference to /vmfs/demo/data_disk.vmdk for the shared data drive. However, because of physical and configuration differences between the two systems, the physical SAN paths to the VMFS volume demo and the physical SAN paths to the LUN referenced by the mapping file data_disk.vmdk are different.

For the server hosting NodeA, the physical SAN address for the demo LUN is vmhba1:0:1:2; for the server hosting NodeB, the physical SAN address for the demo LUN vmhba2:0:1:2. Similarly, the paths to the LUN referred to by raw device mapping might be different. Without raw device mapping, only the physical, static SAN path is used to access the raw LUN. Since the two physical hosts access the LUN over different physical SAN paths, the VM configuration files would have to be updated to resolve the change in SAN without a raw device map.

By removing the limitations of static SAN path definitions, raw device maps enable VMotion operations with VMs that use raw devices. Additional functionality is enabled by raw device mappings; now, all raw device access for a mapped LUN is proxied through a VMFS volume. As a result, the raw device may have access to many of the features of the VMFS file system, depending on the mapping mode used.

38

There are two raw device mapping modes – virtual compatibility mode and physical compatibility mode.

• Virtual compatibility mode allows a mapped raw device to inherit nearly all of the features of a VM disk file – such as file locking, file permissions, and .redo logs.

• Physical compatibility mode allows nearly every SCSI command to be passed directly to the storage controller. This means that SAN-based replication tools, such as HP StorageWorks Business Copy or Continuous Access, should work within a VM that is presented storage through a raw device map in physical compatibility mode. This mode should allow SAN management applications to communicate directly with storage controllers for monitoring and configuration. Check with the storage vendor to determine if the appropriate storage management software has been tested and is supported for running in a VM.

Testing has shown no performance difference between VMs accessing storage as encapsulated disk files and those accessing storage as raw volumes; however, from an administrative perspective, the use of raw volumes requires more coordination between SAN and server administrators. VMFS does not require the strict SAN zoning needed to support raw devices with non-distributed file systems.

From a functional perspective, with the introduction of RDM, many of the differences between VMFS and raw devices have been resolved. As such, unless there is an application requirement or architectural justification for using raw devices, the use of VM disk files in a VMFS volume is preferable due to their flexibility and ease of management. For example, with raw storage devices an administrator must create a LUN on the SAN whenever a new VM is to be created; on the other hand, when creating a VM using a VM disk file within a VMFS volume, no SAN administration is required since the LUN already exists.

39

Planning partitions Before installing ESX Server, VMware strongly recommends that you consider your partitioning needs. Repartitioning an ESX Server requires some Linux expertise; it is easier to plan an appropriate installation rather than having to repartition later. Table 2 shows the recommended partitioning for a typical scenario. See the Installation and Upgrade Guide for more detailed information.

Table 2: Default storage configuration and partitioning for a VMFS volume on internal drives

Partition name

File system format

Size Description

/boot ext3 100 MB (fixed) This partition contains the service console kernel, drivers, and the Linux boot loader (LILO), as well as the LILO configuration files.

/ ext3 2560MB (fixed) Called the “root” partition, this contains all user and system files used by the service console, including the ESX Server configuration files.

Swap 544MB (fixed) The swap partition is used by the Linux kernel of the service console.

/var/log

ext3 2GB The /var partition can provide log file storage outside of the service console root partition.

vmkcore vmkcore 100 MB This partition serves as a repository for the VMkernel core dump files in the event of a VMkernel core dump.

VMFS VMFS3 <remaining space> The VMFS file system for the storage of virtual machine disk files.

Note: If your ESX Server host has no network storage, and one local disk, you must create two more required partitions on the local disk (for a total of five required partitions): vmkcore, a vmkcore partition is required to store core dumps for troubleshooting. VMware does not support ESX Server host configurations without a vmkcore partition. vmfs3, a vmfs3 partition is required to store your virtual machines. These vmfs and vmkcore partitions are required on a local disk only if the ESX Server host has no network storage.

The /var partition can be particularly important as a log file repository. By having the /var mount point reference a partition that is separate from the root partition, the root partition is less likely to become full. If the root partition on the service console becomes completely full, the system may become unstable.

Implementing boot-from-SAN The distributed nature of the VMFS files system can only be leveraged in a shared storage environment; currently, a SAN (iSCSI or fibre channel) or NAS is the only form of shared storage currently certified for use with ESX Server. As a result, most ESX Server deployments are attached to a SAN.

ESX Server supports boot-from-SAN, wherein the boot partitions for the Linux-based service console are placed on the SAN (iSCSI or fibre channel); NAS does not support boot-from-SAN. In this boot-from-SAN environment, there is no need for local drives within the physical host.

Unlike the VMFS volumes used for storing VM disk files, the partitions for booting ESX Server, which use the standard Linux ext3 files system, should not be zoned for access by more than one system. In other words, in the zoning configuration within your SAN, VMFS volumes may be exposed to many hosts; however, boot partitions – /boot, / (root), swap and any other service console partitions you may have created – should only be exposed to a single host.

Comment [A1]: This seems out of place. The rest of the paragraph and the following table talk about partitioning, and no more mention is made of array controllers.

40

The configuration of ESX Server to boot from SAN should be performed at installation time. If you are installing from the product CD, you must select either the bootfromsan or bootfromsan-text option

Noting changes to the boot drive and device specification When booting from a local controller, which uses the cciss.o driver, the drives and partitions are referenced under /dev/cciss/cXdY (where X is the controller number and Y is the device number on the controller).

When booting from SAN, it is important to note the change to the boot drive and device specification. The boot devices are presented as SCSI devices to the service console and therefore are referenced under /dev/sda (or /dev/sdX, where X corresponds to the controller that provides access to the boot partitions).

Taking care during the installation When installing ESX Server in a boot- from-SAN environment, exercise caution with the storage configuration presented during the installation process. If you already have VMFS volumes on the SAN, ensure that the ESX Server installer is not configured to create and format VMFS volumes; otherwise, the installer will format your volumes, destroying VM disk files on the SAN.

Defining the connection type In addition to the zoning configuration on the SAN, the connection type for each physical host must be defined. This setting is different for each SAN, depending on model and manufacturer.

• For an HP StorageWorks Modular Smart Array (MSA), using either the Array Configuration Utility (ACU) or serial cable command line, set the operating system type to Linux in the Selective Storage Presentation options.

• For an HP StorageWorks EVAgl (active/passive) set the connection type to Custom and enter the following string for the connection parameters: 000000002200282E. For firmware 4001 (active/active firmware for “gl” series) and above, use type vmware.

• For an HP StorageWorks EVAxl (active/active) set the connection type to Custom and enter the following string for the connection parameters: 000000002200283E. For firmware 5031 and above, use type vmware.

• For an HP StorageWorks XP disk array, use host mode type 0C.

Fibre Channel multipathing and failover SAN multipathing functionality is built into the VMkernel.

Note: ESX Server multipathing even allows fault-tolerant SAN access to non-VMFS volumes, if there are raw devices within the VMs.

ESX Server identifies storage entities through a hierarchical naming convention that references the following elements: controller, target, LUN and partition. This convention provides unique references to VMFS volumes, such as vmhba2:0:1:2, for example.

This example corresponds to the partition accessed through HBA vmhba2, target 0, LUN 1, and partition 2.

When ESX Server scans the SAN, each HBA reports all LUNs visible on the storage network; each LUN reports an ID that uniquely identifies it to all nodes on the storage network. After detecting the same unique LUN ID reported by the storage network, the VMkernel automatically enables multiple, redundant paths to this LUN, known as multipathing.

ESX Server uses a single storage path for a particular LUN until the LUN becomes unavailable over this path. After noting the path failure, ESX Server switches to an operational path.

41

Fail-back For fail-back after all paths are restored, two policies are available to govern the ESX Server response: fixed and Most-Recently Used (MRU). These policies can be configured through the Storage Management Options in the web interface or from the command line.

• The fixed policy dictates that access to a particular LUN should always use the specified path, if available. Should the specified, preferred path become unavailable, the VMkernel uses an alternate path to access data and partitions on the LUN. ESX Server periodically attempts to initialize the failed SAN path; when the preferred path is restored, the VMkernel reverts to this path for access to the LUN.

• The MRU policy does not place a preference on SAN paths; instead, the VMkernel accesses LUNs over any available path. In the event of a failure, the VMkernel maintains LUN connectivity by switching to a healthy SAN path. The LUN will continue to be accessed over this path, regardless of the state of the previously-failed path; ESX Server does not attempt to initialize and restore any particular path.

Note: The concept of a preferred path applies only when the failover policy is fixed; with the MRU policy, the preferred path specification is ignored.

Application of the path policy is dictated, to a large extent, by the particular storage array deployed.

• For the active-passive SAN controllers found in HP StorageWorks EVA3000, EVA5000 and MSA-series arrays, avoid the fixed policy; only use MRU.

• For the newer HP StorageWorks EVA4000, EVA6000, and EVA8000 arrays and all members of the HP StorageWorks XP disk array family, which are all true active-active storage controller platforms, either policy – fixed or MRU – can be used.

Since the physical storage mechanism is masked by the VMkernel, VMs are unaware of the underlying infrastructure hosting their data. As a result, multipathing, multipathing policy, and path failover are all irrelevant within a VM.

Resource Management ESX Server 3.0 provides the ability for organizations to pool computing resources and then logical and dynamically allocate guaranteed resources as appropriate, whether that be to organizations, individuals or job functions. For the following sections, it is helpful to consider resource providers and resource consumers.

Clusters VirtualCenter allows users to create clusters which can be viewed as logical containers within which computing resources will be grouped. Each cluster can be configured to support VMware DRS and VMware HA which will be discussed later in this section. Clusters are consumers of host resources and are providers to resource pools and VMs.

42

Figure 17: Host systems aggregated into a single resource pool

VMware High Availability (HA) Clusters ESX Server 3.0 provides a method to help improve service levels and uptime while removing the complexity and expense of alternative high availability solutions. In addition to the ease of configuration, HA operates independent of hardware and OS. In the event of a host failure, HA allows VMs to automatically restart on an appropriate host within the HA cluster. The alternate host is chosen based on several factors including resource availability and current workload. For detailed information on VMware HA, please refer to “Automating High Availability (HA) Services with VMware HA.”

VMware Distributed Resource Scheduling (DRS) Clusters Clusters which have been enabled for DRS provide the ability for a global scheduler, managed by VirtualCenter, to automatically distribute VMs across the cluster. DRS provides different levels of service based on the configuration of the cluster. For instance, the cluster can be configured to automatically place VMs within the DRS cluster when they are powered on. Additionally, DRS can be configured to dynamically balance the workload, across physical hosts within a cluster, based on real time utilization of the clusters’ resources. For additional details and best practices regarding DRS, please refer to “Resource Management with VMware DRS.”

Resource Pools Resource Pools are used to hierarchically divide CPU and memory resources within a designated cluster. Each individual host and each DRS cluster has a root resource pool which aggregates the resources of that individual host or cluster. Children resource pools can be created from the root resource pool. Each child owns a portion of the parent resources and can, in turn, provide a hierarchy of child pools. Resource pools can be made up of both child resource pools and virtual machines. Within each pool users can specify reservations, limits, shares which are then available to the child resource pools or VMs. For a detailed discussion on the benefits, usage and resource pool best practices, please refer to the VMware “Resource Management Guide.”

Resource Allocation ESX Server provides powerful, flexible hardware allocation policies to enforce Quality of Service (QoS) or performance requirements, allowing users to define limits and reservations for CPU and memory allocations within each VM. These dynamic resource management policies make it possible to reserve CPU resources for a particular VM – even while the VM is operational. For example, administrators could improve the potential performance of one VM by specifying a reservation of 100% of CPU resources; at the same time, other VMs in the same physical host could be constrained to a limit of 25%.

43

Allocations can be absolute or share-based. In addition, the allocations can be made to a resource pool or to an individual VM.

Absolute allocation It is possible to set a limit and reservation for each VM on a physical host. If, for example, a VM has been allocated 25% of a CPU, VMkernel gives the VM at least 25% of CPU regardless of the demands of other VMs, unless the VM with the reservation is idle6. Likewise, a resource pool can be guaranteed a reservation of 25% of the CPU resources of a cluster. Therefore, each VM within that resource pool will further divide the compute resources available to that pool. Regardless of reservations, idle VMs are preempted in favor of VMs requesting resources.

Share-based allocation In addition to the absolute allocation of resources for an individual, busy VM (with limit and reservation guarantees), share-based allocation provides a mechanism for the relative distribution of server resources between VMs. This concept applies to resource pools as well.

Each VM is assigned a certain number of per-resource, per-VM shares. For example, if two VMs have an equal number of CPU shares, VMkernel ensures that they receive an equal number of CPU cycles (assuming that neither reservations nor limits are violated for either VM and that neither VM is idle). If one VM has twice as many shares as another, the VM with the larger share receives twice as many CPU cycles (again, assuming that minimum or maximum guarantees are not violated for either VM and that neither VM is idle).

Consider a cluster that contains two resource pools, each with an equivalent number of CPU shares. The VMkernel will guarantee that each pool within that cluster is provided an equal number of CPU cycles. If one resource pool has twice as many shares as the other, the resource pool with the superior share count will receive twice the number of CPU cycles allotted to that particular cluster.

Differences between allocation methods The differences between using shares and minimum and maximum guarantees to affect relative VM performance are subtle but important. When using guaranteed allocations, ESX Server is more likely to encounter admission control issues, enforcing the policy that ESX Server must have enough free physical resources to meet the guaranteed minimums for all virtual resources for all VMs. If ESX Server cannot meet the guaranteed minimum allocation for any resource, the VM requesting an allocation is not be powered on; in the case of a VMotion migration, the operation is denied.

In short, a physical host must have enough free resources to meet a VM’s guaranteed reservations in order to power on the VM. There is a caveat to this rule when using VMware HA which is covered later is this document.

Adjusting relative allocations through resource shares does not result in a similar limitation since share-based allocation guarantees only the relative distribution of resources. In the hierarchy of enforcement, meeting guaranteed maximums and minimums takes precedence over maintaining the relative distribution defined by share-based mechanisms.

Warning on setting a guaranteed minimum Be aware that setting a guaranteed reservation might limit the maximum VM density of a physical host. For example, if each VM were to be guaranteed a reservation of 25% of a processor core in a dual-processor (single-core processor, without Intel Hyper-Threading Technology) server, this server could power on only seven VMs. The explanation for this limitation is as follows: the service console has a minimum allocation of 5% of a processor core, leaving 195%; dividing 195% by 25% yields a limit of seven VMs that can meet the guaranteed minimum core allocation.

Note that the total capacity of server system is the sum of the percentages delivered by each processor core. For example, for an eight-processor server with Hyper-Threading Technology, the capacity is 8 x 100% x 2 = 1600%.

By default, VMs are allocated a CPU reservation of 0%.

6 An idle VM is only attempting to execute instructions that constitute the idle loop process.

44

Allocating shares for other resources Shares can be defined and allocated per-resource – for CPU, memory, or disk – for each VM. Note that the application of the relative share allocation policy for other resources differs slightly from CPU:

• Memory The share allocation policy for memory defines the relative extent to which memory is reclaimed from a VM if memory overcommitment should occur. A VM with a larger allocation of shares retains a proportionally larger allocation of physical memory when VMkernel needs to usurp memory from VMs.

• Disk For disk accesses, the proportional share algorithm allows proportional prioritization for each VM’s disk access.

There are no shares associated with network traffic; instead of using shares, network resources are constrained either by traffic shaping or limiting outbound bandwidth.

Best practices VMware publishes best practices guides for many components of the virtualized environment. These include the following:

VirtualCenter & Templates http://www.vmware.com/pdf/vc_2_templates_usage_best_practices_wp.pdf

Virtual SMP http://www.vmware.com/pdf/vsmp_best_practices.pdf ESX Server http://www.vmware.com/pdf/esx3_best_practices.pdf For additional best practices and technical documents, please refer to http://www.vmware.com/vmtn/resources/cat/91,100.

VMware VirtualCenter VirtualCenter is a centralized management application that supports the hierarchical and logical organization and viewing of physical ESX Server resources and associated VMs.

VirtualCenter 2.0 allows users to view the following key items:

• All running VMs • The current state and utilization of each VM • All ESX Server physical hosts • The current state and utilization of each ESX Server physical host • Historical performance and utilization data for each VM • Historical performance and utilization data for each ESX Server physical host • VM configuration • Cluster (DRS and HA) and resource pool configuration

VirtualCenter also offers remote console access to each VM.

However, VirtualCenter is more than just a management application, it is the cornerstone of the Virtual Infrastructure from VMware. This centralized management service allows any application to authenticate and issue commands to the VirtualCenter server through the VMware VirtualCenter SDK, providing a single point of administration for users and applications.

The VirtualCenter user interface also provides access to higher-level functionality, such as VMotion, DRS and HA.

45

Architecture VirtualCenter is based on a client – server – agent architecture, with each managed host requiring a management agent license7.

When an ESX Server physical host connects to VirtualCenter, VirtualCenter automatically installs an agent, which communicates status as well as command and control functions between the VirtualCenter server and ESX Server.

VirtualCenter server8 is a Windows service that may run either on a physical server or inside a VM. Each VirtualCenter server should be able to manage between 50 and 100 physical hosts and between 1000 and 2000 VMs, depending on the configuration of the server running the VirtualCenter server service.

The Virtual Infrastructure Client application, acts as the user interface for VirtualCenter. It does not require a license, and many clients may access the same VirtualCenter server simultaneously. The Virtual Center Client also acts as the main interface to ESX Server 3; this is convenient for environments that have a small number of ESX Server hosts, in which it is feasible to manage these hosts directly.

Note: VirtualCenter requires an ODBC-compliant database for its datastore. This database holds historical performance data and VM configuration information.

Templates and clones VirtualCenter builds on the portability and encapsulation of VM disk files by enabling two additional features: templates and clones.

Template A template is analogous to a “golden master” server image and represents a ready-to-provision server installation that helps eliminate the redundant tasks associated with provisioning a new server. For instance, a template can be built by creating a VM and installing an operating system, all of the required patches and service packs, and standard security and management applications as well as any common configuration parameters. Then, the VM’s network identity is reset using a tool such as SysPrep, and is powered off. Then, VirtualCenter can be used to create a template from this VM. New VMs can be deployed and customized using a wizard-driven interface or an .XML formatted file containing the desired customizations.

Templates are not required to be stored within the VirtualCenter server filesystem; they can also be stored on a NAS shared storage or on a VMFS3 datastore.

7 The management agent is licensed based on the number of physical processors present in the platform to be managed. 8 This server is also a separately licensed component of the Virtual Infrastructure, though, unlike other separately licensed products, the license for VirtualCenter Management Server is not included within the Virtual Infrastructure Node bundle.

46

Cloning VirtualCenter can also clone a VM to achieve the rapid deployment and replication of a server configuration.

Differences between templates and clones The differences between templates and clones are subtle but important, as shown below.

A template is static; once created, it never changes

A clone is dynamic.

To update a template, you must first create a VM from the current template and then install or apply the desired updates. Lastly, you must create a new template to replace the original.

A clone can be changed.

A template is not a VM and cannot be powered on.

A clone is a VM.

A template has a rigid definition, ensuring consistency in the deployment of VMs.

A clone is easy to update and patch.

Both of these deployment options support the thorough customization of a new VM before it is powered on. Consider your particular environment before selecting the approach that best meets your needs.

Considerations and requirements for VirtualCenter server • VirtualCenter server installs as an application and runs as a Windows service. As such it requires a Windows

2000 Server SP4, Windows Server 2003 (Web, Standard or Enterprise) except 64-bit, or Windows XP at any SP level. The VirtualCenter installer requires Internet Explorer 5.5 or higher in order to run.

• VirtualCenter server can run either in a VM or on a physical server. In either case, the server instance hosting the VirtualCenter server must have at least 2 GB of RAM, a 2 GHz processor, and at least one network interface. This minimum configuration should support a total of 50 ESX Server hosts, 1000 VMs, and 20 simultaneous VirtualCenter client connections. Scaling up the machine configuration to 3 GB of RAM, dual 2 GHz processors, and a Gigabit network interface should provide support for a total of 100 ESX Server hosts, 2000 VMs, and 50 simultaneous VirtualCenter client connections.

• VirtualCenter server also requires a database, which may run on the same server as the VirtualCenter server service or on a remote system. Consider the following: – VirtualCenter 2.0 supports Microsoft SQL Server 2000 and SQL Server 7, Oracle® 8i, 9i and 10g as well

as Microsoft MSDE (not recommended for production environments). Note Microsoft Access is no longer a supported DBS.

47

When configuring the database connection for VirtualCenter, configure the ODBC client to use a System DSN with SQL Authentication.

– VirtualCenter 2.0 does not support Windows Authentication to the database servers. Compatibility Refer to Table 3 for compatibility between current and previous version of VirtualCenter and ESX Server.

Table 3: Compatibility matrix showing capabilities of VC and ESX hosts

Manage ESX Server 2 hosts and their VMs with VirtualCenter 2?

Yes, but no DRS, HA or other new features

Manage ESX Server 3 hosts with VirtualCenter 1? No

VMotion from ESX Server 2 to ESX Server 3? No

After upgrading a VM on ESX Server 3, boot it on ESX Server 2? No

Store ESX Server 2 and ESX Server 3 VM files in the same VMFS? No

Virtual Infrastructure Client application requirements • The Virtual Infrastructure Client application requires .NET framework 1.1 in order to operate. The

application is designed to operate on Windows XP Pro (at any SP level), Windows 2000 Pro SP4, Windows 2000 Server SP4 and all versions of Windows Server 2003 except 64-bit.

• The application must run on a 266MHz or higher Intel or AMD processor. • The application requires a network interface and at least 256 MB of RAM (512 MB of RAM recommended). • 150MB of freed disk space is required for basic installation. Users must have 55MB free on the destination

drive for installation of the program, and you must have 100MB free on the drive containing your %temp% directory.

• A Gigabit Ethernet port is recommended, although 10/100 is supported.

VMotion With the release of VMotion, VMware introduced a unique, new technology that allows a VM to move between physical platforms while the VM is running. VMotion can address a wide range of IT challenges – from accommodating scheduled downtime to building an Adaptive Enterprise.

Architecture VMotion relies on several of the underlying components of ESX Server virtualization, most notably the VMFS file system.

As described earlier, VMFS is a distributed file system that locks VM disk files at the file level, a unique locking mechanism that allows multiple ESX Server instances to utilize a VM disk file within a particular VMFS volume. This mechanism ensures that only one physical host at a time can access a disk file and power-on the associated VM.

To support the rapid movement of VMs between physical machines, it is imperative that the large amount of data associated with each VM does not move – moving the many gigabytes of disk storage associated with a typical VM would take a significant length of time. As a result, instead of moving the disk storage, VMotion and VirtualCenter simply change the owner of the VM disk file, allowing the VM to migrate to a different physical host (as shown in Figure 17).

48

Figure 17: Migrating a VM from one physical host to another without moving the VM disk file

The new physical host also requires access to the memory contents and CPU state information of the VM to be migrated. However, unlike the disk-bound data, there is no shared medium for memory and CPU resources; the CPU state must be migrated by copying the data over a network connection.

When initiating a migration, VMotion takes a snapshot of source server memory, then sends a copy of these memory pages – unencrypted – to the destination server. During this copying process, execution continues on the source server; as a result, memory contents on this server are likely to change. ESX Server tracks these changes and, when copying is complete, sends the destination server a map indicating which memory pages have changed.

At this point, the CPU state is sent to the destination server, the file lock is changed, and the destination server opens the VM file and assumes execution of the VM. Accesses to one of the changed memory pages are served from the source server until all memory changes have been communicated to the destination server via background processes.

These network-intensive operations justify the deployment of a Gigabit network interface to minimize latency between source and destination servers and maximize the rate at which memory pages can be moved between these servers. Moreover, since these memory pages are not encrypted, security needs may justify the deployment of a dedicated network interface.

Considerations and requirements To ensure stable and consistent execution after migrating a VM to a different physical host, VirtualCenter thoroughly reviews the capabilities of both source and destination servers prior to initiating a migration. These servers must comply with the following safeguards:

49

• Both must have access to the VMFS SAN-based partition that holds either the VM disk file or the VM disk raw device map file. Since the VM disk file or raw device map file is not moved during the VMotion operation, both servers must be able to access this partition. A VMotion operation cannot involve moving the disk file from one LUN (either local or SAN-based) to another9. Ensure that the SAN is configured to expose the LUN to the HBAs of both the source and destination servers.

• Both must have identical virtual switches defined and available for all virtual network adapters within the VM. Assuming that the VM to be migrated is using a network interface to perform meaningful activity, VirtualCenter must ensure that this connection is still available after the migration. Before initiating a VMotion operation, VirtualCenter examines and compares virtual switch definitions and configurations on both source and destination servers to ensure that they are identical. Note that VirtualCenter does not attempt to validate the defined connectivity, VC assumes that the IT staff followed good practice during configuration. For example, if the VM connects to a virtual switch named devnet on the source server, the destination server must also have a virtual switch named devnet. If the appropriate virtual switch exists on the destination server, VirtualCenter assumes that the networks are identical and provides the same functional connectivity. As a result, if the virtual switch on the source server connects to a development network and the virtual switch on the destination server connects to a production network, the VMotion operation still continues; however, it is likely that the application within the migrated VM will not be able to access the appropriate network resources. To facilitate this process, you should take care to be consistent and use meaningful names when configuring and creating virtual switches.

• Both must have compatible processors. Since both the CPU and the execution states are moved from one server to the other, it is critical that both processors implement the same instruction sets in exactly the same manner. If not, unsupported or altered instruction execution will have unknown and potentially catastrophic affect on the migrated VM. While this safeguard seems straightforward, the enhancements regularly implemented by Intel and AMD mean that compatibility is not always clear – particularly when these enhancements occur within a particular processor family, compatibility is not always clear. VMware Knowledge Base article #1377 provides an overview of the challenges to be faced when migrating a VM from a physical host that supports the SSE3 instruction set and one that does not, and vice versa. For example, VMotion reports an incompatibility between HP ProLiant BL20p G2 and G3 server blades. If an incompatibility were to be reported, the above Knowledge Base describes an unsupported method for overcoming this safeguard.

• The destination server must have enough free memory to support the minimum guarantee for the VM to be moved. In practice, this statement could apply to all server resources: if a physical resource allocation guaranteed to the VM on the source machine cannot be met on the destination machine, the VMotion operation fails. In this case, the VM continues to run on the source server.

Clustered VMs unsupported Currently, clustered VMs are not supported for VMotion operations.

VMotion requires that VMs access VMFS volumes using the public bus access mode; however, because of their shared storage requirements, ESX Server requires clustered VMs to use the shared mode. These two access modes are incompatible.

In order to migrate a clustered VM node from one physical host to another, you must take down one node and perform a “cold migration.” After the migration is complete, bring the cluster node back up and rejoin the cluster. With the cluster complete, repeat the process with the other node, if desired.

9 To perform a migration that requires the disk file to be moved, the VM must be powered off (or suspended) and “cold migrated.”

For more information For access to VMware product guides see, http://www.vmware.com/support/pubs

For detailed information on Planning, Deploying, or Managing a virtual infrastructure on ProLiant see, http://h71019.www7.hp.com/ActiveAnswers/cache/71086-0-0-0-121.html Copyright © 2006 VMware, Inc. All rights reserved. Protected by one or more of U.S. Patent Nos. 6,397,242, 6,496,847, 6,704,925, 6,711,672, 6,725,289, 6,735,601, 6,785,886, 6,789,156, 6,795,966, 6,880,022 6,961,941, 6,961,806 and 6,944,699; patents pending. VMware, the VMware “boxes” logo and design, Virtual SMP and VMotion are registered trademarks or trademarks of VMware, Inc. in the United States and/or other jurisdictions. Microsoft, Windows and Windows NT are registered trademarks of Microsoft Corporation. Linux is a registered trademark of Linus Torvalds. All other marks and names mentioned herein may be trademarks of their respective companies.

VMware ESX Server HP Virtualization

Documents

Transcript of VMware ESX Server HP Virtualization