VMware Technical Journal - Summer 2013

7/28/2019 VMware Technical Journal - Summer 2013

1/64

VOL. 2, NO. 1JUNE 2013

VMWARE TECHNICAL JOURNAL

Editors: Curt Kolovson, Steve Muir, Rita Tavilla

TABLE OF CONTENTS

1 Introduction

Steve Muir, Director, VMware Academic Program

2 Memory Overcommitment in the ESX Server

Ishan Banerjee, Fei Guo, Kiran Tati, Rajesh Venkatasubramanian

13 Redefning ESXi IO Multipathing in the Flash Era

Fei Meng, Li Zhou, Sandeep Uttamchandani, Xiaosong Ma

19 Methodology or Perormance Analysis o VMware vSphere under Tier-1 Applications

Jefrey Buell, Daniel Hecht, Jin Heo, Kalyan Saladi, H. Reza Taheri

29 vATM: VMware vSphere Adaptive Task Management

Chirag Bhatt, Aalap Desai, Rajit Kambo, Zhichao Li, Erez Zadok

35 An Anomaly Event Correlation Engine: Identiying Root Causes, Bottlenecks,

and Black Swans in IT Environments

Mazda A. Marvasti, Arnak V. Poghosyan, Ashot N. Harutyunyan, Naira M. Grigoryan

46 Simpliying Virtualization Management with Graph Databases

Vijayaraghavan Soundararajan, Lawrence Spracklen

54 Autonomous Resource Sharing or Multi-Threaded Workloads in Virtualized Servers

Can Hankendi, Ayse, K. Coskun


2/64

VMware Academic Program (VMAP)

The VMware Academic Program (VMAP) supports a number o academic research projects across

a range o technical areas. We initiate an annual Request or Proposals (RFP), and also support a

small number o additional projects that address particular areas o interest to VMware.

The 2013 Spring RFP, ocused on Storage in support o Sotware Defned Datacenters (SDDC),

is currently conducting fnal reviews o a shortlist o proposals and will announce the recipients o

unding at the VMware Research Symposium in July. Please contact Rita Tavilla ([email protected])

i you wish to receive notifcation o uture RFP solicitations and other opportunities or collaboration

with VMware.

The 2012 RFP Security or Virtualized and Cloud Platormsawarded unding to three projects:

Timing Side-Channels in Modern Cloud Environments

Pro. Michael Reiter, University o North Carolina at Chapel Hill

Random Number Generation in Virtualized Environments

Pro. Yevgeniy Dodis, New York University

VULCAN: Automatically Generating Tools for Virtual Machine Introspection using Legacy Binary Code

Pro. Zhiqiang Lin, University o Texas, Dallas

Visit http://labs.vmware.com/academic/rfp fnd out more about the RFPs.


3/64

1

Introduction

HappyrstbirthdaytotheVMwareTechnicalJournal!Weareverypleasedtohaveseensucha

positivereceptiontotherstcoupleoissuesothejournalandhopeyouwillndthisoneequally

interestingandinormativeWewillpublishthejournaltwiceperyeargoingorwardwithaSpringeditionthathighlightsongoingR&DinitiativesatVMwareandtheFalleditionprovidingashowcase

orourinternsandcollaborators

VMwaresmarketleadershipininrastructureorthesotwaredeneddatacenter(SDDC)isbuilt

uponthestrengthoourcorevirtualizationtechnologycombinedwithinnovationinautomation

andmanagementAttheheartothevSphereproductisourhypervisorandtwopapershighlight

ongoingenhancementsinmemorymanagementandIOmulti-pathingthelatterbeingbased

uponworkdonebyFeiMengoneoourantasticPhDstudentinterns

AundamentalactorinthesuccessovSphereisthehighperormanceotheTierworkloads

mostimportanttoourcustomersHenceweundertakein-depthperormanceanalysisandcomparisontonativedeploymentssomekeyresultsowhicharepresentedhereWealsodevelopthenecessary

eaturestoautomaticallymanagethoseapplicationssuchastheadaptivetaskmanagementscheme

describedinanotherpaper

HowevertheSDDCismuchmorethanjustalargenumberoserversrunningvirtualized

workloadsitrequiressophisticatedanalyticsandautomationtoolsiitistobemanaged

ecientlyatscalevCenterOperationsVMwaresautomatedoperationsmanagementsuite

hasproventobeextremelypopularwithcustomersusingcorrelationbetweenanomalousevents

toidentiyperormanceissuesandrootcausesoailureRecentdevelopmentsintheuseograph

algorithmstoidentiyrelationshipsbetweenentitieshavereceivedagreatdealoattentionortheirapplicationtosocialnetworksbutwebelievetheycanalsoprovideinsightintothe

undamentalstructureothedatacenter

Thenalpaperinthejournaladdressesanotherkeytopicinthedatacenterthemanagement

oenergyconsumptionAnongoingcollaborationwithBostonUniversityledbyProessorAyse

Coskunhasdemonstratedtheimportanceoautomaticapplicationcharacterizationanditsuse

inguidingschedulingdecisionstoincreaseperormanceandreduceenergyconsumption

ThejournalisbroughttoyoubytheVMwareAcademicProgramteamWeleadVMwareseorts

tocreatecollaborativeresearchprogramsandsupportVMwareR&Dinconnectingwiththeresearch

communityWearealwaysinterestedtohearyoureedbackonourprogramspleasecontactuselectronicallyorlookoutorusatvariousresearchconerencesthroughouttheyear

SteveMuirDirectorVMwareAcademicProgram


4/64

2

Abstract

Virtualizationocomputerhardwarecontinuestoreducethecost

ooperationindatacentersItenablesuserstoconsolidatevirtual

hardwareonlessphysicalhardwaretherebyecientlyusing

hardwareresourcesTheconsolidationratioisameasureothe

virtualhardwarethathasbeenplacedonphysicalhardwareA

higherconsolidationratiotypicallyindicatesgreatereciency

VMwaresESXServerisahypervisorthatenablescompetitive

memoryandCPUconsolidationratiosESXallowsuserstopower

onvirtualmachines(VMs)withatotalconguredmemorythat

exceedsthememoryavailableonthephysicalmachineThisiscalledmemoryovercommitment

Memoryovercommitmentraisestheconsolidationratioincreases

operationaleciencyandlowerstotalcostooperatingvirtual

machinesMemoryovercommitmentinESXisreliableitdoesnot

causeVMstobesuspendedorterminatedunderanyconditions

ThisarticledescribesmemoryovercommitmentinESXanalyzes

thecostovariousmemoryreclamationtechniquesandempirically

demonstratesthatmemoryovercommitmentinducesacceptable

perormancepenaltyinworkloadsFinallybestpracticesor

implementingmemoryovercommitmentareprovided

GeneralTerms:memorymanagementmemoryovercommitment

memoryreclamation

Keywords:ESXServermemoryresourcemanagement

Introduction

VMwaresESXServeroerscompetitiveoperationaleciencyo

virtualmachines(VM)inthedatacenterItenablesuserstoconsolidate

VMsonaphysicalmachinewhilereducingcostooperation

TheconsolidationratioisameasureothenumberoVMsplaced

onaphysicalmachineAhigherconsolidationratioindicateslower

costooperationESXenablesuserstooperateVMswithahigh

consolidationratioESXServersovercommitmenttechnologyisan

enablingtechnologyallowinguserstoachieveahigherconsolidation

ratioOvercommitmentistheabilitytoallocatemorevirtualresources

thanavailablephysicalresourcesESXServeroersuserstheability

toovercommitmemoryandCPUresourcesonaphysicalmachine

ESXissaidtobeCPU-overcommittedwhenthetotalcongured

virtualCPUresourcesoallpowered-onVMsexceedthephysicalCPU

resourcesonESXWhenESXisCPU-overcommitteditdistributes

physicalCPUresourcesamongstpowered-onVMsinaairand

ecientmannerSimilarlyESXissaidtobememory-overcommitted

whenthetotalconguredguestmemorysizeoallpowered-on

VMsexceedsthephysicalmemoryonESXWhenESXismemory-

overcommitteditdistributesphysicalmemoryairlyandeciently

amongstpowered-onVMsBothCPUandmemoryschedulingare

donesoastogiveresourcestothoseVMswhichneedthemmost

whilereclaimingtheresourcesromthoseVMswhicharenot

activelyusingit

MemoryovercommitmentinESXisverysimilartothatintraditional

operatingsystems(OS)suchasLinuxandWindowsIntraditional

OSesausermayexecuteapplicationsthetotalmappedmemoryowhichmayexceedtheamountomemoryavailabletotheOS

ThisismemoryovercommitmentItheapplicationsconsume

memorywhichexceedstheavailablephysicalmemorythenthe

OSreclaimsmemoryromsomeotheapplicationsandswaps

ittoaswapspaceItthendistributestheavailablereememory

betweenapplications

SimilartotraditionalOSesESXallowsVMstopoweronwithatotal

conguredmemorysizethatmayexceedthememoryavailableto

ESXForthepurposeodiscussioninthisarticlethememoryinstalled

inanESXServeriscalledESXmemoryIVMsconsumealltheESX

memorythenESXwillreclaimmemoryromVMsItwillthen

distributetheESXmemoryinanecientandairmannertoallVMssuchthatthememoryresourceisbestutilizedAsimple

exampleomemoryovercommitmentiswhentwoGBVMsare

poweredoninanESXServerwithGBoinstalledmemoryThe

totalconguredmemoryopowered-onVMsis*GBwhile

ESXmemoryisGB

Withacontinuingallinthecostophysicalmemoryitcanbe

arguedthatESXdoesnotneedtosupportmemoryovercommitment

Howeverinadditiontotraditionalusecasesoimprovingthe

consolidationratiomemoryovercommitmentcanalsobeused

intimesodisasterrecoveryhighavailability(HA)anddistributed

powermanagement(DPM)toprovidegoodperormanceThis

technologywillprovideESXwithaleadingedgeover

contemporaryhypervisors

Memoryovercommitmentdoesnotnecessarilyleadtoperormance

lossinaguestOSoritsapplicationsExperimentalresultspresented

inthispaperwithtworeal-lieworkloadsshowgradualperormance

degradationwhenESXisprogressivelyovercommitted

ThisarticledescribesmemoryovercommitmentinESXItprovides

guidanceorbestpracticesandtalksaboutpotentialpitallsThe

remainderothisarticleisorganizedasollowsSectionprovides

MEMORY OVERCOMMITMENT IN THE ESX SERVER

Memory Overcommitment in the ESX ServerIshanBanerjee

VMwareInc

ishan@vmwareedu

FeiGuo

VMwareInc

guo@vmwarecom

KiranTati

VMwareInc

ktati@vmwarecom

RajeshVenkatasubramanian

VMwareInc

vrajeshi@vmwarecom


5/64

3

VMsareregularprocessesinKVM andthereorestandard

memorymanagementtechniqueslikeswappingapplyForLinux

guestsaballoondriverisinstalledanditiscontrolledbyhostvia

theballoonmonitorcommandSomehostsalsosupportKernel

SharedpageMerging(KSM)[]whichworkssimilarlytoESXpage

sharingAlthoughKVMpresentsdierentmemoryreclamation

techniquesitrequirescertainhostsandgueststosupport

memoryovercommitmentInadditionthemanagementand

policesortheinteractionsamongthememoryreclamation

techniquesaremissinginKVM

XenServerusesamechanismcalledDynamicMemoryControl

(DMC)toimplementmemoryreclamationItworksbyproportionally

adjustingmemoryamongrunningVMsbasedonpredened

minimumandmaximummemoryVMsgenerallyrunwithmaximum

memoryandthememorycanbereclaimedviaaballoondriverwhen

memorycontentioninthehostoccursHoweverXendoesnot

provideawaytoovercommitthehostphysicalmemoryhence

itsconsolidationratioislargelylimitedUnlikeotherhypervisors

Xenprovidesatranscendentmemorymanagementmechanism

tomanageallhostidlememoryandguestidlememoryTheidle

memoryiscollectedintoapoolanddistributedbasedonthe

demandorunningVMsThisapproachrequirestheguestOStobeparavirtualizedandonlyworkswellorguestswith

non-concurrentmemorypressure

WhencomparedtoexistinghypervisorsESXallowsorreliable

memoryovercommitmenttoachievehighconsolidationratiowith

norequirementsormodicationsorrunningguestsItimplements

variousmemoryreclamationtechniquestoenableovercommitment

andmanagestheminanecientmannertomitigatepossible

perormancepenaltiestoVMs

Workonoptimaluseohostmemoryinahypervisorhasalsobeen

demonstratedbytheresearchcommunityAnoptimizationorthe

KSMtechniquehasbeenattemptedinKVMwithSingleton[]

Sub-pagelevelpagesharingusingpatchinghasbeendemonstrated

inXenwithDierenceEngine[]Paravirtualizedguestandsharing

opagesreadromstoragedeviceshavebeenshownusingXen

inSatori[]

MemoryballooninghasbeendemonstratedorthezVMhypervisor

usingCMM[]Ginkgo[]implementsahypervisor-independent

overcommitmentrameworkallowingJavaapplicationstooptimize

theirmemoryootprintAlltheseworkstargetspecicaspectso

memoryovercommitmentchallengesinvirtualizationenvironment

TheyarevaluablereerencesorutureoptimizationsinESX

memorymanagement

Background:MemoryovercommitmentinESXisreliable(see

Table)ThisimpliesthatVMswillnotbeprematurelyterminated

orsuspendedowingtomemoryovercommitmentMemory

overcommitmentinESXistheoreticallylimitedbytheoverhead

memoryoESXESXguaranteesreliabilityooperationunderall

levelsoovercommitment

backgroundinormationonmemoryovercommitmentSection

describesmemoryovercommitmentSectionprovidesquantitative

understandingoperormancecharacteristicsomemory

overcommitmentandSectionconcludesthearticle

2BackgroundandRelatedWork

Memoryovercommitmentenablesahigherconsolidationratioina

hypervisorUsingmemoryovercommitmentuserscanconsolidate

VMsonaphysicalmachinesuchthatphysicalresourcesareutilizedinanoptimalmannerwhiledeliveringgoodperormanceFor

exampleinavirtualdesktopinrastructure(VDI)deployment

ausermayoperatemanyWindowsVMseachcontainingaword

processingapplicationItispossibletoovercommitahypervisor

withsuchVDIVMsSincetheVMscontainsimilarOSesand

applicationsmanyotheirmemorypagesmaycontainsimilar

contentThehypervisorwillndandconsolidatememorypages

withidenticalcontentromtheseVMsthussavingmemory

Thisenablesbetterutilizationomemoryandenableshigher

consolidationratio

Relatedwork:ContemporaryhypervisorssuchasHyper-VKVMand

XenimplementdierentmemoryovercommitmentreclamationandoptimizationstrategiesTablesummarizesthememory

reclamationtechnologiesimplementedinexistinghypervisors

METHOD ESX HYPER-V KVM XEN

share X X

balloon X X X X

compress X

hypervisor

swap

X X X

memoryhot-add

X

transcendent

memory

X

Table 1Comparingmemoryovercommitmenttechnologiesinexistinghypervisors

ESXimplementsreliableovercommitment

Hyper-Vusesdynamicmemoryorsupportingmemory

overcommitmentWithdynamicmemoryeachVMiscongured

withasmallinitialRAMwhenpoweredonWhentheguest

applicationsrequiremorememoryacertainamountomemory

willbehot-addedtotheVMandtheguestOSWhenahostlacks

reememoryaballoondriverwillreclaimmemoryromotherVMs

andmakememoryavailableorhotaddingtothedemandingVMIntherareandrestrictedscenariosHyper-VwillswapVMmemory

toahostswapspaceDynamicmemoryworksorthelatestversions

oWindowsOSItusesonlyballooningandhypervisor-levelswapping

toreclaimmemoryESXontheotherhandworksorallguestOSes

Italsousescontent-basedpagesharingandmemorycompression

ThisapproachimprovesVMperormanceascomparedtotheuse

oonlyballooningandhypervisor-levelswapping


1 www.microsot.com/hyper-v-server/

2 www.linux-kvm.org/

3 www.xen.org/


6/64

4

memorypageThismethodgreatlyreducesthememoryootprint

oVMswithcommonmemorycontentForexampleianESXhas

manyVMsexecutingwordprocessingapplicationsthenESXmay

transparentlyapplypagesharingtothoseVMsandcollapsethe

textanddatacontentotheseapplicationstherebyreducingthe

ootprintoallthoseVMsThecollapsedmemoryisreedbyESX

andmadeavailableorpoweringonmoreVMsThisraisesthe

consolidationratiooESXandenableshigherovercommitment

Inadditionithesharedpagesarenotsubsequentlywritteninto

bytheVMsthentheyremainsharedoraprolongedtime

maintainingthereducedootprintotheVM

Memoryballooningisanactivemethodorreclaimingidlememory

romVMsItisusedwhenESXisinthesotstateIaVMhas

consumedmemorypagesbutisnotsubsequentlyusingthem

inanactivemannerESXattemptstoreclaimthemromtheVM

usingballooningInthismethodanOS-specicballoondriver

insidetheVMallocatesmemoryromtheOSkernelItthenhands

thememorytoESXwhichisthenreetore-allocateittoanother

VMwhichmightbeactivelyrequiringmemoryTheballoondriver

eectivelyutilizesthememorymanagementpolicyotheguest

OStoreclaimidlememorypagesTheguestOStypicallyreclaims

idlememoryinsidetheguestOSandirequiredswapsthemtoitsownswapspace

WhenESXentersthehardstateitactivelyandaggressivelyreclaims

memoryromVMsbyswappingoutmemorytoaswapspaceDuring

thisstepiESXdeterminesthatamemorypageissharableor

compressiblethenthatpageissharedorcompressedinstead

ThereclamationdonebyESXusingswappingisdierentromthat

donebytheguestOSinsidetheVMTheguestOSmayswapout

guestmemorypagestoitsownswapspaceorexampleswap

orpagelesysESXuseshypervisor-levelswappingtoreclaim

memoryromaVMintoitsownswapspace

ThelowstateissimilartothehardstateInadditiontocompressing

andswappingmemorypagesESXmayblockcertainVMsrom

allocatingmemoryinthisstateItaggressivelyreclaimsmemory

romVMsuntilESXmovesintothehardstate

Pagesharingisapassivememoryreclamationtechniquethat

operatescontinuouslyonapowered-onVMTheremaining

techniquesareactiveonesthatoperatewhenreememory

inESXislowAlsopagesharingballooningandcompression

areopportunistictechniquesTheydonotguaranteememory

reclamationromVMsForexampleaVMmaynothavesharable

contenttheballoondrivermaynotbeinstalledoritsmemory

pagesmaynotyieldgoodcompressionReclamationbyswapping

isaguaranteedmethodorreclaimingmemoryromVMs

InsummaryESXallowsorreliablememoryovercommitment

toachieveahigherconsolidationratioItimplementsvarious

memoryreclamationtechniquestoenableovercommitment

whileimprovingeciencyandloweringcostooperationo

VMsThenextsectiondescribesmemoryovercommitment

WhenESXismemoryovercommitteditallocatesmemorytothose

powered-onVMsthatneeditmostandwillperormbetterwith

morememoryAtthesametimeESXreclaimsmemoryrom

thoseVMswhicharenotactivelyusingitMemoryreclamation

isthereoreanintegralcomponentomemoryovercommitment

ESXusesdierentmemoryreclamationtechniquestoreclaim

memoryromVMsThememoryreclamationtechniquesare

transparentpagesharingmemoryballooningmemory

compressionandmemoryswapping

ESXhasanassociatedmemorystatewhichisdeterminedbythe

amountoreeESXmemoryatagiventimeThestatesarehigh

sothardandlowTableshowstheESXstatethresholdsEach

thresholdisinternallysplitintotwosub-thresholdstoavoidoscillation

oESXmemorystatenearthethresholdAteachmemorystate

ESXutilizesacombinationomemoryreclamationtechniquesto

reclaimmemoryThisisshowninTable

STATE SHARE BALLOON COMPRESS SWAP BLOCK

high X

sot X X

hard X X X

low X X X X

Table 3ActionsperormedbyESXindierentmemorystates

Transparentpagesharing(pagesharing)isapassiveandopportunistic

memoryreclamationtechniqueItoperatesonapowered-onVMat

allmemorystatesthroughoutitslietimelookingoropportunities

tocollapsedierentmemorypageswithidenticalcontentintoone


Table 2.FreememorystatetransitionthresholdinESX(a)Uservisible(b)Internal

thresholdtoavoidoscillation


7/64

5

ESXalsoreservesasmallamountomemorycalledminreeThis

amountisabueragainstrapidallocationsbymemoryconsumers

ESXisinhighstateaslongasthereisatleastthisamountomemory

reeIthereememorydipsbelowthisvaluethenESXisno

longerinhighstateanditbeginstoactivelyreclaimmemory

Figure(c)showstheschematicdiagramrepresentingan

undercommittedESXwhenoverheadmemoryistakeninto

accountInthisdiagramtheoverheadmemoryconsumedby

ESXoritselandoreachpowered-onVMisshownFigure(d)showstheschematicdiagramrepresentinganovercommitted

ESXwhenoverheadmemoryistakenintoaccount

Figure(e)showsthetheoreticallimitomemoryovercommitment

inESXInthiscasealloESXmemoryisconsumedbyESXoverhead

andper-VMoverheadVMswillbeabletopoweronandboot

howeverexecutionotheVMwillbeextremelyslow

Forsimplicityodiscussionandcalculationthedenitiono

overcommitmentromFigure(b)isollowedOverheadmemory

isignoredordeningmemoryovercommitmentFromthisgure

overcommit= v memsize / ESXmemory (1)

where

overcommit memoryovercommitmentactor

V powered-onVMsinESX

memsize conguredmemorysizeov

ESXmemory totalinstalledESXmemory

Therepresentationromthisgureisusedintheremainder

othisarticle

TounderstandmemoryovercommitmentanditseectonVM

andapplicationsmappedconsumedandworkingsetmemory

aredescribed

32MappedMemoryThedenitionomemoryovercommitmentdoesnotconsider

thememoryconsumptionormemoryaccesscharacteristicothe

powered-onVMsImmediatelyateraVMispoweredonitdoes

nothaveanymemorypagesallocatedtoitSubsequentlyasthe

guestOSbootstheVMaccesspagesinitsmemoryaddressspace

ESXoverheadbyreadingorwritingintoitESXallocatesphysical

memorypagestobackthevirtualaddressspaceotheVMduring

thisaccessGraduallyastheguestOScompletesbootingand

applicationsarelaunchedinsidetheVMmorepagesinthevirtual

addressspacearebackedbyphysicalmemorypagesDuringthe

lietimeotheESXoverheadVMtheVMmayormaynotaccess

allpagesinitsvirtualaddressspace

Windowsorexamplewritesthezeropatterntothecomplete

VMmemoryaddressspaceotheVMThiscausesESXtoallocate

memorypagesorthecompleteaddressspacebythetime

WindowshascompletedbootingOntheotherhandLinux

doesnotaccesstheVMmemorycompleteaddressspaceo

theVMwhenitbootsItaccessesmemorypagesonlyrequired

toloadtheOS

3MemoryOvercommitment

Memoryovercommitmentenablesahigherconsolidationratioo

VMsonanESXServerItreducescostooperationwhileutilizing

computeresourcesecientlyThissectiondescribesmemory

overcommitmentanditsperormancecharacteristics

3Denitions

ESXissaidtobememoryovercommittedwhenVMsarepowered

onsuchthattheirtotalconguredmemorysizeisgreaterthanESX

memoryFigureshowsanexampleomemoryovercommitment

onESX

Figure(a)showstheschematicdiagramoamemory-

undercommittedESXServerInthisexampleESXmemory

isGBTwoVMseachwithaconguredmemorysizeoGB

arepoweredonThetotalconguredmemorysizeopowered-on

VMsisthereoreGBwhichislessthanGBHenceESXis

consideredtobememoryundercommitted

Figure(b)showstheschematicdiagramoamemory-

overcommittedESXServerInthisexampleESXmemory

isGBThreeVMswithconguredmemorysizesoGBGB

andGBarepoweredonThetotalconguredmemorysizeo

powered-onVMsisthereoreGBwhichismorethanGB

HenceESXisconsideredtobememoryovercommitted

Thescenariosdescribedaboveomitthememoryoverheadconsumed

byESXESXconsumesaxedamountomemoryoritsowntext

anddatastructuresInadditionitconsumesoverheadmemoryor

eachpowered-onVM


Figure 1.Memoryovercommitmentshownwithandwithoutoverheadmemory(a)

UndercommittedESX(b)OvercommittedESXThismodelistypicallyollowedwhen

describingovercommitment(c)UndercommittedESXshownwithoverheadmemory

(d)OvercommittedESXshownwithoverheadmemory(e)Limitomemory

overcommitment

4 A memory page with all 0x00 content is a zero page.


8/64

6

AllmemorypagesoaVMwhichareeveraccessedbyaVMare

consideredmappedbyESXAmappedmemorypagesisbacked

byaphysicalmemorypagebyESXduringtheveryrstaccessby

theVMAmappedpagemaysubsequentlybereclaimedbyESX

ItisconsideredasmappedthroughthelietimeotheVM

WhenESXisovercommittedthetotalmappedpagesbyallVMs

mayormaynotexceedESXmemoryHenceitispossibleorESX

tobememoryovercommittedbutatthesametimeowingtothe

natureotheguestOSanditsapplicationsthetotalmappedmemoryremainwithinESXmemoryInsuchascenarioESXdoesnotactively

reclaimmemoryromVMsandVMsperormancearenotaected

bymemoryreclamation

MappedmemoryisillustratedinFigureThisgureusesthe

representationromFigure(b)whereisESXmemoryovercommitted

InFigure(a)thememorymappedbyeachVMisshownThetotal

mappedmemoryoallVMsislessthanESXmemoryInthiscase

ESXisovercommittedHoweverthetotalmappedmemoryisless

thanESXmemoryESXwillnotactivelyreclaimmemoryinthiscase

InFigure(b)thememorymappedbyeachVMisshownInthis

caseESXisovercommittedInadditionthetotalmappedmemory

inESXexceedsESXmemoryESXmayormaynotactivelyreclaim

memoryromtheVMsThisdependsonthecurrentconsumed

memoryoeachVM

33Consumedmemory

AVMisconsideredtobeconsumingphysicalmemorypagewhen

aphysicalmemorypageisusedtobackamemorypageinitsaddress

spaceAmemorypageotheVMmayexistindierentstateswith

ESXTheyareasollows

Regular:AregularmemorypageintheaddressspaceoaVMisone

whichisbackedbyonephysicalpageinESX

Shared:AmemorypagemarkedassharedinaVMmaybeshared

withmanyothermemorypagesIamemorypageisbeingshared

withnothermemorypagestheneachmemorypageisconsidered

tobeconsumingnoawholephysicalpage

Compressed:VMpageswhicharecompressedtypicallyconsume

oroaphysicalmemorypage

Ballooned:Balloonedmemorypagearenotbackedbyany

physicalpage

Swapped:Swappedmemorypagearenotbackedbyany

physicalpage

ConsumedmemoryisillustratedinFigureThisgureusesthe

representationromFigure(b)whereESXismemoryovercommitted

InFigure(a)themappedandconsumedmemoryopowered-on

VMsandESXisshownESXisovercommittedHoweverthetotal

mappedmemoryislessthanESXmemoryInadditionthetotal

consumedmemoryislessthanthetotalmappedmemoryThe

consumptionislowerthanmappedsincesomememorypages

mayhavebeensharedballoonedcompressedorswappedInthis

stateESXwillnotactivelyreclaimmemoryromVMsHowever

theVMsmaypossessballoonedcompressedorswappedmemory

ThisisbecauseESXmayearlierhavereclaimedmemoryrom

theseVMsowingtothememorystateatthattime

ThetotalconsumptionoallVMstakentogethercannotexceed

ESXmemoryThisisshowninFigure(b)InthisgureESXis

overcommittedInadditionthetotalmappedmemoryisgreater

thanESXmemoryHoweverwheneverVMsattempttoconsume

morememorythanESXmemoryESXwillreclaimmemoryrom

theVMsandredistributetheESXmemoryamongstallVMsThis

preventsESXromrunningoutomemoryInthisstateESXis

likelygoingtoactivelyreclaimmemoryromVMstoprevent

memoryexhaustion


Figure 2.MappedmemoryoranovercommittedESXMappedregionisshaded

ThemappedregioninESXisthesumomappedregionintheVMs(a)TotalmappedmemoryislessthanESXmemory(b)TotalmappedmemoryismorethanESXmemory

Figure 3.ConsumedmemoryoranovercommittedESXConsumedandmapped

regionareshadedTheconsumed(andmapped)regioninESXisthesumoconsumed(mapped)regionintheVMs(a)Totalconsumedandmappedmemoryislessthan

ESXmemory(b)TotalmappedmemoryismorethanESXmemorytotalconsumed

memoryisequaltoESXmemory


9/64


10/64

8

Page-ault cost:WhenasharedpageisreadbyaVMitisaccessed

bytheVMinaread-onlymannerHencethatsharedpagesdoes

notneedtobepage-aultedHencethereisnopage-aultcost

Awriteaccesstoasharedpageincursacost

CPUWhenasharedpageiswrittentobyaVMESXmustallocate

anewpageandreplicatethesharedcontentbeoreallowingthe

writeaccessromtheVMThisallocationincursaCPUcostTypically

thiscostisverylowanddoesnotsignicantlyaectVMapplications

andbenchmarks

WaitThecopy-on-writeoperationwhenaVMaccessesashared

pagewithwriteaccessisairlyastVMapplicationsaccessingthe

pagedonotincurnoticeabletemporalcost

b)Ballooning

Reclamation cost:MemoryisballoonedromaVMusinga

balloondriverresidinginsidetheguestOSWhentheballoon

driverexpandsitmayinducetheguestOStoreclaimmemory

romguestapplications

CPUBallooningincursaCPUcostonaper-VMbasissinceit

inducesmemoryallocationandreclamationinsidetheVM

StorageTheguestOSmayswapoutmemorypagestotheguestswapspaceThisincursstoragespaceandstoragebandwidthcost

Page-ault cost:

CPUAballoonedpageacquiredbytheballoondrivermay

subsequentlybereleasedbyitTheguestOSorapplicationmay

thenallocateandaccessitThisincursapage-aultintheguestOS

aswellasESXThepage-aultincursalowCPUcostsinceamemory

pagesimplyneedstobeallocated

StorageDuringreclamationbyballooningapplicationpagesmay

havebeenswappedoutbytheguestOSWhentheapplication

attemptstoaccessthatpagetheguestOSneedstoswapitin

Thisincursastoragebandwidthcost

WaitAtemporalwaitcostmaybeincurredbyapplicationiits

pageswereswappedoutbytheguestOSThewaitcostoswapping

inamemorypagebytheguestOSincursasmalleroverallwait

costtotheapplicationthanahypervisor-levelswap-inThisis

becauseduringapageaultintheguestOSbyonethreadthe

guestOSmayscheduleanotherthreadHoweveriESXis

swappinginapagethenitmaydescheduletheentireVM

ThisisbecauseESXcannotrescheduleguestOSthreads

c)Compression

Reclamation cost:MemoryiscompressedinaVMbycompressing

aullguestmemorypagesuchthatisconsumesorphysical

memorypageThereiseectivelynomemorycostsinceeverysuccessulcompressionreleasesmemory

CPUACPUcostisincurredoreveryattemptedcompressionThe

CPUcostistypicallylowandischargedtotheVMwhosememory

isbeingcompressedTheCPUcostishowevermorethanthator

pagesharingItmayleadtonoticeablyreducedVMperormance

andmayaectbenchmarks

36CostoMemoryOvercommitment

Memoryovercommitmentincurscertaincostintermsocompute

resourceaswellasVMperormanceThissectionprovidesaqualitative

understandingothedierentsourcesocostandtheirmagnitude

WhenESXismemoryovercommittedandpowered-onVMs

attempttoconsumemorememorythanESXmemorythenESX

willbegintoactivelyreclaimmemoryromVMsHencememory

reclamationisanintegralcomponentomemoryovercommitment

Tableshowsthememoryreclamationtechniquesandtheir

associatedcostEachtechniqueincursacostoreclamationwhen

apageisreclaimedandacostopage-aultwhenaVMaccesses

areclaimedpageThenumbero*qualitativelyindicatethe

magnitudeothecost

S HA RE BALLOON COMPRESS S WA P(SSD)

SWAP(DISK)

Reclaim

Memory *

CPU * * **

Storage * * *

Wait

Page-ault

Memory

CPU * * **

Storage * * *

Wait * *** * ** ****

1 indicates that this cost may be incurred under certain conditions

Table 4CostoreclaimingmemoryromVMsandcostopage-aultwhenVMaccess

areclaimedpageNumbero*qualitativelyindicatesthemagnitudeocostTheyare

notmathematicallyproportional

AMemorycostindicatesthatthetechniqueconsumesmemory

meta-dataoverheadACPUindicatesthatthetechniqueconsumes

non-trivialCPUresourcesAStoragecostindicatesthatthetechnique

consumesstoragespaceorbandwidthAWaitcostindicatesthat

theVMincursahypervisor-levelpage-aultcostasaresultothe

techniqueThismayleadtotheguestapplicationtostallandhence

leadtoadropinitsperormanceThereclamationandpage-ault

costsaredescribedasollows

a)PagesharingReclamation costPagesharingisacontinuousprocesswhich

opportunisticallysharesVMmemorypages

MemoryThistechniqueincursanESXwidememorycostorstoring

pagesharingmeta-datawhichisapartoESXoverheadmemory

CPUPagesharingincursanominalCPUcostonaperVMbasisThis

costistypicallyverysmallanddoesnotimpacttheperormance

oVMapplicationsorbenchmarks



11/64

9

4Perormance

Thissectionprovidesaquantitativedemonstrationomemory

overcommitmentinESXSimpleapplicationsareshowntowork

usingmemoryovercommitmentThissectiondoesnotprovide

acomparativeanalysisusingsotwarebenchmarks

4Microbenchmark

Thissetoexperimentwasconductedonadevelopmentbuildo

ESXTheESXServerhasGBRAMquad-coreAMDCPU

conguredina-NUMAnodecongurationTheVMsusedinthe

experimentsare-vCPUGBandcontainRHELOSTheESX

ServerconsumesaboutGBmemoryESXmemoryoverhead

per-VMmemoryoverheadreservedminreeThisgivesabout

GBorallocatingtoVMsFortheseexperimentsESXmemory

isGBHoweverESXwillactivelyreclaimmemoryromVMs

whenVMsconsumemorethanGB(availableESXmemory)

RHELconsumesaboutGBmemorywhenbootedandidleThis

willcontributetothemappedandconsumedmemoryotheVM

TheseexperimentsdemonstrateperormanceoVMswhenthetotal

workingsetvariescomparedtoavailableESXmemoryFigure

showstheresultsromthreeexperimentsForthepurposeo

demonstrationandsimplicitythethreereclamationtechniquespage

sharingballooningandcompressionaredisabledHypervisor-level

memoryswappingistheonlyactivememoryreclamationtechnique

inthisexperimentThisreclamationwillstartwhenVMsconsumed

morethanavailableESXmemory

Intheseexperimentspresenceohypervisor-levelpage-aultis

anindicatoroperormancelossFiguresshowcumulativeswap-in

(page-ault)valuesWhenthisvaluerisesVMwillexperience

temporalwaitcostWhenitdoesnotrisethereisnotemporal

waitcostandhencenoperormanceloss

Experimenta: InFigure(a)anexperimentisconductedsimilarto

Figure(a)Inthisexperimentthetotalmappedandtotalworking

setmemoryalwaysremainlessthanavailableESXmemory

Page-ault cost: Whenacompressedmemorypageisaccessedby

theVMESXmustallocateanewpageandde-compressthepage

beoreallowingaccesstotheVM

CPUDe-compressionincursaCPUcost

WaitTheVMalsowaitsuntilthepageisde-compressed

Thisistypicallynotveryhighsincethede-compression

takesplacein-memory

d)Swap

Reclamation cost:ESXswapsoutmemorypagesromaVMtoavoid

memoryexhaustionTheswap-outprocesstakesplaceasynchronous

totheexecutionotheVManditsapplicationHencetheVMand

itsapplicationsdonotincuratemporalwaitcost

StorageSwappingoutmemorypagesromaVMincursstorage

spaceandstoragebandwidthcost

Page-ault cost:WhenaVMpage-aultsaswappedpageESX

mustreadthepageromtheswapspacesynchronouslybeore

allowingtheVMtoaccessthepage

StorageThisincursastoragebandwidthcost

WaitTheVMalsoincursatemporalwaitcostwhiletheswapped

pageissynchronouslyreadromstoragedeviceThetemporal

waitcostishighestoraswapspacelocatedonspinningdisk

SwapspacelocatedonSSDincurslowertemporalwaitcost

Thecostoreclamationandpage-aultvarysignicantlybetween

dierentreclamationtechniquesReclamationbyballooningmay

incurastoragecostonlyincertaincasesReclamationbyswapping

alwaysincursstoragecostSimilarlypage-aultowingtoreclamation

incursatemporalwaitcostinallcasesPage-aultoaswappedpage

incursthehighesttemporalwaitcostTableshowsthecosto

reclamationaswellasthecostopage-aultincurredbyaVMon

areclaimedmemorypage

ThereclamationitseldoesnotimpactVMandapplication

perormancesignicantlysincethetemporalwaitcosto

reclamationiszerooralltechniquesSometechniqueshave

alowCPUcostThiscostmaybechargedtotheVMleading

toslightlyreducedperormance

Howeverduringpage-aulttemporalwaitcostexistsorall

techniquesThisaectsVMandapplicationperormanceWrite

accesstosharedpageaccesstocompressedballoonedpages

incurtheleastperormancecostAccesstoswappedpage

speciallypagesswappedtospinningdiskincursthehighest

perormancecost

SectiondescribedmemoryovercommitmentinESXItdened

varioustermsmappedconsumedworkingsetmemoryandshoweditsrelationtomemoryovercommitmentItalsoshowed

thatmemoryovercommitmentdoesnotnecessarilyimpact

VMandapplicationperormanceItalsoprovidedaqualitative

analysisohowmemoryovercommitmentmayimpactVMand

applicationperormancedependingontheVMsmemorycontent

andaccesscharacteristics

Thenextsectionprovidesquantitativedescriptionoovercommitment


6 CPU cost o reclamation may aect perormance only i ESX is operating

at 100% CPU load

Figure 5.Eectoworkingsetomemoryreclamationandpage-aultAvailableESX

memory=8GB(a)mapped, working set < available ESX memory.(b)mapped > available ESX

memory, working set < available ESX memory(c)mapped, working set > available ESX memory


12/64

1 0

TwoVMseachoconguredmemorysizeGBarepoweredon

TheovercommitmentactorisEachVMrunsaworkloadEach

workloadallocatesGBmemoryandwritesarandompatterninto

itItthenreadsallGBmemorycontinuouslyinaroundrobin

manneroriterationsThememorymappedbyeachVMisabout

GB(workloadGBRHELGB)thetotalbeingGBThe

workingsetisalsoGBBothexceedavailableESXmemory

ThegureshowsmappedmemoryoreachVMandcumulative

swap-inTheX-axisinthisgureshowstimeinsecondsY-axis(let)showsmappedmemoryinMBandY-axis(right)shows

cumulativeswap-inItcanbeseenromthisgurethatasteady

page-aultismaintainedoncetheworkloadshavemappedthe

targetmemoryThisindicatesthatastheworkloadsareaccessing

memorytheyareexperiencingpage-aultsESXiscontinuously

reclaimingmemoryromeachVMastheVMpage-aultsits

workingset

Thisexperimentdemonstratesthatworkloadswillexperience

steadypage-aultsandhenceperormancelosswhentheworking

setexceedsavailableESXmemoryNotethatitheworkingset

read-accessessharedpagesandpagesharingisenabled(deault

ESXbehavior)thenapage-aultisavoided

Inthissectionexperimentsweredesignedtohighlightthebasic

workingoovercommitmentIntheseexperimentspagesharing

ballooningandcompressionweredisabledorsimplicityThese

reclamationtechniquesreclaimmemoryeectivelybeore

reclamationbyswappingcantakeplaceSincepageaults

tosharedballoonedandcompressedmemoryhavealower

temporalwaitcosttheworkloadperormanceisbetterThis

willbedemonstratedinthenextsection

42RealWorkloads

InthissectiontheexperimentsareconductedusingvSphere

TheESXServerhasGBRAMquad-coreAMDCPUTheworkloads

usedtoevaluatememoryovercommitmentperormanceareDVDstoreandaVDIworkloadExperimentswereconducted

withdeaultmemorymanagementcongurationsorpshare

balloonandcompression

Experimentd: TheDVDstoreworkloadsimulatesonlinedatabase

operationsororderingDVDsInthisexperimentDVDstoreVMs

areusedeachconguredwithvCPUsandGBmemoryItcontains

WindowsServerOSandSQLServerTheperormance

metricisthetotaloperationsperminuteoallVMs

Since-GBVMsareusedthetotalmemorysizeoall

powered-onVMsisGBHowevertheESXServercontains

GBinstalledRAMandwaseectivelyundercommittedHence

memoryovercommitmentwassimulatedwiththeuseoamemory

hogVMThememoryhogVMhadullmemoryreservationIts

conguredmemorysizewasprogressivelyincreasedineach

runThiseectivelyreducedtheESXmemoryavailabletothe

DVDStoreVMs

TwoVMseachoconguredmemorysizeGBarepowered

onSinceESXmemoryisGBtheovercommitmentactoris

(*)EachVMexecutesanidenticalhand-crated

workloadTheworkloadisamemorystressprogramItallocates

GBmemorywritesrandompatternintoallothisallocated

memoryandcontinuouslyreadsthismemoryinaround-robin

mannerThisworkloadisexecutedinbothVMsHenceatotal

oGBmemoryismappedandactivelyusedbytheVMs

Theworkloadisexecutedorseconds

ThegureshowsmappedmemoryoreachVMItalsoshowsthe

cumulativeswap-inoreachVMTheX-axisinthisgureshows

timeinsecondsY-axis(let)showsmappedmemoryinMBand

Y-axis(right)showscumulativeswap-inThememorymappedby

eachVMisaboutGB(workloadGBRHELGB)thetotal

beingGBThisislessthanavailableESXmemoryItcanbeseem

romthisgurethatthereisnoswap-in(page-ault)activityatany

timeHencethereisnoperormancelossThememorymappedby

theVMsriseastheworkloadallocatesmemorySubsequentlyas

theworkloadaccessesthememoryinaround-robinmannerallo

thememoryisbackedbyphysicalmemorypagesbyESX

Thisexperimentdemonstratesthatalthoughmemoryis

overcommittedVMsmappingandactivelyusinglessthan

availableESXmemorywillnotbesubjectedtoactivememory

reclamationandhencewillnotexperienceanyperormanceloss

Experimentb:InFigure(b)anexperimentisconductedsimilarto

Figure(b)Inthisexperimentthetotalmappedmemoryexceeds

availableESXmemorywhilethetotalworkingsetmemoryisless

thanavailableESXmemory

TwoVMseachoconguredmemorysizeGBarepoweredon

TheresultingmemoryovercommitmentisTheworkloadin

eachVMallocatesGBomemoryandwritesarandompattern

intoitItthenreadsaxedblockoGB(GB)insize

romthisGBinaround-robinmanneroriterationsThereaterthesamexedblockisreadinaround-robinmanneror

secondsThisworkloadisexecutedinbothVMsThememory

mappedbyeachVMisaboutGB(workloadGBRHELGB)

thetotalbeingGBThisexceedsavailableESXmemoryThe

workingsetisthereateratotalo*GB

ThegureshowsmappedmemoryoreachVMandcumulative

swap-inTheX-axisinthisgureshowstimeinsecondsY-axis

(let)showsmappedmemoryinMBandY-axis(right)shows

cumulativeswap-inItcanbeseenromthisgurethatasthe

workloadmapsmemorythemappedmemoryrisesThereater

thereisaninitialrisingswap-inactivityastheworkingsetispage-

aultedAtertheworkingsetispage-aultedthecumulativeswap-inissteady

Thisexperimentdemonstratesthatalthoughmemoryis

overcommittedandVMshavemappedmorememorythan

availableESXmemoryVMswillperormbetterwhenits

workingsetissmallerthanavailableESXmemory

Experimentc:InFigure(c)anexperimentisconductedsimilarto

Figure(c)Inthisexperimentthetotalmappedmemoryaswell

asthetotalworkingsetexceedsavailableESXmemory


7 http://en.community.dell.com/techcenter/extras/w/wiki/dvd-store.aspx


13/64

1 1

InthisexperimentVDIVMsarepowered-onintheESXServer

EachVMisconguredwithvCPUandGBmemorythetotal

conguredmemorysizeoallVMsbeing*GBTheVMs

containWindowsOSSimilartoexperimentdamemoryhogVM

withullmemoryreservationisusedtosimulateovercommitment

byeectivelyreducingtheESXmemoryontheESXServer

FigureshowsthethpercentileooperationlatencyintheVDI

VMsorESXmemoryvalueso{}GBAtthesedata

pointsthememoryhogVMhasaconguredmemorysizeo{}GBOwingtothereducedsimulatedESXmemory

thesimulatedovercommitmentactor(wtotalVMmemorysize

oGB)was{}

TheY-axis(right)showsthetotalamountoballoonedswapped

memory(MB)romallVMsTheY-axis(let)showstheaverage

latencyooperationsintheVDIworkloadWhenESXmemoryis

GBballoonandswappingisnotobservedbecause)pagesharing

helpsreclaimmemoryand)theapplicationstotalmappedmemory

andworkingsettsintotheavailableESXmemoryWhenESX

memoryreducestoGBatotalaboutGBmemoryareballooned

whichcausesaslightincreaseintheaveragelatencytoThis

indicatesthatVMsactiveworkingsetisbeingreclaimedatthispoint

WhenESXmemoryurtherdecreasestoGandGBtheamount

oballoonedmemoryincreasesdramaticallyAlsotheswapped

memoryrisesAsaresulttheaveragelatencyincreasesto

andtimesrespectivelyThisisowingtothesignicantlyhigher

page-aultwaitcostoballooningandhypervisor-levelswapping

ThisexperimentwasrepeatedaterattachinganSSDosizeGB

totheESXServerESXisdesignedtoutilizeanSSDasadeviceor

swappingoutVMsmemorypagesTheaveragelatencyinthe

presenceotheSSDisshowninthesamegure

Itcanbeseenthattheperormancedegradationissignicantly

lowerThisexperimentshowsthatcertainworkloadsmaintain

perormanceinthepresenceoanSSDevenwhentheESXServer

isseverelyovercommitted

Theexperimentsinthissectionshowthatmemoryovercommitment

doesnotnecessarilyindicateperormancelossAslongasthe

applicationsactiveworkingsetsizeissmallerthantheavailable

FigureshowstheperormanceoDVDstoreVMsorESXmemory

valueso{}GBAtthesedatapointsthememory

hogVMhasaconguredmemorysizeo{}GB

OwingtothereducedsimulatedESXmemorythesimulated

overcommitmentactorusingEquation(wtotalVMmemory

sizeoGB)was{}

TheY-axis(right)showsthetotalamountoballoonedswapped

memoryinMBromallVMsItcanbeseenthattheamounto

balloonedmemoryincreasesgraduallywithoutsignicant

hypervisor-levelswappingasESXmemoryisreducedrom

GBtoGBTheY-axis(let)showstheoperationsperminute

(OPM)executedbytheworkloadTherisingballoonedmemory

contributestoreducedapplicationperormanceasESXmemory

isreducedromGBtoGBThisisowingtothewaitcosttothe

applicationassomeoitsmemoryareswappedoutbyguestOS

BetweentheX-axispointsandtheOPMdecreasesrom

to(about)owingtoballooningTheovercommitment

actorbetweenthesepointsisand

HoweverwhenESXmemoryisreducedtoGB(overcommitment)

ballooningitselisinsucienttoreclaimsucientmemoryNon-

trivialamountsohypervisor-levelswappingtakeplaceOwingto

thehigherwaitcostohypervisor-levelpage-aultsthereislarger

perormancelossandOPMdropsto

ThisexperimentwasrepeatedaterattachinganSSDosize

GBtotheESXServerTheOPMinthepresenceotheSSD

isshowninthesamegureItcanbeseenthattheperormance

degradationislesserthaninthepresenceoSSD

Experimente:TheVDIworkloadisasetointeractiveoceuser-levelapplicationssuchasMicrosotOcesuiteInternetexplorer

Theworkloadisacustommadesetoscriptswhichsimulateuser

actionontheVDIapplicationsThescriptstriggermouseclicks

andkeyboardinputstotheapplicationsThetimetocompleteo

eachoperationisrecordedInonecompleterunotheworkload

user-actionsareperormedThemeasuredperormance

metricisthethpercentileolatencyvalueoalloperations

Asmallerlatencyindicatesbetterperormance


Figure 6.DVDStoreOperationsperminute(OPM)oveDVDStoreVMswhen

changingESXmemoryromGBtoGB

Figure 7.VDI AveragethpercentileoperationlatencyoteenVDIVMswhen

changingESXmemoryromGBtoGB


14/64

1 2

Reerences

1 A. Arcangeli, I. Eidus, and C. Wright. Increasing memory

density by using KSM. In Proceedings o the Linux

Symposium, pages 313328, 2009.

2 D. Gupta, S. Lee, M. Vrable, S. Savage, A. C. Snoeren,

G. Varghese, G. M. Voelker, and A. Vahdat. Dierence engine:

harnessing memory redundancy in virtual machines.

Commun. ACM, 53(10):85-93, Oct. 2010.

3 M. Hines, A. Gordon, M. Silva, D. Da Silva, K. D. Ryu, and

M. BenYehuda. Applications Know Best: Perormance-Driven

Memory Overcommit with Ginkgo. In Cloud Computing

Technology and Science (CloudCom), 2011 IEEE Third

International Conerence on, pages 130-137, 2011.

4 G. Milos, D. Murray, S. Hand, and M. Fetterman. Satori:

Enlightened o page sharing. In Proceedings o the 2009

conerence on USENIX Annual technical conerence.

USENIX Association, 2009.

5 M. Schwidesky, H. Franke, R. Mansell, H. Raj, D. Osisek, and

J. Choi. Collaborative Memory Management in Hosted Linux

Environments. In Proceedings o the Linux Symposium, pages

313-328, 2006.

6 P. Sharma and P. Kulkarni. Singleton: system-wide page

deduplication in virtual environments. In Proceedings o the

21st international symposium on High-Perormance Parallel

and Distributed Computing, HPDC 12, pages 15-26,

New York, NY, USA, 2012. ACM.

7 C. A. Waldspurger. Memory resource management in

VMware ESX server. SIGOPS Oper. Syst. Rev., 36(SI):181-194,

Dec. 2002.

ESXmemorytheperormancedegradationmaybetolerableInmany

situationswherememoryisslightlyormoderatelyovercommitted

pagesharingandballooningisabletoreclaimmemorygraceully

withoutsignicantlyperormancepenaltyHoweverunderhigh

memoryovercommitmenthypervisor-levelswappingmayoccur

leadingtosignicantperormancedegradation

5Conclusion

ReliablememoryovercommitmentisauniquecapabilityoESXnotpresentinanycontemporaryhypervisorUsingmemory

overcommitmentESXcanpoweronVMssuchthatthetotal

conguredmemoryoallpowered-onVMsexceedESXmemory

ESXdistributesmemorybetweenallVMsinaairandecient

mannersoastomaximizeutilizationotheESXServerAtthe

sametimememoryovercommitmentisreliableThismeansthat

VMswillnotbeprematurelyterminatedorsuspendedowingto

memoryovercommitmentMemoryreclamationtechniqueso

ESXguaranteessaeoperationoVMsinamemory

overcommittedenvironment

6AcknowledgmentsMemoryovercommitmentinESXwasdesignedandimplemented

byCarlWaldspurger[]



15/64

1 3

Introduction

Traditionallystoragearrayswerebuiltospinningdiskswithaew

gigabytesobattery-backedNVRAMaslocalcacheThetypical

IOresponsetimewasmultiplemillisecondsandthemaximum

supportedIOPSwereaewthousandTodayintheasheraarrays

areadvertisingIOlatenciesounderamillisecondandIOPSon

theorderomillionsXtremIO[](nowEMC)ViolinMemory[]

WhipTail[]Nimbus[]Solid-Fire[]PureStorage[]

Nimble[]GridIron(nowViolin)[]CacheIQ(nowNetApp)[]

andAvereSystems[]aresomeotheemergingstartupsdeveloping

storagesolutionsthatleverageashAdditionallyestablished

players(namelyEMCIBMHPDellandNetApp)arealsoactively

developingsolutionsFlashisalsobeingadoptedwithinserversas

aashcachetoaccelerateIOsbyservingthemlocallysomeo

theexamplesolutionsinclude[]Giventhecurrenttrends

itisexpectedthatall-ashandhybridarrayswillcompletely

replacethetraditionaldisk-basedarraysbytheendothis

decadeTosummarizetheIOsaturationbottleneckisnow

shitingadministratorsarenolongerworriedabouthowmany

requeststhearraycanservicebutratherhowasttheservercan

beconguredtosendtheseIOrequestsandutilizethebandwidth

Multipathingisamechanismoraservertoconnecttoastorage

arrayusingmultipleavailableabricportsESXismultipathing

logicisimplementedasaPathSelectionPlug-in(PSP)withinthe

PSA(PluggableStorageArchitecture)layer[]TheESXiproduct

todayshipswiththreedierentmultipathingalgorithmsasaNMP

(NativeMultipathingPlug-in)rameworkPSPFIXEDPSPMRU

andPSPRRBothPSPFIXEDandPSPMRUutilizeonlyonexed

pathorIOrequestsanddonotperormanyloadbalancingwhile

PSPRRdoesasimpleround-robinloadbalancingamongallActive

OptimizedpathsAlsotherearecommercialsolutionsavailable(as

describedintherelatedworksection)thatessentiallydierinhow

theydistributeloadacrosstheActiveOptimizedpaths

InthispaperweexploreanovelideaousingbothActiveOptimizedandUn-optimizedpathsconcurrentlyActiveUn-optimizedpaths

havetraditionallybeenusedonlyorailoverscenariossincethese

pathsareknowntoexhibitahigherservicetimecomparedtoActive

OptimizedpathsThehypothesisoourapproachwasthattheservice

timeswerehighsincethecontentionbottleneckisthearray

bandwidthlimitedbythediskIOPSInthenewasherathe

arrayisarrombeingahardwarebottleneckWediscovered

thatourhypothesisishaltrueandwedesignedaplug-in

solutionarounditcalledPSPAdaptive

Abstract

Attheadventovirtualizationprimarystorageequatedspinning

disksTodaytheenterprisestoragelandscapeisrapidlychanging

withlow-latencyall-ashstoragearraysspecializedash-based

IOappliancesandhybridarrayswithbuilt-inashAlsowiththe

adoptionohost-sideashcachesolutions(similartovFlash)the

read-writemixooperationsemanatingromtheserverismore

write-dominated(sincereadsareincreasinglyservedlocallyrom

cache)IstheoriginalESXiIOmultipathinglogicthatwasdeveloped

ordisk-basedarraysstillapplicableinthisnewashstorageera?

Arethereoptimizationswecandevelopasadierentiatorinthe

vSphereplatormorsupportingthiscoreunctionality?

ThispaperarguesthatexistingIOmultipathinginESXiisnotthe

mostoptimalorash-basedarraysInourevaluationthemaximum

IOthroughputisnotboundbyahardwareresourcebottleneck

butratherbythePluggableStorageArchitecture(PSA)module

thatimplementsthemultipathinglogicTherootcauseisthe

anitymaintainedbythePSAmodulebetweenthehosttrac

andasubsetotheportsonthestoragearray(reerredtoas

ActiveOptimizedpaths)TodaytheActiveUn-optimizedpaths

areusedonlyduringhardwareailovereventssinceun-optimized

pathsexhibithigherservicetimethanoptimizedpathsThuseven

thoughtheHostBusAdaptor(HBA)hardwareisnotcompletely

saturatedwearearticiallyconstrainedinsotwarebylimiting

totheActiveOptimizedpathsonly

WeimplementedanewmultipathingapproachcalledPSPAdaptive

asaPath-SelectionPlug-ininthePSAThisapproachdetectsIO

pathsaturation(leveragingexistingSIOCtechniques)andspreads

thewriteoperationsacrossalltheavailablepaths(optimizedand

un-optimized)whilereadscontinuetomaintaintheiranity

pathsThekeyobservationwasthatthehigherservicetimesin

theun-optimizedpathsarestilllowerthanthewaittimesinthe

optimizedpathsFurtherreadanityisimportanttomaintain

giventhesession-basedpreetchingandcachingsemanticsusedbythestoragearraysDuringperiodsonon-saturationour

approachswitchestothetraditionalanitymodelorbothreads

andwritesInourexperimentsweobservedsignicant(upto

)improvementsinthroughputorsomeworkloadscenarios

Wearecurrentlyintheprocessoworkingwithawiderangeo

storagepartnerstovalidatethismodelorvariousAsymmetric

LogicalUnitsAccess(ALUA)storageimplementationsandeven

MetroClusters

Redefning ESXi IO Multipathing in the Flash EraFeiMeng

NorthCarolinaStateUniversity

meng@ncsuedu

LiZhou

FacebookInc

lzhou@bcom

SandeepUttamchandani

VMwareInc

sandeepu@vmwarecom

XiaosongMa

NorthCarolinaState

University&OakRidgeNationalLab

ma@cscncsuedu

1 Li Zhou was a VMware employee when working on this project.

REDEFINING ESXi IO MULTIPATHING IN THE FLASH ERA


16/64

1 4

MultipathinginESXiToday

FigureshowstheIOmultipathingarchitectureovSphereIna

typicalSANcongurationeachhosthasmultipleHostBusAdapter

(HBA)portsconnectedtobothothestoragearrayscontrollers

Thehosthasmultiplepathstothestoragearrayandperorms

loadbalancingamongallpathstoachievebetterperormanceInvSpherethisisdonebythePathSelectionPlug-ins(PSP)at

thePSA(PluggableStorageArchitecture)layer[]ThePSA

rameworkcollapsesmultiplepathstothesamedatastoreand

presentsonelogicaldevicetotheupperlayerssuchasthele

systemInternallytheNMP(NativeMultipathingPlug-in)ramework

allowsdierentpath-selectionpoliciesbysupportingdierentPSPs

ThePSPsdecidewhichpathtorouteanIOrequesttovSphere

providesthreedierentPSPsPSPFIXEDPSPMRUandPSPRR

BothPSPFIXEDandPSPMRUutilizeonlyonepathorIO

requestsanddonotdoanyload-balancingwhilePSPRRdoes

asimpleround-robinloadbalancingamongallactivepathsor

activeactivearrays

Insummarynoneotheexistingpathload-balancingimplementations

concurrentlyutilizeActiveOptimizedandUn-optimizedpathso

ALUAstoragearraysorIOsVMwarecanprovideasignicant

dierentiatedvaluebysupportingPSPAdaptiveasanativeoption

orash-basedarrays

Thekeycontributionsothispaperare

DevelopanovelapproachforI/OmultipathinginESXi,

specicallyoptimizedorash-enabledarraysWithinthis

contextwedesignedanalgorithmthatadaptivelyswitches

betweentraditionalmultipathingandspreadingwritesacross

ActiveOptimizedandActiveUn-optimizedpaths

ImplementationofthePSPAdaptiveplug-inwithinthe

PSAmoduleandexperimentalevaluation

TherestothepaperisorganizedasollowsSectioncovers

multipathinginESXitodaySectiondescribesrelatedwork

Sectionsandcoverthedesigndetailsandevaluation

ollowedbyconclusionsinSection

2RelatedWork

Therearethreedierenttypesostoragearraysregardingthe

dualcontrollerimplementationactiveactiveactivestandbyand

AsymmetricLogicalUnitsAccess(ALUA)[]DenedintheSCSI

standardALUAprovidesastandardwayorhoststodiscoverand

managemultiplepathstothetargetUnlikeactiveactivesystems

ALUAaliatesoneothecontrollersasoptimizedandservesIOs

romthiscontrollerinthecurrentVMwareESXipathselection

plug-in[]UnlikeactivestandbyarraysthatcannotserveIO

throughthestandbycontrollerALUAstoragearrayisable

toserveIOrequestsrombothoptimizedandun-optimized

controllersALUAstoragearrayshavebecomeverypopular

Todaymostmainstreamstoragearrays(egmostpopular

arraysmadebyEMCandNetApp)supportALUA

MultipathingorStorageAreaNetwork(SAN)hasbeendesigned

asaault-tolerancetechniquetoavoidsinglepointailureaswell

asprovideperormanceenhancementoptimizationviaload

balancing[]Multipathinghasbeenimplementedinallmajor

operatingsystemssuchasLinuxincludingdierentstoragestack

layers[]Solaris[]FreeBSD[]andWindows[]

Multipathinghasalsobeenoeredasthird-partyprovided

productssuchasSymantecVeritas[]andEMCPowerpath[]

ESXihasinboxmultipathingsupportinNativeMultipathPlug-in

(NMP)andithasseveraldierentavorsopathselectionalgorithm

(suchasRoundRobinMRUandFixed)ordierentdevicesESXi

alsosupportsthird-partymultipathingplug-insintheormsoPSP

(PathSelectionPlug-in)SATP(StorageArrayTypePlug-in)andMPP

(Multi-PathingPlug-in)allundertheESXiPSArameworkEMC

NetAppDellEqualogicandothershavedevelopedtheirsolutions

onESXiPSAMostotheimplementationsdosimpleroundrobin

amongactivepaths[]basedonnumberocompleteIOsor

transerredbytesoreachpathSomethird-partysolutionssuchas

EMCsPowerPathadoptcomplicatedload-balancingalgorithms

butperormance-wiseareonlyatparorevenworsethanVMwares

NMPKiyoshietalproposedadynamicloadbalancingrequest-

baseddevicemapperMultipathorLinuxbutdidnotimplement

thiseature[]

Figure 1.High-levelarchitectureoIOMultipathinginvSphere



17/64

1 5

casesspreadingwritestoun-optimizedpathswillhelplowertheload

ontheoptimizedpathsandtherebyboostsystemperormanceThus

inouroptimizedplug-inonlywritesarespreadtoun-optimizedpaths

33SpreadStartandStopTriggersBecauseotheasymmetricperormancebetweenoptimizedand

un-optimizedpathsweshouldonlyspreadIOtoun-optimizedpaths

whentheoptimizedpathsaresaturated(iethereisIOcontention)

ThereoreaccurateIOcontentiondetectionisthekeyAnotheractor

weneedtoconsideristhattheALUAspecicationdoesnotspeciy

theimplementationdetailsThereoredierentALUAarraysrom

dierentvendorscouldhavedierentALUAimplementationsand

hencedierentbehaviorsonservingIOissuedontheun-optimized

pathsInourexperimentswehaveoundthatatleastoneALUA

arrayshowsunacceptableperormanceorIOsissuedonthe

un-optimizedpathsWeneedtotakethisintoaccountanddesign

thePSPAdaptivetobeabletodetectsuchbehaviorandstop

routingWRITEstotheun-optimizedpathsinoIOperormanceimprovementisobservedTheollowingsectionsdescribethe

implementationdetails

3.3.1 I/O Contention Detection

WeapplythesametechniquesthatSIOC(StorageIOControl)

usestodaytothePSPAdaptiveorIOcontentiondetectionuse

IOlatencythresholdsToavoidthrashingtwolatencythresholds

taandtb(ta > ta)areusedtotriggerstartandstopowritespread

tonon-optimizedpathsPSPAdaptivekeepsmonitoringIOlatency

3DesignandImplementation

Considerahighwayandalocalroadbothtothesamedestination

Whenthetracisboundedbyatollplazaatthedestinationthere

isnopointtoroutetractothelocalroadHoweverithetollplaza

isremoveditstartstomakesensetorouteapartotractothe

localroadduringrushhoursbecausenowthecontentionpointhas

shitedThesamestrategycanbeappliedtotheload-balancing

strategyorALUAstoragearraysWhenthearrayisthecontention

pointthereisnopointinroutingIOrequeststothenon-optimizedpathsthelatencyishigherandhost-sideIObandwidthisnotthe

boundHoweverwhenthearraywithashisabletoservemillions

oIOPSitisnolongerthecontentionpointandhost-sideIOpipes

couldbecomethecontentionpointduringheavyIOloadItstarts

tomakesensetorouteapartotheIOtractotheun-optimized

pathsAlthoughthelatencyontheun-optimizedpathsishigher

consideringtheoptimizedpathsaresaturatedusingun-optimized

pathscanstillboostaggregatedsystemIOperormancewith

increasedIOthroughputandIOPSThisshouldonlybedoneduring

rushhourswhentheIOloadisheavyandtheoptimizedpathsare

saturatedAnewPSPAdaptiveisimplementedusingthisstrategy

3UtilizeActive/Non-optimizedPathsorALUASystemsFigureshowsthehigh-leveloverviewomultipathinginvSphere

NMPcollapsesallpathsandpresentsonlyonelogicaldevicetoupper

layerswhichcanbeusedtostorevirtualdisksorVMsWhenan

IOisissuedtothedeviceNMPqueriesthepathselectionplug-in

PSPRRtoselectapathtoissuetheIOInternallyPSPRRusesa

simpleroundrobinalgorithmtoselectthepathorALUAsystems

Figure(a)showsthedeaultpath-selectionalgorithmIOsare

dispatchedtoallActiveOptimizedpathsalternatelyActive

Un-optimizedpathsarenotusedevenitheActiveOptimized

pathsaresaturatedThisapproachisawasteoresourceithe

optimizedpathsaresaturated

ToimproveperormancewhentheActiveOptimizedpaths

aresaturatedwespreadWRITEIOstotheun-optimized

pathsEventhoughthelatencywillbehighercompared

toIOsusingtheoptimizedpathstheaggregatedsystem

throughputandIOPSwillbeimprovedFigure(b)illustrates

theoptimizedpathdispatching

32WriteSpreadOnly

Foranyactiveactive(includingALUA)dual-controllerarrayeach

controllerhasitsowncachetocachedatablocksorbothreads

andwritesOnwriterequestscontrollersneedtosynchronizetheir

cacheswitheachothertoguaranteedataintegrityForreadsitis

notnecessarytosynchronizethecacheAsaresultreadshave

anitytoaparticularcontrollerwhilewritesdonotThereorewe

assumeissuingwritesoneithercontrollerissymmetricwhileitis

bettertoissuereadsonthesamecontrollersothatthecachehit

ratewouldbehigher

MostworkloadshavemanymorereadsthanwritesHoweverwith

theincreasingadoptionohost-sideashcachingtheactualIOs

hittingthestoragecontrollersareexpectedtohaveamuchhigher

writereadratioalargeportionothereadswillbeservedrom

thehost-sideashcachewhileallwritesstillhitthearrayInsuch

Figure 2.PSPRoundRobinPolicy



18/64

1 6

oroptimizedpaths(to)ItoexceedstaPSPAdaptivestartsto

triggerwritespreadItoallsbelowtbPSPAdaptivestopswrite

spreadLikeSIOCtheactualvaluesotaandtbaresetbytheuser

andcoulddierordierentstoragearrays

3.3.2 Max I/O Latency Threshold

AsdescribedearlierdierentstoragevendorsALUAimplementations

varyTheIOperormanceontheun-optimizedpathsorsomeALUA

arrayscouldbeverypoorForsucharraysweshouldnotspreadIO

totheun-optimizedpathsTohandlesuchcasesweintroduceathirdthresholdmaxIOlatencytc( tc > ta)Latencyhigherthanthisvalue

isunacceptabletotheuserPSPAdaptivemonitorsIOlatencyon

un-optimizedpaths(tuo)whenwritespreadisturnedonIPSP

Adaptivedetectstuoexceedingthevalueotcitconcludesthat

theun-optimizedpathsshouldnotbeusedandstopswritespread

Asimpleonoswitchisalsoaddedasacongurableknobor

administratorsIauserdoesnotwanttouseun-optimizedpaths

anadministratorcansimplyturntheeatureothroughtheesxcli

commandPSPAdaptivewillbehavethesameasPSPRRinsuch

caseswithoutspreadingIOtoun-optimizedpaths

3.3.3 I/O Perormance Improvement Detection

WewanttospreadIOtoun-optimizedpathsonlyiitimproves

aggregatedsystemIOPSandorthroughputPSPAdaptivecontinues

monitoringaggregatedIOPSandthroughputonallpathstothe

specictargetDetectingIOperormanceimprovementsismore

complicatedhoweversincesystemloadandIOpatterns(eg

blocksize)couldchangeIOlatencynumberscannotbeusedto

decideithesystemperormanceimprovesornotorthisreason

TohandlethissituationwemonitorandcomparebothIOPSand

throughputnumbersWhenIOlatencyonoptimizedpathsexceeds

thresholdtaPSPAdaptivesavestheIOPSandthroughputdataas

thereerencevaluesbeoreitturnsonwritespreadtoun-optimized

pathsItthenperiodicallychecksitheaggregatedIOPSandor

throughputimprovedbycomparingthemagainstthereerencevaluesInotimproveditwillstopwritespreadotherwisenoactions

aretakenToavoidnoiseatleastimprovementoneitherIOPS

orthroughputisrequiredtoconcludethatperormanceisimproved

Overallsystemperormanceisconsideredimprovedevenionly

oneothetwomeasures(IOPSandthroughput)improvesThis

isbecauseIOpatternchangeshouldnotdecreasebothvalues

simultaneouslyForexampleiIOblocksizesgodownaggregated

throughputcouldgodownbutIOPSshouldgoupIsystemload

goesupbothIOPSandthroughputshouldgoupwithwritespread

IbothaggregatedIOPSandthroughputgodownPSPAdaptive

concludesthatitisbecausesystemloadisgoingdown

IsystemloadgoesdowntheaggregatedIOPSandthroughput

couldgodownaswellandcausePSPAdaptivetostopwrite

spreadThisisnebecauselesssystemloadmeansIOlatency

willbeimprovedUnlesstheIOlatencyonoptimizedpaths

exceedstaagainwritespreadwillnotbeturnedonagain

3.3.4 Impact on Other Hosts in the Same Cluster

Usuallyonehostutilizingtheun-optimizedpathscouldnegatively

aectotherhoststhatareconnectedtothesameALUAstorage

arrayHoweverasexplainedintheearliersectionsthegreatly

boostedIOperormanceonewash-basedstoragemeansthe

storagearrayismuchlesslikelytobecomethecontentionpoint

evenimultiplehostsarepumpingheavyIOloadtothearray

simultaneouslyTherebythenegativeimpacthereisnegligible

Ourperormancebenchmarkalsoprovesthat

4EvaluationandAnalysis

AlltheperormancenumbersarecollectedononeESXiserverThe

physicalmachinecongurationislistedinTableIometerrunning

insideWindowsVMsisusedtogenerateworkloadSincevFlashandVFCwerenotavailableatthetimetheprototypewasdone

weusedIometerloadwithrandomreadandrandom

writetosimulatetheeectohost-sidecachingwhichchanges

theIOREADWRITEratiohittingthearray

CPU IntelXeonCoresLogicalCPUs

Memory GBDRAM

HBA DualportGbpsHBA

Storage Array ALUAenabledarraywithLUNsonSSD

eachLUNMB

FC Switch GbpsFCswitch

Table 1TestbedConguration

FigureandFigurecomparetheperormanceoPSPRRand

PSPAdaptivewhenthereisIOcontentionontheHBAportBy

spreadingWRITEstoun-optimizedpathsduringIOcontention

PSPAdaptiveisabletoincreaseaggregatedsystemIOPSand

throughputatthecostoslightlyhigheraverageWRITElatency

Theaggregatedsystemthroughputimprovementalsoincreases

withincreasingIOblocksize

OveralltheperormanceevaluationresultsshowthatPSP

AdaptivecanincreasesystemaggregatedthroughputandIOPS

duringIOcontentionandissel-adaptivetoworkloadchanges

Figure 3.ThroughputPSPRRvsPSPAdaptive



19/64

1 7

Reerences

1 EMC PowerPath.

http://www.emc.com/storage/powerpath/powerpath.htm

2 EMC VPLEX. http://www.emc.com/storage/vplex/vplex.htm.

3 Multipath I/O. http://en.wikipedia.org/wiki/Multipath_I/O

4 Implementing vSphere Metro Storage Cluster using

EMC VPLEX. http://kb.vmware.com/kb/2007545

5 Solaris SAN Conguration and Multipathing Guide, 2000.

http://docs.oracle.com/cd/E19253-01/820-1931/820-1931.pd

6 FreeBSD disk multipath control.http://www.reebsd.org/cgi/

man.cgi?query=gmultipath&apropos=0&sektion=0&manpath=

FreeBSD+7.0-RELEASE&ormat=html

7 Nimbus Data Unveils High-Perormance Gemini Flash Arrays,

2001 http://www.crn.com/news/storage/240005857/

nimbus-data-unveils-high-perormance-gemini-ash-arrays-

with-10-year-warranty.htm

8 Proximal Datas Caching Solutions Increase Virtual Machine

Density in Virtualized Environments, 2012. http://www.businesswire.com/news/home/20120821005923/en/

Proximal-Data%E2%80%99s-Caching-Solutions-Increase-

Virtual-Machine

9 DM Multipath, 2012. https://access.redhat.com/knowledge/

docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/DM_

Multipath/index.html

10 EMC vcache, Server Flash Cache, 2012 http://www.emc.

com/about/news/press/2012/20120206-01.htm

11 Flash and Virtualization Take Center Stage at SNW or Avere.,

2012. http://www.averesystems.com/

12 Multipathing policies in ESXi 4.x and ESXi 5.x, 2012. http://kb.vmware.com/kb/1011340

13 No-Compromise Storage or the Modern Datacenter, 2012,

http://ino.nimblestorage.com/rs/nimblestorage/images/

Nimble_Storage_Overview_White_Paper.pd

14 Pure Storage: 100% ash storage array: Less than the cost o

spinning disk, 2012. http://www.purestorage.com/ash-array/

15 Symantec Veritas Dynamic Multi-Pathing, 2012.

http://www.symantec.com/docs/DOC5811

16 Violin memory, 2012. http://www.violin-memory.com/

17 VMware vFlash ramework, 2012. http://blogs.vmware.com/vsphere/2012/12/virtual-ash-vash-tech-preview.html

5ConclusionandFutureWork

Withtherapidadoptionoashitisimportanttorevisitsome

otheundamentalbuildingblocksothevSpherestackIO

multipathingiscriticalorscaleandperormanceandanyimprovementstranslateintoimprovedend-userexperience

aswellashigherVMdensityonESXiInthispaperweexplored

anapproachthatchallengedtheoldwisdomomultipathingthat

activeun-optimizedpathsshouldnotbeusedorloadbalancing

Ourimplementationshowedthatspreadingowritesonallpaths

hasadvantagesduringcontentionbutshouldbeavoidedduring

normalloadscenariosPSPAdaptiveisaPSPplug-inthatwe

developedtoadaptivelyswitchload-balancingstrategiesbased

onsystemload

Movingorwardweareworkingtourtherenhancetheadaptive

logicbyintroducingapathscoringattributethatranksdierent

pathsbasedonIOlatencybandwidthandotheractorsThescoreisusedtodecidewhetheraspecicpathshouldbeusedor

dierentsystemIOloadconditionsFurtherwewanttodecide

thepercentageoIOrequeststhatshouldbedispatchedtoa

certainpathWecouldalsocombinethepathscorewithIO

prioritiesbyintroducingpriorityqueuingwithinPSA

Anotherimportantstoragetrendistheemergenceoactiveactive

storageacrossmetrodistancesEMCsVPLEX[]istheleading

solutioninthisspaceSimilartoALUAactiveactivestoragearrays

exposeasymmetryinservicetimesevenacrossactiveoptimized

pathsduetounpredictablenetworklatenciesandnumbero

intermediatehopsAnadaptivemultipathstrategycouldbe

useulortheoverallperormance

Figure 4.IOPSPSPRRvsPSPAdaptive

http://www.freebsd.org/cgi/man.cgi?query=gmultipath&apropos=0&sektion=0&manpath=FreeBSD+7.0-RELEASE&format=htmlhttp://www.freebsd.org/cgi/man.cgi?query=gmultipath&apropos=0&sektion=0&manpath=FreeBSD+7.0-RELEASE&format=htmlhttp://www.freebsd.org/cgi/man.cgi?query=gmultipath&apropos=0&sektion=0&manpath=FreeBSD+7.0-RELEASE&format=htmlhttp://www.crn.com/news/storage/240005857/nimbus-data-unveils-high-performance-gemini-flash-arrays-with-10-year-warranty.htmhttp://www.crn.com/news/storage/240005857/nimbus-data-unveils-high-performance-gemini-flash-arrays-with-10-year-warranty.htmhttp://www.crn.com/news/storage/240005857/nimbus-data-unveils-high-performance-gemini-flash-arrays-with-10-year-warranty.htmhttp://www.businesswire.com/news/home/20120821005923/en/Proximal-Data%E2%80%99s-Caching-Solutions-Increase-Virtual-Machinehttp://www.businesswire.com/news/home/20120821005923/en/Proximal-Data%E2%80%99s-Caching-Solutions-Increase-Virtual-Machinehttp://www.businesswire.com/news/home/20120821005923/en/Proximal-Data%E2%80%99s-Caching-Solutions-Increase-Virtual-Machinehttp://www.businesswire.com/news/home/20120821005923/en/Proximal-Data%E2%80%99s-Caching-Solutions-Increase-Virtual-Machinehttps://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/DM_Multipath/index.htmlhttps://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/DM_Multipath/index.htmlhttps://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/DM_Multipath/index.htmlhttp://www.emc.com/about/news/press/2012/20120206-01.htmhttp://www.emc.com/about/news/press/2012/20120206-01.htmhttp://info.nimblestorage.com/rs/nimblestorage/images/Nimble_Storage_Overview_White_Paper.pdfhttp://info.nimblestorage.com/rs/nimblestorage/images/Nimble_Storage_Overview_White_Paper.pdfhttp://blogs.vmware.com/vsphere/2012/12/virtual-flash-vflash-tech-preview.htmlhttp://blogs.vmware.com/vsphere/2012/12/virtual-flash-vflash-tech-preview.htmlhttp://blogs.vmware.com/vsphere/2012/12/virtual-flash-vflash-tech-preview.htmlhttp://blogs.vmware.com/vsphere/2012/12/virtual-flash-vflash-tech-preview.htmlhttp://info.nimblestorage.com/rs/nimblestorage/images/Nimble_Storage_Overview_White_Paper.pdfhttp://info.nimblestorage.com/rs/nimblestorage/images/Nimble_Storage_Overview_White_Paper.pdfhttp://www.emc.com/about/news/press/2012/20120206-01.htmhttp://www.emc.com/about/news/press/2012/20120206-01.htmhttps://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/DM_Multipath/index.htmlhttps://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/DM_Multipath/index.htmlhttps://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/DM_Multipath/index.htmlhttp://www.businesswire.com/news/home/20120821005923/en/Proximal-Data%E2%80%99s-Caching-Solutions-Increase-Virtual-Machinehttp://www.businesswire.com/news/home/20120821005923/en/Proximal-Data%E2%80%99s-Caching-Solutions-Increase-Virtual-Machinehttp://www.businesswire.com/news/home/20120821005923/en/Proximal-Data%E2%80%99s-Caching-Solutions-Increase-Virtual-Machinehttp://www.businesswire.com/news/home/20120821005923/en/Proximal-Data%E2%80%99s-Caching-Solutions-Increase-Virtual-Machinehttp://www.crn.com/news/storage/240005857/nimbus-data-unveils-high-performance-gemini-flash-arrays-with-10-year-warranty.htmhttp://www.crn.com/news/storage/240005857/nimbus-data-unveils-high-performance-gemini-flash-arrays-with-10-year-warranty.htmhttp://www.crn.com/news/storage/240005857/nimbus-data-unveils-high-performance-gemini-flash-arrays-with-10-year-warranty.htmhttp://www.freebsd.org/cgi/man.cgi?query=gmultipath&apropos=0&sektion=0&manpath=FreeBSD+7.0-RELEASE&format=htmlhttp://www.freebsd.org/cgi/man.cgi?query=gmultipath&apropos=0&sektion=0&manpath=FreeBSD+7.0-RELEASE&format=htmlhttp://www.freebsd.org/cgi/man.cgi?query=gmultipath&apropos=0&sektion=0&manpath=FreeBSD+7.0-RELEASE&format=html


20/64

1 8

24 BYAN, S., LENTINI, J., MADAN, A., AND PABON, L. Mercury:

Host-side Flash Caching or the Data Center. InMSST (2012),

IEEE, pp. 112.

25 EMC ALUA System. http://www.emc.com/collateral/hardware/

white-papers/h2890-emc-clariion-asymm-active-wp.pd

26 KIYOSHI UEDA, JUNICHI NOMURA, M. C. Request-based

Device-mapper multipath and Dynamic load balancing. In

Proceedings o the Linux Symposium (2007)

27 LUO, J., SHU, J.-W., AND XUE, W. Design and Implementation

o an Ecient Multipath or a SAN Environment. In Proceedings

o the 2005 international conerence on Parallel and Distributed

Processing and Applications (Berlin, Heidelberg, 2005),

ISPA05, Springer-Verlag, pp. 101110.

28 MICHAEL ANDERSON, P. M. SCSI Mid-Level Multipath.

In Proceedings o the Linux Symposium (2003).

29 UEDA, K. Request-based dm-multipath, 2008.

http://lwn.net/Articles/274292/.

18 Who is WHIPTAIL?, 2012,

http://whiptail.com/papers/who-is-whiptail/

19 Windows Multipath I/O, 2012.

http://technet.microsot.com/en-us/library/cc725907.aspx.

20 XtremIO, 2012. http://www.xtremio.com/

21 NetApp Quietly Absorbs CacheIQ, Nov, 2012, http://www.

networkcomputing.com/storage-networkingmanagement/

netapp-quietly-absorbs-cacheiq/240142457

22 SolidFire Reveals New Arrays or White Hot Flash Market,

Nov, 2012 http://siliconangle.com/blog/2012/11/13/

solidre-reveals-new-arrays-or-white-hot-ash-market//

23 Gridiron Intros New Flash Storage Appliance, Hybrid Flash Array,

Oct, 2012. http://www.crn.com/news/storage/240008223/

gridiron-introsnew-ash-storage-appliance-hybrid-ash-

array.htm

http://www.emc.com/collateral/hardware/white-papers/h2890-emc-clariion-asymm-active-wp.pdfhttp://www.emc.com/collateral/hardware/white-papers/h2890-emc-clariion-asymm-active-wp.pdfhttp://www.networkcomputing.com/storage-networkingmanagement/netapp-quietly-absorbs-cacheiq/240142457http://www.networkcomputing.com/storage-networkingmanagement/netapp-quietly-absorbs-cacheiq/240142457http://www.networkcomputing.com/storage-networkingmanagement/netapp-quietly-absorbs-cacheiq/240142457http://siliconangle.com/blog/2012/11/13/solidfire-reveals-new-arrays-for-white-hot-flash-market//http://siliconangle.com/blog/2012/11/13/solidfire-reveals-new-arrays-for-white-hot-flash-market//http://www.crn.com/news/storage/240008223/gridiron-introsnew-flash-storage-appliance-hybrid-flash-array.htmhttp://www.crn.com/news/storage/240008223/gridiron-introsnew-flash-storage-appliance-hybrid-flash-array.htmhttp://www.crn.com/news/storage/240008223/gridiron-introsnew-flash-storage-appliance-hybrid-flash-array.htmhttp://www.crn.com/news/storage/240008223/gridiron-introsnew-flash-storage-appliance-hybrid-flash-array.htmhttp://www.crn.com/news/storage/240008223/gridiron-introsnew-flash-storage-appliance-hybrid-flash-array.htmhttp://www.crn.com/news/storage/240008223/gridiron-introsnew-flash-storage-appliance-hybrid-flash-array.htmhttp://siliconangle.com/blog/2012/11/13/solidfire-reveals-new-arrays-for-white-hot-flash-market//http://siliconangle.com/blog/2012/11/13/solidfire-reveals-new-arrays-for-white-hot-flash-market//http://www.networkcomputing.com/storage-networkingmanagement/netapp-quietly-absorbs-cacheiq/240142457http://www.networkcomputing.com/storage-networkingmanagement/netapp-quietly-absorbs-cacheiq/240142457http://www.networkcomputing.com/storage-networkingmanagement/netapp-quietly-absorbs-cacheiq/240142457http://www.emc.com/collateral/hardware/white-papers/h2890-emc-clariion-asymm-active-wp.pdfhttp://www.emc.com/collateral/hardware/white-papers/h2890-emc-clariion-asymm-active-wp.pdf


21/64

1 9

CategoriesandSubjectDescriptors: C[Computer Systems

Organization]PerormanceoSystemsDesignstudies

MeasurementtechniquesPerormanceattributes

GeneralTerms:MeasurementPerormanceDesign

Experimentation

Keywords:Tier-Applicationsdatabaseperormance

hardwarecounters

Introduction

Overthepastyearstheperormanceoapplicationsrunning

onvSpherehasgoneromaslowasonativeintheveryearly

daysoESXtowhatiscommonlycallednear-nativeperormance

atermthathascometoimplyanoverheadolessthanMany

toolsandmethodswereinventedalongthewaytomeasureand

tuneperormanceinvirtualenvironments[]Butwithoverheads

atsuchlowlevelsdrillingdownintothesourceoperormance

dierencesbetweennativeandvirtualhasbecomeverydicult

Thegenesisothisworkwasaprojecttostudytheperormance

ovSphererunningTier-Applicationsinparticulararelational

databaserunninganOnlineTransactionProcessing(OLTP)

workloadwhichiscommonlyconsideredaveryheavyworkload

Alongthewaywediscoveredthatexistingtoolscouldnotaccount

orthedierencebetweennativeandvirtualperormanceAtthis

pointweturnedtohardwareeventmonitoringtoolsandcombined

themwithsotwareprolingtoolstodrilldownintotheperormance

overheadsLatermeasurementswithHadoopandlatency-sensitive

applicationsshowedthatthesamemethodologyandtoolscan

beusedorinvestigatingperormanceothoseapplicationsand

urthermorethesourcesovirtualizationperormanceoverhead

aresimilaroralltheseworkloadsThispaperdescribesthetools

andthemethodologyorsuchmeasurements

vSpherehypervisor

VMwarevSpherecontainsvariouscomponentsincludingthe

virtualmachinemonitor(VMM)andtheVMkernelTheVMM

implementsthevirtualhardwarethatrunsavirtualmachineThe

virtualhardwarecomprisesthevirtualCPUstimersanddevices

AvirtualCPUincludesaMemoryManagementUnit(MMU)which

Abstract

Withtherecentadvancesinvirtualizationtechnologybothinthe

hypervisorsotwareandintheprocessorarchitecturesonecan

statethatVMwarevSphererunsvirtualizedapplicationsatnear-

nativeperormancelevelsThisiscertainlytrueagainstabaseline

otheearlydaysovirtualizationwhenreachingevenhalthe

perormanceonativesystemswasadistantgoalHoweverthis

near-nativeperormancehasmadethetaskorootcausingany

remainingperormancedierencesmoredicultInthispaperwe

willpresentamethodologyoramoredetailedexaminationothe

eectsovirtualizationonperormanceandpresentsampleresults

WewilllookattheperormanceonvSphereoanOLTPworkload

atypicalHadoopapplicationandlowlatencyapplicationsWewill

analyzetheperormanceandhighlighttheareasthatcausean

applicationtorunmoreslowlyinavirtualizedserverThepioneers

otheearlydaysovirtualizationinventedabatteryotoolsto

studyandoptimizeperormanceWewillshowthatasthegap

betweenvirtualandnativeperormancehasclosedthesetraditional

toolsarenolongeradequateordetailedinvestigationsOneo

ournovelcontributionsiscombiningthetraditionaltoolswith

hardwaremonitoringacilitiestoseehowtheprocessorexecution

prolechangesonavirtualizedserver

Wewillshowthattheincreaseintranslationlookasidebuer

(TLB)misshandlingcostsduetothehardware-assistedmemory

managementunit(MMU)isthelargestcontributortotheperormance

gapbetweennativeandvirtualserversTheTLBmissratioalso

risesonvirtualserversurtherincreasingthemissprocessing

costsManyotheotherperormancedierenceswithnative

(egadditionaldatacachemisses)arealsoduetotheheavier

TLBmissprocessingbehaviourovirtualserversDepending

ontheapplicationtootheperormancedierenceisdue

tothetimespentinthehypervisorkernelThisisexpectedasall

networkingandstorageIOhastogetprocessedtwiceonceinthevirtualdeviceintheguestonceinthehypervisorkernel

Thehypervisorvirtualmachinemonitor(VMM)isresponsibleor

only-otheoveralltimeInotherwordstheVMMwhichwas

responsibleormuchothevirtualizationoverheadotheearly

hypervisorsisnowasmallcontributortovirtualizationoverhead

TheresultspointtonewareassuchasTLBsandaddresstranslation

toworkoninordertourtherclosethegapbetweenvirtualand

nativeperormance

Methodology or Perormance Analysis oVMware vSphere under Tier-1 Applications

JereyBuell DanielHecht JinHeo KalyanSaladi HRezaTaheriVMwarePerormance VMwareHypervisorGroup VMwarePerormance VMwarePerormance VMwarePerormance

jbuell@vmwarecom dhecht@vmwarecom heoj@vmwarecom ksaladi@vmwarecom rtaheri@vmwarecom

METHODOLOGY FOR PERFORMANCE ANALYSIS OF

VMWARE VSPHERE UNDER TIER-1 APPLICATIONS

1 When VMware introduced its server-class, type 1 hypervisor in 2001, it was called ESX.

The name changed to vSphere with release 4.1 in 200 9.


22/64


23/64

2 1

WereliedheavilyonOraclestatspack(alsoknownasAWR)stats

ortheOLTPworkload

WecollectedandanalyzedUnisphereperformancelogsfromthe

arraystostudyanypossiblebottlenecksinthestoragesubsystem

ortheOLTPworkload

Hadoopmaintainsextensivestatisticsonitsownexecution.

Theseareusedtoanalyzehowwell-balancedtheclusteris

andtheexecutiontimesovarioustasksandphases

2.3 .1 Hardware counters

AbovetoolsarecommonlyusedbyvSphereperormanceanalysts

insideandoutsideVMwaretostudytheperormanceovSphere

applicationsandwemadeextensiveuseothemButoneothe

keytake-awaysothisstudyisthattoreallyunderstandthesources

ovirtualizationoverheadthesetoolsarenotenoughWeneeded

togooneleveldeeperandaugmentthetraditionaltoolswithtools

thatrevealthehardwareexecutionprole

Recentprocessorshaveexpandedcategoriesohardwareevents

thatsotwarecanmonitor[]Butusingthesecountersrequires

1. A tool to collect data

2. A methodology or choosing rom among the thousands o

available events, and combining the event counts to derive

statistics or events that are not directly monitored

3. Meaningul interpretation o the results

Allotheabovetypicallyrequireacloseworkingrelationshipwith

microprocessorvendorswhichwereliedonheavilytocollectand

analyzethedatainthisstudy

Processorarchitecturehasbecomeincreasinglycomplexespecially

withtheadventomultiplecoresresidinginaNUMAnodetypically

withasharedlastlevelcacheAnalyzingapplicationexecutionon

theseprocessorsisademandingtaskandnecessitatesexamining

howtheapplicationisinteractingwiththehardwareProcessorscomeequippedwithhardwareperormancemonitoringunits(PMU)

thatenableecientmonitoringohard-ware-levelactivitywithout

signicantperormanceoverheadsimposedontheapplicationOS

WhiletheimplementationdetailsoaPMUcanvaryromprocessor

toprocessortwocommontypesoPMUareCoreandUncore(see

)EachprocessorcorehasitsownPMUandinthecaseo

hyper-threadingeachlogicalcoreappearstohaveadedicated

corePMUallbyitsel

AtypicalCorePMUconsistsooneormorecountersthatarecapable

ocountingeventsoccurringindierenthardwarecomponents

suchasCPUCacheorMemoryTheperormancecounterscan

becontrolledindividuallyorasagroupthecontrolsoeredareusuallyenabledisableanddetectoverow(viagenerationoan

interrupt)tonameaewandhavedierentmodes(userkernel)

EachperormancecountercanmonitoroneeventatatimeSome

processorsrestrictorlimitthetypeoeventsthatcanbemonitored

romagivencounterThecountersandtheircontrolregisterscan

beaccessedthroughspecialregisters(MSRs)andorthroughthe

PCIbusWewilllookintodetailsoPMUssupportedbythetwo

mainxprocessorvendorsAMDandIntel

2.2 .2 Hadoop Workload Confguration

ClusteroHPDLGservers

Two3.6GHz4-coreIntelXeonX5687(Westmere-EP)processors

withhyper-threadingenabled

72GB1333MHzmemory,68GBusedfortheVM

16local15KRPMlocalSASdrives,12usedforHadoopdata

Broadcom10GbENICsconnectedtogetherinaattopology

throughanAristaSswitchSoftware:vSphere5.1,RHEL6.1,CDH4.1.1withMRv1

vSpherenetworkdriverbnx2x,upgradedforbestperformance

2.2 .3 Latency-sensitive workload confguration

ThetestbedconsistsooneservermachineorservingRR

(request-response)workloadrequestsandoneclientmachine

thatgeneratesRRrequests

Theservermachineisconguredwithdual-socket,quad-core

GHzIntelXeonXprocessorsandGBoRAMwhile

theclientmachineisconguredwithdual-socketquad-core

GHzIntelXeonXprocessorsandGBoRAM

Bothmachinesareequippedwitha10GbEBroadcomNIC.Hyper-threadingisnotused

Twodierentcongurationsareused.Anativeconguration

runsnativeRHELLinuxonbothclientandservermachines

whileaVMcongurationrunsvSphereontheservermachine

andnativeRHELontheclientmachineThevSpherehostsa

VMthatrunsthesameversionoRHELLinuxForboththe

nativeandVMcongurationsonlyoneCPU(VCPUortheVM

conguration)isconguredtobeusedtoremovetheimpacto

parallelprocessinganddiscardanymulti-CPUrelatedoverhead

TheclientmachineisusedtogenerateRRworkloadrequests,

andtheservermachineisusedtoservetherequestsandsend

repliesbacktotheclientinbothcongurationsVMXNETis

usedorvirtualNICsintheVMcongurationAGigabit

Ethernetswitchisusedtointerconnectthetwomachines

23Tools

Obviouslytoolsarecriticalinastudylikethis

WecollecteddatafromtheusualLinuxperformancetools,

suchasmpstatiostatsarnumastatandvmstat

WiththelaterRHEL6.1experiments,weusedtheLinux

per()[]acilitytoprolethenativeguestapplication

Naturally,inastudyofvirtualizationoverhead,vSpheretools

arethemostwidelyused

Esxtopiscomm

VMware Technical Journal - Summer 2013

Documents

Transcript of VMware Technical Journal - Summer 2013