VMware Technical Journal - Summer 2013
Transcript of VMware Technical Journal - Summer 2013
-
7/28/2019 VMware Technical Journal - Summer 2013
1/64
VOL. 2, NO. 1JUNE 2013
VMWARE TECHNICAL JOURNAL
Editors: Curt Kolovson, Steve Muir, Rita Tavilla
TABLE OF CONTENTS
1 Introduction
Steve Muir, Director, VMware Academic Program
2 Memory Overcommitment in the ESX Server
Ishan Banerjee, Fei Guo, Kiran Tati, Rajesh Venkatasubramanian
13 Redefning ESXi IO Multipathing in the Flash Era
Fei Meng, Li Zhou, Sandeep Uttamchandani, Xiaosong Ma
19 Methodology or Perormance Analysis o VMware vSphere under Tier-1 Applications
Jefrey Buell, Daniel Hecht, Jin Heo, Kalyan Saladi, H. Reza Taheri
29 vATM: VMware vSphere Adaptive Task Management
Chirag Bhatt, Aalap Desai, Rajit Kambo, Zhichao Li, Erez Zadok
35 An Anomaly Event Correlation Engine: Identiying Root Causes, Bottlenecks,
and Black Swans in IT Environments
Mazda A. Marvasti, Arnak V. Poghosyan, Ashot N. Harutyunyan, Naira M. Grigoryan
46 Simpliying Virtualization Management with Graph Databases
Vijayaraghavan Soundararajan, Lawrence Spracklen
54 Autonomous Resource Sharing or Multi-Threaded Workloads in Virtualized Servers
Can Hankendi, Ayse, K. Coskun
-
7/28/2019 VMware Technical Journal - Summer 2013
2/64
VMware Academic Program (VMAP)
The VMware Academic Program (VMAP) supports a number o academic research projects across
a range o technical areas. We initiate an annual Request or Proposals (RFP), and also support a
small number o additional projects that address particular areas o interest to VMware.
The 2013 Spring RFP, ocused on Storage in support o Sotware Defned Datacenters (SDDC),
is currently conducting fnal reviews o a shortlist o proposals and will announce the recipients o
unding at the VMware Research Symposium in July. Please contact Rita Tavilla ([email protected])
i you wish to receive notifcation o uture RFP solicitations and other opportunities or collaboration
with VMware.
The 2012 RFP Security or Virtualized and Cloud Platormsawarded unding to three projects:
Timing Side-Channels in Modern Cloud Environments
Pro. Michael Reiter, University o North Carolina at Chapel Hill
Random Number Generation in Virtualized Environments
Pro. Yevgeniy Dodis, New York University
VULCAN: Automatically Generating Tools for Virtual Machine Introspection using Legacy Binary Code
Pro. Zhiqiang Lin, University o Texas, Dallas
Visit http://labs.vmware.com/academic/rfp fnd out more about the RFPs.
-
7/28/2019 VMware Technical Journal - Summer 2013
3/64
1
Introduction
HappyrstbirthdaytotheVMwareTechnicalJournal!Weareverypleasedtohaveseensucha
positivereceptiontotherstcoupleoissuesothejournalandhopeyouwillndthisoneequally
interestingandinormativeWewillpublishthejournaltwiceperyeargoingorwardwithaSpringeditionthathighlightsongoingR&DinitiativesatVMwareandtheFalleditionprovidingashowcase
orourinternsandcollaborators
VMwaresmarketleadershipininrastructureorthesotwaredeneddatacenter(SDDC)isbuilt
uponthestrengthoourcorevirtualizationtechnologycombinedwithinnovationinautomation
andmanagementAttheheartothevSphereproductisourhypervisorandtwopapershighlight
ongoingenhancementsinmemorymanagementandIOmulti-pathingthelatterbeingbased
uponworkdonebyFeiMengoneoourantasticPhDstudentinterns
AundamentalactorinthesuccessovSphereisthehighperormanceotheTierworkloads
mostimportanttoourcustomersHenceweundertakein-depthperormanceanalysisandcomparisontonativedeploymentssomekeyresultsowhicharepresentedhereWealsodevelopthenecessary
eaturestoautomaticallymanagethoseapplicationssuchastheadaptivetaskmanagementscheme
describedinanotherpaper
HowevertheSDDCismuchmorethanjustalargenumberoserversrunningvirtualized
workloadsitrequiressophisticatedanalyticsandautomationtoolsiitistobemanaged
ecientlyatscalevCenterOperationsVMwaresautomatedoperationsmanagementsuite
hasproventobeextremelypopularwithcustomersusingcorrelationbetweenanomalousevents
toidentiyperormanceissuesandrootcausesoailureRecentdevelopmentsintheuseograph
algorithmstoidentiyrelationshipsbetweenentitieshavereceivedagreatdealoattentionortheirapplicationtosocialnetworksbutwebelievetheycanalsoprovideinsightintothe
undamentalstructureothedatacenter
Thenalpaperinthejournaladdressesanotherkeytopicinthedatacenterthemanagement
oenergyconsumptionAnongoingcollaborationwithBostonUniversityledbyProessorAyse
Coskunhasdemonstratedtheimportanceoautomaticapplicationcharacterizationanditsuse
inguidingschedulingdecisionstoincreaseperormanceandreduceenergyconsumption
ThejournalisbroughttoyoubytheVMwareAcademicProgramteamWeleadVMwareseorts
tocreatecollaborativeresearchprogramsandsupportVMwareR&Dinconnectingwiththeresearch
communityWearealwaysinterestedtohearyoureedbackonourprogramspleasecontactuselectronicallyorlookoutorusatvariousresearchconerencesthroughouttheyear
SteveMuirDirectorVMwareAcademicProgram
-
7/28/2019 VMware Technical Journal - Summer 2013
4/64
2
Abstract
Virtualizationocomputerhardwarecontinuestoreducethecost
ooperationindatacentersItenablesuserstoconsolidatevirtual
hardwareonlessphysicalhardwaretherebyecientlyusing
hardwareresourcesTheconsolidationratioisameasureothe
virtualhardwarethathasbeenplacedonphysicalhardwareA
higherconsolidationratiotypicallyindicatesgreatereciency
VMwaresESXServerisahypervisorthatenablescompetitive
memoryandCPUconsolidationratiosESXallowsuserstopower
onvirtualmachines(VMs)withatotalconguredmemorythat
exceedsthememoryavailableonthephysicalmachineThisiscalledmemoryovercommitment
Memoryovercommitmentraisestheconsolidationratioincreases
operationaleciencyandlowerstotalcostooperatingvirtual
machinesMemoryovercommitmentinESXisreliableitdoesnot
causeVMstobesuspendedorterminatedunderanyconditions
ThisarticledescribesmemoryovercommitmentinESXanalyzes
thecostovariousmemoryreclamationtechniquesandempirically
demonstratesthatmemoryovercommitmentinducesacceptable
perormancepenaltyinworkloadsFinallybestpracticesor
implementingmemoryovercommitmentareprovided
GeneralTerms:memorymanagementmemoryovercommitment
memoryreclamation
Keywords:ESXServermemoryresourcemanagement
Introduction
VMwaresESXServeroerscompetitiveoperationaleciencyo
virtualmachines(VM)inthedatacenterItenablesuserstoconsolidate
VMsonaphysicalmachinewhilereducingcostooperation
TheconsolidationratioisameasureothenumberoVMsplaced
onaphysicalmachineAhigherconsolidationratioindicateslower
costooperationESXenablesuserstooperateVMswithahigh
consolidationratioESXServersovercommitmenttechnologyisan
enablingtechnologyallowinguserstoachieveahigherconsolidation
ratioOvercommitmentistheabilitytoallocatemorevirtualresources
thanavailablephysicalresourcesESXServeroersuserstheability
toovercommitmemoryandCPUresourcesonaphysicalmachine
ESXissaidtobeCPU-overcommittedwhenthetotalcongured
virtualCPUresourcesoallpowered-onVMsexceedthephysicalCPU
resourcesonESXWhenESXisCPU-overcommitteditdistributes
physicalCPUresourcesamongstpowered-onVMsinaairand
ecientmannerSimilarlyESXissaidtobememory-overcommitted
whenthetotalconguredguestmemorysizeoallpowered-on
VMsexceedsthephysicalmemoryonESXWhenESXismemory-
overcommitteditdistributesphysicalmemoryairlyandeciently
amongstpowered-onVMsBothCPUandmemoryschedulingare
donesoastogiveresourcestothoseVMswhichneedthemmost
whilereclaimingtheresourcesromthoseVMswhicharenot
activelyusingit
MemoryovercommitmentinESXisverysimilartothatintraditional
operatingsystems(OS)suchasLinuxandWindowsIntraditional
OSesausermayexecuteapplicationsthetotalmappedmemoryowhichmayexceedtheamountomemoryavailabletotheOS
ThisismemoryovercommitmentItheapplicationsconsume
memorywhichexceedstheavailablephysicalmemorythenthe
OSreclaimsmemoryromsomeotheapplicationsandswaps
ittoaswapspaceItthendistributestheavailablereememory
betweenapplications
SimilartotraditionalOSesESXallowsVMstopoweronwithatotal
conguredmemorysizethatmayexceedthememoryavailableto
ESXForthepurposeodiscussioninthisarticlethememoryinstalled
inanESXServeriscalledESXmemoryIVMsconsumealltheESX
memorythenESXwillreclaimmemoryromVMsItwillthen
distributetheESXmemoryinanecientandairmannertoallVMssuchthatthememoryresourceisbestutilizedAsimple
exampleomemoryovercommitmentiswhentwoGBVMsare
poweredoninanESXServerwithGBoinstalledmemoryThe
totalconguredmemoryopowered-onVMsis*GBwhile
ESXmemoryisGB
Withacontinuingallinthecostophysicalmemoryitcanbe
arguedthatESXdoesnotneedtosupportmemoryovercommitment
Howeverinadditiontotraditionalusecasesoimprovingthe
consolidationratiomemoryovercommitmentcanalsobeused
intimesodisasterrecoveryhighavailability(HA)anddistributed
powermanagement(DPM)toprovidegoodperormanceThis
technologywillprovideESXwithaleadingedgeover
contemporaryhypervisors
Memoryovercommitmentdoesnotnecessarilyleadtoperormance
lossinaguestOSoritsapplicationsExperimentalresultspresented
inthispaperwithtworeal-lieworkloadsshowgradualperormance
degradationwhenESXisprogressivelyovercommitted
ThisarticledescribesmemoryovercommitmentinESXItprovides
guidanceorbestpracticesandtalksaboutpotentialpitallsThe
remainderothisarticleisorganizedasollowsSectionprovides
MEMORY OVERCOMMITMENT IN THE ESX SERVER
Memory Overcommitment in the ESX ServerIshanBanerjee
VMwareInc
ishan@vmwareedu
FeiGuo
VMwareInc
guo@vmwarecom
KiranTati
VMwareInc
ktati@vmwarecom
RajeshVenkatasubramanian
VMwareInc
vrajeshi@vmwarecom
-
7/28/2019 VMware Technical Journal - Summer 2013
5/64
3
VMsareregularprocessesinKVM andthereorestandard
memorymanagementtechniqueslikeswappingapplyForLinux
guestsaballoondriverisinstalledanditiscontrolledbyhostvia
theballoonmonitorcommandSomehostsalsosupportKernel
SharedpageMerging(KSM)[]whichworkssimilarlytoESXpage
sharingAlthoughKVMpresentsdierentmemoryreclamation
techniquesitrequirescertainhostsandgueststosupport
memoryovercommitmentInadditionthemanagementand
policesortheinteractionsamongthememoryreclamation
techniquesaremissinginKVM
XenServerusesamechanismcalledDynamicMemoryControl
(DMC)toimplementmemoryreclamationItworksbyproportionally
adjustingmemoryamongrunningVMsbasedonpredened
minimumandmaximummemoryVMsgenerallyrunwithmaximum
memoryandthememorycanbereclaimedviaaballoondriverwhen
memorycontentioninthehostoccursHoweverXendoesnot
provideawaytoovercommitthehostphysicalmemoryhence
itsconsolidationratioislargelylimitedUnlikeotherhypervisors
Xenprovidesatranscendentmemorymanagementmechanism
tomanageallhostidlememoryandguestidlememoryTheidle
memoryiscollectedintoapoolanddistributedbasedonthe
demandorunningVMsThisapproachrequirestheguestOStobeparavirtualizedandonlyworkswellorguestswith
non-concurrentmemorypressure
WhencomparedtoexistinghypervisorsESXallowsorreliable
memoryovercommitmenttoachievehighconsolidationratiowith
norequirementsormodicationsorrunningguestsItimplements
variousmemoryreclamationtechniquestoenableovercommitment
andmanagestheminanecientmannertomitigatepossible
perormancepenaltiestoVMs
Workonoptimaluseohostmemoryinahypervisorhasalsobeen
demonstratedbytheresearchcommunityAnoptimizationorthe
KSMtechniquehasbeenattemptedinKVMwithSingleton[]
Sub-pagelevelpagesharingusingpatchinghasbeendemonstrated
inXenwithDierenceEngine[]Paravirtualizedguestandsharing
opagesreadromstoragedeviceshavebeenshownusingXen
inSatori[]
MemoryballooninghasbeendemonstratedorthezVMhypervisor
usingCMM[]Ginkgo[]implementsahypervisor-independent
overcommitmentrameworkallowingJavaapplicationstooptimize
theirmemoryootprintAlltheseworkstargetspecicaspectso
memoryovercommitmentchallengesinvirtualizationenvironment
TheyarevaluablereerencesorutureoptimizationsinESX
memorymanagement
Background:MemoryovercommitmentinESXisreliable(see
Table)ThisimpliesthatVMswillnotbeprematurelyterminated
orsuspendedowingtomemoryovercommitmentMemory
overcommitmentinESXistheoreticallylimitedbytheoverhead
memoryoESXESXguaranteesreliabilityooperationunderall
levelsoovercommitment
backgroundinormationonmemoryovercommitmentSection
describesmemoryovercommitmentSectionprovidesquantitative
understandingoperormancecharacteristicsomemory
overcommitmentandSectionconcludesthearticle
2BackgroundandRelatedWork
Memoryovercommitmentenablesahigherconsolidationratioina
hypervisorUsingmemoryovercommitmentuserscanconsolidate
VMsonaphysicalmachinesuchthatphysicalresourcesareutilizedinanoptimalmannerwhiledeliveringgoodperormanceFor
exampleinavirtualdesktopinrastructure(VDI)deployment
ausermayoperatemanyWindowsVMseachcontainingaword
processingapplicationItispossibletoovercommitahypervisor
withsuchVDIVMsSincetheVMscontainsimilarOSesand
applicationsmanyotheirmemorypagesmaycontainsimilar
contentThehypervisorwillndandconsolidatememorypages
withidenticalcontentromtheseVMsthussavingmemory
Thisenablesbetterutilizationomemoryandenableshigher
consolidationratio
Relatedwork:ContemporaryhypervisorssuchasHyper-VKVMand
XenimplementdierentmemoryovercommitmentreclamationandoptimizationstrategiesTablesummarizesthememory
reclamationtechnologiesimplementedinexistinghypervisors
METHOD ESX HYPER-V KVM XEN
share X X
balloon X X X X
compress X
hypervisor
swap
X X X
memoryhot-add
X
transcendent
memory
X
Table 1Comparingmemoryovercommitmenttechnologiesinexistinghypervisors
ESXimplementsreliableovercommitment
Hyper-Vusesdynamicmemoryorsupportingmemory
overcommitmentWithdynamicmemoryeachVMiscongured
withasmallinitialRAMwhenpoweredonWhentheguest
applicationsrequiremorememoryacertainamountomemory
willbehot-addedtotheVMandtheguestOSWhenahostlacks
reememoryaballoondriverwillreclaimmemoryromotherVMs
andmakememoryavailableorhotaddingtothedemandingVMIntherareandrestrictedscenariosHyper-VwillswapVMmemory
toahostswapspaceDynamicmemoryworksorthelatestversions
oWindowsOSItusesonlyballooningandhypervisor-levelswapping
toreclaimmemoryESXontheotherhandworksorallguestOSes
Italsousescontent-basedpagesharingandmemorycompression
ThisapproachimprovesVMperormanceascomparedtotheuse
oonlyballooningandhypervisor-levelswapping
MEMORY OVERCOMMITMENT IN THE ESX SERVER
1 www.microsot.com/hyper-v-server/
2 www.linux-kvm.org/
3 www.xen.org/
-
7/28/2019 VMware Technical Journal - Summer 2013
6/64
4
memorypageThismethodgreatlyreducesthememoryootprint
oVMswithcommonmemorycontentForexampleianESXhas
manyVMsexecutingwordprocessingapplicationsthenESXmay
transparentlyapplypagesharingtothoseVMsandcollapsethe
textanddatacontentotheseapplicationstherebyreducingthe
ootprintoallthoseVMsThecollapsedmemoryisreedbyESX
andmadeavailableorpoweringonmoreVMsThisraisesthe
consolidationratiooESXandenableshigherovercommitment
Inadditionithesharedpagesarenotsubsequentlywritteninto
bytheVMsthentheyremainsharedoraprolongedtime
maintainingthereducedootprintotheVM
Memoryballooningisanactivemethodorreclaimingidlememory
romVMsItisusedwhenESXisinthesotstateIaVMhas
consumedmemorypagesbutisnotsubsequentlyusingthem
inanactivemannerESXattemptstoreclaimthemromtheVM
usingballooningInthismethodanOS-specicballoondriver
insidetheVMallocatesmemoryromtheOSkernelItthenhands
thememorytoESXwhichisthenreetore-allocateittoanother
VMwhichmightbeactivelyrequiringmemoryTheballoondriver
eectivelyutilizesthememorymanagementpolicyotheguest
OStoreclaimidlememorypagesTheguestOStypicallyreclaims
idlememoryinsidetheguestOSandirequiredswapsthemtoitsownswapspace
WhenESXentersthehardstateitactivelyandaggressivelyreclaims
memoryromVMsbyswappingoutmemorytoaswapspaceDuring
thisstepiESXdeterminesthatamemorypageissharableor
compressiblethenthatpageissharedorcompressedinstead
ThereclamationdonebyESXusingswappingisdierentromthat
donebytheguestOSinsidetheVMTheguestOSmayswapout
guestmemorypagestoitsownswapspaceorexampleswap
orpagelesysESXuseshypervisor-levelswappingtoreclaim
memoryromaVMintoitsownswapspace
ThelowstateissimilartothehardstateInadditiontocompressing
andswappingmemorypagesESXmayblockcertainVMsrom
allocatingmemoryinthisstateItaggressivelyreclaimsmemory
romVMsuntilESXmovesintothehardstate
Pagesharingisapassivememoryreclamationtechniquethat
operatescontinuouslyonapowered-onVMTheremaining
techniquesareactiveonesthatoperatewhenreememory
inESXislowAlsopagesharingballooningandcompression
areopportunistictechniquesTheydonotguaranteememory
reclamationromVMsForexampleaVMmaynothavesharable
contenttheballoondrivermaynotbeinstalledoritsmemory
pagesmaynotyieldgoodcompressionReclamationbyswapping
isaguaranteedmethodorreclaimingmemoryromVMs
InsummaryESXallowsorreliablememoryovercommitment
toachieveahigherconsolidationratioItimplementsvarious
memoryreclamationtechniquestoenableovercommitment
whileimprovingeciencyandloweringcostooperationo
VMsThenextsectiondescribesmemoryovercommitment
WhenESXismemoryovercommitteditallocatesmemorytothose
powered-onVMsthatneeditmostandwillperormbetterwith
morememoryAtthesametimeESXreclaimsmemoryrom
thoseVMswhicharenotactivelyusingitMemoryreclamation
isthereoreanintegralcomponentomemoryovercommitment
ESXusesdierentmemoryreclamationtechniquestoreclaim
memoryromVMsThememoryreclamationtechniquesare
transparentpagesharingmemoryballooningmemory
compressionandmemoryswapping
ESXhasanassociatedmemorystatewhichisdeterminedbythe
amountoreeESXmemoryatagiventimeThestatesarehigh
sothardandlowTableshowstheESXstatethresholdsEach
thresholdisinternallysplitintotwosub-thresholdstoavoidoscillation
oESXmemorystatenearthethresholdAteachmemorystate
ESXutilizesacombinationomemoryreclamationtechniquesto
reclaimmemoryThisisshowninTable
STATE SHARE BALLOON COMPRESS SWAP BLOCK
high X
sot X X
hard X X X
low X X X X
Table 3ActionsperormedbyESXindierentmemorystates
Transparentpagesharing(pagesharing)isapassiveandopportunistic
memoryreclamationtechniqueItoperatesonapowered-onVMat
allmemorystatesthroughoutitslietimelookingoropportunities
tocollapsedierentmemorypageswithidenticalcontentintoone
MEMORY OVERCOMMITMENT IN THE ESX SERVER
Table 2.FreememorystatetransitionthresholdinESX(a)Uservisible(b)Internal
thresholdtoavoidoscillation
-
7/28/2019 VMware Technical Journal - Summer 2013
7/64
5
ESXalsoreservesasmallamountomemorycalledminreeThis
amountisabueragainstrapidallocationsbymemoryconsumers
ESXisinhighstateaslongasthereisatleastthisamountomemory
reeIthereememorydipsbelowthisvaluethenESXisno
longerinhighstateanditbeginstoactivelyreclaimmemory
Figure(c)showstheschematicdiagramrepresentingan
undercommittedESXwhenoverheadmemoryistakeninto
accountInthisdiagramtheoverheadmemoryconsumedby
ESXoritselandoreachpowered-onVMisshownFigure(d)showstheschematicdiagramrepresentinganovercommitted
ESXwhenoverheadmemoryistakenintoaccount
Figure(e)showsthetheoreticallimitomemoryovercommitment
inESXInthiscasealloESXmemoryisconsumedbyESXoverhead
andper-VMoverheadVMswillbeabletopoweronandboot
howeverexecutionotheVMwillbeextremelyslow
Forsimplicityodiscussionandcalculationthedenitiono
overcommitmentromFigure(b)isollowedOverheadmemory
isignoredordeningmemoryovercommitmentFromthisgure
overcommit= v memsize / ESXmemory (1)
where
overcommit memoryovercommitmentactor
V powered-onVMsinESX
memsize conguredmemorysizeov
ESXmemory totalinstalledESXmemory
Therepresentationromthisgureisusedintheremainder
othisarticle
TounderstandmemoryovercommitmentanditseectonVM
andapplicationsmappedconsumedandworkingsetmemory
aredescribed
32MappedMemoryThedenitionomemoryovercommitmentdoesnotconsider
thememoryconsumptionormemoryaccesscharacteristicothe
powered-onVMsImmediatelyateraVMispoweredonitdoes
nothaveanymemorypagesallocatedtoitSubsequentlyasthe
guestOSbootstheVMaccesspagesinitsmemoryaddressspace
ESXoverheadbyreadingorwritingintoitESXallocatesphysical
memorypagestobackthevirtualaddressspaceotheVMduring
thisaccessGraduallyastheguestOScompletesbootingand
applicationsarelaunchedinsidetheVMmorepagesinthevirtual
addressspacearebackedbyphysicalmemorypagesDuringthe
lietimeotheESXoverheadVMtheVMmayormaynotaccess
allpagesinitsvirtualaddressspace
Windowsorexamplewritesthezeropatterntothecomplete
VMmemoryaddressspaceotheVMThiscausesESXtoallocate
memorypagesorthecompleteaddressspacebythetime
WindowshascompletedbootingOntheotherhandLinux
doesnotaccesstheVMmemorycompleteaddressspaceo
theVMwhenitbootsItaccessesmemorypagesonlyrequired
toloadtheOS
3MemoryOvercommitment
Memoryovercommitmentenablesahigherconsolidationratioo
VMsonanESXServerItreducescostooperationwhileutilizing
computeresourcesecientlyThissectiondescribesmemory
overcommitmentanditsperormancecharacteristics
3Denitions
ESXissaidtobememoryovercommittedwhenVMsarepowered
onsuchthattheirtotalconguredmemorysizeisgreaterthanESX
memoryFigureshowsanexampleomemoryovercommitment
onESX
Figure(a)showstheschematicdiagramoamemory-
undercommittedESXServerInthisexampleESXmemory
isGBTwoVMseachwithaconguredmemorysizeoGB
arepoweredonThetotalconguredmemorysizeopowered-on
VMsisthereoreGBwhichislessthanGBHenceESXis
consideredtobememoryundercommitted
Figure(b)showstheschematicdiagramoamemory-
overcommittedESXServerInthisexampleESXmemory
isGBThreeVMswithconguredmemorysizesoGBGB
andGBarepoweredonThetotalconguredmemorysizeo
powered-onVMsisthereoreGBwhichismorethanGB
HenceESXisconsideredtobememoryovercommitted
Thescenariosdescribedaboveomitthememoryoverheadconsumed
byESXESXconsumesaxedamountomemoryoritsowntext
anddatastructuresInadditionitconsumesoverheadmemoryor
eachpowered-onVM
MEMORY OVERCOMMITMENT IN THE ESX SERVER
Figure 1.Memoryovercommitmentshownwithandwithoutoverheadmemory(a)
UndercommittedESX(b)OvercommittedESXThismodelistypicallyollowedwhen
describingovercommitment(c)UndercommittedESXshownwithoverheadmemory
(d)OvercommittedESXshownwithoverheadmemory(e)Limitomemory
overcommitment
4 A memory page with all 0x00 content is a zero page.
-
7/28/2019 VMware Technical Journal - Summer 2013
8/64
6
AllmemorypagesoaVMwhichareeveraccessedbyaVMare
consideredmappedbyESXAmappedmemorypagesisbacked
byaphysicalmemorypagebyESXduringtheveryrstaccessby
theVMAmappedpagemaysubsequentlybereclaimedbyESX
ItisconsideredasmappedthroughthelietimeotheVM
WhenESXisovercommittedthetotalmappedpagesbyallVMs
mayormaynotexceedESXmemoryHenceitispossibleorESX
tobememoryovercommittedbutatthesametimeowingtothe
natureotheguestOSanditsapplicationsthetotalmappedmemoryremainwithinESXmemoryInsuchascenarioESXdoesnotactively
reclaimmemoryromVMsandVMsperormancearenotaected
bymemoryreclamation
MappedmemoryisillustratedinFigureThisgureusesthe
representationromFigure(b)whereisESXmemoryovercommitted
InFigure(a)thememorymappedbyeachVMisshownThetotal
mappedmemoryoallVMsislessthanESXmemoryInthiscase
ESXisovercommittedHoweverthetotalmappedmemoryisless
thanESXmemoryESXwillnotactivelyreclaimmemoryinthiscase
InFigure(b)thememorymappedbyeachVMisshownInthis
caseESXisovercommittedInadditionthetotalmappedmemory
inESXexceedsESXmemoryESXmayormaynotactivelyreclaim
memoryromtheVMsThisdependsonthecurrentconsumed
memoryoeachVM
33Consumedmemory
AVMisconsideredtobeconsumingphysicalmemorypagewhen
aphysicalmemorypageisusedtobackamemorypageinitsaddress
spaceAmemorypageotheVMmayexistindierentstateswith
ESXTheyareasollows
Regular:AregularmemorypageintheaddressspaceoaVMisone
whichisbackedbyonephysicalpageinESX
Shared:AmemorypagemarkedassharedinaVMmaybeshared
withmanyothermemorypagesIamemorypageisbeingshared
withnothermemorypagestheneachmemorypageisconsidered
tobeconsumingnoawholephysicalpage
Compressed:VMpageswhicharecompressedtypicallyconsume
oroaphysicalmemorypage
Ballooned:Balloonedmemorypagearenotbackedbyany
physicalpage
Swapped:Swappedmemorypagearenotbackedbyany
physicalpage
ConsumedmemoryisillustratedinFigureThisgureusesthe
representationromFigure(b)whereESXismemoryovercommitted
InFigure(a)themappedandconsumedmemoryopowered-on
VMsandESXisshownESXisovercommittedHoweverthetotal
mappedmemoryislessthanESXmemoryInadditionthetotal
consumedmemoryislessthanthetotalmappedmemoryThe
consumptionislowerthanmappedsincesomememorypages
mayhavebeensharedballoonedcompressedorswappedInthis
stateESXwillnotactivelyreclaimmemoryromVMsHowever
theVMsmaypossessballoonedcompressedorswappedmemory
ThisisbecauseESXmayearlierhavereclaimedmemoryrom
theseVMsowingtothememorystateatthattime
ThetotalconsumptionoallVMstakentogethercannotexceed
ESXmemoryThisisshowninFigure(b)InthisgureESXis
overcommittedInadditionthetotalmappedmemoryisgreater
thanESXmemoryHoweverwheneverVMsattempttoconsume
morememorythanESXmemoryESXwillreclaimmemoryrom
theVMsandredistributetheESXmemoryamongstallVMsThis
preventsESXromrunningoutomemoryInthisstateESXis
likelygoingtoactivelyreclaimmemoryromVMstoprevent
memoryexhaustion
MEMORY OVERCOMMITMENT IN THE ESX SERVER
Figure 2.MappedmemoryoranovercommittedESXMappedregionisshaded
ThemappedregioninESXisthesumomappedregionintheVMs(a)TotalmappedmemoryislessthanESXmemory(b)TotalmappedmemoryismorethanESXmemory
Figure 3.ConsumedmemoryoranovercommittedESXConsumedandmapped
regionareshadedTheconsumed(andmapped)regioninESXisthesumoconsumed(mapped)regionintheVMs(a)Totalconsumedandmappedmemoryislessthan
ESXmemory(b)TotalmappedmemoryismorethanESXmemorytotalconsumed
memoryisequaltoESXmemory
-
7/28/2019 VMware Technical Journal - Summer 2013
9/64
-
7/28/2019 VMware Technical Journal - Summer 2013
10/64
8
Page-ault cost:WhenasharedpageisreadbyaVMitisaccessed
bytheVMinaread-onlymannerHencethatsharedpagesdoes
notneedtobepage-aultedHencethereisnopage-aultcost
Awriteaccesstoasharedpageincursacost
CPUWhenasharedpageiswrittentobyaVMESXmustallocate
anewpageandreplicatethesharedcontentbeoreallowingthe
writeaccessromtheVMThisallocationincursaCPUcostTypically
thiscostisverylowanddoesnotsignicantlyaectVMapplications
andbenchmarks
WaitThecopy-on-writeoperationwhenaVMaccessesashared
pagewithwriteaccessisairlyastVMapplicationsaccessingthe
pagedonotincurnoticeabletemporalcost
b)Ballooning
Reclamation cost:MemoryisballoonedromaVMusinga
balloondriverresidinginsidetheguestOSWhentheballoon
driverexpandsitmayinducetheguestOStoreclaimmemory
romguestapplications
CPUBallooningincursaCPUcostonaper-VMbasissinceit
inducesmemoryallocationandreclamationinsidetheVM
StorageTheguestOSmayswapoutmemorypagestotheguestswapspaceThisincursstoragespaceandstoragebandwidthcost
Page-ault cost:
CPUAballoonedpageacquiredbytheballoondrivermay
subsequentlybereleasedbyitTheguestOSorapplicationmay
thenallocateandaccessitThisincursapage-aultintheguestOS
aswellasESXThepage-aultincursalowCPUcostsinceamemory
pagesimplyneedstobeallocated
StorageDuringreclamationbyballooningapplicationpagesmay
havebeenswappedoutbytheguestOSWhentheapplication
attemptstoaccessthatpagetheguestOSneedstoswapitin
Thisincursastoragebandwidthcost
WaitAtemporalwaitcostmaybeincurredbyapplicationiits
pageswereswappedoutbytheguestOSThewaitcostoswapping
inamemorypagebytheguestOSincursasmalleroverallwait
costtotheapplicationthanahypervisor-levelswap-inThisis
becauseduringapageaultintheguestOSbyonethreadthe
guestOSmayscheduleanotherthreadHoweveriESXis
swappinginapagethenitmaydescheduletheentireVM
ThisisbecauseESXcannotrescheduleguestOSthreads
c)Compression
Reclamation cost:MemoryiscompressedinaVMbycompressing
aullguestmemorypagesuchthatisconsumesorphysical
memorypageThereiseectivelynomemorycostsinceeverysuccessulcompressionreleasesmemory
CPUACPUcostisincurredoreveryattemptedcompressionThe
CPUcostistypicallylowandischargedtotheVMwhosememory
isbeingcompressedTheCPUcostishowevermorethanthator
pagesharingItmayleadtonoticeablyreducedVMperormance
andmayaectbenchmarks
36CostoMemoryOvercommitment
Memoryovercommitmentincurscertaincostintermsocompute
resourceaswellasVMperormanceThissectionprovidesaqualitative
understandingothedierentsourcesocostandtheirmagnitude
WhenESXismemoryovercommittedandpowered-onVMs
attempttoconsumemorememorythanESXmemorythenESX
willbegintoactivelyreclaimmemoryromVMsHencememory
reclamationisanintegralcomponentomemoryovercommitment
Tableshowsthememoryreclamationtechniquesandtheir
associatedcostEachtechniqueincursacostoreclamationwhen
apageisreclaimedandacostopage-aultwhenaVMaccesses
areclaimedpageThenumbero*qualitativelyindicatethe
magnitudeothecost
S HA RE BALLOON COMPRESS S WA P(SSD)
SWAP(DISK)
Reclaim
Memory *
CPU * * **
Storage * * *
Wait
Page-ault
Memory
CPU * * **
Storage * * *
Wait * *** * ** ****
1 indicates that this cost may be incurred under certain conditions
Table 4CostoreclaimingmemoryromVMsandcostopage-aultwhenVMaccess
areclaimedpageNumbero*qualitativelyindicatesthemagnitudeocostTheyare
notmathematicallyproportional
AMemorycostindicatesthatthetechniqueconsumesmemory
meta-dataoverheadACPUindicatesthatthetechniqueconsumes
non-trivialCPUresourcesAStoragecostindicatesthatthetechnique
consumesstoragespaceorbandwidthAWaitcostindicatesthat
theVMincursahypervisor-levelpage-aultcostasaresultothe
techniqueThismayleadtotheguestapplicationtostallandhence
leadtoadropinitsperormanceThereclamationandpage-ault
costsaredescribedasollows
a)PagesharingReclamation costPagesharingisacontinuousprocesswhich
opportunisticallysharesVMmemorypages
MemoryThistechniqueincursanESXwidememorycostorstoring
pagesharingmeta-datawhichisapartoESXoverheadmemory
CPUPagesharingincursanominalCPUcostonaperVMbasisThis
costistypicallyverysmallanddoesnotimpacttheperormance
oVMapplicationsorbenchmarks
MEMORY OVERCOMMITMENT IN THE ESX SERVER
-
7/28/2019 VMware Technical Journal - Summer 2013
11/64
9
4Perormance
Thissectionprovidesaquantitativedemonstrationomemory
overcommitmentinESXSimpleapplicationsareshowntowork
usingmemoryovercommitmentThissectiondoesnotprovide
acomparativeanalysisusingsotwarebenchmarks
4Microbenchmark
Thissetoexperimentwasconductedonadevelopmentbuildo
ESXTheESXServerhasGBRAMquad-coreAMDCPU
conguredina-NUMAnodecongurationTheVMsusedinthe
experimentsare-vCPUGBandcontainRHELOSTheESX
ServerconsumesaboutGBmemoryESXmemoryoverhead
per-VMmemoryoverheadreservedminreeThisgivesabout
GBorallocatingtoVMsFortheseexperimentsESXmemory
isGBHoweverESXwillactivelyreclaimmemoryromVMs
whenVMsconsumemorethanGB(availableESXmemory)
RHELconsumesaboutGBmemorywhenbootedandidleThis
willcontributetothemappedandconsumedmemoryotheVM
TheseexperimentsdemonstrateperormanceoVMswhenthetotal
workingsetvariescomparedtoavailableESXmemoryFigure
showstheresultsromthreeexperimentsForthepurposeo
demonstrationandsimplicitythethreereclamationtechniquespage
sharingballooningandcompressionaredisabledHypervisor-level
memoryswappingistheonlyactivememoryreclamationtechnique
inthisexperimentThisreclamationwillstartwhenVMsconsumed
morethanavailableESXmemory
Intheseexperimentspresenceohypervisor-levelpage-aultis
anindicatoroperormancelossFiguresshowcumulativeswap-in
(page-ault)valuesWhenthisvaluerisesVMwillexperience
temporalwaitcostWhenitdoesnotrisethereisnotemporal
waitcostandhencenoperormanceloss
Experimenta: InFigure(a)anexperimentisconductedsimilarto
Figure(a)Inthisexperimentthetotalmappedandtotalworking
setmemoryalwaysremainlessthanavailableESXmemory
Page-ault cost: Whenacompressedmemorypageisaccessedby
theVMESXmustallocateanewpageandde-compressthepage
beoreallowingaccesstotheVM
CPUDe-compressionincursaCPUcost
WaitTheVMalsowaitsuntilthepageisde-compressed
Thisistypicallynotveryhighsincethede-compression
takesplacein-memory
d)Swap
Reclamation cost:ESXswapsoutmemorypagesromaVMtoavoid
memoryexhaustionTheswap-outprocesstakesplaceasynchronous
totheexecutionotheVManditsapplicationHencetheVMand
itsapplicationsdonotincuratemporalwaitcost
StorageSwappingoutmemorypagesromaVMincursstorage
spaceandstoragebandwidthcost
Page-ault cost:WhenaVMpage-aultsaswappedpageESX
mustreadthepageromtheswapspacesynchronouslybeore
allowingtheVMtoaccessthepage
StorageThisincursastoragebandwidthcost
WaitTheVMalsoincursatemporalwaitcostwhiletheswapped
pageissynchronouslyreadromstoragedeviceThetemporal
waitcostishighestoraswapspacelocatedonspinningdisk
SwapspacelocatedonSSDincurslowertemporalwaitcost
Thecostoreclamationandpage-aultvarysignicantlybetween
dierentreclamationtechniquesReclamationbyballooningmay
incurastoragecostonlyincertaincasesReclamationbyswapping
alwaysincursstoragecostSimilarlypage-aultowingtoreclamation
incursatemporalwaitcostinallcasesPage-aultoaswappedpage
incursthehighesttemporalwaitcostTableshowsthecosto
reclamationaswellasthecostopage-aultincurredbyaVMon
areclaimedmemorypage
ThereclamationitseldoesnotimpactVMandapplication
perormancesignicantlysincethetemporalwaitcosto
reclamationiszerooralltechniquesSometechniqueshave
alowCPUcostThiscostmaybechargedtotheVMleading
toslightlyreducedperormance
Howeverduringpage-aulttemporalwaitcostexistsorall
techniquesThisaectsVMandapplicationperormanceWrite
accesstosharedpageaccesstocompressedballoonedpages
incurtheleastperormancecostAccesstoswappedpage
speciallypagesswappedtospinningdiskincursthehighest
perormancecost
SectiondescribedmemoryovercommitmentinESXItdened
varioustermsmappedconsumedworkingsetmemoryandshoweditsrelationtomemoryovercommitmentItalsoshowed
thatmemoryovercommitmentdoesnotnecessarilyimpact
VMandapplicationperormanceItalsoprovidedaqualitative
analysisohowmemoryovercommitmentmayimpactVMand
applicationperormancedependingontheVMsmemorycontent
andaccesscharacteristics
Thenextsectionprovidesquantitativedescriptionoovercommitment
MEMORY OVERCOMMITMENT IN THE ESX SERVER
6 CPU cost o reclamation may aect perormance only i ESX is operating
at 100% CPU load
Figure 5.Eectoworkingsetomemoryreclamationandpage-aultAvailableESX
memory=8GB(a)mapped, working set < available ESX memory.(b)mapped > available ESX
memory, working set < available ESX memory(c)mapped, working set > available ESX memory
-
7/28/2019 VMware Technical Journal - Summer 2013
12/64
1 0
TwoVMseachoconguredmemorysizeGBarepoweredon
TheovercommitmentactorisEachVMrunsaworkloadEach
workloadallocatesGBmemoryandwritesarandompatterninto
itItthenreadsallGBmemorycontinuouslyinaroundrobin
manneroriterationsThememorymappedbyeachVMisabout
GB(workloadGBRHELGB)thetotalbeingGBThe
workingsetisalsoGBBothexceedavailableESXmemory
ThegureshowsmappedmemoryoreachVMandcumulative
swap-inTheX-axisinthisgureshowstimeinsecondsY-axis(let)showsmappedmemoryinMBandY-axis(right)shows
cumulativeswap-inItcanbeseenromthisgurethatasteady
page-aultismaintainedoncetheworkloadshavemappedthe
targetmemoryThisindicatesthatastheworkloadsareaccessing
memorytheyareexperiencingpage-aultsESXiscontinuously
reclaimingmemoryromeachVMastheVMpage-aultsits
workingset
Thisexperimentdemonstratesthatworkloadswillexperience
steadypage-aultsandhenceperormancelosswhentheworking
setexceedsavailableESXmemoryNotethatitheworkingset
read-accessessharedpagesandpagesharingisenabled(deault
ESXbehavior)thenapage-aultisavoided
Inthissectionexperimentsweredesignedtohighlightthebasic
workingoovercommitmentIntheseexperimentspagesharing
ballooningandcompressionweredisabledorsimplicityThese
reclamationtechniquesreclaimmemoryeectivelybeore
reclamationbyswappingcantakeplaceSincepageaults
tosharedballoonedandcompressedmemoryhavealower
temporalwaitcosttheworkloadperormanceisbetterThis
willbedemonstratedinthenextsection
42RealWorkloads
InthissectiontheexperimentsareconductedusingvSphere
TheESXServerhasGBRAMquad-coreAMDCPUTheworkloads
usedtoevaluatememoryovercommitmentperormanceareDVDstoreandaVDIworkloadExperimentswereconducted
withdeaultmemorymanagementcongurationsorpshare
balloonandcompression
Experimentd: TheDVDstoreworkloadsimulatesonlinedatabase
operationsororderingDVDsInthisexperimentDVDstoreVMs
areusedeachconguredwithvCPUsandGBmemoryItcontains
WindowsServerOSandSQLServerTheperormance
metricisthetotaloperationsperminuteoallVMs
Since-GBVMsareusedthetotalmemorysizeoall
powered-onVMsisGBHowevertheESXServercontains
GBinstalledRAMandwaseectivelyundercommittedHence
memoryovercommitmentwassimulatedwiththeuseoamemory
hogVMThememoryhogVMhadullmemoryreservationIts
conguredmemorysizewasprogressivelyincreasedineach
runThiseectivelyreducedtheESXmemoryavailabletothe
DVDStoreVMs
TwoVMseachoconguredmemorysizeGBarepowered
onSinceESXmemoryisGBtheovercommitmentactoris
(*)EachVMexecutesanidenticalhand-crated
workloadTheworkloadisamemorystressprogramItallocates
GBmemorywritesrandompatternintoallothisallocated
memoryandcontinuouslyreadsthismemoryinaround-robin
mannerThisworkloadisexecutedinbothVMsHenceatotal
oGBmemoryismappedandactivelyusedbytheVMs
Theworkloadisexecutedorseconds
ThegureshowsmappedmemoryoreachVMItalsoshowsthe
cumulativeswap-inoreachVMTheX-axisinthisgureshows
timeinsecondsY-axis(let)showsmappedmemoryinMBand
Y-axis(right)showscumulativeswap-inThememorymappedby
eachVMisaboutGB(workloadGBRHELGB)thetotal
beingGBThisislessthanavailableESXmemoryItcanbeseem
romthisgurethatthereisnoswap-in(page-ault)activityatany
timeHencethereisnoperormancelossThememorymappedby
theVMsriseastheworkloadallocatesmemorySubsequentlyas
theworkloadaccessesthememoryinaround-robinmannerallo
thememoryisbackedbyphysicalmemorypagesbyESX
Thisexperimentdemonstratesthatalthoughmemoryis
overcommittedVMsmappingandactivelyusinglessthan
availableESXmemorywillnotbesubjectedtoactivememory
reclamationandhencewillnotexperienceanyperormanceloss
Experimentb:InFigure(b)anexperimentisconductedsimilarto
Figure(b)Inthisexperimentthetotalmappedmemoryexceeds
availableESXmemorywhilethetotalworkingsetmemoryisless
thanavailableESXmemory
TwoVMseachoconguredmemorysizeGBarepoweredon
TheresultingmemoryovercommitmentisTheworkloadin
eachVMallocatesGBomemoryandwritesarandompattern
intoitItthenreadsaxedblockoGB(GB)insize
romthisGBinaround-robinmanneroriterationsThereaterthesamexedblockisreadinaround-robinmanneror
secondsThisworkloadisexecutedinbothVMsThememory
mappedbyeachVMisaboutGB(workloadGBRHELGB)
thetotalbeingGBThisexceedsavailableESXmemoryThe
workingsetisthereateratotalo*GB
ThegureshowsmappedmemoryoreachVMandcumulative
swap-inTheX-axisinthisgureshowstimeinsecondsY-axis
(let)showsmappedmemoryinMBandY-axis(right)shows
cumulativeswap-inItcanbeseenromthisgurethatasthe
workloadmapsmemorythemappedmemoryrisesThereater
thereisaninitialrisingswap-inactivityastheworkingsetispage-
aultedAtertheworkingsetispage-aultedthecumulativeswap-inissteady
Thisexperimentdemonstratesthatalthoughmemoryis
overcommittedandVMshavemappedmorememorythan
availableESXmemoryVMswillperormbetterwhenits
workingsetissmallerthanavailableESXmemory
Experimentc:InFigure(c)anexperimentisconductedsimilarto
Figure(c)Inthisexperimentthetotalmappedmemoryaswell
asthetotalworkingsetexceedsavailableESXmemory
MEMORY OVERCOMMITMENT IN THE ESX SERVER
7 http://en.community.dell.com/techcenter/extras/w/wiki/dvd-store.aspx
-
7/28/2019 VMware Technical Journal - Summer 2013
13/64
1 1
InthisexperimentVDIVMsarepowered-onintheESXServer
EachVMisconguredwithvCPUandGBmemorythetotal
conguredmemorysizeoallVMsbeing*GBTheVMs
containWindowsOSSimilartoexperimentdamemoryhogVM
withullmemoryreservationisusedtosimulateovercommitment
byeectivelyreducingtheESXmemoryontheESXServer
FigureshowsthethpercentileooperationlatencyintheVDI
VMsorESXmemoryvalueso{}GBAtthesedata
pointsthememoryhogVMhasaconguredmemorysizeo{}GBOwingtothereducedsimulatedESXmemory
thesimulatedovercommitmentactor(wtotalVMmemorysize
oGB)was{}
TheY-axis(right)showsthetotalamountoballoonedswapped
memory(MB)romallVMsTheY-axis(let)showstheaverage
latencyooperationsintheVDIworkloadWhenESXmemoryis
GBballoonandswappingisnotobservedbecause)pagesharing
helpsreclaimmemoryand)theapplicationstotalmappedmemory
andworkingsettsintotheavailableESXmemoryWhenESX
memoryreducestoGBatotalaboutGBmemoryareballooned
whichcausesaslightincreaseintheaveragelatencytoThis
indicatesthatVMsactiveworkingsetisbeingreclaimedatthispoint
WhenESXmemoryurtherdecreasestoGandGBtheamount
oballoonedmemoryincreasesdramaticallyAlsotheswapped
memoryrisesAsaresulttheaveragelatencyincreasesto
andtimesrespectivelyThisisowingtothesignicantlyhigher
page-aultwaitcostoballooningandhypervisor-levelswapping
ThisexperimentwasrepeatedaterattachinganSSDosizeGB
totheESXServerESXisdesignedtoutilizeanSSDasadeviceor
swappingoutVMsmemorypagesTheaveragelatencyinthe
presenceotheSSDisshowninthesamegure
Itcanbeseenthattheperormancedegradationissignicantly
lowerThisexperimentshowsthatcertainworkloadsmaintain
perormanceinthepresenceoanSSDevenwhentheESXServer
isseverelyovercommitted
Theexperimentsinthissectionshowthatmemoryovercommitment
doesnotnecessarilyindicateperormancelossAslongasthe
applicationsactiveworkingsetsizeissmallerthantheavailable
FigureshowstheperormanceoDVDstoreVMsorESXmemory
valueso{}GBAtthesedatapointsthememory
hogVMhasaconguredmemorysizeo{}GB
OwingtothereducedsimulatedESXmemorythesimulated
overcommitmentactorusingEquation(wtotalVMmemory
sizeoGB)was{}
TheY-axis(right)showsthetotalamountoballoonedswapped
memoryinMBromallVMsItcanbeseenthattheamounto
balloonedmemoryincreasesgraduallywithoutsignicant
hypervisor-levelswappingasESXmemoryisreducedrom
GBtoGBTheY-axis(let)showstheoperationsperminute
(OPM)executedbytheworkloadTherisingballoonedmemory
contributestoreducedapplicationperormanceasESXmemory
isreducedromGBtoGBThisisowingtothewaitcosttothe
applicationassomeoitsmemoryareswappedoutbyguestOS
BetweentheX-axispointsandtheOPMdecreasesrom
to(about)owingtoballooningTheovercommitment
actorbetweenthesepointsisand
HoweverwhenESXmemoryisreducedtoGB(overcommitment)
ballooningitselisinsucienttoreclaimsucientmemoryNon-
trivialamountsohypervisor-levelswappingtakeplaceOwingto
thehigherwaitcostohypervisor-levelpage-aultsthereislarger
perormancelossandOPMdropsto
ThisexperimentwasrepeatedaterattachinganSSDosize
GBtotheESXServerTheOPMinthepresenceotheSSD
isshowninthesamegureItcanbeseenthattheperormance
degradationislesserthaninthepresenceoSSD
Experimente:TheVDIworkloadisasetointeractiveoceuser-levelapplicationssuchasMicrosotOcesuiteInternetexplorer
Theworkloadisacustommadesetoscriptswhichsimulateuser
actionontheVDIapplicationsThescriptstriggermouseclicks
andkeyboardinputstotheapplicationsThetimetocompleteo
eachoperationisrecordedInonecompleterunotheworkload
user-actionsareperormedThemeasuredperormance
metricisthethpercentileolatencyvalueoalloperations
Asmallerlatencyindicatesbetterperormance
MEMORY OVERCOMMITMENT IN THE ESX SERVER
Figure 6.DVDStoreOperationsperminute(OPM)oveDVDStoreVMswhen
changingESXmemoryromGBtoGB
Figure 7.VDI AveragethpercentileoperationlatencyoteenVDIVMswhen
changingESXmemoryromGBtoGB
-
7/28/2019 VMware Technical Journal - Summer 2013
14/64
1 2
Reerences
1 A. Arcangeli, I. Eidus, and C. Wright. Increasing memory
density by using KSM. In Proceedings o the Linux
Symposium, pages 313328, 2009.
2 D. Gupta, S. Lee, M. Vrable, S. Savage, A. C. Snoeren,
G. Varghese, G. M. Voelker, and A. Vahdat. Dierence engine:
harnessing memory redundancy in virtual machines.
Commun. ACM, 53(10):85-93, Oct. 2010.
3 M. Hines, A. Gordon, M. Silva, D. Da Silva, K. D. Ryu, and
M. BenYehuda. Applications Know Best: Perormance-Driven
Memory Overcommit with Ginkgo. In Cloud Computing
Technology and Science (CloudCom), 2011 IEEE Third
International Conerence on, pages 130-137, 2011.
4 G. Milos, D. Murray, S. Hand, and M. Fetterman. Satori:
Enlightened o page sharing. In Proceedings o the 2009
conerence on USENIX Annual technical conerence.
USENIX Association, 2009.
5 M. Schwidesky, H. Franke, R. Mansell, H. Raj, D. Osisek, and
J. Choi. Collaborative Memory Management in Hosted Linux
Environments. In Proceedings o the Linux Symposium, pages
313-328, 2006.
6 P. Sharma and P. Kulkarni. Singleton: system-wide page
deduplication in virtual environments. In Proceedings o the
21st international symposium on High-Perormance Parallel
and Distributed Computing, HPDC 12, pages 15-26,
New York, NY, USA, 2012. ACM.
7 C. A. Waldspurger. Memory resource management in
VMware ESX server. SIGOPS Oper. Syst. Rev., 36(SI):181-194,
Dec. 2002.
ESXmemorytheperormancedegradationmaybetolerableInmany
situationswherememoryisslightlyormoderatelyovercommitted
pagesharingandballooningisabletoreclaimmemorygraceully
withoutsignicantlyperormancepenaltyHoweverunderhigh
memoryovercommitmenthypervisor-levelswappingmayoccur
leadingtosignicantperormancedegradation
5Conclusion
ReliablememoryovercommitmentisauniquecapabilityoESXnotpresentinanycontemporaryhypervisorUsingmemory
overcommitmentESXcanpoweronVMssuchthatthetotal
conguredmemoryoallpowered-onVMsexceedESXmemory
ESXdistributesmemorybetweenallVMsinaairandecient
mannersoastomaximizeutilizationotheESXServerAtthe
sametimememoryovercommitmentisreliableThismeansthat
VMswillnotbeprematurelyterminatedorsuspendedowingto
memoryovercommitmentMemoryreclamationtechniqueso
ESXguaranteessaeoperationoVMsinamemory
overcommittedenvironment
6AcknowledgmentsMemoryovercommitmentinESXwasdesignedandimplemented
byCarlWaldspurger[]
MEMORY OVERCOMMITMENT IN THE ESX SERVER
-
7/28/2019 VMware Technical Journal - Summer 2013
15/64
1 3
Introduction
Traditionallystoragearrayswerebuiltospinningdiskswithaew
gigabytesobattery-backedNVRAMaslocalcacheThetypical
IOresponsetimewasmultiplemillisecondsandthemaximum
supportedIOPSwereaewthousandTodayintheasheraarrays
areadvertisingIOlatenciesounderamillisecondandIOPSon
theorderomillionsXtremIO[](nowEMC)ViolinMemory[]
WhipTail[]Nimbus[]Solid-Fire[]PureStorage[]
Nimble[]GridIron(nowViolin)[]CacheIQ(nowNetApp)[]
andAvereSystems[]aresomeotheemergingstartupsdeveloping
storagesolutionsthatleverageashAdditionallyestablished
players(namelyEMCIBMHPDellandNetApp)arealsoactively
developingsolutionsFlashisalsobeingadoptedwithinserversas
aashcachetoaccelerateIOsbyservingthemlocallysomeo
theexamplesolutionsinclude[]Giventhecurrenttrends
itisexpectedthatall-ashandhybridarrayswillcompletely
replacethetraditionaldisk-basedarraysbytheendothis
decadeTosummarizetheIOsaturationbottleneckisnow
shitingadministratorsarenolongerworriedabouthowmany
requeststhearraycanservicebutratherhowasttheservercan
beconguredtosendtheseIOrequestsandutilizethebandwidth
Multipathingisamechanismoraservertoconnecttoastorage
arrayusingmultipleavailableabricportsESXismultipathing
logicisimplementedasaPathSelectionPlug-in(PSP)withinthe
PSA(PluggableStorageArchitecture)layer[]TheESXiproduct
todayshipswiththreedierentmultipathingalgorithmsasaNMP
(NativeMultipathingPlug-in)rameworkPSPFIXEDPSPMRU
andPSPRRBothPSPFIXEDandPSPMRUutilizeonlyonexed
pathorIOrequestsanddonotperormanyloadbalancingwhile
PSPRRdoesasimpleround-robinloadbalancingamongallActive
OptimizedpathsAlsotherearecommercialsolutionsavailable(as
describedintherelatedworksection)thatessentiallydierinhow
theydistributeloadacrosstheActiveOptimizedpaths
InthispaperweexploreanovelideaousingbothActiveOptimizedandUn-optimizedpathsconcurrentlyActiveUn-optimizedpaths
havetraditionallybeenusedonlyorailoverscenariossincethese
pathsareknowntoexhibitahigherservicetimecomparedtoActive
OptimizedpathsThehypothesisoourapproachwasthattheservice
timeswerehighsincethecontentionbottleneckisthearray
bandwidthlimitedbythediskIOPSInthenewasherathe
arrayisarrombeingahardwarebottleneckWediscovered
thatourhypothesisishaltrueandwedesignedaplug-in
solutionarounditcalledPSPAdaptive
Abstract
Attheadventovirtualizationprimarystorageequatedspinning
disksTodaytheenterprisestoragelandscapeisrapidlychanging
withlow-latencyall-ashstoragearraysspecializedash-based
IOappliancesandhybridarrayswithbuilt-inashAlsowiththe
adoptionohost-sideashcachesolutions(similartovFlash)the
read-writemixooperationsemanatingromtheserverismore
write-dominated(sincereadsareincreasinglyservedlocallyrom
cache)IstheoriginalESXiIOmultipathinglogicthatwasdeveloped
ordisk-basedarraysstillapplicableinthisnewashstorageera?
Arethereoptimizationswecandevelopasadierentiatorinthe
vSphereplatormorsupportingthiscoreunctionality?
ThispaperarguesthatexistingIOmultipathinginESXiisnotthe
mostoptimalorash-basedarraysInourevaluationthemaximum
IOthroughputisnotboundbyahardwareresourcebottleneck
butratherbythePluggableStorageArchitecture(PSA)module
thatimplementsthemultipathinglogicTherootcauseisthe
anitymaintainedbythePSAmodulebetweenthehosttrac
andasubsetotheportsonthestoragearray(reerredtoas
ActiveOptimizedpaths)TodaytheActiveUn-optimizedpaths
areusedonlyduringhardwareailovereventssinceun-optimized
pathsexhibithigherservicetimethanoptimizedpathsThuseven
thoughtheHostBusAdaptor(HBA)hardwareisnotcompletely
saturatedwearearticiallyconstrainedinsotwarebylimiting
totheActiveOptimizedpathsonly
WeimplementedanewmultipathingapproachcalledPSPAdaptive
asaPath-SelectionPlug-ininthePSAThisapproachdetectsIO
pathsaturation(leveragingexistingSIOCtechniques)andspreads
thewriteoperationsacrossalltheavailablepaths(optimizedand
un-optimized)whilereadscontinuetomaintaintheiranity
pathsThekeyobservationwasthatthehigherservicetimesin
theun-optimizedpathsarestilllowerthanthewaittimesinthe
optimizedpathsFurtherreadanityisimportanttomaintain
giventhesession-basedpreetchingandcachingsemanticsusedbythestoragearraysDuringperiodsonon-saturationour
approachswitchestothetraditionalanitymodelorbothreads
andwritesInourexperimentsweobservedsignicant(upto
)improvementsinthroughputorsomeworkloadscenarios
Wearecurrentlyintheprocessoworkingwithawiderangeo
storagepartnerstovalidatethismodelorvariousAsymmetric
LogicalUnitsAccess(ALUA)storageimplementationsandeven
MetroClusters
Redefning ESXi IO Multipathing in the Flash EraFeiMeng
NorthCarolinaStateUniversity
meng@ncsuedu
LiZhou
FacebookInc
lzhou@bcom
SandeepUttamchandani
VMwareInc
sandeepu@vmwarecom
XiaosongMa
NorthCarolinaState
University&OakRidgeNationalLab
ma@cscncsuedu
1 Li Zhou was a VMware employee when working on this project.
REDEFINING ESXi IO MULTIPATHING IN THE FLASH ERA
-
7/28/2019 VMware Technical Journal - Summer 2013
16/64
1 4
MultipathinginESXiToday
FigureshowstheIOmultipathingarchitectureovSphereIna
typicalSANcongurationeachhosthasmultipleHostBusAdapter
(HBA)portsconnectedtobothothestoragearrayscontrollers
Thehosthasmultiplepathstothestoragearrayandperorms
loadbalancingamongallpathstoachievebetterperormanceInvSpherethisisdonebythePathSelectionPlug-ins(PSP)at
thePSA(PluggableStorageArchitecture)layer[]ThePSA
rameworkcollapsesmultiplepathstothesamedatastoreand
presentsonelogicaldevicetotheupperlayerssuchasthele
systemInternallytheNMP(NativeMultipathingPlug-in)ramework
allowsdierentpath-selectionpoliciesbysupportingdierentPSPs
ThePSPsdecidewhichpathtorouteanIOrequesttovSphere
providesthreedierentPSPsPSPFIXEDPSPMRUandPSPRR
BothPSPFIXEDandPSPMRUutilizeonlyonepathorIO
requestsanddonotdoanyload-balancingwhilePSPRRdoes
asimpleround-robinloadbalancingamongallactivepathsor
activeactivearrays
Insummarynoneotheexistingpathload-balancingimplementations
concurrentlyutilizeActiveOptimizedandUn-optimizedpathso
ALUAstoragearraysorIOsVMwarecanprovideasignicant
dierentiatedvaluebysupportingPSPAdaptiveasanativeoption
orash-basedarrays
Thekeycontributionsothispaperare
DevelopanovelapproachforI/OmultipathinginESXi,
specicallyoptimizedorash-enabledarraysWithinthis
contextwedesignedanalgorithmthatadaptivelyswitches
betweentraditionalmultipathingandspreadingwritesacross
ActiveOptimizedandActiveUn-optimizedpaths
ImplementationofthePSPAdaptiveplug-inwithinthe
PSAmoduleandexperimentalevaluation
TherestothepaperisorganizedasollowsSectioncovers
multipathinginESXitodaySectiondescribesrelatedwork
Sectionsandcoverthedesigndetailsandevaluation
ollowedbyconclusionsinSection
2RelatedWork
Therearethreedierenttypesostoragearraysregardingthe
dualcontrollerimplementationactiveactiveactivestandbyand
AsymmetricLogicalUnitsAccess(ALUA)[]DenedintheSCSI
standardALUAprovidesastandardwayorhoststodiscoverand
managemultiplepathstothetargetUnlikeactiveactivesystems
ALUAaliatesoneothecontrollersasoptimizedandservesIOs
romthiscontrollerinthecurrentVMwareESXipathselection
plug-in[]UnlikeactivestandbyarraysthatcannotserveIO
throughthestandbycontrollerALUAstoragearrayisable
toserveIOrequestsrombothoptimizedandun-optimized
controllersALUAstoragearrayshavebecomeverypopular
Todaymostmainstreamstoragearrays(egmostpopular
arraysmadebyEMCandNetApp)supportALUA
MultipathingorStorageAreaNetwork(SAN)hasbeendesigned
asaault-tolerancetechniquetoavoidsinglepointailureaswell
asprovideperormanceenhancementoptimizationviaload
balancing[]Multipathinghasbeenimplementedinallmajor
operatingsystemssuchasLinuxincludingdierentstoragestack
layers[]Solaris[]FreeBSD[]andWindows[]
Multipathinghasalsobeenoeredasthird-partyprovided
productssuchasSymantecVeritas[]andEMCPowerpath[]
ESXihasinboxmultipathingsupportinNativeMultipathPlug-in
(NMP)andithasseveraldierentavorsopathselectionalgorithm
(suchasRoundRobinMRUandFixed)ordierentdevicesESXi
alsosupportsthird-partymultipathingplug-insintheormsoPSP
(PathSelectionPlug-in)SATP(StorageArrayTypePlug-in)andMPP
(Multi-PathingPlug-in)allundertheESXiPSArameworkEMC
NetAppDellEqualogicandothershavedevelopedtheirsolutions
onESXiPSAMostotheimplementationsdosimpleroundrobin
amongactivepaths[]basedonnumberocompleteIOsor
transerredbytesoreachpathSomethird-partysolutionssuchas
EMCsPowerPathadoptcomplicatedload-balancingalgorithms
butperormance-wiseareonlyatparorevenworsethanVMwares
NMPKiyoshietalproposedadynamicloadbalancingrequest-
baseddevicemapperMultipathorLinuxbutdidnotimplement
thiseature[]
Figure 1.High-levelarchitectureoIOMultipathinginvSphere
REDEFINING ESXi IO MULTIPATHING IN THE FLASH ERA
-
7/28/2019 VMware Technical Journal - Summer 2013
17/64
1 5
casesspreadingwritestoun-optimizedpathswillhelplowertheload
ontheoptimizedpathsandtherebyboostsystemperormanceThus
inouroptimizedplug-inonlywritesarespreadtoun-optimizedpaths
33SpreadStartandStopTriggersBecauseotheasymmetricperormancebetweenoptimizedand
un-optimizedpathsweshouldonlyspreadIOtoun-optimizedpaths
whentheoptimizedpathsaresaturated(iethereisIOcontention)
ThereoreaccurateIOcontentiondetectionisthekeyAnotheractor
weneedtoconsideristhattheALUAspecicationdoesnotspeciy
theimplementationdetailsThereoredierentALUAarraysrom
dierentvendorscouldhavedierentALUAimplementationsand
hencedierentbehaviorsonservingIOissuedontheun-optimized
pathsInourexperimentswehaveoundthatatleastoneALUA
arrayshowsunacceptableperormanceorIOsissuedonthe
un-optimizedpathsWeneedtotakethisintoaccountanddesign
thePSPAdaptivetobeabletodetectsuchbehaviorandstop
routingWRITEstotheun-optimizedpathsinoIOperormanceimprovementisobservedTheollowingsectionsdescribethe
implementationdetails
3.3.1 I/O Contention Detection
WeapplythesametechniquesthatSIOC(StorageIOControl)
usestodaytothePSPAdaptiveorIOcontentiondetectionuse
IOlatencythresholdsToavoidthrashingtwolatencythresholds
taandtb(ta > ta)areusedtotriggerstartandstopowritespread
tonon-optimizedpathsPSPAdaptivekeepsmonitoringIOlatency
3DesignandImplementation
Considerahighwayandalocalroadbothtothesamedestination
Whenthetracisboundedbyatollplazaatthedestinationthere
isnopointtoroutetractothelocalroadHoweverithetollplaza
isremoveditstartstomakesensetorouteapartotractothe
localroadduringrushhoursbecausenowthecontentionpointhas
shitedThesamestrategycanbeappliedtotheload-balancing
strategyorALUAstoragearraysWhenthearrayisthecontention
pointthereisnopointinroutingIOrequeststothenon-optimizedpathsthelatencyishigherandhost-sideIObandwidthisnotthe
boundHoweverwhenthearraywithashisabletoservemillions
oIOPSitisnolongerthecontentionpointandhost-sideIOpipes
couldbecomethecontentionpointduringheavyIOloadItstarts
tomakesensetorouteapartotheIOtractotheun-optimized
pathsAlthoughthelatencyontheun-optimizedpathsishigher
consideringtheoptimizedpathsaresaturatedusingun-optimized
pathscanstillboostaggregatedsystemIOperormancewith
increasedIOthroughputandIOPSThisshouldonlybedoneduring
rushhourswhentheIOloadisheavyandtheoptimizedpathsare
saturatedAnewPSPAdaptiveisimplementedusingthisstrategy
3UtilizeActive/Non-optimizedPathsorALUASystemsFigureshowsthehigh-leveloverviewomultipathinginvSphere
NMPcollapsesallpathsandpresentsonlyonelogicaldevicetoupper
layerswhichcanbeusedtostorevirtualdisksorVMsWhenan
IOisissuedtothedeviceNMPqueriesthepathselectionplug-in
PSPRRtoselectapathtoissuetheIOInternallyPSPRRusesa
simpleroundrobinalgorithmtoselectthepathorALUAsystems
Figure(a)showsthedeaultpath-selectionalgorithmIOsare
dispatchedtoallActiveOptimizedpathsalternatelyActive
Un-optimizedpathsarenotusedevenitheActiveOptimized
pathsaresaturatedThisapproachisawasteoresourceithe
optimizedpathsaresaturated
ToimproveperormancewhentheActiveOptimizedpaths
aresaturatedwespreadWRITEIOstotheun-optimized
pathsEventhoughthelatencywillbehighercompared
toIOsusingtheoptimizedpathstheaggregatedsystem
throughputandIOPSwillbeimprovedFigure(b)illustrates
theoptimizedpathdispatching
32WriteSpreadOnly
Foranyactiveactive(includingALUA)dual-controllerarrayeach
controllerhasitsowncachetocachedatablocksorbothreads
andwritesOnwriterequestscontrollersneedtosynchronizetheir
cacheswitheachothertoguaranteedataintegrityForreadsitis
notnecessarytosynchronizethecacheAsaresultreadshave
anitytoaparticularcontrollerwhilewritesdonotThereorewe
assumeissuingwritesoneithercontrollerissymmetricwhileitis
bettertoissuereadsonthesamecontrollersothatthecachehit
ratewouldbehigher
MostworkloadshavemanymorereadsthanwritesHoweverwith
theincreasingadoptionohost-sideashcachingtheactualIOs
hittingthestoragecontrollersareexpectedtohaveamuchhigher
writereadratioalargeportionothereadswillbeservedrom
thehost-sideashcachewhileallwritesstillhitthearrayInsuch
Figure 2.PSPRoundRobinPolicy
REDEFINING ESXi IO MULTIPATHING IN THE FLASH ERA
-
7/28/2019 VMware Technical Journal - Summer 2013
18/64
1 6
oroptimizedpaths(to)ItoexceedstaPSPAdaptivestartsto
triggerwritespreadItoallsbelowtbPSPAdaptivestopswrite
spreadLikeSIOCtheactualvaluesotaandtbaresetbytheuser
andcoulddierordierentstoragearrays
3.3.2 Max I/O Latency Threshold
AsdescribedearlierdierentstoragevendorsALUAimplementations
varyTheIOperormanceontheun-optimizedpathsorsomeALUA
arrayscouldbeverypoorForsucharraysweshouldnotspreadIO
totheun-optimizedpathsTohandlesuchcasesweintroduceathirdthresholdmaxIOlatencytc( tc > ta)Latencyhigherthanthisvalue
isunacceptabletotheuserPSPAdaptivemonitorsIOlatencyon
un-optimizedpaths(tuo)whenwritespreadisturnedonIPSP
Adaptivedetectstuoexceedingthevalueotcitconcludesthat
theun-optimizedpathsshouldnotbeusedandstopswritespread
Asimpleonoswitchisalsoaddedasacongurableknobor
administratorsIauserdoesnotwanttouseun-optimizedpaths
anadministratorcansimplyturntheeatureothroughtheesxcli
commandPSPAdaptivewillbehavethesameasPSPRRinsuch
caseswithoutspreadingIOtoun-optimizedpaths
3.3.3 I/O Perormance Improvement Detection
WewanttospreadIOtoun-optimizedpathsonlyiitimproves
aggregatedsystemIOPSandorthroughputPSPAdaptivecontinues
monitoringaggregatedIOPSandthroughputonallpathstothe
specictargetDetectingIOperormanceimprovementsismore
complicatedhoweversincesystemloadandIOpatterns(eg
blocksize)couldchangeIOlatencynumberscannotbeusedto
decideithesystemperormanceimprovesornotorthisreason
TohandlethissituationwemonitorandcomparebothIOPSand
throughputnumbersWhenIOlatencyonoptimizedpathsexceeds
thresholdtaPSPAdaptivesavestheIOPSandthroughputdataas
thereerencevaluesbeoreitturnsonwritespreadtoun-optimized
pathsItthenperiodicallychecksitheaggregatedIOPSandor
throughputimprovedbycomparingthemagainstthereerencevaluesInotimproveditwillstopwritespreadotherwisenoactions
aretakenToavoidnoiseatleastimprovementoneitherIOPS
orthroughputisrequiredtoconcludethatperormanceisimproved
Overallsystemperormanceisconsideredimprovedevenionly
oneothetwomeasures(IOPSandthroughput)improvesThis
isbecauseIOpatternchangeshouldnotdecreasebothvalues
simultaneouslyForexampleiIOblocksizesgodownaggregated
throughputcouldgodownbutIOPSshouldgoupIsystemload
goesupbothIOPSandthroughputshouldgoupwithwritespread
IbothaggregatedIOPSandthroughputgodownPSPAdaptive
concludesthatitisbecausesystemloadisgoingdown
IsystemloadgoesdowntheaggregatedIOPSandthroughput
couldgodownaswellandcausePSPAdaptivetostopwrite
spreadThisisnebecauselesssystemloadmeansIOlatency
willbeimprovedUnlesstheIOlatencyonoptimizedpaths
exceedstaagainwritespreadwillnotbeturnedonagain
3.3.4 Impact on Other Hosts in the Same Cluster
Usuallyonehostutilizingtheun-optimizedpathscouldnegatively
aectotherhoststhatareconnectedtothesameALUAstorage
arrayHoweverasexplainedintheearliersectionsthegreatly
boostedIOperormanceonewash-basedstoragemeansthe
storagearrayismuchlesslikelytobecomethecontentionpoint
evenimultiplehostsarepumpingheavyIOloadtothearray
simultaneouslyTherebythenegativeimpacthereisnegligible
Ourperormancebenchmarkalsoprovesthat
4EvaluationandAnalysis
AlltheperormancenumbersarecollectedononeESXiserverThe
physicalmachinecongurationislistedinTableIometerrunning
insideWindowsVMsisusedtogenerateworkloadSincevFlashandVFCwerenotavailableatthetimetheprototypewasdone
weusedIometerloadwithrandomreadandrandom
writetosimulatetheeectohost-sidecachingwhichchanges
theIOREADWRITEratiohittingthearray
CPU IntelXeonCoresLogicalCPUs
Memory GBDRAM
HBA DualportGbpsHBA
Storage Array ALUAenabledarraywithLUNsonSSD
eachLUNMB
FC Switch GbpsFCswitch
Table 1TestbedConguration
FigureandFigurecomparetheperormanceoPSPRRand
PSPAdaptivewhenthereisIOcontentionontheHBAportBy
spreadingWRITEstoun-optimizedpathsduringIOcontention
PSPAdaptiveisabletoincreaseaggregatedsystemIOPSand
throughputatthecostoslightlyhigheraverageWRITElatency
Theaggregatedsystemthroughputimprovementalsoincreases
withincreasingIOblocksize
OveralltheperormanceevaluationresultsshowthatPSP
AdaptivecanincreasesystemaggregatedthroughputandIOPS
duringIOcontentionandissel-adaptivetoworkloadchanges
Figure 3.ThroughputPSPRRvsPSPAdaptive
REDEFINING ESXi IO MULTIPATHING IN THE FLASH ERA
-
7/28/2019 VMware Technical Journal - Summer 2013
19/64
1 7
Reerences
1 EMC PowerPath.
http://www.emc.com/storage/powerpath/powerpath.htm
2 EMC VPLEX. http://www.emc.com/storage/vplex/vplex.htm.
3 Multipath I/O. http://en.wikipedia.org/wiki/Multipath_I/O
4 Implementing vSphere Metro Storage Cluster using
EMC VPLEX. http://kb.vmware.com/kb/2007545
5 Solaris SAN Conguration and Multipathing Guide, 2000.
http://docs.oracle.com/cd/E19253-01/820-1931/820-1931.pd
6 FreeBSD disk multipath control.http://www.reebsd.org/cgi/
man.cgi?query=gmultipath&apropos=0&sektion=0&manpath=
FreeBSD+7.0-RELEASE&ormat=html
7 Nimbus Data Unveils High-Perormance Gemini Flash Arrays,
2001 http://www.crn.com/news/storage/240005857/
nimbus-data-unveils-high-perormance-gemini-ash-arrays-
with-10-year-warranty.htm
8 Proximal Datas Caching Solutions Increase Virtual Machine
Density in Virtualized Environments, 2012. http://www.businesswire.com/news/home/20120821005923/en/
Proximal-Data%E2%80%99s-Caching-Solutions-Increase-
Virtual-Machine
9 DM Multipath, 2012. https://access.redhat.com/knowledge/
docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/DM_
Multipath/index.html
10 EMC vcache, Server Flash Cache, 2012 http://www.emc.
com/about/news/press/2012/20120206-01.htm
11 Flash and Virtualization Take Center Stage at SNW or Avere.,
2012. http://www.averesystems.com/
12 Multipathing policies in ESXi 4.x and ESXi 5.x, 2012. http://kb.vmware.com/kb/1011340
13 No-Compromise Storage or the Modern Datacenter, 2012,
http://ino.nimblestorage.com/rs/nimblestorage/images/
Nimble_Storage_Overview_White_Paper.pd
14 Pure Storage: 100% ash storage array: Less than the cost o
spinning disk, 2012. http://www.purestorage.com/ash-array/
15 Symantec Veritas Dynamic Multi-Pathing, 2012.
http://www.symantec.com/docs/DOC5811
16 Violin memory, 2012. http://www.violin-memory.com/
17 VMware vFlash ramework, 2012. http://blogs.vmware.com/vsphere/2012/12/virtual-ash-vash-tech-preview.html
5ConclusionandFutureWork
Withtherapidadoptionoashitisimportanttorevisitsome
otheundamentalbuildingblocksothevSpherestackIO
multipathingiscriticalorscaleandperormanceandanyimprovementstranslateintoimprovedend-userexperience
aswellashigherVMdensityonESXiInthispaperweexplored
anapproachthatchallengedtheoldwisdomomultipathingthat
activeun-optimizedpathsshouldnotbeusedorloadbalancing
Ourimplementationshowedthatspreadingowritesonallpaths
hasadvantagesduringcontentionbutshouldbeavoidedduring
normalloadscenariosPSPAdaptiveisaPSPplug-inthatwe
developedtoadaptivelyswitchload-balancingstrategiesbased
onsystemload
Movingorwardweareworkingtourtherenhancetheadaptive
logicbyintroducingapathscoringattributethatranksdierent
pathsbasedonIOlatencybandwidthandotheractorsThescoreisusedtodecidewhetheraspecicpathshouldbeusedor
dierentsystemIOloadconditionsFurtherwewanttodecide
thepercentageoIOrequeststhatshouldbedispatchedtoa
certainpathWecouldalsocombinethepathscorewithIO
prioritiesbyintroducingpriorityqueuingwithinPSA
Anotherimportantstoragetrendistheemergenceoactiveactive
storageacrossmetrodistancesEMCsVPLEX[]istheleading
solutioninthisspaceSimilartoALUAactiveactivestoragearrays
exposeasymmetryinservicetimesevenacrossactiveoptimized
pathsduetounpredictablenetworklatenciesandnumbero
intermediatehopsAnadaptivemultipathstrategycouldbe
useulortheoverallperormance
Figure 4.IOPSPSPRRvsPSPAdaptive
REDEFINING ESXi IO MULTIPATHING IN THE FLASH ERA
http://www.freebsd.org/cgi/man.cgi?query=gmultipath&apropos=0&sektion=0&manpath=FreeBSD+7.0-RELEASE&format=htmlhttp://www.freebsd.org/cgi/man.cgi?query=gmultipath&apropos=0&sektion=0&manpath=FreeBSD+7.0-RELEASE&format=htmlhttp://www.freebsd.org/cgi/man.cgi?query=gmultipath&apropos=0&sektion=0&manpath=FreeBSD+7.0-RELEASE&format=htmlhttp://www.crn.com/news/storage/240005857/nimbus-data-unveils-high-performance-gemini-flash-arrays-with-10-year-warranty.htmhttp://www.crn.com/news/storage/240005857/nimbus-data-unveils-high-performance-gemini-flash-arrays-with-10-year-warranty.htmhttp://www.crn.com/news/storage/240005857/nimbus-data-unveils-high-performance-gemini-flash-arrays-with-10-year-warranty.htmhttp://www.businesswire.com/news/home/20120821005923/en/Proximal-Data%E2%80%99s-Caching-Solutions-Increase-Virtual-Machinehttp://www.businesswire.com/news/home/20120821005923/en/Proximal-Data%E2%80%99s-Caching-Solutions-Increase-Virtual-Machinehttp://www.businesswire.com/news/home/20120821005923/en/Proximal-Data%E2%80%99s-Caching-Solutions-Increase-Virtual-Machinehttp://www.businesswire.com/news/home/20120821005923/en/Proximal-Data%E2%80%99s-Caching-Solutions-Increase-Virtual-Machinehttps://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/DM_Multipath/index.htmlhttps://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/DM_Multipath/index.htmlhttps://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/DM_Multipath/index.htmlhttp://www.emc.com/about/news/press/2012/20120206-01.htmhttp://www.emc.com/about/news/press/2012/20120206-01.htmhttp://info.nimblestorage.com/rs/nimblestorage/images/Nimble_Storage_Overview_White_Paper.pdfhttp://info.nimblestorage.com/rs/nimblestorage/images/Nimble_Storage_Overview_White_Paper.pdfhttp://blogs.vmware.com/vsphere/2012/12/virtual-flash-vflash-tech-preview.htmlhttp://blogs.vmware.com/vsphere/2012/12/virtual-flash-vflash-tech-preview.htmlhttp://blogs.vmware.com/vsphere/2012/12/virtual-flash-vflash-tech-preview.htmlhttp://blogs.vmware.com/vsphere/2012/12/virtual-flash-vflash-tech-preview.htmlhttp://info.nimblestorage.com/rs/nimblestorage/images/Nimble_Storage_Overview_White_Paper.pdfhttp://info.nimblestorage.com/rs/nimblestorage/images/Nimble_Storage_Overview_White_Paper.pdfhttp://www.emc.com/about/news/press/2012/20120206-01.htmhttp://www.emc.com/about/news/press/2012/20120206-01.htmhttps://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/DM_Multipath/index.htmlhttps://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/DM_Multipath/index.htmlhttps://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html-single/DM_Multipath/index.htmlhttp://www.businesswire.com/news/home/20120821005923/en/Proximal-Data%E2%80%99s-Caching-Solutions-Increase-Virtual-Machinehttp://www.businesswire.com/news/home/20120821005923/en/Proximal-Data%E2%80%99s-Caching-Solutions-Increase-Virtual-Machinehttp://www.businesswire.com/news/home/20120821005923/en/Proximal-Data%E2%80%99s-Caching-Solutions-Increase-Virtual-Machinehttp://www.businesswire.com/news/home/20120821005923/en/Proximal-Data%E2%80%99s-Caching-Solutions-Increase-Virtual-Machinehttp://www.crn.com/news/storage/240005857/nimbus-data-unveils-high-performance-gemini-flash-arrays-with-10-year-warranty.htmhttp://www.crn.com/news/storage/240005857/nimbus-data-unveils-high-performance-gemini-flash-arrays-with-10-year-warranty.htmhttp://www.crn.com/news/storage/240005857/nimbus-data-unveils-high-performance-gemini-flash-arrays-with-10-year-warranty.htmhttp://www.freebsd.org/cgi/man.cgi?query=gmultipath&apropos=0&sektion=0&manpath=FreeBSD+7.0-RELEASE&format=htmlhttp://www.freebsd.org/cgi/man.cgi?query=gmultipath&apropos=0&sektion=0&manpath=FreeBSD+7.0-RELEASE&format=htmlhttp://www.freebsd.org/cgi/man.cgi?query=gmultipath&apropos=0&sektion=0&manpath=FreeBSD+7.0-RELEASE&format=html -
7/28/2019 VMware Technical Journal - Summer 2013
20/64
1 8
24 BYAN, S., LENTINI, J., MADAN, A., AND PABON, L. Mercury:
Host-side Flash Caching or the Data Center. InMSST (2012),
IEEE, pp. 112.
25 EMC ALUA System. http://www.emc.com/collateral/hardware/
white-papers/h2890-emc-clariion-asymm-active-wp.pd
26 KIYOSHI UEDA, JUNICHI NOMURA, M. C. Request-based
Device-mapper multipath and Dynamic load balancing. In
Proceedings o the Linux Symposium (2007)
27 LUO, J., SHU, J.-W., AND XUE, W. Design and Implementation
o an Ecient Multipath or a SAN Environment. In Proceedings
o the 2005 international conerence on Parallel and Distributed
Processing and Applications (Berlin, Heidelberg, 2005),
ISPA05, Springer-Verlag, pp. 101110.
28 MICHAEL ANDERSON, P. M. SCSI Mid-Level Multipath.
In Proceedings o the Linux Symposium (2003).
29 UEDA, K. Request-based dm-multipath, 2008.
http://lwn.net/Articles/274292/.
18 Who is WHIPTAIL?, 2012,
http://whiptail.com/papers/who-is-whiptail/
19 Windows Multipath I/O, 2012.
http://technet.microsot.com/en-us/library/cc725907.aspx.
20 XtremIO, 2012. http://www.xtremio.com/
21 NetApp Quietly Absorbs CacheIQ, Nov, 2012, http://www.
networkcomputing.com/storage-networkingmanagement/
netapp-quietly-absorbs-cacheiq/240142457
22 SolidFire Reveals New Arrays or White Hot Flash Market,
Nov, 2012 http://siliconangle.com/blog/2012/11/13/
solidre-reveals-new-arrays-or-white-hot-ash-market//
23 Gridiron Intros New Flash Storage Appliance, Hybrid Flash Array,
Oct, 2012. http://www.crn.com/news/storage/240008223/
gridiron-introsnew-ash-storage-appliance-hybrid-ash-
array.htm
REDEFINING ESXi IO MULTIPATHING IN THE FLASH ERA
http://www.emc.com/collateral/hardware/white-papers/h2890-emc-clariion-asymm-active-wp.pdfhttp://www.emc.com/collateral/hardware/white-papers/h2890-emc-clariion-asymm-active-wp.pdfhttp://www.networkcomputing.com/storage-networkingmanagement/netapp-quietly-absorbs-cacheiq/240142457http://www.networkcomputing.com/storage-networkingmanagement/netapp-quietly-absorbs-cacheiq/240142457http://www.networkcomputing.com/storage-networkingmanagement/netapp-quietly-absorbs-cacheiq/240142457http://siliconangle.com/blog/2012/11/13/solidfire-reveals-new-arrays-for-white-hot-flash-market//http://siliconangle.com/blog/2012/11/13/solidfire-reveals-new-arrays-for-white-hot-flash-market//http://www.crn.com/news/storage/240008223/gridiron-introsnew-flash-storage-appliance-hybrid-flash-array.htmhttp://www.crn.com/news/storage/240008223/gridiron-introsnew-flash-storage-appliance-hybrid-flash-array.htmhttp://www.crn.com/news/storage/240008223/gridiron-introsnew-flash-storage-appliance-hybrid-flash-array.htmhttp://www.crn.com/news/storage/240008223/gridiron-introsnew-flash-storage-appliance-hybrid-flash-array.htmhttp://www.crn.com/news/storage/240008223/gridiron-introsnew-flash-storage-appliance-hybrid-flash-array.htmhttp://www.crn.com/news/storage/240008223/gridiron-introsnew-flash-storage-appliance-hybrid-flash-array.htmhttp://siliconangle.com/blog/2012/11/13/solidfire-reveals-new-arrays-for-white-hot-flash-market//http://siliconangle.com/blog/2012/11/13/solidfire-reveals-new-arrays-for-white-hot-flash-market//http://www.networkcomputing.com/storage-networkingmanagement/netapp-quietly-absorbs-cacheiq/240142457http://www.networkcomputing.com/storage-networkingmanagement/netapp-quietly-absorbs-cacheiq/240142457http://www.networkcomputing.com/storage-networkingmanagement/netapp-quietly-absorbs-cacheiq/240142457http://www.emc.com/collateral/hardware/white-papers/h2890-emc-clariion-asymm-active-wp.pdfhttp://www.emc.com/collateral/hardware/white-papers/h2890-emc-clariion-asymm-active-wp.pdf -
7/28/2019 VMware Technical Journal - Summer 2013
21/64
1 9
CategoriesandSubjectDescriptors: C[Computer Systems
Organization]PerormanceoSystemsDesignstudies
MeasurementtechniquesPerormanceattributes
GeneralTerms:MeasurementPerormanceDesign
Experimentation
Keywords:Tier-Applicationsdatabaseperormance
hardwarecounters
Introduction
Overthepastyearstheperormanceoapplicationsrunning
onvSpherehasgoneromaslowasonativeintheveryearly
daysoESXtowhatiscommonlycallednear-nativeperormance
atermthathascometoimplyanoverheadolessthanMany
toolsandmethodswereinventedalongthewaytomeasureand
tuneperormanceinvirtualenvironments[]Butwithoverheads
atsuchlowlevelsdrillingdownintothesourceoperormance
dierencesbetweennativeandvirtualhasbecomeverydicult
Thegenesisothisworkwasaprojecttostudytheperormance
ovSphererunningTier-Applicationsinparticulararelational
databaserunninganOnlineTransactionProcessing(OLTP)
workloadwhichiscommonlyconsideredaveryheavyworkload
Alongthewaywediscoveredthatexistingtoolscouldnotaccount
orthedierencebetweennativeandvirtualperormanceAtthis
pointweturnedtohardwareeventmonitoringtoolsandcombined
themwithsotwareprolingtoolstodrilldownintotheperormance
overheadsLatermeasurementswithHadoopandlatency-sensitive
applicationsshowedthatthesamemethodologyandtoolscan
beusedorinvestigatingperormanceothoseapplicationsand
urthermorethesourcesovirtualizationperormanceoverhead
aresimilaroralltheseworkloadsThispaperdescribesthetools
andthemethodologyorsuchmeasurements
vSpherehypervisor
VMwarevSpherecontainsvariouscomponentsincludingthe
virtualmachinemonitor(VMM)andtheVMkernelTheVMM
implementsthevirtualhardwarethatrunsavirtualmachineThe
virtualhardwarecomprisesthevirtualCPUstimersanddevices
AvirtualCPUincludesaMemoryManagementUnit(MMU)which
Abstract
Withtherecentadvancesinvirtualizationtechnologybothinthe
hypervisorsotwareandintheprocessorarchitecturesonecan
statethatVMwarevSphererunsvirtualizedapplicationsatnear-
nativeperormancelevelsThisiscertainlytrueagainstabaseline
otheearlydaysovirtualizationwhenreachingevenhalthe
perormanceonativesystemswasadistantgoalHoweverthis
near-nativeperormancehasmadethetaskorootcausingany
remainingperormancedierencesmoredicultInthispaperwe
willpresentamethodologyoramoredetailedexaminationothe
eectsovirtualizationonperormanceandpresentsampleresults
WewilllookattheperormanceonvSphereoanOLTPworkload
atypicalHadoopapplicationandlowlatencyapplicationsWewill
analyzetheperormanceandhighlighttheareasthatcausean
applicationtorunmoreslowlyinavirtualizedserverThepioneers
otheearlydaysovirtualizationinventedabatteryotoolsto
studyandoptimizeperormanceWewillshowthatasthegap
betweenvirtualandnativeperormancehasclosedthesetraditional
toolsarenolongeradequateordetailedinvestigationsOneo
ournovelcontributionsiscombiningthetraditionaltoolswith
hardwaremonitoringacilitiestoseehowtheprocessorexecution
prolechangesonavirtualizedserver
Wewillshowthattheincreaseintranslationlookasidebuer
(TLB)misshandlingcostsduetothehardware-assistedmemory
managementunit(MMU)isthelargestcontributortotheperormance
gapbetweennativeandvirtualserversTheTLBmissratioalso
risesonvirtualserversurtherincreasingthemissprocessing
costsManyotheotherperormancedierenceswithnative
(egadditionaldatacachemisses)arealsoduetotheheavier
TLBmissprocessingbehaviourovirtualserversDepending
ontheapplicationtootheperormancedierenceisdue
tothetimespentinthehypervisorkernelThisisexpectedasall
networkingandstorageIOhastogetprocessedtwiceonceinthevirtualdeviceintheguestonceinthehypervisorkernel
Thehypervisorvirtualmachinemonitor(VMM)isresponsibleor
only-otheoveralltimeInotherwordstheVMMwhichwas
responsibleormuchothevirtualizationoverheadotheearly
hypervisorsisnowasmallcontributortovirtualizationoverhead
TheresultspointtonewareassuchasTLBsandaddresstranslation
toworkoninordertourtherclosethegapbetweenvirtualand
nativeperormance
Methodology or Perormance Analysis oVMware vSphere under Tier-1 Applications
JereyBuell DanielHecht JinHeo KalyanSaladi HRezaTaheriVMwarePerormance VMwareHypervisorGroup VMwarePerormance VMwarePerormance VMwarePerormance
jbuell@vmwarecom dhecht@vmwarecom heoj@vmwarecom ksaladi@vmwarecom rtaheri@vmwarecom
METHODOLOGY FOR PERFORMANCE ANALYSIS OF
VMWARE VSPHERE UNDER TIER-1 APPLICATIONS
1 When VMware introduced its server-class, type 1 hypervisor in 2001, it was called ESX.
The name changed to vSphere with release 4.1 in 200 9.
-
7/28/2019 VMware Technical Journal - Summer 2013
22/64
-
7/28/2019 VMware Technical Journal - Summer 2013
23/64
2 1
WereliedheavilyonOraclestatspack(alsoknownasAWR)stats
ortheOLTPworkload
WecollectedandanalyzedUnisphereperformancelogsfromthe
arraystostudyanypossiblebottlenecksinthestoragesubsystem
ortheOLTPworkload
Hadoopmaintainsextensivestatisticsonitsownexecution.
Theseareusedtoanalyzehowwell-balancedtheclusteris
andtheexecutiontimesovarioustasksandphases
2.3 .1 Hardware counters
AbovetoolsarecommonlyusedbyvSphereperormanceanalysts
insideandoutsideVMwaretostudytheperormanceovSphere
applicationsandwemadeextensiveuseothemButoneothe
keytake-awaysothisstudyisthattoreallyunderstandthesources
ovirtualizationoverheadthesetoolsarenotenoughWeneeded
togooneleveldeeperandaugmentthetraditionaltoolswithtools
thatrevealthehardwareexecutionprole
Recentprocessorshaveexpandedcategoriesohardwareevents
thatsotwarecanmonitor[]Butusingthesecountersrequires
1. A tool to collect data
2. A methodology or choosing rom among the thousands o
available events, and combining the event counts to derive
statistics or events that are not directly monitored
3. Meaningul interpretation o the results
Allotheabovetypicallyrequireacloseworkingrelationshipwith
microprocessorvendorswhichwereliedonheavilytocollectand
analyzethedatainthisstudy
Processorarchitecturehasbecomeincreasinglycomplexespecially
withtheadventomultiplecoresresidinginaNUMAnodetypically
withasharedlastlevelcacheAnalyzingapplicationexecutionon
theseprocessorsisademandingtaskandnecessitatesexamining
howtheapplicationisinteractingwiththehardwareProcessorscomeequippedwithhardwareperormancemonitoringunits(PMU)
thatenableecientmonitoringohard-ware-levelactivitywithout
signicantperormanceoverheadsimposedontheapplicationOS
WhiletheimplementationdetailsoaPMUcanvaryromprocessor
toprocessortwocommontypesoPMUareCoreandUncore(see
)EachprocessorcorehasitsownPMUandinthecaseo
hyper-threadingeachlogicalcoreappearstohaveadedicated
corePMUallbyitsel
AtypicalCorePMUconsistsooneormorecountersthatarecapable
ocountingeventsoccurringindierenthardwarecomponents
suchasCPUCacheorMemoryTheperormancecounterscan
becontrolledindividuallyorasagroupthecontrolsoeredareusuallyenabledisableanddetectoverow(viagenerationoan
interrupt)tonameaewandhavedierentmodes(userkernel)
EachperormancecountercanmonitoroneeventatatimeSome
processorsrestrictorlimitthetypeoeventsthatcanbemonitored
romagivencounterThecountersandtheircontrolregisterscan
beaccessedthroughspecialregisters(MSRs)andorthroughthe
PCIbusWewilllookintodetailsoPMUssupportedbythetwo
mainxprocessorvendorsAMDandIntel
2.2 .2 Hadoop Workload Confguration
ClusteroHPDLGservers
Two3.6GHz4-coreIntelXeonX5687(Westmere-EP)processors
withhyper-threadingenabled
72GB1333MHzmemory,68GBusedfortheVM
16local15KRPMlocalSASdrives,12usedforHadoopdata
Broadcom10GbENICsconnectedtogetherinaattopology
throughanAristaSswitchSoftware:vSphere5.1,RHEL6.1,CDH4.1.1withMRv1
vSpherenetworkdriverbnx2x,upgradedforbestperformance
2.2 .3 Latency-sensitive workload confguration
ThetestbedconsistsooneservermachineorservingRR
(request-response)workloadrequestsandoneclientmachine
thatgeneratesRRrequests
Theservermachineisconguredwithdual-socket,quad-core
GHzIntelXeonXprocessorsandGBoRAMwhile
theclientmachineisconguredwithdual-socketquad-core
GHzIntelXeonXprocessorsandGBoRAM
Bothmachinesareequippedwitha10GbEBroadcomNIC.Hyper-threadingisnotused
Twodierentcongurationsareused.Anativeconguration
runsnativeRHELLinuxonbothclientandservermachines
whileaVMcongurationrunsvSphereontheservermachine
andnativeRHELontheclientmachineThevSpherehostsa
VMthatrunsthesameversionoRHELLinuxForboththe
nativeandVMcongurationsonlyoneCPU(VCPUortheVM
conguration)isconguredtobeusedtoremovetheimpacto
parallelprocessinganddiscardanymulti-CPUrelatedoverhead
TheclientmachineisusedtogenerateRRworkloadrequests,
andtheservermachineisusedtoservetherequestsandsend
repliesbacktotheclientinbothcongurationsVMXNETis
usedorvirtualNICsintheVMcongurationAGigabit
Ethernetswitchisusedtointerconnectthetwomachines
23Tools
Obviouslytoolsarecriticalinastudylikethis
WecollecteddatafromtheusualLinuxperformancetools,
suchasmpstatiostatsarnumastatandvmstat
WiththelaterRHEL6.1experiments,weusedtheLinux
per()[]acilitytoprolethenativeguestapplication
Naturally,inastudyofvirtualizationoverhead,vSpheretools
arethemostwidelyused
Esxtopiscomm