Optimized SDDC

© 2014 VMware Inc. All rights reserved.

Optimized SDDC

Iwan ‘e1’ Rahabok

Staff SE (Strategic Accounts) & CTO Ambassador

[email protected] | 9119-9226 | Linkedin.com/in/e1ang | Tweeter: e1_ang

https://www.facebook.com/groups/vmware.users/

VCAP-DCD, TOGAF Certified, vExpert

mailto:[email protected]

CONFIDENTIAL 2

Peer comparison• Average consolidation ratio for VSI (not VDI)

– 1 – 10: ___ customers– 11 – 20: ___ customers– 21 – 30: ___ customers– >30: ___ customers

• Degree of virtualisation (total servers. UNIX counted as physical)– <60%: ___ customers– 60 – 85%: ___ customers– 86 – 99%: ___ customers– 100%: ___ customers

CONFIDENTIAL 3

Peer comparison• No of Server VM in your company

– <1000 VM: ___ customers– 1000 – 5000 VM: ___ customers– 5000 – 10K VM: ___ customers– >10K VM: ___ customers

• No of VDI VM– <2500 VM: ___ customers– 2500 – 10K VM: ___ customers– 10K – 25K VM: ___ customers– >25K VM: ___ customers

4

Warm-up exercise

You got an email from the app team, saying the main Intranet application was slow.• The email was 1 hour ago. The email stated that it was slow for 1 hour, and it was ok after that.

• So it was slow between 1-2 hours ago, but ok now.

• You did a check. Everything is indeed ok in the past 1 hour.

• The application spans 10 VMs in 2 different clusters, 4 datastores and 1 RDM• You are not familiar with the applications. You do not know what apps runs on each VM as you have no access to the Guest

OS.

• Your environment: 1 VC, 4 clusters, 30 hosts, 500 VM, 40 datastores, 1 midrange array, 10 GE, iSCSI storage

Test your vSphere knowledge!How do you solve/approach this with just vSphere?

What do you do? A: Smile, as this will be a nice challenge for your TAM/BCS/MCS/RE B: No sweat, you’re VCDX + CCIE + ITIL Master. You’re born for this.

C: SMS your wive, “Honey, I’m staying overnight at the datacenter “

D: Take a blood pressure medicine so it won’t shoot up.

E: Buy the app team very nice dinner, and tell them to keep quiet.

What do you optimize for?• Performance• Cost• Availability• Recover-ability• Security?• Managability?

• What about…– Upgrade-ability– “Debug-ability”?

We will dive into a few topics so you can go back

and implement

This session does not cover the products. vC Ops and

SRM will be covered at later sessions.

CONFIDENTIAL 6

What is an optimised SDDC Infrastructure?• That thin line where Demand and Supply.

– Too much Supply, and you’re not optimizing for cost– Too much Demand, and you’re not optimizing for performance and availability

It’s optimised when it has met the criteria you set when designing your infrastructure.

Optimized Performance

CONFIDENTIAL7

CONFIDENTIAL 8

Performance: How do you know it’s optimised?• What do you measure?

– Utilisation?• Utilisation of 100% means it’s performing…? • Utilisation of 5% means it’s performing…?• Utilisation of 50% means it’s performing…? Really?

– Something else? • What is that something else?

To understand this “something else”, we need to go back to “fundamental”.

What do we care at each layer?

SDDC

VM VM VM VM

VM VM VM VM

VM VM VM VM

VM VM VM VM

We care if it is being served well by the platform. Other VM is irrelevant from VM Owner point of view. Make sure it is not contending for resource.1We check if it is sized properly.If too small, increase its configuration.If too big, right size it for better performance2

We care if it is serving everyone well. Make sure there is no contention for resource among all the VMs in the platform.

1We check for overall utilisation.Too low, we are not investing wisely on hardwareToo high, we need to buy more hardware.2

CONFIDENTIAL 10

Take Away: Contention and Utilisation• Unlike physical DC, in virtual infrastructure….

– we use Contention, not Utilisation, for Performance Management– we use Utilisation (short range) for Performance Management– we use Utilisation (long range) for Capacity Management

• Contention is how you measure that the platform is performing well.

• Sound good! But how do you measure “Contention”?

CONFIDENTIAL 11

Performance: The counters• What counters prove that it is optimised?

– You need a technical fact to assure yourself• Either that, or take a sleeping pill at night

– You need a technical fact to show to your customers• Your SLA must be based on something concrete, not subject to interpretation or “feeling of

the day”

– If you can’t prove it, how does anyone know it is optimised? ;-)

CONFIDENTIAL 12

Optimized Infrastructure Performance*• CPU• RAM• Storage • Network

* While keeping Cost in mind

Our scope is Infrastrucure.Applications is a separate topic

covered later by Kasim.

How a VM gets its resourceProvisioned

Limit

Reservation

Entitlement

0

Contention

Usage

Demand

VM CPU: The 4 States

CONFIDENTIAL 15

VM CPU: What do you monitor?• Contention

– Ready (ms)?– Co-Stop (ms)? – Latency (%)?– Max Limited (ms)?– Overlap (ms)?– Swap Wait (ms)?

• Utilisation– Used (ms)?– Usage (%)?– Demand (MHz)?

Quiz Time! What’s difference between Average, Summation and Latest?

How does timeline impact the value?

CONFIDENTIAL 16

VM CPU: What you should monitor

• Contention: – Contention (%)

• Utilisation– Workload (%)

• Contention– Latency (%)– Max Limited (if applicable)

• Utilisation – Usage (%)– Demand (MHz)

Discussion Time!What’s should the value be for an optimized environment?

vCenter Operations vCenter

CONFIDENTIAL 17

One more thing…• Hypervisor does not have visibility inside the Guest OS. • There is 1 particular CPU counter that you should get. It tells you

that there is not enough CPU to meet demand.• vCenter Operations (via Hyperic) does not collect this counter• Which counter is that?

CONFIDENTIAL 18

Enough about CPU.Let’s move to RAM!

CONFIDENTIAL 19

Quiz Time! • Which of the following sentences are True:

– Ballooning is bad. You see a VM has balloon, that VM has memory performance problem.– Ballooning happens before Compression, which happens before Swapping. If you see a VM

has Compressed memory but not Ballooned memory, that vCenter is buggy, or your eyes are just tired.

– If all the VMs in the ESXi host has low Usage counter, then the ESXi must also be low.– Turn on Large Page, and there goes all your TPS.– To check if a VM has memory contention, check its CPU Swap Wait counter. – Why are all the questions difficult?!

• Answer– Ballooning indicates the ESXi has memory pressure. It does not mean the VM has memory

performance issue.– Pages remain compressed or swap if they are not accessed.– Usage counter is different in VM and ESXi! In VM, it is Active. In ESXi, it is Consumed. This is

due to 2 level memory concept.– Yes, unless your ESXi is under heavy memory constraint.

2 levels of Memory Hierarchy

• New hierarchy in VMware’s memory overcommit technology• Transparent Page Sharing• Ballooning• Memory Compression• Swap to Host Cache (SSD)• Disk swapping

• Decompression is sub-ms compared to swap (15-20 ms)!

OS

Hypervisor

CONFIDENTIAL 21

vSphere Memory Management• 2 types of Memory Management

– Guest OS level• Balloon

– Hypervisor level• TPS• Compression, Swap to disk, Swap to cache (SSD)

Volunteer Time!Explain Balloon, TPS, Compression.

CONFIDENTIAL 22

VM RAM: What do you monitor? • Contention

– Swapped?– Balloon?– Compressed?– Latency?– CPU Swap Wait?

• Utilisation– Active?– Usage?– Consumed?

CONFIDENTIAL 23

VM RAM: What you should monitor

• Contention: – RAM Contention (%)

• Utilisation– Workload (%)– Consumed (KB)

• Contention– Latency (%)– CPU Swap Wait (ms)

• Utilisation– Usage (%)– Consumed (KB)

Discussion Time!What’s should the value be for an optimized environment?


CONFIDENTIAL 24

One more thing…• Hypervisor does not have visibility inside the Guest OS. • There is 1 particular RAM counter that you should get. It tells

you that there is not enough RAM to meet demand.• Which counter is that?• You can monitor it Guest OS paging activity by separating the

page file into its own vmdk.– You can then use vC Ops to analyse the pattern.

CONFIDENTIAL 25

Enough about RAM.Let’s move to Storage!

CONFIDENTIAL 26

Quiz Time! • Which of the following sentences are True:

– The latency counter is the (Write Latency + Read Latency) / 2 – If you have RDM, vCenter does not track the latency. – If the VM virtual disk counter showing 1000 IOPS, but the VM datastore counter

showing 2x IOPS, something is seriously wrong. Time to call your TAM!– If all your VMs experiencing high latency, the first thing you do is check the

VMkernel queue

• Answer– It is not. It takes into account the number of commands issued. It’s a weighted

average.– It only tracks the latency at the latest data. It’s not including other data during the

collection period.– Check for snapshot. Snapshot IOPS is transparent to virtual disk.– The first thing you do is check the physical device queue and your storage array.

VMkernel queue rarely exceeds 1 ms.

VM Storage: Where and what do you monitor?

27

Virtual Disk

DiskDatastore

VM Storage: where to monitor • For vmdk, use Datastore metric

groups.• For RDM, use Disk metric groups• Disk metric group is naturally

not relevant for NFS (files)

Disk

VM

RDMVMFS NFS

Disk 1 Disk 2 Disk 3

Disk

scsi0:1 scsi0:2

Datastore Datastore

vDisk vDisk vDisk

scsi0:0

CONFIDENTIAL 29

VM Storage: What do you monitor?

• Contention– Latency (ms)

• Utilisation– Commands per second– Usage (KBps)– Workload (%)

• Contention– Latency (ms)

• Utilisation – Commands Issued– Usage (KBps)


CONFIDENTIAL 30

VM Network• Contentions

– Drop packets– Packets retransmit

• Utilisation– Network throughput

• Limitations– We cannot monitor latency (e.g. between source and destination)

CONFIDENTIAL 31

Different Tiers, Different Optimization • Business Logic:

– Tier 1 is optimised for Performance and Availability– Tier 3 is optimised for Cost

• Do you allow Tier 1 VM on Tier 3 Storage?– Or you map the Compute Tier to the Storage Tier?

• What distinguish Tier 1 from Tier 3?– Availability – Performance– Monitoring– Cost!

CONFIDENTIAL 32

Tiering: Considerations• Compute

– No of spare host– No of hosts– Consolidation Ratio (VM:Host)– vCPU:pCPU Oversubscribed– vRAM:pRAM Oversubscribed– Clustering (e.g. VCS)

• Storage– IOPS per VM– Latency

• Monitoring– Application availability monitoring

(e.g. AppHA)– Application performance monitoring

(e.g. vC Ops Enterprise)

• Availability– Automated DR (SRM)– RPO– RTO

CONFIDENTIAL 33

3-Tiers Offering: ExampleTier 1 Tier 2 Tier 3

No of spare host 2 1 1

No of hosts 6 8 10

Consolidation Ratio (VM:Host) 10:1 20:1 40:1

vCPU:pCPU Oversubscribed n/a 2.0x 4.0x

vRAM:pRAM Oversubscribed n/a 1.5x 2.0x

IOPS per VM 400 200 100

Latency <10 ms 15-20 ms 20-25 ms

Clustering (e.g. VCS) Yes Yes No

Application monitoring (e.g. AppHA) Yes Yes No

Apps Yes Yes No

Automated DR (SRM) Yes Yes Yes

RPO 5 minutes 1-2 hour 2-8 hours

RTO 1 hour <2 hours <4 hours

CONFIDENTIAL 34

Demystifying “Peak”• There are 2 types of “Peak”

– Peak across time– Peak across objects

• Impacts– Peak across time can be too high if the burst is high

• VM is low for 24 hours, burst to 100% for 5 minutes, and you get 100% reported.

– Peak across time can be lower if the number of member objects is high.• Peak of a cluster in the past 1 day is 70%. That means at least 1 host was >70%.

– Peak across objects can be too high is the load is unbalanced• Happens when cluster utilisation is not high enough to trigger DRS orStorage DRS

CONFIDENTIAL 35

Sample SLA and Internal ThresholdTier 1 Tier 2 Tier 3

CPU Contention 3% 8% 13%

RAM Contention 0% 5% 10%

Disk Latency 10 ms 20 ms 30 ms

SLA only applies to VM.VM owner does not care about underlying platform

Tier 1 Tier 2 Tier 3CPU Contention 2% 6% 10%

RAM Contention 0% 2% 8%

Disk Latency 10 ms 15 ms 20 ms

CONFIDENTIAL 36

Where to monitor at the Platform level?• Compute

– Host?– Cluster?– Datacenter?– vCenter?

• Storage– Host?– Cluster?– Datastore?– Datastore Cluster?– Datacenter?– vCenter?

• Network

• Network– Standard Switch and port group? :-)– Host?– Distributed Switch?– Distributed Port Group?

CONFIDENTIAL 37

Where to monitor

• Compute– Host– Datacenter

• Storage– Host– Cluster

• Network– Host

• Compute– Cluster

• Storage– Datastore– Datastore Cluster

• Network– Distributed Switch.– Distributed Port Group

DRS (and Storage DRS) will balance the cluster

Not here Monitor these

CONFIDENTIAL 38

QoS in a shared environment• QoS is mandatory in a shared environment• Areas to control

– Compute– Network– Storage

• CPU and RAM– Shares– Reservation– Limit?– Resource Pool?

• Storage I/O Control• Network I/O Control

CONFIDENTIAL 39

QoS: Compute• When not to use Resource Pool?• When to use Resource Pool?• What’s the impact of Reservation?

– HA Slot Size. Unless you use %– Boot time– Oversubscribe ability. You cannot go beyond 100% reservation.

QoS: Storage• A single VM can hog storage

throughput– Just need to run IOmeter – Unfairly penalizes VMs on hosts with high

consolidation ratios

• Existing resource management only works for VMs on the same host

• SIOC calculates datastore latency to identify contention– Latency is a normalized, average across VMs– IO size and IOPS included

100 %

75%

devi

ce q

ueue

dep

th

24

0

25 %

Storage Array Queue

ESX Server ESX Server

38% 50%

devi

ce q

ueue

dep

th

24

0

12%

Without SIOC – Latency is Unbounded

Without Storage IO ControlActual Disk Resources utilized by each VM are not in the correct ratio

StorageCongested

QoS: Storage• SIOC enforces fairness when

datastore latency crosses threshold– Dynamic threshold setting– Fairness enforced by limiting VMs access

to queue slots

• What’s the limitation?– No inter-datastore awareness– Does not work on RDM– Non VM workload not included

• Work with your Storage team.– Auto-tiering array is supported

75%

devi

ce q

ueue

dep

th

24

0

6

0

VM A1500

Shares

VM B500

Shares

VM C500

Shares

25 %

ESX Server ESX Server

100 %

60% 20%20%

With Storage IO ControlActual Disk Resources utilized by each VM Are in the correct ratio even across ESX Hosts

Storage QueueThrottled

With SIOC – Latency is Controlled

StorageControlled

Storage Array Queue

CONFIDENTIAL 42

Key Takeaways• Optimization in SDDC has a lot more components than we

normally think• Contention is 1st . Utilisation is 2nd • SLA is at VM level, not Infrastructure level.• Peak can be too low or too high. • Anything else?

Optimized AvailabilityTake a 2 minute bio-break

CONFIDENTIAL43

Disaster Recovery (DR) >< Disaster Avoidance (DA)

DA requires that Disaster must be avoidable.• DA implies that there is Time to respond to an impending Disaster. The time window

must be large enough to evacuate all necessary system.

Once avoided, for all practical purpose, there is no more disaster.• There is no recovery required.

• There is no panic & chaos.

DA is about Preventing (no downtime). DR is about Recovering (already down)• 2 opposite context.

It is insufficient to have DA only.

DA does not protect the business when Disaster strikes.

Get DR in place first, then DA.

45

DR Context: It’s a Disaster, so…

It might strike when we’re not ready• E.g. IT team having offsite meeting, and next flight is 8 hours away.

• Key technical personnels are not around (e.g. sick or holiday)

We can’t assume Production is up.• There might be nothing for us to evacuate or migrate to DR site.

• Even if the servers are up, we might not even able to access it (e.g. network is down).

Even if it’s up, we can’t assume we have time to gracefully shutdown or migrate.• Shutting down multi-tier apps are complex and take time when you have 100s…

We can't assume certain system will not be affected• DR Exercise should involve entire datacenter.

Assume the worst, and start from that point.

Singapore MAS Guidelines

46

MAS is very clear that DR means Disaster has happened as there is outage.

Clause 8.3.3 states Total Site should be tested. So if you are not doing entire DC test, you’re not in compliance.

DR: Assumptions

47

A company wide DR Solution shall assume:• Production is down or not accessible.

Entire datacenter, not just some systems.

• Key personnels are not available Storage admin, Network admin, AD admin, VMware admin, DBA, security, Windows admin,

RHEL admin, etc. Intelligence should be built into the system to eliminate reliance on human expert.

• Manual Run Books are not 100% up to date Manual documents (Word, Excel, etc) covering every steps to recover entire datacenter is

prone to human error. It contains thousands of steps, written by multiple authors. Automation & virtualisation reduce this risk.

DR Principles

48

To Business Users, actual DR experience must be identical to the Dry Run they experience

• In panic or chaotic situation, users should deal with something they are trained with.

• This means Dry Run has to simulate Production (without shutting down Production)

Dry Run must be done regularly.

• This ensures:

New employees are covered.

Existing employees do not forget.

The procedures are not outdated (hence incorrect or damaging)

• Annual is too long a gap, especially if many users or departments are involved.

DR System must be a replica of Production System

• Testing with a system that is not identical to production deems the Dry Run invalid.

• Manually maintain 2 copies of >100s servers, network, storage, security settings are classic examples of invalid Dry Run, as the DR System is not the Production system.

• System >< Datacenter. Normally, the DR DC is smaller. System here means a collection of servers, storage, network, security that make up “an application from business point of view”.

Datacenter-wide DR Solution: Technical Requirements

49

Fully Automated• Eliminate reliance on many key personnels.

• Eliminate outdated (hence misleading) manual runbooks.

Enable frequent Dry Run, with 0 impact to Production.• Production must not be shutdown, as this impacts the business.

Once you shutdown production, it is no longer a Dry Run. Actual Run is great, but it is not practical as Business will not allow entire datacenter to go down regularly just for IT to test infrastructure.

• No clashing with Production Hostnames and IP addresses.

• If Production is not impacted, then users can take time to test DR. No need to finish within certain time window anymore.

Scalable to entire datacenter• 1000s servers

• Cover all aspect of infrastructure, not just server + storage. Network, Security, Backup have to included so entire datacenter can be failed over automatically.

50

DR 1.0 architecture (current thinking)

Typical DR 1.0 solution (at infrastructure layer) has the following properties:

Area Solution

Server • Data drive (LUN) is replicated. • OS/App drive is not. So there are 2 copies: Production and DR. They have different host

name, IP address. They can’t be the same, as having identical hostname/IP will result in conflict as the network spans across both datacenters.

• This means DR system is actually different to Production, even on actual DR. Production never fails to DR. Only the data gets mounted.

• Technically, this is not a “production recovery” solution, but a “Site 2 mounting Site 1 data” solution. IT has been telling Business that IT is recovering Production, while what IT does is actually running a different system, and the only thing used from Production is just the data.

Storage • Not integrated with the server. Practically 2 different solution, manually run by 2 different team, with a lot of manual coordination and unhappiness.

Network • Not aware of DR Test and Dry Run. It’s 1 network for all purpose.• Firewall rules manually maintained on both sides.

51

DR 1.0 architecture: Limitations

Technically, it is not even a DR solution• We do not recover the Production System. We merely mount production Data on a

different System The only way for the System to be recovered is to do SAN boot on DR Site.

• Can’t prove to audit that DR = Production.

• Registry changes, config changes, etc are hard to track at OS and Application level.

Manual mapping of data drive to associated server on DR site.

Not a scalable solution as manual update don’t scale well to 1000s servers.

Heavy on scripting, which are not tested regularly.

DR Testing relies heavily on IT expertise.

DR Requirements: Summary

52

ID Requirements Description

R01 DR copy = Production copy.Dry Run = Actual Run.

This is to avoid an invalid Dry Run as the System Under Test itself are not the same.No changes allowed (e.g. IP address and Host name) as it means Dry Run >< real DR

R02 Identical User Experience From business users point of view, the entire Dry Run exercise must match real/actual DR experience.

R03 No impact on Production during Dry Run.

DR test should not require Production to be shutdown, as it becomes a real failover. A real failover can’t be done frequently as it impacts the business. Business will resist testing, making the DR Solution risky due to rare testing.

R04 Frequent Dry Run This is only possible if Production is not affected.

R05 No reliance on human experts An datacenter wide DR needs a lot of expert from disciplines, making it an expensive effort.Actual procedure should be simple. It should not recover from error state.

R06 Scalable to entire datacenter DR solution should scale to >1000s servers while maintaining RTO and simplicity.

Open Discussion & Sharing (15 minutes)

Optimized SDDC

Documents

Transcript of Optimized SDDC