Optimized SDDC
description
Transcript of Optimized SDDC
© 2014 VMware Inc. All rights reserved.
Optimized SDDC
Iwan ‘e1’ Rahabok
Staff SE (Strategic Accounts) & CTO Ambassador
[email protected] | 9119-9226 | Linkedin.com/in/e1ang | Tweeter: e1_ang
https://www.facebook.com/groups/vmware.users/
VCAP-DCD, TOGAF Certified, vExpert
CONFIDENTIAL 2
Peer comparison• Average consolidation ratio for VSI (not VDI)
– 1 – 10: ___ customers– 11 – 20: ___ customers– 21 – 30: ___ customers– >30: ___ customers
• Degree of virtualisation (total servers. UNIX counted as physical)– <60%: ___ customers– 60 – 85%: ___ customers– 86 – 99%: ___ customers– 100%: ___ customers
CONFIDENTIAL 3
Peer comparison• No of Server VM in your company
– <1000 VM: ___ customers– 1000 – 5000 VM: ___ customers– 5000 – 10K VM: ___ customers– >10K VM: ___ customers
• No of VDI VM– <2500 VM: ___ customers– 2500 – 10K VM: ___ customers– 10K – 25K VM: ___ customers– >25K VM: ___ customers
4
Warm-up exercise
You got an email from the app team, saying the main Intranet application was slow.• The email was 1 hour ago. The email stated that it was slow for 1 hour, and it was ok after that.
• So it was slow between 1-2 hours ago, but ok now.
• You did a check. Everything is indeed ok in the past 1 hour.
• The application spans 10 VMs in 2 different clusters, 4 datastores and 1 RDM• You are not familiar with the applications. You do not know what apps runs on each VM as you have no access to the Guest
OS.
• Your environment: 1 VC, 4 clusters, 30 hosts, 500 VM, 40 datastores, 1 midrange array, 10 GE, iSCSI storage
Test your vSphere knowledge!How do you solve/approach this with just vSphere?
What do you do? A: Smile, as this will be a nice challenge for your TAM/BCS/MCS/RE B: No sweat, you’re VCDX + CCIE + ITIL Master. You’re born for this.
C: SMS your wive, “Honey, I’m staying overnight at the datacenter “
D: Take a blood pressure medicine so it won’t shoot up.
E: Buy the app team very nice dinner, and tell them to keep quiet.
What do you optimize for?• Performance• Cost• Availability• Recover-ability• Security?• Managability?
• What about…– Upgrade-ability– “Debug-ability”?
We will dive into a few topics so you can go back
and implement
This session does not cover the products. vC Ops and
SRM will be covered at later sessions.
CONFIDENTIAL 6
What is an optimised SDDC Infrastructure?• That thin line where Demand and Supply.
– Too much Supply, and you’re not optimizing for cost– Too much Demand, and you’re not optimizing for performance and availability
It’s optimised when it has met the criteria you set when designing your infrastructure.
Optimized Performance
CONFIDENTIAL7
CONFIDENTIAL 8
Performance: How do you know it’s optimised?• What do you measure?
– Utilisation?• Utilisation of 100% means it’s performing…? • Utilisation of 5% means it’s performing…?• Utilisation of 50% means it’s performing…? Really?
– Something else? • What is that something else?
To understand this “something else”, we need to go back to “fundamental”.
What do we care at each layer?
SDDC
VM VM VM VM
VM VM VM VM
VM VM VM VM
VM VM VM VM
We care if it is being served well by the platform. Other VM is irrelevant from VM Owner point of view. Make sure it is not contending for resource.1We check if it is sized properly.If too small, increase its configuration.If too big, right size it for better performance2
We care if it is serving everyone well. Make sure there is no contention for resource among all the VMs in the platform.
1We check for overall utilisation.Too low, we are not investing wisely on hardwareToo high, we need to buy more hardware.2
CONFIDENTIAL 10
Take Away: Contention and Utilisation• Unlike physical DC, in virtual infrastructure….
– we use Contention, not Utilisation, for Performance Management– we use Utilisation (short range) for Performance Management– we use Utilisation (long range) for Capacity Management
• Contention is how you measure that the platform is performing well.
• Sound good! But how do you measure “Contention”?
CONFIDENTIAL 11
Performance: The counters• What counters prove that it is optimised?
– You need a technical fact to assure yourself• Either that, or take a sleeping pill at night
– You need a technical fact to show to your customers• Your SLA must be based on something concrete, not subject to interpretation or “feeling of
the day”
– If you can’t prove it, how does anyone know it is optimised? ;-)
CONFIDENTIAL 12
Optimized Infrastructure Performance*• CPU• RAM• Storage • Network
* While keeping Cost in mind
Our scope is Infrastrucure.Applications is a separate topic
covered later by Kasim.
How a VM gets its resourceProvisioned
Limit
Reservation
Entitlement
0
Contention
Usage
Demand
VM CPU: The 4 States
CONFIDENTIAL 15
VM CPU: What do you monitor?• Contention
– Ready (ms)?– Co-Stop (ms)? – Latency (%)?– Max Limited (ms)?– Overlap (ms)?– Swap Wait (ms)?
• Utilisation– Used (ms)?– Usage (%)?– Demand (MHz)?
Quiz Time! What’s difference between Average, Summation and Latest?
How does timeline impact the value?
CONFIDENTIAL 16
VM CPU: What you should monitor
• Contention: – Contention (%)
• Utilisation– Workload (%)
• Contention– Latency (%)– Max Limited (if applicable)
• Utilisation – Usage (%)– Demand (MHz)
Discussion Time!What’s should the value be for an optimized environment?
vCenter Operations vCenter
CONFIDENTIAL 17
One more thing…• Hypervisor does not have visibility inside the Guest OS. • There is 1 particular CPU counter that you should get. It tells you
that there is not enough CPU to meet demand.• vCenter Operations (via Hyperic) does not collect this counter• Which counter is that?
CONFIDENTIAL 18
Enough about CPU.Let’s move to RAM!
CONFIDENTIAL 19
Quiz Time! • Which of the following sentences are True:
– Ballooning is bad. You see a VM has balloon, that VM has memory performance problem.– Ballooning happens before Compression, which happens before Swapping. If you see a VM
has Compressed memory but not Ballooned memory, that vCenter is buggy, or your eyes are just tired.
– If all the VMs in the ESXi host has low Usage counter, then the ESXi must also be low.– Turn on Large Page, and there goes all your TPS.– To check if a VM has memory contention, check its CPU Swap Wait counter. – Why are all the questions difficult?!
• Answer– Ballooning indicates the ESXi has memory pressure. It does not mean the VM has memory
performance issue.– Pages remain compressed or swap if they are not accessed.– Usage counter is different in VM and ESXi! In VM, it is Active. In ESXi, it is Consumed. This is
due to 2 level memory concept.– Yes, unless your ESXi is under heavy memory constraint.
2 levels of Memory Hierarchy
• New hierarchy in VMware’s memory overcommit technology• Transparent Page Sharing• Ballooning• Memory Compression• Swap to Host Cache (SSD)• Disk swapping
• Decompression is sub-ms compared to swap (15-20 ms)!
OS
Hypervisor
CONFIDENTIAL 21
vSphere Memory Management• 2 types of Memory Management
– Guest OS level• Balloon
– Hypervisor level• TPS• Compression, Swap to disk, Swap to cache (SSD)
Volunteer Time!Explain Balloon, TPS, Compression.
CONFIDENTIAL 22
VM RAM: What do you monitor? • Contention
– Swapped?– Balloon?– Compressed?– Latency?– CPU Swap Wait?
• Utilisation– Active?– Usage?– Consumed?
CONFIDENTIAL 23
VM RAM: What you should monitor
• Contention: – RAM Contention (%)
• Utilisation– Workload (%)– Consumed (KB)
• Contention– Latency (%)– CPU Swap Wait (ms)
• Utilisation– Usage (%)– Consumed (KB)
Discussion Time!What’s should the value be for an optimized environment?
vCenter Operations vCenter
CONFIDENTIAL 24
One more thing…• Hypervisor does not have visibility inside the Guest OS. • There is 1 particular RAM counter that you should get. It tells
you that there is not enough RAM to meet demand.• Which counter is that?• You can monitor it Guest OS paging activity by separating the
page file into its own vmdk.– You can then use vC Ops to analyse the pattern.
CONFIDENTIAL 25
Enough about RAM.Let’s move to Storage!
CONFIDENTIAL 26
Quiz Time! • Which of the following sentences are True:
– The latency counter is the (Write Latency + Read Latency) / 2 – If you have RDM, vCenter does not track the latency. – If the VM virtual disk counter showing 1000 IOPS, but the VM datastore counter
showing 2x IOPS, something is seriously wrong. Time to call your TAM!– If all your VMs experiencing high latency, the first thing you do is check the
VMkernel queue
• Answer– It is not. It takes into account the number of commands issued. It’s a weighted
average.– It only tracks the latency at the latest data. It’s not including other data during the
collection period.– Check for snapshot. Snapshot IOPS is transparent to virtual disk.– The first thing you do is check the physical device queue and your storage array.
VMkernel queue rarely exceeds 1 ms.
VM Storage: Where and what do you monitor?
27
Virtual Disk
DiskDatastore
VM Storage: where to monitor • For vmdk, use Datastore metric
groups.• For RDM, use Disk metric groups• Disk metric group is naturally
not relevant for NFS (files)
Disk
VM
RDMVMFS NFS
Disk 1 Disk 2 Disk 3
Disk
scsi0:1 scsi0:2
Datastore Datastore
vDisk vDisk vDisk
scsi0:0
CONFIDENTIAL 29
VM Storage: What do you monitor?
• Contention– Latency (ms)
• Utilisation– Commands per second– Usage (KBps)– Workload (%)
• Contention– Latency (ms)
• Utilisation – Commands Issued– Usage (KBps)
vCenter Operations vCenter
CONFIDENTIAL 30
VM Network• Contentions
– Drop packets– Packets retransmit
• Utilisation– Network throughput
• Limitations– We cannot monitor latency (e.g. between source and destination)
CONFIDENTIAL 31
Different Tiers, Different Optimization • Business Logic:
– Tier 1 is optimised for Performance and Availability– Tier 3 is optimised for Cost
• Do you allow Tier 1 VM on Tier 3 Storage?– Or you map the Compute Tier to the Storage Tier?
• What distinguish Tier 1 from Tier 3?– Availability – Performance– Monitoring– Cost!
CONFIDENTIAL 32
Tiering: Considerations• Compute
– No of spare host– No of hosts– Consolidation Ratio (VM:Host)– vCPU:pCPU Oversubscribed– vRAM:pRAM Oversubscribed– Clustering (e.g. VCS)
• Storage– IOPS per VM– Latency
• Monitoring– Application availability monitoring
(e.g. AppHA)– Application performance monitoring
(e.g. vC Ops Enterprise)
• Availability– Automated DR (SRM)– RPO– RTO
CONFIDENTIAL 33
3-Tiers Offering: ExampleTier 1 Tier 2 Tier 3
No of spare host 2 1 1
No of hosts 6 8 10
Consolidation Ratio (VM:Host) 10:1 20:1 40:1
vCPU:pCPU Oversubscribed n/a 2.0x 4.0x
vRAM:pRAM Oversubscribed n/a 1.5x 2.0x
IOPS per VM 400 200 100
Latency <10 ms 15-20 ms 20-25 ms
Clustering (e.g. VCS) Yes Yes No
Application monitoring (e.g. AppHA) Yes Yes No
Apps Yes Yes No
Automated DR (SRM) Yes Yes Yes
RPO 5 minutes 1-2 hour 2-8 hours
RTO 1 hour <2 hours <4 hours
CONFIDENTIAL 34
Demystifying “Peak”• There are 2 types of “Peak”
– Peak across time– Peak across objects
• Impacts– Peak across time can be too high if the burst is high
• VM is low for 24 hours, burst to 100% for 5 minutes, and you get 100% reported.
– Peak across time can be lower if the number of member objects is high.• Peak of a cluster in the past 1 day is 70%. That means at least 1 host was >70%.
– Peak across objects can be too high is the load is unbalanced• Happens when cluster utilisation is not high enough to trigger DRS orStorage DRS
CONFIDENTIAL 35
Sample SLA and Internal ThresholdTier 1 Tier 2 Tier 3
CPU Contention 3% 8% 13%
RAM Contention 0% 5% 10%
Disk Latency 10 ms 20 ms 30 ms
SLA only applies to VM.VM owner does not care about underlying platform
Tier 1 Tier 2 Tier 3CPU Contention 2% 6% 10%
RAM Contention 0% 2% 8%
Disk Latency 10 ms 15 ms 20 ms
CONFIDENTIAL 36
Where to monitor at the Platform level?• Compute
– Host?– Cluster?– Datacenter?– vCenter?
• Storage– Host?– Cluster?– Datastore?– Datastore Cluster?– Datacenter?– vCenter?
• Network
• Network– Standard Switch and port group? :-)– Host?– Distributed Switch?– Distributed Port Group?
CONFIDENTIAL 37
Where to monitor
• Compute– Host– Datacenter
• Storage– Host– Cluster
• Network– Host
• Compute– Cluster
• Storage– Datastore– Datastore Cluster
• Network– Distributed Switch.– Distributed Port Group
DRS (and Storage DRS) will balance the cluster
Not here Monitor these
CONFIDENTIAL 38
QoS in a shared environment• QoS is mandatory in a shared environment• Areas to control
– Compute– Network– Storage
• CPU and RAM– Shares– Reservation– Limit?– Resource Pool?
• Storage I/O Control• Network I/O Control
CONFIDENTIAL 39
QoS: Compute• When not to use Resource Pool?• When to use Resource Pool?• What’s the impact of Reservation?
– HA Slot Size. Unless you use %– Boot time– Oversubscribe ability. You cannot go beyond 100% reservation.
QoS: Storage• A single VM can hog storage
throughput– Just need to run IOmeter – Unfairly penalizes VMs on hosts with high
consolidation ratios
• Existing resource management only works for VMs on the same host
• SIOC calculates datastore latency to identify contention– Latency is a normalized, average across VMs– IO size and IOPS included
100 %
75%
devi
ce q
ueue
dep
th
24
0
25 %
Storage Array Queue
ESX Server ESX Server
38% 50%
devi
ce q
ueue
dep
th
24
0
12%
Without SIOC – Latency is Unbounded
Without Storage IO ControlActual Disk Resources utilized by each VM are not in the correct ratio
StorageCongested
QoS: Storage• SIOC enforces fairness when
datastore latency crosses threshold– Dynamic threshold setting– Fairness enforced by limiting VMs access
to queue slots
• What’s the limitation?– No inter-datastore awareness– Does not work on RDM– Non VM workload not included
• Work with your Storage team.– Auto-tiering array is supported
75%
devi
ce q
ueue
dep
th
24
0
6
0
VM A1500
Shares
VM B500
Shares
VM C500
Shares
25 %
ESX Server ESX Server
100 %
60% 20%20%
With Storage IO ControlActual Disk Resources utilized by each VM Are in the correct ratio even across ESX Hosts
Storage QueueThrottled
With SIOC – Latency is Controlled
StorageControlled
Storage Array Queue
CONFIDENTIAL 42
Key Takeaways• Optimization in SDDC has a lot more components than we
normally think• Contention is 1st . Utilisation is 2nd • SLA is at VM level, not Infrastructure level.• Peak can be too low or too high. • Anything else?
Optimized AvailabilityTake a 2 minute bio-break
CONFIDENTIAL43
Disaster Recovery (DR) >< Disaster Avoidance (DA)
DA requires that Disaster must be avoidable.• DA implies that there is Time to respond to an impending Disaster. The time window
must be large enough to evacuate all necessary system.
Once avoided, for all practical purpose, there is no more disaster.• There is no recovery required.
• There is no panic & chaos.
DA is about Preventing (no downtime). DR is about Recovering (already down)• 2 opposite context.
It is insufficient to have DA only.
DA does not protect the business when Disaster strikes.
Get DR in place first, then DA.
45
DR Context: It’s a Disaster, so…
It might strike when we’re not ready• E.g. IT team having offsite meeting, and next flight is 8 hours away.
• Key technical personnels are not around (e.g. sick or holiday)
We can’t assume Production is up.• There might be nothing for us to evacuate or migrate to DR site.
• Even if the servers are up, we might not even able to access it (e.g. network is down).
Even if it’s up, we can’t assume we have time to gracefully shutdown or migrate.• Shutting down multi-tier apps are complex and take time when you have 100s…
We can't assume certain system will not be affected• DR Exercise should involve entire datacenter.
Assume the worst, and start from that point.
Singapore MAS Guidelines
46
MAS is very clear that DR means Disaster has happened as there is outage.
Clause 8.3.3 states Total Site should be tested. So if you are not doing entire DC test, you’re not in compliance.
DR: Assumptions
47
A company wide DR Solution shall assume:• Production is down or not accessible.
Entire datacenter, not just some systems.
• Key personnels are not available Storage admin, Network admin, AD admin, VMware admin, DBA, security, Windows admin,
RHEL admin, etc. Intelligence should be built into the system to eliminate reliance on human expert.
• Manual Run Books are not 100% up to date Manual documents (Word, Excel, etc) covering every steps to recover entire datacenter is
prone to human error. It contains thousands of steps, written by multiple authors. Automation & virtualisation reduce this risk.
DR Principles
48
To Business Users, actual DR experience must be identical to the Dry Run they experience
• In panic or chaotic situation, users should deal with something they are trained with.
• This means Dry Run has to simulate Production (without shutting down Production)
Dry Run must be done regularly.
• This ensures:
New employees are covered.
Existing employees do not forget.
The procedures are not outdated (hence incorrect or damaging)
• Annual is too long a gap, especially if many users or departments are involved.
DR System must be a replica of Production System
• Testing with a system that is not identical to production deems the Dry Run invalid.
• Manually maintain 2 copies of >100s servers, network, storage, security settings are classic examples of invalid Dry Run, as the DR System is not the Production system.
• System >< Datacenter. Normally, the DR DC is smaller. System here means a collection of servers, storage, network, security that make up “an application from business point of view”.
Datacenter-wide DR Solution: Technical Requirements
49
Fully Automated• Eliminate reliance on many key personnels.
• Eliminate outdated (hence misleading) manual runbooks.
Enable frequent Dry Run, with 0 impact to Production.• Production must not be shutdown, as this impacts the business.
Once you shutdown production, it is no longer a Dry Run. Actual Run is great, but it is not practical as Business will not allow entire datacenter to go down regularly just for IT to test infrastructure.
• No clashing with Production Hostnames and IP addresses.
• If Production is not impacted, then users can take time to test DR. No need to finish within certain time window anymore.
Scalable to entire datacenter• 1000s servers
• Cover all aspect of infrastructure, not just server + storage. Network, Security, Backup have to included so entire datacenter can be failed over automatically.
50
DR 1.0 architecture (current thinking)
Typical DR 1.0 solution (at infrastructure layer) has the following properties:
Area Solution
Server • Data drive (LUN) is replicated. • OS/App drive is not. So there are 2 copies: Production and DR. They have different host
name, IP address. They can’t be the same, as having identical hostname/IP will result in conflict as the network spans across both datacenters.
• This means DR system is actually different to Production, even on actual DR. Production never fails to DR. Only the data gets mounted.
• Technically, this is not a “production recovery” solution, but a “Site 2 mounting Site 1 data” solution. IT has been telling Business that IT is recovering Production, while what IT does is actually running a different system, and the only thing used from Production is just the data.
Storage • Not integrated with the server. Practically 2 different solution, manually run by 2 different team, with a lot of manual coordination and unhappiness.
Network • Not aware of DR Test and Dry Run. It’s 1 network for all purpose.• Firewall rules manually maintained on both sides.
51
DR 1.0 architecture: Limitations
Technically, it is not even a DR solution• We do not recover the Production System. We merely mount production Data on a
different System The only way for the System to be recovered is to do SAN boot on DR Site.
• Can’t prove to audit that DR = Production.
• Registry changes, config changes, etc are hard to track at OS and Application level.
Manual mapping of data drive to associated server on DR site.
Not a scalable solution as manual update don’t scale well to 1000s servers.
Heavy on scripting, which are not tested regularly.
DR Testing relies heavily on IT expertise.
DR Requirements: Summary
52
ID Requirements Description
R01 DR copy = Production copy.Dry Run = Actual Run.
This is to avoid an invalid Dry Run as the System Under Test itself are not the same.No changes allowed (e.g. IP address and Host name) as it means Dry Run >< real DR
R02 Identical User Experience From business users point of view, the entire Dry Run exercise must match real/actual DR experience.
R03 No impact on Production during Dry Run.
DR test should not require Production to be shutdown, as it becomes a real failover. A real failover can’t be done frequently as it impacts the business. Business will resist testing, making the DR Solution risky due to rare testing.
R04 Frequent Dry Run This is only possible if Production is not affected.
R05 No reliance on human experts An datacenter wide DR needs a lot of expert from disciplines, making it an expensive effort.Actual procedure should be simple. It should not recover from error state.
R06 Scalable to entire datacenter DR solution should scale to >1000s servers while maintaining RTO and simplicity.
Open Discussion & Sharing (15 minutes)