© 2010 VMware Inc. All rights reserved Enterprise Management.
-
Upload
lee-briggs -
Category
Documents
-
view
218 -
download
0
Transcript of © 2010 VMware Inc. All rights reserved Enterprise Management.
3
What We Are Hearing From You
Configuration & CompliancePerformance & Capacity
“I need a more integrated, simpler approach to ensure the performance,
capacity, and health of our virtual
environment”
“In the past we’ve just over-provisioned as that was the safest way to CYA. But now
management is asking for usage reports and capacity plans before they allow us to buy more infrastructure.”
“We are constantly preparing for or
responding to an audit. We basically shut down normal IT Ops, other than emergencies,
during each quarterly audit period.”
“We don’t really have good visibility with
our servers – we don’t know what patches have been applied or when. It would be great to know what percentage we are
patched each week.”
4
Managing Performance/Capacity in vSphere: the basic
Is it healthy?
• Every VM & ESX performing well?
CPU, RAM, Network, Disk?
• Are they behaving expectedly?
• Any fault on any component?
Is it enough?
• Enough CPU, RAM, Network, Disk?
Future risk?
• Time remaining?
• Capacity remaining?
• Where are the “Stress points”
in time?
Is it optimised?
• Which VMs need adjustment?
• What are my keyratios?
• How much can I claim back from
“fat” VMs?
• How many more VMs can I put
without impacting performance?
5
Deep understanding of vCenter is required
Yes, buy more RAM.
ESXi has 32 GB RAM.
It is highly used
7
Purpose Built Capacity Planning & Analysis• Integrated capacity analysis and forecasting• Decision support & automation via views, alerts, reports• VM right sizing and capacity reclamation
Automated Configuration & Compliance• Automated Patching and Provisioning• Comprehensive change tracking to isolate root cause• Single-click rollback to remediate and return to normal
Patented Performance Analytics• Self-learning of “normal” performance conditions• Service health baseline and trending • Smart alerts of impending performance degradation
vCenter Operations Solution- Bringing together 3 Disciplines
8
Threshold: a shift in mindset needed
vCenter sets “static” threshold, which can be misleading
• During peak, it is common for VM to reach high utilisation.
• Static threshold will generate alerts when they should not.
• vSphere admin quickly learns to ignore them, defeating the purpose of alert to begin with.
• During non-peak, it might be abnormal for VM to reach even 50% utilisation.
• Static threshold will not generate alerts when they should have.
vCenter only sets high threshold
• Do you set static threshold when CPU or RAM utilisation drops below 5%? • A drop in entire array storage IOPS might be a sign of terrible day ahead.
• Will not alert when these happen:
• Utilisation drops from 75% to 1% when it should not.
• Utilisation change from 5% to 70% when it should not.
• We need to plots both upper range and lower range
But each VM differs. And the same VM differs depending on day/time…
• Intelligence required to analyse each metrics and their expected “normal” behaviour.
11
vCenter Operations Management Suite 5.0
VMware vCenter Operations Manager
• Key part of the VMware vCenter Operations Management Suite
CapacityIQ Merged with vC Ops
• CIQ gets VCOPs features
Dashboard
• New Badges (11 – Up from 3)
• Improved Details Page
Greater Emphasis on the Datastore (First Class Object)
• Performance Management and Capacity Management
New Integrations
• VCM vC Ops
• vC Ops Chargeback
• vCD vC Ops
12
vSphere
vCenter Operations Mgr. – High Level Architecture
OpenVPN
Postgres DB
vSphere
WebApp
Custom
WebApp
Admin
WebApp
vCenter Operations Manager vApp
UI VM
Rolled up capacity data
Capacity Analytics
FSDBPostgres DB
Collector
ActiveMQ
Performance Analytics
Analytics VM
Metric Data
vSphere
VMware Cloud / vCenter
vSphere
vC Ops Mgr vSphere UI
vCenter Configuration
Manager
3rd Party Data Sources
vCenter Communications
over SSL
vC Ops Mgr Custom UI
13
Brand new UI in vCOps5
Updates to the 1.0 Skittles View
Operations Badges
Relationship to the
Datastore
Left Pane Navigation
Drives Focus(e.g. Datastore)
New World Object
Multi vCenter Support
15 Confidential
Vc Ops vSphere UI – Unified Dashboard
Launching Pad
• Click to Drill down
Focused on problems
• Click to drill into details!
• Almost everything is clickable
Main Themes
• Health
• Risk
• Efficiency
New Concepts
• Faults
• Weekly Stress Profile
• Reclaimable Waste
• Density
16
vC Ops vSphere UI – Two Different Users
• Immediate problems
• What is happening right now?
• What do I need to pay attention to?
Operations Short and Long Term Capacity
• Forward Looking
• Are there areas that I should be concerned about from a capacity perspective?
• Have I deployed my VI in the most efficient manner?
17
vC Ops Default UI – Major and Minor Badges
• High level Understanding
• Calculated from scores of Minor Badges
Major x 3
Minor x 8• Specifics
• Guidance
18
Operations: Major Badge – Health
“How is this object doing right now?"
• Identifies current problems in the system
• Issues that need to be resolved immediately to avoid problems
High Health is good (100-0)
Heatmap
• Provides quick view of many objects at once
• Shows Health of all parent and child objects
• Go back in time (6 hours) and see the “weather” of the Virt Infrastructure
Health Score is calculated from its Minor Badges
• Workload
• Anomalies
• Faults
19
Operations: Health Minor Badge – Workload
Measures how hard an object is working?
High Workload is bad (0-100 or more!)
• Percentage of Demand divided by effective capacity
• As workload approaches (and exceeds) 100% Performance Problems!
Starving object for resources!
Focused attention
• CPU
• Memory
• Disk I/O
• Network I/O
95
Improved Network and Disk I/O calculations
Eliminates idle networks and storage from showing High Workload
Limit the erroneous 100% Workload scores
20
Operations: Health Minor Badge – Anomalies
Measures how normal is this object behaving?
• Is what the vC Ops 1.x Health score was, but now inversed
Derived from the number of metrics that are outside of their “Normal” trended ranges
• Learns dynamic ranges of “Normal” for each metric
• Identifies metric abnormalities
Low Anomalies is good (0-100)
• Zero meaning the object is performing exactly the way vC Ops expects it to for that time of the day, that day of the week
• A high number of anomalies are usually an indication of a problem
Anomalies Chart
• Current number of Abnormal Metrics
• Problem/Noise Threshold
Crossing problem threshold will increase the Anomalies Score
Does not generate an alert in this vSphere UI
21
Workload and Anomalies
Workload and Anomalies together tell you a lot…
Workload High & Anomalies Low
• Workload – Object is Running Hot
• Workload – Potentially Starving for Resources
• Anomalies – Normal Behavior for this timeframe
• Work with users to determine if more resources are needed
Workload High & Anomalies High
• Workload – Object is Running Hot
• Workload – Potentially Starving for Resources
• Anomalies – Abnormal behavior for this timeframe
• Something is amiss!!!
• Immediate Attention!!!
22
Operations: Health Minor Badge – Faults
Measures the degree of faults or problems the object is experiencing
• Pulled from active vCenter events
VMware specific knowledge of which vCenter Events affect Availability and Performance (examples):
• Loss of redundancy in NICs or HBAs
• Memory checksum errors
• HA failover problems
Low Faults is good (0-100)• Each fault has a default score (e.g. 25,
50, 75, 100)• Highest individual Fault Score drives the
Fault object Score
Best Practices:
• Do not change the Faults Threshold
• Use Alerts View to manage Faults
Faults shown in Widget
23
Capacity Planning: Major Badge – Risk
Are there future risks to my systems and VI?
Identifies potential problems that could eventually hurt the performance
Low Risk is good (0-100)
Risk Score is calculated from its Minor Badges
• Time Remaining
• Capacity Remaining
• Stress
Risk Chart
• Shows Risk score over the last 7 days
24
Capacity Planning: Risk Minor Badge – Time Remaining
Measures time remaining before each resource type reaches its capacity
• CPU
• Memory
• Disk
• Network I/O
Early warning of upcoming provisioning needs
• Avoid future performance issues
High Time Remaining is good (100-0)
Graph shows resource utilization trends
25
Capacity Planning: Risk Minor Badge – Capacity Remaining
Measures how many more VMs can be placed on the object
Percentage of Total VM “Slots” Remaining
• Based on the average size of the VM on the object (e.g. VM profile)
• Each object has its OWN VM profile size: Host, Cluster, Datacenter, Etc.
High Capacity Remaining is good (100-0)
• Zero mean no room left for more VMs
333 More VMs correlates to 77% Capacity Remaining for this object
26
Capacity Remaining Calculation
Determine Capacity Constraint Resource• Dashboard Chart does not show
which resource is the limiting one• Must drill into the Details Chart
Deployed or Powered On VMs• Deployed/Powered Off VMs only use
disk space resources• Powered On VMs uses ALL of the 4
resources
Calculation Example Shown: • Limiting Resource is Disk Space with
333 VMs available • Use the Deployed VM number of 99
to do the calculation for percentage space remaining• Determine Capacity Remaining
• 333 / (333 + 99) = 77%
27
Capacity Planning: Risk Minor Badge – Stress
Stress measures long-term or chronic workload
• Workload shows an instantaneous value
• Stress looks over a longer period of time
Quickly find and resolve
• Undersized objects
• Population contention
Low Stress is good (0-100)
Stress score encompasses a six (6) week period
• Workloads > 70% = “Stressed”
• Threshold Configurable
Chart shows weeks break down of Stress for each day/hour averaged over the last six (6) Weeks
28
Stress Calculation
Stress Score is a % and is based on area of Workload Above “Stress Line” Threshold compared to the Total Capacity of the object
• Stress line is configured in the vC Ops Configuration Wizard
• Stress Score = (Stress area / Stress Zone) *100
Example• Stress Line is 70% Workload• 12% of the area is above the 70% threshold• Stress Score is 12
0
100
70
Stress Zone
Workload Line
6 Weeks
12%
29
Stress Configuration – Host or Cluster
Access via Configuration Widget
• Stressed Cluster and Host
• Undersized VM
Stress Line
• CPU and/or Memory Workload
Stress Threshold
• When should an object appear on the Stress Reports
• Does not affect Badge Score
Object is stressed if its degree stressed is greater than the %
Stressed threshold
Determines the Stress line for a physical resource (viz. CPU, Memory)
30
Stress Configuration – Undersized VM Detection
A cluster or host is identified as stressed if its degree stressed is greater than the %
Stressed threshold
Use Any or All thresholds for detection
Determines the Stress line for a physical resource (viz. CPU, Memory)
31
Workload, Anomalies and Stress
Adding Stress Badge can tell you even more…
Workload High & Anomalies Low & Stress High
• Workload – Object is Running Hot
• Workload – Potentially Starving for Resources
• Anomalies – Normal Behavior for this timeframe
• Stress – Object is often running under high Workload
• Add resources!!!
Workload High & Anomalies Low & Stress Low
• Workload – Object is Running Hot
• Workload – Potentially Starving for Resources
• Anomalies – Normal Behavior for this timeframe
• Stress – Object usually has enough resources
• Not likely a big problem…a cyclical workload spike?
32
Capacity Planning: Major Badge – Efficiency
Are there optimization opportunities in my systems?
How to run a leaner datacenter
Save $$$ by better utilizing resources
High Efficiency is good (100-0)
Efficiency Score calculated from Minor Badges
• Reclaimable Waste
• Density
Graph Depicts VMs by Percent
• Optimal – Optimally Provisioned VMs
• Waste – Over Provisioned VMs
• Stress – Under Provisioned VMs• Not used in Efficiency Calculation (see Risk)
Three Resources Considered• CPU• Memory• Disk Space
Note: VMs can appear in Stress and Waste
33
Capacity Planning: Efficiency Minor Badge – Reclaimable Waste
Measures the over-provisioning for an object
It identifies the amount of reclaimable resources
• CPU
• Memory
• Disk
Low Reclaimable Waste is good (0-100)
Reclaimable Waste = Reclaimable Capacity / Deployed Capacity
• Score depicts the MAX of the CPU, Memory and Disk calculation
• Disk calculation can also include old snapshots and templates
Graph shows breakdown of the Waste section of the Efficiency Badge pie chart
• % Idle VMs (based on configured settings)
• % Powered Off VMs
• % Oversized VMs
34
Efficiency Configuration – Powered-Off & Idle VMs
Access via Configuration Widget
Powered-Off Threshold
• Based on % time
Idle VM Detection
• Based on % time
- AND -
• All or One of the following thresholds
• CPU
• Disk I/O
• Network I/O
Listed as Powered-Off if the total powered-off time > given % Time Powered-Off Threshold in a given
time interval
Listed as Idle if the total time during which all or any of the resource usage is below the specified thresholds in a given time interval
37
Efficiency Configuration – Oversized VMs
Access via Configuration Widget
Oversized Detection
• CPU and/or Memory Workload
Oversized Threshold
• What percentage of Oversized is acceptable
• When should an object be reported
An Object is oversized if its degree oversized is greater than
the % Oversized threshold
For the given time interval, CapacityIQ first calculates if a physical resource (viz. CPU,
Memory) is over-sized based on the configurable Utilization Less Than
threshold.
38
Oversized VMs - Calculation
• % Oversized Threshold = Area in Blue/ Area of Grey Box
• Higher the ratio (i.e. more blue), higher the over-sizing
39
Capacity Planning: Efficiency Minor Badge – Density
Contrasts Actual vs. Ideal Density
Identify Optimal Resource Deployment Before Contention Occurs
Greater Consolidation $$$
High Density is good (100-0)
Measures consolidation ratios:
• VMs/Host Ratios
• vCPU/Physical CPU Ratios
• vMem/Physical Memory Ratios
41
vC Ops Default UI – Badge Thresholds
Adjust levels to user defined settings
Access via Configuration Widget
Set Infrastructure and VM thresholds separately
• Capacity problems for a Host requires more “warning” than a VM
Disable Color Threshold by Clicking the Level Off
44
Operations: Environment
Updates to the 1.0 Skittles View
Operations Badges
Relationship to the
Datastore
Left Pane Navigation
Drives Focus(e.g. Datastore)
New World Object
Multi vCenter Support
48
Operations: Details
Workload Badge Focus : Host Example
Improved Legends and
Keys
Scroll Down for new graphs for Disk and
Network I/O
Individual objects color-
coded to match badge score
49
Operations: Details
Workload Badge Focus : VM Example
Reserved, Limits and Entitlement Highlighted
on Graphs
50
Operations: Details
Workload Badge Focus : Datastore Example
Space Available
Throughput
IOPS
Latency
51
Operations: Details
Anomalies Badge FocusSubset of the
Anomalies for an object
Help with any troubleshooting
efforts
Visualize magnitude and
impact
53
Operations: Events
Updates to the 1.0 Events View
Choose Badge
For which objects should I show Alerts and Events?
Overlay Badge Alerts
Overlay ChangeEvents
Health ScoreLine
56
Planning: Environment
Updates to the 1.0 Skittles View
Planning Badges
Relationship to the
Datastore
Left Pane Navigation
Drives Focus(e.g. Datastore)
New World Object
Multi vCenter Support
58
Planning: Summary
“Classic CapIQ” Dashboard rolled up under Summary tab
• Summary view context sensitive to object selected
Network I/O trending and forecasting
• Usable Capacity supports Network I/O
What-if Modeling allows CPU & Memory Reservations and Limits configuration
59
Planning: Views
Reports Organized by “Badge”
• 5 different categories – one for each minor badge under Risk and Efficiency
New List Reports
• VM List
• Datastores List
• Datastores Waste List
Views associated with Datastores
64
Configuration Widget: Planning & Reports – Usage Calculation
By default, CapacityIQ calculates capacity usage based all 24 hours of data every day
Use specific hours and days to match business week workload, and not skew data
with off-peak usage
66
Smart Alerts – Overview
New Alerting Functionality
Smarts Alerts Available in EACH vC Ops Suite edition
Different Types of Smart Alerts
• Custom UI Alerts
• Can show vSphere UI Badge Alerts
• Alerts driven by
• Problem/Noise Threshold Anomaly Breaches
• KPI Threshold Breaches
• Very useful for groups of objects (e.g. Application Monitoring)
• vSphere UI Badge Alerts
• Threshold Based
• Driven by Badge Color Change Thresholds
• Only Alert on Minor Badges
• Workload YES – Health NO
• Good for Alerts on single objects (e.g. VM)
67
Smart Alerts - Configuration
Enable/Disable Alerts by Specific Badge Definitions
Create alerts on vCenter faults
• Subset of events from vCenter are considered faults
• VMware best practices and knowledge
Enable Infrastructure and VM Alert separately
Access via Configuration Widget
• Disable threshold level to disable the alert
• Turn off “Workload Orange” – No Alert
70
Smart Alerts – Usability
Filter to view specific Badges
Filter on column values
Add and Remove columns
Search for specific alerts
71
Smart Alerts Details
Double click on an alert to see the details
Details view differs based on the alert type (e.g. Workload vs. Anomalies)
72
Smart Alerts – External Notification Configuration
Configure via the Administration UI
SNMP Notifications
• All alerts are streamed to the source
• Filtering must occur on the Destination System
SMTP Notifications
• Create Email Rules for filtering
73
Smart Alerts – Email Notification Rules
Configure via the Notification Widget
Create Email Rules via Notification Widget
Configure
• Email address
• Alert Types
• Criticality Levels
• Object
• Children
75
Analysis – Heatmaps
Heatmaps like in vC Ops Std 1.0
We now have the Capacity badges and metrics available in the heatmaps
Examples:
• Which Clusters are Healthy and have available Capacity?
• Which hosts have a Low Workload and a low Density?
77
Reports
CapIQ Reports merged into Reports Tab
Only Reports related to vSphere Capacity, even in Ent Plus
81
vCM vC Ops : Change Events Correlated with Performance
Overview Integration between vCM and vC
Ops Mgr for change events Overlay Guest OS configuration
changes from vCM in vC Ops
performance trend graphs Launch in context into vCM to see
full details of changes and
potentially remediate them
Benefits Enable Operations to quickly understand and resolve performance issues arising from
configuration changes (reduce MTTR) Drive efficient & effective troubleshooting by correlating Guest OS configuration changes
w/ VM performance degradations
82
vCM Events in vC Ops – Event Collected
vC Ops does not pull in every event from vCenter
• Only events that could affect health or workload (vSphere Knowledge!)
Adapter only pulls in change events for Guest OSs
• No ESX/i Host configurations changes (these come from vCenter Adapter)
• Guest OS has to be by managed by vCM
Event Collected
Reboot
Software Install/Uninstall
Windows Registry
IP/Networking changes
Device Driver changes
Memory/CPU changes
Windows Firewall
Patches
83
vCM Change Events Correlated with Performance
Launches to the Master Change Log view in vCM for the change in question Rollback the change (if possible)
86
vCenter Operations Management Suite Packaging
Standard Edition Enterprise Plus Edition
VC Ops Mgr 5.0 – Std. VC Ops Mgr 5.0(incl. CapIQ)
VC Infra Navigator **
VC Configuration Mgr
** Not Available a-la-carte.
Chargeback Mgr
Advanced Edition
VC Ops Mgr 5.0(incl. CapIQ)
For hybrid cloud and heterogeneous environments
For larger vSphere
environments
Automated Operations Management
For smaller vSphere
environments
Enterprise Edition
VC Ops Mgr 5.0 (incl. CapIQ)
VC Infra Navigator **
VCM for vSphere **
Chargeback Mgr
For virtual and cloud infrastructure
New SKU New Name