Availability Management Issues - itSMF
Transcript of Availability Management Issues - itSMF
© 2007 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
Technology for better business outcomes
Availability Management IssuesArthur Lampert, Ph.D., CISSPHP Services
Objectives for this session• During this session you will learn:−Effective approaches for defining service availability−Understanding some availability design issues
September 26, 2008 2
Agenda• The Availability Lifecycle• Defining Service Availability• Designing Services to Meet Availability
Requirements
Service Value comes from …• Service Utility−What does the service do?−Functional requirements−Features, inputs, outputs…− “fit for purpose”
• Service Warranty−How well does the service do it?−Non-functional requirements−Capacity, performance, availability, security,
continuity…− “fit for use”
September 26, 2008 5
Service Warranty and risks
September 26, 2008 6
low
power/ network failure
virus attack
security breach- hacker
natural disaster- fire, flood, adverse weather
man made disaster- terrorism, malicious damage
planned downtime
internal security/fraud
denial of service attack
software failure
application failure
configuration issues
hardware failure
Continuity Security
data corruptioninsufficient capacityCapacity
Availability
highim
pact
highfrequencylow
The ITIL Service Lifecycle
September 26, 2008
ServiceDesign
Service
ITIL
ServiceStrategy
ServiceOperation
ServiceDesign
Continual ServiceImprovement
ServiceTransition
7
The Availability Lifecycle
Monitor, React, Prevent
Manage& SupportRelease
Bui
ld, T
est,
Dep
loy
Changes
Availability Reporting,
Service Failure Analysis
Continual Improvement Process & Feedback Loop
Availability Design
Design for Availability
Design for Recovery
Design for Continuity
People Process Technology
Detect Respond Diagnose Repair Recover Restore
SiteData Service Process
OLA
s &
Und
erpi
nnin
g C
ontra
cts
Inst
rum
enta
tion
and
Tool
s
IT S
ervi
ce M
anag
emen
t Pro
cess
es
Bus
ines
s M
anag
emen
t
Business RequirementsGathering
Document
Availability Targets
Service Levels
Acceptance Criteria
Reporting Metrics
DefineEducate
Negotiate
Key
Bus
ines
s Fu
nctio
ns
Impa
ct o
f Dow
ntim
e
Def
initi
on o
f Ava
ilabi
lity
Transition Continual Improvement
Tran
sitio
n
Ope
ratio
n
Des
ign
Stra
tegy
September 26, 2008 8
Traditional view of availabilityPercentage Availability Annual Downtime
99% 87.6 hours (3½ days)
99.5% 43.8 hours
99.9% 8.8 hours
99.95% 4.4 hours
99.99% 53 minutes
99.999% 5.3 minutes
September 26, 2008 10
The traditional calculation
• AST = Agreed Service Time• DT = Downtime
%100×−
=AST
DTASTtyAvailabili
September 26, 2008 11
What’s wrong with this approach?• Customers care about end-to-end service• What if some locations are OK and others aren’t• What if some users are OK and others aren’t• What if some operations work and others don’t• What if the service is so slow that it is unusable?• What if there are frequent 5 second outages?• What are we actually measuring and reporting?
September 26, 2008 12
Service Level Agreements• An SLA documents what has been agreed−From the perspective of the users and customers
• Contents should include−Availability definitions−Targets−Measurement and reporting−Penalties
• Every goal in an SLA must be SMART−Specific, Measurable, Achievable, Relevant, Timely
September 26, 2008 14
Outage frequency and duration• MTBF = Mean Time Between Failures• MTBSi = Mean Time Between System Incidents• MTRS = Mean Time to Restore Service
Up
DownTRS TRS
TBF
TBSi
September 26, 2008 15
Outage frequency and duration• Which of these is better?
September 26, 2008 16
UpMTBF = 19 days MTTR = 1 day Availability = 95%
Down
MTBF = 22.8 hrs MTTR = 1.2 hrs Availability = 95%
Up
Down
Failover Events• How long does a failover take?−Between cluster members?−When a RAID disk fails?−When a network link fails?
• Does fail over have a business impact?−Do transactions have to be restarted?−What is the longest “short” outage that can be ignored?
• What if the cluster continuously fails over?−What is the maximum frequency of these types of event
September 26, 2008 17
Outage frequency and duration• Agree availability in terms of −Frequency of incidents−Duration of incidents
• Agree failover events which won’t be counted−Frequency−Duration− Impact
September 26, 2008 18
An agreement with the business• Outage duration and frequency must be agreed− In terms that the business understands−With metrics that support the business mission
• What might such an agreement look like?
September 26, 2008 19
Example agreement
Outage duration Maximum frequency
up to 2 minutes 2 events in any hour5 events in any day10 events in any week
2 minutes to 30 minutes
2 events in any week6 events in any quarter
30 minutes to 4 hours 4 events in any year
4 hours to 8 hours 1 event in any year
Maximum Annual Downtime
8 hours + (4 * 4 hours) + (24 * 30 mins) = 36 hours
Availability = (8760 – 36) / 8760 = 99.6%September 26, 2008 20
Number of users affected
September 26, 2008
• Most failures do not cause complete loss of service
• Typical scenario−Some users have no service at all−Other users completely unaffected
• Extreme cases−Only one user is affected−Only one user is able to work!
• Should these count as downtime or not?21
User Outage Minutes
%100×−
=serMinutesPotentialU
MinutesUserOutageserMinutesPotentialUtyAvailabili
• Potential User Minutes = Number of users * Agreed service time
• User Outage Minutes = Number of affected users * Downtime
September 26, 2008 22
Potential User Minutes• Not every minute is equal!
Day and timePotential number of users
Weekly PotentialUserMinutes
Monday – Friday 00:00 – 07:00 500 1,050,000
Monday – Friday 07:00 – 09:00 2,500 1,500,000
Monday – Friday 09:00 – 18:00 5,000 13,500,000
Monday – Friday 18:00 – 21:00 1,000 900,000
Monday – Friday 21:00 – 00:00 500 450,000
Saturday – Sunday 500 1,440,000
Weekly Total 18,840,000
September 26, 2008 23
User Outage Minutes example• Using data from previous slide• There is an outage that loses email service to
500 users for 2 hours on a Monday morning at 10:00
• UserOutageMinutes = 500 * 2 * 60• PotentialUserMinutes for the week = 18,840,000• Availability = 18,840,000 – 60,000 / 18,840,000
99.68%
September 26, 2008 24
What if there aren’t users?• Transaction based
systems
• Manufacturing systems
• etc...
%100×−
=nitsPotentialULostUnitsnitsPotentialUtyAvailabili
−=
onsTTransactiPotentialILostTransaonsTTransactiPotentialItyAvailabili ctions %100×
September 26, 2008 25
Critical business functions• Some failures only affect part of a service−ATMs can dispense money but not print statements −Can browse old emails but can’t send or receive−Reservation system can see bookings but not make
new ones
• It is up to the business to define the relative importance of each type of transaction
• You can use transaction weightings to modify availability figures
September 26, 2008 26
Example transaction weightingsIT function that is not available % service impact
Sending email 100%
Receiving email 100%
Using shared distribution lists to send email 10%
Updating shared distribution lists 5%
Reading public folders 50%
Updating public folders 10%
Accessing shared calendars 30%
Updating shared calendars 10%
Why don’t these add up to 100%?
September 26, 2008 27
What about poor performance?• Most SLAs have performance targets• What if the performance is SO SLOW that the
service can’t be used?−Some SLAs count this as downtime−Others count it separately, with its own penalties−The important thing is to discuss, agree, and document
• IT organisation can only agree performance if customer agrees maximum workload− It is the job of the business to forecast the work, not IT
September 26, 2008 28
Example performance agreement
IT function Required response time (when service is available)
Login 99% within 5 seconds99.9% within 15 seconds
Seat availability check 95% within 10 seconds99% within 30 seconds
Seat booking 99% within 40 seconds100% within 60 seconds
Check in 95% within 20 seconds100% within 60 seconds
September 26, 2008 29
Planned downtime• What effect does a planned outage have on
availability?
• AST = Agreed Service Time
• If the planned outage is not in a service window then it does not impact availability−Some SLAs specify when maintenance will happen−Some SLAs allow additional planned downtime with
sufficient notice
September 26, 2008 30
Measurement Period• Remember that Availability is defined as
• AST = Agreed Service Time• DT = Downtime
• What time period should we use for the agreed service time?
%100×−
=AST
DTASTtyAvailabili
September 26, 2008 31
Measurement period• Availability after a single 8 hour incident
• Weekly
• Monthly
• Quarterly
• Annual
( )( ) %3.93%100*
5*2485*24
=−
( )( ) %9.98%100*
12/365*24812/365*24
=−
( )( ) %6.99%100*
4/365*2484/365*24
=−
( )( ) %9.99%100*
365*248365*24
=−
September 26, 2008 32
Measuring Availability• You have a good definition of Availability− It is specific about what will be delivered− It is achievable− It is relevant to the service you deliver− It is defined over a clear time period
• So what have we forgotten?−A definition is of no use at all if you can’t measure it
September 26, 2008 33
How can you measure availability?• Service Desk Records−Fairly easy to implement, inexpensive−Can lead to disputes about accuracy of data
• Instrument all components and calculate−Difficult to implement, expensive−May fail to detect complex or subtle failures
• Use dummy transactions / clients to simulate−Actually measures end-to-end availability−May miss complex or subtle failures
September 26, 2008 34
Defining Availability - Summary• “How many 9s” is not good enough• Must account for−End-to-end service availability−Number and duration of outages−Number of users or transactions affected by incidents−Criticality of business functions affected by incidents−Performance of critical functions−Planned downtime−Agreed measurement period−Agreed measurement process
• Everything must be documented in an SLA−Using SMART metrics
September 26, 2008 35
The Availability Lifecycle
Monitor, React, Prevent
Manage& SupportRelease
Bui
ld, T
est,
Dep
loy
Changes
Availability Reporting,
Service Failure Analysis
Continual Improvement Process & Feedback Loop
Availability Design
Design for Availability
Design for Recovery
Design for Continuity
People Process Technology
Detect Respond Diagnose Repair Recover Restore
SiteData Service Process
OLA
s &
Und
erpi
nnin
g C
ontra
cts
Inst
rum
enta
tion
and
Tool
s
IT S
ervi
ce M
anag
emen
t Pro
cess
es
Bus
ines
s M
anag
emen
t
Business RequirementsGathering
Document
Availability Targets
Service Levels
Acceptance Criteria
Reporting Metrics
DefineEducate
Negotiate
Key
Bus
ines
s Fu
nctio
ns
Def
initi
on o
f Ava
ilabi
lity
Impa
ct o
f Dow
ntim
e
September 26, 2008 37
Traditional view of availability design99.2%
99.7%
99.4%99.1%
99.3%99.0%
95.7% Availability Probabilityor 4.3% Unavailability
September 26, 2008 38
IT Services can be very complex
ApplicationGateway
Web Server
(Linux)
Application Server
(Windows)Application Server
(Windows)Application Server
(Windows)
Database Server
(HP-UX)
Web Server
(Linux)Web Server
(Linux)
DHCP Service
(Windows)
DNS Service
(Windows)
Firewall
(Appliance)Firewall
(Appliance)
SANSAN
Wide area network, LAN, SAN, Message Bus
Directory Service (Windows)
Web clientWeb client
Web clientWeb client
Web clientWeb client
Windows clientWindows
clientWindows clientWindows
clientWindows client
September 26, 2008 39
Modelling and predicting availability• What is the MTBF of a human being?• Availability modelling tools−Can only model and predict technology−Are not good at predicting failover duration and errors−Assume that automation is always perfect
• Most downtime comes from people and process• So how do you design a solution to meet an
availability requirement?
September 26, 2008 40
What is the MTBF of a human being?• Less than 10 years• 10 to 40 years• 40 to 70 years• 70 to 120 years• 120 to 200 years• More than 200 years
September 26, 2008 41
Design the technology• You need to use the right technique for each
component in your solution• Understand contribution of each component to
overall service• Understand the impact of other components• Define component availability requirements
based on service requirements• Identify possible strategies for each component• Think about requirements for maintenance−Planned downtime is still downtime
September 26, 2008 42
What happens when a component failsIncident
StartIncident
End
Uptime
ServiceUnavailable
Downtime
ServiceAvailable
Detect Diagnose Repair Recover Restore
September 26, 2008 43
Instrumentation and tools• Event detection and correlation• Tools to manage blades, virtualisation, SOA…−Consolidate and release changes and patches−Rapidly deploy servers−Automatically redeploy applications to other servers
• All these tools must work together−What happens if you move services to balance power
without thinking about the whole service
• What happens if the management tools go wrong?
• What happens if system managers get it wrong?September 26, 2008 44
People and Process• Skills, roles, and processes must match
technology• Managing virtualised SOA and Blade technology−How do you make sure you can apply the required
updates and patches in a high availability environment?−How do you manage your CMDB?−How much automation of changes can you risk?−How do you manage network and VLANs?
• Imagine a single blade enclosure where every blade supports a different business application−You may never be able to do maintenance− Imagine virtualisation running above the blade layer
September 26, 2008 45
Design the underpinning support• Supporting services need to be appropriate for
each component• Service Level Managers can’t usually negotiate a
separate contract for each service• Use supplier management to ensure you get
what you need for all your services−Continual Improvement as well as good contracts−Differentiated service levels for different components
• Make sure support teams can actually deliver−An OLA is not an alternative to skills, tools and
processesSeptember 26, 2008 46
In conclusion• Define availability targets in a way that:−means something to the business− can be reliably measured and reported
• Design solutions that can meet your targets
September 26, 2008 47