IT Demand Management and Capacity Planning: Why Estimation Is Vital to Balancing the Scale
Capacity Planning: why, what and howdpnm.postech.ac.kr/conf/noms2004/keynotes/noms2004... ·...
Transcript of Capacity Planning: why, what and howdpnm.postech.ac.kr/conf/noms2004/keynotes/noms2004... ·...
Capacity Planning: Capacity Planning: why, what and howwhy, what and how..
Virgilio A. F. Almeida
Computer Science DepartmentFederal University of Minas Gerais
Brazil
NOMS 2004, Seoul, KoreaApril 22, 2004
Virgilio Almeida, all rights reserved. April 2004 2
OutlineOutline
• Capacity planning: why and what• A capacity planning methodology: how• Workload model
– An example of a Hierarchical Characterization of LiveStreaming Media
• Performance models– Examples of simple models– Example of detailed models
• Conclusions
Virgilio Almeida, all rights reserved. April 2004 3
Facts and trends in IT servicesFacts and trends in IT services
• IT costs have skyrocketed;• Businesses rely more and more on the
performance and availability of IT applications and networks;
• Multi-tier architecture designs that increase the complexity of managing the infrastructure.
• Utility Computing needs tools to manage the service environment and maintain servicel levels.
• Increasing variety of middleware architectures (eg, CORBA, EJB, DCOM) and distributed applications.
• Growth of new applications that demand large amount of resources, eg.: p2p, multimedia, VoIP, etc.
Virgilio Almeida, all rights reserved. April 2004 4
Facts and trends in IT servicesFacts and trends in IT services
• IT budgets range from 1 to 10% of a corporation’s overall revenue. In a typical IT budget, 78% (*) is spent managing existing systems, computational and networking infrastructure.
• Need to measure and manage services: service level metrics, cost metrics, development metrics:– System performance could be viewed as the metric that
indicates the percentage of time the systems, applications and infrasctructure are available and performing at a level specified by users, according to service level agreements (SLA).
*source: Gartner Group
Virgilio Almeida, all rights reserved. April 2004 5
Capacity planning: whyCapacity planning: why
• There is a clear need of tools for capacity and resource management, availability management, operation management, security management, and infrastructure management at large.
• Capacity planning is a useful technique for both IT community and network operations and management community.
Virgilio Almeida, all rights reserved. April 2004 6
The System Management RelationThe System Management Relation
CapacityProcessors, I/O, network bandwidth
CapacityPlanning
DemandWorkload characteristics
QoSPerformance, availability, security, cost
Virgilio Almeida, all rights reserved. April 2004 7
Capacity Planning: what Capacity Planning: what
• Capacity planning is more than just performance prediction...
• It’s about:– Performance– Availability– Cost– Revenue– Security
• It’s a management tool...
Virgilio Almeida, all rights reserved. April 2004 8
When Performance is a Problem...When Performance is a Problem...
TO OUR VISITORSThe recent launch of the free ****.com service, designed to be the
most trusted source of information, learning and knowledge on the Internet, has created such an enormous volume of traffic that
the company’s servers have experienced a temporary slowdown.
We apologize to everyone who has been unable to access***.com. The tremendous response to ***.com has created a tidal wave of activity on our site, and we are working hard to make the site available as quickly as possible… but we had no idea that this volume of traffic would be achieved so quickly.
Sincerely, A. B.
CEO ***.com Inc. (taken from the Home Page of ***.com)
Virgilio Almeida, all rights reserved. April 2004 9
When When AAvailability is a Problevailability is a Problemm......
• Availability is becoming a vital metric for IT services!
– Ideal world: business aim at 100% availability for certain business periods
• for e-business, online services, mission-critical apps., ISPs,ASPs, and data-centers.
– Real world: service outages(*) are still frequent• 65% of IT managers report that their websites were
unavailable to customers over a 6-month period– 25%: 3 or more outages
– unavailability costs are high• loss of customers and business and negative press (ebay
failures in 1999, amazon outages in 2000, and MSN messenger in 2001)
Source: Patterson U.C. Berkeley and InternetWeek 4/3/2000
Virgilio Almeida, all rights reserved. April 2004 10
Downtime Costs (per Hour)Downtime Costs (per Hour)
– Brokerage operations $6,450,000– Credit card authorization $2,600,000– Ebay (1 outage 22 hours) $225,000– Amazon.com $180,000– Package shipping services $150,000– Home shopping channel $113,000– Catalog sales center $90,000– Airline reservation center $89,000– Cellular service activation $41,000– On-line network fees $25,000– ATM service fees $14,000
Source: Patterson U.C. Berkeley and InternetWeek 4/3/2000
Virgilio Almeida, all rights reserved. April 2004 11
PerformancePerformance and availability and availability pproblemsroblemsfor for IT services IT services tend to get worse!tend to get worse!
• Proliferation of mobile devices.• Easier to use interfaces (VUI, wireless and Web
services on cars and airplanes, novel browsing paradigms).
• Increasing load placed by agents and robots.• Impacts of authentication and security protocols
(e.g., SSL, TSL) on IT service performance.• Increase in the complexity of middleware and
distributed applications.• Flash crowds, that overload Web services.
Virgilio Almeida, all rights reserved. April 2004 12
Capacity Planning: whatCapacity Planning: what
Workloaddemands
Future ITResources
needed
CapacityPlanningProcess
Businessrequirements
SLAs
Cost
Virgilio Almeida, all rights reserved. April 2004 13
Typical Typical Planning Planning QuestionsQuestions
• How many servers do I need in the utility computing to handle the new enterprise resource applications?
• Is the online trading site prepared to accommodate a 75% increase of trades/day?
• Do I have enough bandwidth to handle a peak demand 10x greater than the average?
• What are our configuration and capacity?• How fast can the site architecture be scaled up? What
components should be upgraded? Database servers? Web servers? Application servers? Bandwidth?
There is a big gap between average and peak workload!
.ΝΟΜΣ 2004
Virgilio Almeida, all rights reserved. April 2004 14
Some possible Some possible actionsactions
• Size to peak workload: cost issues, it’s very expensive!
• Size to average workload: – Bad QoS at peak (ex: most important times!)– Unhappy customers and lost business!
• Design to workload properties:– Examples: distribute non-critical services, use
temporal and space locality to design and allocate proxies, caches and mirrors and explore stastistical properties of the load for multiplexing purposes.
Which one should we pickup?
Virgilio Almeida, all rights reserved. April 2004 15
Some capacity planning questions Some capacity planning questions for new IT servicesfor new IT services
• Will the service work?• How well the service will work?• Could the service work better?• What are the bounds for the service?• What is cost of the service?• What are the end-user’s needs for tomorrow?
Virgilio Almeida, all rights reserved. April 2004 16
Capacity Planning: big picture Capacity Planning: big picture
QoSInformationtechnology &
networkinfrastructure
Workload
? ?Cost $
Virgilio Almeida, all rights reserved. April 2004 17
Some useful conceptsSome useful concepts
• Workload– The set of all inputs the IT infrastructure receives
from its environment.
• Service level agreement (SLA) QoS
• Adequate Capacity– An IT system has adequate capacity if the SLAs are
continuously met for a specified technology and standards, and the services are provided within cost constraints.
Virgilio Almeida, all rights reserved. April 2004 18
QoS Metrics for IT ServicesQoS Metrics for IT Services
• QoS measures the user's experience interacting with a system, – Availability, Download time, Transaction Time– Level of system security– Errors
• Metrics must be available quickly– To determine if an SLA is being violated and to
take fast corrective action• Metrics must also be useful for long term trending
– To evaluate the return of investment in services and technologies
– To evaluate how cost-effective are the IT services
Capacity Planning: howCapacity Planning: how
Virgilio Almeida, all rights reserved. April 2004 20
Understand ServiceArchitecture
Characterizethe Workload
Predict ServicePerformance-Availability
Model ValidationAnd Calibration
Develop aPerformance Model
Forecast WorkloadEvolution
ObtainModel Parameters
Cost-PerformanceAnalysis & Actions
PerformanceAvailability Model
Workload Model
Cost Model
Capacity PlanningProcess
Business Requirements& Measurable Goals
Virgilio Almeida, all rights reserved. April 2004 21
Workload ModelsWorkload Models
An example of workload characterization and modeling
Virgilio Almeida, all rights reserved. April 2004 22
Hierarchical Characterization of Hierarchical Characterization of Live Streaming MediaLive Streaming Media(*)(*)
(*) joint work with Azer Bestavros and Shudong Jin
(Boston University) to appear in the IEEE-ACM Transactions on Networking (TON), August 2004, Veloso, Almeida, Meira, Bestavros, Jin.
Virgilio Almeida, all rights reserved. April 2004 23
Measure Measure Model Model SynthesizeSynthesize
Models
Char
acte
riza
tion
An
aly
sis
Generation
Trace
-driv
en
Evaluatio
nParametric
Evaluation
Protocol, Resources
Distributions of Random Variables
Logs andTraces
Caching, Multicast, …
ValidationObservations Artifacts
Synthetic Workload
s
WorkloadParameters
Virgilio Almeida, all rights reserved. April 2004 24
Workload analysis: Workload analysis: Live versus Live versus Stored ContentStored Content
• How different are workloads resulting from clicking with a mouse versus surfing with a remote control?
• Many studies on stored content access characteristics, but none on live content!
Virgilio Almeida, all rights reserved. April 2004 25
Live versus Stored ContentLive versus Stored Content
• Live Streaming (not vice versa)– Access to stored streaming media (e.g. movie
clips, music, etc.) is not access to “live” content– Periodic rebroadcast of content (e.g. pay-per-
view) is not access to “live” content• Value of live content is in its spontaneity
– Watching the Brazil soccer team beat Germany“live” is intrinsically different from watching it on tape
• Internet as live content delivery device– Enables bypassing of editorial controls (e.g.,
user chooses which feed to watch)
Virgilio Almeida, all rights reserved. April 2004 26
Primary Workload ConsideredPrimary Workload Considered
• Live Reality Show Workload from one of the top content providers in Brazil
• 24x7 live content complements a one hr/dayreality TV show (a la “big brother” in US)
• Web site offers users two live objects, each is a feed from one of 48 cameras mounted around a “house” where contestants live
• Content served over unicast with server adjusting rate to match client bandwidth
Virgilio Almeida, all rights reserved. April 2004 27
What did we log?What did we log?
• Client Info: – ID, IP address, DNS, CPU, OS, language, …
• Access Info:– Object URI, start and stop times, codex, …
• Transfer Stats:– Packets sent, received, recovered, …
• Server Info: – CPU load, # of sessions, configuration, …
Virgilio Almeida, all rights reserved. April 2004 28
Basic StatisticsBasic Statistics
Virgilio Almeida, all rights reserved. April 2004 29
Characterization HierarchyCharacterization Hierarchy
Client Layer
Client
Session
Transfer
Virgilio Almeida, all rights reserved. April 2004 30
Basic StatisticsBasic Statistics
• Peak 1-minute aggregate B/W ~ 80Mbps• Server network/CPU not an issue—this is important
to ensure characterization is not impeded by lack of server resources
Virgilio Almeida, all rights reserved. April 2004 31
Client Layer: Concurrency ProfileClient Layer: Concurrency Profile
• Clear periodic patterns (diurnal/weekly)
• Marginal distribution fits an exponential
Virgilio Almeida, all rights reserved. April 2004 32
Client Layer: Arrival ProcessClient Layer: Arrival Process
• Generative arrival process at time t is Poisson with a periodic (diurnal) λ(t)– Good fit when λ(t) is piece-wise constant over period
< one hour
PoissonProcess
Virgilio Almeida, all rights reserved. April 2004 33
Session Layer: Request IATSession Layer: Request IAT
• Inter-arrival time of requests within a session is best fitted to a lognormal
Virgilio Almeida, all rights reserved. April 2004 34
Transfer Layer: Transfer LengthTransfer Layer: Transfer Length
• Transfer length is lognormal
Virgilio Almeida, all rights reserved. April 2004 35
Detailed Workload Model:Detailed Workload Model:“typical” characteristics“typical” characteristics
Virgilio Almeida, all rights reserved. April 2004 36
Measure Measure Model Model SynthesizeSynthesize
Models
Char
acte
riza
tion
An
aly
sis
Generation
Trace
-driv
en
Evaluatio
nParametric
Evaluation
Protocol, Resources
Distributions of Random Variables
Logs andTraces
Caching, Multicast, …
WorkloadParameters
Synthetic Workloads
Observations ArtifactsValidation
Performance ModelsPerformance Models
Virgilio Almeida, all rights reserved. April 2004 38
Types of Models Types of Models for Capacity Planningfor Capacity Planning
naive
clueless
Intuition
Low
High
Low
ideal
Practical: trends
complex
AccuracyHigh
Suggested by Faloutsos, Dimacs Workshop 2002
Virgilio Almeida, all rights reserved. April 2004 39
Examples of Capacity Planning Examples of Capacity Planning ModelsModels
• Back of the envelope models: – Simple queuing results (eg: Little’s Law) – Simple Bounds
• Elaborated models: – Queuing Network Model that calculates response
times, utilization, and queue lenght.
Virgilio Almeida, all rights reserved. April 2004 40
Simple Model Simple Model C/S x p2p architecture(*)C/S x p2p architecture(*)
• Increase in streaming media traffic
• Traditional distribution approach: – Client-server
architecture
– High server and network bandwidth requirements
Server
Clients(*) Quantitative Analysis of Strategies for Streaming Media Distribution
LA-Web 2003, Almeida, Vasconcelos, Meira, Mata, Rochacsdl.computer.org/comp/proceedings/la-web/ 2003/2058/00/20580154abs.htm
Virgilio Almeida, all rights reserved. April 2004 41
Simple Model Simple Model p2p overlay networkp2p overlay network
Server
Servents
– Application-level multicastbased approach
– Peer to Peer systems– Cooperation of clients
and server – Servent: client + server– Intuition: better system
scalability, verified byexperimentation and modeling.
Virgilio Almeida, all rights reserved. April 2004 42
ServentServent ScalabilityScalability
Given:n = # of clients a servent forwards packets to (fan out)b = average file bitrate λ = average file packet rateS(n) = CPU time to forward a packet to n clients
CPU Utilization UCPU(n) (applying Little’s Result):
UCPU,video(n) = λvideo × Svideo(n) ≈ λvideo × Svideo(1) × n
UCPU,audio(n) = λaudio × Saudio(n) ≈ λaudio × Saudio(1) × n
UCPU(n)=
Network Bandwidth BW(n): BW(n) = b × n
λvideo × UCPU,video(n) + λaudio × UCPU,audio(n) λvideo + λaudio
Virgilio Almeida, all rights reserved. April 2004 43
A simple sizing modelA simple sizing model• Maximum fan out F of a servent
SCPU : dedicated servent CPU time andSBW : dedicated servent output network bandwidth
F = min (SCPU / UCPU(1) , SBW / b)
• Lower bound on # of levels L in P2P tree (assess expected client delay and packet loss)
• Ex: n = 1000, SBW = 1Mbps, music (SCPU negligible, b = 100kbps)
F = 10 L ≥ 3
L
i=1n ≤ ∑ Fi ⇒ L ≥ logF ( n × (F-1)/F + 1)
Virgilio Almeida, all rights reserved. April 2004 44
Simple model Simple model bounds on performancebounds on performance
• Consider a database server with processor and 10 disks. Given the transaction service demands, we want to calculate the maximum throughput.
• Considering (from the workload model) that disk 8 is the bottleneck, we want to understand what is the impact on the server if disk 8 is replaced by one as twice as fast as the original one.
Virgilio Almeida, all rights reserved. April 2004 45
Simple model (2)Simple model (2)upper bounds on performanceupper bounds on performance
processor
Disks 1X
λ ...
kD1
Dmax = max {Dk}
X ≤ ⇒ X ≤maxD1k∀
Disks 10
open systems: X = λ if λ ≤ 1/Dmax
Virgilio Almeida, all rights reserved. April 2004 46
Throughput Asymptotic BoundThroughput Asymptotic Bound
Upgraded system= bottleneck (disk C) replaced by a2x faster device.
Virgilio Almeida, all rights reserved. April 2004 47
A Detailed MA Detailed Modelodel of a Web serverof a Web server
server
cpu
disk 1
disk 2
incominglink
outgoinglink
server
Virgilio Almeida, all rights reserved. April 2004 48
Model Model ParametersParameters
• Input parameters– Workload Intensity
• HTPP Requests/sec• Transactions/sec• E-business functions/sec
– Service demands for each resource and each type of
request.• Results
– Response time, utilization, queue length
Virgilio Almeida, all rights reserved. April 2004 49
Residence Time at the CPUResidence Time at the CPU
cpu
disk 1
disk 2
0.003 sec
0.08 sec
0.12 sec
0.00107 sec 0.109 sec
incominglink
outgoinglink
λ = 5 req/secWebserver
sec 00305.0
003.051003.0
1'
=
×−=
−=
cpu
cpuCPU D
DR
λ
Virgilio Almeida, all rights reserved. April 2004 50
Residence Time at Outgoing LinkResidence Time at Outgoing Link
cpu
disk 1
disk 2
0.003 sec
0.08 sec
0.12 sec
0.00107 sec 0.109 sec
incominglink
outgoinglink
λ = 5 req/secWebserver
sec 239.0109.051
109.01
'
=×−
=−
=OutLink
OutLinkOutlink D
DRλ
Virgilio Almeida, all rights reserved. April 2004 51
Summary of ResultsSummary of Results
Resource
Service Demand
(sec) UtilizationResidence Time (sec)
Inc. Link 0,00107 0,54% 0,00108
CPU 0,00300 1,50% 0,00305
Disk 1 0,08000 40,00% 0,13333
Disk 2 0,12000 60,00% 0,30000
Out. Link 0,10900 54,50% 0,23956
0,31307 0,67702
Average Response TimeSum of service demands
Virgilio Almeida, all rights reserved. April 2004 52
Response vs. Arrival RateResponse vs. Arrival Rate
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
1 2 3 4 5 6 7 8
Arrival Rate (requests/sec)
Res
pons
e Ti
me
(sec
)
Service level agreement
Virgilio Almeida, all rights reserved. April 2004 53
ConclusionsConclusions
• Capacity Planning and Performance Modeling are useful management tools as businesses and individuals become increasingly dependent on IT and communication services.
• Convergence of IT and Telecom worlds– Telecom community performance modeling – IT community testing and performance monitoring
• Performance models are essential to help to understand the complex integration of information and communication technologies. It could be a useful tool for the network operations and management community.
Virgilio Almeida, all rights reserved. April 2004 54
ReferencesReferences
Virgilio Almeida, all rights reserved. April 2004 55
ReferenceReference
Books• Performance by Design : Computer Capacity
Planning By Example, Menascé, Almeida, and Dowdy,Prentice Hall, 2004.
• “Scaling for E-Business: technologies, models, performance, and capacity planning,” Menascé and Almeida, Prentice Hall, 2000.
• “Capacity Planning for Web Services; models, methods, and metrics,” Menascé and Almeida, Prentice Hall, 2002.