Providing Performance Guarantees for Cloud Applications

41
Providing Performance Guarantees for Cloud Applications Anshul Gandhi IBM T. J. Watson Research Center Stony Brook University 1 Parijat Dube, Alexei Karve, Andrew Kochut, Li Zhang IBM T. J. Watson Research Center

description

Providing Performance Guarantees for Cloud Applications. Anshul Gandhi IBM T. J. Watson Research Center Stony Brook University. Parijat Dube , Alexei Karve , Andrew Kochut , Li Zhang IBM T. J. Watson Research Center. Motivation. - PowerPoint PPT Presentation

Transcript of Providing Performance Guarantees for Cloud Applications

Page 1: Providing Performance Guarantees for Cloud Applications

Providing Performance Guarantees for Cloud Applications

Anshul GandhiIBM T. J. Watson Research Center

Stony Brook University

1

Parijat Dube, Alexei Karve, Andrew Kochut, Li ZhangIBM T. J. Watson Research Center

Page 2: Providing Performance Guarantees for Cloud Applications

Motivation

• Businesses have started moving to the cloud for their IT needs─ reduces capital cost of buying servers─ allows for elastic resizing of applications that have dynamic workload

demand

• Cloud Service Providers (CSPs) offer monitoring and rule-based triggers to enable dynamic scaling of applications

Amazon auto scalingMicrosoft Azure Watch

2

Time

Dem

and

?

Page 3: Providing Performance Guarantees for Cloud Applications

Motivation

• The values have to be determined by the user─ requires expert knowledge of application (CPU, memory, n/w thresholds)─ requires performance modeling expertise (when and how to scale)

Amazon auto scalingMicrosoft Azure Watch

How to set these values ??

3

Page 4: Providing Performance Guarantees for Cloud Applications

Motivation

• The values have to be determined by the user─ requires expert knowledge of application (CPU, memory, n/w thresholds)─ requires performance modeling expertise (when and how to scale)

4

arrival rate (req/s)

95%

Res

p. ti

me

(ms)

400 ms

60 req/s

Offline benchmarking

Trial-and-error

Expert application knowledge

The values are critical for application performance

and require expertise

Page 5: Providing Performance Guarantees for Cloud Applications

Goal

• Take the burden away from the user─ user only specifies the performance requirement─ CSP should be fully responsible for scaling

5

Offline benchmarking

Trial-and-error

Expert application knowledge

Not possible for CSPs !

arrival rate (req/s)

95%

Res

p. ti

me

(ms)

400 ms

60 req/s

Page 6: Providing Performance Guarantees for Cloud Applications

6

View from user’s perspective

Page 7: Providing Performance Guarantees for Cloud Applications

7

View from CSP’s perspective

Page 8: Providing Performance Guarantees for Cloud Applications

8

Problem statement

How to scale an unobservable cloud application to provide performance guarantees ?

Page 9: Providing Performance Guarantees for Cloud Applications

9

Outline

1. Existing CSP solutions (survey)2. Our solution: Dependable Compute Cloud (DC2)

a) High-level idea behind DC2b) DC2 system architecturec) Modeling + Optimization (Queueing) (Kalman filtering)

3. Evaluation• RUBiS (eBay.com) multi-tier application• Various traces• Various workload mixes

4. Limitations and future work

Page 10: Providing Performance Guarantees for Cloud Applications

Existing CSP solutions

• Resource usage triggers─ Amazon Auto Scaling, Microsoft Azure Watch, VMware AppInsight, CiRBA

• Request rate for specific software (ex: apache)─ RightScale

• Latency/VM─ Amazon Elastic Load balancing

• Web site response time─ Scalr

10

User has to set values

Page 11: Providing Performance Guarantees for Cloud Applications

11

Outline

1. Existing CSP solutions (survey)2. Our solution: Dependable Compute Cloud (DC2)

a) High-level idea behind DC2b) DC2 system architecturec) Modeling + Optimization (Queueing) (Kalman filtering)

3. Evaluation• RUBiS (eBay.com) multi-tier application• Various traces• Various workload mixes

4. Limitations and future work

Page 12: Providing Performance Guarantees for Cloud Applications

12

DC2: High-level idea

VM utilization

Request rate

End-to-end response time

Service requirements of requests at each tier

Network delay

Background utilization (overhead)

Page 13: Providing Performance Guarantees for Cloud Applications

13

DC2: High-level idea

Service requirements of requests at each tier

Network delay

Background utilization (overhead)

VM utilization

Request rate

End-to-end response time

Kalman filtering

Page 14: Providing Performance Guarantees for Cloud Applications

14

DC2: High-level idea

Service requirements of requests at each tier

Network delay

Background utilization (overhead)

VM utilization

Request rate

End-to-end response time

Kalman filtering

Page 15: Providing Performance Guarantees for Cloud Applications

15

Outline

1. Existing CSP solutions (survey)2. Our solution: Dependable Compute Cloud (DC2)

a) High-level idea behind DC2b) DC2 system architecturec) Modeling + Optimization (Queueing) (Kalman filtering)

3. Evaluation• RUBiS (eBay.com) multi-tier application• Various traces• Various workload mixes

4. Limitations and future work

Page 16: Providing Performance Guarantees for Cloud Applications

16

DC2: System architecture

DC2 Userinitial topology + perf SLA

resource monitoring

application monitoring

scaling directives

modeling+optimization

Page 17: Providing Performance Guarantees for Cloud Applications

17

Outline

1. Existing CSP solutions (survey)2. Our solution: Dependable Compute Cloud (DC2)

a) High-level idea behind DC2b) DC2 system architecturec) Modeling + Optimization (Queueing) (Kalman filtering)

3. Evaluation• RUBiS (eBay.com) multi-tier application• Various traces• Various workload mixes

4. Limitations and future work

Page 18: Providing Performance Guarantees for Cloud Applications

18

DC2: Modeling + Optimization

Given initial topology, how to dynamically scale application in a cost-effective manner to ensure user-specified SLA compliance?

TSLA

Page 19: Providing Performance Guarantees for Cloud Applications

19

DC2: Modeling

multi-tier queueing network model

home

browse

buy

Page 20: Providing Performance Guarantees for Cloud Applications

20

DC2: Modeling

λ1

λ2

λ3

T1

T2

T3

S11 S21 S31 S13 S23 S33

S12 S22 S32

U0jdi

Parameters:• λi – Request rate for class i

• Ti – Response time for class i

• Sij – Service requirement for class i at tier j

• di – Network latency for class i

• U0j – Background utilization on tier j

• Uj – Utilization of tier j

24 parameters9 known + 15 unknown

iijijj

j j

ijii

SU0U

U

SdT

1

6 equations

Page 21: Providing Performance Guarantees for Cloud Applications

21

DC2: ModelingParameters:• λi – Request rate for class i

• Ti – Response time for class i

• Sij – Service requirement for class i at tier j

• di – Network latency for class i

• U0j – Background utilization on tier j

• Uj – Utilization of tier j

24 parameters9 known + 15 unknown

iijijj

j j

ijii

SU0U

U

SdT

1

6 equations

• Underdetermined system• Need to “infer” unknowns• Can leverage monitored values • Kalman filtering

• Observed states: {λi, Ti, Uj}

• Hidden states: {Sij, di, U0j}

“Guess” unknowns

Evaluate functions using guesses

Compare with monitored values

Improve guess

Page 22: Providing Performance Guarantees for Cloud Applications

22

Kalman filtering

“Guess” unknowns

Evaluate functions using guesses

Compare with monitored values

Improve guess

• KF is a reactive, feedback-based estimation approach that has only recently been employed for computer systems

• KF automatically learns the (possibly changing) system parameters, for any system, including combination of workloads

• We extend KF to a 3-tier 3-workload-class system• Based on KF estimation, DC2 automatically, and proactively, detects which tier

is the bottleneck, and how to resolve the bottleneck (scale VMs)─ do not require any knowledge of application, except topology

Page 23: Providing Performance Guarantees for Cloud Applications

23

Kalman filtering + Queueing

“Guess” unknowns

Evaluate functions using guesses

Compare with monitored values

Improve guess

• KF can be integrated with system models (ex, queueing models) to improve accuracy and convergence

• Model need not be accurate─ KF leverages (true) monitored values to account for model inaccuracies─ Well suited for approximate system models such as queueing-theoretic models─ Can use other models as well, ex: machine-learning based models

Page 24: Providing Performance Guarantees for Cloud Applications

24

Kalman filtering + Queueing: Evaluation

Time to converge~1 min (6 intervals)

Good accuracy

Change in workload triggered

Time to converge~3 min (18 intervals)

Good accuracy

Page 25: Providing Performance Guarantees for Cloud Applications

25

Outline

1. Existing CSP solutions (survey)2. Our solution: Dependable Compute Cloud (DC2)

a) High-level idea behind DC2b) DC2 system architecturec) Modeling + Optimization (Queueing) (Kalman filtering)

3. Evaluation• RUBiS (eBay.com) multi-tier application• Various traces• Various workload mixes

4. Limitations and future work

Page 26: Providing Performance Guarantees for Cloud Applications

26

Experimental setup

SoftLayer machines

OpenStack

8-core, 8 GB

RUBiS

Page 27: Providing Performance Guarantees for Cloud Applications

RUBiS

27

Page 28: Providing Performance Guarantees for Cloud Applications

28

RUBiS

• RUBiS is an open source benchmark inspired by ebay.com• We focus on scaling Tomcat app tier• Different workload classes (home, browse, buy)

4 vCPU 4 vCPU

2 vCPUSLA: Tbrowse < 40ms for every 10 s monitoring

interval

Page 29: Providing Performance Guarantees for Cloud Applications

29

Demo

Page 30: Providing Performance Guarantees for Cloud Applications

30

Evaluation

Bursty trace [WITS] Hill trace [ITA] Rampdown trace [WITS]

Base

MoreDB

MoreApp

MoreWeb

Page 31: Providing Performance Guarantees for Cloud Applications

31

DC2: All traces

Bursty trace [WITS] Hill trace [ITA] Rampdown trace [WITS]

Page 32: Providing Performance Guarantees for Cloud Applications

32

THRES(x,y): Bursty trace

Page 33: Providing Performance Guarantees for Cloud Applications

33

Bursty trace: All policies

Bursty trace [WITS] Hill trace [ITA] Rampdown trace [WITS]

STATIC-OPT

DC2

THRES(30,60)

THRES(30,50)

THRES(40,60)

V=0% K=3.00 V=0% K=4.00 V=0% K=6.00

V=0% K=2.50 V=0% K=2.44 V=0% K=4.76

V=0% K=2.50 V=6.66% K=2.56 V=0% K=6.00

V=0% K=2.79 V=0% K=2.72 V=0% K=6.00

V=2.02% K=2.19 V=15.87% K=2.13 V=0% K=4.62

Page 34: Providing Performance Guarantees for Cloud Applications

34

All traces: All policies

Bursty trace [WITS] Hill trace [ITA] Rampdown trace [WITS]

STATIC-OPT

DC2

THRES(30,60)

THRES(30,50)

THRES(40,60)

V=0% K=3.00 V=0% K=4.00 V=0% K=6.00

V=0% K=2.50 V=0% K=2.44 V=0% K=4.76

V=0% K=2.50 V=6.66% K=2.56 V=0% K=6.00

V=0% K=2.79 V=0% K=2.72 V=0% K=6.00

V=2.02% K=2.19 V=15.87% K=2.13 V=0% K=4.62

Page 35: Providing Performance Guarantees for Cloud Applications

35

All workloads: All policies (Bursty trace)

STATIC-OPT

DC2

THRES(30,60)

V=0% K=3.00 V=0% K=4.00 V=0% K=3.00 V=0% K=3.00

V=0% K=2.50 V=0% K=3.66 V=0% K=2.94 V=0% K=2.87

V=0% K=2.50 V=3.06% K=3.40 V=2.04% K=2.98 V=0% K=3.00

Base MoreDB MoreApp MoreWeb

• Rule-based policies like THRES require tuning and are not robust

• Other auto-scaling policies require control of application

• DC2 is superior to THRES and does not require application control

Page 36: Providing Performance Guarantees for Cloud Applications

36

Other applications

• Bottleneck analysis─ HeavyDB use case

• What-if analysis─ Optimal VM configuration

• Can be combined with forecasting models

• Can be combined with other system models ─ ML-based instead of queueing models

Page 37: Providing Performance Guarantees for Cloud Applications

37

Outline

1. Existing CSP solutions (survey)2. Our solution: Dependable Compute Cloud (DC2)

a) High-level idea behind DC2b) DC2 system architecturec) Modeling + Optimization (Queueing) (Kalman filtering)

3. Evaluation• RUBiS (eBay.com) multi-tier application• Various traces• Various workload mixes

4. Limitations and future work

Page 38: Providing Performance Guarantees for Cloud Applications

38

Limitations and future work

• Evaluation limited to dynamic web applications─ Currently investigating Hadoop-type applications

• Only applies to stateless tiers─ DB scaling would be challenging

• Scaling algorithm can be further improved─ Add delayedoff

• Non-zero convergence time

• Tunable parameters─ Response time threshold (35ms)─ Monitoring interval (10s)

Page 39: Providing Performance Guarantees for Cloud Applications

39

Conclusions

• Need for adaptive scaling services for (opaque) cloud applications─ Application agnostic─ Robust to arrival pattern and workload mix

• Existing commercial offerings do not suffice: rule-based• Existing auto-scaling research solutions do not apply due to lack

of visibility and control of opaque cloud applications

• Our solution: Dependable Compute Cloud (DC2)─ Does not require offline benchmarking or expert knowledge─ Can adapt to dynamic changes in workload

• Well suited for cloud users who lack expertise in system modeling and application knowledge

Page 40: Providing Performance Guarantees for Cloud Applications

40

Thank You !

Page 41: Providing Performance Guarantees for Cloud Applications

41

Conclusions

• Need for adaptive scaling services for (opaque) cloud applications─ Application agnostic─ Robust to arrival pattern and workload mix

• Existing commercial offerings do not suffice: rule-based• Existing auto-scaling research solutions do not apply due to lack

of visibility and control of opaque cloud applications

• Our solution: Dependable Compute Cloud (DC2)─ Does not require offline benchmarking or expert knowledge─ Can adapt to dynamic changes in workload

• Well suited for cloud users who lack expertise in system modeling and application knowledge