Mohammad Hajjat Purdue University Joint work with: Shankar P N (Purdue), David Maltz (Microsoft),...

Mohammad Hajjat Purdue University Joint work with: Shankar P N (Purdue), David Maltz (Microsoft), Sanjay Rao (Purdue) and Kunwadee Sripanidkulchai (NECTEC Thailand) 1 Dealer: Application-aware Request Splitting for Interactive Cloud Applications

Performance of Interactive Applications 2 Interactive apps stringent requirements on user response time Amazon: every 100ms latency cost 1% in sales Google: 0.5 secs delay increase traffic and revenue drop by 20% Tail importance: SLAs defined on 90%ile and higher response time =

Cloud Computing: Benefits and Challenges 3 Benefits: Elasticity Cost-savings Geo-distribution: Service resilience, disaster recovery, better user experience, etc. Challenges: Performance is variable: [Ballani11], [ Wang10], [Li10], [Mangot09], etc. Even worse, data-centers fail Average of $5,600 per minute!

Approaches for Handling Cloud Performance Variability 4 Autoscaling? Cant tackle storage problems, network congestion, etc. Slow: tens of mins in public clouds DNS-based and Server-based Redirection? Overload remote DC Waste local resources DNS-based schemes may take hours to react

Contributions 5 Introduce Dealer to help interactive multi-tier applications respond to transient variability in performance in cloud Split requests at component granularity (rather entire DC) Pick best combination of component replicas (potentially across multiple DCs) to serve each individual request Benefits over nave approaches: o Wide range of variability in cloud (performance problems, network congestion, workload spikes, failures, etc.) o Short time scale adaptation (tens of seconds to few minutes) o Performance tail (90 th percentile and higher): o Under natural cloud dynamics > 6x o Redirection schemes: e.g., DNS-based load-balancers > 3x

Outline 6 Introduction Measurement and Observations System Design Evaluation

Performance Variability in Multi-tier Interactive Applications 7 Thumbnail Application Work erRol e R ol e IISIIS eb R ol e IISIIS Web Role IIS Load Balancer orker Role Worker Role blob BE BL 1 FE Queue BL 2 Work erRol e orker Role Worker Role Multi-tier apps may consist of hundreds of components Deploy each app on 2 DCs simultaneously

Performance Variability in Multi-tier Interactive Applications 25 th median 75 th Outliers 8 BE BL 1 BL 2 FE

Observations 9 Replicas of a component are uncorrelated Few components show poor performance at any time Performance problems are short-lived; 90% < 4 mins BE BL 1 BL 2 FE BE BL 1 BL 2 FE

C1C1 C2C2 C3C3 CnCn C4C4 C1C1 C2C2 C3C3 CnCn C4C4 Dealer Approach: Per-Component Re-routing 11 Split reqs at each component dynamically Serve each req using a combination of replicas across multiple DCs GTM

12 C1C1 C2C2 C3C3 CnCn C1C1 C2C2 C3C3 CnCn C1C1 C2C2 C3C3 CnCn Dealer GTM Dealer System Overview

Dealer High Level Design 13 Determine Delays Compute Split-Ratios Application Dynamic Capacity Estimation Stability

Determining Delays 14 Monitoring: Instrument apps to record: Component processing time Inter-component delay Use X-Trace for instrumentation, uses global ID Automate integration using Aspect Oriented Programming (AOP) Push logs asynchronously to reduce overhead Active Probing: Send reqs along lightly used links and comps Use workload generators (e.g., Grinder) Heuristics for faster recovery by biasing towards better paths Determine Delays Compute Split-Ratios Application Dynamic Capacity Estimation Stability

Determining Delays 15 Monitoring Probing Delay matrix D[,]: component processing and inter- component communication delay Transaction matrix T[,]: transactions rate between components Combine Estimates Stability & Smoothing

FE BL1 BE BL2 user C1C2C3 C4 C5 C 22 C 42 C 32 C 52 C 12 C 21 C 41 C 31 C 51 C 11 Calculating Split Ratios 16 BE BL 1 BL 2 FE Determine Delays Compute Split-Ratios Application Dynamic Capacity Estimation Stability

C 21 C 41 C 31 C 11 C 22 C 42 C 32 C 12 C 51 C 52 Calculating Split Ratios 17 Given: Delay matrix D[im, jn] Transaction matrix T[i,j] Capacity matrix C[i,m] (capacity of component i in data-center m) Goal: Find Split-ratios TF[im, jn]: # of transactions between each pair of components C im and C jn s.t. overall delay is minimized Algorithm: greedy algorithm that assigns requests to the best performing combination of replicas (across DCs)

Other Design Aspects 18 Dynamic Capacity Estimation: Develop algorithm to dynamically capture capacities of comps Prevent comps getting overloaded by re-routed traffic Stability: multiple levels: Smooth matrices with Weighted Moving Average (WMA) Damp Split-Ratios by a factor to avoid abrupt shifts Integration with Apps: Can be integrated with any app (stateful; e.g., StockTrader) Provide generic pull/push APIs Determine Delays Compute Split-Ratios Application Dynamic Capacity Estimation Stability Dynamic Capacity Estimation Stability

20 Real multi-tier, interactive apps: Thumbnails: photo processing, data-intensive Stocktrader: stock trading, delay-sensitive, stateful 2 Azure datacenters in US Workload: Real workload trace from big campus ERP app DaCapo benchmark Comparison with existing schemes: DNS-based redirection Server-based redirection Performance variability scenarios (single fault domain failure, storage latency, transaction mix change, etc.)

Running In the Wild 21 More than 6x difference Evaluate Dealer under natural cloud dynamics Explore inherent performance variability in cloud environments

Running In the Wild 22 FE BEBL FEBEBL ABAB

Dealer vs. GTM 23 Global Traffic Managers (GTMs) use DNS to route user IPs to closest DC Best performing DC closest DC (measured by RTT) Results: more than 3x improvement for 90 th percentile and higher

Dealer vs. Server-level Redirection 24 Re-route entire request, granularity of DCs HTTP 302 DC A DC B

Evaluating Against Server-level Redirection 25 FE BE BL 1 BL 2 FEBE BL 1 BL 2

Conclusions 26 Dealer: novel technique to handle cloud variability in multi- tier interactive apps Per-component re-routing: dynamically split user reqs across replicas in multiple DCs at component granularity Transient cloud variability: performance problems in cloud services, workload spikes, failures, etc. Short time scale adaptation: tens of seconds to few mins Performance tail improvement: o Natural cloud dynamics > 6x o Coarse-grain Redirection: e.g., DNS-based GTM > 3x

27 Questions?

Mohammad Hajjat Purdue University Joint work with: Shankar P N (Purdue), David Maltz (Microsoft),...

Documents

Transcript of Mohammad Hajjat Purdue University Joint work with: Shankar P N (Purdue), David Maltz (Microsoft),...