Identifying Performance Bottlenecks in CDNs through TCP ...mfreed/docs/wand-wmust11-slides.pdf ·...

15
Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun Minlan Yu, Michael J. Freedman, Jennifer Rexford Princeton University August 19, 2011

Transcript of Identifying Performance Bottlenecks in CDNs through TCP ...mfreed/docs/wand-wmust11-slides.pdf ·...

Page 1: Identifying Performance Bottlenecks in CDNs through TCP ...mfreed/docs/wand-wmust11-slides.pdf · Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun

Identifying Performance Bottlenecks in CDNs through

TCP-Level Monitoring

Peng Sun Minlan Yu, Michael J. Freedman, Jennifer Rexford

Princeton University

August 19, 2011

Page 2: Identifying Performance Bottlenecks in CDNs through TCP ...mfreed/docs/wand-wmust11-slides.pdf · Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun

Performance Bottlenecks

2

Server

APP

Server OS

CDN Servers

Internet Clients

APP Write too slowly

Server OS Insufficient send buffer or Small initial congestion window

Internet Network congestion

Client Insufficient receive buffer

Page 3: Identifying Performance Bottlenecks in CDNs through TCP ...mfreed/docs/wand-wmust11-slides.pdf · Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun

Reaction to Each Bottleneck

3

Server

APP

Server OS

CDN Servers

Internet Clients

APP is bottleneck: Debug application

Server OS is bottleneck: Tune buffer size, or upgrade server

Internet is bottleneck: Circumvent the congested part of network

Client is bottleneck: Notify client to change

Page 4: Identifying Performance Bottlenecks in CDNs through TCP ...mfreed/docs/wand-wmust11-slides.pdf · Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun

Server

APP

Packet

Sniffer Server OS

Previous Techniques Not Enough

4

Application logs: No details of network activities

Packet sniffing: Expensive to capture

Active probing: Extra load on network

Transport-layer stats: Directly reveal perf. bottlenecks

Page 5: Identifying Performance Bottlenecks in CDNs through TCP ...mfreed/docs/wand-wmust11-slides.pdf · Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun

How TCP Stats Reveal Bottlenecks

CDN Servers Internet Clients

CDN Server Applications

Server Network Stack Network Path Clients

5

Insufficient data in send buffer

Send buffer full or Initial congestion window too small

Packet loss Receive

window too small

Page 6: Identifying Performance Bottlenecks in CDNs through TCP ...mfreed/docs/wand-wmust11-slides.pdf · Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun

Measurement Framework

•  Collect TCP statistics •  Web100 kernel patch •  Extract useful TCP stats for analyzing perf.

•  Analysis tool •  Bottleneck classifier for individual connections

•  Cross-connection correlation at AS level •  Map conn. to AS based on RouteView

•  Correlate bottlenecks to drive CDN decisions

6

Page 7: Identifying Performance Bottlenecks in CDNs through TCP ...mfreed/docs/wand-wmust11-slides.pdf · Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun

How Bottleneck Classifier Works

7

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 180

50

100

150

200

250

300

350

Time in seconds

KB

RwinCwinBytesInSndBuf

BytesInSndBuf = Rwin

Rwin limits sending

Client is bottleneck

Cwin drops greatly and Packet loss

Network path is bottleneck

Small initial Cwin

Slow start limits perf.

Network Stack is bottleneck

Page 8: Identifying Performance Bottlenecks in CDNs through TCP ...mfreed/docs/wand-wmust11-slides.pdf · Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun

CoralCDN Experiment

•  CoralCDN serves 1 million clients per day

•  Experiment Environment •  Deployment: A Clemson PlanetLab node •  Polling interval: 50 ms •  Traces to Show: Feb 19th – 25th 2011 •  Total # of Conn.: 209K •  After removing

Cache-Miss Conn.: 137K (Total 2008 ASes)

•  Log Space overhead •  < 200MB per Coral server per day

8

Page 9: Identifying Performance Bottlenecks in CDNs through TCP ...mfreed/docs/wand-wmust11-slides.pdf · Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun

What are Major Bottleneck for Individual Clients?

•  We calculate the fraction of time that the connection is under each bottleneck in lifetime

9

Bottlenecks % of Conn. With Bottleneck for

>40% of Lifetime

Server Application 10.75%

Server Network Stack 18.72%

Network Path 3.94%

Clients 1.27%

Reasons: Slow CPU or scarce disk resources of the PlanetLab node

Reasons: Congestion window rises too slowly for short conn. (>80% of the connections last <1 second)

Reasons: Spotty network (discussed in next slide) Reasons: Receive buffer too small (Most of them are <30KB) Our suggestion: Use more powerful PlanetLab machines Our suggestion: Use larger initial congestion window Our suggestion: Filter them out of decision making

Page 10: Identifying Performance Bottlenecks in CDNs through TCP ...mfreed/docs/wand-wmust11-slides.pdf · Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun

AS-Level Correlation

•  CDNs make decision at the AS level •  e.g., change server selection for 1.1.1.0/24

•  Explore at the AS level: •  Filter out non-network bottlenecks •  Whether network problems exist

•  Whether the problem is consistent

10

Page 11: Identifying Performance Bottlenecks in CDNs through TCP ...mfreed/docs/wand-wmust11-slides.pdf · Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun

Filtering Out Non-Network Bottlenecks

•  CDNs change server selection if clients have low throughput

•  Non-network factors can limit throughput

•  236 out of 505 low-throughput ASes limited by non-network bottlenecks

•  Filtering is helpful: •  Don’t worry about things CDNs cannot control •  Produce more accurate estimates of perf.

11

Page 12: Identifying Performance Bottlenecks in CDNs through TCP ...mfreed/docs/wand-wmust11-slides.pdf · Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun

Network Problem at AS Level

•  CDN make decision at AS level

•  Whether conn. in the same AS have common network problem

•  For 7.1% of the ASes, half of conn. have >10% packet loss rate

•  Network problems are significant at the AS level

12

Page 13: Identifying Performance Bottlenecks in CDNs through TCP ...mfreed/docs/wand-wmust11-slides.pdf · Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun

Consistent Packet Loss of AS

•  CDNs care about predictive value of measurement

•  Analyze the variance of average packet loss rates •  Each epoch (1 min) has nonzero average loss rate •  Loss rate is consistent across epochs

(standard deviation < mean)

13

Analysis Length # of ASes with

Consistent Packet Loss

One Week 377 / 2008

One Day (Feb 21st) 122 / 739

One Hour (Feb 21st 18:00~19:00)

19 / 121

Page 14: Identifying Performance Bottlenecks in CDNs through TCP ...mfreed/docs/wand-wmust11-slides.pdf · Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun

Conclusion & Future Work •  Use TCP-level stats to detect performance

bottlenecks

•  Identify major bottlenecks for a production CDN

•  Discuss how to improve CDN’s operation with our tool

•  Future Works •  Automatic and real-time analysis combined into

CDN operation •  Detect the problematic AS on the path •  Combine TCP-level stats with application logs to

debug online services 14

Page 15: Identifying Performance Bottlenecks in CDNs through TCP ...mfreed/docs/wand-wmust11-slides.pdf · Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun

Thanks!

Questions?

15