Computing on Jetstream: Streaming Analytics In the Wide-Area

47
COMPUTING ON JETSTREAM: STREAMING ANALYTICS IN THE WIDE-AREA Matvey Arye Joint work with: Ari Rabkin, Sid Sen, Mike Freedman and Vivek Pai

description

Computing on Jetstream: Streaming Analytics In the Wide-Area. Matvey Arye Joint work with: Ari Rabkin , Sid Sen , Mike Freedman and Vivek Pai. The Rise of Global Distributed Systems. Image shows. CDN. Traditional Analytics. Centralized Database. Image shows. CDN. - PowerPoint PPT Presentation

Transcript of Computing on Jetstream: Streaming Analytics In the Wide-Area

Page 1: Computing on Jetstream: Streaming Analytics In the Wide-Area

COMPUTING ON JETSTREAM:STREAMING ANALYTICS IN THE WIDE-AREA

Matvey Arye

Joint work with: Ari Rabkin, Sid Sen, Mike Freedman and Vivek Pai

Page 2: Computing on Jetstream: Streaming Analytics In the Wide-Area

THE RISE OF GLOBAL DISTRIBUTED SYSTEMS

Image shows CDN

Page 3: Computing on Jetstream: Streaming Analytics In the Wide-Area

TRADITIONAL ANALYTICS

Image shows CDN

Centralized

Database

Page 4: Computing on Jetstream: Streaming Analytics In the Wide-Area

BANDWIDTH IS EXPENSIVE

Series1

CPU(16x)Storage(10x)Bandwidth(2.7x)

Price Trends 2005-2008

[Above the Clouds, Armbrust et. al.]

Page 5: Computing on Jetstream: Streaming Analytics In the Wide-Area

BANDWIDTH TRENDS

[TeleGeography's Global Bandwidth Research Service]

Page 6: Computing on Jetstream: Streaming Analytics In the Wide-Area

BANDWIDTH TRENDS

[TeleGeography's Global Bandwidth Research Service]

20% 20%

Page 7: Computing on Jetstream: Streaming Analytics In the Wide-Area

BANDWIDTH COSTS• Amazon EC2 bandwidth: $0.05 per GB

• Wireless broadband: $2 per GB

• Cell phone broadband (ATT/Verizon): $6 per GB– (Other providers are similar)

• Satellite Bandwidth $200 - $460 per GB–May drop to ~$20

Page 8: Computing on Jetstream: Streaming Analytics In the Wide-Area

THIS APPROACH IS NOT SCALABLE

Image shows CDN

Centralized

Database

Page 9: Computing on Jetstream: Streaming Analytics In the Wide-Area

THE COMING FUTURE: DISPERSED DATA

Dispersed

Databases

Dispersed

Databases

Dispersed

Databases Disperse

d Database

s

Page 10: Computing on Jetstream: Streaming Analytics In the Wide-Area

WIDE-AREA COMPUTER SYSTEMS

• Web Services– CDNs– Ad Services– IaaS– Social Media

• Infrastructure– Energy Grid

• Military– Global Network– Drones– UAVs– Surveillance

Page 11: Computing on Jetstream: Streaming Analytics In the Wide-Area

NEED QUERIES ON A GLOBAL VIEW

• CDNs: – Popularity of websites globally– Tracking security threats

• Military– Threat “chatter” correlation– Big picture view of battlefield

• Energy Grid– Wide-area view of energy production and

expenditure

Page 12: Computing on Jetstream: Streaming Analytics In the Wide-Area

SOME QUERIES ARE EASY

Alert me when servers crash

Server Crashed

Page 13: Computing on Jetstream: Streaming Analytics In the Wide-Area

OTHERS ARE HARD

How popular are all of my domains? Urls?

RequestsRequestsRequestsCDN

Requests

RequestsRequestsRequestsCDN Request

s

Page 14: Computing on Jetstream: Streaming Analytics In the Wide-Area

BEFORE JETSTREAM

Time [two days]

Band

widt

h95% Level

Analyst’s remorse: not enough datawasted bandwidth

Buyers’s remorse: system overload or overprovisioning

Needed for backhaul

Page 15: Computing on Jetstream: Streaming Analytics In the Wide-Area

? ? ? ? ? ?

WHAT HAPPENS DURING OVERLOAD?

Time [one day]

Band

widt

h Needed for backhaulAvailable

TimeLate

ncy

Queue size grows without bound!

Page 16: Computing on Jetstream: Streaming Analytics In the Wide-Area

THE JETSTREAM VISION

Time [two days]

Band

widt

hAvailable

Needed for backhaul

JetStream lets programs adapt to shortages and backfill later.Need new abstractions for programmers

Used by JetStream

Page 17: Computing on Jetstream: Streaming Analytics In the Wide-Area

SYSTEM ARCHITECTURE

compute resources (several sites)

Control plane

Data plane

Query graph

CoordinatorDaemon

Optimized query

JetS

tream

API

……

Planner Library

worker node

stream source

Page 18: Computing on Jetstream: Streaming Analytics In the Wide-Area

AN EXAMPLE QUERY File Read Operator

Parse Log File

Local Storag

e

File Read Operator

Parse Log File

LocalStorag

e

QueryEvery 10 s

Central

Storage

QueryEvery 10 s

Site A

Site BSite C

Page 19: Computing on Jetstream: Streaming Analytics In the Wide-Area

Feedback control

ADAPTIVE DEGRADATION

Local DataDataflowOperator

s

Summarized or

Approximated

Data

Feedback control to decide when to degradeUser-defined policies for how to degrade

data

NetworkDataflowOperator

s

Page 20: Computing on Jetstream: Streaming Analytics In the Wide-Area

MONITORING AVAILABLE BANDWIDTH

Data Data Data Time Marker Data

• Sources insert time markers into the data stream every k seconds

• Network monitor records time it took to process interval – t

=> k/t estimates available capacity

Page 21: Computing on Jetstream: Streaming Analytics In the Wide-Area

WAYS TO DEGRADE DATA

Can drop low-rank values

Can coarsen a dimension

Page 22: Computing on Jetstream: Streaming Analytics In the Wide-Area

AN INTERFACE FOR DEGRADATION (I)

• First attempt: policy specified by choosing an operator.• Operators read the congestion sensor and respond.

Coarsening Operator

Incoming data

SampledData Network

Sending 4x too much

Page 23: Computing on Jetstream: Streaming Analytics In the Wide-Area

COARSENING REDUCES DATA VOLUMES

01:01:01 foo.com/a 101:01:01 foo.com/b 1001:01:01 foo.com/c 5

01:01:02 foo.com/a 201:01:02 foo.com/b 1501:01:02 foo.com/c 20

01:01:* foo.com/a 301:01:* foo.com/b 2501:01:* foo.com/c 25

Page 24: Computing on Jetstream: Streaming Analytics In the Wide-Area

BUT NOT ALWAYS

01:01:01 foo.com/a 101:01:01 foo.com/b 1001:01:01 foo.com/c 5

01:01:02 bar.com/a 201:01:02 bar.com/b 1501:01:02 bar.com/c 20

01:01:* foo.com/a 101:01:* foo.com/b 1001:01:* foo.com/c 501:01:* bar.com/a 201:01:* bar.com/b 1501:01:* bar.com/c 20

Page 25: Computing on Jetstream: Streaming Analytics In the Wide-Area

DEPENDS ON LEVEL OF COARSENING

Data from CoralCDN logs

Page 26: Computing on Jetstream: Streaming Analytics In the Wide-Area

GETTING THE MOST DATA QUALITY FOR THE LEAST BW

IssueSome degradation techniques result in good quality but have unpredictable savings.

SolutionUse multiple techniques– Start off with technique that gives best quality– Supplement with other techniques when BW scarce=> Keeps latency bounded; minimize analyst’s remorse

Page 27: Computing on Jetstream: Streaming Analytics In the Wide-Area

ALLOWING COMPOSITE POLICIES

• Chaos if two operators are simultaneously responding to the same sensor

• Operator placement constrained in ways that don’t match degradation policy.

Coarsening Operator

Incoming data NetworkSampling

Operator

Sending 4x too much

Page 28: Computing on Jetstream: Streaming Analytics In the Wide-Area

INTRODUCING A CONTROLLER

• Introduce a controller for each network connection that determines which degradations to apply

• Degradation policies for each controller• Policy no longer constrained by operator topology

Coarsening Operator

Incoming data NetworkSampling

Operator

Controller

Drop 75% of data! Sending 4x too much

Page 29: Computing on Jetstream: Streaming Analytics In the Wide-Area

DEGRADATIONType Mergeabili

ty Errors Predictable

Size Savings

Dimension Coarsening

Yes* Resolution No

Consistent Sampling

Yes Sampling Yes

Local Filtering

No Sampling Yes

Multi-round Filtering

No None No

Aggregate Approx.

Depends Depends Depends

Page 30: Computing on Jetstream: Streaming Analytics In the Wide-Area

MERGEABILITY IS NONTRIVIAL

• Can’t cleanly unify data at arbitrary degradation• Degradation operators need to have fixed levels

01 - 05

06 - 10

11 - 15

16 - 20

21 - 25Every 5

01 - 10 11 - 20Every 10 21 - 30

01 - 30Every 30??

26 - 30

01 - 06 07 - 12 13 - 18 19 - 24Every 6 25 - 30

??????

Page 31: Computing on Jetstream: Streaming Analytics In the Wide-Area

INTERFACING WITH THE CONTROLLER

Coarsening Operator

Incoming data NetworkSampling

Operator

Controller

Operator

Shrinking data by 50%Possible levels:

[0%, 50%, 75%, 95%, …]

Controller

Go to level 75%

Sending 4x too much

Page 32: Computing on Jetstream: Streaming Analytics In the Wide-Area

A PLANNER FOR POLICYQuery planners:

Query + Data Distribution => Execution Plan

Why not do this for degradation policy?What is the Query?For us the policy affects the data ingestion=> Effects all subsequent Queries

Planning All Potential Queries + Data Distribution => Policy

Page 33: Computing on Jetstream: Streaming Analytics In the Wide-Area

EXPERIMENTAL SETUP

80 nodes on VICCI testbed in US and Germany

Princeton

Policy: Drop data if insufficient BW

Page 34: Computing on Jetstream: Streaming Analytics In the Wide-Area

WITHOUT ADAPTATIONBandwidt

h Shaping

Page 35: Computing on Jetstream: Streaming Analytics In the Wide-Area

WITH ADAPTATIONBandwidth Shaping

Page 36: Computing on Jetstream: Streaming Analytics In the Wide-Area

COMPOSITE POLICIES

Page 37: Computing on Jetstream: Streaming Analytics In the Wide-Area

OPERATING ON DISPERSED DATA

Dispersed

Databases

Dispersed

Databases

Dispersed

Databases Disperse

d Database

s

Page 38: Computing on Jetstream: Streaming Analytics In the Wide-Area

CUBE DIMENSIONS

01:01:00

foo.

com

/rfo

o.co

m/

qbar.c

om/

nbar.c

om/

m01:01:01

Tim

e

URL

Page 39: Computing on Jetstream: Streaming Analytics In the Wide-Area

CUBE AGGREGATES

Count Requests

Max Latency

bar.c

om/m

01:01:01

Page 40: Computing on Jetstream: Streaming Analytics In the Wide-Area

CUBE ROLLUP

01:01:00

foo.

com

/rfo

o.co

m/

qbar.c

om/

nbar.c

om/

m

Tim

e

URL

bar.com/

*foo.com/

*

Page 41: Computing on Jetstream: Streaming Analytics In the Wide-Area

FULL HIERARCHY

(5,90) (3,75) (8,199) (21,40) 01:01:00

foo.

com

/rfo

o.co

m/

qbar.c

om/

nbar.c

om/

m

Tim

e

URL

(8,90) (29,199)

(37,199)

URL: *Time: 01:01:01

Page 42: Computing on Jetstream: Streaming Analytics In the Wide-Area

RICH STRUCTURE

(5,90)(3,75) (8,199)

(21,40)

DA

C

01:01:59

01:01:00

f o o . c o m / r

f o o . c o m / q

b a r . c o m / n

bar.c

om/m

01:01:01

01:01:58

B

E

Cell URL TimeA bar.com/* 01:01:01B * 01:01:01C foo.com/* 01:01:01D foo.com/r 01:01:*E foo.com/* 01:01:*

Page 43: Computing on Jetstream: Streaming Analytics In the Wide-Area

TWO KINDS OF AGGREGATION

1. Rollups – Across Dimensions2. Inserts – Across Sources

The data cube model constrains the system to use the same aggregate function for both.

Constraint: no queries on tuple arrival order

Makes reasoning easier!

Page 44: Computing on Jetstream: Streaming Analytics In the Wide-Area

AN EXAMPLE QUERY File Read Operator

Parse Log File

Local Storag

e

File Read Operator

Parse Log File

LocalStorag

e

QueryEvery 10 s

Central

Storage

QueryEvery 10 s

Site A

Site BSite C

Page 45: Computing on Jetstream: Streaming Analytics In the Wide-Area

SUBSCRIBERS

• Extract data from cubes to send downstream

• Control latency vs completeness traeoff

File Read Operator

Parse Log File Local

Storage

QueryEvery 10 sFile Read

OperatorParse

Log File

Site A

Page 46: Computing on Jetstream: Streaming Analytics In the Wide-Area

SUBSCRIBER API• Notified of every tuple inserted into

cube• Can slice and rollup cubes

Possible policies:• Wait for all upstream nodes to

contribute• Wait for a timer to go off

Page 47: Computing on Jetstream: Streaming Analytics In the Wide-Area

FUTURE WORK• Reliability

• Individual queries– Statistical methods– Multi-round protocols

• Currently working on improving top-k

• Fairness that gives best data quality

Thanks for listening!