Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons...

34
Adaptive Dataflow Joe Hellerstein UC Berkeley
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    1

Transcript of Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons...

Page 1: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Adaptive Dataflow

Joe Hellerstein UC Berkeley

Page 2: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Overview

• Trends Driving Adaptive Dataflow• Lessons

– networking• flow control, event programming, app-level routing

– query processing• distributed & “shared nothing” parallel query systems

• Adaptive Dataflow– Rivers, Eddies

• Telegraph– Facts and Figures on the Web (FFW)– sensor-based information systems– software traces and adaptive distributed systems

• FFW Questions

Page 3: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Recent Trends

• 1990s: shift in the focus of academic CS– driving example applications changed– everybody on the information bandwagon

• 1990s: tightening of bottlenecks– data growth double Moore’s Law

• 90’s systems R&D: Parallel & Distributed Information Services

– infocentric, multi-user, highly available– “shared-nothing” clusters, not “parallelism” a la 1989– distributed data, not “distributed OS” a la 1989

Page 4: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Systems Research: Up or Down

• UP: Global Federations– internet services as procedure calls

• with fees and lawyers!

– B2B e-commerce has this problem today• Cohera examples

• DOWN: Sensor/Actuator Networks– UPC codes? Clickstream? Smart dust!– HUGE, noisy data volumes– new architectures, major challenges

Page 5: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Core Technology Not There Yet

• Key component: dataflow– The plumbing is coming

• XML/http, WML/WAP, etc. give LCD communication• glue at the boundaries• ho hum

– Systems challenge: move the data• efficiently • robustly• intelligently

– Language challenge too• programming and debugging tools• interfaces and economic models

Page 6: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

What’s So Hard Here?

• Volatile regime– Data flows unpredictably from sources– Code performs unpredictably along flows– Continuous volatility due to many decentralized systems

• Lots of choices– Choice of services– Choice of machines– Choice of info: sensor fusion, data reduction, etc.– Order of operation

• Maintenance– Federated world– Partial failure is the common case

• Adaptivity required!

Page 7: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

A Networking Problem!?

• Networks do dataflow!• Significant history of adaptive techniques

– E.g. TCP congestion control– E.g. routing

• But traditionally much lower function– Ship bitstreams– Minimal, fixed code

• Lately, moving up the foodchain?– app-level routing– active networks– politics of growth

• assumption of complexity = assumption of liability

Page 8: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Networking Code as Dataflow?

• States & Events, Not Threads– Asynchronous events natural to networks– State machines in protocol specification and system

code– Low-overhead, spreading to big systems– Totally different programming style

• remaining area of hacker machismo

• Eventflow optimization– Can’t eventflow be adaptively optimized like dataflow?– Why didn’t that happen years ago?– Hold this thought

Page 9: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Query Plans are Dataflow Too

• Programming model: iterators– old idea, widely used in

DB query processing– object with three

methods:• Init(), GetNext(), Close()

– input/output types– query plan: graph of

iterators• pipelining: iterators that

return results before children Close()

Page 10: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Distributed/Parallel Databases?

• Query plans across machines• Bloom filters, query optimization minimize bandwidth

– Send lossily compressed signatures, not just data– Model network, disk, CPU costs in dataflow optimization– A “Distributed DB” contribution

• App-level optimization … code to data or data to code– DB research highest up the food chain

• Data partitioning, query opt. parallelizes dataflow – Pipeline & partition parallelism: natural!– Model resource consumption and response time– A “Parallel DB” contribution

• Challenge: move down the foodchain to serve all– Biggest problem: limited adaptivity

Page 11: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Adaptive Systems: General Flavor

Repeat:1. Observe (model) environment

2. Use observation to choose behavior

3. Take action

Page 12: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Adaptive Dataflow in DBs: History

• Rich But Unacknowledged History– Codd's data independence predicated on

adaptivity!• adapt opaquely to changing schema and storage

– Query optimization does it!• statistics-driven optimization• key differentiator between DBMSs and other

systems

Page 13: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Adaptivity in Current DBs

• Limited & coarse grainRepeat:

1.Observe (model) environment– runstats (once per week!!): model changes in

data

2.Use observation to choose behavior – query optimization: fixes a single static query

plan

3.Take action– query execution: blindly follow plan

Page 14: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Adaptive Query Processing Work

– Late Binding: Dynamic, Parametric [HP88,GW89,IN+92,GC94,AC+96,LP97]

– Per Query: Mariposa [SA+96], ASE [CR94] – Competition: RDB [AZ96]– Inter-Op: [KD98], Tukwila [IF+99]– Query Scrambling: [AF+96,UFA98]

• Survey: Hellerstein, Franklin, et al., DE Bulletin 2000

Syste

m R

Late

Bin

ding

Per Q

uery

Com

petit

ion

& Sa

mpl

ing

Inte

r-Ope

rato

rQ

uery

Scr

ambl

ing

Eddie

s

Ingr

es D

ECO

MP

Frequency of Adaptivity

Futu

re W

ork

Page 15: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Some Solutions We’re Focusing On• Rivers

– Adaptive partitioning of work across machines• Eddies

– Adaptive ordering of pipelined operations• Quality of Service

– Online aggregation & data reduction: CONTROL– MUST have app-semantics– Often may want user interaction

• UI models of temporal interest

• Data Dissemination– Adaptively choosing what to send, what to cache

Page 16: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Dataflow Parallelism in DBs

• Volcano: “exchange” iterator [Graefe]– encapsulate exchange

logic in an iterator– not in the dataflow

system– Box-and-arrow

programming can ignore parallelism

Page 17: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

River

• We built the world’s fastest sorting machine– On the “NOW”: 100 Sun workstations + SAN– Only beat the record under ideal conditions

• No such thing in practice!

• River: adaptive dataflow on clusters– One main idea: Distributed Queues

• adaptive exchange operator

– Simplifies management and programming

Page 18: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

River

Page 19: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Multi-Operator Query Plans

• Deal with pipelines of commutative operators• Adapt at finer granularity than current DBMSs

Page 20: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Continuous Adaptivity: Eddies

• A pipelining tuple-routing iterator– just like join or sort or exchange

• Works great with other pipelining operators– like Ripple Joins, online reordering, etc.

Eddy

Avnur & HellersteinSIGMOD 2000

Page 21: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Continuous Adaptivity: Eddies

• How to order and reorder operators over time– based on performance, economic/admin feedback

• Vs.River:– River optimizes each operator “horizontally”– Eddies optimize a pipeline “vertically”

Eddy

Page 22: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Continuous Adaptivity: Eddies

• Adjusts flow adaptively– Tuples routed through ops in different

orders– Visit each op once before output

• Naïve routing policy:– All ops fetch from eddy as fast as possible

• A la River

– Turns out, doesn’t quite work• Only measures rate of work, not benefit

Page 23: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

An Aside: n-Arm Bandits

• A little machine learning problem:– Each arm pays off differently– Explore? Or Exploit?

• Sometimes want to randomly choose an arm

• Usually want to go with the best• If probabilities are stationary,

dampen exploration over time

Page 24: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Eddies with Lottery Scheduling

• Operator gets 1 ticket when it takes a tuple– Favor operators that run fast (low cost)

• Operator loses a ticket when it returns a tuple– Favor operators with high rejection rate

• Low selectivity

• Lottery Scheduling:– When two ops vie for the same tuple, hold a lottery– Never let any operator go to zero tickets

• Support occasional random “exploration”

• Set up “inflation” (forgetting) to adapt over time– E.g. tix’ = oldtix + newtix

Page 25: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Promising!

• Initial performance results

• Ongoing work on proofs of convergence– have analysis for contrained case

Page 26: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

To Be Continued

• Tune & formalize policy• Competitive eddies

– Source & Join selection– Requires duplicate

management• Parallelism

– Eddies + Rivers?• Reliability

– Long-running flows– Rivers + RAID-style

computation

Eddy

R2R1 R3 S1 S2 S3

hash

block index1 index2

Page 27: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

To Be Continued, cont.

• What about wide area?– data reduction– sensor fusion– asynchronous communication

• Continuous queries– events– disconnected operation

• Lower-level eventflow?– can eddies, rivers, etc. be brought to bear

on programming?

Page 28: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Telegraph: An Adaptive Dataflow System

• An adaptive dataflow system• Currently cluster + http

– Rivers and Eddies– Web wrappers

• Sensor nets next• Target applications

– Facts and Figures on the Web (FFW)– Distributed Introspection Services– Sensor Stream Services

w/Mike Franklin, Sirish Chandrasekaran, Amol Deshpande, Nick Lanham, Sam Madden, VijayShankar Raman, Mehul Shah

Page 29: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Facts & Figures on the Web

• “Deep” Web– “Hidden” Web, “Dark Matter”

• More interesting: Facts & figures, not text– “search” is not the main problem

• “search” was always easy• ranking often not apropos to facts

– combine, transform, summarize, analyze

Page 30: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

FFW Election 2000

• Campaign Finance Drill-down– Bush/Gore donations

• Personal and industrial

– Industry data– Neighborhood data– Personal data– Historical voting data

• Live demo, online aggregation

http://ffw.cs.berkeley.edu

Page 31: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Web Research Revisited

• Crawling, Caching, Relationship Graphs– Transitive Closure– The Berkeley Bindings

• & graph analysis?– Form identification & APIs– “Semantic” caching

• Socio-Techno-Legal Issues– Privacy: Statistical DBs + Federation

• WhitePages |x| WhoIs |x| DoubleClick |x| CDC Wonder

– Stats in the wrong hands– Accuracy of derived results– Intellectual property

• Etc!

Page 32: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Summary

• Adaptive software systems must happen– federation & scaling require it– systems and stats must marry

• Dataflow programming natural– for many applications– best hope for large-scale apps

• Terrific nexus of research– DB, Networking, Learning/Stat– Lots of work to be done!

• Drop by the Telegraph FFW demo!– http://ffw.cs.berkeley.edu

Page 33: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Backup slides

• The rest of the slides are backup to answer questions…

Page 34: Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Prior Progress in DB Adaptivity

• Per-query adaptivity– E.g. Mariposa [Sto95]

• 1st distributed DBMS to consider scalability• economic APIs for federation, give limited adaptivity too

• One-time intra-query– DEC Rdb competition [AZ96]– Sampling [lots]

• Intra-query, inter-operator – “Query Scrambling”: reoptimize in face of delays

[UFA98]– Kabra/DeWitt ‘98: dam the flow, reoptimize downstream