Download - Differential Dataflow (and the Naiad system)

Transcript
Page 1: Differential Dataflow (and the Naiad system)

Differential Dataflow

(and the Naiad system)

Frank McSherry, Derek G. Murray,Rebecca Isaacs, Michael Isard

Microsoft Research, Silicon Valley

Page 2: Differential Dataflow (and the Naiad system)

Data-parallel dataflow

12345

1 423 66

5 AB CD E

k1:k2:k3:

Page 3: Differential Dataflow (and the Naiad system)

Data-parallel dataflow

123456

AB CD E

Page 4: Differential Dataflow (and the Naiad system)

Data-parallel dataflow

123456

AB CD E

iii iiiiv v

ijk

Page 5: Differential Dataflow (and the Naiad system)

Data-parallel dataflow

123456

AB CD E

iii iiiiv v

ijk

Page 6: Differential Dataflow (and the Naiad system)

Data-parallel dataflowSimple systems (Hadoop, Dryad) process entire collections.

1. Incremental updates. (StreamInsight, Incoop)2. Fixed point iteration. (Datalog, Rex, Nephele)3. Prioritized computation. (PrIter)

Hard to compose, for non-trivial reasons. (IVM rec-queries)

e.g. Maintaining the Strongly Connected Components of a social graph as edges continually arrive/depart.

Page 7: Differential Dataflow (and the Naiad system)

NaiadData-parallel compute engine using differential dataflow.

C#/LINQ programming model:• arbitrarily nested loops,• incremental updates,• prioritization,• … • fully composable.

Trades memory for performance:Data-parallelism to scale memory.

Page 8: Differential Dataflow (and the Naiad system)

Using Naiad1. Programmer writes a declarative Naiad program.

Loop Body

⋈ ∪ MinEdges

Labels

Output

Page 9: Differential Dataflow (and the Naiad system)

// produces a (name, label) pair for each node in the input graph. public Collection<Node> DirectedReachability(Collection<Edge> edges) { // start each node in the graph with itself as a label var nodes = edges.Select(x => new Node(name = x.src, label = x.src)) .Distinct();  // repeatedly update labels to the minimum of the labels of neighbors return nodes.FixedPoint(x => x.Join(edges, n => n.name, e => e.src, (n, e) => new Node(e.dst, n.label)) .Concat(nodes) .Min(n => n.name, n => n.label)); }

Using Naiad1. Programmer writes a declarative Naiad program.

Page 10: Differential Dataflow (and the Naiad system)

// produces a (name, label) pair for each node in the input graph. public Collection<Node> DirectedReachability(Collection<Edge> edges) { // start each node in the graph with itself as a label var nodes = edges.Select(x => new Node(name = x.src, label = x.src)) .Distinct();  // repeatedly update labels to the minimum of the labels of neighbors return nodes.FixedPoint(x => x.Join(edges, n => n.name, e => e.src, (n, e) => new Node(e.dst, n.label)) .Concat(nodes) .Min(n => n.name, n => n.label)); }

Using Naiad1. Programmer writes a declarative Naiad program.

Page 11: Differential Dataflow (and the Naiad system)

// produces a (name, label) pair for each node in the input graph. public Collection<Node> DirectedReachability(Collection<Edge> edges) { // start each node in the graph with itself as a label var nodes = edges.Select(x => new Node(name = x.src, label = x.src)) .Distinct();  // repeatedly update labels to the minimum of the labels of neighbors return nodes.FixedPoint(x => x.Join(edges, n => n.name, e => e.src, (n, e) => new Node(e.dst, n.label)) .Concat(nodes) .Min(n => n.name, n => n.label)); }

Using Naiad2. Program is compiled to a cyclic dataflow graph.

Page 12: Differential Dataflow (and the Naiad system)

Using Naiad2. Program is compiled to a cyclic dataflow graph.

Page 13: Differential Dataflow (and the Naiad system)

Using Naiad3. Graph is distributed across independent workers.4. Computation stays resident, with interactive access.var edges = new InputCollection<Edge>();

var labels = edges.DirectedReachability();

labels.Subscribe(x => ProcessLabels(x)); while (!inputStream.Closed()) edges.OnNext(inputStream.GetNext());

Page 14: Differential Dataflow (and the Naiad system)

Incremental DataflowData-parallel operators can operate on differences:

Collection : { ( record, count ) }

Operator YX

Page 15: Differential Dataflow (and the Naiad system)

Incremental DataflowData-parallel operators can operate on differences:

Difference : { ( record, delta ) }

Operator YX

Page 16: Differential Dataflow (and the Naiad system)

Incremental DataflowData-parallel operators can operate on differences:

Difference : { ( record, delta ) }

Operator dYdX

Page 17: Differential Dataflow (and the Naiad system)

Incremental DataflowData-parallel operators can operate on differences:

Difference : { ( record, delta ) }

Operator dYdX

Page 18: Differential Dataflow (and the Naiad system)

Incremental DataflowData-parallel operators can operate on differences:

Difference : { ( record, delta ) }

OperatordX dYdX dY

Page 19: Differential Dataflow (and the Naiad system)

Incremental DataflowData-parallel operators can operate on differences:

Difference : { ( record, delta ) }

OperatordX dYdX dY

Page 20: Differential Dataflow (and the Naiad system)

Incremental DataflowData-parallel operators can operate on differences:

Difference : { ( record, delta ) }

OperatordX dYdX dYdX dY

Page 21: Differential Dataflow (and the Naiad system)

Incremental DataflowData-parallel operators can operate on differences:

Difference : { ( record, delta ) }

OperatordX dYdX dYdX dY

Page 22: Differential Dataflow (and the Naiad system)

Incremental DataflowData-parallel operators can operate on differences:

Difference : { ( record, delta ) }

OperatordX dYdX dYdX dY

Page 23: Differential Dataflow (and the Naiad system)

Incremental DataflowData-parallel operators can operate on differences:

Difference : { ( record, delta ) }

OperatordX dYdX dYdX dY

Page 24: Differential Dataflow (and the Naiad system)

Incremental DataflowData-parallel operators can operate on differences:

Difference : { ( record, delta ) }

Up until this point, this is all old news.

OperatordX dYdX dYdX dY

Page 25: Differential Dataflow (and the Naiad system)

Differential DataflowData-parallel operators can operate on differences:

Difference : { ( record, delta, version ) }

OperatordX dYdX dYdX dY

Page 26: Differential Dataflow (and the Naiad system)

Differential Dataflow

OperatordX dYdX dYdX dY

Data-parallel operators can operate on differences:

Difference : { ( record, delta, version ) }

Important: A version can be more than just an integer.

Page 27: Differential Dataflow (and the Naiad system)

Differential Dataflow

dX

OperatordX dYdX dYdX dY

Data-parallel operators can operate on differences:

Difference : { ( record, delta, version ) }

Important: A version can be more than just an integer.

Page 28: Differential Dataflow (and the Naiad system)

Differential Dataflow

OperatordX dYdX dYdX dY

dX dY

Data-parallel operators can operate on differences:

Difference : { ( record, delta, version ) }

Important: A version can be more than just an integer.

Page 29: Differential Dataflow (and the Naiad system)

Differential Dataflow

OperatordX dYdX dYdX dY

dX dY

Data-parallel operators can operate on differences:

Difference : { ( record, delta, version ) }

Important: A version can be more than just an integer.

Page 30: Differential Dataflow (and the Naiad system)

Differential Dataflow

OperatordX dYdX dYdX dY

dX dYdX dY

Data-parallel operators can operate on differences:

Difference : { ( record, delta, version ) }

Important: A version can be more than just an integer.

Page 31: Differential Dataflow (and the Naiad system)

Differential Dataflow

OperatordX dYdX dYdX dY

dX dYdX dY

Data-parallel operators can operate on differences:

Difference : { ( record, delta, version ) }

Important: A version can be more than just an integer.

Page 32: Differential Dataflow (and the Naiad system)

Differential Dataflow

OperatordX dYdX dYdX dY

dX dYdX dX dY dY

Data-parallel operators can operate on differences:

Difference : { ( record, delta, version ) }

Important: A version can be more than just an integer.

Page 33: Differential Dataflow (and the Naiad system)

Differential Dataflow

OperatordX dYdX dYdX dY

dX dYdX dX dY dY

Data-parallel operators can operate on differences:

Difference : { ( record, delta, version ) }

Important: A version can be more than just an integer.

Page 34: Differential Dataflow (and the Naiad system)

Differential Dataflow

OperatordX dYdX dYdX dY

dX dYdX dX dY dY

Data-parallel operators can operate on differences:

Difference : { ( record, delta, version ) }

Important: A version can be more than just an integer.

Page 35: Differential Dataflow (and the Naiad system)

Differential DataflowData-parallel operators can operate on differences:

Difference : { ( record, delta, version ) }

OperatordX dYdX dYdX dY

dX dYdX dX dY dY

dY dY dYdXdX dX

Page 36: Differential Dataflow (and the Naiad system)

Differential DataflowData-parallel operators can operate on differences:

Difference : { ( record, delta, version ) }

OperatordX dYdX dYdX dY

dX dYdX dX dY dY

dY dY dYdXdX dX

Page 37: Differential Dataflow (and the Naiad system)

Differential DataflowData-parallel operators can operate on differences:

Difference : { ( record, delta, lattice ) }

OperatordX dYdX dYdX dY

dX dYdX dX dY dY

dY dY dYdXdX dX

Page 38: Differential Dataflow (and the Naiad system)

Differential DataflowData-parallel operators can operate on differences:

Difference : { ( record, delta, lattice ) }

OperatordX dYdX dYdX dY

dX dYdX dX dY dY

dY dY dYdXdX dX

Page 39: Differential Dataflow (and the Naiad system)

Empirical Efficacy

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 291

10

100

1000

10000

100000

1000000

baseline

diffe

renc

es (s

ize

of d

X)

inner iterations

incremental

Page 40: Differential Dataflow (and the Naiad system)

Strongly Connected Components

Nested fixed-point computation.

Two inner loops re-use existing DirectedReachability() query.

The entire computation is alsoautomatically incrementalized.

Declarative program uses 23 LOC.

Page 41: Differential Dataflow (and the Naiad system)

Strongly Connected Components

// repeatedly remove edges until fixed point.Collection<Edge> SCC(this Collection<Edge> edges){ return edges.FixedPoint(y => y.TrimAndTranspose() .TrimAndTranspose());}

// retain edges whose endpoint are reached by the same nodes.Collection<Edge> TrimAndTranspose(this Collection<Edge> edges){ var labels = edges.DirectedReachability();

return edges.Join(labels, x => x.src, y => y.name, (x,y) => x.Label1(y)) .Join(labels, x => x.dst, y => y.name, (x,y) => x.Label2(y)) .Where(x => x.label1 == x.label2) .Select(x => new Edge(x.dst, x.src));}

Page 42: Differential Dataflow (and the Naiad system)

Streaming SCC on Twitter

CDFs for 24 hour windowed SCC of @mention graph.

Page 43: Differential Dataflow (and the Naiad system)

Concluding CommentsThe generality of differential dataflow allows Naiad arrange computation more naturally and efficiently.

Better re-use of previous work, by changing “previous”. Millisecond-scale updates for complex computations.Enables new and richer program patterns.

ex: SCC, also graph coloring, partitioning, …

Bringing declarative data-parallel closer to imperative.

Page 44: Differential Dataflow (and the Naiad system)

Naiad StatusPublic code release available at project page:

http://research.microsoft.com/naiad/http://bigdataatsvc.wordpress.com/

Code release is C#: Windows (.NET), Linux, OS X (Mono).

Come see our poster and demo, processing tweets.

Page 45: Differential Dataflow (and the Naiad system)

Questions?

𝑓 ∞