DMSN 2011 Cagri Balkesen & Nesime Tatbul Scalable Data Partitioning Techniques for Parallel Sliding...

Post on 28-Mar-2015

222 views 2 download

Tags:

Transcript of DMSN 2011 Cagri Balkesen & Nesime Tatbul Scalable Data Partitioning Techniques for Parallel Sliding...

DMSN 2011Cagri Balkesen & Nesime Tatbul

Scalable Data Partitioning Techniques for Parallel Sliding Window Processing over Data Streams

Talk Outline

• Intro & Motivation• Stream Partitioning Techniques

– Basic window partitioning– Batch partitioning– Pane-based partitioning

• Ring-based Query Evaluation• Experimental Evaluation• Conclusions & Future Work

cagri.balkesen@inf.ethz.ch 2

DSMS

Intro & Motivation

cagri.balkesen@inf.ethz.ch 3

Architectural Overview

• Classical Split-Merge pattern from Parallel DBs• Adjustable parallelism level, d• QoS on max latency & order

Query

Query nodes

Splitstage

Split node

Query Mergestage

Merge node

inputstream

outputstream

QoS: latency < 5 seconds disorder < 3 tuples

Query

cagri.balkesen@inf.ethz.ch 4

Related Work: How to Partition?

• Content-sensitive– FluX: Fault-tolerant, load balancing Exchange [1,2]– Use group-by values from the query to partition– Need explicit load-balancing due to skewed data

• Content-insensitive– GDSM: Window-based parallelization (fixed-size tumbling wins) [3]– Win-Distribute: Partition at window boundaries– Win-Split: Partition each win into equi-length subwins

• The Problem:– How to handle sliding windows?– How to handle queries without group-by or a few groups?

[1] Flux: An Adaptive Partitioning Operator for Continuous Query Systems, ICDE‘03[2] Highly-Available, Fault-Tolerant, Parallel Dataflows, SIGMOD ‘04[3] Customizable Parallel Execution of Scientific Stream Queries, VLDB ‘05

cagri.balkesen@inf.ethz.ch 5

Stream Partitioning Techniques

• Independently processable chunking– Window aware splitting of the stream

• Each window has an id & tuples are marked– (first-winid, last-winid, is-win-closer)

• Tuples are replicated for each of their windows

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 . . .

W1

W3

W4

w = 6 units, s = 2 units, Replication = 6/2 = 3

Node1

Node2

Node3

SplitW2

Approach 1: Basic Sliding Window Partitioning

cagri.balkesen@inf.ethz.ch 7

The Problem with Basic sliding window partitioning:• Tuples belong to many windows depending on slide• Excessive replication of tuples for each window• Increase in output data volume of split

Approach 1: Basic Sliding Window Partitioning

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 . . .

W1

W3

W4

Node1

Node2

Node3

SplitW2

w = 6 units, s = 2 units, Replication = 6/2 = 3

cagri.balkesen@inf.ethz.ch 8

Approach 2: Batch-based Partitioning

• Batch several windows together to reduce replication• “Batch-window”: wb

= w+(B-1)*s ; sb = B*s– All the tuples in a batch go to the same partition– Only tuples overlapping btw. batches are replicated

• Replication reduced to wb/sb partitions instead of w/st1 t2 t3 t4 t5 t6 t7 t8 t9 t10 . . .

w1w2

w3

w4w5

w6

w7w8

B1

B2w = 3, s = 1B = 3 wb = 5, sb = 3Replication : 3 5/3

cagri.balkesen@inf.ethz.ch 9

Definitions:w : window-sizes : slide-sizeB : batch-size

The Panes Technique

• Divide overlapping windows into disjoint panes• Reduce cost by sub-aggregation and sharing• Each window has w/gcd(w,s) panes of size gcd(w,s)• Query is decomposed: pane-level (PLQ) & window-level (WLQ)

queries

w1

w2

w3

w4

w5

. . .

win

dow

s

p1 p2 p3 p4 p5 p6 p7 p8 . . .

panes

[1] No Pane, No Gain: Efficient Evaluation of Sliding Window Aggregates over Data Streams, SIGMOD Record ‘05cagri.balkesen@inf.ethz.ch 10

Approach 3: Pane-based Partitioning• Mark each tuple with pane-id + win-id

– Treat panes as tumbling window with wp = sp = gcd(w,s)

• Route tuples to a node based on pane-id• Nodes compute PLQ with pane tuples• Combine all PLQ results of a window to form WLQ

– Need for an organized topology of nodes– We propose organization of nodes in a ring

cagri.balkesen@inf.ethz.ch 11

Node1

Node2

Node3

Split

w = 6 units, s = 2 units

Window1

Pane1 Pane3

65

Pane2

4321

Window2Pane5

109

Pane4

87

Pane3

65

Window3Pane6 Pane7

14131211

Pane5

109

Node1

Node3

Node2

Merge

…P9P8

P3P2P1

…P11P10

P5P4

. .

.

…P13

P12

P7

P6

R3

R11

R9

R5

R13

R7

W1

W2

W3

Input Source

Split

Ring-based Query Evaluation

• High amount of pipelined result sharing among nodes

• Organized communication topology

cagri.balkesen@inf.ethz.ch 12

W = 6, S = 4 tuplesP = GCD(6,4) = 2 tuples

Assignment of Windows and Panes to Nodes

• All pane results only arrive from predecessors• Pane results sent to successor is only local panes

– Each node is assigned n consecutive windows– Min n st.

cagri.balkesen@inf.ethz.ch 13

Definitions:ww : win-size in # of panessw : slide-size in # of panes

Flexible Result Merging

[1] Exploiting k-Constraints to Reduce Memory Overhead in Continuous Queries over Data Streams. ACM TODS ‘04

cagri.balkesen@inf.ethz.ch 14

Fully-ordered

FIFO

k-ordered: k-ordering constraint [1], certain disorder allowed

Defn: For any tuple s, s’ arrives at least k+1 tuples after s st. s’.A ≥ s.A

* k = 0

Experimental Evaluation

• Implementation of techniques in Borealis• Workload adapted from Linear Road Benchmark

– Slightly modified segment statistics queries– Basic aggregation functions with different window/slide

ratios

cagri.balkesen@inf.ethz.ch 15

Scalability of Split Operator

• Pane-partitioning: cost & tput constant regardless of overlap ratio• Window & batch –partitioning: cost ↑ and tput↓ as overlap ↑• Excessive replication in window-partitioning is reduced by batching

cagri.balkesen@inf.ethz.ch 16

window-size/slide ratio (window overlap)

Max

imum

inpu

t rat

e (t

uple

s/se

cond

)

Scalability of Partitioning Techniques

• Pane-based scales close to linear until split is saturated– per tuple cost is constant

• Window & batch based: exteremely high replication– Split is not saturated, but scales very slowly

cagri.balkesen@inf.ethz.ch 17

* w/s = overlap ratio = 100

Summary & Conclusions

• Pane-partitioning is the choice of partitioning– Avoids tuple replication– Incurs less overhead in split and aggregate– Scales close to linear

cagri.balkesen@inf.ethz.ch 18

1) Window-based 2) Batch-based 3) Pane-based

Ongoing & Future Work

• Generalization of the framework• Support for adaptivity during runtime• Extending complexity of query plans• Extending performance analysis & experiments

cagri.balkesen@inf.ethz.ch 19

Thank You!

cagri.balkesen@inf.ethz.ch 20