Post on 02-Jan-2016
PIPES: A Resource Adaptive Data Stream Management System
Bernhard SeegerPhilipps-University Marburg, Germany
Research supported by the German Research Society (DFG) grant Se 553/4-2
2
Information Landscape
DBMS
Input
Output
DBMS
DBMS
DBMS
DBMS
DBMS
File System
File System
File System
File System
File System
DSMS
3
Outline
Motivation and problem definition
Sliding Windows
Query Processing in PIPES Data Stream Model
Logical Operators
Algebraic Query Optimization
Physical Operators
Runtime Environment
Dynamic Plan Migration
Conclusions
4
Example Application
Traffic monitoring Data format
Continuous dataflow streams Variable stream rates
Time + location dependence
Queries Continuous, long-running
“At which measuring stations of the highway has the average speed of vehicles been below
15 m/s over the last 15 minutes ?”
HighwayStream( lane, speed, length, timestamp )
5
Data Streams
Continuously Arriving Sequence of Records
time as an integral component
Autonomous Data Sources sensors, mobile devices,
software agents, …
Important Type of Data miniaturization of hardware
ubiquitous networks
o o oo o …
6
Requirements
Declarative Query Language Expressive like (Temporal) SQL
join of data streams according to time combination of data streams with persistent databases
assigns meaning to data
query results as a data stream
Publish/Subscribe Paradigm Subscribe: users register new queries Publish: continous report of results
Quality of Service (QoS) e. g. at least one record per second
scalability number of data sources number of subscribed queries
7
Stream Query Processing
Similar to Traditional DBMS1. Queries expressed in CQL
SQL-like query language
2. Logical Query Plan algebra with „relational“ operators
3. Query Optimization algebraic rules
simple, but accurate cost model
4. Physical Query Plan select physical operators
5. Processing of the Query
8
What is special about PIPES?
PIPES provides an Infrastructure for DSMS DSMS = Data Stream Management System PIPES = Public Infrastructure for Processing and Exploring
Data Streams Differences to DBMS
Semantics is borrowed from Temporal Databases Expressiveness Query Optimization
Data Driven Query Processing Publish/Subscribe
Adaptive Runtime Environment Dynamic assignment of resources at runtime scalability, QoS
Continuous Optimization of Queries von Anfragen plan migration scalability, QoS
9
Outline
Motivation and problem definition
Sliding Windows
Query Processing in PIPES Data Stream Model
Logical Operators
Algebraic Query Optimization
Physical Operators
Runtime Environment
Dynamic Plan Migration
Conclusions
10
2. Sliding Windows
Requirement of Users no impact of outdated data on our result integration of different streams according to time
Moving Temporal Windows Finite subsequence of an infinite stream Query processing is restricted to the most recent data
Important for an expressive and efficient query processing
Options Count-based windows
FIFO queue of size w
Time-based windows t time stamp of an element t + w + 1 end of the validity of an element
11
Problem: Determinism
Data-driven Processing
Count-based Windows w = 2
Non-Determinism Result of a query depends
on scheduling
a3 b3
a3b1a3b2
a1
a2
b1
b2
a2b3a3b3
a3b1a3b2a2b3a3b3
a1b3a2b3a3b2a3b3
a1b3a2b3a3b2a3b3
Example: Symetric Join
a2
a3
b2
b3
Reset
a3b1a3b2a2b3a3b3
a1b3a2b3a3b2a3b3
12
Temporal Windows in CQL
SELECT sectionIDFROM ( SELECT AVG(speed) AS avgSpeed, 1 AS sectionID FROM HighwayStream1 [Range 15 minutes] UNION ALL … UNION ALL SELECT AVG(speed) AS avgSpeed, 20 AS sectionID FROM HighwayStream20 [Range 15 minutes])WHERE avgSpeed < 15;
“At which measuring stations of the highway has the average speed of vehicles been below 15 m/s over the last 15 minutes ?”
13
Outline
Motivation and problem definition
Sliding Windows
Query Processing in PIPES Data Stream Model
Logical Operators
Algebraic Query Optimization
Physical Operators
Runtime Environment
Dynamic Plan Migration
Conclusions
14
3. Query Processing in PIPES
Data Streams Model Input Streams
Autonomous Source
Logical Streams Semantics
Physical Streams Implementation of the Semantics, but more expressive
15
Input Streams
Sequence of Records Arbitrary, but fixed schema
No limitation to the relational model
Records with timestamps Temporal ordered
Schema: HighwayStream( short lane, float speed, float length, Timestamp timestamp )
Input Stream:(5; 18.28; 5.27; 5:00:08)(2; 21.33; 4.62; 5:01:32)(4; 19.69; 9.97; 5:02:16)
…
16
Physical Stream
PIPES: Time Intervals instead of Points Validity of an element e
Processing of e restricted to its time interval
Removal of invalid records
Sequence of tuples (e, [tS, tE))
Ordered by tS and tE
((5; 18.28; 5.27; 5:00:08), [5:00:08, 5:00:09))((2; 21.33; 4.62; 5:01:32), [5:01:32, 5:01:33))((4; 19.69; 9.97; 5:02:16), [5:02:16, 5:02:17))
…
Transformation: input stream physical stream
17
Data Stream Operators
Window Operator
Relational Operator „relational“ algebra on data streams
projection
selection
Cartesian product
union
difference
temporal extension of operators
18
Window Operator
Purpose Extension of the validity of an element by w time units.
Overlap of windows of elements Elements need to be processed together
Window: w = 15 minutes
(e1, [5:00:08, 5:15:09))(e2, [5:01:32, 5:16:33))(e3, [5:02:16, 5:17:17))
…
Sliding window: 15 minutes
tS+1+wtS
w+1
19
Relational Stream Operators
Snapshot-Reducibility Snapshot
Mapping of a physical stream to a non-temporal relation. Relation comprises all valid elements at point t
t
RelationalOperator
RelationalStreamOperator
S1, …, Sn R1, …, Rn
RoutSout
20
Query Optimization
Application of Well-known Rules from Temporal Databases Slivinskas, Jensen, Snodgrass (ICDE 2000)
Query Plans for Conventional and Temporal Queries Involving Duplicates and Ordering
many rules directly applicable to streams
conventional + temporal rules
Basis for Effective Query Optimization
21
1) Query2) Logical Query Plan3) Query Optimization4) Physical Query Plan
Steps
SELECT sectionIDFROM ( SELECT AVG(speed) AS avgSpeed, 1 AS sectionID FROM HighwayStream1 [Range 15 minutes] UNION ALL … UNION ALL SELECT AVG(speed) AS avgSpeed, 20 AS sectionID FROM HighwayStream20 [Range 15 minutes])WHERE avgSpeed < 15;
“At which measuring stations of the highway has the average speed of vehicles been below 15 m/s over the last 15 minutes ?”
Map: projection on sectionID
Filter: avgSpeed < 15
Union: merge of data streams
Aggregation: averagespeed (avgSpeed)Map: projection on speed., assigning sectionID
Window: w = 15 minutes
22
Physical Operators
Stateless Operators Processing of an element is independent from the
previous ones.
Examples: filter, map
Stateful Operators Processing of an element depends on previous
elements Restrict to elements in sliding window
Explicit management of status
Examples:join, aggregation
23
Data-driven Joins
Input streams A and B and sliding window of size w
join predicate P
Output records ((a,b), [tS,tE))
P(a,b)
overlapping intervals of a und b
a b
tS tE
(a,b)
24
Methodology
Adaptation of Sweepline TechniquetA = Start time of last element of A
tB = Start time of last element of B
Status for each input Status of A: elements of A with end time ≥ tB
Status of B: elements of B with end time ≥ tA
Continuous Processing
A B
StatusA StatusB
insertionprobing & reorganisation
25
Runtime Environment of PIPES
Sources
Sinks
Qu
ery
grap
h
PIP
ES
26
Outline
Motivation and problem definition
Sliding Windows
Query Processing in PIPES Data Stream Model
Logical Operators
Algebraic Query Optimization
Physical Operators
Runtime Environment
Dynamic Plan Migration
Conclusions
27
4. Plan Migration
Re-Optimization of Query Plans at Runtime Identification of poorly performing subgraphs in the
query graph
Plan Migration Substitution of old plan by a new one
Requirements
Preserving of snapshot reducibility
Continuous production of results
Short migration time
28
Beispiel
R S T U
C1 C2Sinks
Sources
29
Semantics Problems
Duplicates Parallel insertion of new elements into both plans
Loss of Results Exclusive insertion of new element in the new plan
30
Split
Approach in PIPES
Assumptions Streams A and B Window of length w equivalent query plans Palt and Pneu
Earliest split time tsplit = max {tA, tB} + w
Splitting of the input at split time
tsplit
31
Approach in PIPES
Production of Results Acceptance of all results received from the old plan
Pold
Selection of results received from the new plan Pnew
Acceptance only if start time > tsplit
Pold Pnew
Split
A
Split
B
32
Properties
Method is broadly applicable Arbitrary plans
Many data streams
Different window sizes
Migration Time Worst-case: w time units
33
Outline
Motivation and problem definition
Sliding Windows
Query Processing in PIPES Data Stream Model
Logical Operators
Algebraic Query Optimization
Physical Operators
Runtime Environment
Dynamic Plan Migration
Conclusions
34
5. Conclusions
Applications Traffic management Alarming systems
Observation of production lines
Basic ideas of stream processing in PIPES Temporal Databases Data-driven query processing Adaptivity at runtime Continuous Optimization at runtime
Dynamic Plan Migration Broadly applicable approach
35
Current Work
Problems Cost models for optimization
New techniques
Strategies for adaptation Memory
CPU
QoS
Runtime environment Realtime applications
Real applications for DSMS Observation of patients in hospitals
Processing of sensor data Coupling of PIPES and commercial products
36
Related Work
Abadi, Carney, Cetintemel et al. Aurora: A new model and architecture for data stream
management. The VLDB Journal, 12(2):120-139, 2003.
Arasu, Babu, and Widom The CQL continuous query language: Semantic foundations and
query execution. Technical Report 2003-67, Stanford University, 2003.
Tucker, Maier, Sheard, and Faragas Exploiting punctuation semantics in continuous data streams.
IEEE Trans. Knowledge and Data Eng., 15(3):555-568, 2003.
Law, Wang, and Zaniolo Query languages and data models for database
sequences and data streams. In VLDB, pages 492-503, 2004.
37
Papers on PIPES/XXL
Michael Cammert, Jürgen Krämer, Bernhard Seeger, Sonny Vaupel: An Approach to Adaptive Memory Management in Data Stream Systems , will appear in Proc. ICDE 2006.
Michael Cammert, Christoph Heinz, Jürgen Krämer, Bernhard Seeger: Sortierbasierte Joins über Datenströmen,BTW 2005, Karlsruhe - Germany, March, 2-4.
Björn Blohsfeld, Christoph Heinz, Bernhard Seeger:Maintaining Nonparametric Estimators over Data Streams,BTW 2005, Karlsruhe - Germany, March, 2-4.
Christoph Heinz, Bernhard Seeger: Wavelet Density Estimators over Data Streams (Extended Abstract),ACM Symposium on Applied Computing, Santa Fe - New Mexico, 2005.
Michael Cammert, Christoph Heinz, Jürgen Krämer, Bernhard Seeger: Anfrageverarbeitung auf Datenströmen,Datenbank-Spektrum 11: 5-13, (2004).
Jürgen Krämer, Bernhard Seeger:PIPES–A Public Infrastructure for Processing and Exploring Data Streams. Proc. SIGMOD 2004 (Demo)
Jochen Van den Bercken, Björn Blohsfeld, Jens-Peter Dittrich, Jürgen Krämer, Tobias Schäfer, Martin Schneider, Bernhard Seeger: XXL - A Library Approach to Supporting Efficient Implementations of Advanced Database Queries,In Proc. of the Conf. on Very Large Databases (VLDB), 39-48, September 2001.
38
Future Work
Query optimization Adequate cost model
Not only stream rates
Runtime statistics: delays, memory usage, etc.
Static query optimization Multi query optimization
Subquery sharing
Dynamic query optimization Detection of suitable subgraphs
Plan migration at runtime
Temporal aspects Coalesce
Thank you !
Any questions ?
For more information check our website:
http://dbs.mathematik.uni-marburg.de/Home/Research/Projects/PIPES
40
Reorganization
Restriction of memory usage
All elements where tE mintSj tSj : latest start timestamp of input stream j
Ordering invariant no temporal overlap with future stream elements
Which elements can be discarded in internal data structures ?
Why ?
41
Aggregation
Incremental computation
Efficient implementation Aggregation segment-tree
Amortized logarithmic costs per element
T
current state(aggregates)
new element
Example: Sum
4
25
345
9
7
ReorganizationInsertion
42
Outline
Motivation and problem definition Query formulation Our temporal approach
Stream typesLogical query plansQuery optimizationPhysical query plansQuery execution
Exploration of Data Streams Conclusions
43
Exploration of Data Streams
Example Estimation of selectivity during runtime of continuous range
queries:
select * from Stream S
where S.measure between min and max
Our Approach Exploit the density p of the distribution
Represents all information about the distribution
Suitable for estimating the selectivity multiple queries
max
min
)( dxxp
44
Requirement
Problem Density is unknown
Adaptation of a non-parametric density estimation technique Kernels Wavelets Sampling and CDF
Requirements Low resource consumption (memory and CPU)
Memory and CPU adaptive Increasing memory size higher accuracy
Valid estimation at each point in time Adapt to a changing distribution
45
Reservoir Sampling
CDF is built on top of the iid samples
Disadvantages Estimation relies on a few elements
No advantage from an increasing memory
Advantage Low processing overhead
main memory
12 5 2734 4
samples
0 jdata stream
... 34...5...12 4...27...
46
Blockwise Estimation
Stream is transformed into blocks For simplicity: blocks are of the same size
Idea Estimation of the first k blocks is available
Compute the estimation of k+1 blocks iteratively
Example (Average)
Generalization for density functions Straightforward Extension
Problem: Violates the requirement of limited memory
actkk avgk
avgk
kavg
1
1
11
47
Cumulative-Compressed Estimation
Compression Cubic splines
Weighting strategies
Amortized cost for updates O(log M)
))(ˆ)(ˆ)1(()(ˆ111 xsxfcompressxf kkkk
12 5 2734 4
sample
main memory
Current estimatorat time k
k 1k
48
Experimental Comparison
Streaming data from a real traffic data set
Arithmetic weights
Memory size: 5000