Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov,...
-
Upload
colin-wilkins -
Category
Documents
-
view
213 -
download
0
Transcript of Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov,...
Early Profile Pruning on XML-aware Early Profile Pruning on XML-aware Publish-Subscribe Systems Publish-Subscribe Systems
Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras
University of California
VLDB 2007
Presented by Lee Jae-won (SNU)
Copyright 2006 by CEBT
Introduction Introduction
Publish-subscribe applications (pub-sub) are an important class of asynchronous content-based dissemination systems
Notification websites
– Users can subscribe for events of interest and get automatic notification when relevant events arrive in the systems
Events are announced with a message generated outside of the system by third party applications referred as publishers
These messages are then selectively delivered to interested subscribers that have announced their interest by submitting profiles
IDS Lab. Seminar - 2Center for E-Business Technology
Copyright 2006 by CEBT
IntroductionIntroduction
Architecture of a pub-sub system
Matching process
– Finding (filtering) which messages satisfy which profile subscriptions
IDS Lab. Seminar - 3Center for E-Business Technology
Copyright 2006 by CEBT
Introduction Introduction
Three predominant strategies to design the matching process
Standard relational approach
– Translating profiles and messages to the Relational Model
– The matching can be expressed as join operations
Index techniques
– Aggregating the profiles using some indexing techniques
– The matching reads the input message and traverse the index in order to select the profiles satisfied
Finite State Machine (FSM)
– Top-down approach
– This paper proposes bottom-up approach
XML documents has its more selective elements located at its leaves
IDS Lab. Seminar - 4Center for E-Business Technology
Copyright 2006 by CEBT
Bottom-up Filtering FSM (BUFF)Bottom-up Filtering FSM (BUFF)
By changing the order in which document is evaluated
We also need to change the order in which the query is evaluated
For query 1
– NFA executes transitions to state 1, 2 and 3 eleven times before achieving the final state 4
– BUFF executes transitions to state 1, 2 and 3 only once, then achieves the final state 4
IDS Lab. Seminar - 5Center for E-Business Technology
Copyright 2006 by CEBT
BUFF – Automaton Matching Process BUFF – Automaton Matching Process
Algorithm
Keeping RS (runtime stack) that store the current
document path being processed
For each opening tag of XML document, the
respective element e is pushed to RS
For each closing tag, an element e is poped from RS
IDS Lab. Seminar - 6Center for E-Business Technology
Copyright 2006 by CEBT
BUFF – Automaton Matching Process BUFF – Automaton Matching Process
Example
When a final state (in this case, 4) is reached, the algorithm is ended
– Document & query (BUFF) are matched
IDS Lab. Seminar - 7Center for E-Business Technology
Copyright 2006 by CEBT
Bounding-based XML Filtering Bounding-based XML Filtering (BOXFILTER)(BOXFILTER)
BoxFilter
XML Filtering approach which is based on index-based Filtering technique
BoxFilter Core Module
IDS Lab. Seminar - 8Center for E-Business Technology
Copyright 2006 by CEBT
Prufer Sequence – Background Prufer Sequence – Background
Example
Initially, vertex 1 is the leaf with the smallest label
– So it is removed first and “4” is put in the Prufer sequence
Vertex 2 and 3 are removed next, so “4” is added twice more
Vertex 4 is now a leaf and has the smallest label
– So it is removed
– We append “5” to the sequence
– We are left with only two vertices (n-2)
The tree’s sequence is 4445
IDS Lab. Seminar - 9Center for E-Business Technology
Copyright 2006 by CEBT
BoxFilterBoxFilter
Sequence Envelope
BoxFilter index tree is based on the concept of a Sequence Envelope
Assume that all sequence of profiles have the same length l
Consider a set of k prufer sequence of profiles, S1 … Sk
We derive tow new sequences (with length l) called the upper and the lower bounde, or U and L repectively
– Li = min (S1i … Ski )
– Ui = max (S1i … Ski )
Example
– L = ABCABABABAB
– U = DEDEEEDEDED
– ∀i Li ≤ S1i … Ski ≤ Ui
SE ≡ (L, U)
IDS Lab. Seminar - 10Center for E-Business Technology
Copyright 2006 by CEBT
BoxFilterBoxFilter
Sequence Envelope
IDS Lab. Seminar - 11Center for E-Business Technology
Copyright 2006 by CEBT
Filtering AlgorithmsFiltering Algorithms
We assume that there are 8 profiles submitted to the system
These are represented XML format
The profiles are transformed into Prufer sequences
Then these are inserted into the BoXFilter tree
IDS Lab. Seminar - 12Center for E-Business Technology
Copyright 2006 by CEBT
Filtering AlgorithmsFiltering Algorithms
For each index node, the figure shows the sequence envelope that contains all envelopes from its children
Input Document D = ABCFABABABABF
IDS Lab. Seminar - 13Center for E-Business Technology
Copyright 2006 by CEBT
Filtering AlgorithmFiltering Algorithm
Sequential Processing
Input Document D = ABCFABABABABF
We start by comparing the tree root in Document D to the query root sequence envelope
Since there is a match we examine the root’s three children nodes
We have a subsequence matching only with the first child (node 2)
– So we ignore the subtrees at the seconde and the third children nodes
We examine the children of node 2, which are now leaf nodes
We do subsequence matching between the document and each of the strings in these leaf nodes
There is a match between the document and leaf number 1
Batch Processing
Matching process is performed by joining the BoXFitler tree and the document tree (form of BoXFilter)
IDS Lab. Seminar - 14Center for E-Business Technology
Copyright 2006 by CEBT
Experiments Experiments
Setup
Dataset with 1000, 10000 and 100000 small documents (up to 8KB)
Queries were specified for those datasets considering paths with 3 to 10 elements
Varying the Number of Documents
IDS Lab. Seminar - 15Center for E-Business Technology
Copyright 2006 by CEBT
Experiments Experiments
Varying the Number of Queries
NFA and BUFF are linear to the number of documents and queries evaluated
BoXFilter has a constant performance for a relatively small number of queries
IDS Lab. Seminar - 16Center for E-Business Technology
Copyright 2006 by CEBT
ExperimentsExperiments
Varying the Selectivity
Selectivity means how many document satisfy any of the queries
IDS Lab. Seminar - 17Center for E-Business Technology
Copyright 2006 by CEBT
ExperimentsExperiments
Batching BoXFilter
B-BoXFilter is advantageous when the time spent for the extra step (tree creation) is less than the benefit in the matching time
IDS Lab. Seminar - 18Center for E-Business Technology
Copyright 2006 by CEBT
Conclusion Conclusion
We proposed a FSM-based approach (BUFF)
It evaluates the documents in a bottom-up order
We introduced the idea of early profile pruning and proposed a sequence-based index (BoXFilter)
It allows to prune out queries very efficiently
First, documents and queries are transformed into sequences and grouped into envelops
Then, the queries and can be pruned out by evaluating the lower and upper bounds of their envelopes
– This is the first time that a concept of envelopes in employed for XML query processing
IDS Lab. Seminar - 19Center for E-Business Technology