Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov,...

19
Early Profile Pruning on XML-aware Early Profile Pruning on XML-aware Publish-Subscribe Systems Publish-Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented by Lee Jae-won (SNU)

Transcript of Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov,...

Page 1: Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.

Early Profile Pruning on XML-aware Early Profile Pruning on XML-aware Publish-Subscribe Systems Publish-Subscribe Systems

Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras

University of California

VLDB 2007

Presented by Lee Jae-won (SNU)

Page 2: Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.

Copyright 2006 by CEBT

Introduction Introduction

Publish-subscribe applications (pub-sub) are an important class of asynchronous content-based dissemination systems

Notification websites

– Users can subscribe for events of interest and get automatic notification when relevant events arrive in the systems

Events are announced with a message generated outside of the system by third party applications referred as publishers

These messages are then selectively delivered to interested subscribers that have announced their interest by submitting profiles

IDS Lab. Seminar - 2Center for E-Business Technology

Page 3: Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.

Copyright 2006 by CEBT

IntroductionIntroduction

Architecture of a pub-sub system

Matching process

– Finding (filtering) which messages satisfy which profile subscriptions

IDS Lab. Seminar - 3Center for E-Business Technology

Page 4: Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.

Copyright 2006 by CEBT

Introduction Introduction

Three predominant strategies to design the matching process

Standard relational approach

– Translating profiles and messages to the Relational Model

– The matching can be expressed as join operations

Index techniques

– Aggregating the profiles using some indexing techniques

– The matching reads the input message and traverse the index in order to select the profiles satisfied

Finite State Machine (FSM)

– Top-down approach

– This paper proposes bottom-up approach

XML documents has its more selective elements located at its leaves

IDS Lab. Seminar - 4Center for E-Business Technology

Page 5: Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.

Copyright 2006 by CEBT

Bottom-up Filtering FSM (BUFF)Bottom-up Filtering FSM (BUFF)

By changing the order in which document is evaluated

We also need to change the order in which the query is evaluated

For query 1

– NFA executes transitions to state 1, 2 and 3 eleven times before achieving the final state 4

– BUFF executes transitions to state 1, 2 and 3 only once, then achieves the final state 4

IDS Lab. Seminar - 5Center for E-Business Technology

Page 6: Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.

Copyright 2006 by CEBT

BUFF – Automaton Matching Process BUFF – Automaton Matching Process

Algorithm

Keeping RS (runtime stack) that store the current

document path being processed

For each opening tag of XML document, the

respective element e is pushed to RS

For each closing tag, an element e is poped from RS

IDS Lab. Seminar - 6Center for E-Business Technology

Page 7: Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.

Copyright 2006 by CEBT

BUFF – Automaton Matching Process BUFF – Automaton Matching Process

Example

When a final state (in this case, 4) is reached, the algorithm is ended

– Document & query (BUFF) are matched

IDS Lab. Seminar - 7Center for E-Business Technology

Page 8: Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.

Copyright 2006 by CEBT

Bounding-based XML Filtering Bounding-based XML Filtering (BOXFILTER)(BOXFILTER)

BoxFilter

XML Filtering approach which is based on index-based Filtering technique

BoxFilter Core Module

IDS Lab. Seminar - 8Center for E-Business Technology

Page 9: Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.

Copyright 2006 by CEBT

Prufer Sequence – Background Prufer Sequence – Background

Example

Initially, vertex 1 is the leaf with the smallest label

– So it is removed first and “4” is put in the Prufer sequence

Vertex 2 and 3 are removed next, so “4” is added twice more

Vertex 4 is now a leaf and has the smallest label

– So it is removed

– We append “5” to the sequence

– We are left with only two vertices (n-2)

The tree’s sequence is 4445

IDS Lab. Seminar - 9Center for E-Business Technology

Page 10: Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.

Copyright 2006 by CEBT

BoxFilterBoxFilter

Sequence Envelope

BoxFilter index tree is based on the concept of a Sequence Envelope

Assume that all sequence of profiles have the same length l

Consider a set of k prufer sequence of profiles, S1 … Sk

We derive tow new sequences (with length l) called the upper and the lower bounde, or U and L repectively

– Li = min (S1i … Ski )

– Ui = max (S1i … Ski )

Example

– L = ABCABABABAB

– U = DEDEEEDEDED

– ∀i Li ≤ S1i … Ski ≤ Ui

SE ≡ (L, U)

IDS Lab. Seminar - 10Center for E-Business Technology

Page 11: Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.

Copyright 2006 by CEBT

BoxFilterBoxFilter

Sequence Envelope

IDS Lab. Seminar - 11Center for E-Business Technology

Page 12: Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.

Copyright 2006 by CEBT

Filtering AlgorithmsFiltering Algorithms

We assume that there are 8 profiles submitted to the system

These are represented XML format

The profiles are transformed into Prufer sequences

Then these are inserted into the BoXFilter tree

IDS Lab. Seminar - 12Center for E-Business Technology

Page 13: Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.

Copyright 2006 by CEBT

Filtering AlgorithmsFiltering Algorithms

For each index node, the figure shows the sequence envelope that contains all envelopes from its children

Input Document D = ABCFABABABABF

IDS Lab. Seminar - 13Center for E-Business Technology

Page 14: Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.

Copyright 2006 by CEBT

Filtering AlgorithmFiltering Algorithm

Sequential Processing

Input Document D = ABCFABABABABF

We start by comparing the tree root in Document D to the query root sequence envelope

Since there is a match we examine the root’s three children nodes

We have a subsequence matching only with the first child (node 2)

– So we ignore the subtrees at the seconde and the third children nodes

We examine the children of node 2, which are now leaf nodes

We do subsequence matching between the document and each of the strings in these leaf nodes

There is a match between the document and leaf number 1

Batch Processing

Matching process is performed by joining the BoXFitler tree and the document tree (form of BoXFilter)

IDS Lab. Seminar - 14Center for E-Business Technology

Page 15: Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.

Copyright 2006 by CEBT

Experiments Experiments

Setup

Dataset with 1000, 10000 and 100000 small documents (up to 8KB)

Queries were specified for those datasets considering paths with 3 to 10 elements

Varying the Number of Documents

IDS Lab. Seminar - 15Center for E-Business Technology

Page 16: Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.

Copyright 2006 by CEBT

Experiments Experiments

Varying the Number of Queries

NFA and BUFF are linear to the number of documents and queries evaluated

BoXFilter has a constant performance for a relatively small number of queries

IDS Lab. Seminar - 16Center for E-Business Technology

Page 17: Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.

Copyright 2006 by CEBT

ExperimentsExperiments

Varying the Selectivity

Selectivity means how many document satisfy any of the queries

IDS Lab. Seminar - 17Center for E-Business Technology

Page 18: Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.

Copyright 2006 by CEBT

ExperimentsExperiments

Batching BoXFilter

B-BoXFilter is advantageous when the time spent for the extra step (tree creation) is less than the benefit in the matching time

IDS Lab. Seminar - 18Center for E-Business Technology

Page 19: Early Profile Pruning on XML-aware Publish- Subscribe Systems Mirella M. Moro, Petko Bakalov, Vassilis J. Tsotras University of California VLDB 2007 Presented.

Copyright 2006 by CEBT

Conclusion Conclusion

We proposed a FSM-based approach (BUFF)

It evaluates the documents in a bottom-up order

We introduced the idea of early profile pruning and proposed a sequence-based index (BoXFilter)

It allows to prune out queries very efficiently

First, documents and queries are transformed into sequences and grouped into envelops

Then, the queries and can be pruned out by evaluating the lower and upper bounds of their envelopes

– This is the first time that a concept of envelopes in employed for XML query processing

IDS Lab. Seminar - 19Center for E-Business Technology