S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

85
SEQUENCE INDEXING SCHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601

Transcript of S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

Page 1: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

SEQUENCE INDEXING SCHEMESRoman Čížek Erasmus 2687,

Nelly Vouzoukidou MET601

Page 2: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

INTRODUCTION

Graph indexes precise Path, (twig only few methods)

Sequence indexing schemes Top-down or bottom-up XML document and XML queries in structure-encoded

sequences Path and twig

Page 3: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

TOP-DOWN SEQUENCE INDEXES: VIST

Page 4: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

VIST – VIRTUAL SUFFIX TREE

Top-down Sequence Indexes Represent XML documents and XML queries in

structure-encoded sequences Querying XML data is equivalent to finding subsequence

matching Avoid to expensive join operations Provides unified index on both content and structure Support dynamic index update B+Trees which are supported in DBMSs

Page 5: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

DTD OF PURCHASE RECORDS

<!ELEMENT purchases (purchase*)><!ELEMENT purchase (seller, buyer)><!ATTRIST seller ID ID location CDATA name CDATA><!ELEMENT seller (item*)><!ATTRIST buyer ID ID location CDATA name CDATA><!ELEMENT item (item*)><!ATTRIST item name CDATA manufacturer CDATA>

Page 6: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

A SINGLE PURCHASE RECORD

Page 7: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

PREORDER SEQUENCE OF XML

Use capital letters to represent names of elements/attributes

Use hash function h(), to encode attribute values into integers

v1 = h(“dell”) v2=h(“ibm”)

Preorder sequence of XML purchase record example PSNv1IMv2Nv3IMv4INv5Lv6BLv7Nv8

Isomorphic trees may produce different preorder seq. DTD schema embodies linear order of all elements/attributes Without DTD – use lexicographical order

Page 8: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

STRUCTURE-ENCODED SEQUENCE

Definition: A Structure-Encoded Sequence, derived from a prefix traversal of semi-structured XML document, is a sequence of (symbol, prefix) pairs:

D = (a1,p1), (a2,p2),…, (an,pn)

Where ai represents a node in the XML document tree, (of which a1, … ,an is the preorder sequence), and pi is the path from the root node to node ai.

Page 9: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

STRUCTURE-ENCODED SEQUENCE

D= (P,ϵ),(S,P),(N,PS),(v1,PSN),(I,PS),(M,PSI),(v2,PSIM),(N,PSI),(v3,PSIN),(I,PSI),(M,PSII),(v4,PSIIM),(I,PS),(N,PSI),(v5,PSIN),

(L,PS),(v6,PSL),(B,P),(L,PB),(v7,PBL),(N,PB),(v8,PBN)

Page 10: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

XML QUERIES IN GRAPH FORM

Page 11: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

XML QUERIES IN PATH EXPRESSION AND SEQUENCE FORM

Query: Path Expression Structure-Encoded Sequence

Q1 : /Purchase/Seller/Item/Manufacturer (P, ϵ)(S,P)(I,PS)(M,PSI)

Q2 : /Purchase[Seller[Loc = v5]]/Buyer[Loc = v7] (P, ϵ)(S,P)(L,PS)(v5,PSL)(B,P)(L,PB)(v7,PBL)

Q3 : /Purchase/*[Loc = v5]

(P, ϵ)(L, P)(v5,P*L) Q4 : /Purchase//Item[Manufacturer = v3]

(P, ϵ)(I,P//)(M, P//I)(v3,P//IM)

Page 12: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

QUERYING XML THROUGH STRUCTURE-ENCODED SEQUENCE MATCHING Querying XML is equivalent to finding (non-contiguous)

subsequence matches Most structural XML queries can be performed through direct

subsequence matching Exception: branch has multiple identical child nodes

Q5=/A[B/C]/B/D Two different sequences

(A, ϵ)(B,A)(C,AB)(B,A)(D,AB) (A, ϵ)(B,A)(D,AB)(B,A)(C,AB)

Find matches separately and union their result We may find false matches if the indexed documents contain

branches with identical child nodes, then we ask multiple queries and compute set difference on result

If the query contains a large number of same child nodes under the branch, we can choose disassemble the tree into multiple trees and use join operations to combine their results

Page 13: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

ALGORITHMS

Naïve algorithm RIST – Relationships Indexed Suffix Tree ViST – Virtual Suffix Tree

Page 14: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

NAÏVE ALGORITHM: SUFFIX-TREE-LIKE STRUCTURE

Doc1 : (P, ϵ)( S, P)(N, PS)(v1, PSN)(L, PS)(v2, PSL) Doc2 : (P, ϵ)(B, P)(L, PB)(v2, PBL) Q1 : (P, ϵ)(B, P)(L,PB)(v2, PBL) Q2 : (P, ϵ)(L, P*)(v2,P*L)

Page 15: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

D-ANCESTORSHIP AND S-ANCESTORSHIP

D-Ancestorship Ancestor-descendant relationships in original XML tree Element (S,P) is a D-Ancestorship of (L,PS)

S-Ancestorship Ancestor-descendant relationships in suffix tree Element (v1, PSN) is an S-Ancestorship of (L, PS)

Page 16: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

NAÏVE SEARCH :A NAÏVE ALGORITHM BASED ON SUFFIX TREES

Page 17: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

RIST – INDEXING CONSTRUCTION

S-Ancestorship requires additional information Label each suffix tree node x by pair <nx, sizex>

nx prefix traversal order of x in suffix tree sizex is total number of descendants of x in suffix tree

x … <nx, sizex>, y …<ny, sizey> x is S-Ancestor of node y if ny ϵ (nx, nx + sizex]

Construct the B+Trees: Tree nodes into the D-Ancestorship B+Tree using (Symbol,

Prefix) as keys For all nodes x inserted with the same (Symbol, Prefix) we

index them by S-Ancestorship B+Tree, using the nx values of their labels as keys.

Page 18: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

THE RIST INDEX STRUCTURE

Page 19: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

SEARCH: NON-CONTIGUOUS SUBSEQUENCE MATCHINGUSING B+TREE

Page 20: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

VIST – VIRTUAL SUFFIX TREE

Dynamic Virtual suffix tree labeling Semantic and statistical clues Dynamic scope allocation without clues

Page 21: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

DYNAMIC SCOPE ALLOCATION

Number of child nodes of x is λ. We allocate 1/ λ of the remaining scope to x’s first child

Dynamic scope allocation with λ=2

Page 22: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

DYNAMIC SCOPE OF A SUFFIX TREE NODE

Page 23: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

SUBSCOPE(PARENT, E): CREATE A SUB SCOPEWITHIN THE PARENT SCOPE FOR E

Page 24: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

INSERTION INDEX

Doc1 = (P, ϵ)(S,P)(N,PS)(v1,PSN)(L,PS)(v2,PSL) Doc2 = (P, ϵ)(S,P)(L,PS)(v2,PSL)

Page 25: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

INDEX AN XML DOCUMENT

Page 26: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

EXPERIMENTS - SAMPLE QUERIES

Path Expression DatasetQ1 /inproceedings/title DBLPQ2 /book/author[text=‘David’] DBLPQ3 /*/author[text= ‘David’] DBLPQ4 //author[text= ‘David’] DBLPQ5 /book[key=‘books/bc/MaierW88’]/author DBLPQ6 /site//item[location=‘US’]/mail/date[text=‘12/15/1999’]

XMARKQ7 /site//person/*/city[text=‘Pocatello’] XMARKQ8 //closed_auction[*[person=‘person1’]]/date[text=‘12/15/1999’]

XMARK

Page 27: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

COMPARING INDEXING METHODS

time in seconds

Page 28: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

INDEX STRUCTURE

DBLP (301 MB of data) XMARK (52MB of data)

Page 29: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

CONCLUSION

structure-encoded sequences Sequence matching Avoid expensive join operations Top-down scope allocation method Index structure – B+Tree

Page 30: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

PRIX:PRUFER SEQUENCES FOR INDEXING XML

Page 31: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

PRIX: PRUFER SEQUENCES FOR INDEXING XML

Rao & Moon (2006) proposed a new method for indexing XML documents using sequences

It uses the same idea as in ViST index: The XML tree is transformed into a sequence and saved in the

database Each query is also transformed into a sequence The answer of the query is acquired by performing subsequence

matching

Page 32: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

PRIX: PRUFER SEQUENCES FOR INDEXING XML

Page 33: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

PRIX: PRUFER SEQUENCES FOR INDEXING XML

Page 34: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

MOTIVATION: TWIG QUERIES AND WILDCARDS

Like in ViST, PRIX also tries to efficiently answer twig queries as well as queries containing wildcards (‘*’ any and ‘//’ self or descendant queries)

P

Q

T S

Twig queryXPath: P/Q[T]/S

Query with wildcardsXPath: P//Q/S

P

Q

S

Page 35: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

MOTIVATION: PROBLEMS IN VIST INDEX

Memory requirements: In the worst case, ViST requires O(N2) space to index the

document

A

B

C

D

D = (A, ε), (B, A), (C, AB), (D, ABC), (E, ABCD)

EElements in height k

appear k times

<A> <B> <C> <D> <E> </E> </D> </C> </B></A>

Page 36: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

MOTIVATION: PROBLEMS IN VIST INDEX

Memory requirements: In the worst case, ViST requires O(N2) space to index the

document False positives

In many cases, query processing in Vist results in false alarms

P

Q

T

R

TUS

Doc1 = (P, e) (Q, P) (T, PQ) (S, PQ) (R, P) (U, PR) (T, PR)

P

Q

T

Q

S

Doc2 = (P, e) (Q, P) (T, PQ) (Q, P) (S, PQ)

P

Q

T S

XPath: P/Q[T]/SQ = (P, e) (Q, P) (T, PQ) (S, PQ)

Page 37: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

MOTIVATION: PROBLEMS IN VIST INDEX

Memory requirements: In the worst case, ViST requires O(N2) space to index the

document False positives

In many cases, query processing in Vist results in false alarms False negatives

Correctly answering a twig query depends on the order the branches are created

P

F

T

N

G

Doc = (P, e) (F, P) (T, PF) (N, P) (G, PN)

P

N F

Xpath: P[N]/FQ = (P, e) (N, P) (F, P) ???

Page 38: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

MOTIVATION: PROBLEMS IN VIST INDEX

Memory requirements: In the worst case, ViST requires O(N2) space to index the

document False positives

In many cases, query processing in Vist results in false alarms False negatives

Correctly answering a twig query depends on the order the branches are created

Page 39: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

PRIX: PRUFER SEQUENCES FOR INDEXING XML

Page 40: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

PRIX ARCHITECTURE

Page 41: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

INDEXING AND QUERYING IN PRIX

Indexing: The first step is to take as input an XML document and

convert it into a sequence This is achieved using Prufer Sequences

The sequence is saved in the database in a way equivalent to the one used in ViST It is a Virtual Trie implemented as B+ Trees

XML document

Page 42: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

INDEXING AND QUERYING IN PRIX

Querying Queries are also transformed to trees and then to Prufer

Sequences

The query sequence looked up in the document sequence and all matching subsequences are retrieved

After this initial filtering, three refinement phases follow

XPath Query

Page 43: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

PRIX: PRUFER SEQUENCES FOR INDEXING XML

Page 44: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

INDEXING XML DOCUMENTS The first step is to transform the XML document to the

equivalent XML tree

Notice that both elements and text values are represented as nodes (the same stands for attributes)

The tree is not saved in the database

<A> <B></B> <B> <C> D </C> <C> <F/> <E/> </C> </B></A>

A

B B

CC

D F E

Page 45: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

INDEXING XML DOCUMENTS

Then the Prufer Sequence is created from the XML tree A Prufer Sequence is a method proposed by Prufer

(1918) that constructs a one-to-one correspondence between a labeled tree and a sequence

8,A

1,B 7,B

6,C3,C

2,D 5,E4,F

8, 3, 7, 6, 6, 7, 8

Page 46: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

INDEXING XML DOCUMENTS

Prufer Sequences can only be created from trees with numerical labeling, with each node having a unique number

Since the XML tree contains string labels (the names of elements etc.) we add an additional label to each node

We will use the post-order traversal to name the nodes The prufer sequence can be extracted for any labeling of the

tree, but using post-order numbering has some properties that makes the querying process easier

Page 47: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

INDEXING XML DOCUMENTS

Initial labeling

A

B B

CC

D F E

8,A

1,B 7,B

6,C3,C

2,D 5,E4,F

Page 48: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

INDEXING XML DOCUMENTS

Finding the Prufer Sequence The algorithm to find the Prufer sequence is the

following: Find the leaf with the smallest value and delete it. Add the label of its parent to the sequence Repeat until only one node is left

In PRIX index, two sequences are held: The actual Prufer Sequence holding the numbers of the

labels called Numbered Prufer Sequence: NPS The corresponding sequence holding the actual labels of the

nodes of the XML Tree called Labeled Prufer Sequence: LPS

Page 49: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

INDEXING XML DOCUMENTS

Finding the Prufer Sequence The algorithm to find the Prufer sequence is the

following: Find the leaf with the smallest value and delete it. Add the label of its parent to the sequence Repeat until only one node is left

8,A

1,B 7,B

6,C3,C

2,D 5,E4,F

NPS : 8, LPS : A,

Page 50: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

INDEXING XML DOCUMENTS

Finding the Prufer Sequence The algorithm to find the Prufer sequence is the

following: Find the leaf with the smallest value and delete it. Add the label of its parent to the sequence Repeat until only one node is left

8,A

7,B

6,C3,C

2,D 5,E4,F

NPS : 8, 3LPS : A, C

1,B

Page 51: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

INDEXING XML DOCUMENTS

Finding the Prufer Sequence The algorithm to find the Prufer sequence is the

following: Find the leaf with the smallest value and delete it. Add the label of its parent to the sequence Repeat until only one node is left

8,A

7,B

6,C3,C

2,D 5,E4,F

NPS : 8, 3, 7, 6, 6, 7, 8LPS : A, C, B, C, C, B, A

1,B

Page 52: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

INDEXING XML DOCUMENTS

Properties Both NPS and LPS have length N-1 (where N is the total

number of nodes Due to the fact that we delete one node at a time until only

one node is left

NPS : 8, 3, 7, 6, 6, 7, 8LPS : A, C, B, C, C, B, A

8,A

1,B 7,B

6,C3,C

2,D 5,E4,F

Page 53: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

INDEXING XML DOCUMENTS

Properties Both NPS and LPS have length N-1 (where N is the total

number of nodes The i-th element deleted is always the node with label i

This helps us find the edges of the tree! (that is the mapping from the NPS to the tree)

NPS : 8, 3, 7, 6, 6, 7, 8LPS : A, C, B, C, C, B, A

8,A

1,B 7,B

6,C3,C

2,D 5,E4,F

Deleted node: 1, 2, 3, 4, 5, 6, 7

Page 54: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

INDEXING XML DOCUMENTS

Properties Both NPS and LPS have length N-1 (where N is the total

number of nodes The i-th element deleted is always the node with label i LPS does not contain any leaves

NPS : 8, 3, 7, 6, 6, 7, 8LPS : A, C, B, C, C, B, A

8,A

1,B 7,B

6,C3,C

2,D 5,E4,F

Deleted node: 1, 2, 3, 4, 5, 6, 7

Page 55: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

INDEXING XML DOCUMENTS

Indexes held in the database are The LPS (label prufer sequence) The NPS (numbered prufer sequence) The mapping between the number and the xml label of

the leaves of the tree

NPS : 8, 3, 7, 6, 6, 7, 8LPS : A, C, B, C, C, B, ALeaves mapping: 1 B, 2 D, 4 F, 5 E

8,A

1,B 7,B

6,C3,C

2,D 5,E4,F

Page 56: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

PRIX: PRUFER SEQUENCES FOR INDEXING XML

Page 57: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

QUERYING

When a query arrives it is also transformed to a prufer sequence

Then, an initial filtering is performed The results of the initial filtering are sorted out in order

to acquire the correct answer to the query after three more refinement phases.

XPath Query

Page 58: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

QUERYING:TRANSFORMING A QUERY TO A PRUFER SEQUENCE The same process as in documents is followed For instance if we have the XPath query

A[B/C]/D/E/F The query tree is:

The NPS and LPS are: NPS(Q) = 2, 6, 4, 5, 6 LPS(Q) = B, A, E, D, A

Page 59: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

QUERYING:FILTERING BY SEQUENCE MATCHING

Suppose we have the following XML tree (T) of the document:

NPS(T) = 15, 3, 7, 6, 6, 7, 15, 9, 15, 13, 13, 13, 14, 15 LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A

Page 60: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

QUERYING:FILTERING BY SEQUENCE MATCHING

To find the correct results for the given query we find the subsequences of LPS(Q) inside LPS(T)

“A subsequence is any string that can be obtained by deleting zero or more symbols from a given string”

Page 61: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

QUERYING:FILTERING BY SEQUENCE MATCHING To find the correct results for the given query we find the

subsequences of LPS(Q) inside LPS(T) Each subsequence represents a possible solution in the tree

LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A LPS(Q) = B, A, E, D, A

T Q

Page 62: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

QUERYING:FILTERING BY SEQUENCE MATCHING To find the correct results for the given query we find the

subsequences of LPS(Q) inside LPS(T) Each subsequence represents a possible solution in the tree

LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A LPS(Q) = B, A, E, D, A

T Q

Page 63: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

QUERYING:FILTERING BY SEQUENCE MATCHING To find the correct results for the given query we find the

subsequences of LPS(Q) inside LPS(T) Each subsequence represents a possible solution in the tree

LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A LPS(Q) = B, A, E, D, A

12 subsequences are found in total, while only 4 are correct

T Q

Page 64: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

QUERYING:FILTERING BY SEQUENCE MATCHING To find the correct results for the given query we find the

subsequences of LPS(Q) inside LPS(T) Each subsequence represents a possible solution in the tree

LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A LPS(Q) = B, A, E, D, A

12 subsequences are found in total, while only 4 are correct

T Q

Page 65: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

QUERYING:FILTERING BY SEQUENCE MATCHING To find the correct results for the given query we find the

subsequences of LPS(Q) inside LPS(T) Each subsequence represents a possible solution in the tree

LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A LPS(Q) = B, A, E, D, A

12 subsequences are found in total, while only 4 are correct

T Q

Page 66: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

QUERYING:FILTERING BY SEQUENCE MATCHING To find the correct results for the given query we find the

subsequences of LPS(Q) inside LPS(T) Each subsequence represents a possible solution in the tree

LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A LPS(Q) = B, A, E, D, A

12 subsequences are found in total, while only 4 are correct

T Q

Page 67: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

QUERYING:FILTERING BY SEQUENCE MATCHING To find the correct results for the given query we find the

subsequences of LPS(Q) inside LPS(T) Each subsequence represents a possible solution in the tree

LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A LPS(Q) = B, A, E, D, A

12 subsequences are found in total, while only 4 are correct

T Q

Page 68: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

QUERYING:FILTERING BY SEQUENCE MATCHING To find the path in the tree that is represented by the

sequence found while filtering we use the NPS(T) Recall that the edges can be retrieved using the index in the

NPS(T)

NPS(T) = 15, 3, 7, 6, 6, 7, 15, 9, 15, 13, 13, 13, 14, 15 LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A

T Q

Page 69: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

QUERYING:FILTERING BY SEQUENCE MATCHING To find the path in the tree that is represented by the

sequence found while filtering we use the NPS(T) Recall that the edges can be retrieved using the index in the

NPS(T)

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 NPS(T) = 15, 3, 7, 6, 6, 7, 15, 9, 15, 13, 13, 13, 14, 15 LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A

T Q

Page 70: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

QUERYING:FILTERING BY SEQUENCE MATCHING To find the path in the tree that is represented by the

sequence found while filtering we use the NPS(T) Recall that the edges can be retrieved using the index in the

NPS(T)

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 NPS(T) = 15, 3, 7, 6, 6, 7, 15, 9, 15, 13, 13, 13, 14, 15 LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A

T Q

Page 71: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

QUERYING:FILTERING BY SEQUENCE MATCHING To find the path in the tree that is represented by the

sequence found while filtering we use the NPS(T) Recall that the edges can be retrieved using the index in the

NPS(T)

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 NPS(T) = 15, 3, 7, 6, 6, 7, 15, 9, 15, 13, 13, 13, 14, 15 LPS(T) = A, C, B, C, C, B, A, C, A, E, E, E, D, A

T Q

Page 72: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

QUERYING:REFINEMENT STEPS

Despite the filtering, some false positives are in the results.

To find these false positives we have 3 refinement steps, namely: Refinement by connectedness Refinement by structure Refinement by matching leaf nodes

Page 73: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

QUERYING: FALSE NEGATIVES

A false negative can appear in the same case as in ViST index

The subsequence filtering relies on the assumption that the query branches come in the “correct” order

P

F

T

N

G

Document

P

N F

Query

Page 74: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

QUERYING: FALSE NEGATIVES

The solution proposed by Rao and Moon is to test the query in all possible permutations of the branches and then return the union as the answer of the query N branches N! permutations

Their main argument is that queries usually have a small number of branches

Page 75: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

QUERYING: FALSE NEGATIVES

The solution proposed by Rao and Moon is to test the query in all possible permutations of the branches and then return the union as the answer of the query N branches N! permutations

Their main argument is that queries usually have a small number of branches

P

N F D

S

P

N FD

S

P

NF D

S

… (three more permutations)

Page 76: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

EXPERIMENTS

Page 77: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

VIST VS PRIX: EXPERIMENTS

1.8GHz Pentium IV processor 512 MB RAM running Solaris 8 40GB EIDE disk drive (store data and indexes) Compiled by GNU g++ compiler version 2.95.3 Buffer pool size: 2000 pages of size 8K

Page 78: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

VIST VS PRIX: EXPERIMENTS

Page 79: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

VIST VS PRIX: EXPERIMENTS

Page 80: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

VIST VS PRIX: EXPERIMENTS

DBLP dataset

Page 81: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

VIST VS PRIX: EXPERIMENTS

SWISSPROT dataset

Page 82: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

VIST VS PRIX: EXPERIMENTS

TREEBANK dataset

Page 83: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

VIST VS PRIX

O(N2)

Page 84: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

?

? ?

QUESTIONS?

Page 85: S EQUENCE I NDEXING S CHEMES Roman Čížek Erasmus 2687, Nelly Vouzoukidou MET601.

THANK YOU!!