Indexing Dataspaces
description
Transcript of Indexing Dataspaces
1
Indexing DataspacesPresenter : Aviv AlonSeminar in Databases (236826)
Dataspaces are collections of heterogeneous and partially unstructured data.
Dataspaces
Dataspaces – Why we need them?
Looking for an architect with
good reviews and cheap materials?
Return “Architect B” as instance
4
Consider queries that are keyword basedbut also structure aware:
How to effectively query and search a dataspace
Main Problem
5
An inverted list where each row represents a keyword and each column represents a data item from the data sources.
Indexing Heterogeneous Data
6
We model the data as a set of triples Each triple is either of the form
(instance, attribute, value) for example: (“Architect B”, name, ‘Shalom’)
or of the form (instance, association, instance) for example: (“Architect B”, worksWith, “Architect A”)
Indexing Heterogeneous Data
7
We also model:
Indexing Heterogeneous Data
Person instances: p1, p2, p3
Article instance: a1
Conference instance: c1
Example Attributes firstName, lastName and
nickName are sub-attributes of name Association contactAuthor is a sub-
association of author.
Set of predicates of the form (v, {K1, ... , Kn})◦ v - an attribute or association label◦ {K1, …, Kn} - a keyword set
Predicate queries
Example 1: (title, ‘Birch’)
attribute predicate
Set of predicates of the form (v, {K1, ... , Kn})◦ v - an attribute or association label◦ {K1, …, Kn} - a keyword set
Predicate queries
association predicate
Example 2:(publishedIn ‘1996 Sigmod)’
Set of keywords K1, ... , Kn
◦ relevant instance◦ associated instances
Neighborhood keyword queries
Example: ‘Birch’relevant instance
associated instances
12
Build a separate index for each attribute to support structured queries on structured data.◦ Con: significant overhead to the index structure
Create an inverted list to support keyword search on unstructured data.◦ Con: Does not allow specifications on structure
Existing methods
13
Capture both text values and structuralinformation using an extended inverted list.
The index augments the text terms in the inverted list with labels denoting the structural aspects of the data such as attribute tags and associations between data items.
Proposed solution
Inverted Lists - ExampleWe cannot tell that “Tian” occurs as p1’s name and p3’s lastName
15
Indexing Attributes◦ Attribute inverted lists (ATIL)
Indexing Associations◦ Attribute-association inverted lists (AAIL)
Indexing hierarchies◦ Attribute inverted lists with duplication (Dup-ATIL)◦ Attribute inverted lists with hierarchies (Hier-ATIL)◦ Hybrid attribute inverted list (Hybrid-ATIL)
Indexing structure outline
16
Attribute inverted lists (ATIL) Whenever the keyword k appears in a value of the a
attribute, there is a row in the inverted list for k//a//
Indexing Attributes
keyword = 1996
Attribute = yeara1 c1 p1 p2 p3
1996//year// 0 1 0 0 0
17
Attribute inverted lists (ATIL) Whenever the keyword k appears in a value of the a
attribute, there is a row in the inverted list for k//a//
Indexing Attributes
18
To Answer an attribute predicate query (A,{K1, ... , Kn})
we need to search for {K1 //A//, ... , Kn //A//}Example:(lastName, ‘Tian’)
“tian//lastName//”
Attribute inverted lists (ATIL)
The search will yield p3
19
Indexing Attributes◦ Attribute inverted lists (ATIL)
Indexing Associations◦ Attribute-association inverted lists (AAIL)
Indexing hierarchies◦ Attribute inverted lists with duplication (Dup-ATIL)◦ Attribute inverted lists with hierarchies (Hier-ATIL)◦ Hybrid attribute inverted list (Hybrid-ATIL)
Indexing structure outline
20
Attribute-association inverted lists (AAIL):
Indexing Attributes
keyword = Birch
Association = authoredPaper with p1, p2
a1 c1 p1 p2 p3
Birch//authoredPaper// 0 0 1 1 0
21
Attribute-association inverted lists (AAIL):
Indexing Associations
22
To Answer a association predicate query (R, {K1, ... , Kn})
we need to search for {K1 // R //, ... , Kn // R //}Example: (author ‘Raghu’)
“raghu//author//”
Attribute-association Inverted lists (AAIL)
23
For the query (name ‘Tian’), we wish to return instances p1 and p3, rather than only p1.
Indexing hierarchies
24
To Answer the query (name ‘Tian’)
we can search for:“tian//name// OR tian//firstName// OR tian//lastName// OR tian//nickName//”
A Naïve method
Can be very expensive!
25
Indexing Attributes◦ Attribute inverted lists (ATIL)
Indexing Associations◦ Attribute-association inverted lists (AAIL)
Indexing hierarchies◦ Attribute inverted lists with duplication (Dup-ATIL)◦ Attribute inverted lists with hierarchies (Hier-ATIL)◦ Hybrid attribute inverted list (Hybrid-ATIL)
Indexing structure outline
26
Attribute inverted lists with duplication (Dup-ATIL):
Indexing Attributes
Attribute = nameSub-attribute = nickName
a1 c1 p1 p2 p3a1 c1 p1 p2 p3
Jeff//name//Jeff//nickName//
00
00
00
00
11
Attribute inverted lists with duplication (Dup-ATIL)
Index with Duplication
28
To Answer an attribute predicate query (A,{K1, ... , Kn})
we need to search for {K1//A//, ... , Kn//A//}Example:(name, ‘Tian’)
“tian//name//”
Attribute inverted lists with duplication (Dup-ATIL)
The search will yield both p3 and p1
29
Pro: simple query answering
Con: may considerably expand the size of the index because of the duplication. Specially when:◦ Long paths from the root attribute to the leaf attributes ◦ Most values in the triple base belong to leaf attributes.
Dup-ATIL (cont.)
30
Indexing Attributes◦ Attribute inverted lists (ATIL)
Indexing Associations◦ Attribute-association inverted lists (AAIL)
Indexing hierarchies◦ Attribute inverted lists with duplication (Dup-ATIL)◦ Attribute inverted lists with hierarchies (Hier-ATIL)◦ Hybrid attribute inverted list (Hybrid-ATIL)
Indexing structure outline
31
Attribute inverted lists with hierarchies (Hier-ATIL):
Index with Hierarchy Path
Attribute = nameSub-attribute = nickName
a1 c1 p1 p2 p3
Jeff//name//nickName// 0 0 0 0 1
32
To Answer an attribute predicate query (A,{K1, ... , Kn})
we need to search for {K1//a0 // ... //am //*, ... , Kn// a0 // ... //am //*}
Example:(name, ‘Tian’)
“tian//name//*”
Attribute inverted lists with hierarchies (Hier-ATIL)
The search will yield both p3 and p1
a0 // ... //am : the hierarchy path for attribute A
33
Pro: Does not increase the number of indexed keywords (Although it can lengthen many of them)◦ real indexing systems typically record a keyword only by
the difference from its previous keyword
Con: answer a predicate query by transforming it into a prefix search, which can be more expensive than a keyword search.
Hier-ATIL (cont.)
34
Indexing Attributes◦ Attribute inverted lists (ATIL)
Indexing Associations◦ Attribute-association inverted lists (AAIL)
Indexing hierarchies◦ Attribute inverted lists with duplication (Dup-ATIL)◦ Attribute inverted lists with hierarchies (Hier-ATIL)◦ Hybrid attribute inverted list (Hybrid-ATIL)
Indexing structure outline
35
Dup-ATIL is more suitable for the cases where a keyword occurs in many attributes with common ancestors
Hier-ATIL is more suitable for the cases where a keyword occurs in only a few attributes with common ancestors
Hybrid indexing combines the strengths of both methods
Hybrid Index – Why?
36
Hybrid attribute inverted list (Hybrid-ATIL): Inverted list that can answer any prefix search by
reading no more than t rows.
Hybrid Index
A1 c1 p1 p2 p3
Jeff//name//nickName//Jie//name//firstName//Tian//name////Tian//name//lastName//
0000
0000
0010
0000
1111
Tian//name//lastName//is shadowed by Tian//name//
summary row
37
To Answer prefix query of the form k//a0 // ... //am//* we look at all the rows with prefix k//a0 // ... //am // except
those shadowed by summary rowsExample:(name, ‘Tian’), t=1
“tian//name//*”
Hybrid Index
Answer the prefix search after reading 1 row. yield both p1 and p3
38
We build the Keyword Inverted List (KIL), which is essentially a Hybrid-AAIL
Neighborhood Keyword Queries
We build the Keyword Inverted List (KIL), which is essentially a Hybrid-AAIL
Example:“Birch”, t=1
“birch//*”
Neighborhood Keyword Queries
Answer the prefix search after reading 1 row. yield p1, p2, a1,, c1
40
Associations between disparate items on the desktop:◦ Latex and Bibtex files◦ Word documents◦ Powerpoint presentations◦ emails and contacts◦ webpages in the web cache
The instances and associations are stored in an RDF file. the size of the file is 52.4MB
Experimental Evaluation
Experimental Evaluation
Attribute clauses. No
sub-attributes
Attribute clauses. With sub-attributes
Association clauses
Observations about the results105,320 object
300,354 attribute468,402 association
predicate query: 15.2 ms neighborhood keyword query: 224.3 ms
(with no more than 5 keywords)
Answering queries using the KIL was very efficient!
Answering queries with / without sub-attributes consumed a similar amount of time
Effectiveness of hybrid indexing
Compared with KIL (on average): The Naïve method
◦ query-answering time increased by a factor of 15.9 XML Index (SepIL):
◦ query-answering time increased by a factor of 2
Comparison of methods
44
Main Contributions: An indexing method that is designed to support flexible
querying over dataspaces Extended inverted lists to capture both texts and
structure of data
Future Work Extend the index to support value heterogeneity and to
investigate appropriate ranking algorithms
Conclusions
45
THE ENDQuestions ?