# Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for...

date post

02-Jan-2016Category

## Documents

view

222download

0

Embed Size (px)

### Transcript of Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for...

Web Data ManagementIndexes

In this lectureIndexesXSetRegion algebrasIndexes for Arbitrary Semistructured DataDataguidesT-indexesIndex Fabric

ResourcesIndex Structures for Path Expressions by Milo and Suciu, in ICDT'99XSet description: http://www.openhealth.org/XSet/Data on the Web Abiteboul, Buneman, Suciu : section 8.2

The problemInput: large, irregular data graphOutput: index structure for evaluating regular path expressions

The DataSemistructured data instance = a large graph

The queriesRegular expressions (using Lorel-like syntax)SELECT XfROM (Bib.*.author).(lastname|firstname).Abiteboul XSelect xfrom part._*.supplier.name xRequires: to traverse data from root, return all nodes x reachable by a path matching the given path expression.Select XFrom part._*.supplier: {name: X, address: Philadelphia}Need index on values to narrow search to parts of the database that contain the string Philadelphia.

Analyzing the problemwhat kind of datatree data (XML): easier to index graph data: used in more complex applicationswhat kind of queriesrestricted regular expressions (e.g. XPath): may be more efficientarbitrary regular expressions: rarely encountered in practice

XSet: a simple index for XMLPart of the Ninja project at BerkeleyExample XML data:

XSet: a simple index for XMLEach node = a hashtableEach entry = list of pointers to data nodes (not shown)

XSet: Efficient query evaluationTo evaluate R1, look for part in the root hash table h1, follow the link to table h2, then look for name.R4 following part leads to h2; traverse all nodes in the index (corresponding to *), then continue with the path subpart.name.Thus, explore the entire subtree dominated by h2.Will be efficient if index is small and fits in memoryR3 leading wild card forces to consider all nodes in the index tree, resulting in less efficient computation than for R4.Can index the index itself. Retrieve all hash tables that contain a supplier entry, continue a normal search from there.(R1)SELECT X FROM part.name X -yes(R2)SELECT X FROM part.supplier.name X -yes(R3)SELECT X FROM *.supplier.name X -maybe(R4)SELECT X FROM part.*.subpart.name X -maybe

Region Algebrasstructured text = text with tags (like XML)

powerful indexing techniques[Baeza-Yates, Gonnet, Navarro, Salminen, Tompa, etc.]

New Oxford English Dictionary

critical limitation:ordered data only (like text)

Assume: data given as an XML text file, and implicit ordering in the file.

less critical limitation: restricted regular expressions

Region Algebras: Definitionsdata = sequence of characters [c1c2c3 ]

region = segment of the text in a filerepresentation (x,y) = [cx,cx+1, cy], x start position, y end position of the regionexample:

region set = a set of regions s.t. any two regions are either disjoint or one included in the otherexample all regions (may be nested)Tree data each node defines a region and each set of nodes define a region set.example: region p2 consisting of text under p2, set {p2,s2,s1} is a region set with three regions

Representation of a region setExample: the region set:

region algebra = operators on region set, s1 op s2 defines a new region set

Region algebra: some operatorss1 intersect s2 = {r | r s1, r s2}s1 included s2 = {r | rs1, r s2, r r}s1 including s2 = {r | r s1, r s2, r r}s1 parent s2 = {r | r s1, r s2, r is a parent of r}s1 child s2 = {r | r s1, r s2, r is child of r}

Examples: included = { s1, s2, s3, s5} including = {p2, p3} child = {n1, n3, n12}

Efficient computation of Region Algebra OperatorsExample: s1 included s2s1 = {(x1,x1'), (x2,x2'), }s2 = {(y1,y1'), (y2,y2'), }(i.e. assume each consists of disjoint regions)Algorithm:if xi < yj then i := i + 1if xi' > yj' then j := j + 1otherwise: print (xi,xi'), do i := i + 1

Can do in sub-linear time when one region is very small

From path expressions to region expressionsUse region algebra operators to answer regular path expressions:

Only restricted forms of regular path expressions can be translated into region algebra operators expressions of the form R1.R2Rn, where each Ri is either a label constant or the Kleene closure *.

Region expressions correspond to simple XPath expressionspart.name name child (part child root)part.supplier.name name child (supplier child (part child root))*.supplier.name name child supplierpart.*.subpart.name name child (subpart included (part child root))

From path expressions to region expressionsAnswering more complex queries:

Translates into the following region algebra expression:

Philadelphia denotes a region set consisting of all regions corresponding to the word Philadelphia in the text.Such a region can be computed dynamically using a full text index.

Region expressions correspond to simple XPath expressionsSelect XFrom *.subpart: {name: X, *.supplier.address: Philadelphia}Name child (subpart includes (supplier parent (address intersect Philadelphia)))

Indexes for Arbitrary Semistructured Data

A semistructured data instance that is a DAG

Indexes for Arbitrary Semistructured DataThe data represents employees and projects in a company.Two kinds of employees programmers and statisticiansThree kinds of links to projects leads, workson, consultantsIndex graph reduced graph that summarizes all paths from root in the data graphExample: node p1 paths from root to p1 labeled with the following five sequences:

ProjectEmployee.leadsEmployee.worksonProgrammer.employee.leadsProgrammer.employee.workson

Node p2 paths from root to p2 labeled by same five sequencesp1 and p2 are language-equivalent

Indexes for Arbitrary Semistructured DataFor each node x in the data graph, Lx = {w| a path from the root to x labeled w}

x,y x y Lx = Ly [x] = {y | x y }

Nodes(I) = {[x] | x nodes(G)I = Edges(I) = {[x] [y] | x [x], y [y], x y }

Indexes for Arbitrary Semistructured Data

We have the following equivalences:e1 e2e3 e4 e5p1 p2p3 p4p5 p6 p7

Indexes for Arbitrary Semistructured DataComputing path expression queriesCompute query on I and obtain set of index nodesCompute union of all extents

Returns nodes h8, h9.Their extents are [p5, p6, p7] and [p8], respectively; result set = [p5, p6, p7, p8]Always: size(I) size(G)Efficient when I can be stored in main memoryChecking x y is expensive.Select XFrom statistician.employee.(leads|consults): X

Indexes for Arbitrary Semistructured DataUse bisimulation instead of Fact: x, y x b y x yUse the same construction, but [u] now refers to b instead of . Bisimulation: Let DB be a data graph. A relation is a bisimulation on the reversed graph (i.e. all edges have their direction reversed) if the following conditions hold:1. If x y and x is a root, then so is y.2. Conversely, if x y and y is a root, then so is x.3. If x y, then for any edge x x there exists an edge y y, s.t. x y.4. Conversely, if x y, then for any edge y y, then there exists an edge x x s.t. x y.

DataGuidesGoldman & Widom [VLDB 97]graph dataarbitrary regular expressions

DataGuidesDefinitiongiven a semistructured data instance DB, a DataGuide for DB is a graph G s.t.:- every path in DB also occurs in G- every path in G occurs in DB- every path in G is unique

DataguidesExample:

DataGuidesMultiple DataGuides for the same data:

DataGuidesDefinitionLet w, w be two words (i.e. word queries) and G a graphw G w if w(G) = w(G)

DefinitionG is a strong dataguide for a database DB if G is the same as DB

DataGuidesExample:G1 is a strong dataguideG2 is not strong

person.project !DB dept.projectperson.project !G2 dept.project

DataGuidesConstructing the strong DataGuide G:Nodes(G)={{root}}Edges(G)=while changes dochoose s in Nodes(G), a in Labelsadd s={y|x in s, (x -a->y) in Edges(DB)} to Nodes(G)add (x -a->y) to Edges(G)Use hash table for Nodes(G)This is precisely the powerset automaton construction.

- DataGuidesHow large are the dataguides ?if DB is a tree, then size(G)