(Meeting Overview)
description
Transcript of (Meeting Overview)
![Page 1: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/1.jpg)
(Meeting Overview)
Arjen P. de Vries, Georgina Ramirez, Johan List
Djoerd Hiemstra, Vojkan Mihajlovic, Mila Boldareva,
Maurice van Keulen
![Page 2: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/2.jpg)
Overview• Cirquid Goals
• Multi-model DBMS Architecture
• Region algebras• For XML path traversal?• For ranking in IR?
• GALAX Architecture (+example)
![Page 3: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/3.jpg)
Goals• Develop efficient and flexible system
that integrates information retrieval and data retrieval• ‘structure + content’
• Two parts:• Database architecture (Arjen & Djoerd)• Optimization (Henk Ernst)
![Page 4: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/4.jpg)
Example QueryFOR $article IN
document("collection.xml")//article
WHERE $article/text() about
‘Willem-Alexander dating Maxima’
AND $article[@language = ‘English’]
AND $article[@pub-date between ‘31-1-2003’ and ‘1-3-2003’]
RETURN <result>$article</result>
![Page 5: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/5.jpg)
Basic Assumption• Coupled IR+DB system architecture
is not desirable and efficient
• Possible Alternatives:• Express entire combined algorithms in
DBMS query language• Exploit DBMS extension mechanism for IR• Flexible and transparent integration of
IR in query engine
![Page 6: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/6.jpg)
Multi-model DBMS Architecture
Conceptual Layer
Logical Layer
Physical LayerSuffix Array Staircase-Join
X-Path
…
LM IR …
![Page 7: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/7.jpg)
Cirquid Focus• X-Path extension and IR Language
Modeling extension• Suitable for collection-based processing• Maintain data independence• Based on region algebra
![Page 8: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/8.jpg)
(1, 1:23, 0) (1, 8:22, 1) (1, 14:21, 2) … …
(1, 2:7, 1) (1, 9:13, 2) (1, 15:20, 3) … …
<section>
<title>
(1, 3, 2) … …
(1, 4, 2) … … “retrieval”
“information”
Node index
Word index
<section> <title> Information Retrieval Using RDBMS </title> <section> <title> Beyond Simple Translation </title> <section> <title> Extension of IR Features </title> </section> </section></section>
1
2 3 4 5 6 7
89 10 11 12 13
14
15 16 17 18 19 20
21
22
23
Containment, direct containment, tight containment, proximity
![Page 9: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/9.jpg)
XML IndexingA
B
T1
C
D
T2 T3
E
T4
A
B C
D ET1
T2 T3 T4
![Page 10: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/10.jpg)
B
T1
C
D
T2 T3
E
T4
A
Node index
Word index
OID S E P
A 0 0 13 -
B 1 1 3 1
C 2 4 12 1
D 3 5 8 2
E 4 10 11 2
… … … … …
T1 2
T2 6
T3 7
T4 9
![Page 11: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/11.jpg)
LM IR on Regions• Extend region representation with a
probability value• Extend DB with rules how the
probabilities are computed• E.g.:
P(A ranked_combining B) =count(#B in A) / count( * in A )
• Background model [W3C flawed?]• Prob(A ranked_containing B in collection C)
![Page 12: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/12.jpg)
Issues• Tokenization etc. part of schema?!• ‘Content independence’ through
declarative specification of RM?• Define term-prob ::=
FOR $n in //*LET $rtf = count($n/text() contains
Q),$rlen = count($n/text()) RETURN <p>$rtf/$rlen</p>
![Page 13: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/13.jpg)
More Issues• Adjacency? Proximity? Tag name???
• Region representation?• Pre-post? Stretched pre-post?• Byte offset?
• Reduce cost of materialization of results by scanning original collection file?
• Allows direct use of suffix array… but is it efficient? For what queries?
![Page 14: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/14.jpg)
System Development Plan• Focus on query plan generation
• All the way from conceptual to physical!• Inspiration sources: Moa and RAM• Generate for both MonetDB and ‘normal’ RDBMS;
also X-100?
• Initial goal• Tijah – be pragmatic, must handle INEX 2003!
• Integrate with existing Xquery processor:• Galax Open Source implementation• Investigate also Konstanz system
![Page 15: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/15.jpg)
• Galax project, started in 2000 in Bell Labs.http://db.bell-labs.com/galax/
• Implements (most of):• XQuery 1.0 and XPath 2.0 Data Model• XQuery 1.0 and XPath 2.0 Functions and Operators• XQuery 1.0 : An XML Query Language• XML Query Use Cases• XML Schema Part1: Structures & Part2: Datatypes
• A Typed Implementation: Static & Dynamic• A functional implementation (O’Caml).
![Page 16: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/16.jpg)
Galax Architecture (+example)EXAMPLE
• Use case: Relational• Xquery: Return the item number and the description of all the bicycles.
<result> { for $i in $items//item_tuple where contains($i/description, "Bicycle") return <item_tuple> {$i/itemno} {$i/description} </item_tuple> } </result>
![Page 17: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/17.jpg)
Galax Architecture (+example)
XQuery Parser
Parsing Layer
XML Parser
XQuery Expression
XML Schema Description
XMLDocument
XML Schema AST
XQuery AST
![Page 18: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/18.jpg)
Galax Architecture (+example)
XQueryParser
Parsing Layer
XML Parser
XQuery Expression
XML Schema Description
XMLDocument
XML Schema AST
XQuery AST
Mapping Layer
XQuery Mapping to
the Core
Type System Mapping
XQuery Core Internal Structure
XQuery Type System Internal Structure
![Page 19: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/19.jpg)
Galax Architecture (+example)
XQueryParser
Parsing Layer
XML Parser
XQuery Expression
XML Schema Description
XMLDocument
XML Schema AST
XQuery AST
Mapping Layer
XQuery Mapping to
the Core
Type System Mapping
XQuery Core Internal Structure
XQuery Type System Internal Structure
Static Type
Checker
Static Error for non well-typed queries
Type of Query Result
(Static) Evaluation Layer
element result { element item_tuple { element itemno {xsd:int}, element description {xsd:string} }*}
![Page 20: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/20.jpg)
Galax Architecture (+example)
element result { for $i in ( glx:distinct-docorder(
(let $glx:sequence := (glx:distinct-docorder(($items))) return let $glx:last := (fn:count(($glx:sequence))) return for $glx:dot at $glx:position in ($glx:sequence) return glx:distinct-docorder(
(let $glx:sequence := (glx:distinct-docorder( (descendant-or-self::node()))) return let $glx:last := (fn:count(($glx:sequence))) return for $glx:dot at $glx:position in ( $glx:sequence) return child::item_tuple))))) return if (fn:boolean((let $glx:v1 := (fn:data((glx:distinct-docorder((let $glx:sequence := ( glx:distinct-docorder(($i))
…
Normalized Expression (XQuery Core)
![Page 21: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/21.jpg)
Algebra
• At a logical level, not at the physical.
• Use of regular-expression types.
• Iteration construct based on the notion of monad.
• Notation similar to path navigation in XPath.
![Page 22: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/22.jpg)
Algebra: some operators
• Projection: book0 / author
• Iteration: for b in bib0/book do book [b/author,b/title]
• Selection: where e1 then e2
• Aggregation: avg, count, max, min, sum.
• Joins: nested for loops
• Structural Recursion: match p
case b: …
case c: …
else …
![Page 23: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/23.jpg)
Some Optimization Rules• Goal:
• To eliminate unnecessary FOR or MATCH expressions • Enable other optimizations by reordering or distributing computations.
• Some rules:• FOR simplification
• For v in () do e ()• For v in e do v e• For v in (e1,e2) do e3 (for v in e1 do e3) , ( for v in e2 do e3)
• IF simplification cexpr1 := true cexpr2
If cexpr1 then cexpr2 else cexpr3
cexpr1 := false cexpr3
• LET simplification used_count $v Expr2 => 0 Expr2
Let $v := Expr1 return Expr2 used_count $v Expr2 => 1 Expr2 [ Expr1 / $v ]
![Page 24: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/24.jpg)
Galax Architecture (+example)Optimized Normalized Expression (XQuery Core)
element result { for $i in (glx:distinct-docorder((let $glx:dot := ($items) return for $glx:dot in (descendant-or-self::node()) return child::item_tuple))) return if ( fn:contains((fn:data((glx:distinct-docorder((let $glx:dot := ($i) return child::description))))),("Bicycle")) ) then ( element item_tuple { glx:distinct-docorder((let $glx:dot := ($i) return child::itemno)), text { "" }, glx:distinct-docorder((let $glx:dot := ($i) return child::description))} ) else (())}
![Page 25: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/25.jpg)
Galax Architecture (+example)
XQueryParser
Parsing Layer
XML Parser
XQuery Expression
XML Schema Description
XMLDocument
XML Schema AST
XQuery AST
Mapping Layer
XQuery Mapping to
the Core
Type System Mapping
XQuery Core Internal Structure
XQuery Type System Internal Structure
Static Type
Checker
Static Error for non well-typed queries
Type of Query Result
(Dynamic) Evaluation Layer
Query Processor
Data Model Query Result
XML Parser
XML AST
XML Data Model Loader XML Data Model
Internal Structure
Validation
OUR MAPPING OUR QP
![Page 26: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/26.jpg)
Road Ahead• But…
• Goal, again, is to be mainly pragmatic first• Deeper research starts after initial QP
generator has been bootstrapped from existing system
• Risk:• Too much engineering• Algebra in Galax not suited for optimization
![Page 27: (Meeting Overview)](https://reader035.fdocuments.net/reader035/viewer/2022062801/56814401550346895db09612/html5/thumbnails/27.jpg)
Research issues• Should the ‘semi-structured semi-
monad’ algebra be adapted to enable more set-oriented processing?
• Gives IR application rise to new physical operators???