INEX ‘05 TopX @ INEX ‘05 Martin Theobald Ralf Schenkel Gerhard Weikum Max Planck Institute for...
-
Upload
ashlynn-norman -
Category
Documents
-
view
216 -
download
0
Transcript of INEX ‘05 TopX @ INEX ‘05 Martin Theobald Ralf Schenkel Gerhard Weikum Max Planck Institute for...
INEX ‘05
TopX @ INEX ‘05TopX @ INEX ‘05
Martin TheobaldMartin TheobaldRalf SchenkelRalf Schenkel
Gerhard WeikumGerhard Weikum
Max Planck Institute for InformaticsMax Planck Institute for InformaticsSaarbrückenSaarbrücken
An Efficient and Versatile Query Engine for TopX Search 2INEX ‘05
//article[//sec[about(.//, “XML retrieval”)] //par[about(.//, “native XML database”)] ]//bib[about(.//item, “W3C”)]
sec
article
sec
par
bib
par
title “Current Approaches to XML Data Manage-ment.”
item
“Data management systems control data acquisition, storage, and retrieval. Systems evolved from flat files … ”
“XML queries with an expres- sive power similar to that of Datalog …”
par
title“XML-QL: A Query Language for XML.”
“Native XML database systems can store schemaless data ... ” inproc
“Proc. Query Languages Workshop, W3C,1998.”
title“Native XML databases.”
sec
article
sec
par “Sophisticated technologies developed by smart people.”
par
title “TheXML Files”
par
title “TheOntology Game”
title“TheDirty LittleSecret”
“What does XML add for retrieval? It adds formal ways …”
bib
“w3c.org/xml” “There, I've said it - the "O" word. If anyone is thinking along ontology lines, I would like to break some old news …”
title
item
url“XML”
An Efficient and Versatile Query Engine for TopX Search 3INEX ‘05
TopX: Efficient XML-IR [VLDB ’05]
Extend top-k query processing algorithms for sorted lists [Buckley ’85; Güntzer, Balke & Kießling ’00; Fagin ‘01] to XML data
Non-schematic, heterogeneous data sources
Combined inverted index for content & structure
Avoid full index scans, postpone expensive random accesses to large disk-resident data structures
Exploit cheap disk space for redundant indexing
Goal: Efficiently retrieve the best results of a similarity query
Goal: Efficiently retrieve the best results of a similarity query
An Efficient and Versatile Query Engine for TopX Search 6INEX ‘05
Data Model
Simplified XML modeldisregarding IDRef & XLink/XPointer
Redundant full-contents Per-element term frequencies ftf(ti,e) for full-contentsPre/postorder labels for each tag-term pair
<article> <title>XML-IR</title> <abs> IR techniques for XML</abs> <sec> <title> Clustering on XML </title> <par>Evaluation</par> </sec></article>
“xml ir”
articlearticle
titletitle absabs secsec
“xml ir ir technique xmlclustering xml evaluation“
“ir techniquexml“
“clustering xml evaluation“
“clustering xml”
“evaluation“
titletitle parpar
1 6
2 5 3 4 3 3
5 2 6 1
ftf(“xml”, article1 ) = 3ftf(“xml”, article1 ) = 3
An Efficient and Versatile Query Engine for TopX Search 7INEX ‘05
Full-Content Scoring Model
Full-content scores cast into an Okapi-BM25 probabilistic model with
element-specific parameterization
Basic scoring idea within IR-style family of TF*IDF ranking functions tag N avglength k1 b
article 12,223 2,903 10.5 0.75sec 96,709 413 10.5 0.75par 1,024,907 32 10.5 0.75fig 109,230 13 10.5 0.75
per-elementstatistics
Additional static score mass c for relaxable structural conditions
An Efficient and Versatile Query Engine for TopX Search 8INEX ‘05
Inverted Block-Index for Content & Structure
eid docid score pre post max-score
46 2 0.9 2 15 0.9
9 2 0.5 10 8 0.9
171 5 0.85 1 20 0.85
84 3 0.1 1 12 0.1
sec[clustering]
title[xml] par[evaluation]
sec[clustering] title[xml] par[evaluation]
Inverted index over tag-term pairs (full-contents)Benefits from increased selectivity of combined tag-term pairsAccelerates child-or-descendant axis, e.g., sec//”clustering”
eid docid score pre post max-
score
216 17 0.9 2 15 0.9
72 3 0.8 10 8 0.8
51 2 0.5 4 12 0.5
671 31 0.4 12 23 0.4
eid docid score pre post max-
score
3 1 1.0 1 21 1.0
28 2 0.8 8 14 0.8
182 5 0.75 3 7 0.75
96 4 0.75 6 4 0.75
Sequential block-scans Re-order elements in descending order of (maxscore, docid, score) per listFetch all tag-term pairs per doc in one sequential block-accessdocid limits the range of in-memory structural joins
Stored as inverted files or database tables (B+-tree indexes)
An Efficient and Versatile Query Engine for TopX Search 9INEX ‘05
Navigational Index
eid docid pre post
46 2 2 15
9 2 10 8
171 5 1 20
84 3 1 12
sec
title[xml] par[evaluation]
sec title par
Additional navigational indexNon-redundant element directorySupports element paths and branching path queriesRandom accesses using (docid, tag) as key
Schema-oblivious indexing & querying
eid docid pre post
216 17 2 15
72 3 10 8
51 2 4 12
671 31 12 23
eid docid pre post
3 1 1 21
28 2 8 14
182 5 3 7
96 4 6 4
An Efficient and Versatile Query Engine for TopX Search 10INEX ‘05
TopX Query Processing
Adapt Threshold Algorithm (TA) paradigm Focus on inexpensive sequential/sorted accessesPostpone expensive random accesses
Candidate d = connected sub-pattern with element ids and scoresIncrementally evaluate path constraints using pre/postorder labelsIn-memory structural joins (nested loops, staircase, or holistic twig joins)
Upper/lower score guarantees per candidateRemember set of evaluated dimensions E(d)
worstscore(d) = ∑iE(d) score(ti,e)bestscore(d) = worstscore(d) + ∑iE(d) highi
Early threshold terminationCandidate queuingStop, if
ExtensionsBatching of sorted accesses & efficient queue managementCost model for random access scheduling
Probabilistic candidate pruning for approximate top-k results [VLDB ’04]
[Fagin et al., PODS ’01 Güntzer et al., VLDB ’00 Buckley&Lewit, SigIR ‘85]
[Fagin et al., PODS ’01 Güntzer et al., VLDB ’00 Buckley&Lewit, SigIR ‘85]
An Efficient and Versatile Query Engine for TopX Search 11INEX ‘05
1.0
worst=0.9best=2.9
46 worst=0.5best=2.5
9
TopX Query Processing By Example
eid docid score pre post
46 2 0.9 2 15
9 2 0.5 10 8
171 5 0.85 1 20
84 3 0.1 1 12
eid docid score pre post
216 17 0.9 2 15
72 3 0.8 10 8
51 2 0.5 4 12
671 31 0.4 12 23
eid docid score pre post
3 1 1.0 1 21
28 2 0.8 8 14
182 5 0.75 3 7
96 4 0.75 6 4
worst=1.0best=2.8
3
worst=0.9best=2.8
216
171 worst=0.85best=2.75
72
worst=0.8best=2.65
worst=0.9best=2.8
46
2851
worst=0.5best=2.4
9doc2 doc17 doc1worst=0.9
best=2.75
216
doc5worst=1.0best=2.75
3
doc3
worst=0.9best=2.7
46
2851
worst=0.5best=2.3
9 worst=0.85best=2.65
171score=1.7best=2.5
46
28
score=0.5best=1.3
9 worst=0.9best=2.55
216
worst=1.0best=2.65
3
worst=0.85best=2.45
171
worst=0.8best=2.45
72
worst=0.8best=1.6
72
worst=0.1best=0.9
84
worst=0.9best=1.8
216
worst=1.0best=1.9
3
worst=2.2best=2.2
46
2851
worst=0.5best=0.5
9 worst=1.0best=1.6
3
worst=0.85best=2.15
171 worst=1.6best=2.1
171
182
worst=0.9best=1.0
216
worst=0.0best=2.9
Pseudo-
Candidate
worst=0.0best=2.8worst=0.0best=2.75worst=0.0best=2.65worst=0.0best=2.45worst=0.0best=1.7worst=0.0best=1.4worst=0.0best=1.35
sec[clustering] title[xml]
Top-2 results
worst=0.946 worst=0.59 worst=0.9
216
worst=1.746
28
worst=2.246
2851
worst=1.0
3
worst=1.6171
182
par[evaluation]1.0 1.0 1.00.9
0.850.1
0.90.80.5
0.80.75
min-2=0.0min-2=0.5min-2=0.9min-2=1.6
sec[clustering]
title[xml] par[evaluation]
Candidate queue
An Efficient and Versatile Query Engine for TopX Search 12INEX ‘05
CO.Thorough
Element-granularity
Turn query into pseudo CAS query using “//*”
No post-filtering on specific element types
nxCG@10 = 0.0379 (rank 22 of 55)
MAP = 0.008 (rank 37 of 55)
Old INEX_eval: MAP=0.058 (rank 3)
An Efficient and Versatile Query Engine for TopX Search 13INEX ‘05
COS.Fetch&Browse
Document-granularity
Rank documents according to their best target element
Strict evaluation of support & target elements
Return all target elements per doc using the document score (no overlap)
MAP = 0.0601 (rank 4 of 19)
An Efficient and Versatile Query Engine for TopX Search 14INEX ‘05
SSCAS
Element-granularity with strict support & target elements (no overlap)
nxCG@10 = 0.45 (ranks 1 & 2 of 25)
MAP = 0.0322 & 0.0272 (ranks 1 & 6 )
An Efficient and Versatile Query Engine for TopX Search 15INEX ‘05
Top-k Efficiency
02,000,0004,000,0006,000,0008,000,000
10,000,00012,000,000
k
# S
A +
# R
A
Join&Sort
StructIndex+
StructIndex
BenProbe
MinProbe
0.0784,424723,1690.010TopX – BenProbe
0.17
0.09
0.373,25,068761,970n/a10StructIndex
0.26109,122,318n/a10Join&Sort
1.000.341.875,074,38477,482n/a10StructIndex+
0.0364,807635,5070.010TopX – MinProbe
1.000.030.351,902,427882,9290.01,000TopX – BenProbe
relP
rec
# SA
CPU se
c
P@k
MAP@
k
epsil
on# RAk
relP
rec
An Efficient and Versatile Query Engine for TopX Search 16INEX ‘05
Probabilistic Pruning
0.0
0.2
0.4
0.6
0.8
1.0
ε
relPrecP@10MAP
0
200,000
400,000
600,000
800,000
#SA
+ #
RA
TopX -MinProbe
0.07
0.08
0.08
0.08
0.09
0.770.340.0556,952392,3950.2510
1.000.340.0364,807635,5070.0010TopX - MinProbe
0.650.310.0248,963231,1090.5010
0.510.330.0142,174102,1180.7510
0.380.300.0135,32736,9361.0010
# SA
CPU se
c
P@k
MAP@
k
epsil
on# RAk re
lPre
c
An Efficient and Versatile Query Engine for TopX Search 17INEX ‘05
Conclusions & Ongoing Work
Efficient and versatile TopX query processorExtensible framework for text, semi-structured & structured data
Probabilistic ExtensionsProbabilistic cost model for random access schedulingVery good precision/runtime ratio for probabilistic candidate pruning
Full NEXI supportPhrase matching, mandatory terms “+”, negation “-”, attributes “@”Query weights (e.g., relevance feedback, ontological similarities)
ScalabilityOptimized for runtime, exploits cheap disk space
(redundancy factor 4-5 for INEX)Participated at TREC Terabyte Efficiency Task
Dynamic and self-tuning query expansions [Sigir ’05]
Incrementally merges inverted lists for a set of active expansionsVague Content & Structure (VCAS) queries (maybe next year..)