XSEarch XML Search Engine Jonathan MAMOU October 2002.

Post on 18-Dec-2015

217 views 1 download

Transcript of XSEarch XML Search Engine Jonathan MAMOU October 2002.

XSEarchXML Search Engine

Jonathan MAMOU

October 2002

Motivation

XML Getting popular Allows meta-data to be embedded

into documents Data-centric view : exchange

format for structured data – meta data Document-centric view : Content –

text, meta data Querying data and meta-data

One Fish Two Fish by

John Meyer & Peter Smith

Costs Only: $7.95

Goodnight Moon by Margaret Brown

Costs Only: $10.55

Brown Bear by Bill Martin Jr.

Costs Only: $6.00

Buy our Classic

Children’s books.

amazing.com

<bookinfo><book><title>One Fish Two Fish</title>

<author>John Meyer</author> < author >Peter Smith</author> <price>7.95</price></book>

<book><title>Goodnight Moon</title> < author >Margaret

Brown</author> <price>10.55</price></book>

....</bookinfo>

A query Find titles and prices of books by

‘Meyer’ or ‘Smith’

IR Approach

How to deal with tags? Discard all tags

Simplicity Loss of information (structure) lower retrieval

performance Keep tags as keyword

How to write the query? “Title price book author Meyer Smith”

IR Approach (cont’d) Can’t specify that Meyer and Smith

are the authors Can’t specify that title, price and

author belongs to same book Can’t specify desired output (i.e.,

titles, price)

Database approachFOR $b IN document(“bib.xml”)//bookWHERE $b/author contains ‘Meyer’ OR $b/author

contains ‘Smith’RETURN <result>

<title> $b/title </title><price> $b/price </price>

</result>

•Difficult for naive user

•Requires knowledge of document structure

•Dependent on document structure

Our Goal

Combine IR and database techniques : tags + text

Simple language Logical Structure, not physical Require knowledge of tag names,

not structure Queries should work even if

structure changes Rank results

Framework

bookinfo

Just Lost

book

titleauthor

author

price

Mercy Meyer

Gina Meyer

$5.75

book

titleprice

Brown Bear

$13.95

Tree Representation

We need to find tuples of related title and price nodes.

author

name

Dr. Meyer

author

namebook

M. Brown

Goodnight Moon

title

book

titleprice

One Fish Two Fish

$12.50

book

title price

Cat in the Hat

$14.95

bookinfo

Another Tree Representation

Similar document, but with different hierarchical structure from the previous.

We need to find tuples of related title, author and price nodes.

Interconnection

Consider a title and price nodeIntuition: The nodes belong to different book entities

bookinfo

Just Lost

book

titlenamename

price

Mercy Meyer Gina

Meyer

$5.75

book

titleprice

Brown Bear

$13.95

The lowest common

ancestor of the circled

nodes

Interconnection (cont’d)

Just Lost

title

bookinfo

book

namename

price

Mercy Meyer Gina

Meyer

$5.75

book

titleprice

Brown Bear

$13.95

Intuition: The nodes belong to same book entity

The lowest common

ancestor of the circled

nodes

Interconnection (cont’d)

Just Lost

title

bookinfo

book

namename

price

Mercy Meyer Gina

Meyer

$5.75

book

titleprice

Brown Bear

$13.95

Intuition: The nodes belong to same book entity

Relationship tree

Nodes n1,n2

n their lowest common ancestor Tn the subtree rooted at n The relationship tree of n1,n2 is the

tree obtained by pruning from Tn all nodes other than n1,n2 that are not ancestors of n1,n2

Interconnection We say that n1,n2 are

interconnected if the relationship tree does not contain 2

distinct nodes with the same labelOr the relationship tree contains exactly

one pair of distinct nodes with the same label and this pair is comprised of n1,n2

All-Pairs Interconnection A set of nodes is all-pairs

interconnected if every pair of nodes are interconnected

Star interconnectionbookinfo

Just Lost

book

titleauthorauthor

price

Mercy Meyer

Gina Meyer

$5.75

book

titleprice

Brown Bear

$13.95name

name

The 2 names are not interconnected

Star Interconnection (cont’d)

A set of nodes is star interconnected if all the nodes in the set are interconnected to the same node

Search terms, Search query

Search Term (l,k) l label (context) k keyword

Search Query AND:L1 OR:L2 L1, L2 list of search terms

AND:(title,)(price,) OR:(author,Meyer)(author:Smith)

Answer AND:N1 OR:N2

N1, N2 are list of nodes Matching between N1,N2 and L1,L2 N1 and N2 are interconnected

All all-pair answers are star answers

Maximal answer

bookinfo

Just Lost

book

titleauthorauthor

price

Mercy Meyer Gina

Meyer

$5.75

book

titleprice

Brown Bear

$13.95

Example

(title,) (price,) (author,Meyer)

Find matchings of title, author and price to the nodes in the tree

title

author pricenull

Computing answers All-pairs

Determining whether the set of answers is empty is NP-complete

If L1 is empty, computing the set of answers is polynomial in the size of input and output

Star computing the set of answers is

polynomial in the size of input and output

Ranking results Unstructured

Keyword weight (tfilf) Tags weight Result size

Structured Nodes distance Ancestor-descendant

Keyword Weight Compute the weight of a keyword

k within a given node n Variation of the tfidf, one of the

metric of Vector Space Model (classical model in IR)

Keyword Weight (cont’d) Term Frequency (tf): number of

appearances of k within ntf(k,n) = occ(k,n) / (max occ(k’,n)) Inverse Leaf Frequency (ilf): inverse

frequency of k among all the leafs in the corpus

idf(k) = log(1+N/Nk) W(k,n) = tf(k,n) * idf(k) Normalized per leave

Tag Weight Give weight to tags according to

their importance E.g. give more weight to <title> than

to <abstract>

Result Size Number of search terms appearing

in the result (OR part)

Ranking-Structured Nodes distance

size of the relationship tree Ancestor-descendant relationship

“more” interconnected

System overview

XSEarch overview

XML corpus with logical hierarchy

Indexer Search

query

ResultsOffline

Online

Document Location array Generate a unique id, did Associate each did with the

physical location of the corresponding document

Logical structure of the corpus

Node Encoding Array Generate for each interior node a id,

nid Node encoding

Defined recursively Node encoding of its parent Index of the node among its siblings Eg: 13.8.1.9

Associate each nid with its node encoding

Node Label Array Associate each nid with its label

Inverted Tag Index For each tag, keep

posting list: list of nodes labeled with this tag

weight

Nid1tag Nid3Nid2

Inverted Keyword Index For each kw, keep

posting list: list of leafs containing this keyword

weight of the kw within the leaf (tfilf)

Nid1,w1kw

Nid3,w3Nid2,w2

Node Interconnection Matrix

element ij contains: 1, if ni and nj are interconnected 0, else

n*n symmetric sparse matrix Dynamic programming

Alternative Hash set : keep only

interconnected nodes Key: pair (ni, nj)

Interconnection Let n be the number of nodes It is possible to determine whether

n1 and n2 are interconnected in O(n) time

It is possible to determine interconnection of all pairs in O(n2)

Offline/Online computation

Interconnection for (i=size-1; i>=0; i--)

for (j=i+1; j<=size; j++) if i ancestor of j

connected(iChild,j) AND connected(i,jFather) AND labelIChild != labelJ AND labelI != labelJFather

for (j=i+1; j<size; j++) if i not ancestor of j

connected(i,jFather) AND connected(iFather,j) AND

labelI != labelJFather AND labelIFather != labelJ

Demo