XSEarch: A Semantic Search Engine for XML

XSEarch: A Semantic Search Engine for XML

Sara Cohen, Jonathan Mamou, Yaron Kanza, Yehoshua Sagiv

The Hebrew University of Jerusalem

Presented by Deniz Kasap & Sarp Baran Özkan

XSEarch an XML Search Engine

XSEarch an XML Search Engine

Goal:

Find the “relevant” XML fragments,

given tag names and keywords

Goal:

Find the “relevant” XML fragments,

given tag names and keywords

Introduction

• It is becoming increasingly popular to publish data on the Web in the form of XML documents.

• Current search engines, which are an indispensable tool for finding HTML documents, have two main drawbacks when it comes to searching for XML documents. – It is not possible to pose queries that explicitly refer to XML tags. – Search engines return references (i.e. links) to documents and not

specific fragments thereof. This is problematic, since large XML documents may contain thousands of elements storing many pieces of information that are not necessarily related to each other.

Excerpt from the XML Version of DBLP

<proceedings>

<inproceedings>

<author>Moshe Y. Vardi</author>

<title>Querying Logical Databases</title>

</inproceedings>

<inproceedings>

<author>Victor Vianu</author>

<title>A Web Odyssey: From Codd to XML</title>

</inproceedings>

</proceedings>

A Search Example

Find papers by Vianu on the topic of “logical databases”

How can we find such papers?

Attempt 1: Standard Search Engine

A document containing some of the three query terms is considered as a result.

The document is not relevant to the query.

This does not work!!!

The document is returned BUT it does not

contain any paper on “logical databases” by VianuThis fragment does not represent

a paper by Vianu

This fragment does not represent a paper about logical databases

<proceedings> <inproceedings> <author>Moshe Y. Vardi</author> <title>Querying Logical Databases</title> </inproceedings> <inproceedings> <author>Victor Vianu</author> <title>A Web Odyssey: From Codd to XML</title> </inproceedings></proceedings>

The document contains the three query terms. Hence, it is returned by a standard search engine. BUT

• Since a reference to whole XML document is usually not a useful answer, the granularity of the search should be refined.

• Instead of returning entire document, an XML search engine should return fragments of XML documents.

• A query language for XML, such as XQuery, can

be used to extract data from XML documents.

• However, such a query language is not an alternative to an XML search engine for several reasons. – The syntax of XQuery is more complicated than the

syntax of a standart search query. Hence, it is not appropriate for a naive user.

– Extensive knowledge of the document structure is required in order to correctly formulate a query. Thus, queries must be formulated on a per document basis.

– XQuery lacks any mechanism for ranking answers.

Attempt 2: XML Query Language

FOR $i IN document(“bib.xml”)//inproceedingsWHERE $i/author contains ‘Vianu’

AND $i/title contains ‘Logical’ AND $i/title contains ‘Databases’

RETURN <result> <author> $i/author </author>

<title> $i/title </title> </result>

FOR $i IN document(“bib.xml”)//inproceedingsWHERE $i/author contains ‘Vianu’

AND $i/title contains ‘Logical’ AND $i/title contains ‘Databases’

RETURN <result> <author> $i/author </author>

<title> $i/title </title> </result>

• Complicated syntax• Extensive knowledge of the document structure required to

write the query• No mechanism for ranking results

This does work, BUT

Our Requirements from the Search Tool

• A simple syntax that can be used by naive users• Search results should include XML fragments and

not necessarily full documents• The XML fragments in an answer, should be

semantically related– For example, a paper and an author should be in an

answer only if the paper was written by this author

• Search results should be ranked• Search results should be returned in “reasonable”

time

• The design and implementation of XSEarch involved several challenges.

– A syntax is suitable for a naive user.

– The theoretical results were adapted so that XSEarch always returns as answers.

– Answers are highly relevant to the keywords of the query.

– Suitable ranking mechanism that takes into account both the degree of the semantic relationship and the relevance of the keywords have been developed.

– Index structures and evaluation algorithms that allow the system to deal efficiently with large documents have been developed.

– The implemantation of XSEarch is extensible in the sense that it can easily accommodate different type of semantic relationships.

Query Syntax

• The query language of a standart search engine is simply a list of keywords.

• Keywords with a plus (+) sign must appear in a satisfying document, whereas keywords without a plus sign may or may not appear in a satisfying document. (but the appearance of such keywords is desirable)

• The query language of XSEarch is a simple extension of the language described below. In addition to specify labels and keyword-label combinations that must or may appear in a satisfying document.

• A search term may have a plus sign prepended, in which case it is a required term. Otherwise, it is an optional term.

• We use t, t1, t2, etc., as an abstract notation for required and optional term.

• A query has the form Q(S) where S = t1,...,tm is a sequence of required and optional search terms.

Formally, a search term has the form;

l:k, l:, :k

where

l is a label and k is a keyword.

Appearance of logical in thefragment increases the rank ofthis fragment

Note that the different document fragments matching these query terms must be “semantically related”

Example

• Find papers by Vianu on the topic of “logical databases”

logical +database inproceedings: author:Vianulogical +database inproceedings: author:Vianu

Appearance of the tag inproceedings, in the fragment, increases the rank of this fragment

Appearance of Vianu under the tag author, in the fragment, increases the rank of this fragment

The keyword databasemust appear in the fragment

Query Semantics

• This section presents the semantics of our queries.

• In order to satisfy a query Q, each of the required terms in Q must be satisfied.

• In addition, the elements satisfying Q must be meaningfully related.

XSEarch:

<proceedings>

<inproceedings>



</inproceedings>

<inproceedings>



</inproceedings>

</proceedings>

>title>A Web Odyssey: From Codd to XML</title<

>author>Victor Vianu</author<

Good Result!

title and author elements ARE semantically related

author:Vianu title:

<proceedings>

<inproceedings>



</inproceedings>

<inproceedings>



</inproceedings>

</proceedings>


>author>Victor Vianu</author<

Bad Result!

title and author elements ARE NOT semantically related

XSEarch: author:Vianu title:

Satisfaction of a Search Term

• XML documents are modeled as trees in the standard fashion.– Each interior node is associated with a label and each leaf

node is associated with the sequence of keywords.– If k is a keyword in the sequence associated with n, n

contains k is said.

• In Figure 1 there is a tree that represents a small portion of the Sigmod Record.

• We will refer to this tree as Tsr

• Let n be an interior node in a tree T.

• We say that n satisfies the search term;– l:k if n is labeled with l and a descendent that contains the keyword

k.– l: if n is labeled with l.– :k if n has a leaf child that contains the keyword k.

• Example:– In the tree Tsr,

• node number 14 satisfies :Kempster • node number 9 satisfies authors:Kempster. • node 9 does not satisfy :Kempster, position: or :position.

Meaningfully Related Sets of Nodes

• Let T be a tree and R be a binary, reflexive and symmetric relationship on the nodes in T.

• We assume that R contains pairs of nodes that are meaningfully related.

• We present two different way to extend R to arbitrary sets of nodes

• A set of nodes N is all-pairs R-related, if (n1,n2) is in R, for every pair of nodes n1, n2.

• This states that a set of nodes is meaningfully related if every pair of nodes in the set is meaningfully related.

• N is star R-related, if there is a node n* N such that the pair (n*,n) is in R, for all nodes n N.

• This states that the nodes of a set are meaningfully related if all these nodes are meaningfully related to a node in the set.

• Depending on the structure of the documents, either the all-pairs relationship or star relation-ship may be more appropriate.

Query Answers• Let Q(t1,…,tm) be a query.

• A sequence N = n1,…,nm of nodes and null values is an all-pairs R-answer for Q if the nodes in N are all-pairs R-related and for all 1 i m:– ni is not the null value if ti is a required term;

– ni satisfies ti if it is not the null value.

• Similarly, N is star R-answer, when the nodes in N are star R-related.

• We use;

– Ansa,R(Q) to denote the set of all-pairs R-answer for the query Q over a tree T and

– Ansts,R(Q) to denote the set of star R-answers for Q over T.

– MaxAnsa,R to denote the set of maximal answers in Ansa,R(Q)

The Interconnection Relationship

• We present a relation which can be used to determine whether a pair of nodes is meaningfully related.

• Let T be tree an n1 and n2 be nodes in T.

• The shortest undirected path between n1 and n2 consists of the paths from the lowest common ancestor of n1 and n2 to n1 and n2.

• We denote the tree consisting of these two paths as T|n1,n2.

• This tree describes the relationship between the nodes n1 and n2.

• For example in Tsr, the tree T|8,13 consists of the nodes 7, 8, 9, 12 and 13.

Relationship Trees

Relationship tree of n1, n2, …, nk

n1 n2nk…

Lowest common ancestor of n1, n2, …, nk

Our “Semantic Relation”: Interconnection

• n1,..., nk are interconnected if either

– relationship tree of n1,..., nk does not contain two nodes with the same label, or

– the only nodes with the same label in the relationship tree of n1,..., nk, are among n1,..., nk

proceedings

Moshe Y. Vardi

inproceedings

author title

Querying Logical Databases

authortitle

Victor Vianu

A Web Odyssey: From Codd to XML

inproceedings

Circled nodes belong to different inproceedings entities. They ARE NOT interconnected!

Relationship tree

Lowest common ancestor of circled

nodes

Example (1)

proceedings

Moshe Y. Vardi

inproceedings

author title


authortitle

Victor Vianu

A Web Odyssey: From Codd to XML

inproceedings

Circled nodes belong to the same inproceedings entity. They ARE interconnected!

Relationship tree


nodes

Example (2)

proceedings

Moshe Y. Vardi

inproceedings

author title


authortitle

Victor Vianu

Queries and Computation on the Web

inproceedings

Circled nodes belong to the same inproceedings entity, but are labeled with the same tag.

They ARE interconnected.

Relationship tree


nodes

Example (3)

author

Serge Abiteboul

Example 1 of Query Semantics

• Consider the query Q1 defined as;

Q1(+title:, author:).

– The query Q1 finds pairs of titles and authors, belonging to the same article.

– Only tuples where the title is non-null will be returned.

– The answers created for Tsr are;

(8,10) , (8,12) , (8,14) , (17,18) and (25, )

Example 2 of Query Semantics

The answers for Q1 over this document would consists of;

(6,3) and (6,4)

Query Processing

• Document fragments are extracted using the interconnection index and other indices

• Extracted fragments are returned ranked by the estimated relevance

Ranker

User

Query Processor Ranker

XML FilesIndexerIndices

Index Repository

1 2 3 4

L1 L2 L3 L4

Ranking Factors

Several factors increase the rank of a result

• Similarity between query and result

• Weight of labels appearing in the result

• Characteristics of result tree

Query and Result Similarity

TFILF– Extension of TFIDF, classical in IR

– Term Frequency: number of occurrences of a query term in a fragment

– Inverse Leaf Frequency: number of leaves containing a query term divided by number of leaves in the corpus

TFILF• Term frequency of keyword k in a leaf node nl

• Inverse leaf frequency

TFILF is the product between tf and ilf

Weight of Labels

• Some labels are considered more important than others– Text under an element labeled with title is

more “important” than text under element labeled with section

• Label weights can be– system generated– user defined

Relationship between Nodes• Size of the relationship tree: small fragment

indicates that its nodes are closer, and thus, probably, “more related”

article: title:XML

2 nodes

3 nodes

This fragment will obtain an higher rankarticle

title

XML

article

section

title

XML

Relationship between Nodes• Ancestor-descendant relationships between

a pair of nodes in a fragment, indicates “strong relation” between these nodes

section: title:XML

article

title

XML

section

section node is an ancestor of title node

This fragment will obtain an higher rank

section

title

article

XML

Combining the Factors

Given a query Q and an answer N, we use the measures• sim(Q,N), • tsize(N) • and anc-des(N)

to determine the ranking of the answer. We experimented with the following combination of factors by varying the values of α , β and γ

sim(Q,N)α / tsize(N)β x (1+ γ x anc-des(N))

System Implementation

The architecture of the XSEarch system is depicted in the following figure:

User Interface

Query Processor Ranker

XML Files

1 2 3 4

L1 L2 L3 L4

IndexerIndices

Index Repository

• The basic follow of information is as follows:– The user enters a query using a browser.

– The Search-Query Processor parses the query into a list of search terms.

– The Index Repository is used to find nodes that satisfy that satisfy the search terms and to find whether pairs of nodes are interconnected.

• It responds by checking the stored indices.

• If these indices do not contain sufficient information, the Indexer is used to augment the current indices.

• Once the relevant information is returned to the Search-Query Processor, it creates the answers, which are ranked, sorted and then returned.

– The Indexer creates several different indices in the Index Repository based on a set of XML documents.

• We focus on the most important and novel index structures:– The interconnection index– Path index

• The interconnection index allows for rapid checking of the interconnection relationship.

• Path index allow us to create first answers with higher estimated ranking.

Dynamic Offline Interconnection Indexing

• Checking for interconnection of nodes online is expensive.

– Hence, it is decided that at the first to create a node-interconnection index that would store information about the interconnection relationship between each pair of nodes.

– This requires solving the following problem:

• Given a document T, for all pairs of nodes n and n’ in T, determine whether n and n’ are interconnected.

– The algorithm which is the solution of this problem, is based on the following Lemma:

• Lemma (Interconnection Characterization)

– Let T be a document and let n and n’ be nodes in T.

– If n is ancestor of n’, then n and n’ are interconnected if and only if the following hold:

• The parent of n’ is strongly-interconnected with n;

• The child of n on the path to n’ is strongly-interconnected with n’.

– If n is not an ancestor of n’ and n’ is not an ancestor of n, then n and n’ are interconnected if and only if the following hold:

• The parent of n’ is strongly-interconnected with n;

• The parent of n is strongly-interconnected with n‘.

• In the XSearch system, we have explored the possibilities of storing the node-interconnection index in either a hashtable or a symmetric matrix.

• When implemented as a hastable, the node-interconnection index contains pairs of ids of interconnected nodes.

• When implemented as a symmetric matrix, the node-interconnection index contains a boolean value for each pairs of nodes, indicating whether they are interconnected or not.

• A comparison of time and space efficiency of these structures will be explained.

Dynamic Online Interconnection Indexing

• Offline computation of the node-interconnection index may be expensive.

• In order to amortize the cost of computing this index over the queries received, we have also considered an online indexing method.

• When indexing online, for each pair of nodes n and n’, we compute the section of the node interconnection index corresponding to T|n,n’

• We use a hashtable to store the part of the part of index that has already been computed at any given moment.

• The hashtable contains a boolean value for each pair of nodes whose interconnection has already been checked.

• The boolean value indicates whether the nodes are interconnected or not.

• During query processing, usually only a small part of the node-interconnection index will be created, thus the slowdown in response time is not large.

• In addition, queries tend to be similar in the parts of the document that they must access.

• Therefore, even after many queries have been evaluated, it is likely for the node-interconnection index to be only partially computed.

• This speeds up execution time when loading the index into main memory.

Experimental ResultsExperimental Results

Hardware and Software Used

• Language: Java

• Processor: 1.6 GHZ Pentium 4

• RAM: 2 GB (limited to 1.46 GB by JVM)

• OS: Windows XP

Interconnection Index

• Is built offline• Allows for checking interconnection

between two nodes, during query processing, in O(1) time

• We have two implementations – as a hash table– as a symmetric matrix

• The Indexer is responsible for building the Interconnection Index

Choosing the Implementation for the Interconnection Index

• We have experimented the two implementations of the interconnection index1. IIH: the index is an hash table2. IIM: the index is a symmetric matrix

• We compare the two implementations – Cost of building the index– Cost of query processing, i.e., using the index

• IIM is better than IIH, because of the additional overhead of hashing

Time For Building Indices

XML corpus

Size (KB)

Number of nodes

IIM time (ms)

IIH time (ms)

Dream1463,3602936

Hamlet2816,635114185

Sigmod70421,2461,5521,729

Mondial1,19849,4226,2317,837

• Both implementations are reasonable

On the Fly Indexing (OFI)

• Fully building the indices as a preprocess of querying is expensive in memory for “huge” corpuses!– Also expensive in time because of the additional

overhead of using virtual memory

• Instead, compute interconnection index incrementally on-the-fly during query processing for each pair that must be checked– By how much will query processing be slowed

down?

Time For Building Indices: Comparing IIH, IIM, OFI

XML corpus

Size (KB)

Number of nodes

OFI time (ms)

IIH time (ms)

IIM time (ms)

Dream1463,3600.63629

Hamlet2816,6351.1185114

Sigmod70421,2462.21,7291,552

Mondial1,19849,42210.07,8376,231

For these corpuses, OFI time is less than 10 ms.

Actually it is the time to build all the indices other than the interconnection index.

Query Execution Time

• We generated 1000 random queries for the Sigmod Record corpus

• Each query had:– At most 3 optional search terms– At most 3 required search terms

• We checked time with IIH, IIM and OFI

0.1

1

10

100

1000

Time (milliseconds)

Nu

mb

er

of

Qu

eri

es

IIH

IIM

IIH/IIM: Query Processing Time

• Note: Logarithmic scale

• Both approaches lead to similar results

• Average run time for queries: 35 ms

• After processing the 1000 queries, 0.75% of all pairs of nodes were checked for interconnection.•Space saved in main memory

Slowdown in response time not too large!Locality property: queries tend to be similar in the parts of the document that they may access

• More than 50% of the queries processed in under 10 ms

513

144199

96

2317

7

11

10

100

1000

0.01 0.1 1 10 100 1000 10000 100000

Time (sec)

Nu

mb

er

of

Qu

eri

es

OFI: Query Processing Time

How Good are the Results?

• We measured recall & precision for the query:– Find papers written by Buneman that contain the keyword

database in the title

• We tried two different queries that reflect different amounts of user knowledge– Kw: +Buneman +database (classical search engine

query)– Tag-kw: +author:Buneman +title:database

• Corpus: Sigmod, DBLP

• We computed the "correct answers" using XQuery

• Recall

Perfect recall, i.e., XSEarch returns all the correct answers

• Precision at n

Precision and Recall

correct returned answers

correct answers

correct answers in the first n returned answers

n

Precision at 5, 10 and 20

0.5

0.6

0.7

0.8

0.9

1

P@5: KW P@10:KW

P@20:KW

P@5:KW+TAG

P@10:KW+TAG

P@20:KW+TAG

DBLP

Sigmod

Sigmod: Perfect precision

DBLP: 0.8/0.9 for query containing only keywords

Combining tags and keywords leads to perfect precision

Related Work

• Numerous query languages for XML have been developed.– For example, the XQuery working group is considering

how to add full-text search features and ranking to XQuery. Such capabilities have already been added to various XML query languages. But these languages are not suitable for naïve user, since the query syntax is always complex.

– A recent related work is the XRANK system for keyword searching in XML documents

Conclusions

• The main contribution of this paper is in laying the foundations for a semantic search engine over XML documents.

• XSearch returns semantically related fragmants, ranked by estimated relevance.

• This system is extensible and can easily accommodate different types of relationships between nodes.

• We have shown that it is possible to combine these qualities with an efficient, scalable and modular system.

• Thus, XSearch can be seen as a general framework for semantic searching in XML documents.

• Efficient index structures– IIM/IIH for “small” documents– OFI for “big” documents

• Efficient evaluation algorithms– Dynamic algorithm for computing

interconnection

• Extensible implementation– The system can easily accommodate different

types of semantic relations between nodes, other than interconnection

Thank You.

Questions?

XSEarch: A Semantic Search Engine for XML

Documents

Transcript of XSEarch: A Semantic Search Engine for XML