CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld...

72
1 CIKM 2005 Finding and Approximating Top- Finding and Approximating Top- k k Answers in Keyword Proximity Answers in Keyword Proximity Search Search Benny Kimelfeld Benny Kimelfeld and Yehoshua Sagiv Yehoshua Sagiv The Selim and Rachel Benin School of Engineering and Computer Science תתתתתתתתתתת תתתתתת תתתתתתתתתתת תתתתתת תתתתתתתת תתתתתתתתThe Hebrew University of The Hebrew University of Jerusalem Jerusalem

Transcript of CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld...

Page 1: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

1

CIKM 2005

Finding and Approximating Top-Finding and Approximating Top-kkAnswers in Keyword Proximity SearchAnswers in Keyword Proximity Search

Benny Kimelfeld Benny Kimelfeld and Yehoshua Sagiv Yehoshua Sagiv

The Selim and Rachel Benin School of Engineering and Computer Science

האוניברסיטה העברית בירושליםהאוניברסיטה העברית בירושליםThe Hebrew University of JerusalemThe Hebrew University of Jerusalem

Page 2: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search2PODS'06

A paradigm for data extraction

Data have varying degrees of structure

– Relational databases, XML, Web sites

Queries are sets of keywords

− No structural constraints

Keyword Proximity Search (KPS)Keyword Proximity Search (KPS)

The Goal:The Goal:

Extract meaningful parts of data w.r.t. the keywords

Page 3: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search3PODS'06

Querying Structure & Content by KeywordsQuerying Structure & Content by Keywords

Keywords appear in different parts of the data

Answers show occurrences of keywords, as well the associations among these occurrences

Proximity of the keywords in the answer indicates a close (strong) semantic association among them

Vardi Databasessearchjournal

name

Databases

article

Vardi

author…

article

title author

Databases Vardi

Page 4: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search4PODS'06

Past Work on KPS (Past Work on KPS (Keyword Proximity SearchKeyword Proximity Search) )

• DataSpotDataSpot (Sigmod 1998)

• Information Units Information Units (WWW 2001)

• BANKSBANKS (ICDE 2002, VLDB 2005)

• DISCOVERDISCOVER (VLDB 2002)

• DBXplorerDBXplorer (ICDE 2002)

• XKeyword XKeyword (ICDE 2003)

• ……

Page 5: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search5PODS'06

The Goal of this PaperThe Goal of this Paper

Devise Devise efficientefficient algorithms for finding algorithms for finding high-high-quality answersquality answers in in keyword proximity searchkeyword proximity searchDevise Devise efficientefficient algorithms for finding algorithms for finding high-high-quality answersquality answers in in keyword proximity searchkeyword proximity search

Page 6: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search6PODS'06

ContentsContents

Introduction

Formal Setting

The Main Results

Enumerating in the Exact Order

Enumerating in an Approximate Order

Conclusion and Future Work

Page 7: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search7PODS'06

ContentsContents

Introduction

Formal SettingFormal Setting

The Main Results

Enumerating in the Exact Order

Enumerating in an Approximate Order

Conclusion and Future Work

Page 8: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search8PODS'06

Data GraphsData Graphs

company

supplies

supply

product

supplier

papersA4

company

supplies

supply

product

supplier

coffee

president

Cohen

department

Summers

manager

Parishqhq

Structural and keyword nodes

Edges may have weights

– Weak relationships are penalized by high weights

Page 9: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search9PODS'06

QueriesQueries

Q={ Summers , Cohen , coffee }company

supplies

supply

product

supplier

papersA4

company

supplies

supply

product

supplier

coffee

president

Cohen

department

Summers

manager

Parishqhq

Queries are sets of keywords from the data graph

Page 10: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search10PODS'06

Query AnswersQuery Answers

company

supplies

supply

product

customer

papersA4

company

supplies

supply

product

customer

coffee

president

Cohen

department

Summers

manager

Parishqhq

Page 11: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search11PODS'06

company

supplies

supply

product

customer

papersA4

company

supplies

supply

product

customer

coffee

president

Cohen

department

Summers

manager

Parishqhq

Query AnswersQuery Answers

An answer is a directed subtree of the data graph

Contains all keywords of the query

Has no redundant edges (and nodes)

The keywords of the query are the leaves

The root has two or more children

Page 12: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search12PODS'06

Ranking: Inversely Proportional to WeightRanking: Inversely Proportional to Weight

Vardi

databases

dblp

article

5

title

1

1

article

title1

1Vardi

databases

article

title

article

title

cite

references1

1.5

1

1

1

1

1Vardi

databases

title

2 13

rank(A)=(weight(A))-1

Smaller subtrees represent closer associations

Page 13: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search13PODS'06

Enumerating in Exact (Ranked) OrderEnumerating in Exact (Ranked) Order

B CA B CAB CA B CA

B CAB CA B CA

IfIf ThenThen ≤≤

Top-Top-kk Answers AnswersTop-Top-kk Answers Answers

B CAB CA

Page 14: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search14PODS'06

Enumerating in a Enumerating in a CC-Approximate Order-Approximate Order

B CAB CAB CAB CA

B CAB CA B CA

IfIf ThenThen ≤≤

CC-Approximation of the Top--Approximation of the Top-kk Answers Answers

(Fagin et. al, PODS’01)

CC-Approximation of the Top--Approximation of the Top-kk Answers Answers

(Fagin et. al, PODS’01)

B CA

B CA

CC

C may be a function of G and Q

Page 15: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search15PODS'06

Polynomial DelayPolynomial Delay

Yardstick of efficiency:

Polynomial delayPolynomial delay

Yardstick of efficiency:

Polynomial delayPolynomial delay

B CA B CAB CA B CA

B CAB CA B CA

Polynomial time between generating successive answers

Exponentially many answers even for 2 keywords (it is inefficient to generate all answers and then sort)

Page 16: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search16PODS'06

ContentsContents

Introduction

Formal Setting

The Main ResultsThe Main Results

Enumerating in the Exact Order

Enumerating in an Approximate Order

Conclusion and Future Work

Page 17: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search17PODS'06

Top Answers are Steiner TreesTop Answers are Steiner Trees

• Finding the top answer in KPS (a.k.a. the Steiner-tree problem) is intractableintractable – Therefore, one cannot enumerate all answers

in ranked order with polynomial delay

• However, the top answer can be found efficiently under data complexity– That is, the number of keywords is fixed

• Approximations can be found efficiently under query-and-data complexity – There is a lot of work on Steiner-tree approximations

Page 18: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search18PODS'06

So What Can Be Done?So What Can Be Done?

Can answers of KPS be enumerated in the exact order with polynomial delay, under data complexity??

Can answers of KPS be enumerated in the exact order with polynomial delay, under data complexity??

Can approximations of Steiner trees be used for efficiently enumerating in an approximate order (while preserving

the approximation ratio)??

Can approximations of Steiner trees be used for efficiently enumerating in an approximate order (while preserving

the approximation ratio)??

Page 19: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search19PODS'06

Our ResultsOur Results

Theorem 1:Theorem 1: Under data complexity, answers of KPS can be enumerated in the exact order with polynomial delay

B CA B CAB CA B CA

B CAB CA B CA

Page 20: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search20PODS'06

Our Results (cont’d)Our Results (cont’d)

Theorem 2:Theorem 2: Under query-and-data complexity, given an efficient C-approximation for finding Steiner trees, one can enumerate with polynomial delay in a (C+1)-approximate order

B CA B CAB CA B CA

B CAB CA B CA

Page 21: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search21PODS'06

The Meaning of the ResultsThe Meaning of the Results

KPS is tractable under data complexity

Under query-and-data complexity, an efficient enumeration in an approximate order can be done

with almost the same ratios as Steiner trees

All results on Steiner trees can be applied to KPS

Existing approaches to KPS are heuristics–Exponential delay in the worst case–No provable nontrivial approximation ratios

From a theoretical point of view, From a theoretical point of view, using heuristics isusing heuristics is notnot the only optionthe only option

Page 22: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search22PODS'06

ContentsContents

Introduction

Formal Setting

The Main Results

Enumerating in the Exact OrderEnumerating in the Exact Order

Enumerating in an Approximate Order

Conclusion and Future Work

Page 23: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search23PODS'06

Lawler’s MethodLawler’s Method

• We use the technique of Lawler (1972), which is an iterative method for finding the top-k answers

• Each iteration generates the next answer by finding the top answer under constraints

• Lawler’s method is designed for general (discrete) optimization problems

• When applying it to a specific problem, one needs to deal with the following two issues

Page 24: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search24PODS'06

Two Problems to SolveTwo Problems to Solve

1.1. What exactly are the constraintsconstraints? (That is, how can we apply Lawler’s

method so that the constraints make it possible to find top answers efficiently?)

2.2. How can we find efficientlyefficiently the top answer under constraints??

Page 25: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search25PODS'06

Solving the First ProblemSolving the First Problem

Constraints are subtreessubtrees of the graph• Pairwise node disjoint• Their leaves are exactly the keywords of the query

An answer satisfies the constraints if itcontains all the subtrees (i.e., a supertreesupertree)

B CA

E FG

B CA

E FG

Page 26: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search26PODS'06

1.1. What exactly are the constraintsconstraints? (That is, how can we apply Lawler in a

way that the constraints enable finding the top answer efficiently?)

Two Problems to Solve (One Left)Two Problems to Solve (One Left)

2.2. How can we find efficientlyefficiently the top answer under constraints??

Page 27: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search27PODS'06

Formulation of the Second ProblemFormulation of the Second Problem

Input:Input: constraints

(node-disjoint subtrees, keywords as leaves)

Objective:Objective:

A minimal answer satisfying the constraints(i.e., containing all the subtress)

Next, an algorithm that solves “almost” this problem, namely:

(Almost the same) Objective:

A minimal supertree satisfying the constraints

Page 28: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search28PODS'06

Finding a Minimal SupertreeFinding a Minimal Supertree

Input:Input: G, T (constraints, i.e., subtrees)

1. Collapse each of the subtrees of T into a node

2. Find a Steiner tree T of the collapsed subtrees

3. Restore the collapsed subtrees in T

(more details in the proceedings…)

Page 29: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search29PODS'06

(Almost the same) Objective:

A minimal supertree satisfying the constraints

This is not Enough!This is not Enough!

Input:Input: constraints

(node-disjoint subtrees, keywords as leaves)

Objective:Objective:

A minimal answer satisfying the constraints(i.e., containing all the subtress)

Not the same!

Page 30: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search30PODS'06

company

supplies

supply

product

customer

papersA4

company

supplies

supply

product

customer

coffee

president

Cohen

department

Summers

manager

Parishqhq

Query Answers RevisitedQuery Answers Revisited

An answer is a directed subtree of the data graph

Contains all keywords of the query

Has no redundant edges (and nodes)

Keywords are the leaves

The root has two or more children

Page 31: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search31PODS'06

An ExampleAn Example

A B

C D

Page 32: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search32PODS'06

An ExampleAn Example

A B

C D

A B

C D

The minimal supertreesatisfying the constraints

The minimal answersatisfying the constraints

This edge is redundant!But, it cannot be removed since it is a constraint!

The minimal answer can be completely different from the minimal supertree

Furthermore, there can be no answer even if there is a supertree

Page 33: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search33PODS'06

What if We Remove Edges of Constraints?What if We Remove Edges of Constraints?

• What if we first generate a minimal supertree and if the root has only one child, then we just remove it (until an answer is obtained)?

• The constraints are violated, leading to a failure of Lawler’s method!

• That is, – Some answers will be duplicated– While other answers will not be generated at all

Page 34: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search34PODS'06

Our Approach Our Approach

Transform Min.Supertree

Constraints

F

C D

A BH

G

E

A BH

E

F

C D

G

AnswerAnswerA BH

E

F

C D

G

New constraints

The root of this subtree has more than one child and it must be the root of the answer

Page 35: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search35PODS'06

A BH

E

F

C D

G

Min.Supertree

Min.Supertree

Min.Supertree

Min.Supertree

HF

C D

E

A BG

H

A BH

E

F

C D

G

C D

A BH

F EGC DA BH

F EG

C DA BH

E

F G

A BHF

C D G

ETransform

A BHE

F

C D

G

Transform

Transform

Transform

This Process is RepeatedThis Process is Repeated

Constraints

F

C D

A BH

G

E

Up to 2#keywords times (fixed & usually fewer)

The best is the final answerfinal answer

Page 36: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search36PODS'06

About the TransformationAbout the Transformation

• The details of the exact transformation and the proof of correctness are intricate

• All can be found in the proceedings…

This concludes the algorithm forThis concludes the algorithm forenumerating in the exact orderenumerating in the exact order

Page 37: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search37PODS'06

A Different View: Chain of ReductionsA Different View: Chain of Reductions

Enumerating answers in Enumerating answers in ranked orderranked order

Finding the Finding the top top answer under answer under constraintsconstraints

Finding minimal Finding minimal supertreessupertrees

Finding Finding Steiner treesSteiner trees

Adapting Lawler’s methodAdapting Lawler’s method

Transformation of constraintsTransformation of constraints

Collapse and restoreCollapse and restore

Page 38: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search38PODS'06

ContentsContents

Introduction

Formal Setting

The Main Results

Enumerating in the Exact Order

Enumerating in an Approximate OrderEnumerating in an Approximate Order

Conclusion and Future Work

Page 39: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search39PODS'06

Modifying the Chain of ReductionsModifying the Chain of Reductions

Enumeration in an Enumeration in an approximateapproximate order order

Finding Finding approximateapproximate answers under constraints answers under constraints

Finding Finding approximations approximations of minimal supertreesof minimal supertrees

Finding Finding approximationsapproximations of Steiner trees of Steiner trees

Similar

Similar

Completely different!

Page 40: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search40PODS'06

A BH

E

F

C D

G

Min.Supertree

Min.Supertree

Min.Supertree

Min.Supertree

HF

C D

E

A BG

H

A BH

E

F

C D

G

C D

A BH

F EGC DA BH

F EG

C DA BH

E

F G

A BHF

C D G

ETransform

A BHE

F

C D

G

Transform

Transform

TransformConstraints

F

C D

A BH

G

E

Exact Order RevisitedExact Order Revisited

Up to 2#keywords We cannot allow it under query-and-data complexity!

Page 41: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search41PODS'06

The AlgorithmThe Algorithm

F

C D

A BH

E

E

C D

A BH

E

F

C D

Constraints

≤ CC times the optimum ≤ 11 times the optimum A C-approximation of the minimal supertree (collapse and restore)

A minimal answer for 3 or fewer constraints (the

algorithm for the exact order)

Page 42: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search42PODS'06

Combine the SubtreesCombine the Subtrees

E

C D

A BH

E

F

C D

A BH

E

F

C D

The combined subgraph contains an answer

≤ ((C+1C+1)) times the optimum

≤ CC times the optimum ≤ 11 times the optimum A C-approximation of the minimal supertree (collapse and restore)

A minimal answer for 3 or fewer constraints (the

algorithm for the exact order)

Page 43: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search43PODS'06

ContentsContents

Introduction

Formal Setting

The Main Results

Enumerating in the Exact Order

Enumerating in an Approximate Order

Conclusion and Future WorkConclusion and Future Work

Page 44: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search44PODS'06

Keyword Proximity SearchKeyword Proximity Search• A common paradigm for keyword search over

structured databases

• In the formal model: – Data are directed and weighted graphs– Queries are sets of keywords (i.e., nodes) from

the data graph– Query answers are non-redundant subtrees

containing the keywords of the query

• The goal is to find the top-k answers, where the rank is inversely proportional to the weight

• A stronger goal: enumeration with poly. delay

Page 45: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search45PODS'06

Our ResultsOur Results

• Under data complexity, answers can be enumerated in the exact ranked order with polynomial delay

• Under query-and-data complexity, every efficient C-approximation to the Steiner-tree problem yields an algorithm for enumerating answers with polynomial delay in a (C+1)-approximate order

Page 46: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search46PODS'06

Our Chain of ReductionsOur Chain of Reductions

Enumerating answers in sorted orderEnumerating answers in sorted order

Finding the top answer under constraintsFinding the top answer under constraints

Finding minimal supertreesFinding minimal supertrees

Finding Steiner treesFinding Steiner trees

Lawler’s approachLawler’s approach

The intricate part …

Subtree Collapse/RestoreSubtree Collapse/Restore

Page 47: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search47PODS'06

Other Variant of KPSOther Variant of KPS

Our algorithms can be adapted to other popular variants of KPS

Page 48: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search48PODS'06

Undirected VariantUndirected Variant

company

supplies

supply

product

supplier

papersA4

company

supplies

supply

product

supplier

coffee

president

Cohen

department

Summers

manager

Parishqhq

Answers are undirected trees

Page 49: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search49PODS'06

Strong VariantStrong Variant

company

supplies

supply

product

supplier

papersA4

company

supplies

supply

product

supplier

coffee

president

Cohen

department

Summers

manager

Parishqhq

Answers are undirected treesand keywords are leaves

Page 50: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search50PODS'06

Open ProblemsOpen Problems

• Can we improve the space efficiency of our algorithms??

• Some ranking functions (e.g., height) are easier than weight when looking for the top answer (no constraints), but– The chain of reductions doesn’t work– The complexity of finding the top answer under

constraints is unknown

• Can our results hold for richer queries that also have structural constraints??

Page 51: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search51PODS'06

Implementation ConsiderationsImplementation Considerations

• Bottlenecks: Steiner-tree algorithms and approximations

• Thin graphs allow in-memory execution of our algorithms, even for large XML documents (e.g., DBLP)

• New and intuitive ranking functions that are easier to implement efficiently

Page 52: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search52PODS'06

Related Work: Order vs. EfficiencyRelated Work: Order vs. Efficiency

Exact Exact OrderOrder

Approximate Approximate OrderOrder

Heuristic Heuristic OrderOrder(no approx. guaranteed)

No No OrderOrder

More Desirable

More Efficient

(Queries have a

fixed size)This work

Past work

Page 53: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

53

CIKM 2005

Thank you.Thank you.

Questions?

Page 54: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

54

CIKM 2005

Illustration of Lawler’s MethodIllustration of Lawler’s Method

Page 55: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search55PODS'06

Lawler’s Method (1972)Lawler’s Method (1972)

Page 56: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search56PODS'06

1. Find the Top Answer1. Find the Top Answer

In principle, at this point we should find the second-best answer

But Instead…But Instead…

Page 57: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search57PODS'06

2. Partition the Remaining Answers2. Partition the Remaining Answers

Page 58: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search58PODS'06

2. Partition the Remaining Answers2. Partition the Remaining Answers

Each partition is defined by a distinct set of constraints

Page 59: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search59PODS'06

3. Find the Top of each Set3. Find the Top of each Set

Page 60: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search60PODS'06

4. Find the Second Answer4. Find the Second Answer

The second answer is the best among all the top answers in the partitions

Page 61: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search61PODS'06

5. Further Divide the Chosen Partition5. Further Divide the Chosen Partition

Page 62: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search62PODS'06

And so on…And so on…

Page 63: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

63

CIKM 2005

Adapting Lawler’s MethodAdapting Lawler’s Method

Page 64: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search64PODS'06

Our ConstraintsOur Constraints

• Node-disjoint subtrees of the data graph• All the leaves are keywords• An answer must contain all the subtrees

InclusionInclusion constraints

• Edges of the data graph• An answer must not contain any of the

edges

ExclusionExclusion constraints

C DC D

BA

CC

B

Page 65: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search65PODS'06

Partitioning a Partition (cont)Partitioning a Partition (cont)

A

edges(A) \ I = {e1,…,ek}

I A0 A0E ⋃{e1}

I ⋃{e1} A1 A1E ⋃{e2}

I ⋃{e1,e2}A2 A2E ⋃{e3}

I ⋃{e1,e2,e3}A3 A3E ⋃{e4}

I ⋃{e1,…,ek-

1}Ak-1 Ak-1E ⋃{ek}

I AE

Page 66: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search66PODS'06

Generating Constraints (intuition)Generating Constraints (intuition)

A B C D EA B C D E A B C D EA B C D E

A B C D EA B C D E A B C D EA B C D E A B C D EA B C D E

A B C D EA B C D E A B C D EA B C D E A B C D EA B C D E

A B C D E

Constraints (subtrees/edges) are obtained from existing constraints of the current partition and the top answer

Page 67: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

67

CIKM 2005

Collapsing SubtreesCollapsing Subtrees

Page 68: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search68PODS'06

Collapsing a SubtreeCollapsing a Subtree

Page 69: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search69PODS'06

1. Remove All Edges and Internal Nodes1. Remove All Edges and Internal Nodes

Only the root is left

Page 70: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search70PODS'06

2. Remove Incoming Edges of Internal Nodes2. Remove Incoming Edges of Internal Nodes

Page 71: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search71PODS'06

3. Add Outgoing Edges to the Root3. Add Outgoing Edges to the Root

An edge that emanates from an internal node becomes an outgoing edge of the root

Page 72: CIKM 2005 1 Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.

Finding and Approximating Top-k Answers in Keyword Proximity Search72PODS'06

More DetailsMore Details

• When adding an outgoing edge (r,u) to the root, the weight of (r,u) is the minimal weight among all the edges from the collapsed subtree to u

• When restoring a subtree, each outgoing edge (r,u) of the root is replaced with an (arbitrary) original edge from the restored subtree to u, with the same weight

• Incoming edges of internal nodes of the subtree are never restored– Such edges cannot participate in G-supertrees