Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor)...

53
Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. [email protected]

Transcript of Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor)...

Page 1: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

MetasearchMathematics of Knowledge and Search Engines: Tutorials

@ IPAM9/13/2007

Zhenyu (Victor) Liu

Software Engineer

Google Inc.

[email protected]

Page 2: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

2

Roadmap The problem Database content modeling Database selection Summary

Page 3: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

3

Metasearch – the problem

??? appliedmathematics

??? appliedmathematics

Search results

MetasearchEngine

Page 4: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

4

Subproblems Database content modeling

How does a Metasearch engine “perceive” the content of each database?

Database selection Selectively issue the query to the “best” databases

Query translation Different database has different query formats

“a+b” / “a AND b” / “title:a AND body:b” / etc.

Result merging Query “applied mathematics” top-10 results from both science.com and nature.com,

how to present?

Page 5: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

5

Database content modeling and selection: a simplified example A “content summary” of each database Selection based on # of mathing docs Assuming independence between words

Word w # of documents that use w

Pr(w)

applied 4000 0.4

mathematics 2500 0.25

Total #: 10,000

10,000 0.4 0.25 = 1000

documents matches“applied mathematics”

>

Total #: 60,000

60,000 0.00333 0.005 = 1documents matches “applied mathematics”

Word w # of documents that use w

Pr(w)

applied 200 0.00333

mathematics 300 0.005

Page 6: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

6

Roadmap The problem Database content modeling Database selection Summary

Page 7: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

7

Database content modeling

able to replicate theentire text database- most storage demanding- fully cooperative database

download part of atext database- more storage demanding- non-cooperative database

able to obtain a fullcontent summary- less storage demanding- fully cooperative database

approximate the contentsummary via sampling- least storage demanding- non-cooperative database

Page 8: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

8

Replicate the entire database E.g.

www.google.com/patents, replica of the entire USPTO patent document database

Page 9: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

9

Download a non-cooperative database Objective: download as much as possible Basic idea: “probing” (querying with short queries) and

downloading all results Practically, can only issue a fixed # of probes (e.g., 1000

queries per day)

MetasearchEngine

SearchInterface

“applied”

“mathematics”

A textdatabase

Page 10: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

10

Harder than the “set-coverage” problem All docs in a database db as the universe

assuming all docs are equal Each probe corresponds to a subset Find the least # of subsets

(probes) that covers db or, the max coverage with a

fixed # of subsets (probes) NP-complete

Greedy algo. proved to be thebest-possible P-timeapproximation algo.

Cardinality of each subset(# of matching docs for eachprobe) unknown!

“mathematics”

“applied”

Page 11: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

11

Pseudo-greedy algorithms [NPC05] Greedy-set-coverage: choose subsets with the

max “cardinality gain” When cardinality of subsets is unknown

Assume cardinality of subsets is the same across databases - proportionally

e.g. build a database with Web pages crawled from the Internet, rank single words according to their frequency

Start with certain “seed” queries, adaptively choose query words within the docs returned

Choice of probing words varies from database to database

Page 12: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

12

An adaptive method D(wi) – subsets returned by probe with word wi

w1, w2, …, wn already issued

Rewritten as|db|Pr(wi+1) - |db|Pr(wi+1 Λ (w1V…V wn))

Pr(w): prob. of w appearing in a doc of db

)()(maxarg1

1

by used worda as 1

1

i

n

ii

)D(ww

wDwDi

n

ii

Page 13: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

13

An adaptive method (cont’d) How to estimate Pr ̃(wi+1) Zipf’s law:

Pr(w) = α(R(w)+β)-γ, R(w): rank of w in a descending order of

Pr(w)

Assuming the ranking of w1, w2, …, wn and other words remains the same in the downloaded subset and in db

Interpolate:interpolated

“P ̃r(w)”fitted Zipf’s law curve

single words ranked by Pr(w)in the downloaded documents

Pr(w) values for w1, w2, …, wn

Page 14: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

14

Obtain an exact content summary C(db) for a database db

Statistics about words in db,e.g., df – document frequency,

Standards and proposals for co-operative databases to follow to export C(db) STARTS [GCM97]

Initiated by Stanford, attracted main search engine players by 1997: Fulcrum, Infoseek, PLS, Verity, WAIS, Excite, etc.

SDARTS [GIG01] Initiated by Columbia U.

w df

mathematics 2500

applied 4000

research 1000

Page 15: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

15

Approximate the content-summary

Objective: C̃(db) of a database db, with high vocabulary coverage & high accuracy

Basic idea: probing and download sample docs [CC01] Example: df as the content summary statistics

1. Pick a single word as the query, probe the database

2. Download a fraction of results, e.g., top-k

3. If terminating condition unsatisfied, go to 1.

4. Output <w, df̃> based on the sample docs downloaded

Page 16: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

16

Vocabulary coverage Can a small sample of docs cover the

vocabulary of a big database? Yes, based on Heap’s law [Hea78]:

|W |= Knβ

n - # of words scanned W - set of distinct words encountered K - constant, typically in [10, 100] β - constant, typically in [0.4, 0.6]

Empirically verified [CC01]

Page 17: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

17

Estimate document frequency How to identify the df ̃ of w in the entire database?

w used as a query during sampling: df typically revealed in search results

w’ appearing in the sampled docs: need to estimate df ̃ based on the docs sample

Apply Zipf’s law & interpolate [IG02]1. Rank w and w’ based on their frequency in the sample

2. Curve-fit based on the true df of those w

3. Interpolate the estimated df ̃ of w’ onto the fitted curve

Page 18: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

18

What if db changes over time? So does its content summary C(db), and C̃(db) [INC05] Empirical study

152 Web databases, a snapshot downloaded weekly, for 1 year df as the statistics measure Kullback-Leibler divergence

as the “change” measure between the “latest”

snapshot and thesnapshot time t ago

db does change! How do we model

the change? When to resample, and

get a new C ̃(db) ?t

Kullb

ack

-Le

ible

rdiv

erg

ence

Page 19: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

19

Model the change KLdb(t) – the KL divergence between the current

C̃(db) and C̃(db, t) time t ago T: time when KLdb(t) exceeds a pre-specified τ Applying principles of Survival Analysis

Survival function Sdb(t) = 1 – Pr(T ≤ t)

Hazard funciton hdb(t) = - (dSdb(t) /dt) / Sdb(t)

How to compute hdb(t) and then Sdb(t)?

Page 20: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

20

Learn the hdb(t) of database change Cox proportional hazards regression model

ln( hdb(t) ) = ln( hbase(t) ) + β1x1 + … , where xi is some predictor variable

Predictors Pre-specified threshold τ Web domain of db, “.com” “.edu” “.gov” “.org” “others”

5 binary “domain variables”

ln( |db| ) avg KLdb(1 week) measured in the training period …

Page 21: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

21

Train the Cox model Stratified Cox model being applied

Domain variables didn’t satisfy the Cox proportional assumption

Stratifying on each domain, or, a hbase(t) / Sbase(t) for each domain

Training Sbase(t) for each domain Assuming Weibull distribution, Sbase(t) = e-λtγ

Page 22: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

22

Training result

Sbase(t)

t

γ ranges in (0.57, 1.08) Sbase(t) not exponential distribution

Page 23: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

23

Training result (cont’d)

A larger db takes less time to have KLdb(t) exceed τ

Databases changes faster during a short period are more likely to change later on

predictor ln( |db| ) avg KLdb(1 week) τ

β value 0.094 6.762 -1.305

Page 24: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

24

How to use the trained model? Model gives Sdb(t) likelihood that db “has not

changed much” An update policy to periodically resample each db

Intuitively, maximize ∑db Sdb(t) More precisely

S = limt∞ (1/t)∫0

t [ ∑db Sdb(t) ] dt

A policy: {fdb}, where fdb is the update frequency of db, e.g., 2/week

Subject to practical constraints, e.g., total update cap per week

Page 25: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

25

Derive an optimal update policy Find {fdb} that maximizes S under the constraint

∑db fdb = F, where F is a global frequency limit Solvable by the Lagrange-multiplier method Sample results:

db λ F=4/week F=15/week

tomshardware.com 0.088 1/46 1/5

Usps.com 0.023 1/34 1/12

Page 26: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

26

Roadmap The problem Database content modeling Database selection Summary

Page 27: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

27

Database selection Select the databases to issue a given query

Necessary when the Metasearch engine do not have entire replica of each database – most likely with content summary only

Reduces query load in the entire system

Formalization Query q = <w1, …, wm>, databases db1, …, dbn

Rank databases according to their “relevancy score” r(dbi, q) to query q

Page 28: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

28

Relevancy score # of matching docs in db Similarity between q and top docs returned by

db Typically vector-space similarity (dot-product)

between q and a doc Sum / Avg of similarities of top-k docs of each db,

e.g., top-10 Sum / Avg of similarities of top docs of each db

exceeding a similarity threshold Relevancy of db as judged by users

Explicit relevance feedback User click behavior data

Page 29: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

29

Estimating r(db,q) Typically, r(db, q) unavailable Estimate r̃(db, q) based on C(db), or C̃(db)

Page 30: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

30

Estimating r(db,q), example 1 [GGT99] r(db, q): # of matching docs in db Independence assumption:

Query words w1, …, wm appear independently in db

r̃(db, q):

df(db, wj): document frequency of wj in db –

could be df ̃(db, wj) from C̃(db)

∏)(

)(~

q∈w

j

j|db|

db,wdf=|db|×db,qr

Page 31: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

31

Estimating r(db,q), example 2 [GGT99] r(db, q):

∑{ddb | sim(d, q)>l} sim(d, q)

d: a doc in db sim(d, q): vector dot-product between d & q

each word in d & q weighted with common tfidf weighting

l: a pre-specified threshold

Page 32: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

32

Estimating r(db,q), example 2 (cont’d) Content summary, C(db), required:

df(db, w): doc frequency v(db, w): ∑{ddb} weight of w in d’s vector

<v(db, w1), v(db, w2), …> - “centroid” of the entire db as a “cluster of doc vectors”

– –

Page 33: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

33

Estimating r(db,q), example 2 (cont’d) l = 0, sum of all q-doc similarity values of db

r(db, q) = ∑{ddb} sim(d, q) r̃(db, q) = r(db, q) =

<v(q,w1), …> <v(db, w1), v(db, w2), …>

v(q, w): weight of w in the query vector

l > 0?

– –

Page 34: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

34

Estimating r(db,q), example 2 (cont’d) Assuming uniform weight of w among all docs using w

i.e. weight of w in any doc = v(db, w) / df(db, w)

Highly-correlated query words scenario If df(db, wi) < df(db, wj), every doc using wi also uses wj

Words in q sorted s.t. df(db, w1) ≤ df(db, w2) ≤ … ≤ df(db, wm)

r̃(db, q) = ∑i=1…p

v(q, wi)v(db, wi) +

df(db, wp) [ ∑j=p+1…m

v(q, wj)v(db, wj)/df(db, wj)]

where p is determined by some criteria [GGT99]

Disjoint query words scenario No doc using wi uses wj

r̃(db, q) = ∑i=1…m | df(db, wi) > 0 Λ v(q, wi)

v(db, wi) / df(db, wi) > l v(q, wi)v(db, wi)

––

––

Page 35: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

35

Estimating r(db,q), example 2 (cont’d) Ranking of databases based on r̃(db, q)

empirically evaluated [GGT99]

Page 36: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

36

A probabilistic model for errors in estimation [LLC04] Any estimation makes errors An error (observed) distribution for each db

distribution of db1 ≠ distribution of db2

Definition of error: relative

),(~),(~-),(

=),(eqdbr

qdbrqdbrqdbrr

Page 37: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

37

0

0 .1

0 .2

0 .3

0 .4

0 .5

0 .6

0 .7

1 2 3 4 5 6 7 8 9 1 0 1 1 1 1 1 1 1 1 1 2 0

0

0 .1

0 .2

0 .3

0 .4

0 .5

0 .6

0 .7

1 2 3 4 5 6 7 8 9 1 0 1 1 1 1 1 1 1 1 1 2 0

Modeling the errors: a motivating experiment dbPMC: PubMedCentral www.pubmedcentral.nih.gov

Two query sets, Q1 and Q2 (healthcare related) |Q1| = |Q2| = 1000, Q1 Q2 =

Compute err(dbPMC, q) for each sample queryq Q1 or Q2

Further verified through statistical tests (Pearson-χ2)

err(dbPMC, q), q Q1 err(dbPMC, q), q Q2

errorprobabilitydistribution

errorprobabilitydistribution

Q1 Q2

Page 38: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

38

Implications of the experiment On a text database

Similar error behavior among sample queries Can sample a database and summarize the error

behavior into an Error Distribution (ED) Use ED to predict the error for a future unseen query

Sampling size study [LLC04] A few hundred sample queries good enough

Page 39: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

39

From an Error Distribution (ED)to a Relevancy Distribution (RD) Database: db1. Query: qnew

0.1

0.50.4

-50% 0% +50%

③ r� (db1,qnew) =1000

500 1000 1500

A Relevancy Distribution (RD)for r(db1, qnew)

err(db1,qnew)

r(db1,qnew)

)(~1)+)(()( new1new1new1 q,dbq,dbeq,db rrrr ×=

The ED for db1

0.1

0.50.4

by definition

from sampling

existing estimation method

Page 40: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

40

RD-based: db1 < db2

( Pr(db1 < db2) = 0.85 )

RD-based selectionr� (db1,qnew)r� (db2,qnew)

650 1000

Estimation-based: db1 > db2

0.1

0.50.4

-50% 0% +50%

err(db1, qnew)

r� (db1,qnew) =1000

r(db1, qnew)

500 1000 1500

0.1

0.50.4

db1

0.1

0.9

0% +100%

err(db2, qnew)

r� (db1,qnew) =650

r(db2, qnew)db2

0.1

0.9

650 1300db1:

db2:

Page 41: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

41

Correctness metric Terminology:

DBk: k databases returned by some method DBtopk: the actual answer

How correct DBk is compared to DBtopk? Absolute correctness: Cora(DBk) = 1, if DBk=DBtopk

0, otherwise

Partial correctness: Corp(DBk) =

Cora(DBk) = Corp(DBk) for k = 1

k

|DBDB| topkk ∩

Page 42: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

42

Effectiveness of RD-based selection 20 healthcare-related text databases on the Web Q1 (training, 1000 queries) to learn the ED of each database

Q2 (testing, 1000 queries) to test the correctness of database selection

k = 1 k = 3

Avg(Cora), Avg(Corp)

Avg(Cora) Avg(Corp)

Estimation-based selection(term-independence estimator)

0.471 0.301 0.699

RD-based selection 0.651 (+38.2%) 0.478 (+58.8%)

0.815 (+30.9%)

Page 43: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

43

Probing to improve correctness RD-based selection

0.85 = Pr(db2 > db1)

= Pr({db2} = DBtop1)

= 1Pr({db2} = DBtop1) +0Pr({db2} DBtop1)

= E[Cora({db2})]

Probe dbi: contact a dbi to obtain its exact relevancy

After probing db1:

E[Cora({db2})] = Pr(db2 > db1) = 1

500 1000 1500

0.1

0.50.4

0.1

0.9

650 1300

db1:

db2:

500

r(db1,q)=500

Page 44: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

44

Computing the expected correctness Expected absolute correctness

E[Cora(DBk)]=1Pr(Cora(DBk) = 1) + 0Pr(Cora(DBk) = 0)= Pr(Cora(DBk) = 1)= Pr(DBk = DBtopk)

Expected partial correctness E[Corp(DBk)]

)=|∩(|•=)=)((•= ∑∑0

0

lDBDBPrk

l

k

lDBCorPr

k

l topkkkp

≤l≤k≤l≤k

Page 45: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

45

Adaptive probing algorithm: APro

dbi+1dbidb1

unprobed probed

Any DBk

with E[Cor(DBk)] t?

NO

YES

return this DBk

RD’s of the probed and unprobed databases

dbndbi-1

dbi+1 dbn

User-specified correctness threshold: t

dbi

Page 46: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

46

Which database to probe? A greedy strategy:

The stopping condition: E[Cor(DBk)] t

Once probed, which database leads to the highest E[Cor(DBk)]?

Suppose we will probe db3

if r(db3,q) = ra, max E[Cor(DBk)] = 0.85

if r(db3,q) = rb, max E[Cor(DBk)] = 0.8

if r(db3,q) = rc, max E[Cor(DBk)] = 0.9

Probe the database that leads to

the largest “expected”

max E[Cor(DBk)]

db1 db2

db3 db4

r(db3, q) = ra

r(db3, q) = rb

r(db3, q) = rc

rarbrc

Page 47: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

47

Effectiveness of adaptive probing 20 healthcare-related text databases on the Web Q1 (training, 1000 queries) to learn the RD of each

database Q2 (testing, 1000 queries) to test the correctness of

database selection

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1 2 3 4 5

# of databases probed

adaptive probing APro

the term-independence estimator

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1 2 3 4 5# of databases probed

adaptive probing APro

the term-independence estimator

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1 2 3 4 5# of databases probed

adaptive probing APro

the term-independence estimator

k = 1

avgCora

avgCora

avgCorp

k = 3 k = 3

Page 48: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

48

The “lazy TA problem” Same problem, generalized & “humanized” After the final exam, the TA wants to find out the

top scoring students TA is “lazy,” don’t want to score all exam sheets Input: every student’s score: a known distribution

Observed from pervious quiz, mid-term exams

Output: a scoring strategy Maximizes the correctness of the “guessed” top-k

students

Page 49: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

49

Further study of this problem [LSC05] Proves greedy probing is optimal under special

cases More interesting factors to-be-explored:

“Optimal” probing strategy in general cases Non-uniform probing cost Time-variant distributions

Page 50: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

50

Roadmap The problem Database content modeling Database selection Summary

Page 51: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

51

Summary Metasearch – a challenging problem Database content modeling

Sampling enhanced by proper application of the Zipf’s law, the Heap’s law

Content change modeled using Survival Analysis

Database selection Estimation of database relevancy based on assumptions A probabilistic framework that models the error as a distribution “Optimal” probing strategy for a collection of distributions as input

Page 52: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

52

References [CC01] J.P. Callan and M. Connell, “Query-Based Sampling of Text

Databases,” ACM Tran. on Information System, 19(2), 2001 [GCM97] L. Gravano, C-C. K. Chang, H. Garcia-Molina, A. Paepcke,

“STARTS: Stanford Proposal for Internet Meta-searching,” in Proc. of the ACM SIGMOD Int’l Conf. on Management of Data, 1997

[GGT99] L. Gravano, H. Garcia-Molina, A. Tomasic, “GlOSS: Text Source Discovery over the Internet,” ACM Tran. on Database Systems, 24(2), 1999

[GIG01] N. Green, P. Ipeirotis, L. Gravano, “SDLIP+STARTS=SDARTS: A Protocol and Toolkit for Metasearching,” in Proc. of the Joint Conf. on Digital Libraries (JCDL), 2001

[Hea78] H.S. Heaps, Information Retrieval: Computational and Teoretical Aspects, Academic Press, 1978

[IG02] P. Ipeirotis, L. Gravano, “Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection,” in Proc. of the 28th VLDB Conf., 2002

Page 53: Metasearch Mathematics of Knowledge and Search Engines: Tutorials @ IPAM 9/13/2007 Zhenyu (Victor) Liu Software Engineer Google Inc. vicliu@google.com.

53

References (cont’d) [INC05] P. Ipeirotis, A. Ntoulas, J. Cho, L. Gravano, “Modeling and

Managing Content Changes in Text Databases,” in Proc. of the 21st IEEE Int’l Conf. on Data Eng. (ICDE), 2005

[LLC04] Z. Liu, C. Luo, J. Cho, W.W. Chu, “A Probabilistic Approach to Metasearching with Adaptive Probing,” in Proc. of the 20th IEEE Int’l Conf. on Data Eng. (ICDE), 2004

[LSC05] Z. Liu, K.C. Sia, J. Cho, “Cost-Efficient Processing of Min/Max Queries over Distributed Sensors with Uncertainty,” in Proc. of ACM Annual Symposium on Applied Computing, 2005

[NPC05] A. Ntoulas, P. Zerfos, J. Cho, “Downloading Hidden Web Content,” in Proc. of the Joint Conf. on Digital Libraries (JCDL), June 2005