Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros...

Downloading Textual Hidden-WebContent Through Keyword Queries

Alexandros Ntoulas Petros Zerfos Junghoo Cho

University of California Los AngelesComputer Science Department

{ntoulas, pzerfos, cho}@cs.ucla.edu

JCDL, June 8th 2005

Downloading Textual Hidden-WebContent Through Keyword Queries

April 10, 2023

Motivation

I would like to buy a used ’98 Ford Taurus Technical specs ?

Reviews ?

Classifieds ?

Vehicle history ?

GGooooggllee??

April 10, 2023

Why can’t we use a search engine ? Search engines today employ crawlers that

find pages by following links around Many useful pages are available only after

issuing queries (e.g. Classifieds, USPTO, PubMed, LoC, …)

Search engines cannot reach such pages: there are no links to them (Hidden-Web)

In this talk: how can we download Hidden-Web content?

April 10, 2023

Outline

Interacting with Hidden-Web sites Algorithms for selecting queries for the

Hidden-Web sites Experimental evaluation of our algorithms

April 10, 2023

Interacting with Hidden-Web pages (1)1. The user issues a query through a query

interface

liver

April 10, 2023

Interacting with Hidden-Web pages (2)1. The user issues a query through a query

interface 2. A result list is presented to the user

Result List Page

April 10, 2023

1. The user issues a query through a query interface

2. A result list is presented to the user

3. The user selects and views the “interesting” results

Interacting with Hidden-Web pages (3)

April 10, 2023

Querying a Hidden-Web site

Procedure

while ( there are available resources ) do

(1) select a query to send to the site

(2) send query and acquire result list

(3) download the pages

done

April 10, 2023

How should we select the queries ? (1)

S: set of pages in Web site (pages as points) qi: set of pages returned if we issue query qi

(queries as circles)

April 10, 2023

How should we select the queries ? (2)

Find the queries (circles) that cover the maximum number of pages (points)

Equivalent to the set-covering problem in graph-theory

April 10, 2023

Challenges during query selection In practice we don’t know which pages will be

returned by which queries (qi are unknown)

Even if we did know qi, the set-covering problem is NP-Hard

We will present approximation algorithms to the query selection problem

We will assume single-keyword queries

April 10, 2023

Outline



April 10, 2023

Some background (1)

Assumption: When we issue query qi to a Web site, all pages containing qi are returned

P(qi): fraction of pages from site we get back after issuing qi

Example: q = liver No. of docs in DB: 10,000 No. of docs containing liver: 3,000 P(liver) = 0.3

April 10, 2023

Some background (2)

P(q1/\q2): fraction of pages containing both q1 and q2 (intersection of q1 and q2)

P(q1\/q2): fraction of pages containing either q1 or q2 (union of q1 and q2)

Cost and benefit: How much benefit do we get out of a query ? How costly is it to issue a query?

April 10, 2023

Cost function

The cost to issue a query and download the Hidden-Web pages:

cq: query cost cr: cost for retrieving

a result item cd: cost for downloading

a document

Cost(qi) =

(1) Cost for issuing a query

(2) Cost for retrieving a result item times no. of results

(3) Cost for retrieving a doc times no. of docs

cq + crP(qi) + cdP(qi)

April 10, 2023

Problem formalization

Find the set of queries q1,…,qn

which maximizes

P(q1\/…\/qn)

Under the constraint:

n

ii tqCost

1

)(

April 10, 2023

Query selection algorithms

Random: Select a query randomly from a precompiled list (e.g. a dictionary)

Frequency-based: Select a query from a precompiled list based on frequency (e.g. a corpus previously downloaded from the Web)

Adaptive: Analyze previously downloaded pages to determine “promising” future queries

April 10, 2023

Adaptive query selection

Assume we have issued q1,…,qi-1.

To find a promising query qi we need to estimate P(q1\/…\/qi-1\/qi)

P( (q1\/…\/qi-1) \/ qi) =

P(q1\/…\/qi-1) +

P(qi) -

P(q1\/…\/qi-1) P(qi|q1\/…\/qi-1)

Known (by counting) since we have

issued q1,…,qi-1

Can measure by counting P(qi) within

P(q1,…,qi-1)What about P(qi) ?

April 10, 2023

Estimating P(qi)

Independence estimator

Zipf estimator [IG02] Rank queries based on frequency of occurrence

and fit a power law distribution Use fitted distribution to estimate P(qi)

P(qi) ~ P(qi|q1\/…\/qi-1)

April 10, 2023

Query selection algorithm

foreach qi in [potential queries] do

Pnew(qi) = P(q1\/…\/qi-1\/qi) – P(q1\/…\/qi-1)

Estimate

done

return qi with maximum Efficiency(qi)

)(

)()(

i

inewi qCost

qPqEfficiency

April 10, 2023

Other practical issues

Efficient calculation of P(qi|q1\/…\/qi-1) Selection of the initial query Crawling sites that limit the number of results

(e.g. DMOZ returns up to 10,000 results) Please refer to our paper for the details

April 10, 2023

Outline



April 10, 2023

Experimental evaluation Applied our algorithms to 4 different sites

Hidden-Web site No. of documents

Limit in the no.

of results

PubMed medical library

~13 million no limit

Books section of Amazon

~4.2 million 32,000

DMOZ: Open directory project

~3.8 million 10,000

Arts section of DMOZ

~429,000 10,000

April 10, 2023

Policies

Random-16K Pick query randomly from 16,000

most popular terms Random-1M

Pick query randomly from 1,000,000 most popular terms

Frequency-based Pick query based on frequency of occurrence

Adaptive

April 10, 2023

Coverage of policies

What fraction of the Web sites can we download by issuing queries ?

Study P(q1\/…\/qi) as i increases

April 10, 2023

Coverage of policies for PubMed

Adaptive gets ~80% with ~83 queries Frequency needs 103 for the same coverage

April 10, 2023

Coverage of policies for DMOZ (whole)

Adaptive outperforms others

April 10, 2023

Coverage of policies for DMOZ (arts)

Adaptive performs best in topic-specific texts

April 10, 2023

Other experiments

Impact of the initial query Impact of the various parameters of the cost

function Crawling sites that limit the number of results

(e.g. DMOZ returns up to 10,000 results) Please refer to our paper for the details

April 10, 2023

Related work

Issuing queries to databases Acquire language model [CCD99] Estimate fraction of the Web indexed [LG98] Estimate relative size and overlap of indexes

[BB98] Build multi-keyword queries that can return a

large number of documents [BF04] Harvesting approaches/cooperative

databases (OAI [LS01], DP9 [LMZN02])

April 10, 2023

Conclusion

An adaptive algorithm for issuing queries to Hidden-Web sites

Our algorithm is highly efficient (downloaded >90% of a site with ~100 queries)

Allows users to tap into unexplored information on the Web

Allows the research community to download, mine, study, understand the Hidden-Web

April 10, 2023

References [IG02] P. Ipeirotis, L. Gravano. Distributed search over the

hidden web: Hierarchical database sampling and selection. VLDB 2002.

[CCD99] J. Callan, M.E. Connel, A. Du. Automatic discovery of language models for text databases. SIGMOD 1999.

[LG98] S. Lawrence, C.L. Giles. Searching the World Wide Web. Science 280(5360):98-100, 1998.

[BB98] K. Bharat, A. Broder. A technique for measuring the relative size and overlap of public web search engines. WWW 1998.

[BF04] L. Barbosa, J. Freire. Siphoning hidden-web data through keyword-based interfaces.

[LS01] C. Lagoze, H.V. Sompel. The Open Archives Initiative: Building a low-barrier interoperability framework. JCDL 2001.

[LMZN02] X. Liu, K. Maly, M. Zubair, M.L. Nelson. DP9-An OAI Gatway Service for Web Crawlers. JCDL 2002.

Thank you !

Questions ?

April 10, 2023

Impact of the initial query

Does it matter what the first query is ? Crawled PubMed with queries:

data (1,344,999 results) information (308,474 results) return (29,707 results) pubmed (695 results)

April 10, 2023

Impact of the initial query

Algorithm converges regardless of initial query

April 10, 2023

Incorporating the document download cost Cost(qi) = cq + crP(qi) + cdPnew (qi) Crawled PubMed with

cq = 100

cr = 100

cd = 10,000

April 10, 2023

Incorporating document download cost

Adaptive uses resources more efficiently Document cost significant portion of the cost

April 10, 2023

Can we get all the results back ?

…

April 10, 2023

Downloading from sites limiting the number of results (1)

Site returns qi’ instead of qi

For qi+1 we need to estimate P(qi+1|q1\/…\/qi)

April 10, 2023

Downloading from sites limiting the number of results (2)

Assuming qi’ is a random sample of qi

))]...(()(

))...(([)...(

1

)...|(

1111

1111

11

iiiii

iii

ii

qqqqPqqP

qqqPqqP

qqqP

)'(

)(

)'(

)(

1

1

i

i

ii

ii

qP

qP

qqP

qqP

April 10, 2023

Impact of the limit of results

How does the limit of results affect our algorithms ?

Crawled DMOZ but restricted the algorithms to 1,000 results instead of 10,000

April 10, 2023

Dmoz with a result cap at 1,000

Adaptive still outperforms frequency-based

Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros...

Documents

Transcript of Downloading Textual Hidden-Web Content Through Keyword Queries Alexandros NtoulasPetros...