Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex...

Google’s Deep-Web Crawl

ByJayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy,

Alex Rasmussen, and Alon HalevyAugust 30, 2008

Speaker : Sahana Chiwane

Introduction

Deep-Web : Content hidden behind HTML forms that can be accessed only by form submission with valid input values

Deep-Web crawling approaches:

Vertical Search Engines Search engines for specific domains (Data Integrity solution) Mediator form for each domain and semantic mappings between

data sources and mediator.

Surfacing the Deep-Web Pre-computing form submissions and indexing the computed forms

Challenges in Surfacing

Predicting the correct input combinations (Query Templates)

Predicting the appropriate values for text inputs

Contributions For Surfacing

Informativeness Test : To evaluate query templates based on distinctness of the web pages generated via form submission

Algorithm to identify suitable query templates

Algorithm to predict appropriate input values for text boxes

Query Templates SelectionChallenges

- Determine templates of correct dimension

- Determine & discard presentation inputs

Key concept

Informative Template (T):

No of distinct signatures returned in queries generated by T) / (the number of form submissions on T) >= distinctness_fraction where;

distinctness_fraction is 0.2

The dimension(number of inputs) of template is limited to <= 3.

Experimental Results

• The Template selection based on informative test results in fewer number of URLs and scales linearly with size of the underlying database as shown in graph.

CARTESIAN: all possible URLsTRIPLE: Templates with three binding inputs

Experimental Results

The table above shows that by limiting the dimension of template to 3 and applying the informative test limits the number of url tested to increase linearly

Input ValuesChallenges

- Determine generic & typed inputs

- Determine candidate keywords and value selection

Key concept

Finite selection Try all.

Typed text box. known collection of types. - cities, zip-code, price[low/high], date etc. Input with highest distinctness_fraction is indicative of input type.

Generic text box. Obtain a seed set of query words from parsing the form itself. Issue queries & mine results pages for high importance words to add

to set and iterate. (Iterative Probing)

Generic Input ResultsThe table below shows the number of records retrieved and number of URLs generated against an estimated database which suggests that the ISIT has superior coverage.

first: records on the result page when using only the text box.select: records on the result page using only select menus.first++: on the result page and the pages that have links from it when using only the text box

Detecting Input Type Results

The table below shows the vast majority of type recognition by the algorithm is correct

Each entry records the results of applying a particular type recognizer (rows, e.g., city-us) on inputs whose names match different patterns (columns, e.g., *city*, *date*).

Research DirectionsCrawl subsets of the Deep-Web sites to maximize

traffic and coverage, reduce crawler load

Develop heuristics to identify common data types to enable vertical searching

Forms submitted through POST need to be surfaced

Ranks of the web sites to be considered

Include form submission through Javascript

Include dependencies between input values

Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex...

Documents

Transcript of Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex...