Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex...
-
Upload
laurence-stewart -
Category
Documents
-
view
212 -
download
0
Transcript of Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex...
Google’s Deep-Web Crawl
ByJayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy,
Alex Rasmussen, and Alon HalevyAugust 30, 2008
Speaker : Sahana Chiwane
Introduction
Deep-Web : Content hidden behind HTML forms that can be accessed only by form submission with valid input values
Deep-Web crawling approaches:
Vertical Search Engines Search engines for specific domains (Data Integrity solution) Mediator form for each domain and semantic mappings between
data sources and mediator.
Surfacing the Deep-Web Pre-computing form submissions and indexing the computed forms
Challenges in Surfacing
Predicting the correct input combinations (Query Templates)
Predicting the appropriate values for text inputs
Contributions For Surfacing
Informativeness Test : To evaluate query templates based on distinctness of the web pages generated via form submission
Algorithm to identify suitable query templates
Algorithm to predict appropriate input values for text boxes
Query Templates SelectionChallenges
- Determine templates of correct dimension
- Determine & discard presentation inputs
Key concept
Informative Template (T):
No of distinct signatures returned in queries generated by T) / (the number of form submissions on T) >= distinctness_fraction where;
distinctness_fraction is 0.2
The dimension(number of inputs) of template is limited to <= 3.
Experimental Results
• The Template selection based on informative test results in fewer number of URLs and scales linearly with size of the underlying database as shown in graph.
CARTESIAN: all possible URLsTRIPLE: Templates with three binding inputs
Experimental Results
The table above shows that by limiting the dimension of template to 3 and applying the informative test limits the number of url tested to increase linearly
Input ValuesChallenges
- Determine generic & typed inputs
- Determine candidate keywords and value selection
Key concept
Finite selection Try all.
Typed text box. known collection of types. - cities, zip-code, price[low/high], date etc. Input with highest distinctness_fraction is indicative of input type.
Generic text box. Obtain a seed set of query words from parsing the form itself. Issue queries & mine results pages for high importance words to add
to set and iterate. (Iterative Probing)
Generic Input ResultsThe table below shows the number of records retrieved and number of URLs generated against an estimated database which suggests that the ISIT has superior coverage.
first: records on the result page when using only the text box.select: records on the result page using only select menus.first++: on the result page and the pages that have links from it when using only the text box
Detecting Input Type Results
The table below shows the vast majority of type recognition by the algorithm is correct
Each entry records the results of applying a particular type recognizer (rows, e.g., city-us) on inputs whose names match different patterns (columns, e.g., *city*, *date*).
Research DirectionsCrawl subsets of the Deep-Web sites to maximize
traffic and coverage, reduce crawler load
Develop heuristics to identify common data types to enable vertical searching
Forms submitted through POST need to be surfaced
Ranks of the web sites to be considered
Include form submission through Javascript
Include dependencies between input values