1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In...

41
1 Deep web 06/21/22 Jianguo Lu

description

3 Deep web vs. surface web Bergman, Michael K. (August 2001). "The Deep Web: Surfacing Hidden Value". The Journal of Electronic Publishing 7 (1).The Deep Web: Surfacing Hidden Value

Transcript of 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In...

Page 1: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

1

Deep web

05/03/23

Jianguo Lu

Page 2: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

2

What is deep web• Also called hidden web, invisible

web– In contrast to surface web

• Content is dynamically generated by a search interface. The search interface can be

– HTML form– Web service

• Content in general is stored in a database

• Usually not indexed by a search engine

– That is the reason that sometimes people define surface web as the web accessible by a search engine

Deep web • http://www.osti.gov/fedsearch

Page 3: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

3

Deep web vs. surface web

• Bergman, Michael K. (August 2001). "The Deep Web: Surfacing Hidden Value". The Journal of Electronic Publishing 7 (1).

Page 4: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

4

How large is deep web

• http://www.juanico.co.il/Main%20frame%20-%20English/Issues/Information%20systems.htm

Deep web

Page 5: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

5

Deep and surface web may overlap

• Some content hidden behind an HTML form or web service can also be available in normal html pages

• Some search engines try to index part of the deep web

– Google is also crawling deep web– Madhavan, Jayant; David Ko, Łucja Kot,

Vignesh Ganapathy, Alex Rasmussen, Alon Halevy (2008). Google’s Deep-Web Crawl. VLDB

– Only a very small portion of deep web is indexed

Deep web

Page 6: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

6

Why is there a deep web• Not everything can be in surface web, for many reasons…

• Some pages are generated on the fly– There are pages that are generated by a specific request, e.g.,

– books in a library, – historical weather data, – newspaper archives, – all the accounts/members in flickr/tweeter/facebook…web sites

– There would be too many items if they are represented as web pages – It is easier to save them in a data base instead of providing it as static web pages– Some pages are the result of integration from various databases

• Content is not restricted to text or html. Can be image, pdf, software, music, books, etc. E.g.,

– all the paintings in a museum.– Books in a library

• Maybe password protected

• But still, we wish the content is searchable…

Deep web

Page 7: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

7

Deep web crawling

• Crawl and index the deep web so that hidden data can be surfaced

• Unlike the surface web, there are no hyperlinks to follow

• Two tasks– Find deep web data sources, i.e., html forms, web services

– Accessing the deep web: A survey, B He, M Patel, Z Zhang, KCC Chang - Communications of the ACM, 2007

– Given a data source, download the data from this data source

• We focus on the second task

Deep web

Page 8: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

8

Crawling a deep web data source

• The only interface is an html form or a web service– if the data is hidden by HTML form

– Fill the forms– Select and send appropriate queries

– Alexandros, Ntoulas; Petros Zerfos, and Junghoo Cho (2005). Downloading Hidden Web Content. UCLA Computer Science.

– Yan Wang, Jianguo Lu, Jessica Chen: Crawling Deep Web Using a New Set Covering Algorithm. ADMA 2009: 326-337.

– Jianguo Lu, Yan Wang, Jie Liang, Jessica Chen, Jiming Liu: An Approach to Deep Web Crawling by Sampling. Web Intelligence 2008: 718-724

– Extract relevant data from return HTML page– If the data is hidden by a web service

– Select and send appropriate queries– Form filling and data extraction are exempted

• It also attracts public interests– Wright, Alex (2009-02-22). "

Exploring a 'Deep Web' That Google Can’t Grasp". New York Times.

Deep web

Page 9: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

9

Deep web crawling is not a trivial task• it is not easy to obtain all the data

• Query quota

• Return limit

• More importantly, high overlappingc:\research\dictionaries\newsgroup_dict_10000_random.txt

filetype:xsd n=212000Query matches total distinct newmaven 176 176 176 176disarm 1 185 177 1sudanese 34 219 211 34profession 59 278 270 59compete 2 280 272 2…windsor 101 56939 16199bosch 8 56947 16203 4cliff 32 56979 16203 0pursuit 14 56993 16204 1konstantin 3 56996 16204 0

The overlapping rate is

56996/16204=3.5

When percentage is

16204/212000=0.07

Deep web

Page 10: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

10

The problem

• Minimize the cost while dislodging most of the data– Some people try to minimize the number of queries while we

minimize the number of total documents

• Minimize the OR (Overlapping Rate) while reaching a high Hit Rate (HR)

– S(qj , DB) : set of results of the query qj on database DB.

Deep web

Page 11: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

11

Random queries

• What is the cost if random queries are sent?

• The answer depends on the assumptions of a data source

• In the diagram, model in the lower layer is more difficult to crawl

Random queries

Model All the matched documents returned?

Each document has equal probability of being matched?

example

M0 yes YesMr No YesMh Yes NoMrh No no

Page 12: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

12

Notations

• N: the actual number of documents in a data source;

• t: the number of queries that are sent to a data source;

• mj : the number of matched documents for query j. 1 <j <t. – n =mj is the sample size, i.e., the total number of matched documents;

• uj : the number of new documents retrieved by query j. 1 <=j <=t.

• Mi = j<i uj is the total number of unique documents that are retrieved before query i. – Note that M1 = 0, and M2 = m1. Let M = Mt+1 denote the total number of distinct documents that

are retrieved by all the queries in the estimation process;

• di: the number of duplicate documents retrieved by query i. di+ui = mi;

• k: the maximal number of returns from a ranked data source, even if there are mj > k number of matches.

• OR = n/M: the Overlapping Rate up to the t-th query, i.e., the ratio between the sample size and the distinct documents;

• P = M/N: the percentage of the documents that has been sampled, i.e., the ratio between the distinct documents and the actual size.

Random queries

Page 13: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

13

Example of crawling process

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 1.05 1.1 1.15 1.2

OR

P

query mi ri ui di Mi ni OR P

q1 3 3 3 0   3 1 0.005

q2 26 26 25 1 3 29 1.036 0.047

q3 37 30 25 5 28 59 1.113 0.088

q4 23 23 16 7 53 82 1.188 0.115

          69      

•Suppose N=600, limit=30

Random queries

Page 14: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

14

Model M0

1.21 ORP

Random queries

• Assumptions– All the matched documents

are returned

– Each document has equal probability of being matched

• Result

– Jianguo Lu, Dingding Li, Estimating Deep Web Data Source Size by Capture-Recapture Method, Information Retrieval. Springer.

Page 15: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

15

Model M0

• The more accurate formula for the relationship between P and OR is

• Conclusion: In model M0, it is not difficult to crawl a data source at all

• In most cases OR will be higher than what is calculated by the above formula

– Because M0 is the simplest

PPOR /)1ln(

Random queries

PPOR /)1ln(

P OR

0.1 1.053605

0.2 1.115718

0.3 1.188916

0.4 1.277064

0.5 1.386294

0.6 1.527151

0.7 1.719961

0.8 2.011797

0.9 2.558428

0.95 3.153402

0.99 4.651687

0.999 6.91467

Page 16: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

16

Model M0 vs Mh

• The blue line is drawn using equation

P=1-OR^(-2.1)

• Several real data show different trend

• Why?

Random queries

Page 17: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

17

Model Mh

• Assumptions:– Each document has unequal

probability of being matched by a query

– All matched documents are returned

• h means heterogeneity in catch probability

– Originally developed in ecology, to estimate the population of wild animals

– Process: capture a group of animals, mark and release them; capture another group of animals, mark and release them again; … …

• Mh was first proposed in capture-recapture method

Random queries

Capture frequency of news groups documents by queries

(A) is the scatter plot when documents are selected by queries. In total 13,600 documents are retrieved. (B) is the first 100 captures in Figure (A). (C) is the histogram of (A). (D) is the log-log plot of (C).

Page 18: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

18

Model Mh

• The empirical result is

• Obtained by linear regression

P OR

0 1

0.533484 2

0.701347 3

0.782362 4

0.829732 5

0.860674 6

0.882404 7

0.898468 8

0.910806 9

1.11 ORP

Random queries

Page 19: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

19

File size distributions

Random queries

Page 20: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

20

Measuring heterogeneity

• Coefficient of Variation (CV)

• Assume that the documents in the data source have different but fixed probabilities of being captured, i.e.,

p = {p1, p2, …,pn}, Pj=1.

Sampling based approach

Scatter plots for various CVs. 200 random numbers within the range of 1 and 20,000 are generated in Pareto distribution.

Page 21: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

21

Measuring heterogeneity

ORP 1

Relationship between CV (γ) and α

Random queries CV

α

P

Page 22: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

22

Model M0r

• Assumptions– Only top k documents are returned– Each document has equal probability being

matched– Documents have static ranking

Random queries

Page 23: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

23

Model M0r

• When k and m are fixed for every query

• Not a practical assumption

)1( 1.2 ORmkP

Random queries

Page 24: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

24

Model Mhr

• Assumptions– Only top k documents are returned

– documents have unequal probability being matched

– Documents have static ranking

• When k and m are fixed, we have

Random queries

)1( 1.1 ORmkP

Page 25: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

25

Evolution of the models

• Comparison of models M0, Mh, M0r, and Mhr. 1000 documents are sorted according to their file size in decreasing order. 600 documents are selected in the four models, including the duplicates. k = 10;m = 20.

• Subplot M0 shows that all the documents are retrieved uniformly.

• Subplot Mh shows that large documents are preferred, but most of the documents can be eventually sampled.

• Subplot M0r exhibits a clear cut around the 500th document. Beyond this line there are almost no documents retrieved.

• Mhr is the compound of M0r and Mh.

Random queries

Page 26: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

26

Selecting queries

• We have learnt the cost when random queries are issued.

• Can we select the queries to reduce the cost?

• What models can we apply to? – Mh or Mhr?

Page 27: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

27

Select queries

• Incremental approach– Method

1. send a query to download matched documents;2. While (not most of the documents downloaded)

Analyze the downloaded documents to select the next most appropriate query;send the query to download documents;

– Alexandros Ntoulas, Petros Zerfos, and Junghoo Cho, Downloading Textual Hidden Web Content through Keyword Queries. JCDL, 2005. 100-109.

– Disadvantages– Need to download many (almost all) documents– Crawler may only need to know the URL, not the entire documents

• Sampling based approach– Jianguo Lu, Yan Wang, Jie Liang, Jessica Chen, Jiming Liu: An

Approach to Deep Web Crawling by Sampling. Web Intelligence 2008: 718-724

Select queries

Page 28: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

28

Sampling based approach

• The queries are selected from a sample set of documents

• In contrast to incremental approach

• Steps– Send a few random queries to

TotalDB;– Obtain the matched documents

and construct the SampleDB;– Analyse all the documents in

SampleDB, construct QueryPool;– Use set covering algorithms to

select the Queries;– Send Queries to TotalDB to

retrieve documents.

• Whether the queries can cover most of the data source?

• Whether low OR in SampleDB can be projected to TotalDB?

• Whether SampleDB need to be very large?

Sampling based approach

Page 29: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

29

Hypothesis 1: vocabulary learnt from sample can cover most of the documents in TotalDB

• Impact of sample size on HR. The queries are selected from SampleDB and cover above 99% of the documents in SampleDB. The HR in the plot is obtained when those queries are sent to the TotalDB. relative query pool size is 20.

Sampling based approach

Page 30: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

30

Hypothesis 2: low OR in sampleDB can be projected to TotalDB• Comparison of our

method on the four corpora with queries selected randomly from sample.

• Sample size is 3000,

• relative query pool size is 20.

• Our method achieves a much smaller OR when HR is high.

Sampling based approach

Page 31: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

31

Hypothesis 3: both the sample size and query pool size do not need to be very large

Sampling based approach

Page 32: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

32

Hypothesis 3 (continue)

Sampling approach

Page 33: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

33

Set covering problem

• given a universe U and a family of subsets S={S1, S2, …, Sn} of U.

• a cover is a subfamily of S whose union is U.

• Let J={1,2,…,n}. J* is a cover if

• set covering decision problem: the input is a pair (S,U) and an integer k; the question is whether there is a set covering of size k or less.

• set covering optimization problem: the input is a pair (S,U), and the task is to find a set covering which uses the fewest sets.

• The decision version of set covering is NP-complete, and the optimization version of set cover is NP-hard.

*Jjj US

Set covering

Page 34: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

34

Set covering example

d1 d3d2t1

t3

t2

Set covering

Suppose

each row represent a term,

each column represent a document.

If the cell (i,j) is 1,

term i can retrieve document j,

or term i covers document j.

Page 35: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

35

Set covering algorithms

• Optimal solution is hard to obtain, within polynomial time

• various approximation algorithms are developed– Greedy

– A classical algorithm

– Weighted greedy– Developed for our particular application– Yan Wang, Jianguo Lu, Jessica Chen: Crawling Deep Web Using a

New Set Covering Algorithm. [PDF] ADMA 2009: 326-337.

– Genetic algorithm

– Clustering

– …

Set covering

Page 36: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

36

Greedy algorithm

Number of new elements

Set covering

Page 37: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

37

Greedy algorithm may not be able to find the optimal solution

• There can be two solutions– If the first set selected is t1, then

the solution is– {t1, t2}– The cost 4

– If the first selection is t2, then the solution is

– {t2, t3}– the cost is 3.

Set covering

Page 38: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

38

Weighted greedy algorithm

38

q1

q4

q3

q5

q25

4 92

1

6

7

Set covering

d1 d2 d3 d4 d5 d6 d7 d8 d9q1 0 0 1 0 1 1 0 1 0q2 0 0 0 0 0 1 0 1 0q3 1 1 1 1 0 0 0 0 1q4 0 1 0 0 0 1 1 0 1q5 0 0 1 1 1 0 0 1 1

Page 39: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

39

One solution obtained by greedy algorithm

Set covering

38

q5

5

4 92

1

6

7

38

q4

q5

5

4 92

1

6

7

38

q4

q3

q5

5

4 92

1

6

7

d1 d2 d3 d4 d5 d6 d7 d8 d9 df new new/dfq1 0 0 1 0 1 1 0 1 0 4 4 1q2 0 0 0 0 0 1 0 1 0 2 2 1q3 1 1 1 1 0 0 0 0 1 5 5 1q4 0 1 0 0 0 1 1 0 1 4 4 1q5 0 0 1 1 1 0 0 1 1 5 5 1

d1 d2 d3 d4 d5 d6 d7 d8 d9 df new new/dfq1 0 0 0 0 0 1 0 0 0 4 1 0.25q2 0 0 0 0 0 1 0 0 0 2 1 0.5q3 1 1 0 0 0 0 0 0 0 5 2 0.4q4 0 1 0 0 0 1 1 0 0 4 3 0.75q5 0 0 0 0 0 0 0 0 0 5 0 0

d1 d2 d3 d4 d5 d6 d7 d8 d9 df new new/dfq1 0 0 0 0 0 0 0 0 0 4 0 0q2 0 0 0 0 0 0 0 0 0 2 0 0q3 1 0 0 0 0 0 0 0 0 5 1 0.2q4 0 0 0 0 0 0 0 0 0 4 0 0q5 0 0 0 0 0 0 0 0 0 5 0 0

Total cost is 5+4+5=14

Page 40: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

40

Solution obtained by weighted greedy algorithm

38

q1

q4

q3

q5

q25

4 92

1

6

7 q492

6

7 q4

3

q3

4 92

1

6

7

7

38

q1

q3

5

4 92

1

6

Set covering

d1 d2 d3 d4 d5 d6 d7 d8 d9 weight w/dfq1 0 0 0.3 0 0.5 0.3 0 0.3 0 1.5 0.375q2 0 0 0 0 0 0.3 0 0.3 0 0.667 0.3333q3 1 0.5 0.3 0.5 0 0 0 0 0.3 2.667 0.5333q4 0 0.5 0 0 0 0.3 1 0 0.3 2.167 0.5417q5 0 0 0.3 0.5 0.5 0 0 0.3 0.3 2 0.4

d1 d2 d3 d4 d5 d6 d7 d8 d9 weight w/df

q1 0 0 0.3 0 0.5 0 0 0.3 0 1.167 0.2917q2 0 0 0 0 0 0 0 0.3 0 0.333 0.1667q3 1 0 0.3 0.5 0 0 0 0 0 1.833 0.3667q4 0 0 0 0 0 0 0 0 0 0 0q5 0 0 0.3 0.5 0.5 0 0 0.3 0 1.667 0.3333

d1 d2 d3 d4 d5 d6 d7 d8 d9 weight w/df

q1 0 0 0 0 0.5 0 0 0.3 0 0.833 0.2083q2 0 0 0 0 0 0 0 0.3 0 0.333 0.1667q3 0 0 0 0 0 0 0 0 0 0 0q4 0 0 0 0 0 0 0 0 0 0 0q5 0 0 0 0 0.5 0 0 0.3 0 0.833 0.1667

Total cost is 4+5+4=13

Page 41: 1 Deep web 1/26/2016 Jianguo Lu. 2 What is deep web Also called hidden web, invisible web –In contrast to surface web Content is dynamically generated.

41

Review• Deep web crawling

• Random queries and models

• Sampling based crawling

• Set covering algorithms– Greedy

– Weighted greedy

– Clustering

– Genetic

– Currently they are for model Mh

– What is the solution for Model Mhr?

• For Model Mhr, we need to predict the term frequencies of the terms in TotalDB