Crawling the Infinite Web (WAW 2004 Rome)
-
Upload
carlos-castillo -
Category
Technology
-
view
663 -
download
3
Transcript of Crawling the Infinite Web (WAW 2004 Rome)
Outline Introduction Models Experiments Summary
Crawling the Infinite Web:Five Levels are Enough
Ricardo Baeza-Yates and Carlos Castillo
Center for Web Researchwww.cwr.cl
WAW 2004
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
1 Introduction
2 Models
3 Experiments
4 Summary
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Introduction
Dynamic page: “a page which is created on request”
Dynamic pages with links to other dynamic pages
Malicious: loops and/or near-duplicates
Legitimate: recommendation systems, calendars, iterativealgorithms, etc.
The number of pages on the Web can be considered infinite
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Introduction
Dynamic page: “a page which is created on request”
Dynamic pages with links to other dynamic pages
Malicious: loops and/or near-duplicates
Legitimate: recommendation systems, calendars, iterativealgorithms, etc.
The number of pages on the Web can be considered infinite
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Introduction
Dynamic page: “a page which is created on request”
Dynamic pages with links to other dynamic pages
Malicious: loops and/or near-duplicates
Legitimate: recommendation systems, calendars, iterativealgorithms, etc.
The number of pages on the Web can be considered infinite
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Introduction
Dynamic page: “a page which is created on request”
Dynamic pages with links to other dynamic pages
Malicious: loops and/or near-duplicates
Legitimate: recommendation systems, calendars, iterativealgorithms, etc.
The number of pages on the Web can be considered infinite
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Introduction
Dynamic page: “a page which is created on request”
Dynamic pages with links to other dynamic pages
Malicious: loops and/or near-duplicates
Legitimate: recommendation systems, calendars, iterativealgorithms, etc.
The number of pages on the Web can be considered infinite
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Conflicting interests
Web site administrator: would like to have all of the Website indexed
Search engine administrator: would like to use efficientlythe network and storage capacity available
Search engine user: would like to find what he is looking for
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Conflicting interests
Web site administrator: would like to have all of the Website indexed
Search engine administrator: would like to use efficientlythe network and storage capacity available
Search engine user: would like to find what he is looking for
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Conflicting interests
Web site administrator: would like to have all of the Website indexed
Search engine administrator: would like to use efficientlythe network and storage capacity available
Search engine user: would like to find what he is looking for
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Our approach
Users do not go so deep inside Web sites
If something is important it has to be easily reachable
We will download only a few levels of each Web site
How many levels?
How much do you lost?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Our approach
Users do not go so deep inside Web sites
If something is important it has to be easily reachable
We will download only a few levels of each Web site
How many levels?
How much do you lost?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Our approach
Users do not go so deep inside Web sites
If something is important it has to be easily reachable
We will download only a few levels of each Web site
How many levels?
How much do you lost?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Our approach
Users do not go so deep inside Web sites
If something is important it has to be easily reachable
We will download only a few levels of each Web site
How many levels?
How much do you lost?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Our approach
Users do not go so deep inside Web sites
If something is important it has to be easily reachable
We will download only a few levels of each Web site
How many levels?
How much do you lost?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
ModelsNavigating a tree ≈ Moving through levels
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
ActionsPossible actions at a given level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Type of models we study
There is a set of atomic actionsA = {next, start/jump, back, stay , prev , fwd}Pr(action|`) is the probability of taking an action∑
action∈A Pr(action|`) = 1
The probability Pr(next|`) is constant
Stationary distribution → how much time users spent at eachlevel
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Type of models we study
There is a set of atomic actionsA = {next, start/jump, back, stay , prev , fwd}Pr(action|`) is the probability of taking an action∑
action∈A Pr(action|`) = 1
The probability Pr(next|`) is constant
Stationary distribution → how much time users spent at eachlevel
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Type of models we study
There is a set of atomic actionsA = {next, start/jump, back, stay , prev , fwd}Pr(action|`) is the probability of taking an action∑
action∈A Pr(action|`) = 1
The probability Pr(next|`) is constant
Stationary distribution → how much time users spent at eachlevel
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Type of models we study
There is a set of atomic actionsA = {next, start/jump, back, stay , prev , fwd}Pr(action|`) is the probability of taking an action∑
action∈A Pr(action|`) = 1
The probability Pr(next|`) is constant
Stationary distribution → how much time users spent at eachlevel
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Type of models we study
There is a set of atomic actionsA = {next, start/jump, back, stay , prev , fwd}Pr(action|`) is the probability of taking an action∑
action∈A Pr(action|`) = 1
The probability Pr(next|`) is constant
Stationary distribution → how much time users spent at eachlevel
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Model AForwards and backwards one level at a time
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Model AForwards and backwards one level at a time
Birth and death process
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Model BBack to first level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Model BBack to first level
Birth and death process with extinction
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Model CBack to any previous level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Model CBack to any previous level
Birth and death process with extinction and disaster?
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Cumulative probability of levels 0 . . . kBased on solutions given in the paper
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Experiments
Anonimized access logs for 13 Websites
Educational - Commercial - Reference - Organization - Blogs
Analysis of access logs to extract ≈ 250,000 user sessions
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Experiments
Anonimized access logs for 13 Websites
Educational - Commercial - Reference - Organization - Blogs
Analysis of access logs to extract ≈ 250,000 user sessions
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Experiments
Anonimized access logs for 13 Websites
Educational - Commercial - Reference - Organization - Blogs
Analysis of access logs to extract ≈ 250,000 user sessions
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Distribution of visits per level
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Model fitting
Code Type Country Model q Error
E1 Educational Chile B 0.51 0.88%E2 Educational Spain B 0.51 2.29%E3 Educational US B 0.64 0.72%
C1 Commercial Chile B 0.55 0.39%C2 Commercial Chile B 0.62 5.17%
R1 Reference Chile B 0.54 2.96%R2 Reference Chile B 0.59 2.75%
O1 Organization Italy C 0.35 2.27%O2 Organization US B 0.62 2.31%
OB1 Organization + Blog Chile B 0.65 2.07%OB2 Organization + Blog Chile B 0.72 0.35%
B1 Blog Chile C 0.79 0.88%B2 Blog Chile C 0.63 1.01%
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Observed distribution of transitions
Level Obs. Next Start Jump Back Stay Prev
0 247985 0.457 – 0.527 – 0.008 –1 120482 0.459 – 0.332 0.185 0.017 –2 70911 0.462 0.111 0.235 0.171 0.014 –3 42311 0.497 0.065 0.186 0.159 0.017 0.0694 27129 0.514 0.057 0.157 0.171 0.009 0.0885 17544 0.549 0.048 0.138 0.143 0.009 0.1086 10296 0.555 0.037 0.133 0.155 0.009 0.1067 6326 0.596 0.033 0.135 0.113 0.006 0.1138 4200 0.637 0.024 0.104 0.127 0.006 0.0969 2782 0.663 0.015 0.108 0.113 0.006 0.08910 2089 0.662 0.037 0.084 0.120 0.005 0.086
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Observed distribution of transitionsLevel Obs. Next Start Jump Back Stay Prev
0 247985 0.457 – 0.527 – 0.008 –1 120482 0.459 – 0.332 0.185 0.017 –2 70911 0.462 0.111 0.235 0.171 0.014 –3 42311 0.497 0.065 0.186 0.159 0.017 0.0694 27129 0.514 0.057 0.157 0.171 0.009 0.0885 17544 0.549 0.048 0.138 0.143 0.009 0.1086 10296 0.555 0.037 0.133 0.155 0.009 0.1067 6326 0.596 0.033 0.135 0.113 0.006 0.1138 4200 0.637 0.024 0.104 0.127 0.006 0.0969 2782 0.663 0.015 0.108 0.113 0.006 0.08910 2089 0.662 0.037 0.084 0.120 0.005 0.086
Pr(next) is not constant, if you have spent some time in the Web site,
then you can spend some more
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Pagerank and depthCumulative Pagerank by levels in the Chilean Web
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Pagerank and depthCorrelation of Pagerank and depth is low at deeper levels
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Summary
90% of the visits are 4-5 clicks away from the home page,except in blogs
Simple models try to explain this behavior
In the paper: explicit methodology, closed solutions to themodels, references
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Summary
90% of the visits are 4-5 clicks away from the home page,except in blogs
Simple models try to explain this behavior
In the paper: explicit methodology, closed solutions to themodels, references
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Summary
90% of the visits are 4-5 clicks away from the home page,except in blogs
Simple models try to explain this behavior
In the paper: explicit methodology, closed solutions to themodels, references
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Open problems
A model which better fits empirical data
Analyzing blogs
Analyzing the textual content of pages to decide when to stop
Relationship of this with the spam detection problem
Try adaptive strategies: which are the factors that affect thedesired crawling depth in a Web site?
There are other ways of defining which pages to downloadfrom an infinite set
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Open problems
A model which better fits empirical data
Analyzing blogs
Analyzing the textual content of pages to decide when to stop
Relationship of this with the spam detection problem
Try adaptive strategies: which are the factors that affect thedesired crawling depth in a Web site?
There are other ways of defining which pages to downloadfrom an infinite set
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Open problems
A model which better fits empirical data
Analyzing blogs
Analyzing the textual content of pages to decide when to stop
Relationship of this with the spam detection problem
Try adaptive strategies: which are the factors that affect thedesired crawling depth in a Web site?
There are other ways of defining which pages to downloadfrom an infinite set
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Open problems
A model which better fits empirical data
Analyzing blogs
Analyzing the textual content of pages to decide when to stop
Relationship of this with the spam detection problem
Try adaptive strategies: which are the factors that affect thedesired crawling depth in a Web site?
There are other ways of defining which pages to downloadfrom an infinite set
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Open problems
A model which better fits empirical data
Analyzing blogs
Analyzing the textual content of pages to decide when to stop
Relationship of this with the spam detection problem
Try adaptive strategies: which are the factors that affect thedesired crawling depth in a Web site?
There are other ways of defining which pages to downloadfrom an infinite set
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Open problems
A model which better fits empirical data
Analyzing blogs
Analyzing the textual content of pages to decide when to stop
Relationship of this with the spam detection problem
Try adaptive strategies: which are the factors that affect thedesired crawling depth in a Web site?
There are other ways of defining which pages to downloadfrom an infinite set
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web
Outline Introduction Models Experiments Summary
Questions and comments . . .
R. Baeza-Yates and C. Castillo Center for Web Research
Crawling the Infinite Web