The Mobile Web is Structurally Different Apoorva Jindal USC Chris Crutchfield MIT Samir Goel Google...

12
The Mobile Web is Structurally Different Apoorva Jindal USC Chris Crutchfield MIT Samir Goel Google Inc Ravi Jain Google Inc Ravi Kolluri Google Inc

Transcript of The Mobile Web is Structurally Different Apoorva Jindal USC Chris Crutchfield MIT Samir Goel Google...

Page 1: The Mobile Web is Structurally Different Apoorva Jindal USC Chris Crutchfield MIT Samir Goel Google Inc Ravi Jain Google Inc Ravi Kolluri Google Inc.

The Mobile Web is Structurally Different

Apoorva Jindal

USC

Chris Crutchfield

MITSamir Goel

Google Inc

Ravi Jain

Google Inc

Ravi Kolluri

Google Inc

Page 2: The Mobile Web is Structurally Different Apoorva Jindal USC Chris Crutchfield MIT Samir Goel Google Inc Ravi Jain Google Inc Ravi Kolluri Google Inc.

The Mobile Web is Structurally DifferentThe Mobile Web?

Web pages designed for consumption on mobile wireless devices CHTML, XHTML, WML

All other pages referred to as fixed web Becoming more important

Better devices Better networks Cheaper plans

Different from fixed web? Smaller pages Fewer hyperlinks Fewer images

is Structurally Different

Page 3: The Mobile Web is Structurally Different Apoorva Jindal USC Chris Crutchfield MIT Samir Goel Google Inc Ravi Jain Google Inc Ravi Kolluri Google Inc.

Web graph pages ↔ nodes hyperlinks ↔ edges

Properties of this graph In-degree distribution Out-degree distribution Strongly connected component size distribution ….

Importance Used in basic algorithms to implement search

Crawling Ranking the web pages

Studied in detail for fixed web

INFOCOM 2008

Structurally?The Mobile Web is Structurally DifferentThe Mobile Web is

EDAS

Page 4: The Mobile Web is Structurally Different Apoorva Jindal USC Chris Crutchfield MIT Samir Goel Google Inc Ravi Jain Google Inc Ravi Kolluri Google Inc.

Bow-tie Structure [Broder et al 2000]

Model to describe the structure of the fixed web.

Page 5: The Mobile Web is Structurally Different Apoorva Jindal USC Chris Crutchfield MIT Samir Goel Google Inc Ravi Jain Google Inc Ravi Kolluri Google Inc.

Methodology Collapse all pages in a domain to one node

Use Tools based on Mapreduce

Google’s mobile web index, June 2007 CHTML XHTML + WML

Webbase 2001

Google’s fixed web index, July 2007

In-degree & out-degree distributions Tools based on mapreduce Use [Clauset et al 2006] to infer the power law

coefficient Determine bow-tie structure properties

Use COSIN tools [Donato et al 2004] Limitations

Cannot handle Google fixed web 2007 at page level

Page 6: The Mobile Web is Structurally Different Apoorva Jindal USC Chris Crutchfield MIT Samir Goel Google Inc Ravi Jain Google Inc Ravi Kolluri Google Inc.

Mobile web is sparser

Page-level Graph properties – Degree Distributions

Corpus Avg Node Degree

In-degree Out-degree

XHTML+WML 3.75 2.00 3.49

CHTML 5.06 1.99 4.06

Webbase 7.0 2.1 2.7

Coefficient of power-law distribution

CHTML lies between XHTML+WML and fixed web

Out-degree distribution falls off faster for mobile web

Page 7: The Mobile Web is Structurally Different Apoorva Jindal USC Chris Crutchfield MIT Samir Goel Google Inc Ravi Jain Google Inc Ravi Kolluri Google Inc.

Mobile web Smaller SCC Larger IN and smaller OUT Bigger Disconnected + Tendrils

Connectivity: Fixed Web > CHTML > XHTML/WML

Page-level Graph properties – Bow-tie structure

Corpus SCC IN OUT Tendrils Disconnected

XHTML+WML

10.5% 18% 10.4% 18.3% 42.7%

CHTML 22% 25.9% 14.2% 22% 15.8%

Webbase

33% 11% 39% 13% 4%

Page 8: The Mobile Web is Structurally Different Apoorva Jindal USC Chris Crutchfield MIT Samir Goel Google Inc Ravi Jain Google Inc Ravi Kolluri Google Inc.

Language Properties Sub-graph of pages that share a common trait

Like keyword, location. Called Thematically Unified Clusters (TUCs). In fixed web, they retain the structural properties of the entire graph.

Mobile web?

Corpus Language Fraction of Nodes

XHTML

Chinese 42.6%

English 22.3%

Russian 13.4%

French 3.4%

German 2.3%

CHTML Japanese 92.3%

English 5.9%

Corpus SCC IN OUT Tendrils Disconnected

XHTML+WML

10.5% 18% 10.4% 18.3% 42,7%

Chinese 13% 22% 9% 14% 42%

English 2% 3% 7% 25% 63%

Russian 22% 40% 8% 11% 19%

Don’t study Japanese: Properties same as CHTML

Page 9: The Mobile Web is Structurally Different Apoorva Jindal USC Chris Crutchfield MIT Samir Goel Google Inc Ravi Jain Google Inc Ravi Kolluri Google Inc.

Domain-level Graph Properties Domain-level graph

Collapse all nodes for a domain into a single super-node

Compare mobile web 2007 and fixed web 2007

Advantages Allows us to understand the differences at a much coarser level Allows us to compare present day fixed and mobile webs

Corpus Avg Node Degree

SCC IN OUT Tendrils + Disconn.

XHTML+WML

3.91 40.6% 40.7% 2.73% 15.9%

CHTML 5.56 83% 16.4% 0.22% 0.36%

Fixed web 2007

35.75 93.9% 5.62% 0.4% 0.03%

Observations Domain-level graphs are better

connected. XHMTL + WML has a much larger

Disconnected component CHTML properties lies between

XTHML+WML and Fixed web. Structural differences between

domain-level fixed web and mobile web same as the differences between page-level fixed web and mobile web.

Page 10: The Mobile Web is Structurally Different Apoorva Jindal USC Chris Crutchfield MIT Samir Goel Google Inc Ravi Jain Google Inc Ravi Kolluri Google Inc.

Application: Impact on Crawling

Crawling is resource-intensive. Efficiency is important

Higher level of disconnectedness Need a larger and a more diverse seed set

Covering the IN component requires special care

Depth-first strategy risks spending a disproportionate time in Tendrils and Disconnected components

Different languages have different levels of disconnectedness Require a larger seed set for English pages than Russian pages Crawl depth can be reduced for Russian sub-graph

Sparseness also can give an advantage Chances of encountering the page again during a crawl is smaller

Page 11: The Mobile Web is Structurally Different Apoorva Jindal USC Chris Crutchfield MIT Samir Goel Google Inc Ravi Jain Google Inc Ravi Kolluri Google Inc.

Conclusions

Mobile web graph is structurally different Sparser, more disconnected Smaller SCC and OUT

CHTML properties lies between XHTML+WML and Fixed web

Surprising preponderance of Chinese pages

English sub-graph extremely disconnected

Page 12: The Mobile Web is Structurally Different Apoorva Jindal USC Chris Crutchfield MIT Samir Goel Google Inc Ravi Jain Google Inc Ravi Kolluri Google Inc.

Future Work

Only a first step

Results motivate the need of a deeper and more extensive analysis

Propose alternatives to bow-tie model for mobile web

Better understanding of language sub-graphs

Quantitatively characterize the impact of differences in structure on different search algorithms