Focused Crawling for Structured Data

Focused Crawling for Structured Data

Robert Meusel, Peter Mika, and Roi Blanco

2

Markup Languages in HTML Pages

<html>

…

<body>

…

<div id="main-section" class="performance left" data-

sku="M17242_580“>

<h1> Predator Instinct FG Fußballschuh

</h1>

<div>

<meta content="EUR">

<span

data-sale-price="219.95">219,95</span>

…

</body>

</html>

HTML pages embed directly markup languages to annotate items using different vocabularies

<html>

…

<body>

…

<div id="main-section" class="performance left" data-

sku="M17242_580" itemscope

itemtype="http://schema.org/Product">

<h1 itemprop="name"> Predator Instinct FG Fußballschuh

</h1>

<div itemscope itemtype="http://schema.org/Offer"

itemprop="offers">

<meta itemprop="priceCurrency" content="EUR">

<span itemprop="price" data-sale-

price="219.95">219,95</span>

…

</body>

</html>

1._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-

ns#type> <http://schema.org/Product> .

2._:node1 <http://schema.org/Product/name> "Predator

Instinct FG Fußballschuh"@de .

3._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-

ns#type> <http://schema.org/Offer> .

4._:node1 <http://schema.org/Offer/price>

"219,95"@de .

5._:node1 <http://schema.org/Offer/priceCurrency>

"EUR" .

6.…

Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai

3

Deployment of Markup Languages

14% of all sites use markup languages to annotate their data (status 2013) [Meusel2014]

• Broad topical variations from Articles over Products to Recipe [Bizer2013]

• Multiple strong drivers pushing the deployment

• Search engine companies initiative on Schema.org

• Open Graph Protocol used by Facebook


http://link.springer.com/chapter/10.1007/978-3-319-11964-9_18


http://schema.org/

4

Motivation

• Existing datasets/crawls do not focus on structured data

• Common Crawl Foundation uses PageRank and Breadth-First Search

• Datasets, as the WebDataCommons corpus extracted from these corpora, are likely to miss large amounts of data [Meusel2014]

• Structured information

• Hundreds of million pages

• Up-to-date information

• Publicly available



5

Main Idea

• Adapting the idea of focused crawling

• Similarities:

• Evaluation of content based on a objective function

• Differences:

• Typically focused by topic, not quality/amount of data collected

• Because of that, typically no direct feedback about crawled pages available

Possibility to incorporate the feedback directly into our system to improve classification of newly

discovered URLs.


6

Online Learning for Focused Crawling

• Capability to incorporates real-time feedback

• Improves performance

• Adapts to concept drifts

• Possible features

• URL-based features; mainly tokens from the URL-String itself

• Features describing information from the parent(s) of the URL

• Features describing information from the siblings of the URL

• Free open-source software available (e.g. Massive Online Analysis Library by Bifet et al.)


http://moa.cms.waikato.ac.nz/

7

Exploration vs. Exploitation

• Decision/Classification is based on gathered knowledge

• Knowledge can be incomplete• Crawled too few pages

• Knowledge can get invalid• Reaching part of the Web with

different behavior

Selecting the page with the highest confidence for supporting our objective, might not always be the best

choice


8

Bandit-Based Selection

• Bin each URL to the host it belongs to

• Each host represents one bandit

• Calculate the expected score for each bandit based on a scoring function

• Select the degree of randomness λ

• λ between 0 and 1

• For each turn draw a random number z

• z > λ: select the bandit with highest score

• else: select a random bandit


9

Scoring Functions

Incorporate knowledge in score calculation for bandit/host:

• Best Score (Pure classification-based selection)

• Negative Absolute Bad

• Success Rate

• Absolute Good · Best Score

• Success Rate · Best Score

• Thompson Sampling


10

System Workflow

Online

Classifier

Bandits

Crawler

URLParser

SemanticParser

Classified URL

URLHTMLPage

URLs

Feedback

Seeds


11

Setup for Experiments

• Data originates from the Common Crawl Corpus 2012

• including over 3.5 billion HTML pages

• Extracted a subset of 5.5 million linked pages

• Including 450k different hosts

• Identified all pages within the subset containing at least one markup language (using the WebDataCommons corpus)

• 27.5% of all pages


http://commoncrawl.org/

http://webdatacommons.org/structureddata/index.html

12

Experiment Description

Measure: Number of relevant pages retrieved within the first 1 million pages crawled.

1. Online vs. batch-based classification with 100K, 250K, and 1M pages

2. Pure online classification vs. enhanced with bandit-based selection (λ=0)

3. Improvements with different λ

4. Improvements with decaying λ


13

Results: Online vs. Offline

• Both methods outperform Breadth-First Search (BFS)

• Static approach: 340K

• Adaptive approach: 539K

Perc

enta

ge o

f re

leva

nt

pag

es

Fetched web pages


14

Results: Pure Online Classification vs. +Bandit-based

• Success rate based scoring functions show most promising results

• Negative absolute bad scoring performs like BFS

• Success ratefunction: 628K

• Pure online-classification: 539K

Perc

enta

ge o

f re

leva

nt

pag

es

Fetched web pages


15

Results: λ > 0

• Including randomness seems not to have an effect

• Beneficial effect of λ > 0 is shown e.g. for the success ratefunction within the first 400K crawled pages

Perc

enta

ge o

f re

leva

nt

pag

es

Fetched web pages


16

Results: Decaying λ

Decaying λ over time, means the reduction of randomness while crawling more pages.

• Success rate function with decaying λ = 0.5: 673K

• Static λ: 628K

Perc

enta

ge o

f re

leva

nt

pag

es

Fetched web pages


17

Adaptation to more specific Objective

• General objective is narrowed down to:

• Pages making use of the markup language Microdata and

• Include at least five marked up statements

• Example:

1. A page including information about a movie

2. The movie has the name Se7en

3. with a rating of 8.7 out of 10

4. and it was released in 1995

5. This information is maintained by imdb.com


18

Results: Adaptation to more specific Objective

• 3.5% of pages include such information

• In general: Observation of beneficial effects using our approach

• Static λ = 0.2: 120K

• Decaying λ = 0.5: 108K

Perc

enta

ge o

f re

leva

nt

pag

es

Fetched web pages


19

Conclusion

• Improvement by 26% in comparison to pure online classification-based selection strategy for general objective

• Improvement by 66% for the more specific objective

• Success rate based scoring functions shows most promising results for objectives


20

Open Challenges

• Expand the approach to exploit results from one bandit to the other bandits (contextual bandits)

• Introduce a more fine grained grading of the crawled pages (multi-class problem)

• Take into account the quality of gathered information (beside richness)

• Adapt the process to traditional topical focused crawling

• Publishing of code and data to the community


21

More Information

• Paper accepted at ACM International Conference on Information and Knowledge Management in Shanghai, China

• ACM Digital Library: Focused Crawling for Structured Data

• Detailed Descriptions and Source Code:

• Anthelion Webpage

• Datasets:

• Common Crawl Foundation Corpora

• WebDataCommons Corpora


http://dl.acm.org/citation.cfm?id=2661902

http://webdatacommons.org/structureddata/anthelion/

http://commoncrawl.org/the-data/get-started/

http://webdatacommons.org/structureddata/

Focused Crawling for Structured Data

Science

Transcript of Focused Crawling for Structured Data