Focused Crawling for Structured Data

21
Focused Crawling for Structured Data Robert Meusel , Peter Mika, and Roi Blanco

Transcript of Focused Crawling for Structured Data

Page 1: Focused Crawling for Structured Data

Focused Crawling for Structured Data

Robert Meusel, Peter Mika, and Roi Blanco

Page 2: Focused Crawling for Structured Data

2

Markup Languages in HTML Pages

<html>

<body>

<div id="main-section" class="performance left" data-

sku="M17242_580“>

<h1> Predator Instinct FG Fußballschuh

</h1>

<div>

<meta content="EUR">

<span

data-sale-price="219.95">219,95</span>

</body>

</html>

HTML pages embed directly markup languages to annotate items using different vocabularies

<html>

<body>

<div id="main-section" class="performance left" data-

sku="M17242_580" itemscope

itemtype="http://schema.org/Product">

<h1 itemprop="name"> Predator Instinct FG Fußballschuh

</h1>

<div itemscope itemtype="http://schema.org/Offer"

itemprop="offers">

<meta itemprop="priceCurrency" content="EUR">

<span itemprop="price" data-sale-

price="219.95">219,95</span>

</body>

</html>

1._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-

ns#type> <http://schema.org/Product> .

2._:node1 <http://schema.org/Product/name> "Predator

Instinct FG Fußballschuh"@de .

3._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-

ns#type> <http://schema.org/Offer> .

4._:node1 <http://schema.org/Offer/price>

"219,95"@de .

5._:node1 <http://schema.org/Offer/priceCurrency>

"EUR" .

6.…

Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai

Page 3: Focused Crawling for Structured Data

3

Deployment of Markup Languages

14% of all sites use markup languages to annotate their data (status 2013) [Meusel2014]

• Broad topical variations from Articles over Products to Recipe [Bizer2013]

• Multiple strong drivers pushing the deployment

• Search engine companies initiative on Schema.org

• Open Graph Protocol used by Facebook

Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai

Page 4: Focused Crawling for Structured Data

4

Motivation

• Existing datasets/crawls do not focus on structured data

• Common Crawl Foundation uses PageRank and Breadth-First Search

• Datasets, as the WebDataCommons corpus extracted from these corpora, are likely to miss large amounts of data [Meusel2014]

• Structured information

• Hundreds of million pages

• Up-to-date information

• Publicly available

Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai

Page 5: Focused Crawling for Structured Data

5

Main Idea

• Adapting the idea of focused crawling

• Similarities:

• Evaluation of content based on a objective function

• Differences:

• Typically focused by topic, not quality/amount of data collected

• Because of that, typically no direct feedback about crawled pages available

Possibility to incorporate the feedback directly into our system to improve classification of newly

discovered URLs.

Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai

Page 6: Focused Crawling for Structured Data

6

Online Learning for Focused Crawling

• Capability to incorporates real-time feedback

• Improves performance

• Adapts to concept drifts

• Possible features

• URL-based features; mainly tokens from the URL-String itself

• Features describing information from the parent(s) of the URL

• Features describing information from the siblings of the URL

• Free open-source software available (e.g. Massive Online Analysis Library by Bifet et al.)

Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai

Page 7: Focused Crawling for Structured Data

7

Exploration vs. Exploitation

• Decision/Classification is based on gathered knowledge

• Knowledge can be incomplete• Crawled too few pages

• Knowledge can get invalid• Reaching part of the Web with

different behavior

Selecting the page with the highest confidence for supporting our objective, might not always be the best

choice

Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai

Page 8: Focused Crawling for Structured Data

8

Bandit-Based Selection

• Bin each URL to the host it belongs to

• Each host represents one bandit

• Calculate the expected score for each bandit based on a scoring function

• Select the degree of randomness λ

• λ between 0 and 1

• For each turn draw a random number z

• z > λ: select the bandit with highest score

• else: select a random bandit

Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai

Page 9: Focused Crawling for Structured Data

9

Scoring Functions

Incorporate knowledge in score calculation for bandit/host:

• Best Score (Pure classification-based selection)

• Negative Absolute Bad

• Success Rate

• Absolute Good · Best Score

• Success Rate · Best Score

• Thompson Sampling

Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai

Page 10: Focused Crawling for Structured Data

10

System Workflow

Online

Classifier

Bandits

Crawler

URLParser

SemanticParser

Classified URL

URLHTMLPage

URLs

Feedback

Seeds

Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai

Page 11: Focused Crawling for Structured Data

11

Setup for Experiments

• Data originates from the Common Crawl Corpus 2012

• including over 3.5 billion HTML pages

• Extracted a subset of 5.5 million linked pages

• Including 450k different hosts

• Identified all pages within the subset containing at least one markup language (using the WebDataCommons corpus)

• 27.5% of all pages

Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai

Page 12: Focused Crawling for Structured Data

12

Experiment Description

Measure: Number of relevant pages retrieved within the first 1 million pages crawled.

1. Online vs. batch-based classification with 100K, 250K, and 1M pages

2. Pure online classification vs. enhanced with bandit-based selection (λ=0)

3. Improvements with different λ

4. Improvements with decaying λ

Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai

Page 13: Focused Crawling for Structured Data

13

Results: Online vs. Offline

• Both methods outperform Breadth-First Search (BFS)

• Static approach: 340K

• Adaptive approach: 539K

Perc

enta

ge o

f re

leva

nt

pag

es

Fetched web pages

Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai

Page 14: Focused Crawling for Structured Data

14

Results: Pure Online Classification vs. +Bandit-based

• Success rate based scoring functions show most promising results

• Negative absolute bad scoring performs like BFS

• Success ratefunction: 628K

• Pure online-classification: 539K

Perc

enta

ge o

f re

leva

nt

pag

es

Fetched web pages

Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai

Page 15: Focused Crawling for Structured Data

15

Results: λ > 0

• Including randomness seems not to have an effect

• Beneficial effect of λ > 0 is shown e.g. for the success ratefunction within the first 400K crawled pages

Perc

enta

ge o

f re

leva

nt

pag

es

Fetched web pages

Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai

Page 16: Focused Crawling for Structured Data

16

Results: Decaying λ

Decaying λ over time, means the reduction of randomness while crawling more pages.

• Success rate function with decaying λ = 0.5: 673K

• Static λ: 628K

Perc

enta

ge o

f re

leva

nt

pag

es

Fetched web pages

Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai

Page 17: Focused Crawling for Structured Data

17

Adaptation to more specific Objective

• General objective is narrowed down to:

• Pages making use of the markup language Microdata and

• Include at least five marked up statements

• Example:

1. A page including information about a movie

2. The movie has the name Se7en

3. with a rating of 8.7 out of 10

4. and it was released in 1995

5. This information is maintained by imdb.com

Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai

Page 18: Focused Crawling for Structured Data

18

Results: Adaptation to more specific Objective

• 3.5% of pages include such information

• In general: Observation of beneficial effects using our approach

• Static λ = 0.2: 120K

• Decaying λ = 0.5: 108K

Perc

enta

ge o

f re

leva

nt

pag

es

Fetched web pages

Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai

Page 19: Focused Crawling for Structured Data

19

Conclusion

• Improvement by 26% in comparison to pure online classification-based selection strategy for general objective

• Improvement by 66% for the more specific objective

• Success rate based scoring functions shows most promising results for objectives

Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai

Page 20: Focused Crawling for Structured Data

20

Open Challenges

• Expand the approach to exploit results from one bandit to the other bandits (contextual bandits)

• Introduce a more fine grained grading of the crawled pages (multi-class problem)

• Take into account the quality of gathered information (beside richness)

• Adapt the process to traditional topical focused crawling

• Publishing of code and data to the community

Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai

Page 21: Focused Crawling for Structured Data

21

More Information

• Paper accepted at ACM International Conference on Information and Knowledge Management in Shanghai, China

• ACM Digital Library: Focused Crawling for Structured Data

• Detailed Descriptions and Source Code:

• Anthelion Webpage

• Datasets:

• Common Crawl Foundation Corpora

• WebDataCommons Corpora

Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai