large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web...

88
Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden. Web Mining Strata 2012 1 photo by: i_pinz, flickr Scale Unlimited Tuesday, February 28, 12

Transcript of large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web...

Page 1: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Web MiningStrata 2012

1

phot

o by

: i_p

inz,

flic

kr

Scale Unlimited

Tuesday, February 28, 12

Page 2: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Welcome to Web Mining!

This class is a tutorial on large scale web miningTopics covered

Overview of web mining

Web crawling - broad & focused

Text mining - extracting value

Hands-on lab

Tips and traps

2

Tuesday, February 28, 12

Page 3: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Meet Your Instructor

Ken Krugler - direct from Nevada City, CaliforniaFounder of TransPac Software, Krugle, Bixo Labs/Scale UnlimitedDeveloper of Bixo web mining toolkitCommitter on Apache TikaDeveloper and trainer for Hadoop, Solr and CascadingActively web mining for six years

3

Tuesday, February 28, 12

Page 4: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Agenda

9:00am - Overview9:30am - Web Crawling10:00am - Text Mining10:30am - Break

11:00am - Web Mining Lab11:45am - Lab Review12:00pm - Summary12:15pm - Q&A

4

Tuesday, February 28, 12

Page 5: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Web MiningStrata 2012

5

phot

o by

: exf

ordy

, flic

kr

Overview

Tuesday, February 28, 12

Page 6: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Key Questions

Which of the three types of web mining are we focusing on today?What makes web pages “noisy”?

6

Tuesday, February 28, 12

Page 7: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

What is web mining?

7

Tuesday, February 28, 12

Page 8: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

What is web mining?

Extracting useful information from the World-wide Web

7

Tuesday, February 28, 12

Page 9: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

What is web mining?

8

Tuesday, February 28, 12

Page 10: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

What is web mining?

Extracting useful information from the World-wide Web

8

Tuesday, February 28, 12

Page 11: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

What is web mining?

Extracting useful information from the World-wide WebWeb structure - link graph analysis

8

Tuesday, February 28, 12

Page 12: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

What is web mining?

9

173.255.195.185 - - [05/Sep/2011:06:03:56 -0600] "GET /feed/ HTTP/1.1" 200 166 "-"

67.124.22.71 - - [05/Sep/2011:06:03:58 -0600] "GET /summary/ HTTP/1.1" 200 809 "-"

89.105.44.90 - - [05/Sep/2011:06:04:02 -0600] "GET /feedx/ HTTP/1.1" 404 0 "-"

Tuesday, February 28, 12

Page 13: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

What is web mining?

Extracting useful information from the World-wide Web

9

173.255.195.185 - - [05/Sep/2011:06:03:56 -0600] "GET /feed/ HTTP/1.1" 200 166 "-"

67.124.22.71 - - [05/Sep/2011:06:03:58 -0600] "GET /summary/ HTTP/1.1" 200 809 "-"

89.105.44.90 - - [05/Sep/2011:06:04:02 -0600] "GET /feedx/ HTTP/1.1" 404 0 "-"

Tuesday, February 28, 12

Page 14: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

What is web mining?

Extracting useful information from the World-wide WebWeb structure - link graph analysis

9

173.255.195.185 - - [05/Sep/2011:06:03:56 -0600] "GET /feed/ HTTP/1.1" 200 166 "-"

67.124.22.71 - - [05/Sep/2011:06:03:58 -0600] "GET /summary/ HTTP/1.1" 200 809 "-"

89.105.44.90 - - [05/Sep/2011:06:04:02 -0600] "GET /feedx/ HTTP/1.1" 404 0 "-"

Tuesday, February 28, 12

Page 15: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

What is web mining?

Extracting useful information from the World-wide WebWeb structure - link graph analysisWeb usage - server logs

9

173.255.195.185 - - [05/Sep/2011:06:03:56 -0600] "GET /feed/ HTTP/1.1" 200 166 "-"

67.124.22.71 - - [05/Sep/2011:06:03:58 -0600] "GET /summary/ HTTP/1.1" 200 809 "-"

89.105.44.90 - - [05/Sep/2011:06:04:02 -0600] "GET /feedx/ HTTP/1.1" 404 0 "-"

Tuesday, February 28, 12

Page 16: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

What is web mining?

10

Tuesday, February 28, 12

Page 17: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

What is web mining?

Extracting useful information from the World-wide Web

10

Tuesday, February 28, 12

Page 18: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

What is web mining?

Extracting useful information from the World-wide WebWeb structure - link graph analysis

10

Tuesday, February 28, 12

Page 19: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

What is web mining?

Extracting useful information from the World-wide WebWeb structure - link graph analysisWeb usage - server logs

10

Tuesday, February 28, 12

Page 20: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

What is web mining?

Extracting useful information from the World-wide WebWeb structure - link graph analysisWeb usage - server logsWeb content - text and images

10

Tuesday, February 28, 12

Page 21: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Web Content Mining

Analyzing data from web pagesTypically three types of page processing

Unstructured - get rid of “boilerplate” text, analyze sentiment

Semi-structured - find names of people with phone numbers

Structured - find hotel name, address, phone number, reviews

Plus inter-document analysisClustering

11

Tuesday, February 28, 12

Page 22: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Crawling versus Mining

12

Tuesday, February 28, 12

Page 23: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Crawling versus Mining

Web mining combines...

12

Tuesday, February 28, 12

Page 24: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Crawling versus Mining

Web mining combines...web crawling - finding & fetching content

12

Tuesday, February 28, 12

Page 25: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Crawling versus Mining

Web mining combines...web crawling - finding & fetching content

data mining - extracting useful information

12

Tuesday, February 28, 12

Page 26: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Crawling versus Mining

Web mining combines...web crawling - finding & fetching content

data mining - extracting useful information

12

Tuesday, February 28, 12

Page 27: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Crawling versus Mining

Web mining combines...web crawling - finding & fetching content

data mining - extracting useful information

Both fields are broad and deep - for example

12

Tuesday, February 28, 12

Page 28: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Crawling versus Mining

Web mining combines...web crawling - finding & fetching content

data mining - extracting useful information

Both fields are broad and deep - for exampleoptimal crawling strategies

12

Tuesday, February 28, 12

Page 29: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Crawling versus Mining

Web mining combines...web crawling - finding & fetching content

data mining - extracting useful information

Both fields are broad and deep - for exampleoptimal crawling strategies

machine learning for page classification

12

Tuesday, February 28, 12

Page 30: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Crawling versus Mining

Web mining combines...web crawling - finding & fetching content

data mining - extracting useful information

Both fields are broad and deep - for exampleoptimal crawling strategies

machine learning for page classification

automatically extracting structured data

12

Tuesday, February 28, 12

Page 31: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

What is “Large Scale”?

More than what you can handle with one serverMany single-server solutions for mining web pages

Harder when you include text analytics

And (almost) impossible when you get to 100M+ pages

So you need some kind of distributed processing framework

13

Tuesday, February 28, 12

Page 32: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Key Aspects of Web Mining

Crawling - finding the “good stuff”

Extracting - getting the “right data”

Processing - turning bytes into bucks

14

Tuesday, February 28, 12

Page 33: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Finding the “good stuff”

Often feels like “needle in a haystack”E.g. even 100M pages is 0.1% of total web

Need to optimize time + cost per useful resultCan’t afford to waste time on pages that aren’t useful

And each page has cost to data provider

15

Tuesday, February 28, 12

Page 34: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Getting the “right data”

Scale and precision are in oppositionOne area of one site can be precision-processed

All areas of 50M domains means you have to be general

Pages are noisyAds

Boilerplate (navigation, etc)

SEO

16

Tuesday, February 28, 12

Page 35: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Processing the results

1TB data file has very little valueActually less value that a small file that can be opened & viewed

Has to be turned into something with value

Often processing is considered part of web miningReduction - turning petabytes into pie charts

Indexing - being able to search the data

Analytics - clustering, training models for recommenders

17

Tuesday, February 28, 12

Page 36: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Q & A

Which of the three types of web mining are we focusing on today?What makes web pages “noisy”?

18

Tuesday, February 28, 12

Page 37: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Web MiningStrata 2012

19

Web Crawling

phot

o by

: Gra

ham

Rac

her,

flick

r

Tuesday, February 28, 12

Page 38: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Key Questions

What are three general types of web crawls?What can make it hard to accurately score a page?

20

Tuesday, February 28, 12

Page 39: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

What is “web crawling”?Includes fetching pages, of courseBut also has aspect of spider crawling over a web

Extracting outlinks to discover new pages

Which means parsing the fetched content

Managing state of the crawl

And all of the implicit rulesRobots exclusion protocol

User agent

Request rate

21

Tuesday, February 28, 12

Page 40: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Types of web crawlsBroad

Few or no limits to what domains/pages to process

Typically what people think of - Googlebot, bingbot, Baiduspider, ...

FocusedUses page scoring -> outlinks to guess at quality of unfetched pages

Often has whitelist of domains to avoid traps

DomainFor a limited number of domains

Typically for precise extraction of data

22

Tuesday, February 28, 12

Page 41: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

The “don’t craw” crawl

Leverage other people’s crawl dataCan be faster, cheaper

Reduces load on servers

Public datasetsCommon crawl

Wikipedia - use data dump!

Commercial providersSpinner, InfoChimps

23

Tuesday, February 28, 12

Page 42: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Crawling Solutions

General rule - don’t roll your own!Easy to make something simple

Hard to make something scalable, robust, efficient

Open source optionsJava - Nutch, Heritrix, Bixo, Droids

Python - http://scrapy.org/

PHP - http://astellar.com/php-crawler/

24

Tuesday, February 28, 12

Page 43: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

What makes it hard?

Web mining breaks the implicit contract with web sitesYou often aren’t creating an index that drives traffic to them

So why should they let you use bandwidth & server cycles?

The web is a nearly infinite set of edge casesEvery possible problem will occur, with a broad enough crawl

And not everybody plays niceLink farms/honeypots, malicious sites, angry webmasters

Plus you have to be able to work at scale

25

Tuesday, February 28, 12

Page 44: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Scaling Solutions

Needs to be reliable, scalable, fault tolerantSingle server can fetch lots of pages

But scaling is issue with post-processing

Several optionsHadoop - Nutch, Bixo

Custom queuing system - Heritrix, Droids

Storm - scalable queuing

26

Tuesday, February 28, 12

Page 45: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Focused Crawling 101

How to maximize results while minimizing costaka Finding Good Stuff Fast

Only crawl pages that you think are likely to be goodReduces cost through

Less time spent fetching worthless pages

Lower bandwidth/CPU/storage costs

Fewer angry webmasters

27

Tuesday, February 28, 12

Page 46: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Focused Crawl Details

Seed URLs - Good starting pointURL State - DB of all known URLsPage Score - “Quality” of pageLink Score - Page Score/outlinksFetched Pages - Saved results

28

Tuesday, February 28, 12

Page 47: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Finding Seed URLs

List of all registered domains - Complete, but big (100M+)DMOZ - lots of spam/pornAlexa/Quantcast “top sites” list - top 1M US sites by trafficWikipedia - use outlink dump if possibleTweets - with filtering, e.g. Gnip, DataSiftUsing search

Manually entering URLs - slow, but curated

Using API - faster, typically limited, can have junk

29

Broad

Narrow

Tuesday, February 28, 12

Page 48: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Scoring Pages

Analyze text on pageTypically means tokenizing text

“The sport of ultimate is..” => “the”, “sport”, “of”, “ultimate”, “is”, ...

Simple term-basedCount occurrences of all phrases, good phrase, bad phrases

Calculate ratios of counts: good/all - bad/all = score

30

Tuesday, February 28, 12

Page 49: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

SVM = support vector machineTrained using “documents” that have features, and a class

“good” : “ultimate”, “ultimate frisbee”, “disc”, “sport”, “throw”, “run”

“bad” : “golf”, “timeshare”, “aardvark”, “potato”

Creates a statistical model Divides all training documents into separate classes

Used to give an unknown document a class

Scoring using SVM

31

Tuesday, February 28, 12

Page 50: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Challenges with ScoringHow do you decide that a page is “good”?

Might be mostly graphics with few words

Could be a definition of the term

Min threshold for amount of real contentDetecting link farms with fake content

32

Tuesday, February 28, 12

Page 51: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Chrome, cruft, and boilerplateNavigational linksSidebar elementsAds, SEO links

Can use Boilerpipe & other “cleaners”

33

Tuesday, February 28, 12

Page 52: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Expanding the crawl frontier

Have to parse the page to find outlinksNeed to normalize links

http://scaleunlimited.com == http://www.scaleunlimited.com

http://www.scaleunlimited.com == http://www.scaleunlimited.com/

Skipping links to low-value pagesLinks to images, pdf files, other binary types (using suffix)

Links to DB-generated pages

34

Tuesday, February 28, 12

Page 53: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Focused Domain Crawl

Very specific, explicit crawl of one domainTypically involves discovery of target content pagesOften uses URL patterns to synthesize links

Page X in site has list of product; a, b, c, d...

Product pages are <domain>/product/a or b or c or d...

35

Tuesday, February 28, 12

Page 54: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Discovery vs Extraction

Focused Domain Crawl has two distinct phasesCrawling to discover details pages

Fetching/processing details pages

Often phases are co-mingled, for efficiencyNeed to track what kind of page in the URL State DB

36

Tuesday, February 28, 12

Page 55: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Goby Crawl Examplehttp://www.goby.com has information on lots of attractionshttp://www.goby.com/boston-ma has list of categories

http://www.goby.com/<category>--near--<city>-<state>Often need to paginate listing pages, to get all details links

37

Tuesday, February 28, 12

Page 56: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Q & A

What are three general types of web crawls?What can make it hard to accurately score a page?

38

Tuesday, February 28, 12

Page 57: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Web MiningStrata 2012

39

Data Extraction

phot

o by

: Gra

ham

Rac

her,

flick

r

Tuesday, February 28, 12

Page 58: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Key Questions

What are the three general approaches for data extraction?Why might you want to detect the language of a page?

40

Tuesday, February 28, 12

Page 59: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

You’ve got a page, now what?

Time to extract the data you needThree attributes of extraction, pick any two

Broad - across lots of domains and page formats

Precise - very specific types of data

Accurate - low error rate

Three general approachesUnstructured (broad, accurate) - “just text”

Semi-structured (broad, precise) - finding meaning in text

Structured (precise, accurate) - getting exactly the data you need

41

Tuesday, February 28, 12

Page 60: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Common Tasks - Cleaning

The HTML needs to be cleaned upLots of messy data, especially when hand-edited

Even HTML (2.0? 3.2? 4.0.1?) should be converted to XHTML

Various libraries help with “cleaning” the HTMLTagSoup, NekoHTML, HtmlCleaner

Note that end result won’t match original text

42

Tuesday, February 28, 12

Page 61: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Common Tasks - Charset

You get bytes back from the web serverYou need a charset to convert bytes to characters

HTTP response header - “Content-Type: text/html; charset=UTF-8”

HTML meta tag - <meta http-equiv=”Content-Type” content=“...” />

Analysis of text - byte sequence statistics

Several packages support thisTika, ICU

43

Tuesday, February 28, 12

Page 62: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Common Tasks - Link Extraction

Needed to have a crawl - where new links come fromMeans you need XHTML so you can parse the markupNot just <a href=“xxx”>

img, frame, iframe, link, map, area

44

Tuesday, February 28, 12

Page 63: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Common Tasks - Boilerplate

For unstructured and semi-structuredCan improve the quality of resultsEspecially important for machine learning

Boilerplate text can dramatically skew statistics

Creates a noisier signal

45

Tuesday, February 28, 12

Page 64: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Common Tasks - Language

Often used for filtering or alternative processingTarget audience is only interested in Spanish

I need to tokenize Japanese differently

Clustering improves when it’s segmented by language

Multiple signals for selecting language, same as charset detectionHTTP response header: Content-Language: es

HTML meta tag - <meta http-equiv=”Content-Language” content=“es” />

HTML tag attributes - <html lang=“es”>

Analysis of text - ngram statistics, short words

46

Tuesday, February 28, 12

Page 65: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Unstructured Extraction

Goal is extracting text, without much additional processingOften has a few fields, from HTML

Title - from <head><title>The title of my page</title></head>

Description - from <meta name=“description” content=“ultimate frisbee” />

Body - from <body>...all elements that contain text, like <p>...</body>

47

Tuesday, February 28, 12

Page 66: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Semi-structured Extraction

Goal is finding structured data in random textCan be applied broadly, since it’s not (very) format-specific

Accuracy suffers, because of breadth of input data formats

Beware the academic algorithm

Examples of what does work...Easy patterns: telephone numbers, dates

Microformats: hCalendar, hCard, hReview, ...

Natural Language Processing (NLP): named entities

48

Tuesday, February 28, 12

Page 67: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Structured Extraction

Precise extraction of specific types of dataTypically is to one area of one siteOften handled with XPath, and maybe regular expressions

//div[@id=‘<id of target div>’]/p

div and span are beautiful tagsCommonly used with CSS

Which means they are (more) stable

49

Tuesday, February 28, 12

Page 68: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

How to figure out XPath

Firebug is your friendPlug-in for Firefox

Will show you the full XPath for each element

Note that browsers will re-write HTML (e.g. tbody element)The DOM you see is often generated with Javascript

50

Tuesday, February 28, 12

Page 69: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

XPath demo

FirebugXPath tool

51

Tuesday, February 28, 12

Page 70: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Dealing with Javascript

Required if page generates target content using JSForces you to use Firebug or equivalent to inspect the DOMOptions for processing include...

HtmlUnit

qt-webkit

headless Mozilla

52

Tuesday, February 28, 12

Page 71: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Javascript challenges

10x slower than just loading the page textGood way to make a webmaster angry

Lots of extra load on server

Can skew website stats

Often has issuesPages that work in FF or IE but not HtmlUnit

Pages that cause HtmlUnit to hang

53

Tuesday, February 28, 12

Page 72: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Q & A

What are the three general approaches for data extraction?Why might you want to detect the language of a page?

54

Tuesday, February 28, 12

Page 73: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Web MiningStrata 2012

55

Web Crawling Lab

phot

o by

: Gra

ham

Rac

her,

flick

r

Tuesday, February 28, 12

Page 74: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Key Questions

Were you able to build and run the code locally?Were you able to run a crawl in Elastic MapReduce?Were you able to improve the focused crawl?

56

Tuesday, February 28, 12

Page 75: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

ImageFinder Details

Find images about Ultimate FrisbeeFocused crawl

Fixed list of seed URLs

Positive & negative terms used to score pages

Extract images from page

57

Tuesday, February 28, 12

Page 76: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Running ImageFinder

Can run locally, with restrictionsOnly one fetcher thread

Only 5 pages/loop

Can run in Hadoop clusterAmazon Elastic MapReduce

CrawlRunner uploads job jar, creates “Job Flow Step”

Limited to 100 pages/loop, 2 loops

Will take up to 10 minutes to run loops

58

Tuesday, February 28, 12

Page 77: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Running in Elastic MapReduce

Watch your job via http://strata.scaleunlimited.com:9100/Your job name will include your username

Results get added to searchable indexhttp://strata.scaleunlimited.com/solr/strata/

Search for student:<username> to find your results

59

Tuesday, February 28, 12

Page 78: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Solr Results IssueNote links vs. imagesThese are “image” URLs that are actually to pagesDouble-bonus on exercise...fix this problem :)

60

Tuesday, February 28, 12

Page 79: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Lab Details

Code is in strata-web-mining folder you downloadedDetails of code in strata-web-mining/doc/README-DescriptionInstructions are in strata-web-mining/doc/README-Lab

Please follow the lab steps carefully

Missing a step will cause pain and suffering later

61

Tuesday, February 28, 12

Page 80: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Lab Exercises

First goal is to build code and run locallyNext is to build code and run in real clusterThen you get to try to optimize the focused crawlAnd (if you’re fast) try finding images for a different topic

62

Tuesday, February 28, 12

Page 81: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Trouble-shooting & Timing

I’ll be walking around - raise your hand if you need helpBut with 100+ people, I’ll be talking fast :)We’ve got an hour (or more) before summary/Q&AHave fun...

63

Tuesday, February 28, 12

Page 82: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Q & A

Were you able to build and run the code locally?Were you able to run a crawl in Elastic MapReduce?Were you able to improve the focused crawl?

64

Tuesday, February 28, 12

Page 83: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Web MiningStrata 2012

65

Summary & QA

phot

o by

: Gra

ham

Rac

her,

flick

r

Tuesday, February 28, 12

Page 84: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Key Challenges

Complexity of large scale web crawling

Challenges with extracting the right data

Extra work to turn results into value

66

Tuesday, February 28, 12

Page 85: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Ethical Crawling

Always have a real, valid, informative user agent nameAlways honor the robot exclusion protocol - robots.txtLimit your crawl rate - parallelism, crawl delay, pages/dayImmediately comply with blacklisting and data removal requests

67

Tuesday, February 28, 12

Page 86: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Avoiding getting blocked

Follow all ethical crawling guidelinesGradually ramp up your crawl rate

Gives webmasters time to complain before it’s a serious problem

Avoid Javascript if at all possibleDon’t follow form linksGrovel shamelessly

68

Tuesday, February 28, 12

Page 87: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Resources

Hadoop - http://hadoop.apache.orgCascading - http://www.cascading.orgBixo - http://openbixo.orgWeb Data Mining by Bing Liu

69

Tuesday, February 28, 12

Page 88: large-scale web mining - O'Reilly Mediaassets.en.oreilly.com/1/event/75/Large scale web mining...Overview of web mining Web crawling ... Web Content Mining Analyzing data from web

Copyright (c) 2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Question?

I might have answers

[email protected]@kkrugler

70

Tuesday, February 28, 12