Web Crawling and Data Gathering with Apache Nutch

18
Apache Nutch Web Crawling and Data Gathering Steve Watt - @wattsteve IBM Big Data Lead Data Day Austin

description

Apache Nutch Presentation by Steve Watt at Data Day Austin 2011

Transcript of Web Crawling and Data Gathering with Apache Nutch

Page 1: Web Crawling and Data Gathering with Apache Nutch

Apache Nutch

Web Crawling and Data Gathering

Steve Watt - @wattsteveIBM Big Data LeadData Day Austin

Page 2: Web Crawling and Data Gathering with Apache Nutch

2

Topics

Introduction

The Big Data Analytics Ecosystem

Load Tooling

How is Crawl data being used?

Web Crawling - Considerations

Apache Nutch Overview

Apache Nutch Crawl Lifecycle, Setup and Demos

Page 3: Web Crawling and Data Gathering with Apache Nutch

3

The Offline (Analytics) Big Data Ecosystem

Load Tooling

Web Content Your Content

Hadoop

Data Catalogs Analytics Tooling Export Tooling

Find Analyze Visualize Consume

Page 4: Web Crawling and Data Gathering with Apache Nutch

4

Load Tooling - Data Gathering Patterns and Enablers

Web Content

– Downloading – Amazon Public DataSets / InfoChimps

– Stream Harvesting – Collecta / Roll-your-own (Twitter4J)

– API Harvesting – Roll your own (Facebook REST Query)

– Web Crawling – Nutch

Your Content

– Copy from FileSystem

– Load from Database - SQOOP

– Event Collection Frameworks - Scribe and Flume

Page 5: Web Crawling and Data Gathering with Apache Nutch

5

How is Crawl data being used?

Build your own search engine – Built in Lucene Indexes for querying

– Solr integration for Multi-faceted search

Analytics Selective filtering and extraction with data from a single

provider Joining datasets from multiple providers for further

analytics Event Portal Example Is Austin really a startup town?

Extension of the mashup paradigm - “Content Providers cannot predict how their data will be re-purposed”

Page 6: Web Crawling and Data Gathering with Apache Nutch

6

Web Crawling - considerations

Robots.txt

Facebook lawsuit against API Harvester

“No Crawling without written approval” in Mint.com Terms of Use

What if the web had as many crawlers as Apache Web Servers ?

Page 7: Web Crawling and Data Gathering with Apache Nutch

7

Apache Nutch – What is it ?

Apache Nutch Project – nutch.apache.org– Hadoop + Web Crawler + Lucene

Hadoop based web crawler ? How does that work ?

Page 8: Web Crawling and Data Gathering with Apache Nutch

8

Apache Nutch Overview

Seeds and Crawl Filters

Crawl Depths

Fetch Lists and Partitioning

Segments - Segment Reading using Hadoop

Indexing / Lucene

Web Application for Querying

Page 9: Web Crawling and Data Gathering with Apache Nutch

Apache Nutch - Web Application

Page 10: Web Crawling and Data Gathering with Apache Nutch

Crawl Lifecycle

Generate

Inject

LinkDB

Fetch

Index

CrawlDB Update

Dedup

Merge

Page 11: Web Crawling and Data Gathering with Apache Nutch

Single Process Web Crawling

Page 12: Web Crawling and Data Gathering with Apache Nutch

Single Process Web Crawling

- Create the seed file and copy it into a “urls” directory

- Export JAVA_HOME

- Edit the conf/crawl-urlfilter.txt regex to constrain the crawl (Usually via domain)

- Edit the conf/nutch-site.xml and specify an http.agent.name

- bin/nutch crawl urls -dir crawl -depth 2

D E M O

Page 13: Web Crawling and Data Gathering with Apache Nutch

Distributed Web Crawling

Page 14: Web Crawling and Data Gathering with Apache Nutch

Distributed Web Crawling

- The Nutch distribution is overkill if you already have a Hadoop Cluster. Its also not how you really integrate with Hadoop these days, but there is some history to consider. Nutch Wiki has Distributed Setup.

- Why orchestrate your crawl?

- How?– Create the seed file and copy it into a “urls” directory. Then

copy the directory up to the HDFS

– Edit the conf/crawl-urlfilter.txt regex to constrain the crawl (Usually via domain)

– Copy the conf/nutch-site,conf/nutch-default.xml, conf/nutch-conf.xml & conf/crawl-urlfilter.txt to the Hadoop conf directory.

– Restart Hadoop so the new files are picked up in the classpath

Page 15: Web Crawling and Data Gathering with Apache Nutch

Distributed Web Crawling

- Code Review: org.apache.nutch.crawl.Crawl

- Orchestrated Crawl Example (Step 1 - Inject):

bin/hadoop jar nutch-1.2.0.job org.apache.nutch.crawl.Injector crawl/crawldb urls

D E M O

Page 16: Web Crawling and Data Gathering with Apache Nutch

Segment Reading

Page 17: Web Crawling and Data Gathering with Apache Nutch

17

Segment Readers

The SegmentReader class is not all that useful. But here it is anyway:

– bin/nutch readseg -list crawl/segments/20110128170617

– bin/nutch readseg -dump crawl/segments/20110128170617 dumpdir

What you really want to do is process each crawled page in M/R as an individual record– SequenceFileInputFormatters over Nutch HDFS Segments

FTW

– RecordReader returns Content Objects as Value

Code Walkthrough

D E M O

Page 18: Web Crawling and Data Gathering with Apache Nutch

Thanks

Questions ?

Steve Watt - [email protected]

Twitter: @wattsteveBlog: stevewatt.blogspot.com

austinhug.blogspot.com