Web Scale Crawling with Apache Nutch

Web Scale Crawling with

Julien [email protected]

Berlin Buzzwords 08/06/11

Apache

mailto:[email protected]

2 / 30DigitalPebble Ltd

Based in Bristol (UK) Specialised in Text Engineering

– Web Crawling– Natural Language Processing– Information Retrieval– Data Mining

Strong focus on Open Source & Apache ecosystem User | Contributor | Committer

– Nutch, SOLR, Lucene – Tika– GATE, UIMA– Mahout– Behemoth

3 / 30Outline

Overview Features Data Structures Use cases

What's new in Nutch 1.3 Nutch 2.0 GORA

Conclusion

4 / 30Nutch?

“Distributed framework for large scale web crawling”– but does not have to be large scale at all– or even on the web (file-protocol)

Based on Apache Hadoop

Indexing and Search

Open Source – Apache 2.0 License

5 / 30Short history

2002/2003 : Started By Doug Cutting & Mike Caffarella

2004 : sub-project of Lucene @Apache

2005 : MapReduce implementation in Nutch

– 2006 : Hadoop sub-project of Lucene @Apache

2006/7 : Parser and MimeType in Tika

– 2008 : Tika sub-project of Lucene @Apache

May 2010 : TLP project at Apache

June 2011 (?) : Nutch 1.3

Q4 2011 (?) : Nutch 2.0

6 / 30In a Nutch Shell (1.3)

1) Inject → populates CrawlDB from seed list

2) Generate → Selects URLS to fetch in segment

3) Fetch → Fetches URLs from segment

4) Parse → Parses content (text + metadata)

5) UpdateDB → Updates CrawlDB (new URLs, new status...)

6) InvertLinks → Build Webgraph

7) SOLRIndex → Send docs to SOLR

8) SOLRDedup → Remove duplicate docs based on signature

Step by Step :

Or use the all-in-one 'nutch crawl' command

Repeat steps 2 to 8

7 / 30Frontier expansion

Manual “discovery”– Adding new URLs by

hand, “seeding”

Automatic discovery of new resources (frontier expansion)– Not all outlinks are

equally useful - control– Requires content

parsing and link extraction

seed

i = 1

i = 2

i = 3

[Slide courtesy of A. Bialecki]

8 / 30Outline



Conclusion

9 / 30An extensible framework

Endpoints– Protocol– Parser– HtmlParseFilter– ScoringFilter (used in various places)– URLFilter (ditto)– URLNormalizer (ditto)– IndexingFilter

Plugins– Activated with parameter 'plugin.includes'– Implement one or more endpoints

10 / 30Features

Fetcher– Multi-threaded fetcher– Follows robots.txt– Groups URLs per hostname / domain / IP– Limit the number of URLs for round of fetching– Default values are polite but can be made more aggressive

Crawl Strategy – Breadth-first but can be depth-first– Configurable via custom scoring plugins

Scoring– OPIC (On-line Page Importance Calculation) by default– LinkRank

11 / 30Features (cont.)

Protocols– Http, file, ftp, https

Scheduling– Specified or adaptative

URL filters– Regex, FSA, TLD domain, prefix, suffix

URL normalisers– Default, regex

12 / 30Features (cont.)

Other plugins– CreativeCommons– Feeds– Language Identification– Rel tags– Arbitrary Metadata

Indexing to SOLR– Bespoke schema

Parsing with Apache Tika– But some legacy parsers as well

13 / 30Outline



Conclusion

14 / 30Data Structures

MapReduce jobs => I/O : Hadoop [Sequence|Map]Files CrawlDB => status of known pages

CrawlDB

MapFile : <Text,CrawlDatum> byte status; [fetched? Unfetched? Failed? Redir?] long fetchTime; byte retries; int fetchInterval; float score = 1.0f; byte[] signature = null; long modifiedTime; org.apache.hadoop.io.MapWritable metaData;

Input of : generate - index Output of : inject - update

15 / 30Data Structures 2

Segment/crawl_generate/ → SequenceFile<Text,CrawlDatum>/crawl_fetch/ → MapFile<Text,CrawlDatum>/content/ → MapFile<Text,Content>/crawl_parse/ → SequenceFile<Text,CrawlDatum>/parse_data/ → MapFile<Text,ParseData>/parse_text/ → MapFile<Text,ParseText>

Segment => round of fetching Identified by a timestamp

Can have multiple versions of a page in different segments

16 / 30Data Structures – 3

LinkDB

MapFile : <Text,Inlinks> Inlinks : HashSet <Inlink> Inlink :

String fromUrlString anchor

Output of : invertlinks Input of : SOLRIndex

linkDB => storage for Web Graph

17 / 30Outline



Conclusion

18 / 30Use cases Crawl for Search Systems

– Web wide or vertical– Single node to large clusters– Legacy Lucene-based search or SOLR

… but not necessarily– NLP (e.g.Sentiment Analysis)– ML, Classification / Clustering– Data Mining

– MAHOUT / UIMA / GATE – Use Behemoth as glueware (http://github.com/jnioche/behemoth)

SimilarPages.com– Large cluster on Amazon EC2 (up to

400 nodes)– Fetched & parsed 3 billion pages– 10+ billion pages in crawlDB

(~100TB data)– 200+ million lists of similarities– No indexing / search involved

19 / 30Outline



Conclusion

20 / 30NUTCH 1.3 Transition between 1.x and 2.0

http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/

1.3-RC3 => imminent

Removed Lucene-based indexing and search webapp

– delegate indexing / search remotely to SOLR

– change of focus : “Web search application” → “Crawler”

Removed deprecated parse plugins

– delegate most parsing to Tika

Separate local / distributed runtimes

Ivy-based dependency management

21 / 30NUTCH 2.0

Became trunk in 2010

Same features as 1.3– delegation to SOLR, TIKA, etc...

Moved to table-based architecture– Wealth of NoSQL projects in last 2 years

Preliminary version known as NutchBase (Doğacan Güney)

Moved storage layer to subproject in incubator → GORA

22 / 30GORA

http://incubator.apache.org/gora/

ORM for NoSQL databases– and limited SQL support

Serialization with Apache AVRO

Object-to-datastore mappings (backend-specific)

Backend implementations– HBase– Cassandra– SQL– Memory

0.1 released in April 2011

23 / 30GORA (cont.)

Atomic operations– Get – Put– Delete

Querying– Execute– deleteByQuery

Wrappers for Apache Hadoop– GORAInput|OutputFormat– GORAMapper|Reducer

24 / 30Benefits for Nutch

Storage still distributed and replicated

but one big table– status, metadata, content, text → one place

Simplified logic in Nutch– Simpler code for updating / merging information

More efficient– No need to read / write entire structure to update records

– e.g. update step in 1.x

Easier interaction with other resources– Third-party code just need to use GORA and schema

25 / 30Status Nutch 2.0

Beta stage

– debugging / testing required

Compare performance of GORA backends

Need to update documentation / WIKI

Enthusiasm from community

GORA – next great project coming out of Nutch?

26 / 30Future

Delegate code to crawler-commons(http://code.google.com/p/crawler-commons/)

– Fetcher / protocol handling– Robots.txt parsing– URL normalisation / filtering

New functionalities – Sitemap– Canonical tag– More indexers (e.g. ElasticSearch) + pluggable indexers?

Definitive move to 2.0?– Contribute backends and functionalities to GORA

27 / 30Outline



Conclusion

28 / 30Where to find out more?

Project page : http://nutch.apache.org/ Wiki : http://wiki.apache.org/nutch/ Mailing lists :

– [email protected]– [email protected]

Chapter in 'Hadoop the Definitive Guide' (T. White)– Understanding Hadoop is essential anyway...

Support / consulting : – http://wiki.apache.org/nutch/Support– [email protected]

29 / 30Questions

?

30 / 30

Web Scale Crawling with Apache Nutch

Technology

Transcript of Web Scale Crawling with Apache Nutch