Large Scale Crawling with Apache Nutch and Friends

This is the title of a presentation

Large Scale Crawling with

Julien [email protected]

LUCENE/SOLR REVOLUTION EU 2013

Apache

and friends...

I'll be talking about large scale document processing and more specifically about Behemoth which is an open source project based on Hadoop

About myself

DigitalPebble Ltd, Bristol (UK)

Specialised in Text Engineering

Web Crawling

Natural Language Processing

Information Retrieval

Machine Learning

Strong focus on Open Source & Apache ecosystem

VP Apache Nutch

User | Contributor | CommitterTika

SOLR, Lucene

GATE, UIMA

Mahout

Behemoth

A few words about myself just before I start...What I mean by Text Engineering is a variety of activities ranging from ....What makes the identity of DP is The main projects I am involved in are

Overview

Installation and setup

Main steps

Nutch 2.x

Future developments

Outline

Nutch?

Distributed framework for large scale web crawling(but does not have to be large scale at all)

Based on Apache Hadoop

Apache TLP since May 2010

Indexing and Search by

Note that I mention crawling and not web search used not only for search + used to do indexing and search using Lucene but now delegate this to SOLR

A bit of history

2002/2003 : Started By Doug Cutting & Mike Caffarella

2005 : MapReduce implementation in Nutch2006 : Hadoop sub-project of Lucene @Apache

2006/7 : Parser and MimeType in Tika2008 : Tika sub-project of Lucene @Apache

May 2010 : TLP project at Apache

Sept 2010 : Storage abstraction in Nutch 2.x2012 : Gora TLP @Apache

Recent Releases

trunk

2.2.1

2.0

1.5.1

1.3

1.4

1.1

1.2

1.0

06/12

06/11

06/10

06/09

2.x

06/13

1.7

2.1

1.6

Why use Nutch?

FeaturesIndex with SOLR / ES / CloudSearch

PageRank implementation

Loads of existing plugins

Can easily be extended / customised

Usual reasonsOpen source with a business-friendly license, mature, community, ...

ScalabilityTried and tested on very large scale

Standard Hadoop

Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters

Use cases

Crawl for searchGeneric or vertical

Index and Search with SOLR and al.

Single node to large clusters on Cloud

but alsoData Mining

NLP (e.g.Sentiment Analysis)

ML

MAHOUT / UIMA / GATE

Use Behemoth as glueware
(https://github.com/DigitalPebble/behemoth)

with

Customer cases

Specificity (Verticality)

Size

BetterJobs.com (CareerBuilder)Single server

Aggregates content from job portals

Extracts and normalizes structure (description, requirements, locations)

~2M pages total

Feeds SOLR index

SimilarPages.comLarge cluster on Amazon EC2 (up to 400 nodes)

Fetched & parsed 3 billion pages

10+ billion pages in crawlDB (~100TB data)

200+ million lists of similarities

No indexing / search involved

http://commoncrawl.org/

Using Nutch 1.7

A few modifications to Nutch code https://github.com/Aloisius/nutch

Next release imminent

Open repository of web crawl data

2012 dataset : 3.83 billion docs

ARC files on Amazon S3

CommonCrawl

Overview


Main steps

Nutch 2.x

Future developments

Outline

Installation

http://nutch.apache.org/downloads.html

1.7 => src and bin distributions

2.2.1 => src only

'ant clean runtime'runtime/local => local mode (test and debug)

runtime/deploy => job jar for Hadoop + scripts

Binary distribution for 1.x == runtime/local

Configuration and resources

Changes in $NUTCH_HOME/confNeed recompiling with 'ant runtime'

Local mode => can be made directly in runtime/local/conf

Specify configuration in nutch-site.xmlLeave nutch-default alone!

At least :

http.agent.name WhateverNameDescribesMyMightyCrawler

Running it!

bin/crawl script : typical sequence of steps

bin/nutch : individual Nutch commandsInject / generate / fetch / parse / update .

Local mode : great for testing and debugging

Recommended : deploy + Hadoop (pseudo) distrib mode Parallelism

MapReduce UI to monitor crawl, check logs, counters

Monitor Crawl with MapReduce UI

Counters and logs

Overview


Main steps

Nutch 2.x

Future developments

Outline

Typical Nutch Steps

Inject populates CrawlDB from seed list

Generate Selects URLS to fetch in segment

Fetch Fetches URLs from segment

Parse Parses content (text + metadata)

UpdateDB Updates CrawlDB (new URLs, new status...)

InvertLinks Build Webgraph

Index Send docs to [SOLR | ES | CloudSearch | ]

Sequence of batch operations

Or use the all-in-one crawl script

Repeat steps 2 to 7

Same in 1.x and 2.x

Main steps in NutchMore actions availableShell Wrappers around hadoop commands

Main steps from a data perspective

CrawlDBSeed ListSegment

/crawl_generate/

/crawl_fetch//content/

/crawl_parse//parse_data//parse_text/

LinkDB

Main steps in NutchMore actions availableShell Wrappers around hadoop commands

Frontier expansion

Manual discoveryAdding new URLs by hand, seeding

Automatic discovery of new resources (frontier expansion)Not all outlinks are equally useful - control

Requires content parsing and link extraction

seed

i = 1

i = 2

i = 3

[Slide courtesy of A. Bialecki]

An extensible framework

EndpointsProtocol

Parser

HtmlParseFilter (a.k.a ParseFilter in Nutch 2.x)

ScoringFilter (used in various places)

URLFilter (ditto)

URLNormalizer (ditto)

IndexingFilter

IndexWriter (NEW IN 1.7!)

PluginsActivated with parameter 'plugin.includes'

Implement one or more endpoints

Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters

Features

FetcherMulti-threaded fetcher

Queues URLs per hostname / domain / IP

Limit the number of URLs for round of fetching

Default values are polite but can be made more aggressive

Crawl Strategy Breadth-first but can be depth-first

Configurable via custom ScoringFilters

ScoringOPIC (On-line Page Importance Calculation) by default

LinkRank

Fetcher . multithreaded but polite

Features (cont.)

ProtocolsHttp, file, ftp, https

Respects robots.txt directives

SchedulingFixed or adaptive

URL filtersRegex, FSA, TLD, prefix, suffix

URL normalisersDefault, regex

Fetcher . multithreaded but polite

Features (cont.)

Other pluginsCreativeCommons

Feeds

Language Identification

Rel tags

Arbitrary Metadata

Pluggable indexingSOLR | ES etc...

Parsing with Apache TikaHundreds of formats supported

But some legacy parsers as well

Indexing

Apache SOLRschema.xml in conf/

SOLR 3.4

JIRA issue for SOLRCloudhttps://issues.apache.org/jira/browse/NUTCH-1377

ElasticSearchVersion 0.90.1

AWS CloudSearchWIP : https://issues.apache.org/jira/browse/NUTCH-1517

Easy to build your ownText, DB, etc...

Typical Nutch document

Some of the fields (IndexingFilters in plugins or core code)url

content

title

anchor

site

boost

digest

segment

host

type

Configurable onesmeta tags (keywords, description etc...)

arbitrary metadata

Overview


Main steps

Nutch 2.x

Future developments

Outline

NUTCH 2.x

2.0 released in July 2012

2.2.1 in July 2013

Common features as 1.xMapReduce, Tika, delegation to SOLR, etc...

Moved to 'big table'-like architectureWealth of NoSQL projects in last few years

Abstraction over storage layer Apache GORA

Apache GORA

http://gora.apache.org/

ORM for NoSQL databasesand limited SQL support + file based storage

Serialization with Apache AVRO

Object-to-datastore mappings (backend-specific)

DataStore implementations

Current version 0.3

Accumulo

Cassandra

HBase

Avro

DynamoDB

SQL (broken)

AVRO Schema => Java code

{"name": "WebPage", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "baseUrl", "type": ["null", "string"] }, {"name": "status", "type": "int"}, {"name": "fetchTime", "type": "long"}, {"name": "prevFetchTime", "type": "long"}, {"name": "fetchInterval", "type": "int"}, {"name": "retriesSinceFetch", "type": "int"}, {"name": "modifiedTime", "type": "long"}, {"name": "protocolStatus", "type": { "name": "ProtocolStatus", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "code", "type": "int"}, {"name": "args", "type": {"type": "array", "items": "string"}}, {"name": "lastModified", "type": "long"} ] }},[]

Mapping file (backend specific Hbase)

DataStore operations

Basic operationsget(K key)

put(K key, T obj)

delete(K key)

Queryingexecute(Query query) Result

deleteByQuery(Query query)

Wrappers for Apache HadoopGORAInput|OutputFormat

GoraRecordReader|Writer

GORAMapper|Reducer

GORA in Nutch

AVRO schema provided and java code pre-generated

Mapping files provided for backendscan be modified if necessary

Need to rebuild to get dependencies for backendhence source only distribution of Nutch 2.x

http://wiki.apache.org/nutch/Nutch2Tutorial

What does this mean for Nutch?

Benefits

Storage still distributed and replicated

but one big tablestatus, metadata, content, text one place

no more segments

Resume-able fetch and parse steps

Easier interaction with other resourcesThird-party code just need to use GORA and schema

Simplify the Nutch code

Potentially faster (e.g. update step)

What does this mean for Nutch?

Drawbacks

More stuff to install and configureHigher hardware requirements

Current performance :-(http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html

N2+HBase : 2.7x slower than 1.x

N2+Cassandra : 4.4x slower than 1.x

due mostly to GORA layer : not inherent to Hbase or Cassandra

https://issues.apache.org/jira/browse/GORA-119 filtered scans

Not all backends provide data locality!

Not as stable as Nutch 1.x

2.x Work in progress

Stabilise backend implementationsGORA-Hbase most reliable

Synchronize features with 1.xe.g. missing LinkRank equivalent (GSOC 2013 use Apache Giraph)

No pluggable indexers yet (NUTCH-1568)

Filter enabled scansGORA-119=> don't need to de-serialize the whole dataset

Overview


Main steps

Nutch 2.x

Future developments

Outline

Future

New functionalities Support for SOLRCloud

Sitemap (from CrawlerCommons library)

Canonical tag

Generic deduplication (NUTCH-656)

1.x and 2.x to coexist in parallel2.x not yet a replacement of 1.x

Move to new MapReduce APIUse Nutch on Hadoop 2.x

More delegation

Great deal done in recent years (SOLR, Tika)

Share code with crawler-commons(http://code.google.com/p/crawler-commons/)Fetcher / protocol handling

URL normalisation / filtering

PageRank-like computations to graph libraryApache Giraph

Should be more efficient + less code to maintain

Longer term

Hadoop 2.x & YARN

Convergence of batch and streamingStorm / Samza / Storm-YARN /

End of 100% batch operations ?Fetch and parse as streaming ?

Always be fetching

Generate / update / pagerank remain batch

See https://github.com/DigitalPebble/storm-crawler

Where to find out more?

Project page : http://nutch.apache.org/

Wiki : http://wiki.apache.org/nutch/

Mailing lists : [email protected]

[email protected]

Chapter in 'Hadoop the Definitive Guide' (T. White)Understanding Hadoop is essential anyway...

Support / consulting : http://wiki.apache.org/nutch/Support

Questions

?

/

Large Scale Crawling with Apache Nutch and Friends

Software

Transcript of Large Scale Crawling with Apache Nutch and Friends