Large Scale Crawling with Apache Nutch and Friends
-
Upload
julien-nioche -
Category
Software
-
view
5.880 -
download
1
Transcript of Large Scale Crawling with Apache Nutch and Friends
This is the title of a presentation
Large Scale Crawling with
Julien [email protected]
LUCENE/SOLR REVOLUTION EU 2013
Apache
and friends...
I'll be talking about large scale document processing and more specifically about Behemoth which is an open source project based on Hadoop
About myself
DigitalPebble Ltd, Bristol (UK)
Specialised in Text Engineering
Web Crawling
Natural Language Processing
Information Retrieval
Machine Learning
Strong focus on Open Source & Apache ecosystem
VP Apache Nutch
User | Contributor | CommitterTika
SOLR, Lucene
GATE, UIMA
Mahout
Behemoth
A few words about myself just before I start...What I mean by Text Engineering is a variety of activities ranging from ....What makes the identity of DP is The main projects I am involved in are
Overview
Installation and setup
Main steps
Nutch 2.x
Future developments
Outline
Nutch?
Distributed framework for large scale web crawling(but does not have to be large scale at all)
Based on Apache Hadoop
Apache TLP since May 2010
Indexing and Search by
Note that I mention crawling and not web search used not only for search + used to do indexing and search using Lucene but now delegate this to SOLR
A bit of history
2002/2003 : Started By Doug Cutting & Mike Caffarella
2005 : MapReduce implementation in Nutch2006 : Hadoop sub-project of Lucene @Apache
2006/7 : Parser and MimeType in Tika2008 : Tika sub-project of Lucene @Apache
May 2010 : TLP project at Apache
Sept 2010 : Storage abstraction in Nutch 2.x2012 : Gora TLP @Apache
Recent Releases
trunk
2.2.1
2.0
1.5.1
1.3
1.4
1.1
1.2
1.0
06/12
06/11
06/10
06/09
2.x
06/13
1.7
2.1
1.6
Why use Nutch?
FeaturesIndex with SOLR / ES / CloudSearch
PageRank implementation
Loads of existing plugins
Can easily be extended / customised
Usual reasonsOpen source with a business-friendly license, mature, community, ...
ScalabilityTried and tested on very large scale
Standard Hadoop
Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters
Use cases
Crawl for searchGeneric or vertical
Index and Search with SOLR and al.
Single node to large clusters on Cloud
but alsoData Mining
NLP (e.g.Sentiment Analysis)
ML
MAHOUT / UIMA / GATE
Use Behemoth as glueware
(https://github.com/DigitalPebble/behemoth)
with
Customer cases
Specificity (Verticality)
Size
BetterJobs.com (CareerBuilder)Single server
Aggregates content from job portals
Extracts and normalizes structure (description, requirements, locations)
~2M pages total
Feeds SOLR index
SimilarPages.comLarge cluster on Amazon EC2 (up to 400 nodes)
Fetched & parsed 3 billion pages
10+ billion pages in crawlDB (~100TB data)
200+ million lists of similarities
No indexing / search involved
http://commoncrawl.org/
Using Nutch 1.7
A few modifications to Nutch code https://github.com/Aloisius/nutch
Next release imminent
Open repository of web crawl data
2012 dataset : 3.83 billion docs
ARC files on Amazon S3
CommonCrawl
Overview
Installation and setup
Main steps
Nutch 2.x
Future developments
Outline
Installation
http://nutch.apache.org/downloads.html
1.7 => src and bin distributions
2.2.1 => src only
'ant clean runtime'runtime/local => local mode (test and debug)
runtime/deploy => job jar for Hadoop + scripts
Binary distribution for 1.x == runtime/local
Configuration and resources
Changes in $NUTCH_HOME/confNeed recompiling with 'ant runtime'
Local mode => can be made directly in runtime/local/conf
Specify configuration in nutch-site.xmlLeave nutch-default alone!
At least :
http.agent.name WhateverNameDescribesMyMightyCrawler
Running it!
bin/crawl script : typical sequence of steps
bin/nutch : individual Nutch commandsInject / generate / fetch / parse / update .
Local mode : great for testing and debugging
Recommended : deploy + Hadoop (pseudo) distrib mode Parallelism
MapReduce UI to monitor crawl, check logs, counters
Monitor Crawl with MapReduce UI
Counters and logs
Overview
Installation and setup
Main steps
Nutch 2.x
Future developments
Outline
Typical Nutch Steps
Inject populates CrawlDB from seed list
Generate Selects URLS to fetch in segment
Fetch Fetches URLs from segment
Parse Parses content (text + metadata)
UpdateDB Updates CrawlDB (new URLs, new status...)
InvertLinks Build Webgraph
Index Send docs to [SOLR | ES | CloudSearch | ]
Sequence of batch operations
Or use the all-in-one crawl script
Repeat steps 2 to 7
Same in 1.x and 2.x
Main steps in NutchMore actions availableShell Wrappers around hadoop commands
Main steps from a data perspective
CrawlDBSeed ListSegment
/crawl_generate/
/crawl_fetch//content/
/crawl_parse//parse_data//parse_text/
LinkDB
Main steps in NutchMore actions availableShell Wrappers around hadoop commands
Frontier expansion
Manual discoveryAdding new URLs by hand, seeding
Automatic discovery of new resources (frontier expansion)Not all outlinks are equally useful - control
Requires content parsing and link extraction
seed
i = 1
i = 2
i = 3
[Slide courtesy of A. Bialecki]
An extensible framework
EndpointsProtocol
Parser
HtmlParseFilter (a.k.a ParseFilter in Nutch 2.x)
ScoringFilter (used in various places)
URLFilter (ditto)
URLNormalizer (ditto)
IndexingFilter
IndexWriter (NEW IN 1.7!)
PluginsActivated with parameter 'plugin.includes'
Implement one or more endpoints
Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters
Features
FetcherMulti-threaded fetcher
Queues URLs per hostname / domain / IP
Limit the number of URLs for round of fetching
Default values are polite but can be made more aggressive
Crawl Strategy Breadth-first but can be depth-first
Configurable via custom ScoringFilters
ScoringOPIC (On-line Page Importance Calculation) by default
LinkRank
Fetcher . multithreaded but polite
Features (cont.)
ProtocolsHttp, file, ftp, https
Respects robots.txt directives
SchedulingFixed or adaptive
URL filtersRegex, FSA, TLD, prefix, suffix
URL normalisersDefault, regex
Fetcher . multithreaded but polite
Features (cont.)
Other pluginsCreativeCommons
Feeds
Language Identification
Rel tags
Arbitrary Metadata
Pluggable indexingSOLR | ES etc...
Parsing with Apache TikaHundreds of formats supported
But some legacy parsers as well
Indexing
Apache SOLRschema.xml in conf/
SOLR 3.4
JIRA issue for SOLRCloudhttps://issues.apache.org/jira/browse/NUTCH-1377
ElasticSearchVersion 0.90.1
AWS CloudSearchWIP : https://issues.apache.org/jira/browse/NUTCH-1517
Easy to build your ownText, DB, etc...
Typical Nutch document
Some of the fields (IndexingFilters in plugins or core code)url
content
title
anchor
site
boost
digest
segment
host
type
Configurable onesmeta tags (keywords, description etc...)
arbitrary metadata
Overview
Installation and setup
Main steps
Nutch 2.x
Future developments
Outline
NUTCH 2.x
2.0 released in July 2012
2.2.1 in July 2013
Common features as 1.xMapReduce, Tika, delegation to SOLR, etc...
Moved to 'big table'-like architectureWealth of NoSQL projects in last few years
Abstraction over storage layer Apache GORA
Apache GORA
http://gora.apache.org/
ORM for NoSQL databasesand limited SQL support + file based storage
Serialization with Apache AVRO
Object-to-datastore mappings (backend-specific)
DataStore implementations
Current version 0.3
Accumulo
Cassandra
HBase
Avro
DynamoDB
SQL (broken)
AVRO Schema => Java code
{"name": "WebPage", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "baseUrl", "type": ["null", "string"] }, {"name": "status", "type": "int"}, {"name": "fetchTime", "type": "long"}, {"name": "prevFetchTime", "type": "long"}, {"name": "fetchInterval", "type": "int"}, {"name": "retriesSinceFetch", "type": "int"}, {"name": "modifiedTime", "type": "long"}, {"name": "protocolStatus", "type": { "name": "ProtocolStatus", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "code", "type": "int"}, {"name": "args", "type": {"type": "array", "items": "string"}}, {"name": "lastModified", "type": "long"} ] }},[]
Mapping file (backend specific Hbase)
DataStore operations
Basic operationsget(K key)
put(K key, T obj)
delete(K key)
Queryingexecute(Query query) Result
deleteByQuery(Query query)
Wrappers for Apache HadoopGORAInput|OutputFormat
GoraRecordReader|Writer
GORAMapper|Reducer
GORA in Nutch
AVRO schema provided and java code pre-generated
Mapping files provided for backendscan be modified if necessary
Need to rebuild to get dependencies for backendhence source only distribution of Nutch 2.x
http://wiki.apache.org/nutch/Nutch2Tutorial
What does this mean for Nutch?
Benefits
Storage still distributed and replicated
but one big tablestatus, metadata, content, text one place
no more segments
Resume-able fetch and parse steps
Easier interaction with other resourcesThird-party code just need to use GORA and schema
Simplify the Nutch code
Potentially faster (e.g. update step)
What does this mean for Nutch?
Drawbacks
More stuff to install and configureHigher hardware requirements
Current performance :-(http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html
N2+HBase : 2.7x slower than 1.x
N2+Cassandra : 4.4x slower than 1.x
due mostly to GORA layer : not inherent to Hbase or Cassandra
https://issues.apache.org/jira/browse/GORA-119 filtered scans
Not all backends provide data locality!
Not as stable as Nutch 1.x
2.x Work in progress
Stabilise backend implementationsGORA-Hbase most reliable
Synchronize features with 1.xe.g. missing LinkRank equivalent (GSOC 2013 use Apache Giraph)
No pluggable indexers yet (NUTCH-1568)
Filter enabled scansGORA-119=> don't need to de-serialize the whole dataset
Overview
Installation and setup
Main steps
Nutch 2.x
Future developments
Outline
Future
New functionalities Support for SOLRCloud
Sitemap (from CrawlerCommons library)
Canonical tag
Generic deduplication (NUTCH-656)
1.x and 2.x to coexist in parallel2.x not yet a replacement of 1.x
Move to new MapReduce APIUse Nutch on Hadoop 2.x
More delegation
Great deal done in recent years (SOLR, Tika)
Share code with crawler-commons(http://code.google.com/p/crawler-commons/)Fetcher / protocol handling
URL normalisation / filtering
PageRank-like computations to graph libraryApache Giraph
Should be more efficient + less code to maintain
Longer term
Hadoop 2.x & YARN
Convergence of batch and streamingStorm / Samza / Storm-YARN /
End of 100% batch operations ?Fetch and parse as streaming ?
Always be fetching
Generate / update / pagerank remain batch
See https://github.com/DigitalPebble/storm-crawler
Where to find out more?
Project page : http://nutch.apache.org/
Wiki : http://wiki.apache.org/nutch/
Mailing lists : [email protected]
Chapter in 'Hadoop the Definitive Guide' (T. White)Understanding Hadoop is essential anyway...
Support / consulting : http://wiki.apache.org/nutch/Support
Questions
?
/