A quick introduction to Storm Crawler

15
A quick introduction to Julien Nioche [email protected] @digitalpebble ApacheCon EU 2014 - Budapest Storm Crawler

Transcript of A quick introduction to Storm Crawler

Page 1: A quick introduction to Storm Crawler

A quick introduction to

Julien [email protected]

@digitalpebble

ApacheCon EU 2014 - Budapest

Storm Crawler

Page 2: A quick introduction to Storm Crawler

2 / 15

About myself

DigitalPebble Ltd, Bristol (UK) Specialised in Text Engineering

– Web Crawling– Natural Language Processing– Information Retrieval– Machine Learning

Strong focus on Open Source & Apache ecosystem PMC Chair Apache Nutch User | Contributor | Committer

– Tika– SOLR, Lucene – GATE, UIMA– Mahout– Behemoth

Page 3: A quick introduction to Storm Crawler

3 / 15

Collection of resources (SDK) for building web crawlers on Apache Storm

https://github.com/DigitalPebble/storm-crawler Artefacts available from Maven Central Apache License v2

Scalable Low latency Easily extensible

What is it?

Page 4: A quick introduction to Storm Crawler

4 / 15

What it is not

A ready-to-use, feature-complete, recursive web crawler– Might be something like that as a separate project using S/C later

e.g. no PageRank or explicit ranking of pages– Build your own

No fancy UI, dashboards, etc...– Build your own

Page 5: A quick introduction to Storm Crawler

5 / 15

Comparison with Nutch

Nutch is batch driven : little control on when URLs are fetched– Potential issue for use cases where need sessions– latency++

Fetching only one of the steps in Nutch– SC : 'always be fetching' (Ken Krugler); better use of resources

Make it even more flexible– Typical case : few custom classes (at least a Topology) the rest are just

dependencies and standard S/C components

Not ready-to use as Nutch : it's a SDK Would not have existed without it

– Borrowed code and concepts

Page 6: A quick introduction to Storm Crawler

6 / 15

Overview of resources

https://www.flickr.com/photos/dipster1/1403240351/

Page 7: A quick introduction to Storm Crawler

7 / 15

FetcherBolt

Multi-threaded Polite

– Puts incoming tuples into internal queues based on IP/domain/hostname– Sets delay between requests from same queue– Respects robots.txt

Protocol-neutral– Protocol implementations are pluggable– HTTP implementation taken from Nutch

Output– String URL– byte[] content– HashMap<String, String[]> metadata

Page 8: A quick introduction to Storm Crawler

8 / 15

ParserBolt

Based on Apache Tika Supports most commonly used doc formats

– HTML, PDF, DOC etc...

Calls ParseFilters on document– e.g. scrape info with XPathFilter

Calls URLFilters on outlinks– e.g normalize and / or blacklists URLs based on RegExps

Output– String URL– byte[] content– HashMap<String, String[]> metadata – String text– Set<String> outlinks

Page 9: A quick introduction to Storm Crawler

9 / 15

Other resources

ElasticSearchBolt– Sends fields to ElasticSearch for indexing– (deprecated by resources in elasticsearch-hadoop?)

URLPartitionerBolt– Generates a key based on the hostname / domain / IP of URL– Output :

• String URL• String key• String metadata

– Useful for fieldGrouping

Page 10: A quick introduction to Storm Crawler

10 / 15

Other resources

ConfigurableTopology– Overrides config with local YAML file– Simple switch for running in local mode– Abstract class to be extended

Simple Spouts (for testing)– FileSpout / RandomURLSpout

Various Metrics-related stuff– Including a MetricsConsumer for https://www.librato.com/

FetchQueue package– BlockingURLSpout and ShardedQueue abstraction

Page 11: A quick introduction to Storm Crawler

11 / 15

Integrate it!

Write your the Spout for your usecase– Will work fine existing resources as long as it generates URL, metadata

Typical scenario– Group URLs to fetch into separate external queues based on host or

domain (AWS SQS, Apache Kafka)– Write Spout for it and throttle with topology.max.spout.pending– So that can enforce politeness without getting timeout on Tuples → fail– Parse and extract– Send new URLs to queues

Can use various forms of persistence for URLs– ElasticSearch, DynamoDB, Hbase, etc...

Page 12: A quick introduction to Storm Crawler

12 / 15

Some use cases (prototype stage)

Processing of streams of data (natural fit for Storm)– http://www.weborama.com

Monitoring of finite set of URLs– http://www.ontopic.io (more on them later)– http://www.shopstyle.com : scraping + indexing

One-off non-recursive crawling – http://www.stolencamerafinder.com/ : scraping + indexing

Recursive crawler– WIP

Page 13: A quick introduction to Storm Crawler

13 / 15

What's next?

All-in-one crawler project built on SC– Also a good example of how to use SC

Additional Parse/URLFilters

More tests and documentation

A nice logo (this is an invitation)

A better name?

Page 14: A quick introduction to Storm Crawler

14 / 15

Questions

?

Page 15: A quick introduction to Storm Crawler

15 / 15