Crawl the entire web in 10 minutes...and just 100€
-
Upload
danny-linden -
Category
Technology
-
view
460 -
download
0
Transcript of Crawl the entire web in 10 minutes...and just 100€
Crawl the entire web
in 10 minutes...
Copyright ©: 2015 OnPage.org GmbH
Using AWS-EMR, AWS-S3, PIG, CommonCrawl
...and just 100 €
Since 2011 in Munich
Work at OnPage.org
Interested in Webcrawling and BigData Frameworks
Build low cost scalable BigData solutions
About Me
Twitter: @danny_munich
Facebook: https://www.facebook.com/danny.linden2
E-mail: [email protected]
Do you want to build your own Search-
Engine?
- High Hardware / Cloud Costs
- Nutch needs ~ 1 Hour for 1 million URLs
- You want to crawl > 1 Billion URLs
Don‘t Crawl!
- Use Common-Crawl : https://commoncrawl.org
- Non-Profit-Organisation
- ~Monthly over 2 Billions Crawled URLs
- Over 1.000 TB total since 2009
- URL seeding list from Blekko: https://blekko.com
Don‘t Crawl! – Use Common Crawl!
- Scalably stored on Amazon AWS S3
- Hadoop compatible format powered by Archive.org (Wayback Machine)
- Partitionable with S3 Object Prefix possibility
- 100MB-1GB file Sizes (gzip) -> Hadoop size
Choose the right format
- WARC (Raw HTML): 1.000 MB
- WAT (Meta data as JSON) : 450 MB
- WET (Plain Text): 150 MB
Processing
- Pure Hadoop with MapReduce
- Input Classes: http://commoncrawl.org/the-data/get-started/
Processing
- High Level ETL-Layer like PIG: http://pig.apache.org
- Example Stuff :
- https://github.com/norvigaward/warcexamples
- https://github.com/mortardata/mortar-examples
- https://github.com/matpalm/common-crawl
PIG Example
REGISTER file:/home/hadoop/lib/pig/piggybank.jar
DEFINE FileLoaderClass org.commoncrawl.pig.ArcLoader();
%default INPUT_PATH "s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/25/0/1285398*.arc.gz";
-- %default INPUT_PATH "s3://aws-publicdatasets/common-crawl/crawl-002/2010/09/";
%default OUTPUT_PATH "s3://example-bucket/out";
pages = LOAD '$INPUT_PATH'
USING FileLoaderClass
AS (url, html);
meta_titles = FOREACH pages GENERATE url, REGEX_EXTRACT(html, '<title>(.*)</title>', 1) AS meta_title;
filtered = FILTER meta_titles BY meta_title IS NOT NULL;
STORE filtered INTO '$OUTPUT_PATH' USING PigStorage('\t');
Hadoop & PIG on AWS
- Support new Hadoop releases
- PIG Integration
- Replace HDFS with S3
- Easy UI to start quickly
- Pay per Hour to scale as much as posible
That‘s it!
Customer:
Twitter: @danny_munich
Facebook: https://www.facebook.com/danny.linden2
E-mail: [email protected]
And: We are hiring!
https://de.onpage.org/about/jobs/