Hadoop ecosystem framework n hadoop in live environment

Hadoop ecosystem framework Hadoop in live environment

- Ashish Agrawal

http://in.linkedin.com/pub/ashish-agrawal/3/73a/a8b

Outline

Introduction to HADOOP & Distributed FileSystems

Architecture of Hadoop Ecosystem (Hbase/Pig) & setting up Hadoop Single/Multiple node cluster

Introduction to MapReduce & running sample programs on Hadoop

Hadoop ecosystem framework - Hadoop Hadoop ecosystem framework - Hadoop in live environmentin live environment

Hadoop Ecosystem

HDFS Map Reduce Hbase Pig Hive Mahout Zookeeper

HDFS Architecture

Map Reduce Flow

By Ricky Ho

HBase Architecture

Job Scheduler

CronJobs

Chain Map Recude

Azkaban By LinkedIn

Oozie by Yahoo!

Overview of Oozie Manage data processing jobs

Offers scalable data oriented service

Manages dependencies between jobs

Support job execution in topological order

Provides time & event driven triggering mechanism

Overview of Oozie

Supports map reduce, pig, filesystem, java applications, even map reduce streaming and pipes as action nodes

Action nodes are connected through dependency edges

Decision, fork and join nodes are used as flow control operations

Overview of Oozie

Actions and decisions depends upon properties of job, hadoop counters or file/directory status

A workflow application contains definition file for workflow, jar files, native and third party libraries, resource file and pig scripts

Oozie vs Azkaban

Oozie can be restarted from point of failure but azkaban does not

Oozie keeps flow in DB while azkaban keeps in memory

Azkaban fixes execution path before starting job while Oozie allows decision nodes to decide

Azkaban does not support event trigger Azkaban is used for simpler work flow

Chain MR Chains the multiple mapper classes in single

map task which saves lots of I/O The output of immediate previous mapper is fed

as input to current mapper The output of last mapper is written as task

output Supports passing key/value pairs to next maps

by reference to save [de]serialization time ChainReducer supports to chain multiple

mapper classes after reducer within reducer task

Oozie Flow

Start Map reduce Fork

MR Streaming

Pig

Join

Decision

MR Pipes

Java

FileSystemEnd

Performance Tuning Parameters

Network bandwidth – Gigabytes Nw Disk throughput – SCSI Drives Memory usage – ECC RAM CPU overhead for thread handling HDFS block size Max number of requests allowed in progress Per user file descriptors – needs to be set high Running the balancer

Performance Tuning Parameters

Sufficient space for temp directory Compressed data storage Speculative data execution Use of combiner function – Associative &

commulative Selection of Job scheduler : FIFO/Capacity/Fair Number of mappers : larger files are preferred Number of reducers : Slightly less than #nodes

Performance Tuning Parameters Compression of intermediate data from

Mappers sort size (io.sort.mb) – larger if mapper has to

write large data Sort factor (io.sort.factor) – set high for larger

jobs (#input files can be merged at once) mapred.reduce.parallel.copies - higher for large

jobs dfs.namenode.handler.count &

dfs.datanode.handler.count – high for large cluster

Tips

Use an appropriate MapReduce language Java : Speed, control and binary data. Working with

existing libraries. Pipes : Working with existing C++ libraries Streaming : Writing MR in scripting languages Dumbo (Python), happy(Jython), Wukong (Ruby) Pig, Hive, Cascading : For nested data, joins etc

Thumb Rule : Pure Java for large, recurring jobs, Hive for SQL style analysis and Pig/Streaming for ad-hoc analysis.

Tips

Few Larger files are preferred over many smaller files

Report Progress For CPU intensive job, increase the

mapred.task.timeout (default 10 mins) Use Distributed cache

To make data available to all mappers/reducers. For example keeping look up hash map

Used to make auxiliary jars available among mappers/reducers

Tips

Use SequenceFile and MapFile Splittable. Unlike other compressable format, they

are map reduce job friendly and each map gets an independent split to work on

Compressible. By using block compression you get the benefits of compression (use less disk space, faster to read and write), while keeping the file splittable still.

Compact. SequenceFiles are usually used with Hadoop Writable objects, which have a pretty compact format.

A MapFile is an indexed SequenceFile, useful for if you want to do look-ups by key.

Mahout (Machine learning library)

Collaborative Filtering User and Item based recommenders K-Means, Fuzzy K-Means clustering Mean Shift clustering Dirichlet process clustering Latent Dirichlet Allocation Singular value decomposition Parallel Frequent Pattern mining Complementary Naive Bayes classifier

Different minds Different interpretation

http://www.youtube.com/watch?v=9izUKE5bN0U

Hadoop in live environment Google Yahoo Amazon LinkedIn Facebook StumbleUpon Nokia Last.fm Clickable

@Google

Google uses it for

indexing the web

computing PageRank

processing geographic information in Google Maps

clustering news articles,

machine translation

Google Trends etc

@Google An Example :

403,152 TB (terabytes) data

394 machines were allocated

Completion time is 6 minutes and a half.

Google indexing system uses 20TB data

Bigtable (Hbase) is used for many Google

products such as Orkut, Finance etc.

Sawzall is used for massive log processing

@Yahoo!

The Two Quadrillionth Bit of π is 0! One of the largest computations took 23 days of

wall clock time and 503 years of CPU time on a 1000-node cluster

Yahoo! Has 4000 nodes in hadoop cluster

Following slides have been taken from opencirrus summit 2009

Open Cirrus Summit 2009

Hadoop is critical to Yahoo’s business

Ads Optimization

Content Optimization

Search Index

Content Feed Processing

Machine Learning

(e.g. Spam filters)

• When you visit yahoo, you are interacting with data processed with Hadoop!


Tremendous Impact on Productivity

• Makes Developers & Scientists more productive– Key computations solved in days and not months– Projects move from research to production in days– Easy to learn, even our rocket scientists use it!

• The major factors– You don’t need to find new hardware to experiment– You can work with all your data!– Production and research based on same framework– No need for R&D to do IT (it just works)

Open Cirrus Summit 2009 28

Search & Advertising SciencesHadoop Applications: Search Assist™

Before Hadoop After Hadoop

Time 26 days 20 minutes

Language C++ Python

Development Time 2-3 weeks 2-3 days

• Database for Search Assist™ is built using Hadoop. • 3 years of log-data• 20-steps of map-reduce


Largest Hadoop Clusters in the Universe

• 25,000+ nodes (~200,000 cores)– Clusters of up to 4,000 nodes

• 4 Tiers of clusters– Development, Testing and QA (~10%)– Proof of Concepts and Ad-Hoc work (~10%)

• Runs the latest version of Hadoop – currently 0.20

– Science and Research (~60%)• Runs more stable versions

– Production (~20%)• Currently Hadoop 0.18.3


Large Hadoop-Based Applications

2008 2009Webmap ~70 hours runtime

~300 TB shuffling~200 TB output1480 nodes

~73 hours runtime~490 TB shuffling~280 TB output2500 nodes

Sort benchmarks(Jim Gray contest)

1 Terabyte sorted•209 seconds•900 nodes

1 Terabyte sorted•62 seconds, 1500 nodes1 Petabyte sorted•16.25 hours, 3700 nodes

Largest cluster 2000 nodes•6PB raw disk•16TB of RAM•16K CPUs

4000 nodes•16PB raw disk•64TB of RAM•32K CPUs•(40% faster CPUs too)

@Facebook Claims to have the largest single Hadoop

cluster in the world Have multiple clusters at separate data

centers Largest warehouse cluster currently spans

3000 of machines Scan around 2 petabytes per day 300 people throughout the company query

this warehouse every month

@Facebook

Facebook ”messages” uses the Hbase in prod Collects click logs in near real time from web

servers and stream them directly into Hadoop clusters

Medium-term archiving of MySQL databases Fast backup and recovery from data stored in

Hadoop File System Reduces maintenance and deployment costs for

archiving petabyte size datasets.

@Nokia Started using hadoop in August 2009 in search

analytics team Started with 15 machines as part of cluster To analyse large scale search logs for various

analytics purposes Search relevance calculation Duplicate places handling, data cleaning Fuzzy query parsing and tagging for spelling

correction and lookahead suggestion model

@Clickable

Using Hbase, HDFS, Map reduce for various purposes such as data storage, analytics, reportings and recommendations

7 machines cluster for production

Used Hbase to address continous data updates from networks or any other user action at our end.

@Stumbleupon

Log early, log often, log everything

No piece of data is too small or too noisy to be used in future

Uses for apache log file processing and session analysis, spam detection

@Stumbleupon

Uses Scribe to collect data directly into HDFS where it is reviewed and processed by number of systems

Uses MR to extract data from logs for click counts

Uses for search index updates, thumbnail creation and recommendation systems

Questions?

Hadoop ecosystem framework n hadoop in live environment

Technology

Transcript of Hadoop ecosystem framework n hadoop in live environment