Hadoop ecosystem framework n hadoop in live environment
-
date post
13-Sep-2014 -
Category
Technology
-
view
3.140 -
download
2
description
Transcript of Hadoop ecosystem framework n hadoop in live environment
![Page 1: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/1.jpg)
Hadoop ecosystem framework Hadoop in live environment
- Ashish Agrawal
![Page 2: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/2.jpg)
Outline
Introduction to HADOOP & Distributed FileSystems
Architecture of Hadoop Ecosystem (Hbase/Pig) & setting up Hadoop Single/Multiple node cluster
Introduction to MapReduce & running sample programs on Hadoop
Hadoop ecosystem framework - Hadoop Hadoop ecosystem framework - Hadoop in live environmentin live environment
![Page 3: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/3.jpg)
Hadoop Ecosystem
HDFS Map Reduce Hbase Pig Hive Mahout Zookeeper
![Page 4: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/4.jpg)
HDFS Architecture
![Page 5: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/5.jpg)
Map Reduce Flow
By Ricky Ho
![Page 6: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/6.jpg)
HBase Architecture
![Page 7: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/7.jpg)
Job Scheduler
CronJobs
Chain Map Recude
Azkaban By LinkedIn
Oozie by Yahoo!
![Page 8: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/8.jpg)
Overview of Oozie Manage data processing jobs
Offers scalable data oriented service
Manages dependencies between jobs
Support job execution in topological order
Provides time & event driven triggering mechanism
![Page 9: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/9.jpg)
Overview of Oozie
Supports map reduce, pig, filesystem, java applications, even map reduce streaming and pipes as action nodes
Action nodes are connected through dependency edges
Decision, fork and join nodes are used as flow control operations
![Page 10: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/10.jpg)
Overview of Oozie
Actions and decisions depends upon properties of job, hadoop counters or file/directory status
A workflow application contains definition file for workflow, jar files, native and third party libraries, resource file and pig scripts
![Page 11: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/11.jpg)
Oozie vs Azkaban
Oozie can be restarted from point of failure but azkaban does not
Oozie keeps flow in DB while azkaban keeps in memory
Azkaban fixes execution path before starting job while Oozie allows decision nodes to decide
Azkaban does not support event trigger Azkaban is used for simpler work flow
![Page 12: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/12.jpg)
Chain MR Chains the multiple mapper classes in single
map task which saves lots of I/O The output of immediate previous mapper is fed
as input to current mapper The output of last mapper is written as task
output Supports passing key/value pairs to next maps
by reference to save [de]serialization time ChainReducer supports to chain multiple
mapper classes after reducer within reducer task
![Page 13: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/13.jpg)
Oozie Flow
Start Map reduce Fork
MR Streaming
Pig
Join
Decision
MR Pipes
Java
FileSystemEnd
![Page 14: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/14.jpg)
Performance Tuning Parameters
Network bandwidth – Gigabytes Nw Disk throughput – SCSI Drives Memory usage – ECC RAM CPU overhead for thread handling HDFS block size Max number of requests allowed in progress Per user file descriptors – needs to be set high Running the balancer
![Page 15: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/15.jpg)
Performance Tuning Parameters
Sufficient space for temp directory Compressed data storage Speculative data execution Use of combiner function – Associative &
commulative Selection of Job scheduler : FIFO/Capacity/Fair Number of mappers : larger files are preferred Number of reducers : Slightly less than #nodes
![Page 16: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/16.jpg)
Performance Tuning Parameters Compression of intermediate data from
Mappers sort size (io.sort.mb) – larger if mapper has to
write large data Sort factor (io.sort.factor) – set high for larger
jobs (#input files can be merged at once) mapred.reduce.parallel.copies - higher for large
jobs dfs.namenode.handler.count &
dfs.datanode.handler.count – high for large cluster
![Page 17: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/17.jpg)
Tips
Use an appropriate MapReduce language Java : Speed, control and binary data. Working with
existing libraries. Pipes : Working with existing C++ libraries Streaming : Writing MR in scripting languages Dumbo (Python), happy(Jython), Wukong (Ruby) Pig, Hive, Cascading : For nested data, joins etc
Thumb Rule : Pure Java for large, recurring jobs, Hive for SQL style analysis and Pig/Streaming for ad-hoc analysis.
![Page 18: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/18.jpg)
Tips
Few Larger files are preferred over many smaller files
Report Progress For CPU intensive job, increase the
mapred.task.timeout (default 10 mins) Use Distributed cache
To make data available to all mappers/reducers. For example keeping look up hash map
Used to make auxiliary jars available among mappers/reducers
![Page 19: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/19.jpg)
Tips
Use SequenceFile and MapFile Splittable. Unlike other compressable format, they
are map reduce job friendly and each map gets an independent split to work on
Compressible. By using block compression you get the benefits of compression (use less disk space, faster to read and write), while keeping the file splittable still.
Compact. SequenceFiles are usually used with Hadoop Writable objects, which have a pretty compact format.
A MapFile is an indexed SequenceFile, useful for if you want to do look-ups by key.
![Page 20: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/20.jpg)
Mahout (Machine learning library)
Collaborative Filtering User and Item based recommenders K-Means, Fuzzy K-Means clustering Mean Shift clustering Dirichlet process clustering Latent Dirichlet Allocation Singular value decomposition Parallel Frequent Pattern mining Complementary Naive Bayes classifier
![Page 21: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/21.jpg)
Different minds Different interpretation
http://www.youtube.com/watch?v=9izUKE5bN0U
![Page 22: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/22.jpg)
Hadoop in live environment Google Yahoo Amazon LinkedIn Facebook StumbleUpon Nokia Last.fm Clickable
![Page 23: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/23.jpg)
Google uses it for
indexing the web
computing PageRank
processing geographic information in Google Maps
clustering news articles,
machine translation
Google Trends etc
![Page 24: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/24.jpg)
@Google An Example :
403,152 TB (terabytes) data
394 machines were allocated
Completion time is 6 minutes and a half.
Google indexing system uses 20TB data
Bigtable (Hbase) is used for many Google
products such as Orkut, Finance etc.
Sawzall is used for massive log processing
![Page 25: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/25.jpg)
@Yahoo!
The Two Quadrillionth Bit of π is 0! One of the largest computations took 23 days of
wall clock time and 503 years of CPU time on a 1000-node cluster
Yahoo! Has 4000 nodes in hadoop cluster
Following slides have been taken from opencirrus summit 2009
![Page 26: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/26.jpg)
Open Cirrus Summit 2009
Hadoop is critical to Yahoo’s business
Ads Optimization
Content Optimization
Search Index
Content Feed Processing
Machine Learning
(e.g. Spam filters)
• When you visit yahoo, you are interacting with data processed with Hadoop!
![Page 27: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/27.jpg)
Open Cirrus Summit 2009
Tremendous Impact on Productivity
• Makes Developers & Scientists more productive– Key computations solved in days and not months– Projects move from research to production in days– Easy to learn, even our rocket scientists use it!
• The major factors– You don’t need to find new hardware to experiment– You can work with all your data!– Production and research based on same framework– No need for R&D to do IT (it just works)
![Page 28: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/28.jpg)
Open Cirrus Summit 2009 28
Search & Advertising SciencesHadoop Applications: Search Assist™
Before Hadoop After Hadoop
Time 26 days 20 minutes
Language C++ Python
Development Time 2-3 weeks 2-3 days
• Database for Search Assist™ is built using Hadoop. • 3 years of log-data• 20-steps of map-reduce
![Page 29: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/29.jpg)
Open Cirrus Summit 2009
Largest Hadoop Clusters in the Universe
• 25,000+ nodes (~200,000 cores)– Clusters of up to 4,000 nodes
• 4 Tiers of clusters– Development, Testing and QA (~10%)– Proof of Concepts and Ad-Hoc work (~10%)
• Runs the latest version of Hadoop – currently 0.20
– Science and Research (~60%)• Runs more stable versions
– Production (~20%)• Currently Hadoop 0.18.3
![Page 30: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/30.jpg)
Open Cirrus Summit 2009
Large Hadoop-Based Applications
2008 2009Webmap ~70 hours runtime
~300 TB shuffling~200 TB output1480 nodes
~73 hours runtime~490 TB shuffling~280 TB output2500 nodes
Sort benchmarks(Jim Gray contest)
1 Terabyte sorted•209 seconds•900 nodes
1 Terabyte sorted•62 seconds, 1500 nodes1 Petabyte sorted•16.25 hours, 3700 nodes
Largest cluster 2000 nodes•6PB raw disk•16TB of RAM•16K CPUs
4000 nodes•16PB raw disk•64TB of RAM•32K CPUs•(40% faster CPUs too)
![Page 31: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/31.jpg)
@Facebook Claims to have the largest single Hadoop
cluster in the world Have multiple clusters at separate data
centers Largest warehouse cluster currently spans
3000 of machines Scan around 2 petabytes per day 300 people throughout the company query
this warehouse every month
![Page 32: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/32.jpg)
Facebook ”messages” uses the Hbase in prod Collects click logs in near real time from web
servers and stream them directly into Hadoop clusters
Medium-term archiving of MySQL databases Fast backup and recovery from data stored in
Hadoop File System Reduces maintenance and deployment costs for
archiving petabyte size datasets.
![Page 33: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/33.jpg)
@Nokia Started using hadoop in August 2009 in search
analytics team Started with 15 machines as part of cluster To analyse large scale search logs for various
analytics purposes Search relevance calculation Duplicate places handling, data cleaning Fuzzy query parsing and tagging for spelling
correction and lookahead suggestion model
![Page 34: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/34.jpg)
@Clickable
Using Hbase, HDFS, Map reduce for various purposes such as data storage, analytics, reportings and recommendations
7 machines cluster for production
Used Hbase to address continous data updates from networks or any other user action at our end.
![Page 35: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/35.jpg)
@Stumbleupon
Log early, log often, log everything
No piece of data is too small or too noisy to be used in future
Uses for apache log file processing and session analysis, spam detection
![Page 36: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/36.jpg)
@Stumbleupon
Uses Scribe to collect data directly into HDFS where it is reviewed and processed by number of systems
Uses MR to extract data from logs for click counts
Uses for search index updates, thumbnail creation and recommendation systems
![Page 37: Hadoop ecosystem framework n hadoop in live environment](https://reader033.fdocuments.net/reader033/viewer/2022061104/54138c708d7f7284698b4659/html5/thumbnails/37.jpg)
Questions?