MapReduce and Hadoop
-
Upload
salil-navgire -
Category
Technology
-
view
470 -
download
0
description
Transcript of MapReduce and Hadoop
![Page 1: MapReduce and Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051323/54b745154a7959077a8b4592/html5/thumbnails/1.jpg)
Map Reduce and Hadoop
- SALIL NAVGIRE
![Page 2: MapReduce and Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051323/54b745154a7959077a8b4592/html5/thumbnails/2.jpg)
Big Data Explosion• 90% of today's data was created in the
last 2 years
• Moore's law: Data volume doubles every 18 months
• YouTube: 13 million hours and 700 billion views in 2010
• Facebook: 20TB/day (compressed)
• CERN/LHC: 40TB/day (15PB/year)
• Many more examples
![Page 3: MapReduce and Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051323/54b745154a7959077a8b4592/html5/thumbnails/3.jpg)
Solution: Scalability
How?
Divide and Conquer
![Page 4: MapReduce and Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051323/54b745154a7959077a8b4592/html5/thumbnails/4.jpg)
Challenges!• How to assign units of work to the workers?
• What if there are more units of work than workers?
• What if the workers need to share intermediate incomplete data?
• How do we aggregate such intermediate data?
• How do we know when all workers have completed their assignments?
• What if some workers failed?
![Page 5: MapReduce and Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051323/54b745154a7959077a8b4592/html5/thumbnails/5.jpg)
History• 2000: Apache Lucene: batch index updates
and sort/merge with on disk index
• 2002: Apache Nutch: distributed, scalable open source web crawler
• 2004: Google publishes GFS and MapReduce papers
• 2006: Apache Hadoop: open source Java implementation of GFS and MapReduce to solve Nutch’ problem; later becomes standalone project
![Page 6: MapReduce and Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051323/54b745154a7959077a8b4592/html5/thumbnails/6.jpg)
What is Map Reduce?• A programming model to distribute a task
on multiple nodes
• Used to develop solutions that will process large amounts of data in a parallelized fashion in clusters of computing nodes
• Original MapReduce paper by Google
• Features of MapReduce:• Fault-tolerance
• Status and monitoring tools
• A clean abstraction for programmers
![Page 7: MapReduce and Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051323/54b745154a7959077a8b4592/html5/thumbnails/7.jpg)
MapReduce Execution Overview
UserProgram
Worker
Worker
Master
Worker
Worker
Worker
fork fork fork
assignmap
assignreduce
readlocalwrite
remoteread,sort
OutputFile 0
OutputFile 1
write
Split 0Split 1Split 2
Input Data
![Page 8: MapReduce and Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051323/54b745154a7959077a8b4592/html5/thumbnails/8.jpg)
Hadoop Components
HDFS
Storage
Self-healinghigh-bandwidthclustered storage
MapReduce
Processing
Fault-tolerantdistributedprocessing
![Page 9: MapReduce and Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051323/54b745154a7959077a8b4592/html5/thumbnails/9.jpg)
HDFS Architecture
![Page 10: MapReduce and Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051323/54b745154a7959077a8b4592/html5/thumbnails/10.jpg)
HDFS Basics• HDFS is a filesystem written in Java
• Sits on top of a native filesystem
• Provides redundant storage for massive amounts of data
• Use Commodity devices
![Page 11: MapReduce and Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051323/54b745154a7959077a8b4592/html5/thumbnails/11.jpg)
HDFS Data• Data is split into blocks and stored on multiple nodes in the cluster
• Each block is usually 64 MB or 128 MB
• Each block is replicated multiple times
• Replicas stored on different data nodes
![Page 12: MapReduce and Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051323/54b745154a7959077a8b4592/html5/thumbnails/12.jpg)
2 Types of Nodes
Master Nodes
Slave Nodes
![Page 13: MapReduce and Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051323/54b745154a7959077a8b4592/html5/thumbnails/13.jpg)
Master Node• NameNode
• only 1 per cluster
• metadata server and database
• SecondaryNameNode helps with some housekeeping
• JobTracker• only 1 per cluster
• job scheduler
![Page 14: MapReduce and Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051323/54b745154a7959077a8b4592/html5/thumbnails/14.jpg)
Slave Nodes• DataNodes
• 1-4000 per cluster
• block data storage
• TaskTrackers
• 1-4000 per cluster
• task execution
![Page 15: MapReduce and Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051323/54b745154a7959077a8b4592/html5/thumbnails/15.jpg)
NameNode• A single NameNode stores all metadata, replication of blocks and read/write access to files
• Filenames, locations on DataNodes of each block, owner, group, etc.
• All information maintained in RAM for fast lookup
![Page 16: MapReduce and Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051323/54b745154a7959077a8b4592/html5/thumbnails/16.jpg)
Secondary NameNode• Does memory-intensive administrative functions for the NameNode
• Should run on a separate machine
![Page 17: MapReduce and Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051323/54b745154a7959077a8b4592/html5/thumbnails/17.jpg)
Data Node• DataNodes store file contents
• Different blocks of the same file will be stored on different DataNodes
• Same block is stored on three (or more) DataNodes for redundancy
![Page 18: MapReduce and Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051323/54b745154a7959077a8b4592/html5/thumbnails/18.jpg)
Word Count Example• Input• Text files
• Output• Single file containing (Word <TAB> Count)
• Map Phase• Generates (Word, Count) pairs
• [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}]
• Reduce Phase• For each word, calculates aggregate
• [{a,7}, {b,5}, {c,6}]
![Page 19: MapReduce and Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051323/54b745154a7959077a8b4592/html5/thumbnails/19.jpg)
Typical Cluster• 3-4000 commodity servers
• Each server • 2x quad-core
• 16-24 GB ram
• 4-12 TB disk space
• 20-30 servers per rack
![Page 20: MapReduce and Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051323/54b745154a7959077a8b4592/html5/thumbnails/20.jpg)
When Should I use it?Good choice for jobs that can be broken into parallelized jobs:
• Indexing/Analysis of log files
• Sorting of large data sets
• Image Processing/Machine Learning
Bad choice for serial or low latency jobs:
• For real-time processing
• For processing intensive task with little data
• Replacing MySQL
![Page 21: MapReduce and Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051323/54b745154a7959077a8b4592/html5/thumbnails/21.jpg)
Who uses Hadoop?