Hadoop Tutorial with @techmilind
-
Upload
emc-academic-alliance -
Category
Technology
-
view
107 -
download
1
description
Transcript of Hadoop Tutorial with @techmilind
![Page 1: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/1.jpg)
Hadoop Overview & Architecture
Milind Bhandarkar Chief Scientist, Machine Learning Platforms,
Greenplum, A Division of EMC (Twitter: @techmilind)
![Page 2: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/2.jpg)
About Me • http://www.linkedin.com/in/milindb
• Founding member of Hadoop team at Yahoo! [2005-2010]
• Contributor to Apache Hadoop since v0.1
• Built and led Grid Solutions Team at Yahoo! [2007-2010]
• Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)
• Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), Yahoo!, LinkedIn, and EMC-Greenplum
![Page 3: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/3.jpg)
2
Agenda • Motivation
• Hadoop
• Map-Reduce
• Distributed File System
• Hadoop Architecture
• Next Generation MapReduce
• Q & A
![Page 4: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/4.jpg)
3
Hadoop At Scale���(Some Statistics)
• 40,000 + machines in 20+ clusters
• Largest cluster is 4,000 machines
• 170 Petabytes of storage
• 1000+ users
• 1,000,000+ jobs/month
![Page 5: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/5.jpg)
EVERY CLICK BEHIND
![Page 6: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/6.jpg)
Hadoop Workflow
![Page 7: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/7.jpg)
Who Uses Hadoop ?
![Page 8: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/8.jpg)
7
Why Hadoop ?
![Page 9: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/9.jpg)
Big Datasets���(Data-Rich Computing theme proposal. J. Campbell, et al, 2007)
![Page 10: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/10.jpg)
Cost Per Gigabyte���(http://www.mkomo.com/cost-per-gigabyte)
![Page 11: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/11.jpg)
Storage Trends ���(Graph by Adam Leventhal, ACM Queue, Dec 2009)
![Page 12: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/12.jpg)
11
Motivating Examples
![Page 13: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/13.jpg)
Yahoo! Search Assist
![Page 14: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/14.jpg)
13
Search Assist • Insight: Related concepts appear close
together in text corpus
• Input: Web pages
• 1 Billion Pages, 10K bytes each
• 10 TB of input data
• Output: List(word, List(related words))
![Page 15: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/15.jpg)
// Code Samples
14
// Input: List(URL, Text)foreach URL in Input : Words = Tokenize(Text(URL)); foreach word in Tokens : Insert (word, Next(word, Tokens)) in Pairs; Insert (word, Previous(word, Tokens)) in Pairs;// Result: Pairs = List (word, RelatedWord)Group Pairs by word;// Result: List (word, List(RelatedWords)foreach word in Pairs : Count RelatedWords in GroupedPairs;// Result: List (word, List(RelatedWords, count))foreach word in CountedPairs : Sort Pairs(word, *) descending by count; choose Top 5 Pairs;// Result: List (word, Top5(RelatedWords))
Search Assist
![Page 16: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/16.jpg)
People You May Know
![Page 17: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/17.jpg)
16
People You May Know • Insight: You might also know Joe Smith if a
lot of folks you know, know Joe Smith
• if you don’t know Joe Smith already
• Numbers:
• 100 MM users
• Average connections per user is 100
![Page 18: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/18.jpg)
// Code Samples
17
// Input: List(UserName, List(Connections)) foreach u in UserList : // 100 MM foreach x in Connections(u) : // 100 foreach y in Connections(x) : // 100 if (y not in Connections(u)) : Count(u, y)++; // 1 Trillion Iterations Sort (u,y) in descending order of Count(u,y); Choose Top 3 y; Store (u, {y0, y1, y2}) for serving;
People You May Know
![Page 19: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/19.jpg)
18
Performance
• 101 Random accesses for each user
• Assume 1 ms per random access
• 100 ms per user
• 100 MM users
• 100 days on a single machine
![Page 20: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/20.jpg)
19
MapReduce Paradigm
![Page 21: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/21.jpg)
20
Map & Reduce
• Primitives in Lisp (& Other functional languages) 1970s
• Google Paper 2004
• http://labs.google.com/papers/mapreduce.html
![Page 22: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/22.jpg)
21
Output_List = Map (Input_List)
Square (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) =(1, 4, 9, 16, 25, 36,49, 64, 81, 100)
Map
![Page 23: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/23.jpg)
22
Output_Element = Reduce (Input_List)
Sum (1, 4, 9, 16, 25, 36,49, 64, 81, 100) = 385
Reduce
![Page 24: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/24.jpg)
23
Parallelism
• Map is inherently parallel
• Each list element processed independently
• Reduce is inherently sequential
• Unless processing multiple lists
• Grouping to produce multiple lists
![Page 25: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/25.jpg)
// Example
24
// Input: http://hadoop.apache.org Pairs = Tokenize_And_Pair ( Text ( Input ) )
Search Assist Map
![Page 26: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/26.jpg)
25
// Input: GroupedList (word, GroupedList(words))CountedPairs = CountOccurrences (word, RelatedWords)
Output = {(hadoop, apache, 7) (hadoop, DFS, 3) (hadoop, streaming, 4) (hadoop, mapreduce, 9) ...}
Search Assist Reduce
![Page 27: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/27.jpg)
26
Issues with Large Data
• Map Parallelism: Chunking input data
• Reduce Parallelism: Grouping related data
• Dealing with failures & load imbalance
![Page 28: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/28.jpg)
![Page 29: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/29.jpg)
28
Apache Hadoop
• January 2006: Subproject of Lucene
• January 2008: Top-level Apache project
• Stable Version: 1.0.3
• Latest Version: 2.0.0 (Alpha)
![Page 30: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/30.jpg)
29
Apache Hadoop
• Reliable, Performant Distributed file system
• MapReduce Programming framework
• Ecosystem: HBase, Hive, Pig, Howl, Oozie, Zookeeper, Chukwa, Mahout, Cascading, Scribe, Cassandra, Hypertable, Voldemort, Azkaban, Sqoop, Flume, Avro ...
![Page 31: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/31.jpg)
30
Problem: Bandwidth to Data
• Scan 100TB Datasets on 1000 node cluster
• Remote storage @ 10MB/s = 165 mins
• Local storage @ 50-200MB/s = 33-8 mins
• Moving computation is more efficient than moving data
• Need visibility into data placement
![Page 32: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/32.jpg)
31
Problem: Scaling Reliably • Failure is not an option, it’s a rule !
• 1000 nodes, MTBF < 1 day
• 4000 disks, 8000 cores, 25 switches, 1000 NICs, 2000 DIMMS (16TB RAM)
• Need fault tolerant store with reasonable availability guarantees
• Handle hardware faults transparently
![Page 33: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/33.jpg)
32
Hadoop Goals
• Scalable: Petabytes (1015 Bytes) of data on thousands on nodes
• Economical: Commodity components only
• Reliable
• Engineering reliability into every application is expensive
![Page 34: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/34.jpg)
33
Hadoop MapReduce
![Page 35: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/35.jpg)
34
Think MapReduce
• Record = (Key, Value)
• Key : Comparable, Serializable
• Value: Serializable
• Input, Map, Shuffle, Reduce, Output
![Page 36: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/36.jpg)
// Code Samples
35
cat /var/log/auth.log* | \ grep “session opened” | cut -d’ ‘ -f10 | \sort | \uniq -c > \~/userlist
Seems Familiar ?
![Page 37: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/37.jpg)
36
Map
• Input: (Key1, Value1)
• Output: List(Key2, Value2)
• Projections, Filtering, Transformation
![Page 38: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/38.jpg)
37
Shuffle
• Input: List(Key2, Value2)
• Output
• Sort(Partition(List(Key2, List(Value2))))
• Provided by Hadoop
![Page 39: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/39.jpg)
38
Reduce
• Input: List(Key2, List(Value2))
• Output: List(Key3, Value3)
• Aggregation
![Page 40: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/40.jpg)
39
Hadoop Streaming • Hadoop is written in Java
• Java MapReduce code is “native” • What about Non-Java Programmers ?
• Perl, Python, Shell, R
• grep, sed, awk, uniq as Mappers/Reducers
• Text Input and Output
![Page 41: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/41.jpg)
40
Hadoop Streaming • Thin Java wrapper for Map & Reduce Tasks
• Forks actual Mapper & Reducer
• IPC via stdin, stdout, stderr
• Key.toString() \t Value.toString() \n
• Slower than Java programs
• Allows for quick prototyping / debugging
![Page 42: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/42.jpg)
// Code Samples
41
$ bin/hadoop jar hadoop-streaming.jar \ -input in-files -output out-dir \ -mapper mapper.sh -reducer reducer.sh# mapper.shsed -e 's/ /\n/g' | grep .# reducer.shuniq -c | awk '{print $2 "\t" $1}'
Hadoop Streaming
![Page 43: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/43.jpg)
42
Hadoop Distributed File System (HDFS)
![Page 44: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/44.jpg)
43
HDFS
• Data is organized into files and directories
• Files are divided into uniform sized blocks (default 128MB) and distributed across cluster nodes
• HDFS exposes block placement so that computation can be migrated to data
![Page 45: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/45.jpg)
44
HDFS
• Blocks are replicated (default 3) to handle hardware failure
• Replication for performance and fault tolerance (Rack-Aware placement)
• HDFS keeps checksums of data for corruption detection and recovery
![Page 46: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/46.jpg)
45
HDFS
• Master-Worker Architecture
• Single NameNode
• Many (Thousands) DataNodes
![Page 47: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/47.jpg)
46
HDFS Master���(NameNode)
• Manages filesystem namespace
• File metadata (i.e. “inode”) • Mapping inode to list of blocks + locations
• Authorization & Authentication
• Checkpoint & journal namespace changes
![Page 48: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/48.jpg)
47
Namenode
• Mapping of datanode to list of blocks
• Monitor datanode health
• Replicate missing blocks
• Keeps ALL namespace in memory
• 60M objects (File/Block) in 16GB
![Page 49: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/49.jpg)
48
Datanodes • Handle block storage on multiple volumes &
block integrity
• Clients access the blocks directly from data nodes
• Periodically send heartbeats and block reports to Namenode
• Blocks are stored as underlying OS’s files
![Page 50: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/50.jpg)
HDFS Architecture
![Page 51: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/51.jpg)
Example: Unigrams
• Input: Huge text corpus
• Wikipedia Articles (40GB uncompressed)
• Output: List of words sorted in descending order of frequency
![Page 52: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/52.jpg)
$ cat ~/wikipedia.txt | \sed -e 's/ /\n/g' | grep . | \sort | \uniq -c > \~/frequencies.txt$ cat ~/frequencies.txt | \# cat | \sort -n -k1,1 -r |# cat > \~/unigrams.txt
Unigrams
![Page 53: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/53.jpg)
mapper (filename, file-contents):for each word in file-contents: emit (word, 1)
reducer (word, values):
sum = 0for each value in values: sum = sum + valueemit (word, sum)
MR for Unigrams
![Page 54: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/54.jpg)
mapper (word, frequency):emit (frequency, word)
reducer (frequency, words):
for each word in words: emit (word, frequency)
MR for Unigrams
![Page 55: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/55.jpg)
public static class MapClass extends MapReduceBaseimplements Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { Text word = new Text(itr.nextToken()); output.collect(word, new IntWritable(1)); }}
}
Unigrams: Java Mapper
![Page 56: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/56.jpg)
public static class Reduce extends MapReduceBaseimplements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key,Iterator<IntWritable> values,
OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum));}
}
Unigrams: Java Reducer
![Page 57: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/57.jpg)
public void run(String inputPath, String outputPath) throws Exception{
JobConf conf = new JobConf(WordCount.class);conf.setJobName("wordcount");conf.setMapperClass(MapClass.class);
conf.setReducerClass(Reduce.class); FileInputFormat.addInputPath(conf, new Path(inputPath));
FileOutputFormat.setOutputPath(conf, new Path(outputPath));JobClient.runJob(conf);
}
Unigrams: Driver
![Page 58: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/58.jpg)
Configuration • Unified Mechanism for
• Configuring Daemons
• Runtime environment for Jobs/Tasks
• Defaults: *-default.xml
• Site-Specific: *-site.xml
• final parameters
![Page 59: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/59.jpg)
<configuration><property> <name>mapred.job.tracker</name> <value>head.server.node.com:9001</value></property><property> <name>fs.default.name</name> <value>hdfs://head.server.node.com:9000</value></property><property><name>mapred.child.java.opts</name><value>-Xmx512m</value><final>true</final></property>
....</configuration>
Example
![Page 60: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/60.jpg)
Running Hadoop Jobs
![Page 61: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/61.jpg)
[milindb@gateway ~]$ hadoop jar \$HADOOP_HOME/hadoop-examples.jar wordcount \/data/newsarchive/20080923 /tmp/newsoutinput.FileInputFormat: Total input paths to process : 4mapred.JobClient: Running job: job_200904270516_5709mapred.JobClient: map 0% reduce 0%mapred.JobClient: map 3% reduce 0%mapred.JobClient: map 7% reduce 0%....mapred.JobClient: map 100% reduce 21%mapred.JobClient: map 100% reduce 31%mapred.JobClient: map 100% reduce 33%mapred.JobClient: map 100% reduce 66%mapred.JobClient: map 100% reduce 100%mapred.JobClient: Job complete: job_200904270516_5709
Running a Job
![Page 62: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/62.jpg)
mapred.JobClient: Counters: 18mapred.JobClient: Job Countersmapred.JobClient: Launched reduce tasks=1mapred.JobClient: Rack-local map tasks=10mapred.JobClient: Launched map tasks=25mapred.JobClient: Data-local map tasks=1mapred.JobClient: FileSystemCountersmapred.JobClient: FILE_BYTES_READ=491145085mapred.JobClient: HDFS_BYTES_READ=3068106537mapred.JobClient: FILE_BYTES_WRITTEN=724733409mapred.JobClient: HDFS_BYTES_WRITTEN=377464307
Running a Job
![Page 63: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/63.jpg)
mapred.JobClient: Map-Reduce Frameworkmapred.JobClient: Combine output records=73828180mapred.JobClient: Map input records=36079096mapred.JobClient: Reduce shuffle bytes=233587524mapred.JobClient: Spilled Records=78177976mapred.JobClient: Map output bytes=4278663275mapred.JobClient: Combine input records=371084796mapred.JobClient: Map output records=313041519mapred.JobClient: Reduce input records=15784903
Running a Job
![Page 64: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/64.jpg)
// Code Samples
JobTracker WebUI
![Page 65: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/65.jpg)
JobTracker Status
![Page 66: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/66.jpg)
Jobs Status
![Page 67: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/67.jpg)
Job Details
![Page 68: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/68.jpg)
Job Counters
![Page 69: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/69.jpg)
Job Progress
![Page 70: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/70.jpg)
All Tasks
![Page 71: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/71.jpg)
Task Details
![Page 72: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/72.jpg)
Task Counters
![Page 73: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/73.jpg)
Task Logs
![Page 74: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/74.jpg)
MapReduce Dataflow
![Page 75: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/75.jpg)
MapReduce
![Page 76: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/76.jpg)
Job Submission
![Page 77: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/77.jpg)
Initialization
![Page 78: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/78.jpg)
Scheduling
![Page 79: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/79.jpg)
Execution
![Page 80: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/80.jpg)
Map Task
![Page 81: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/81.jpg)
Sort Buffer
![Page 82: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/82.jpg)
Reduce Task
![Page 83: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/83.jpg)
82
Next Generation MapReduce
![Page 84: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/84.jpg)
MapReduce Today (Courtesy: Arun Murthy, Hortonworks)
![Page 85: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/85.jpg)
84
Why ? • Scalability Limitations today
• Maximum cluster size: 4000 nodes
• Maximum Concurrent tasks: 40,000
• Job Tracker SPOF
• Fixed map and reduce containers (slots)
• Punishes pleasantly parallel apps
![Page 86: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/86.jpg)
85
Why ? (contd) • MapReduce is not suitable for every
application
• Fine-Grained Iterative applications
• HaLoop: Hadoop in a Loop
• Message passing applications
• Graph Processing
![Page 87: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/87.jpg)
86
Requirements
• Need scalable cluster resources manager
• Separate scheduling from resource management
• Multi-Lingual Communication Protocols
![Page 88: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/88.jpg)
87
Bottom Line
• @techmilind #mrng (MapReduce, Next Gen) is in reality, #rmng (Resource Manager, Next Gen)
• Expect different programming paradigms to be implemented
• Including MPI (soon)
![Page 89: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/89.jpg)
Architecture (Courtesy: Arun Murthy, Hortonworks)
![Page 90: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/90.jpg)
89
The New World • Resource Manager
• Allocates resources (containers) to applications
• Node Manager
• Manages containers on nodes
• Application Master
• Specific to paradigm e.g. MapReduce application master, MPI application master etc
![Page 91: Hadoop Tutorial with @techmilind](https://reader034.fdocuments.net/reader034/viewer/2022051515/54c654224a7959e8238b45a5/html5/thumbnails/91.jpg)
90
Container
• In current terminology: A Task Slot
• Slice of the node’s hardware resources
• #of cores, virtual memory, disk size, disk and network bandwidth etc
• Currently, only memory usage is sliced