Introduction To Hadoop · Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008...
Transcript of Introduction To Hadoop · Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008...
![Page 1: Introduction To Hadoop · Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified](https://reader033.fdocuments.net/reader033/viewer/2022050506/5f97c0f2f3ac7b1d0e1e4ede/html5/thumbnails/1.jpg)
Introduction To Hadoop
Kenneth Heafield
Google Inc
January 14, 2008
Example code from Hadoop 0.13.1 used under the Apache License Version 2.0
and modified for presentation. Except as otherwise noted, the content of this
presentation is licensed under the Creative Commons Attribution 2.5 License.
Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 1 / 12
![Page 2: Introduction To Hadoop · Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified](https://reader033.fdocuments.net/reader033/viewer/2022050506/5f97c0f2f3ac7b1d0e1e4ede/html5/thumbnails/2.jpg)
Outline
1 Word Count CodeMapperReducerMain
2 How it WorksSerializationData Flow
3 Lab
Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 2 / 12
![Page 3: Introduction To Hadoop · Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified](https://reader033.fdocuments.net/reader033/viewer/2022050506/5f97c0f2f3ac7b1d0e1e4ede/html5/thumbnails/3.jpg)
Word Count Code Mapper
Mapper
< “wikipedia.org”, “The Free” >→ < “The”, 1 >,< “Free”, 1 >
public void map(WritableComparable key,Writable value, OutputCollector output,Reporter reporter) throws IOException {
String line = ((Text)value).toString();StringTokenizer itr = new StringTokenizer(line);Text word = new Text();while (itr.hasMoreTokens()) {word.set(itr.nextToken());output.collect(word, new IntWritable(1));
}}
Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 3 / 12
![Page 4: Introduction To Hadoop · Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified](https://reader033.fdocuments.net/reader033/viewer/2022050506/5f97c0f2f3ac7b1d0e1e4ede/html5/thumbnails/4.jpg)
Word Count Code Mapper
Mapper
< “wikipedia.org”, “The Free” >→ < “The”, 1 >,< “Free”, 1 >
public void map(WritableComparable key,Writable value, OutputCollector output,Reporter reporter) throws IOException {
String line = ((Text)value).toString();StringTokenizer itr = new StringTokenizer(line);Text word = new Text();while (itr.hasMoreTokens()) {word.set(itr.nextToken());output.collect(word, new IntWritable(1));
}}
Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 3 / 12
![Page 5: Introduction To Hadoop · Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified](https://reader033.fdocuments.net/reader033/viewer/2022050506/5f97c0f2f3ac7b1d0e1e4ede/html5/thumbnails/5.jpg)
Word Count Code Mapper
Mapper
< “wikipedia.org”, “The Free” >→ < “The”, 1 >,< “Free”, 1 >
public void map(WritableComparable key,Writable value, OutputCollector output,Reporter reporter) throws IOException {
String line = ((Text)value).toString();StringTokenizer itr = new StringTokenizer(line);Text word = new Text();while (itr.hasMoreTokens()) {word.set(itr.nextToken());output.collect(word, new IntWritable(1));
}}
Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 3 / 12
![Page 6: Introduction To Hadoop · Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified](https://reader033.fdocuments.net/reader033/viewer/2022050506/5f97c0f2f3ac7b1d0e1e4ede/html5/thumbnails/6.jpg)
Word Count Code Mapper
Mapper
< “wikipedia.org”, “The Free” >→ < “The”, 1 >,< “Free”, 1 >
public void map(WritableComparable key,Writable value, OutputCollector output,Reporter reporter) throws IOException {
String line = ((Text)value).toString();StringTokenizer itr = new StringTokenizer(line);Text word = new Text();while (itr.hasMoreTokens()) {word.set(itr.nextToken());output.collect(word, new IntWritable(1));
}}
Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 3 / 12
![Page 7: Introduction To Hadoop · Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified](https://reader033.fdocuments.net/reader033/viewer/2022050506/5f97c0f2f3ac7b1d0e1e4ede/html5/thumbnails/7.jpg)
Word Count Code Mapper
Mapper
< “wikipedia.org”, “The Free” >→ < “The”, 1 >,< “Free”, 1 >
public void map(WritableComparable key,Writable value, OutputCollector output,Reporter reporter) throws IOException {
String line = ((Text)value).toString();StringTokenizer itr = new StringTokenizer(line);Text word = new Text();while (itr.hasMoreTokens()) {word.set(itr.nextToken());output.collect(word, new IntWritable(1));
}}
Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 3 / 12
![Page 8: Introduction To Hadoop · Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified](https://reader033.fdocuments.net/reader033/viewer/2022050506/5f97c0f2f3ac7b1d0e1e4ede/html5/thumbnails/8.jpg)
Word Count Code Reducer
Reducer
< “The”, 1 >,< “The”, 1 >→ < “The”, 2 >
public void reduce(WritableComparable key,Iterator values,OutputCollector output,Reporter reporter)throws IOException {
int sum = 0;while (values.hasNext()) {sum += ((IntWritable) values.next()).get();
}output.collect(key, new IntWritable(sum));
}
Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 4 / 12
![Page 9: Introduction To Hadoop · Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified](https://reader033.fdocuments.net/reader033/viewer/2022050506/5f97c0f2f3ac7b1d0e1e4ede/html5/thumbnails/9.jpg)
Word Count Code Reducer
Reducer
< “The”, 1 >,< “The”, 1 >→ < “The”, 2 >
public void reduce(WritableComparable key,Iterator values,OutputCollector output,Reporter reporter)throws IOException {
int sum = 0;while (values.hasNext()) {sum += ((IntWritable) values.next()).get();
}output.collect(key, new IntWritable(sum));
}
Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 4 / 12
![Page 10: Introduction To Hadoop · Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified](https://reader033.fdocuments.net/reader033/viewer/2022050506/5f97c0f2f3ac7b1d0e1e4ede/html5/thumbnails/10.jpg)
Word Count Code Reducer
Reducer
< “The”, 1 >,< “The”, 1 >→ < “The”, 2 >
public void reduce(WritableComparable key,Iterator values,OutputCollector output,Reporter reporter)throws IOException {
int sum = 0;while (values.hasNext()) {sum += ((IntWritable) values.next()).get();
}output.collect(key, new IntWritable(sum));
}
Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 4 / 12
![Page 11: Introduction To Hadoop · Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified](https://reader033.fdocuments.net/reader033/viewer/2022050506/5f97c0f2f3ac7b1d0e1e4ede/html5/thumbnails/11.jpg)
Word Count Code Reducer
Reducer
< “The”, 1 >,< “The”, 1 >→ < “The”, 2 >
public void reduce(WritableComparable key,Iterator values,OutputCollector output,Reporter reporter)throws IOException {
int sum = 0;while (values.hasNext()) {sum += ((IntWritable) values.next()).get();
}output.collect(key, new IntWritable(sum));
}
Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 4 / 12
![Page 12: Introduction To Hadoop · Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified](https://reader033.fdocuments.net/reader033/viewer/2022050506/5f97c0f2f3ac7b1d0e1e4ede/html5/thumbnails/12.jpg)
Word Count Code Main
Main
public static void main(String[] args)throws IOException {
JobConf conf = new JobConf(WordCount.class);conf.setJobName("wordcount");conf.setMapperClass(MapClass.class);conf.setCombinerClass(ReduceClass.class);conf.setReducerClass(ReduceClass.class);conf.setNumMapTasks(new Integer(40));conf.setNumReduceTasks(new Integer(30));conf.setInputPath(new Path("/shared/wikipedia_small"));conf.setOutputPath(new Path("/user/kheafield/word_count"));conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);JobClient.runJob(conf);
}
Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 5 / 12
![Page 13: Introduction To Hadoop · Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified](https://reader033.fdocuments.net/reader033/viewer/2022050506/5f97c0f2f3ac7b1d0e1e4ede/html5/thumbnails/13.jpg)
Word Count Code Main
Main
public static void main(String[] args)throws IOException {
JobConf conf = new JobConf(WordCount.class);conf.setJobName("wordcount");conf.setMapperClass(MapClass.class);conf.setCombinerClass(ReduceClass.class);conf.setReducerClass(ReduceClass.class);conf.setNumMapTasks(new Integer(40));conf.setNumReduceTasks(new Integer(30));conf.setInputPath(new Path("/shared/wikipedia_small"));conf.setOutputPath(new Path("/user/kheafield/word_count"));conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);JobClient.runJob(conf);
}
Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 5 / 12
![Page 14: Introduction To Hadoop · Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified](https://reader033.fdocuments.net/reader033/viewer/2022050506/5f97c0f2f3ac7b1d0e1e4ede/html5/thumbnails/14.jpg)
Word Count Code Main
Main
public static void main(String[] args)throws IOException {
JobConf conf = new JobConf(WordCount.class);conf.setJobName("wordcount");conf.setMapperClass(MapClass.class);conf.setCombinerClass(ReduceClass.class);conf.setReducerClass(ReduceClass.class);conf.setNumMapTasks(new Integer(40));conf.setNumReduceTasks(new Integer(30));conf.setInputPath(new Path("/shared/wikipedia_small"));conf.setOutputPath(new Path("/user/kheafield/word_count"));conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);JobClient.runJob(conf);
}
Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 5 / 12
![Page 15: Introduction To Hadoop · Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified](https://reader033.fdocuments.net/reader033/viewer/2022050506/5f97c0f2f3ac7b1d0e1e4ede/html5/thumbnails/15.jpg)
Word Count Code Main
Main
public static void main(String[] args)throws IOException {
JobConf conf = new JobConf(WordCount.class);conf.setJobName("wordcount");conf.setMapperClass(MapClass.class);conf.setCombinerClass(ReduceClass.class);conf.setReducerClass(ReduceClass.class);conf.setNumMapTasks(new Integer(40));conf.setNumReduceTasks(new Integer(30));conf.setInputPath(new Path("/shared/wikipedia_small"));conf.setOutputPath(new Path("/user/kheafield/word_count"));conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);JobClient.runJob(conf);
}
Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 5 / 12
![Page 16: Introduction To Hadoop · Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified](https://reader033.fdocuments.net/reader033/viewer/2022050506/5f97c0f2f3ac7b1d0e1e4ede/html5/thumbnails/16.jpg)
How it Works Serialization
Types
Purpose
Simple serialization for keys, values, and other data
Interface Writable
Read and write binary format
Convert to String for text formats
WritableComparable adds sorting order for keys
Example Implementations
ArrayWritable is only Writable
BooleanWritable
IntWritable sorts in increasing order
Text holds a String
Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 6 / 12
![Page 17: Introduction To Hadoop · Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified](https://reader033.fdocuments.net/reader033/viewer/2022050506/5f97c0f2f3ac7b1d0e1e4ede/html5/thumbnails/17.jpg)
How it Works Serialization
A Writable
public class IntPairWritable implements Writable {public int first;public int second;public void write(DataOutput out) throws IOException {out.writeInt(first);out.writeInt(second);
}public void readFields(DataInput in) throws IOException {first = in.readInt();second = in.readInt();
}public int hashCode() { return first + second; }public String toString() {return Integer.toString(first) + "," +
Integer.toString(second);}
}Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 7 / 12
![Page 18: Introduction To Hadoop · Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified](https://reader033.fdocuments.net/reader033/viewer/2022050506/5f97c0f2f3ac7b1d0e1e4ede/html5/thumbnails/18.jpg)
How it Works Serialization
WritableComparable Method
public int compareTo(Object other) {IntPairWritable o = (IntPairWritable)other;if (first < o.first) return -1;if (first > o.first) return 1;if (second < o.second) return -1;if (second > o.second) return 1;return 0;
}
Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 8 / 12
![Page 19: Introduction To Hadoop · Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified](https://reader033.fdocuments.net/reader033/viewer/2022050506/5f97c0f2f3ac7b1d0e1e4ede/html5/thumbnails/19.jpg)
How it Works Data Flow
Data Flow
Default Flow
1 Mappers read from HDFS
2 Map output is partitioned by key and sent to Reducers
3 Reducers sort input by key
4 Reduce output is written to HDFS
1 HDFS 2 Mapper 3 Reducer 4 HDFS
Input Map
Input Map
Sort Reduce Output
Sort Reduce Output
Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 9 / 12
![Page 20: Introduction To Hadoop · Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified](https://reader033.fdocuments.net/reader033/viewer/2022050506/5f97c0f2f3ac7b1d0e1e4ede/html5/thumbnails/20.jpg)
How it Works Data Flow
Combiners
Concept
Add counts at Mapper before sending to Reducer.
Word count is 6 minutes with combiners and 14 without.
Implementation
Mapper caches output and periodically calls Combiner
Input to Combine may be from Map or Combine
Combiner uses interface as Reducer
Mapper
Input Map
Cache
Combine
Sort
Sort
Reduce
Reduce
Output
Output
Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 10 / 12
![Page 21: Introduction To Hadoop · Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified](https://reader033.fdocuments.net/reader033/viewer/2022050506/5f97c0f2f3ac7b1d0e1e4ede/html5/thumbnails/21.jpg)
Lab
Exercises
Recommended: Word Count
Get word count running.
Bigrams
Count bigrams and unigrams efficiently.
Capitalization
With what probability is a word capitalized?
Indexer
In what documents does each word appear? Where in the documents?
Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 11 / 12
![Page 22: Introduction To Hadoop · Introduction To Hadoop Kenneth Heafield Google Inc January 14, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified](https://reader033.fdocuments.net/reader033/viewer/2022050506/5f97c0f2f3ac7b1d0e1e4ede/html5/thumbnails/22.jpg)
Lab
Instructions
1 Login to the cluster successfully (and set your password).
2 Get Eclipse installed, so you can build Java code.
3 Install the Hadoop plugin for Eclipse so you can deploy jobs to thecluster.
4 Set up your Eclipse workspace from a template that we provide.
5 Run the word counter example over the Wikipedia data set.
Kenneth Heafield (Google Inc) Introduction To Hadoop January 14, 2008 12 / 12