Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
-
Upload
hugh-oconnor -
Category
Documents
-
view
222 -
download
0
description
Transcript of Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
Hadoop
Team3: Xiaokui Shu, Ron [email protected] [email protected] at Virginia TechDecember 6, 2010
Content
Introduction Hadoop MapReduce
Working With Hadoop Environment MapReduce Programming
Summary
Introduction :: Hadoop
Is a software framework User should program Like a super-library
For distributed applications Build-in solutions Solutions depend on this framework
Inspired by Google's MapReduce and Google File System (GFS) papers
Introduction :: Hadoop
Who use Hadoop A9.com – Amazon▪ Amazon's product search indices
Adobe▪ 30 nodes running HDFS, Hadoop and Hbase
Baidu▪ handle about 3000TB per week
Facebook▪ store copies of internal log and dimension data
sources Last.fm, LinkedIn, IBM, Yahoo!, Google…
Introduction :: Hadoop
Hadoop Common HDFS MapReduce ZooKeeper
Introduction :: Hadoop :: IR
Connections to the IR book Ch.4 Index construction▪ Distributed indexing (4.4)
Ch.20 Web crawling and indexes▪ Distributed crawler (20.2)▪ Distributed indexing (20.3)
Introduction :: MapReduce Is a software framework For distributed computing
Mass amount of data Simple processing requirement Portability across variety platforms▪ Clusters▪ CMP/SMP▪ GPGPU
Introduced by Google
Introduction :: MapReduce
Cited from MapReduce: Simplified Data Processing on Large Clusters
Introduction :: MapReduce Map
Map(k1,v1) -> list(k2,v2) Reduce
Reduce(k2, list (v2)) -> list(v3)
Hadoop MapReduce (input) <k1, v1> -> map -> <k2, v2> ->
combine -> <k2, v2> -> reduce -> <k3, v3> (output)
Introduction :: MapReduce Ex Source
$cat file01Hello World Bye World$cat file02Hello Hadoop Goodbye Hadoop$
Introduction :: MapReduce Ex Map Output
For File01< Hello, 1>< World, 1>< Bye, 1>< World, 1>
For File02< Hello, 1>< Hadoop, 1>< Goodbye, 1>< Hadoop, 1>
Introduction :: MapReduce Ex Reduce Output
< Bye, 1>< Goodbye, 1>< Hadoop, 2>< Hello, 2>< World, 2>
Introduction :: MapReduce More input More mappers
Combiner Function after Map More reducers
Partition Function before ReduceFocus on Map & Reduce
Working With Hadoop :: Env
Hadoop in Java (C++) Run in 3 modes
Local (Standalone) Mode Pseudo-Distributed Mode Fully-Distributed Mode
It is setup to Pseudo-Distributed Mode in our instance on IBM cloud
Working With Hadoop
Process1. Start Hadoop service2. Prepare input3. Write your MapReduce program4. Compile your program5. Run your application with Hadoop
Working With Hadoop :: Env Start Hadoop service
$ bin/hadoop namenode -format $ bin/start-all.sh
Initialize filesystem $ bin/hadoop fs -put localdir hinputdir You can also use -get, -rm, -cat with fs
Working With Hadoop :: Env Compile your program & create jar
$ javac -classpath ${HADOOP}-core.jar -d wordcount_classes WordCount.java
$ jar -cvf wordcount.jar -C wordcount_classes/ .
Run your application with Hadoop $ bin/hadoop jar wordcount.jar
org.myorg.WordCount hinputdir houtputdir
Working With Hadoop :: Progvoid map(String name, String document):
// name: document name// document: document contentsfor each word w in document:
EmitIntermediate(w, "1"); void reduce(String word, Iterator partialCounts):
// word: a word// partialCounts: a list of aggregated partial countsint result = 0;for each pc in partialCounts:
result += ParseInt(pc);Emit(AsString(result));
Cited from Wikipedia
Working With Hadoop :: Progpublic static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();StringTokenizer tokenizer = new
StringTokenizer(line);while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());output.collect(word, one);
}}
}
Working With Hadoop :: Progpublic static class Reduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;while (values.hasNext()) {
sum += values.next().get();}output.collect(key, new IntWritable(sum));
}}
Working With Hadoop :: Prog Configurations & Main class
Leave other work for the Hadoop MapReduce Framework
Summary
Hadoop Introduction Connections to the IR book
MapReduce Overview E.g. WordCount Environment configuration Writing your MapReduce application
Refenerce Hadoop Project
http://hadoop.apache.org/ MapReduce in Hadoop
http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html
MapReduce: Simplified Data Processing on Large Clusters
http://portal.acm.org/citation.cfm?id=1327452.1327492&coll=GUIDE&dl=&idx=J79&part=magazine&WantType=Magazines&title=Communications%20of%20the%20ACM
Hadoop Single-Node Setuphttp://hadoop.apache.org/common/docs/r0.20.2/quickstart.html
Who use Hadoophttp://wiki.apache.org/hadoop/PoweredBy
Thank You!