An Introduction to the World of Hadoop

60
MapReduce / Hadoop for Scientific Data Mining Hadoop = Open Source MapReduce Wider World of Hadoop An Introduction to the World of Hadoop Applications to Scientific Data Mining Gordon Rios [email protected] Cork Constraint Computation Centre (4C) University College Cork October 29, 2010 Gordon Rios Introduction to Hadoop

description

A presentation on Hadoop for scientific researchers given at Universitat Rovira i Virgili in Catalonia, Spain in October 2010. http://etseq.urv.cat/seminaris/seminars/3/

Transcript of An Introduction to the World of Hadoop

Page 1: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

An Introduction to the World of HadoopApplications to Scientific Data Mining

Gordon Rios

[email protected] Constraint Computation Centre (4C)

University College Cork

October 29, 2010

Gordon Rios Introduction to Hadoop

Page 2: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Outline

1 MapReduce / Hadoop for Scientific Data MiningObjectives for the TalkMapReduce as Simplified Parallel ComputingThinking in Terms of MapReduce

2 Hadoop is Open Source MapReduceBasics of HadoopHadoop ExamplesDeveloping Production Systems with Hadoop

3 Wider World of HadoopAd Hoc Analysis with HadoopFurther Reading

Gordon Rios Introduction to Hadoop

Page 3: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

ObjectivesParallel Computing with MapReduceMapReduce Thinking

Outline

1 MapReduce / Hadoop for Scientific Data MiningObjectives for the TalkMapReduce as Simplified Parallel ComputingThinking in Terms of MapReduce

2 Hadoop is Open Source MapReduceBasics of HadoopHadoop ExamplesDeveloping Production Systems with Hadoop

3 Wider World of HadoopAd Hoc Analysis with HadoopFurther Reading

Gordon Rios Introduction to Hadoop

Page 4: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

ObjectivesParallel Computing with MapReduceMapReduce Thinking

Objectives

At the end of this talk I want you to have ideas for how to applyMapReduce to your domains and confidence that Hadoop is agood way to do it. . .

Introduce thinking in terms of MapReduce and why it’s agood ideaIntroduce Hadoop as an open source implementation ofMapReducePresent a detailed example of using the Hadoopstreaming API for a scientific data mining taskDiscuss higher level notions for performing ad hoc analysisand building systems with Hadoop

Gordon Rios Introduction to Hadoop

Page 5: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

ObjectivesParallel Computing with MapReduceMapReduce Thinking

Objectives

At the end of this talk I want you to have ideas for how to applyMapReduce to your domains and confidence that Hadoop is agood way to do it. . .

Introduce thinking in terms of MapReduce and why it’s agood ideaIntroduce Hadoop as an open source implementation ofMapReducePresent a detailed example of using the Hadoopstreaming API for a scientific data mining taskDiscuss higher level notions for performing ad hoc analysisand building systems with Hadoop

Gordon Rios Introduction to Hadoop

Page 6: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

ObjectivesParallel Computing with MapReduceMapReduce Thinking

Objectives

At the end of this talk I want you to have ideas for how to applyMapReduce to your domains and confidence that Hadoop is agood way to do it. . .

Introduce thinking in terms of MapReduce and why it’s agood ideaIntroduce Hadoop as an open source implementation ofMapReducePresent a detailed example of using the Hadoopstreaming API for a scientific data mining taskDiscuss higher level notions for performing ad hoc analysisand building systems with Hadoop

Gordon Rios Introduction to Hadoop

Page 7: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

ObjectivesParallel Computing with MapReduceMapReduce Thinking

Objectives

At the end of this talk I want you to have ideas for how to applyMapReduce to your domains and confidence that Hadoop is agood way to do it. . .

Introduce thinking in terms of MapReduce and why it’s agood ideaIntroduce Hadoop as an open source implementation ofMapReducePresent a detailed example of using the Hadoopstreaming API for a scientific data mining taskDiscuss higher level notions for performing ad hoc analysisand building systems with Hadoop

Gordon Rios Introduction to Hadoop

Page 8: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

ObjectivesParallel Computing with MapReduceMapReduce Thinking

Outline

1 MapReduce / Hadoop for Scientific Data MiningObjectives for the TalkMapReduce as Simplified Parallel ComputingThinking in Terms of MapReduce

2 Hadoop is Open Source MapReduceBasics of HadoopHadoop ExamplesDeveloping Production Systems with Hadoop

3 Wider World of HadoopAd Hoc Analysis with HadoopFurther Reading

Gordon Rios Introduction to Hadoop

Page 9: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

ObjectivesParallel Computing with MapReduceMapReduce Thinking

MapReduceMapReduce is distributed computing where we take advantage ofdata locality to push the computation to the data. . .

Distributed computing: clusters of computers with local memoryand disk (network intensive for big data)

Parallel Computing: multiple CPUs processing over sharedmemory and filesystem

If we can decompose the problem into independent map andreduce tasks we can achieve “easy” parallelism withMapReduce. . .

1 Map works independently to convert input data to key valuepairs. . .

2 Reduce works independently on all values for a given keyand transforms them to a single output set (possibly evenjust the ∅) per key. . .

Now, let’s expand that a bit. . .

Gordon Rios Introduction to Hadoop

Page 10: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

ObjectivesParallel Computing with MapReduceMapReduce Thinking

MapReduceMapReduce is distributed computing where we take advantage ofdata locality to push the computation to the data. . .

Distributed computing: clusters of computers with local memoryand disk (network intensive for big data)

Parallel Computing: multiple CPUs processing over sharedmemory and filesystem

If we can decompose the problem into independent map andreduce tasks we can achieve “easy” parallelism withMapReduce. . .

1 Map works independently to convert input data to key valuepairs. . .

2 Reduce works independently on all values for a given keyand transforms them to a single output set (possibly evenjust the ∅) per key. . .

Now, let’s expand that a bit. . .

Gordon Rios Introduction to Hadoop

Page 11: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

ObjectivesParallel Computing with MapReduceMapReduce Thinking

MapReduceMapReduce is distributed computing where we take advantage ofdata locality to push the computation to the data. . .

Distributed computing: clusters of computers with local memoryand disk (network intensive for big data)

Parallel Computing: multiple CPUs processing over sharedmemory and filesystem

If we can decompose the problem into independent map andreduce tasks we can achieve “easy” parallelism withMapReduce. . .

1 Map works independently to convert input data to key valuepairs. . .

2 Reduce works independently on all values for a given keyand transforms them to a single output set (possibly evenjust the ∅) per key. . .

Now, let’s expand that a bit. . .

Gordon Rios Introduction to Hadoop

Page 12: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

ObjectivesParallel Computing with MapReduceMapReduce Thinking

MapReduceMapReduce is distributed computing where we take advantage ofdata locality to push the computation to the data. . .

Distributed computing: clusters of computers with local memoryand disk (network intensive for big data)

Parallel Computing: multiple CPUs processing over sharedmemory and filesystem

If we can decompose the problem into independent map andreduce tasks we can achieve “easy” parallelism withMapReduce. . .

1 Map works independently to convert input data to key valuepairs. . .

2 Reduce works independently on all values for a given keyand transforms them to a single output set (possibly evenjust the ∅) per key. . .

Now, let’s expand that a bit. . .

Gordon Rios Introduction to Hadoop

Page 13: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

ObjectivesParallel Computing with MapReduceMapReduce Thinking

MapReduceMapReduce is distributed computing where we take advantage ofdata locality to push the computation to the data. . .

Distributed computing: clusters of computers with local memoryand disk (network intensive for big data)

Parallel Computing: multiple CPUs processing over sharedmemory and filesystem

If we can decompose the problem into independent map andreduce tasks we can achieve “easy” parallelism withMapReduce. . .

1 Map works independently to convert input data to key valuepairs. . .

2 Reduce works independently on all values for a given keyand transforms them to a single output set (possibly evenjust the ∅) per key. . .

Now, let’s expand that a bit. . .

Gordon Rios Introduction to Hadoop

Page 14: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

ObjectivesParallel Computing with MapReduceMapReduce Thinking

MapReduceMapReduce is distributed computing where we take advantage ofdata locality to push the computation to the data. . .

Distributed computing: clusters of computers with local memoryand disk (network intensive for big data)

Parallel Computing: multiple CPUs processing over sharedmemory and filesystem

If we can decompose the problem into independent map andreduce tasks we can achieve “easy” parallelism withMapReduce. . .

1 Map works independently to convert input data to key valuepairs. . .

2 Reduce works independently on all values for a given keyand transforms them to a single output set (possibly evenjust the ∅) per key. . .

Now, let’s expand that a bit. . .

Gordon Rios Introduction to Hadoop

Page 15: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

ObjectivesParallel Computing with MapReduceMapReduce Thinking

Basics Elements of MapReduceMapReduce is distributed sort with specific places to insertapplication logic. . .

an input reader: read work data W from file system1 andproduce a set of splits S: W → S

a Map function: (S)→ (K , V )

combiner function: a mapper optimization. . .

partition function: partition2 keys k ∈ K to reducers K → R

compare function cmp(ki , kj): sort keys presented to eachreducer

a Reduce function: reduce output from all mappers for aparticular to another set of values for that key wk(k , V )→ (k , wk ))

an output writer: write output to file system.1

A distributed file system (DFS) for stability and scale2

The default hash keys modulo number of reducers

Gordon Rios Introduction to Hadoop

Page 16: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

ObjectivesParallel Computing with MapReduceMapReduce Thinking

Outline

1 MapReduce / Hadoop for Scientific Data MiningObjectives for the TalkMapReduce as Simplified Parallel ComputingThinking in Terms of MapReduce

2 Hadoop is Open Source MapReduceBasics of HadoopHadoop ExamplesDeveloping Production Systems with Hadoop

3 Wider World of HadoopAd Hoc Analysis with HadoopFurther Reading

Gordon Rios Introduction to Hadoop

Page 17: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

ObjectivesParallel Computing with MapReduceMapReduce Thinking

Examples of Map and Reduce

Let’s start with a few examples of Map. . .

Word Count: read in a stream of text (e.g. a document or a set ofdocuments) and emit each word as a key with a value of 1

Inverted Index: read in a stream of documents and emit eachword as a key and the document ID as the value

Max Temperature: read in formatted data and emit year as akey with temperature as the value

Mean Rain Precipitation: read in daily data and emit(year-month, lat, long) as a key with temperature asthe value

Reduce in these cases simply applies a count, list, max,average, to a set of values for each key,respectively. [Dean, Ghemawat, 2008, Wikipedia, 2010, White, 2011]

Gordon Rios Introduction to Hadoop

Page 18: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

ObjectivesParallel Computing with MapReduceMapReduce Thinking

Visualizing Word Count

source: Chris Wensel fromhttp://www.cascading.org

Gordon Rios Introduction to Hadoop

Page 19: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

Outline

1 MapReduce / Hadoop for Scientific Data MiningObjectives for the TalkMapReduce as Simplified Parallel ComputingThinking in Terms of MapReduce

2 Hadoop is Open Source MapReduceBasics of HadoopHadoop ExamplesDeveloping Production Systems with Hadoop

3 Wider World of HadoopAd Hoc Analysis with HadoopFurther Reading

Gordon Rios Introduction to Hadoop

Page 20: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

Engineering Intermezzo

This is how easy it is to get Hadoop installed . . . given that youhave Java 6 installed already. . .

Get Hadoop: http://hadoop.apache.org/

% t a r xz f hadoop−x . y . z . t a r . gz% expor t HADOOP_INSTALL=BUILD_DIR / hadoop−x . y . z% expor t PATH=$PATH:$HADOOP_INSTALL / b in

Gordon Rios Introduction to Hadoop

Page 21: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

MapReduce with Hadoop and the streaming library

Now, let’s take a closer look at how Hadoop implementsMapReduce from [White, 2011]. . .

Gordon Rios Introduction to Hadoop

Page 22: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

Hadoop Streaming Library

We’ll focus on the streaming library as it’s the most natural forscientific or technical computing. . . let’s look at the DefinitiveGuide’s weather example. . .

Gordon Rios Introduction to Hadoop

Page 23: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

Outline

1 MapReduce / Hadoop for Scientific Data MiningObjectives for the TalkMapReduce as Simplified Parallel ComputingThinking in Terms of MapReduce

2 Hadoop is Open Source MapReduceBasics of HadoopHadoop ExamplesDeveloping Production Systems with Hadoop

3 Wider World of HadoopAd Hoc Analysis with HadoopFurther Reading

Gordon Rios Introduction to Hadoop

Page 24: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

Hadoop Book Examples

More examples from Hadoop: The Definitive Guide, 2nd Edition(Hadoop 20.1) http://www.hadoopbook.com/. . . here’show to install and try them for yourself. . .

Install Git: http://git-scm.com/Visit github for book code:http://github.com/tomwhite/hadoop-book/

Checkout code examples from The Definitive Guide% cd BUILD_DIR% git clone http://github.com/tomwhite/hadoop-book.git hadoop-book

Gordon Rios Introduction to Hadoop

Page 25: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

Example: ECA Mean Precipitation

Let’s compute mean precipitation at over 2,000 weather stations andmake some graphics. There are 2,186 files with median of 21,875lines each, a minimum of 1,025 and a maximum of 78,090.

ECA Daily Data

The ECA dataset contains series of daily observations at meteorological stations throughout Europe and theMediterranean. Part of the dataset is freely available for non-commercial research. To download this public dataselect one of the options below. Note that a gridded version with daily temperature and precipitation fields is alsoavailable. source: http://eca.knmi.nl/dailydata/index.php

File Format

FILE FORMAT (MISSING VALUE CODE = −9999):

01−06 STAID : S ta t i on i d e n t i f i e r08−13 SOUID : Source i d e n t i f i e r15−22 DATE : Date YYYYMMDD24−28 RR : P r e c i p i t a t i o n amount i n 0.1 mm30−34 Q_RR : q u a l i t y code f o r RR (0= ’ va l i d ’ ; 1= ’ suspect ’ ; 9= ’ missing ’ )

Gordon Rios Introduction to Hadoop

Page 26: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

Example: ECA Mean Precipitation

Scientific Data Mining: use the Hadoop stream library andmanually pipeline MapReduce jobs together as needed. . .

Write hadoop scripts in python in two stepsTest cat data | map.py | sort | reduce.py >output (not shown)Process data into individual files for each time period(Year/Month) of interest using hadoop stream library (localmode)Call R in batch mode to produce image files

Gordon Rios Introduction to Hadoop

Page 27: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

ECA Mean Precipitation: Step One

map_one.py

def l a t_ lon_ to_coord ( s ) :s ign = 1d , m, s = map( lambda x : f l o a t ( x ) , s . s p l i t ( " : " ) )s ign = −1 i f d < 0 else 1x = abs ( d ) + m / 60.0 + s / 3600.0return f l o a t ( s ign ∗ x )

for l i n e in sys . s t d i n :# f l d s = ( s ta id , souid , date , r r , q_r r )f l d s = l i n e . s t r i p ( ) . s p l i t ( " , " )i f len ( f l d s ) != 5 :

continues t a i d = f l d s [ 0 ] . s t r i p ( ) # s t a t i o n i ddate = f l d s [ 2 ] . s t r i p ( ) # YYYYMMDDi f date < BEGIN_DATE or date > END_DATE:

continuer r = f l d s [ 3 ] . s t r i p ( ) # p r e c i p i t a t i o n i n 0.1 mmq_r r = f l d s [ 4 ] . s t r i p ( ) # q u a l i t y code "0 " = v a l i dl a t , lon = l a t l o n s . get ( s ta id , (None , None ) )i f q_r r == ’ 0 ’ and ( l a t is not None) and ( lon is not None ) :

pr in t "%s ,%.4 f ,%.4 f \ t%s " % ( date [ 0 : 6 ] , l a t , lon , r r )

Gordon Rios Introduction to Hadoop

Page 28: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

ECA Mean Precipitation: Step One (cont)

reduce_one.py

( las t_key , x , n ) = (None , 0 .0 , 0)for l i n e in sys . s t d i n :

( key , va l ) = l i n e . s t r i p ( ) . s p l i t ( " \ t " )i f l as t_key and l as t_key != key : # t ime to emit reduced value

i f n > 0:pr in t "%s \ t %.2 f " % ( last_key , x / n )x = 0.0n = 0

# we j u s t want data f o r the year 2009( las t_key , x , n ) = ( key , x + f l o a t ( va l ) , n + 1)

i f l as t_key :i f n > 0:

pr in t "%s \ t %.2 f " % ( last_key , x / n )

Gordon Rios Introduction to Hadoop

Page 29: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

ECA Mean Precipitation: Step TwoMap ((yyyymm,lat,lon),mean) -> (yyyymm, (lat,lon,mean))

map_two.py

for l i n e in sys . s t d i n :yyyymm_lat_lon , mean_precip = l i n e . s t r i p ( ) . s p l i t ( " \ t " )yyyymm , l a t , lon = yyyymm_lat_lon . s t r i p ( ) . s p l i t ( " , " )pr in t "%s \ t%s %s %s " % (yyyymm , l a t , lon , mean_precip )

Empty reduce just write to a local file (hack since we’re running locally)

reduce_two.py

l as t_key = Nonevalues = [ ]for l i n e in sys . s t d i n :

( key , va l ) = l i n e . s t r i p ( ) . s p l i t ( " \ t " )i f l as t_key and l as t_key != key : # t ime to emit reduced value

w r i t e _ f i l e ( las t_key , values )values = [ ]

l as t_key = keyvalues . append ( va l ) # create a s t r i n g wi th th ree values

i f l as t_key :w r i t e _ f i l e ( las t_key , values )

Gordon Rios Introduction to Hadoop

Page 30: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

ECA Mean Precipitation: Step TwoMap ((yyyymm,lat,lon),mean) -> (yyyymm, (lat,lon,mean))

map_two.py

for l i n e in sys . s t d i n :yyyymm_lat_lon , mean_precip = l i n e . s t r i p ( ) . s p l i t ( " \ t " )yyyymm , l a t , lon = yyyymm_lat_lon . s t r i p ( ) . s p l i t ( " , " )pr in t "%s \ t%s %s %s " % (yyyymm , l a t , lon , mean_precip )

Empty reduce just write to a local file (hack since we’re running locally)

reduce_two.py

l as t_key = Nonevalues = [ ]for l i n e in sys . s t d i n :

( key , va l ) = l i n e . s t r i p ( ) . s p l i t ( " \ t " )i f l as t_key and l as t_key != key : # t ime to emit reduced value

w r i t e _ f i l e ( las t_key , values )values = [ ]

l as t_key = keyvalues . append ( va l ) # create a s t r i n g wi th th ree values

i f l as t_key :w r i t e _ f i l e ( las t_key , values )

Gordon Rios Introduction to Hadoop

Page 31: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

Example: ECA Mean Precipitation

Step One: input -> (yyyymm,lat,lon), mean precip

% hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input/Downloads/tarragona/ECA_blend_all_data/RR_STAID00* -output output -mapper/Desktop/tmp/tarragona/python/map_one.py -reducer /Desktop/tmp/tarragona/python/reduce_one.py% 10/10/25 19:34:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=. . .% 10/10/25 20:07:36 INFO streaming.StreamJob: Job complete: job_local_0001% 10/10/25 20:07:36 INFO streaming.StreamJob: Output: output

Step Two: (date,lat,lon), mean precip -> files(yymm)

% hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input output/part-00000-output output_two -mapper /Desktop/tmp/tarragona/python/map_two.py -reducer/Desktop/tmp/tarragona/python/reduce_two.py% 10/10/25 20:41:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=. . .% 10/10/25 20:41:47 INFO streaming.StreamJob: Job complete: job_local_0001% 10/10/25 20:41:47 INFO streaming.StreamJob: Output: output_two

Gordon Rios Introduction to Hadoop

Page 32: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

Example: ECA Mean Precipitation

Step One: input -> (yyyymm,lat,lon), mean precip

% hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input/Downloads/tarragona/ECA_blend_all_data/RR_STAID00* -output output -mapper/Desktop/tmp/tarragona/python/map_one.py -reducer /Desktop/tmp/tarragona/python/reduce_one.py% 10/10/25 19:34:31 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=. . .% 10/10/25 20:07:36 INFO streaming.StreamJob: Job complete: job_local_0001% 10/10/25 20:07:36 INFO streaming.StreamJob: Output: output

Step Two: (date,lat,lon), mean precip -> files(yymm)

% hadoop jar /Users/gordon/build/hadoop/contrib/streaming/hadoop-0.20.2-streaming.jar -input output/part-00000-output output_two -mapper /Desktop/tmp/tarragona/python/map_two.py -reducer/Desktop/tmp/tarragona/python/reduce_two.py% 10/10/25 20:41:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=. . .% 10/10/25 20:41:47 INFO streaming.StreamJob: Job complete: job_local_0001% 10/10/25 20:41:47 INFO streaming.StreamJob: Output: output_two

Gordon Rios Introduction to Hadoop

Page 33: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

Batch Processing in R

And, after a little batch processing with R. . .batch-graphics.R

l i b r a r y ( f i e l d s )f i l e s <− c ( " 200901. dat " , " 200902. dat " , " 200903. dat " ,

" 200904. dat " , " 200905. dat " , " 200906. dat " ," 200907. dat " , " 200908. dat " , " 200909. dat " ," 200910. dat " , " 200911. dat " , " 200912. dat " )

i <− 1for ( f i n f i l e s ) {

mat <− read . t ab l e ( f )names ( mat ) <− c ( " l a t " , " long " , " p rec ip " )png ( f i lename=paste ( " prec ip−" , i , " . png " , sep=" " ) , he igh t =480 , width =480)q u i l t . p l o t ( mat \ $long , mat \ $ la t , mat \ $precip , nco l =100 ,nrow=100 ,

y l im=c ( 22 .0 ,79 .0 ) , x l im=c (−52.0 ,72.0) ,co l=two . co lo rs (256 , s t a r t = " wheat " , end=" darkblue " , middle=" blue " ) ,z l im=c (0 ,410) , add . legend=T , cex . lab =0.6)

po in t s (1 .2453 , 41.1187 , pch=1)t e x t (1 .2453 , 41.1187 , " tar ragona " , cex =0.8 , pos=1)po in t s (2.35083 , 48.89 , pch=1)t e x t (2 .3508 , 48.89 , " pa r i s " , cex =0.8 , pos=4)po in t s (12.4823 , 41.8955 , pch=1)t e x t (12.4823 , 41.8955 , " rome " , cex =0.8 , pos=4)dev . o f f ( )i <− i + 1

}

Gordon Rios Introduction to Hadoop

Page 34: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

ECA Precipitation 2009 Month: 1

Gordon Rios Introduction to Hadoop

Page 35: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

ECA Precipitation 2009 Month: 2

Gordon Rios Introduction to Hadoop

Page 36: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

ECA Precipitation 2009 Month: 3

Gordon Rios Introduction to Hadoop

Page 37: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

ECA Precipitation 2009 Month: 4

Gordon Rios Introduction to Hadoop

Page 38: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

ECA Precipitation 2009 Month: 5

Gordon Rios Introduction to Hadoop

Page 39: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

ECA Precipitation 2009 Month: 6

Gordon Rios Introduction to Hadoop

Page 40: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

ECA Precipitation 2009 Month: 7

Gordon Rios Introduction to Hadoop

Page 41: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

ECA Precipitation 2009 Month: 8

Gordon Rios Introduction to Hadoop

Page 42: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

ECA Precipitation 2009 Month: 9

Gordon Rios Introduction to Hadoop

Page 43: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

ECA Precipitation 2009 Month: 10

Gordon Rios Introduction to Hadoop

Page 44: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

ECA Precipitation 2009 Month: 11

Gordon Rios Introduction to Hadoop

Page 45: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

ECA Precipitation 2009 Month: 12

Gordon Rios Introduction to Hadoop

Page 46: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

Summary of What We Did

Work through a complete example but that’s not all since with verylittle additional work we can. . .

Test the scripts in pseudo-distributed mode locally on our ownmachine

Run the job on a compute cluster remotely

Run the job in the cloud with EC2 there system as just anotherremote cluster

Run the job with Amazon’s Elastic MapReducehttp://aws.amazon.com/elasticmapreduce/ whichallows you to pay for exactly as much computing as you use.

See [White, 2011] for complete details on how to run in these differentmodes. . .

Gordon Rios Introduction to Hadoop

Page 47: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

Outline

1 MapReduce / Hadoop for Scientific Data MiningObjectives for the TalkMapReduce as Simplified Parallel ComputingThinking in Terms of MapReduce

2 Hadoop is Open Source MapReduceBasics of HadoopHadoop ExamplesDeveloping Production Systems with Hadoop

3 Wider World of HadoopAd Hoc Analysis with HadoopFurther Reading

Gordon Rios Introduction to Hadoop

Page 48: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

Systems Development APIs

And, you can build production systems with Hadoop in eitherJava or C++. . .

Full featured Java API for HadoopPipes is the C++ API for Hadoop MapReduceCascading is an API for developing general dataprocessing systems that incorporate MapReduce(http://www.cascading.org)

Gordon Rios Introduction to Hadoop

Page 49: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

Systems Development APIs

And, you can build production systems with Hadoop in eitherJava or C++. . .

Full featured Java API for HadoopPipes is the C++ API for Hadoop MapReduceCascading is an API for developing general dataprocessing systems that incorporate MapReduce(http://www.cascading.org)

Gordon Rios Introduction to Hadoop

Page 50: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

Systems Development APIs

And, you can build production systems with Hadoop in eitherJava or C++. . .

Full featured Java API for HadoopPipes is the C++ API for Hadoop MapReduceCascading is an API for developing general dataprocessing systems that incorporate MapReduce(http://www.cascading.org)

Gordon Rios Introduction to Hadoop

Page 51: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Hadoop BasicsHadoop ExamplesDeveloping Production Systems

Cascading

Cascading allows developers to model very complex data flows at a higher level than Map & Reduce and thenautomatically generate and visualize the dozens or perhaps even hundreds of necessary Hadoop jobs as a graph

Gordon Rios Introduction to Hadoop

Page 52: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Ad Hoc AnalysisFurther Reading

Outline

1 MapReduce / Hadoop for Scientific Data MiningObjectives for the TalkMapReduce as Simplified Parallel ComputingThinking in Terms of MapReduce

2 Hadoop is Open Source MapReduceBasics of HadoopHadoop ExamplesDeveloping Production Systems with Hadoop

3 Wider World of HadoopAd Hoc Analysis with HadoopFurther Reading

Gordon Rios Introduction to Hadoop

Page 53: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Ad Hoc AnalysisFurther Reading

Ad Hoc Analysis

What’s missing? Sometimes you need to do fast ad hocqueries. . . can we do that in a scalable way?

Pig: “Pig is a scripting language for exploring large datasets”[White, 2011] (Yahoo!)

Hive: provide an SQL interface for running ad hoc queries andother data processing tasks for SQL analysts (Facebook)

Hbase: Column oriented database along the lines of Google’sBigtable database (Powerset)

Hypertable: GPL clone of Google’s Bigtable database written inC++ (Zvents)

Google’s Bigtable database is describedin [Chang, Dean, Ghemawat, et al., 2008]

Gordon Rios Introduction to Hadoop

Page 54: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Ad Hoc AnalysisFurther Reading

Ad Hoc Analysis

What’s missing? Sometimes you need to do fast ad hocqueries. . . can we do that in a scalable way?

Pig: “Pig is a scripting language for exploring large datasets”[White, 2011] (Yahoo!)

Hive: provide an SQL interface for running ad hoc queries andother data processing tasks for SQL analysts (Facebook)

Hbase: Column oriented database along the lines of Google’sBigtable database (Powerset)

Hypertable: GPL clone of Google’s Bigtable database written inC++ (Zvents)

Google’s Bigtable database is describedin [Chang, Dean, Ghemawat, et al., 2008]

Gordon Rios Introduction to Hadoop

Page 55: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Ad Hoc AnalysisFurther Reading

Ad Hoc Analysis

What’s missing? Sometimes you need to do fast ad hocqueries. . . can we do that in a scalable way?

Pig: “Pig is a scripting language for exploring large datasets”[White, 2011] (Yahoo!)

Hive: provide an SQL interface for running ad hoc queries andother data processing tasks for SQL analysts (Facebook)

Hbase: Column oriented database along the lines of Google’sBigtable database (Powerset)

Hypertable: GPL clone of Google’s Bigtable database written inC++ (Zvents)

Google’s Bigtable database is describedin [Chang, Dean, Ghemawat, et al., 2008]

Gordon Rios Introduction to Hadoop

Page 56: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Ad Hoc AnalysisFurther Reading

Ad Hoc Analysis

What’s missing? Sometimes you need to do fast ad hocqueries. . . can we do that in a scalable way?

Pig: “Pig is a scripting language for exploring large datasets”[White, 2011] (Yahoo!)

Hive: provide an SQL interface for running ad hoc queries andother data processing tasks for SQL analysts (Facebook)

Hbase: Column oriented database along the lines of Google’sBigtable database (Powerset)

Hypertable: GPL clone of Google’s Bigtable database written inC++ (Zvents)

Google’s Bigtable database is describedin [Chang, Dean, Ghemawat, et al., 2008]

Gordon Rios Introduction to Hadoop

Page 57: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Ad Hoc AnalysisFurther Reading

Ad Hoc Analysis

What’s missing? Sometimes you need to do fast ad hocqueries. . . can we do that in a scalable way?

Pig: “Pig is a scripting language for exploring large datasets”[White, 2011] (Yahoo!)

Hive: provide an SQL interface for running ad hoc queries andother data processing tasks for SQL analysts (Facebook)

Hbase: Column oriented database along the lines of Google’sBigtable database (Powerset)

Hypertable: GPL clone of Google’s Bigtable database written inC++ (Zvents)

Google’s Bigtable database is describedin [Chang, Dean, Ghemawat, et al., 2008]

Gordon Rios Introduction to Hadoop

Page 58: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Ad Hoc AnalysisFurther Reading

Interesting Application Frameworks with Hadoop

Here are a few examples of frameworks in development or alreadyavailable that use Hadoop as a platform. . .

Apache Mahout: Ambitious project to implement popularmachine learning algorithms and recommenders with Hadoop3

Graph: Jake Hoffman from Yahoo Research has released someof his work on large scale network analysis with Hadoop withprototype code4. Also see [Vassilvitskii, 2010] for related graphanalysis research.

Application to GIS: Nathan Kerr’s M.S. Thesis with lots of detailson how to do GIS with Hadoop5

3http://mahout.apache.org/

4http://github.com/jhofman/icwsm2010_tutorial

5http://www.nathankerr.com/projects/parallel-gis-processing/alternative_

approaches_to_parallel_gis_processing.html

Gordon Rios Introduction to Hadoop

Page 59: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Ad Hoc AnalysisFurther Reading

Outline

1 MapReduce / Hadoop for Scientific Data MiningObjectives for the TalkMapReduce as Simplified Parallel ComputingThinking in Terms of MapReduce

2 Hadoop is Open Source MapReduceBasics of HadoopHadoop ExamplesDeveloping Production Systems with Hadoop

3 Wider World of HadoopAd Hoc Analysis with HadoopFurther Reading

Gordon Rios Introduction to Hadoop

Page 60: An Introduction to the World of Hadoop

MapReduce / Hadoop for Scientific Data MiningHadoop = Open Source MapReduce

Wider World of Hadoop

Ad Hoc AnalysisFurther Reading

Further ReadingWhite, T.Hadoop: The Definitive Guide, 2nd EditionO’Reilly Media, Inc., Sebastopol, CA, 2011

Sanderson, D.Programming Google App EngineO’Reilly Media, Inc., Sebastopol, CA, 2009

Murty, J.Programming Amazon Web ServicesO’Reilly Media, Inc., Sebastopol, CA, 2008

Dean, J. and Ghemawat, S.MapReduce: simplified data processing on large clustersCommunications of the ACM, 51(1):107–113, 2008

Chang, Fay and Dean, Jeffrey and Ghemawat, Sanjay and Hsieh, Wilson C. and Wallach, Deborah A. andBurrows, Mike and Chandra, Tushar and Fikes, Andrew and Gruber, Robert E.Bigtable: a distributed storage system for structured dataOSDI ’06: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation,USENIX Assoc., Berkeley, CA, 2006

MapReduce on Wikipediahttp://en.wikipedia.org/wiki/MapReduce

Vassilvitskii, S.XXL Graph Algorithms, Hadoop Summit 2010http://developer.yahoo.com/events/hadoopsummit2010/

Gordon Rios Introduction to Hadoop