Writing Hadoop Jobs in Scala using Scalding

Post on 27-Jan-2015

115 views 1 download

Tags:

description

Talk that I gave at the #BcnDevCon13 about using Scalding and the strong points of using Scala for Big Data data processing

Transcript of Writing Hadoop Jobs in Scala using Scalding

Writing Hadoop Jobs in Scala using Scalding @tonicebrian

How much storage can100$ dollars buy you?

1980

1 photo

How much storage can100$ dollars buy you?

1980

1 photo

1990

5 songs

How much storage can100$ dollars buy you?

2000

7 movies

1980

1 photo

1990

5 songs

How much storage can100$ dollars buy you?

2000

7 movies

1980

1 photo

1990

5 songs

600 movies

170.000 songs

5 million photos

2010

How much storage can100$ dollars buy you?

From single drives…

From single drives… to clusters…

Data Science

“A mathematician is a device for turning coffee into theorems”

Alfréd Rényi

“A mathematician is a device for turning coffee into theorems”

Alfréd Rényi

data scientist

“A mathematician is a device for turning coffee into theorems”

Alfréd Rényi

data scientist

and data

“A mathematician is a device for turning coffee into theorems”

Alfréd Rényi

data scientist

and datainsights

Map Reduce

Distributed File System+=

Hadoop

Storage

Map Reduce

Distributed File System+=

Hadoop

StorageProgram ModelMap

ReduceDistributed File System+=

Hadoop

Word Count

Hello cruel world

Say hello! Hello!

Raw

Word Count

Hello cruel world

Say hello! Hello!

hello 1

cruel 1

world 1

say 1

hello 2

Raw Map

Word Count

Hello cruel world

Say hello! Hello!

hello 1

cruel 1

world 1

say 1

hello 2

Raw Map Reduce

Word Count

Hello cruel world

Say hello! Hello!

hello 3

cruel 1

world 1

say 1

Raw Map Reduce Result

4 Main Characteristics of Scala

JVM

4 Main Characteristics of Scala

JVM Statically Typed

4 Main Characteristics of Scala

JVM Statically Typed

Object Oriented

4 Main Characteristics of Scala

JVM Statically Typed

Object Oriented

Functional Programming

4 Main Characteristics of Scala

def map[B](f: (A) B): ⇒ List[B] Builds a new collection by applying a function to all elements of this list.

def reduce[A1 >: A](op: (A1, A1) A1): A1 ⇒Reduces the elements of this list using the specified associative binary operator.

Recap

• Programming paradigm that employs concepts from Functional Programming

Map/Reduce

Recap

• Map/Reduce

• Programming paradigm that employs concepts from Functional Programming

Map/Reduce

• Functional Language that runs on the JVM

Scala

Recap

• Map/Reduce

• Programming paradigm that employs concepts from Functional Programming

Map/Reduce

• Functional Language that runs on the JVM

Scala

• Open Source Implementation of MR in the JVM

Hadoop

Recap

So in what language is Hadoop implemented?

The Result?

package org.myorg;import java.io.IOException;import java.util.*; import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.*;import org.apache.hadoop.io.*;import org.apache.hadoop.mapreduce.*;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {  public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }

The Result?

High level approaches

SQL DataTransformations

High level approaches

input_lines = LOAD ‘myfile.txt' AS (line:chararray);words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;filtered_words = FILTER words BY word MATCHES '\\w+';word_groups = GROUP filtered_words BY word;word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;ordered_word_count = ORDER word_count BY count DESC;STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';

-- myscript.pigREGISTER myudfs.jar;A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);B = FOREACH A GENERATE myudfs.UPPER(name);DUMP B;

package myudfs;import java.io.IOException;import org.apache.pig.EvalFunc;import org.apache.pig.data.Tuple;import org.apache.pig.impl.util.WrappedIOException; public class UPPER extends EvalFunc<String>{ public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } }}

Java

Pig

User defined functions (UDF)

package impatient;import java.util.Properties;import cascading.flow.Flow;import cascading.flow.FlowDef;import cascading.flow.hadoop.HadoopFlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexFilter;import cascading.operation.regex.RegexSplitGenerator;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.property.AppProps;import cascading.scheme.Scheme;import cascading.scheme.hadoop.TextDelimited;import cascading.tap.Tap;import cascading.tap.hadoop.Hfs;import cascading.tuple.Fields;  public class Main { public static void main( String[] args ) { String docPath = args[ 0 ]; String wcPath = args[ 1 ];  Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );  // specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );  // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );  // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );  // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

WordCount in Cascading

• Data Flow Programming Model• User Defined Functions

Good parts

• Data Flow Programming Model• User Defined Functions

Good parts

• Still Java• Objects for Flows

Bad

package com.twitter.scalding.examples import com.twitter.scalding._ class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) )  // Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "").split("\\s+") }}

Red

GreenRefactor

TDD Cycle

Red

GreenRefactor

Unit Testing

Acceptance Testing

Continuous Deployment

…Lean Startup

Broader view

Big Data Big Speed

A typical day working with Hadoop

A typical day working with Hadoop

A typical day working with Hadoop

A typical day working with Hadoop

A typical day working with Hadoop

A typical day working with Hadoop

A typical day working with Hadoop

A typical day working with Hadoop

Is Scalding of any help here?

Is Scalding of any help here?

0 Size of code

Is Scalding of any help here?

1 Types

0 Size of code

Is Scalding of any help here?

1 Types

2 Unit Testing

0 Size of code

Is Scalding of any help here?

1 Types

2 Unit Testing

3 Local execution

0 Size of code

1Types

Unit Testing

Acceptance Testing

Continuous Deployment

Lean Startup

An extra cycle

Compilation Phase

Unit Testing

Acceptance Testing

Continuous Deployment

Lean Startup

An extra cycle

Static type-checking makes

you a better programmer™

(Int,Int,Int,Int)

Fail-fast with type errors

(Int,Int,Int,Int)

TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]

Fail-fast with type errors

(Int,Int,Int,Int)

TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]

Fail-fast with type errors

val w = 5val x = 5val y = 5val z = 5

w + x + y + z = 20

(Int,Int,Int,Int)

TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]

Fail-fast with type errors

val w = Meters(5)val x = Miles(5)val y = Celsius(5)val z = Fahrenheit(5)

w + x + y + z => type error

val w = 5val x = 5val y = 5val z = 5

w + x + y + z = 20

2Unit Testing

How do you test a distributed algorithm without a distributed

platform?

Source

Tap

Source

Tap

Source

Tap

// Scaldingimport com.twitter.scalding._ class WordCountTest extends Specification with TupleConversions { "A WordCount job" should { JobTest("com.snowplowanalytics.hadoop.scalding.WordCountJob"). arg("input", "inputFile"). arg("output", "outputFile"). source(TextLine("inputFile"), List("0" -> "hack hack hack and hack")). sink[(String,Int)](Tsv("outputFile")){ outputBuffer => val outMap = outputBuffer.toMap "count words correctly" in { outMap("hack") must be_==(4) outMap("and") must be_==(1) } }. run. finish }}

3Local Execution

HDFS

Local

HDFS

Local

> run-main com.twitter.scalding.Tool MyJob --local

> run-main com.twitter.scalding.Tool MyJob --hdfs

SBT as a REPL

More Scalding goodness

More Scalding goodness

Algebird

More Scalding goodness

Algebird

Matrix library

Be functionalQuestions?