Why hadoop map reduce needs scala, an introduction to scoobi and scalding

32
@agemooij A Look at Scoobi and Scalding Scala DSLs for Hadoop Why Needs Scala Scoobi Scalding

description

 

Transcript of Why hadoop map reduce needs scala, an introduction to scoobi and scalding

Page 1: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

@agemooij

A Look at Scoobi and Scalding Scala DSLs for Hadoop

Why

Needs Scala

Scoobi

Scalding

Page 2: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

Obligatory “About Me” Slide

Page 3: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

Rocks!

Page 4: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

Sucks!

But programming

kinda

Page 5: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

Hello World Word Count using

Hadoop MapReduce

Page 6: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

For each word, sum the 1s to get the total

Split lines into words

Group by word (?)

Turn each word into a Pair(word, 1)

Page 7: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

Low level glue code

Lots of small unintuitive Mapper and Reducer

Classes

Lots of Hadoop intrusiveness(Context, Writables, Exceptions, etc.)

Actually runs the code on the cluster

Page 8: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

This does not make me a happy Hadoop developer!

Especially for things that are a little bit more complicated than counting words

• Unintuitive, invasive programming model• Hard to compose/chain jobs into real, more

complicated programs• Lots of low-level boilerplate code• Branching, Joins, CoGroups, etc. hard to

implement

Page 9: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

What Are the Alternatives?

Page 10: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

Counting Words using Apache Pig

Already a lot better, but anything more complex gets hard pretty fast.

Handy for quick exploration of data!

Pig is hard to customize/extend

Nice!

And the same goes for Hive

Page 11: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

package cascadingtutorial.wordcount;

/**

* Wordcount example in Cascading

*/

public class Main

{

public static void main( String[] args )

{

String inputPath = args[0];

String outputPath = args[1];

Scheme inputScheme = new TextLine(new Fields("offset", "line"));

Scheme outputScheme = new TextLine();

Tap sourceTap = inputPath.matches( "^[^:]+://.*") ?

new Hfs(inputScheme, inputPath) :

new Lfs(inputScheme, inputPath);

Tap sinkTap = outputPath.matches("^[^:]+://.*") ?

new Hfs(outputScheme, outputPath) :

new Lfs(outputScheme, outputPath);

Pipe wcPipe = new Each("wordcount",

new Fields("line"),

new RegexSplitGenerator(new Fields("word"), "\\s+"),

new Fields("word"));

wcPipe = new GroupBy(wcPipe, new Fields("word"));

wcPipe = new Every(wcPipe, new Count(), new Fields("count", "word"));

Properties properties = new Properties();

FlowConnector.setApplicationJarClass(properties, Main.class);

Flow parsedLogFlow = new FlowConnector(properties)

.connect(sourceTap, sinkTap, wcPipe);

parsedLogFlow.start();

parsedLogFlow.complete();

}

}

Pipes & Filters

Not very intuitive

Lots of boilerplate code

Very powerful!Record Model

Strange new abstraction

Joins & CoGroups

Page 12: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

Meh...I’m lazy

I want more power with less work!

Page 13: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

How would we count words in plain Scala?

(My current language of choice)

Page 14: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

Nice!Familiar, intuitiveWhat if...?

Page 15: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

But that code doesn’t scale to my cluster!

Or does it?

Meanwhile at Google...

Page 16: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

Introducing Scoobi & ScaldingScala DSLs for Hadoop MapReduce

Scalding5%

Scoobi95%

NOTE: My relative familiarity with either platform:

Page 17: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

http://github.com/nicta/scoobi

A Scala library that implements a higher level programming model for

Hadoop MapReduce

Page 18: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

Counting Words using Scoobi

For each word, sum the 1s to get the total

Split lines into words

Group by wordTurn each word into a Pair(word, 1)

Actually runs the code on the cluster

Page 19: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

Scoobi is...• A distributed collections abstraction:

• Distributed collection objects abstract data in HDFS

• Methods on these objects abstract map/reduce operations

• Programs manipulate distributed collections objects

• Scoobi turns these manipulations into MapReduce jobs

• Based on Google’s FlumeJava / Cascades

• A source code generator (it generates Java code!)

• A job plan optimizer

• Open sourced by NICTA

• Written in Scala (W00t!)

Page 20: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

DList[T]• Abstracts storage of data and files on HDFS

• Calling methods on DList objects to transform and manipulate them abstracts the mapper, combiner, sort-and-shuffle, and reducer phases of MapReduce

• Persisting a DList triggers compilation of the graph into one or more MR jobs and their execution

• Very familiar: like standard Scala Lists

• Strongly typed

• Parameterized with rich types and Tuples

• Easy list manipulation using typical higher order functions like map, flatMap, filter, etc.

Page 21: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

DList[T]

Page 22: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

• Can read/write text files, Sequence files and Avro files

• Can influence sorting (raw, secondary)

IO

Serialization• Serialization of custom types through Scala type

classes and WireFormat[T]

• Scoobi implements WireFormat[T] for primitive types, strings, tuples, Option[T], either[T], Iterable[T], etc.

• Out of the box support for serialization of Scala case classes

Page 23: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

IO/Serialization I

Page 24: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

IO/Serialization II

For normal (i.e. non-case) classes

Page 25: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

Further Info

http://nicta.github.com/scoobi/

[email protected]@googlegroups.com

Version 0.4 released today (!)• Avro, Sequence Files• Materialized DObjects• DList reduction methods (product, min,

etc.)• Vastly improved testing support• Less overhead• Much more

Page 26: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

Scalding!

http://github.com/twitter/scalding

A Scala library that implements a higher level programming model for

Hadoop MapReduceCascading

Page 27: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

Counting Words using Scalding

Page 28: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

Scalding is...• A distributed collections abstraction

• A wrapper around Cascading (i.e. no source code generation)

• Based on the same record model (i.e. named fields)

• Less strongly typed

• Uses Kryo Serialization

• Used by Twitter in production

• Written in Scala (W00t!)

Page 30: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

How do they compare?Different approaches,

similar power

Small feature differences, which will

even out over time

Scoobi gets a little closer to idiomatic

Scala

Twitter is definitely a bigger fish than

NICTA, so Scalding gets all the attention

Both open sourced (last year) Scoobi has better docs!

Page 31: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

Which one should I use?Ehm...

...I’m extremely prejudiced!

Page 32: Why hadoop map reduce needs scala, an introduction to scoobi and scalding

Questions?