Writing Hadoop Jobs in Scala using Scalding

Writing Hadoop Jobs in Scala using Scalding @tonicebrian

How much storage can100$ dollars buy you?

1 photo

5 songs

7 movies

1 photo

5 songs

7 movies

1 photo

5 songs

600 movies

170.000 songs

5 million photos

From single drives…

From single drives… to clusters…

Data Science

“A mathematician is a device for turning coffee into theorems”

Alfréd Rényi

data scientist

Alfréd Rényi

data scientist

and data

Alfréd Rényi

data scientist

and datainsights

Map Reduce

Distributed File System+=

Hadoop

Storage

Map Reduce

Distributed File System+=

Hadoop

StorageProgram ModelMap

ReduceDistributed File System+=

Hadoop

Word Count

Hello cruel world

Say hello! Hello!

Word Count

Hello cruel world

Say hello! Hello!

hello 1

cruel 1

world 1

hello 2

Raw Map

Word Count

Hello cruel world

Say hello! Hello!

hello 1

cruel 1

world 1

hello 2

Raw Map Reduce

Word Count

Hello cruel world

Say hello! Hello!

hello 3

cruel 1

world 1

Raw Map Reduce Result

4 Main Characteristics of Scala

JVM Statically Typed

Object Oriented

Functional Programming

def map[B](f: (A) B): ⇒ List[B] Builds a new collection by applying a function to all elements of this list.

def reduce[A1 >: A](op: (A1, A1) A1): A1 ⇒Reduces the elements of this list using the specified associative binary operator.

• Programming paradigm that employs concepts from Functional Programming

Map/Reduce

• Map/Reduce

Map/Reduce

• Functional Language that runs on the JVM

• Map/Reduce

Map/Reduce

• Functional Language that runs on the JVM

• Open Source Implementation of MR in the JVM

Hadoop

So in what language is Hadoop implemented?

The Result?

package org.myorg;import java.io.IOException;import java.util.*; import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.*;import org.apache.hadoop.io.*;import org.apache.hadoop.mapreduce.*;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }

The Result?

High level approaches

SQL DataTransformations

High level approaches

input_lines = LOAD ‘myfile.txt' AS (line:chararray);words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;filtered_words = FILTER words BY word MATCHES '\\w+';word_groups = GROUP filtered_words BY word;word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;ordered_word_count = ORDER word_count BY count DESC;STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';

-- myscript.pigREGISTER myudfs.jar;A = LOAD 'student_data' AS (name: chararray, age: int, gpa: float);B = FOREACH A GENERATE myudfs.UPPER(name);DUMP B;

package myudfs;import java.io.IOException;import org.apache.pig.EvalFunc;import org.apache.pig.data.Tuple;import org.apache.pig.impl.util.WrappedIOException; public class UPPER extends EvalFunc<String>{ public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String)input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } }}

User defined functions (UDF)

package impatient;import java.util.Properties;import cascading.flow.Flow;import cascading.flow.FlowDef;import cascading.flow.hadoop.HadoopFlowConnector;import cascading.operation.aggregator.Count;import cascading.operation.regex.RegexFilter;import cascading.operation.regex.RegexSplitGenerator;import cascading.pipe.Each;import cascading.pipe.Every;import cascading.pipe.GroupBy;import cascading.pipe.Pipe;import cascading.property.AppProps;import cascading.scheme.Scheme;import cascading.scheme.hadoop.TextDelimited;import cascading.tap.Tap;import cascading.tap.hadoop.Hfs;import cascading.tuple.Fields; public class Main { public static void main( String[] args ) { String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath ); // specify a regex operation to split the "document" text lines into a token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\$\$,.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); // write a DOT file and run the flow Flow wcFlow = flowConnector.connect( flowDef ); wcFlow.writeDOT( "dot/wc.dot" ); wcFlow.complete(); } }

WordCount in Cascading

• Data Flow Programming Model• User Defined Functions

Good parts

• Data Flow Programming Model• User Defined Functions

Good parts

• Still Java• Objects for Flows

package com.twitter.scalding.examples import com.twitter.scalding._ class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) ) // Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "").split("\\s+") }}

GreenRefactor

TDD Cycle

GreenRefactor

Unit Testing

Acceptance Testing

Continuous Deployment

…Lean Startup

Broader view

Big Data Big Speed

A typical day working with Hadoop

Is Scalding of any help here?

0 Size of code

1 Types

0 Size of code

1 Types

2 Unit Testing

0 Size of code

1 Types

2 Unit Testing

3 Local execution

0 Size of code

1Types

Unit Testing

Acceptance Testing

Lean Startup

An extra cycle

Compilation Phase

Unit Testing

Acceptance Testing

Lean Startup

An extra cycle

Static type-checking makes

you a better programmer™

(Int,Int,Int,Int)

Fail-fast with type errors

(Int,Int,Int,Int)

TypedPipe[(Meters,Miles,Celsius,Fahrenheit)]

(Int,Int,Int,Int)

val w = 5val x = 5val y = 5val z = 5

w + x + y + z = 20

(Int,Int,Int,Int)

val w = Meters(5)val x = Miles(5)val y = Celsius(5)val z = Fahrenheit(5)

w + x + y + z => type error

val w = 5val x = 5val y = 5val z = 5

w + x + y + z = 20

2Unit Testing

How do you test a distributed algorithm without a distributed

platform?

Source

// Scaldingimport com.twitter.scalding._ class WordCountTest extends Specification with TupleConversions { "A WordCount job" should { JobTest("com.snowplowanalytics.hadoop.scalding.WordCountJob"). arg("input", "inputFile"). arg("output", "outputFile"). source(TextLine("inputFile"), List("0" -> "hack hack hack and hack")). sink[(String,Int)](Tsv("outputFile")){ outputBuffer => val outMap = outputBuffer.toMap "count words correctly" in { outMap("hack") must be_==(4) outMap("and") must be_==(1) } }. run. finish }}

3Local Execution

> run-main com.twitter.scalding.Tool MyJob --local

> run-main com.twitter.scalding.Tool MyJob --hdfs

SBT as a REPL

More Scalding goodness

Algebird

More Scalding goodness

Algebird

Matrix library

Be functionalQuestions?

Writing Hadoop Jobs in Scala using Scalding

Technology

Transcript of Writing Hadoop Jobs in Scala using Scalding

Dr. Sabin Buraga - profs.info.uaic.robusaco/teach/courses/soa/presentations/... · eBay Java, Node.js (JavaScript) Oracle DB ... Java, Scala, Rails (Ruby) MySQL, Cassandra, Hadoop,

Big Data online training | Hadoop Spark Scala Online Training

Big Data, Hadoop, SpArk, Scala · b.Versiones"de"Apache"Spark c.Comparación"WordCount,"MapReduce" vsWordCountSpark d.ResilientDistributedDataSet(RDD) 4.Introduccióna’Scala a.¿QuéesScala?

Effects of Scalding Parameters and Ripening on the ...

Scalding @ Coursera

Scalding - the not-so-basics @ ScalaDays 2014

Lucenes Welt - ordnen, finden, klassifizieren - inovex GmbH · PDF fileEbay Kleinanzeigen ... Scala over TCP ... Lucene & Hadoop - Hadoop: parallel processing power - Lucene: dynamic,

Tomorrow’s Enterprise - Delivered Todaycfs22.simplicdn.net/ice9/docs/SL_Focus_Categories.pdf · Scala, Hadoop 2.7, Cassandra, Pig, Hive, Impala, Kafka, MongoDB, Storm Training in

Scalding Tank Design

Other Distributed Frameworks Shannon Quinn. Distinction 1.General Compute Engines – Hadoop 2.User-facing APIs – Cascading – Scalding.

Scalding on tez (final)

1 Big Data Hadoop€¦ · · 2017-09-01Data Sampling and Debugging ... 2 Apache Spark & Scala 1 Introduction to Spark Limitations of MapReduce in Hadoop Objectives ... Cassandra

MapReduce-ohjelmointimallikauhsa.github.io/kandi/kandi.pdf · 1Inside eBay’s 90PB data warehouse: ... Hadoop-kirjastonsuorituskyky ... Spark on Scala-ohjelmointikielellä toteutettu

Scalding Big (Ad)ta

@Scaldingblogs.ischool.berkeley.edu/.../2012/11/twitter... · Scalding @Twitter • Revenue quality team (ads targeting, market insight, click-prediction, trafﬁc-quality) uses scalding

Scalding by Adform Research, Alex Gryzlov

Hadoop / Big Data - cours.tokidev.frcours.tokidev.fr/bigdata/cours/mbds_big_data_hadoop_2017_2018... · 12-2 Présentation Développé en Scala (langage orienté objet dérivé de

MapReduce with Scalding · Java MapReduce Word count example . MapReduce Techs Scalding.io Java MapReduce Pig Hive Hadoop Cascading Others tion . The promise of Cascading Scalding.io

Productivity Frameworks in Big Data Image Processing ... · Productivity frameworks in big data image processing computations - creating photographic mosaics with Hadoop and Scalding

YARN webinar series: Using Scalding to write applications to Hadoop and YARN