Introduction to Spark with Scala

download Introduction to Spark with Scala

of 41

  • date post

    15-Jul-2015
  • Category

    Engineering

  • view

    2.717
  • download

    11

Embed Size (px)

Transcript of Introduction to Spark with Scala

Introduction to Spark with Scala

Himanshu GuptaSoftware ConsultantKnoldus Software LLP

Who am I ?

Himanshu Gupta (@himanshug735)

Software Consultant at Knoldus Software LLP

Spark & Scala enthusiast

Agenda

What is Spark ?

Why we need Spark ?

Brief introduction to RDD

Brief introduction to Spark Streaming

How to install Spark ?

Demo

What is Apache Spark ?

Fast and general engine for large-scale data processing with libraries for SQL, streaming, advanced analytics

Why javascript, why we are bothering to do javascript.
beacuse as you know its typical to do web development without javascript.
ITs the only language, that's basically supported web browser.
So at some point you need javascript code.
ITs scripting language, not designed to scale large rich web application

Spark History

Project Begins at UCB AMP Lab2009

2010

Open Sourced

Apache Incubator2011

2012

2013

2014

2015

Data Frames

ClouderaSupportApacheTop levelSparkSummit2013SparkSummit2014

Spark Stack

Img src - http://spark.apache.org/

Fastest Growing Open Source Project

Img src - https://databricks.com/blog/2015/03/31/spark-turns-five-years-old.html

Easy to learn

Now Javascript is easy to pick up because of the very flexible nature of the language. Because Javascript is not a compiled language, things like memory management is not big concern.

Easy to Edit

Its is easy to get started with because you don't need much to do so. As we know, its a scripting language, so the code you write does not need to be compiled and as such does not require any compiler or any expensive software.Prototyping Language

its a prototyping language. In a prototyping language, every object is an instance of a class. What that means is that objects can be defined, and developed on the fly to suit a particular use, rather than having to build out specific classes to handle a specific needEasy to debugThere are many tools like firebug to debug javascript. to trace error

Agenda

What is Spark ?

Why we need Spark ?

Brief introduction to RDD

Brief introduction to Spark Streaming

How to install Spark ?

Demo

Code Size

Img src - http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf

Word Count Ex.

public static class WordCountMapClass extends MapReduceBase implements Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
}
public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
public static class WorkdCountReduce extends MapReduceBase implements Reducer {
public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

val file = spark.textFile("hdfs://...")val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)counts.saveAsTextFile("hdfs://...")

Why we need to do compiling in JavaScript?

gained many new apis, but language itself is mostly the same. Some developers really like javscript, but they feel that there should be other features included in javscript.many platforms that compiles high level language to javascript. It removes many of the hidden dangers that Javascript has like: * Missing critical semicolonsyou can write better javascript code in othe language.

Major Reason:- to consistently work with the same language both on the server and on the client. In this way one doesn't need to change gears all the time

Daytona GraySort Record:Data to sort 100TB

Img src -http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015

Hadoop (2013):

2100 nodes

72 minutes

Spark (2014):

206 nodes

23 minutes

Typescript compilers that compiles in javascript and add some new features such as type annotations, classes and interfaces.CoffeeScript, Dart

Coffee script is very popular and targets javascript. One of the main reason of its popularity to get rid of javascript c like syntax, because some people apparently dislike curly braces and semicolon very much. CoffeeScript is inspired by Ruby, Python and Haskell. Google created Dart as a replacement of Dart. They are hoping that one day they will replace javascript.

Parenscript, Emscripten, JSIL, GWT. Js.scala

Runs Everywhere

Img src - http://spark.apache.org/

Who are using Apache Spark ?

Img src - http://www.slideshare.net/datamantra/introduction-to-apache-spark-45062010

Agenda

What is Spark ?

Why we need Spark ?

Brief introduction to RDD

Brief introduction to Spark Streaming

How to install Spark ?

Demo

Brief Introduction to RDD

RDD stands for Resilient Distributed Dataset

A fault tolerant, distributed collection of objects.

In Spark all work is expressed in following ways:

Creating new RDD(s)

Transforming existing RDD(s)

Calling operations on RDD(s)

Example (RDD)

val master = "local"val conf = new SparkConf().setMaster(master)

This is the Spark Configuration

Scala- an acronym for Scalable Language. a careful integration of object-oriented and functional language concepts.Scala runs on the JVM..scala.js supports all of scala language so it can compile entire scala standard library.

Example (RDD)

val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)

This is the Spark ContextContd...

Scala- an acronym for Scalable Language. a careful integration of object-oriented and functional language concepts.Scala runs on the JVM..scala.js supports all of scala language so it can compile entire scala standard library.

Example (RDD)

val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)

This is the Spark ContextContd...

Scala- an acronym for Scalable Language. a careful integration of object-oriented and functional language concepts.Scala runs on the JVM..scala.js supports all of scala language so it can compile entire scala standard library.

Example (RDD)

val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)val lines = sc.textFile("data.txt")

Extract linesfrom text file

Contd...

Scala- an acronym for Scalable Language. a careful integration of object-oriented and functional language concepts.Scala runs on the JVM..scala.js supports all of scala language so it can compile entire scala standard library.

Example (RDD)

val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)val lines = sc.textFile("demo.txt")val words = lines.flatMap(_.split(" ")).map((_,1))

Map linesto words

mapContd...

Scala- an acronym for Scalable Language. a careful integration of object-oriented and functional language concepts.Scala runs on the JVM..scala.js supports all of scala language so it can compile entire scala standard library.

Example (RDD)

val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)val lines = sc.textFile("demo.txt")val words = lines.flatMap(_.split(" ")).map((_,1))val wordCountRDD = words.reduceByKey(_ + _)

Word Count RDD

mapgroupByContd...

Scala- an acronym for Scalable Language. a careful integration of object-oriented and functional language concepts.Scala runs on the JVM..scala.js supports all of scala language so it can compile entire scala standard library.

Example (RDD)

val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)val lines = sc.textFile("demo.txt")val words = lines.flatMap(_.split(" ")).map((_,1))val wordCountRDD = words.reduceByKey(_ + _)val wordCount = wordCountRDD.collect

Map[word, count]

mapgroupBycollectStartsComputationContd...

Scala- an acronym for Scalable Language. a careful integration of object-oriented and functional language concepts.Scala runs on the JVM..scala.js supports all of scala language so it can compile entire scala standard library.

Example (RDD)

val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)val lines = sc.textFile("demo.txt")val words = lines.flatMap(_.split(" ")).map((_,1))val wordCountRDD = words.reduceByKey(_ + _)val wordCount = wordCountRDD.collect

mapgroupBycollect

TransformationActionContd...

Scala- an acronym for Scalable Language. a careful integration of object-oriented and functional language concepts.Scala runs on the JVM..scala.js supports all of scala language so it can compile entire scala standard library.

Agenda

What is Spark ?

Why we need Spark ?

Brief introduction to RDD

Brief introduction to Spark Streaming

How to install Spark ?

Demo

Brief Introduction to Spark Streaming

Img src - http://spark.apache.org/

In