Apache Spark: Moving on from Hadoop

Apache Spark Moving on from Hadoop Víctor Sánchez Anguix Universitat Politècnica de València MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image Course 2014/2015

Apache SparkMoving on from Hadoop

Víctor Sánchez AnguixUniversitat Politècnica de València

MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image

Course 2014/2015

Hadoop is unbeatable (?)

Hadoop is unbeatable (?)

Hadoop is unbeatable (?)


Hadoop is unbeatable (?)

Google Trends
Hadoop is unbeatable (?)

Google Trends

Hadoop is unbeatable (?)

Hadoop is unbeatable (?)


Hadoop is unbeatable (?)

➢ Open source cluster computing

➢ Open source cluster computing

➢ Distributed disk → Distributed memory

➢ Created at UC Berkeley in 2009

➢ Last major release: Dec 2014https://spark.apache.org/

What is Apache Spark?

➢ Core concept in Spark

➢ Core concept in Spark

➢ Distributed collection of objects in memory

➢ Operate on parallel on RDDs

➢ Read from file, distributed file system, or parallelize existing collection

Resilient Distributed Dataset (RDD)

➢ RDDs are fault tolerant

➢ RDDs are fault tolerant

➢ Spark maintains DAG of operations for getting a RDD

➢ We can cache RDDs to save computations

Resilient Distributed Dataset (RDD)

Spark Architecture
Interact with cluster
Main program, coordinates tasks
Assigns resources
Carries out tasks, manages RDD chunks


Spark Architecture

Interact with cluster

Main program, coordinates tasks

Assigns resources

Carries out tasks, manages RDD chunks

➢ Main application for a Spark script

➢ Main application for a Spark script

➢ Creates Spark context and coordinates executors

➢ Executes instructions in Java/Python/Scala

➢ ONLY parallelizes operations on RDDs

Driver program

➢ We will stick to Scala

➢ We will stick to Scala

➢ Functional programming

➢ Completely integrated with Java

➢ Shorter code due to Scala abstractions

Programming in Spark

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ We have an interactive shell

spark-shell --master local[4]

spark-shell --master yarn-client

Programming in Spark

Number of cores to use in local mode

Use resources from a yarn cluster like Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Welcome to

____ __

/ __/__ ___ _____/ /__

_\ \/ _ \/ _ `/ __/ '_/

/___/ .__/\_,_/_/ /_/\_\ version 1.2.0


Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_25)

Type in expressions to have them evaluated.

Type :help for more information.

15/02/02 11:43:03 INFO repl.SparkILoop: Created spark context..

Spark context available as sc.


Programming in Spark

➢ Special type of object

➢ Special type of object

➢ Interacts with distributed resources:○ Read data○ Add resources (e.g., jars, files) to cluster○ Creates RDDs

○ etc.

➢ In spark shell, it is automatically created in sc

Spark context

➢ Load from local filesystem
➢ Load from HDFS
Loading a text file

➢ Load from local filesystem

➢ Load from HDFS

Loading a text file

val Students =sc.textFile( “file:///home/victor.sanchez/students.tsv” )

Students: org.apache.spark.rdd.RDD[String] = file://home/victor.sanchez/students.tsv MappedRDD[1] at textFile at <console>:12scala>

RDD of Strings

val Students=sc.textFile(“hdfs://localhost/user/victor.sanchez/students.tsv”)

Final variable, content does not change

➢ Take n elements from RDD
➢ Get whole RDD into driver
What's in my dataset?

➢ Take n elements from RDD

➢ Get whole RDD into driver

What’s in my dataset?

val x = Students.take( 3 )

x: Array[String] = Array(1 John Doe M 18, 2 Mary Doe F 20, 3Lara Croft F 25)

Array of Strings, LOCAL!! Resides in master

val x = Students.collect

x: Array[String] = Array(1 John Doe M 18, 2 Mary Doe F 20, 3Lara Croft F 25, 4 Sherlock Holmes M 36, 5 John Watson M

38, 6 SarahKerrigan F 21, 7 Bruce Wayne M 32, 8Tony Stark M 33, 9 Princess Peach F 21, 10 Peter Parker

M 23)

Elements to take

If no arguments, no need for ()

➢ Parallelize collection from driver
➢ Broadcast a variable (only sent once)
Can I go the reverse way?

➢ Parallelize collection from driver

➢ Broadcast a variable (only sent once)

Can I go the reverse way?

val myArray = Array( 1, 2, 3, 4, 5 )

myArray: Array[Int] = Array(1, 2, 3, 4, 5)

val myArrayPar = sc.parallelize( myArray )

myArrayPar: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:14

val x = 6

val xBroad = sc.broadcast( x )

xBroad: org.apache.spark.broadcast.Broadcast[Int] = Broadcast(4)

Array creation

➢ Map: Project or generate new data
➢ It really takes an anonymous function as arg:
Basic operations on RDDs

➢ Map: Project or generate new data

➢ It really takes an anonymous function as arg:

Basic operations on RDDs

val StudentsF = Students.map( l => l.split( "\t", -1 ) )

StudentsF: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[6] at map

StudentsF.take( 2 )

res6: Array[Array[String]] = Array(Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20))

For each l in Students, generate its split

(l:String) => l.split( "\t", -1 )(x:Int,y:Int) => x+y

Input parameters


➢ Map: Project or generate new data
Basic operations on RDDs

➢ Map: Project or generate new data

Basic operations on RDDs

def splitWrapped(line:String) : Array[String] = { line.split( "t", -1 ) }

val StudentsF = Students.map( splitWrapped )

StudentsF: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[6] at map

StudentsF.take( 2 )

res6: Array[Array[String]] = Array(Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20))



➢ Generate a new RDD for students with a new field indicating if the student is under 25 years

➢ Generate a new RDD for students with a new field indicating if the student is under 25 years


➢ Foreach: Perform operation over each object
➢ Does not return a new RDD!
Basic operations on RDDs

➢ Foreach: Perform operation over each object

➢ Does not return a new RDD!

Basic operations on RDDs

StudentsF.foreach( x => print( x( 1 ) + " " ) )

John Bruce Tony Princess Peter 15/02/03 09:17:04 INFO executor.Executor: Finished task 1.0 in stage 13.0 (TID 27). 1693 bytes result sent to driverMary Lara Sherlock John Sarah 15/02/03 09:17:04 INFO executor.Executor: Finished task 0.0 in stage 13.0 (TID 26). 1693 bytes result sent to driver

➢ Filter: filter elements fulfilling a condition
Basic operations on RDDs

➢ Filter: filter elements fulfilling a condition

Basic operations on RDDs

val StudentsFilt = StudentsF.filter( s => s( 0 ).toInt > 3 )

StudentsFilt: org.apache.spark.rdd.RDD[Array[String]] = FilteredRDD[13] at filter at <console>:16

StudentsFilt.take( 3 )

res13: Array[Array[String]] = Array(Array(4, Sherlock, Holmes, M, 36), Array(5, John, Watson, M, 38))

Anon. function

Convert to integer

➢ Distinct: only different objects
Basic operations on RDDs

➢ Distinct: only different objects

Basic operations on RDDs

val StudentsDis = StudentsF.map( s => s( 3 ) ).distinct

StudentsDis: org.apache.spark.rdd.RDD[String] = MappedRDD[18] at distinct at <console>:16

StudentsFilt.take( 2 )

res16: Array[String] = Array(F, M)

➢ Fold: Reduce all objects to a single object
➢ Beware, dummy is applied more than once
Basic operations on RDDs

➢ Fold: Reduce all objects to a single object

➢ Beware, dummy is applied more than once

Basic operations on RDDs

val dummyStudent = Array( "12", "Clark", "Kent", "M", "25" )

val StudentsFold = StudentsF.fold( dummyStudent )( (acc,value) => { if ( value( 4 ).toInt > acc( 4 ).toInt) value else acc } )

StudentsFold: Array[String] = Array(5, John, Watson, M, 38)Starting left operand

Left operand

val StudentsFold = StudentsF.fold( dummyStudent )( (acc,value) => { Array( "[" + acc( 0 ) + "-" + value( 0 ) + "]" , acc( 1 ), acc( 2 ), acc( 3 ), acc( 4 ) ) } )

StudentsFold: Array[String] = Array([[12-[[[[12-7]-8]-9]-10]]-[[[[[[12-1]-2]-3]-4]-5]-6]], Clark, Kent, M, 0)

➢ Reduce: Reduce all objects to a single object
Basic operations on RDDs

➢ Reduce: Reduce all objects to a single object

Basic operations on RDDs

val StudentsRed = StudentsF.map( s => s( 4 ).toInt ).reduce( _ + _ )

StudentsRed: Int = 267

Binary operatorConmutativeAssociate

➢ Max:
➢ Min:
Basic operations on RDDs

➢ Max:

➢ Min:

Basic operations on RDDs

val StudentsMax = StudentsF.map( s => s( 4 ).toInt ).max

StudentsMax: Int = 38

val StudentsMin = StudentsF.map( s => s( 4 ).toInt ).min

StudentsMax: Int = 38

➢ Count:
➢ CountByValue: Count repetitions of elements
Basic operations on RDDs

➢ Count:

➢ CountByValue: Count repetitions of elements

Basic operations on RDDs

val StudentsCount = StudentsF.count

StudentsCount: Long = 10

val StudentsCount = StudentsF.map( s => s( 3 ) ).countByValue

StudentsCount: scala.collection.Map[String,Long] = Map(M -> 6, F -> 4)

➢ Count the number of students that are female

➢ Count the number of students that are female


➢ Sample:
➢ RandomSplit: Splits into random RDDs
Basic operations on RDDs

➢ Sample:

➢ RandomSplit: Splits into random RDDs

Basic operations on RDDs

val StudentsSample = StudentsF.sample( true, 0.5 )

StudentsSample: org.apache.spark.rdd.RDD[Array[String]] =PartitionwiseSampledRDD[33]

StudentsSample.take( 3 )

res18: Array[Array[String]] = Array(Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20), Array(2, Mary, Doe, F, 20))

val StudentsSplit = StudentsF.randomSplit( Array( 0.8, 0.2 ) )

StudentsSplit: Array[org.apache.spark.rdd.RDD[Array[String]]] = Array(PartitionwiseSampledRDD[46]

StudentsSplit( 0 ).collect

res26: Array[Array[String]] = Array(Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20), Array(5, John, Watson, M, 38), Array(6, Sarah, Kerrigan, F, 21), Array(9, Princess, Peach, F, 21), Array(10, Peter, Parker, M, 23))

With replacement and fraction

Weights for each partition

➢ SortBy: Sort elements according to value
➢ Top: Get largest elements
Basic operations on RDDs

➢ SortBy: Sort elements according to value

➢ Top: Get largest elements

Basic operations on RDDs

val StudentsSorted = StudentsF.sortBy( x => x( 4 ) )


Array(Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20), Array(6, Sarah, Kerrigan, F, 21), ...

val StudentsTop = StudentsF.map( s => s( 4 ) ).top( 3 )

StudentsTop: Array[String] = Array(38, 36, 33)

Value to sort by

k elements to select

➢ Union: Two RDDs into one
Basic operations on RDDs

➢ Union: Two RDDs into one

Basic operations on RDDs

val StudentsUnder25 = Students.filter( s => s( 4 ).toInt < 25 )

val StudentsOver30 = Students.filter( s => s( 4 ).toInt > 30 )

val StudentsUnion = StudentsOver30.union( StudentsUnder25 )

StudentsUnion: org.apache.spark.rdd.RDD[Array[String]] = UnionRDD[75]


Array[Array[String]] = Array(Array(4, Sherlock, Holmes, M, 36), Array(5, John, Watson, M, 38), Array(7, Bruce, Wayne, M, 32), Array(8, Tony, Stark, M, 33), Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20), Array(6, Sarah, Kerrigan, F, 21), Array(9, Princess, Peach, F, 21), Array(10, Peter, Parker, M, 23))

➢ Intersection: Common elements in two RDDs
Basic operations on RDDs

➢ Intersection: Common elements in two RDDs

Basic operations on RDDs

val StudentsUnder35 = StudentsF.map( s => s( 4 ).toInt ).filter( a => a < 35 )

val StudentsOver25 = StudentsF.map( s => s( 4 ).toInt ).filter( a => a > 25 )

val StudentsIntersect = StudentsUnder35.intersection( StudentsOver25 )

StudentsIntersect: org.apache.spark.rdd.RDD[Int] = MappedRDD[92]


res31: Array[Int] = Array(32, 33)

➢ Subtract: Elements in a RDD not in the other
Basic operations on RDDs

➢ Subtract: Elements in a RDD not in the other

Basic operations on RDDs

val StudentsUnder35 = StudentsF.map( s => s( 4 ).toInt ).filter( a => a < 35 )

val StudentsOver25 = StudentsF.map( s => s( 4 ).toInt ).filter( a => a > 25 )

val StudentsSub = StudentsUnder35.subtract( StudentsOver25 )

StudentsSub: org.apache.spark.rdd.RDD[Int] = MappedRDD[12]


res0: Array[Int] = Array(18, 20, 21, 21, 23, 25)

➢ Tuples in Scala:
➢ Pair RDDs → RDDs with tuples (key, value)
Pair RDDs

➢ Tuples in Scala:

➢ Pair RDDs → RDDs with tuples (key, value)

Pair RDDs

val myTuple = ( 13, "Bob", "Squarepants", "M", 10 )


res6: Int = 13

Tuple creation

Access fields

val PairStudents = StudentsF.map( s => ( s( 3 ), s ) )

PairStudents.take( 3 )

res8: Array[(String, Array[String])] = Array((M,Array(1, John, Doe, M, 18)), (F,Array(2, Mary, Doe, F, 20)), (F,Array(3, Lara, Croft, F, 25)))

➢ Join:
Operations on Pair RDDs

➢ Join:

Operations on Pair RDDs

val PairStudentsId = StudentsF.map( s => ( s( 0 ), s ) ) val PairGrades = GradesF.map( g => ( g( 0 ), g ) )val StudentGrades = PairStudentsId.join( PairGrades )

StudentGrades: org.apache.spark.rdd.RDD[(String, (Array[String], Array[String]))]

StudentGrades.take( 3 )

res13: Array[(String, (Array[String], Array[String]))] = Array((4,(Array(4, Sherlock, Holmes, M, 36),Array(4, Math, 2.3))), (4,(Array(4, Sherlock, Holmes, M, 36),Array(4, Biology, 6.7))), (4,(Array(4, Sherlock, Holmes, M, 36),Array(4, Engineering, 8.0))))

Prepare key, value structure

Output is (key, (value1, value2))

➢ Left Join:
Operations on Pair RDDs

➢ Left Join:

Operations on Pair RDDs

val auxRDD = sc.parallelize( Array(Array("0","Dummy","Student","M","10"), Array("1","John","Doe","M","18" ) ) ) val auxPairRDD = auxRDD.map( a => ( a( 0 ), a ) )val auxGrades = auxPairRDD.leftOuterJoin( PairGrades )

auxGrades: org.apache.spark.rdd.RDD[(String, (Array[String], Option[Array[String]]))] = FlatMappedValuesRDD[34]

auxGrades.take( 2 )

res23: Array[(String, (Array[String], Option[Array[String]]))] = Array((0,(Array(0, Dummy, Student, M, 10),None)), (1,(Array(1, John, Doe, M, 18),Some([Ljava.lang.String;@30d4fbf))))

Option = None or a value

String representation for non emptyvalue in Option

➢ Left Join (cont):
Operations on Pair RDDs

➢ Left Join (cont):

Operations on Pair RDDs

val auxGrades = auxPairRDD.leftOuterJoin( PairGrades ).map( p => ( p._1, ( p._2._1, if(!p._2._2.isEmpty) p._2._2.get ) ) )

auxGrades.take( 2 )

res26: Array[(String, (Array[String], Any))] = Array((0,(Array(0, Dummy, Student, M, 10),())), (1,(Array(1, John, Doe, M, 18),Array(1, Math, 5.6))))

➢ reduceByKey: To single object by key
Operations on Pair RDDs

➢ reduceByKey: To single object by key

Operations on Pair RDDs

val RedKeys = PairStudents.map({case (k,v) => ( k, v( 4 ).toInt ) }).reduceByKey( _ + _ )

RedKeys: org.apache.spark.rdd.RDD[(String, String)] = ShuffledRDD[84]

RedKeys.take( 2 )

res33: Array[(String, Int)] = Array((F,87), (M,180))

More than 1 line in anon function

Pattern matching

Result is a RDD

➢ foldByKey: To single object by key
Operations on Pair RDDs

➢ foldByKey: To single object by key

Operations on Pair RDDs

val foldedKeys = PairStudents.map({case(k,v) => (k,v(4).toInt)}).foldByKey(0)((a,b) => Math.max(a,b))

foldedKeys: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[92]

res30: Array[(String, Array[String])] = Array((F,Array(2, Mary, Doe, F, 20)), (F,Array(3, Lara, Croft, F, 25)), (F,Array(6, Sarah, Kerrigan, F, 21)))

Left parameter


➢ groupByKey: group values with same key
Operations on Pair RDDs

➢ groupByKey: group values with same key

Operations on Pair RDDs

val groupedKeys = PairStudents.groupByKey

groupedKeys: org.apache.spark.rdd.RDD[(String, Iterable[Array[String]])] = ShuffledRDD[93]groupedKeys.collect

res35: Array[(String, Iterable[Array[String]])] = Array((F,CompactBuffer([Ljava.lang.String;@31788c16, [Ljava.lang.String;@613511b9, [Ljava.lang.String;@631eba8a, [Ljava.lang.String;@7668ecdc)), (M,CompactBuffer([Ljava.lang.String;@62969c3f, [Ljava.lang.String;@dec1eaa, [Ljava.lang.String;@8d1320a, [Ljava.lang.String;@5e2c330b, [Ljava.lang.String;@27cb477a, [Ljava.lang.String;@12c1aeff)))

groupedKeys.map({case (k,v)=>(k,v.map( x => "("+ x.mkString(",")+ ")" ) ) }).take(2)res40: Array[(String, Iterable[String])] = Array((F,List((2,Mary,Doe,F,20), (3,Lara,Croft,F,25), (6,Sarah,Kerrigan,F,21), (9,Princess,Peach,F,21))), (M,List((1,John,Doe,M,18), (4,Sherlock,Holmes,M,36), (5,John,Watson,M,38), (7,Bruce,Wayne,M,32), (8,Tony,Stark,M,33), (10,Peter,Parker,M,23))))

String repr of Iterable[Array[String]]

➢ Really useful for AI and ML
➢ For loop example

➢ Really useful for AI and ML

➢ For loop example

var StudentsLoop = PairStudents.map( s => (s(0).toInt,s(1),s(2)) )

for( i <- 1 to 10 ){ StudentsLoop = StudentsLoop.map( {case (id,name,surname) => (id+1,name,surname)} )



res43: Array[(Int, String, String)] = Array((11,John,Doe), (12,Mary,Doe), (13,Lara,Croft), (14,Sherlock,Holmes), (15,John,Watson), (16,Sarah,Kerrigan), (17,Bruce,Wayne), (18,Tony,Stark), (19,Princess,Peach), (20,Peter,Parker))


Non final variable

➢ Really useful for AI and ML
➢ For loop example

➢ Really useful for AI and ML

➢ For loop example

var StudentsLoop = PairStudents.map( s => (s(0).toInt,s(1),s(2)) )

for( i <- 1 to 10 ){ StudentsLoop = StudentsLoop.map( {case (id,name,surname) => (id+1,name,surname)} )



res43: Array[(Int, String, String)] = Array((11,John,Doe), (12,Mary,Doe), (13,Lara,Croft), (14,Sherlock,Holmes), (15,John,Watson), (16,Sarah,Kerrigan), (17,Bruce,Wayne), (18,Tony,Stark), (19,Princess,Peach), (20,Peter,Parker))


Non final variable

➢ Persist in memory/disk RDDs
➢ Other levels of persistance:
○ MEMORY_ONLY, MEMORY_AND_DISK,
Caching

➢ Persist in memory/disk RDDs

➢ Other levels of persistance:○ MEMORY_ONLY, MEMORY_AND_DISK,




import org.apache.spark.storage.StorageLevel

GradesF.persist( StorageLevel.MEMORY_AND_DISK )

➢ Store to local or HDFS
Saving RDDs

➢ Store to local or HDFS

Saving RDDs

PairStudents.map(x =>( x._1,"("+ x._2.mkString( "," )+")")).saveAsTextFile( "file:///home/victor.sanchez/res" )

PairStudents.map(x =>( x._1,"("+ x._2.mkString( "," )+")")).saveAsTextFile( "hdfs:///user/victor.sanchez/res" )

Trick to convert Array[String] properly to String

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

package es.upv.dsic.iarfid.haia

import org.apache.spark.SparkContext

import org.apache.spark.SparkConf

import org.apache.spark.SparkContext._

object mySparkScript {

def average(data: Iterable[Double]): Double = { data.reduceLeft( _ + _ )/data.size }

def main( args: Array[String] ) {

val sc = new SparkContext( ( new SparkConf() ).setAppName( "MY SPARK SCRIPT" ) )

val Grades = sc.textFile( args( 0 ) ).map( l => l.split( “\t”, -1 ) ).map( g => ( g( 1 ), g( 2 ).toDouble ) )

val GradesGr = Grades.groupByKey.map( g => ( g._1, average( g._2 ) ) )

GradesGr.saveAsTextFile( args( 1 ) )



Scripting Packa





Support methods


Main method

Singleton object

Program arguments


Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Compiling Spark code

➢ Scala code is compiled to Java Byte code

➢ sbt is a scala compiler for Scala and Java

➢ sbt can help us manage our dependencies

➢ Spark cluster → Fat jar, sbt assembly can do!

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Spark project example











Main .sbt file. Scala code to compile your scala source!

Plugins needed by sbt to compile your source

Your project source file

Extra libraries

Output jar for your project

How to compile your main .sbt

Test sources

Additional files for your jar

Your project code

Main sbt file example

import AssemblyKeys._


name := "haia"

version := "1.0"

scalaVersion := "2.10.4"

organization := "es.upv"

libraryDependencies ++= Seq(

"org.apache.spark" %% "spark-core" % "1.1.0" % "provided"


jarName in assembly := {

name.value + ".jar"


outputPath in assembly := {

file( "target/" + (jarName in assembly).value )


Main sbt file example

mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>


case PathList("javax", "servlet", xs @ _*) => MergeStrategy.first

case PathList(ps @ _*) if ps.last endsWith ".html" => MergeStrategy.first

case "unwanted.txt" => MergeStrategy.discard

case PathList( "META-INF", ".*pom.properties" ) => MergeStrategy.first

case x => old(x)



Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Fat jar?○ A jar with all of the jar files it depends on○ Workers needs all dependencies

○ sbt-assembly plugin can generate fat jars

➢ Generating a fat jar:sbt assembly

Generating a fat jar

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

spark-submit --class es.upv.dsic.iarfid.haia.mySparkScript --master yarn-cluster target/haia.jar hdfs:///user/victor.sanchez/grades.tsv hdfs:///user/victor.sanchez/spark_submit_ex

How to execute Spark code from jar

Singleton object to execute

Fat jar file

Program parameters

➢ Simulated annealing → Optimization method
➢ Multi-point → Exploring from different points
➢ Function to optimize:
Exercise: Multi-point simulated annealing

➢ Simulated annealing → Optimization method

➢ Multi-point → Exploring from different points

➢ Function to optimize:

Exercise: Multi-point simulated annealing

Single point simulated annealing

Single point simulated annealing

Time to work!

Time to work!

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Spark is still a quite novel technology➢ Unexpected Out of memory exceptions➢ Memory issues are difficult to debug in Spark➢ Avoid out of memory scenarios:

○ Use object serialization (Java or Kryo)○ Choose data structures wisely○ Increase parallelism (spark.default.parallelism)○ Avoid groupBy operations → reduceBy

○ More memory for shuffle (spark.shuffle.spill=false or higher spark.shuffle.memoryFraction)

A final advice on Spark

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Disk based parallelization

➢ No looping

➢ More mature project

➢ Many organizations use it

Hadoop ecosystem vs Spark

➢ Memory based parallelization

➢ Loopings (nice for AI an ML)

➢ Initial steps for Spark

➢ Changing all Hadoop code has a cost

Extra information
➢ http://spark.apache.org/
➢ Learning Spark: Lightning-Fast Big Data Analysis. Holden Karau et al. Ed. O'Reilly
➢ StackOverflow

Extra information

➢ http://spark.apache.org/

➢ Learning Spark: Lightning-Fast Big Data Analysis. Holden Karau et al. Ed. O’Reilly

➢ StackOverflow