Apache Spark: Moving on from Hadoop

56
Apache Spark Moving on from Hadoop Víctor Sánchez Anguix Universitat Politècnica de València MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image Course 2014/2015

Transcript of Apache Spark: Moving on from Hadoop

Page 1: Apache Spark: Moving on from Hadoop

Apache SparkMoving on from Hadoop

Víctor Sánchez AnguixUniversitat Politècnica de València

MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image

Course 2014/2015

Page 2: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Hadoop is unbeatable (?)

Page 3: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

https://spark.apache.org/

Hadoop is unbeatable (?)

Page 4: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Google Trends

Hadoop is unbeatable (?)

Page 5: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

http://www.datanami.com/2014/03/06/apache_spark_3_real-world_use_cases/

Hadoop is unbeatable (?)

Page 6: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Open source cluster computing

➢ Distributed disk → Distributed memory

➢ Created at UC Berkeley in 2009

➢ Last major release: Dec 2014https://spark.apache.org/

What is Apache Spark?

Page 7: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Core concept in Spark

➢ Distributed collection of objects in memory

➢ Operate on parallel on RDDs

➢ Read from file, distributed file system, or parallelize existing collection

Resilient Distributed Dataset (RDD)

Page 8: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ RDDs are fault tolerant

➢ Spark maintains DAG of operations for getting a RDD

➢ We can cache RDDs to save computations

Resilient Distributed Dataset (RDD)

Page 9: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

https://spark.apache.org/

Spark Architecture

Interact with cluster

Main program, coordinates tasks

Assigns resources

Carries out tasks, manages RDD chunks

Page 10: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Main application for a Spark script

➢ Creates Spark context and coordinates executors

➢ Executes instructions in Java/Python/Scala

➢ ONLY parallelizes operations on RDDs

Driver program

Page 11: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ We will stick to Scala

➢ Functional programming

➢ Completely integrated with Java

➢ Shorter code due to Scala abstractions

Programming in Spark

Page 12: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ We have an interactive shell

spark-shell --master local[4]

spark-shell --master yarn-client

Programming in Spark

Number of cores to use in local mode

Use resources from a yarn cluster like Hadoop

Page 13: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Welcome to

____ __

/ __/__ ___ _____/ /__

_\ \/ _ \/ _ `/ __/ '_/

/___/ .__/\_,_/_/ /_/\_\ version 1.2.0

/_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_25)

Type in expressions to have them evaluated.

Type :help for more information.

15/02/02 11:43:03 INFO repl.SparkILoop: Created spark context..

Spark context available as sc.

scala>

Programming in Spark

Page 14: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Special type of object

➢ Interacts with distributed resources:○ Read data○ Add resources (e.g., jars, files) to cluster○ Creates RDDs

○ etc.

➢ In spark shell, it is automatically created in sc

Spark context

Page 15: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Load from local filesystem

➢ Load from HDFS

Loading a text file

val Students =sc.textFile( “file:///home/victor.sanchez/students.tsv” )

Students: org.apache.spark.rdd.RDD[String] = file://home/victor.sanchez/students.tsv MappedRDD[1] at textFile at <console>:12scala>

RDD of Strings

val Students=sc.textFile(“hdfs://localhost/user/victor.sanchez/students.tsv”)

Final variable, content does not change

Page 16: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Take n elements from RDD

➢ Get whole RDD into driver

What’s in my dataset?

val x = Students.take( 3 )

x: Array[String] = Array(1 John Doe M 18, 2 Mary Doe F 20, 3Lara Croft F 25)

Array of Strings, LOCAL!! Resides in master

val x = Students.collect

x: Array[String] = Array(1 John Doe M 18, 2 Mary Doe F 20, 3Lara Croft F 25, 4 Sherlock Holmes M 36, 5 John Watson M

38, 6 SarahKerrigan F 21, 7 Bruce Wayne M 32, 8Tony Stark M 33, 9 Princess Peach F 21, 10 Peter Parker

M 23)

Elements to take

If no arguments, no need for ()

Page 17: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Parallelize collection from driver

➢ Broadcast a variable (only sent once)

Can I go the reverse way?

val myArray = Array( 1, 2, 3, 4, 5 )

myArray: Array[Int] = Array(1, 2, 3, 4, 5)

val myArrayPar = sc.parallelize( myArray )

myArrayPar: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:14

val x = 6

val xBroad = sc.broadcast( x )

xBroad: org.apache.spark.broadcast.Broadcast[Int] = Broadcast(4)

Array creation

Page 18: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Map: Project or generate new data

➢ It really takes an anonymous function as arg:

Basic operations on RDDs

val StudentsF = Students.map( l => l.split( "\t", -1 ) )

StudentsF: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[6] at map

StudentsF.take( 2 )

res6: Array[Array[String]] = Array(Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20))

For each l in Students, generate its split

(l:String) => l.split( "\t", -1 )(x:Int,y:Int) => x+y

Input parameters

Output

Page 19: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Map: Project or generate new data

Basic operations on RDDs

def splitWrapped(line:String) : Array[String] = { line.split( "t", -1 ) }

val StudentsF = Students.map( splitWrapped )

StudentsF: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[6] at map

StudentsF.take( 2 )

res6: Array[Array[String]] = Array(Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20))

Output

Input

Page 20: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Generate a new RDD for students with a new field indicating if the student is under 25 years

Exercise

Page 21: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Foreach: Perform operation over each object

➢ Does not return a new RDD!

Basic operations on RDDs

StudentsF.foreach( x => print( x( 1 ) + " " ) )

John Bruce Tony Princess Peter 15/02/03 09:17:04 INFO executor.Executor: Finished task 1.0 in stage 13.0 (TID 27). 1693 bytes result sent to driverMary Lara Sherlock John Sarah 15/02/03 09:17:04 INFO executor.Executor: Finished task 0.0 in stage 13.0 (TID 26). 1693 bytes result sent to driver

Page 22: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Filter: filter elements fulfilling a condition

Basic operations on RDDs

val StudentsFilt = StudentsF.filter( s => s( 0 ).toInt > 3 )

StudentsFilt: org.apache.spark.rdd.RDD[Array[String]] = FilteredRDD[13] at filter at <console>:16

StudentsFilt.take( 3 )

res13: Array[Array[String]] = Array(Array(4, Sherlock, Holmes, M, 36), Array(5, John, Watson, M, 38))

Anon. function

Convert to integer

Page 23: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Distinct: only different objects

Basic operations on RDDs

val StudentsDis = StudentsF.map( s => s( 3 ) ).distinct

StudentsDis: org.apache.spark.rdd.RDD[String] = MappedRDD[18] at distinct at <console>:16

StudentsFilt.take( 2 )

res16: Array[String] = Array(F, M)

Page 24: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Fold: Reduce all objects to a single object

➢ Beware, dummy is applied more than once

Basic operations on RDDs

val dummyStudent = Array( "12", "Clark", "Kent", "M", "25" )

val StudentsFold = StudentsF.fold( dummyStudent )( (acc,value) => { if ( value( 4 ).toInt > acc( 4 ).toInt) value else acc } )

StudentsFold: Array[String] = Array(5, John, Watson, M, 38)Starting left operand

Left operand

val StudentsFold = StudentsF.fold( dummyStudent )( (acc,value) => { Array( "[" + acc( 0 ) + "-" + value( 0 ) + "]" , acc( 1 ), acc( 2 ), acc( 3 ), acc( 4 ) ) } )

StudentsFold: Array[String] = Array([[12-[[[[12-7]-8]-9]-10]]-[[[[[[12-1]-2]-3]-4]-5]-6]], Clark, Kent, M, 0)

Page 25: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Reduce: Reduce all objects to a single object

Basic operations on RDDs

val StudentsRed = StudentsF.map( s => s( 4 ).toInt ).reduce( _ + _ )

StudentsRed: Int = 267

Binary operatorConmutativeAssociate

Page 26: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Max:

➢ Min:

Basic operations on RDDs

val StudentsMax = StudentsF.map( s => s( 4 ).toInt ).max

StudentsMax: Int = 38

val StudentsMin = StudentsF.map( s => s( 4 ).toInt ).min

StudentsMax: Int = 38

Page 27: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Count:

➢ CountByValue: Count repetitions of elements

Basic operations on RDDs

val StudentsCount = StudentsF.count

StudentsCount: Long = 10

val StudentsCount = StudentsF.map( s => s( 3 ) ).countByValue

StudentsCount: scala.collection.Map[String,Long] = Map(M -> 6, F -> 4)

Page 28: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Count the number of students that are female

Exercise

Page 29: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Sample:

➢ RandomSplit: Splits into random RDDs

Basic operations on RDDs

val StudentsSample = StudentsF.sample( true, 0.5 )

StudentsSample: org.apache.spark.rdd.RDD[Array[String]] =PartitionwiseSampledRDD[33]

StudentsSample.take( 3 )

res18: Array[Array[String]] = Array(Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20), Array(2, Mary, Doe, F, 20))

val StudentsSplit = StudentsF.randomSplit( Array( 0.8, 0.2 ) )

StudentsSplit: Array[org.apache.spark.rdd.RDD[Array[String]]] = Array(PartitionwiseSampledRDD[46]

StudentsSplit( 0 ).collect

res26: Array[Array[String]] = Array(Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20), Array(5, John, Watson, M, 38), Array(6, Sarah, Kerrigan, F, 21), Array(9, Princess, Peach, F, 21), Array(10, Peter, Parker, M, 23))

With replacement and fraction

Weights for each partition

Page 30: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ SortBy: Sort elements according to value

➢ Top: Get largest elements

Basic operations on RDDs

val StudentsSorted = StudentsF.sortBy( x => x( 4 ) )

StudentsSorted.collect

Array(Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20), Array(6, Sarah, Kerrigan, F, 21), ...

val StudentsTop = StudentsF.map( s => s( 4 ) ).top( 3 )

StudentsTop: Array[String] = Array(38, 36, 33)

Value to sort by

k elements to select

Page 31: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Union: Two RDDs into one

Basic operations on RDDs

val StudentsUnder25 = Students.filter( s => s( 4 ).toInt < 25 )

val StudentsOver30 = Students.filter( s => s( 4 ).toInt > 30 )

val StudentsUnion = StudentsOver30.union( StudentsUnder25 )

StudentsUnion: org.apache.spark.rdd.RDD[Array[String]] = UnionRDD[75]

StudentsUnion.collect

Array[Array[String]] = Array(Array(4, Sherlock, Holmes, M, 36), Array(5, John, Watson, M, 38), Array(7, Bruce, Wayne, M, 32), Array(8, Tony, Stark, M, 33), Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20), Array(6, Sarah, Kerrigan, F, 21), Array(9, Princess, Peach, F, 21), Array(10, Peter, Parker, M, 23))

Page 32: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Intersection: Common elements in two RDDs

Basic operations on RDDs

val StudentsUnder35 = StudentsF.map( s => s( 4 ).toInt ).filter( a => a < 35 )

val StudentsOver25 = StudentsF.map( s => s( 4 ).toInt ).filter( a => a > 25 )

val StudentsIntersect = StudentsUnder35.intersection( StudentsOver25 )

StudentsIntersect: org.apache.spark.rdd.RDD[Int] = MappedRDD[92]

StudentsIntersect.collect

res31: Array[Int] = Array(32, 33)

Page 33: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Subtract: Elements in a RDD not in the other

Basic operations on RDDs

val StudentsUnder35 = StudentsF.map( s => s( 4 ).toInt ).filter( a => a < 35 )

val StudentsOver25 = StudentsF.map( s => s( 4 ).toInt ).filter( a => a > 25 )

val StudentsSub = StudentsUnder35.subtract( StudentsOver25 )

StudentsSub: org.apache.spark.rdd.RDD[Int] = MappedRDD[12]

StudentsSub.collect

res0: Array[Int] = Array(18, 20, 21, 21, 23, 25)

Page 34: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Tuples in Scala:

➢ Pair RDDs → RDDs with tuples (key, value)

Pair RDDs

val myTuple = ( 13, "Bob", "Squarepants", "M", 10 )

myTuple._1

res6: Int = 13

Tuple creation

Access fields

val PairStudents = StudentsF.map( s => ( s( 3 ), s ) )

PairStudents.take( 3 )

res8: Array[(String, Array[String])] = Array((M,Array(1, John, Doe, M, 18)), (F,Array(2, Mary, Doe, F, 20)), (F,Array(3, Lara, Croft, F, 25)))

Page 35: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Join:

Operations on Pair RDDs

val PairStudentsId = StudentsF.map( s => ( s( 0 ), s ) ) val PairGrades = GradesF.map( g => ( g( 0 ), g ) )val StudentGrades = PairStudentsId.join( PairGrades )

StudentGrades: org.apache.spark.rdd.RDD[(String, (Array[String], Array[String]))]

StudentGrades.take( 3 )

res13: Array[(String, (Array[String], Array[String]))] = Array((4,(Array(4, Sherlock, Holmes, M, 36),Array(4, Math, 2.3))), (4,(Array(4, Sherlock, Holmes, M, 36),Array(4, Biology, 6.7))), (4,(Array(4, Sherlock, Holmes, M, 36),Array(4, Engineering, 8.0))))

Prepare key, value structure

Output is (key, (value1, value2))

Page 36: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Left Join:

Operations on Pair RDDs

val auxRDD = sc.parallelize( Array(Array("0","Dummy","Student","M","10"), Array("1","John","Doe","M","18" ) ) ) val auxPairRDD = auxRDD.map( a => ( a( 0 ), a ) )val auxGrades = auxPairRDD.leftOuterJoin( PairGrades )

auxGrades: org.apache.spark.rdd.RDD[(String, (Array[String], Option[Array[String]]))] = FlatMappedValuesRDD[34]

auxGrades.take( 2 )

res23: Array[(String, (Array[String], Option[Array[String]]))] = Array((0,(Array(0, Dummy, Student, M, 10),None)), (1,(Array(1, John, Doe, M, 18),Some([Ljava.lang.String;@30d4fbf))))

Option = None or a value

String representation for non emptyvalue in Option

Page 37: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Left Join (cont):

Operations on Pair RDDs

val auxGrades = auxPairRDD.leftOuterJoin( PairGrades ).map( p => ( p._1, ( p._2._1, if(!p._2._2.isEmpty) p._2._2.get ) ) )

auxGrades.take( 2 )

res26: Array[(String, (Array[String], Any))] = Array((0,(Array(0, Dummy, Student, M, 10),())), (1,(Array(1, John, Doe, M, 18),Array(1, Math, 5.6))))

Page 38: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ reduceByKey: To single object by key

Operations on Pair RDDs

val RedKeys = PairStudents.map({case (k,v) => ( k, v( 4 ).toInt ) }).reduceByKey( _ + _ )

RedKeys: org.apache.spark.rdd.RDD[(String, String)] = ShuffledRDD[84]

RedKeys.take( 2 )

res33: Array[(String, Int)] = Array((F,87), (M,180))

More than 1 line in anon function

Pattern matching

Result is a RDD

Page 39: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ foldByKey: To single object by key

Operations on Pair RDDs

val foldedKeys = PairStudents.map({case(k,v) => (k,v(4).toInt)}).foldByKey(0)((a,b) => Math.max(a,b))

foldedKeys: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[92]

res30: Array[(String, Array[String])] = Array((F,Array(2, Mary, Doe, F, 20)), (F,Array(3, Lara, Croft, F, 25)), (F,Array(6, Sarah, Kerrigan, F, 21)))

Left parameter

Function

Page 40: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ groupByKey: group values with same key

Operations on Pair RDDs

val groupedKeys = PairStudents.groupByKey

groupedKeys: org.apache.spark.rdd.RDD[(String, Iterable[Array[String]])] = ShuffledRDD[93]groupedKeys.collect

res35: Array[(String, Iterable[Array[String]])] = Array((F,CompactBuffer([Ljava.lang.String;@31788c16, [Ljava.lang.String;@613511b9, [Ljava.lang.String;@631eba8a, [Ljava.lang.String;@7668ecdc)), (M,CompactBuffer([Ljava.lang.String;@62969c3f, [Ljava.lang.String;@dec1eaa, [Ljava.lang.String;@8d1320a, [Ljava.lang.String;@5e2c330b, [Ljava.lang.String;@27cb477a, [Ljava.lang.String;@12c1aeff)))

groupedKeys.map({case (k,v)=>(k,v.map( x => "("+ x.mkString(",")+ ")" ) ) }).take(2)res40: Array[(String, Iterable[String])] = Array((F,List((2,Mary,Doe,F,20), (3,Lara,Croft,F,25), (6,Sarah,Kerrigan,F,21), (9,Princess,Peach,F,21))), (M,List((1,John,Doe,M,18), (4,Sherlock,Holmes,M,36), (5,John,Watson,M,38), (7,Bruce,Wayne,M,32), (8,Tony,Stark,M,33), (10,Peter,Parker,M,23))))

String repr of Iterable[Array[String]]

Page 41: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Really useful for AI and ML

➢ For loop example

var StudentsLoop = PairStudents.map( s => (s(0).toInt,s(1),s(2)) )

for( i <- 1 to 10 ){ StudentsLoop = StudentsLoop.map( {case (id,name,surname) => (id+1,name,surname)} )

}

StudentsLoop.collect

res43: Array[(Int, String, String)] = Array((11,John,Doe), (12,Mary,Doe), (13,Lara,Croft), (14,Sherlock,Holmes), (15,John,Watson), (16,Sarah,Kerrigan), (17,Bruce,Wayne), (18,Tony,Stark), (19,Princess,Peach), (20,Peter,Parker))

Looping!

Non final variable

Page 42: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Really useful for AI and ML

➢ For loop example

var StudentsLoop = PairStudents.map( s => (s(0).toInt,s(1),s(2)) )

for( i <- 1 to 10 ){ StudentsLoop = StudentsLoop.map( {case (id,name,surname) => (id+1,name,surname)} )

}

StudentsLoop.collect

res43: Array[(Int, String, String)] = Array((11,John,Doe), (12,Mary,Doe), (13,Lara,Croft), (14,Sherlock,Holmes), (15,John,Watson), (16,Sarah,Kerrigan), (17,Bruce,Wayne), (18,Tony,Stark), (19,Princess,Peach), (20,Peter,Parker))

Looping!

Non final variable

Page 43: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Persist in memory/disk RDDs

➢ Other levels of persistance:○ MEMORY_ONLY, MEMORY_AND_DISK,

MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, etc...

Caching

PairStudents.cache

import org.apache.spark.storage.StorageLevel

GradesF.persist( StorageLevel.MEMORY_AND_DISK )

Page 44: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Store to local or HDFS

Saving RDDs

PairStudents.map(x =>( x._1,"("+ x._2.mkString( "," )+")")).saveAsTextFile( "file:///home/victor.sanchez/res" )

PairStudents.map(x =>( x._1,"("+ x._2.mkString( "," )+")")).saveAsTextFile( "hdfs:///user/victor.sanchez/res" )

Trick to convert Array[String] properly to String

Page 45: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

package es.upv.dsic.iarfid.haia

import org.apache.spark.SparkContext

import org.apache.spark.SparkConf

import org.apache.spark.SparkContext._

object mySparkScript {

def average(data: Iterable[Double]): Double = { data.reduceLeft( _ + _ )/data.size }

def main( args: Array[String] ) {

val sc = new SparkContext( ( new SparkConf() ).setAppName( "MY SPARK SCRIPT" ) )

val Grades = sc.textFile( args( 0 ) ).map( l => l.split( “\t”, -1 ) ).map( g => ( g( 1 ), g( 2 ).toDouble ) )

val GradesGr = Grades.groupByKey.map( g => ( g._1, average( g._2 ) ) )

GradesGr.saveAsTextFile( args( 1 ) )

}

}

Scripting Packa

Package

Packa

Imports

Packa

Support methods

Packa

Main method

Singleton object

Program arguments

Packa

Page 46: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Compiling Spark code

➢ Scala code is compiled to Java Byte code

➢ sbt is a scala compiler for Scala and Java

➢ sbt can help us manage our dependencies

➢ Spark cluster → Fat jar, sbt assembly can do!

Page 47: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Spark project example

build.sbt

lib/

project/

plugins.sbt

build.scala

src/

main/

resources/

test/

target/

Main .sbt file. Scala code to compile your scala source!

Plugins needed by sbt to compile your source

Your project source file

Extra libraries

Output jar for your project

How to compile your main .sbt

Test sources

Additional files for your jar

Your project code

Page 48: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

import AssemblyKeys._

assemblySettings

name := "haia"

version := "1.0"

scalaVersion := "2.10.4"

organization := "es.upv"

libraryDependencies ++= Seq(

"org.apache.spark" %% "spark-core" % "1.1.0" % "provided"

)

jarName in assembly := {

name.value + ".jar"

}

outputPath in assembly := {

file( "target/" + (jarName in assembly).value )

}

Main sbt file example

mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>

{

case PathList("javax", "servlet", xs @ _*) => MergeStrategy.first

case PathList(ps @ _*) if ps.last endsWith ".html" => MergeStrategy.first

case "unwanted.txt" => MergeStrategy.discard

case PathList( "META-INF", ".*pom.properties" ) => MergeStrategy.first

case x => old(x)

}

}

Page 49: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Fat jar?○ A jar with all of the jar files it depends on○ Workers needs all dependencies

○ sbt-assembly plugin can generate fat jars

➢ Generating a fat jar:sbt assembly

Generating a fat jar

Page 50: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

spark-submit --class es.upv.dsic.iarfid.haia.mySparkScript --master yarn-cluster target/haia.jar hdfs:///user/victor.sanchez/grades.tsv hdfs:///user/victor.sanchez/spark_submit_ex

How to execute Spark code from jar

Singleton object to execute

Fat jar file

Program parameters

Page 51: Apache Spark: Moving on from Hadoop

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Simulated annealing → Optimization method

➢ Multi-point → Exploring from different points

➢ Function to optimize:

Exercise: Multi-point simulated annealing

Page 52: Apache Spark: Moving on from Hadoop

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Single point simulated annealing

Page 53: Apache Spark: Moving on from Hadoop

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Time to work!

Page 54: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Spark is still a quite novel technology➢ Unexpected Out of memory exceptions➢ Memory issues are difficult to debug in Spark➢ Avoid out of memory scenarios:

○ Use object serialization (Java or Kryo)○ Choose data structures wisely○ Increase parallelism (spark.default.parallelism)○ Avoid groupBy operations → reduceBy

○ More memory for shuffle (spark.shuffle.spill=false or higher spark.shuffle.memoryFraction)

A final advice on Spark

Page 55: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Disk based parallelization

➢ No looping

➢ More mature project

➢ Many organizations use it

Hadoop ecosystem vs Spark

➢ Memory based parallelization

➢ Loopings (nice for AI an ML)

➢ Initial steps for Spark

➢ Changing all Hadoop code has a cost

Page 56: Apache Spark: Moving on from Hadoop

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Extra information

➢ http://spark.apache.org/

➢ Learning Spark: Lightning-Fast Big Data Analysis. Holden Karau et al. Ed. O’Reilly

➢ StackOverflow