Introduction to Apache Spark

37
INTRODUCTION TO APACHE SPARK Mohamed Hedi Abidi - Software Engineer @ ebiznext @mh_abidi

description

This is my slides from ebiznext workshop : Introduction to Apache Spark. Please download code sources from https://github.com/MohamedHedi/SparkSamples

Transcript of Introduction to Apache Spark

Page 1: Introduction to Apache Spark

INTRODUCTION TO

APACHE SPARK

Mohamed Hedi Abidi - Software Engineer @ebiznext

@mh_abidi

Page 2: Introduction to Apache Spark

CONTENT

Spark Introduction

Installation

Spark-Shell

SparkContext

RDD

Persistance

Simple Spark Apps

Deploiement

Spark SQL

Spark GraphX

Spark Mllib

Spark Streaming

Spark & Elasticsearch

Page 3: Introduction to Apache Spark

INTRODUCTION

An open source data analytics cluster computing framework

In Memory Data processing

100x faster than Hadoop

Support MapReduce

Page 4: Introduction to Apache Spark

INTRODUCTION

Handles batch, interactive, and real-time within a single framework

Page 5: Introduction to Apache Spark

INTRODUCTION

Programming at a higher level of abstraction : faster, easier development

Page 6: Introduction to Apache Spark

INTRODUCTION

Highly accessible through standard APIs built in Java, Scala, Python, or SQL (for interactive queries), and a rich set of machine learning libraries

Compatibility with the existing Hadoop v1 (SIMR) and 2.x (YARN) ecosystems so companies can leverage their existing infrastructure.

Page 7: Introduction to Apache Spark

INSTALLATION

Install JDK 1.7+, Scala 2.10.x, Sbt0.13.7, Maven 3.0+

Download and unzip Apache Spark 1.1.0 sources

Or clone development Version :

git clone git://github.com/apache/spark.git

Run Maven to build Apache Spark

mvn -DskipTests clean package

Launch Apache Spark standalone REPL

[spark_home]/bin/spark-shell

Go to SparkUI @

http://localhost:4040

Page 8: Introduction to Apache Spark

SPARK-SHELL

we’ll run Spark’s interactive shell… within the “spark” directory, run:

./bin/spark-shell

then from the “scala>” REPL prompt, let’s create somedata…

scala> val data = 1 to 10000

create an RDD based on that data…

scala> val distData = sc.parallelize(data)

then use a filter to select values less than 10…

scala> distData.filter(_ < 10).collect()

Page 9: Introduction to Apache Spark

SPARKCONTEXT

The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster.

In the shell for either Scala or Python, this is the scvariable, which is created automatically

Other programs must use a constructor to instantiate a new SparkContextval conf = new SparkConf().setAppName(appName).setMaster(master)

new SparkContext(conf)

Page 10: Introduction to Apache Spark

RDDS

Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – It is an immutable distributed collection of data, which is partitioned across machines in a cluster

There are currently two types:

parallelized collections : Take an existing Scala collection and run functions on it in parallel

External datasets : Spark can create distributed datasets fromany storage source supported by Hadoop, including local file system, HDFS, Cassandra, HBase, Amazon S3, etc.

Page 11: Introduction to Apache Spark

RDDS

Parallelized collections scala> val data = Array(1, 2, 3, 4, 5)

data: Array[Int] = Array(1, 2, 3, 4, 5)

scala> val distData = sc.parallelize(data)

distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:14

External datasetsscala> val distFile = sc.textFile("README.md")

distFile: org.apache.spark.rdd.RDD[String] = README.md MappedRDD[7] attextFileat <console>:12

Page 12: Introduction to Apache Spark

RDDS

Two types of operations on RDDs:

transformations and actions

A transformation is a lazy (not computed immediately) operation on an RDD that yields another RDD

An action is an operation that triggers a computation, returns a value back to the Master, or writes to a stable storage system

Page 13: Introduction to Apache Spark

RDDS : COMMONLY USED TRANSFORMATIONS

Transformation & Purpose Example & Result

filter(func)Purpose: new RDD by selecting those data elements on which func returns true

scala> val rdd =sc.parallelize(List(“ABC”,”BCD”,”DEF”))scala> val filtered = rdd.filter(_.contains(“C”))scala> filtered.collect()Result:Array[String] = Array(ABC, BCD)

map(func)Purpose: return new RDD by applying func on each data element

scala> val rdd=sc.parallelize(List(1,2,3,4,5))scala> val times2 = rdd.map(_*2)scala> times2.collect()Result:Array[Int] = Array(2, 4, 6, 8, 10)

flatMap(func)Purpose: Similar to map but funcreturns a Seq instead of a value. For example, mapping a sentence into a Seq of words

scala> val rdd=sc.parallelize(List(“Spark isawesome”,”It is fun”))scala> val fm=rdd.flatMap(str=>str.split(“ “))scala> fm.collect()Result:Array[String] = Array(Spark, is, awesome, It, is, fun)

Page 14: Introduction to Apache Spark

RDDS : COMMONLY USED TRANSFORMATIONS

Transformation & Purpose Example & Result

reduceByKey(func,[numTasks])Purpose: To aggregate values of akey using a function. “numTasks” is anoptional parameter to specify number of reduce tasks

scala> val word1=fm.map(word=>(word,1))scala> val wrdCnt=word1.reduceByKey(_+_)scala> wrdCnt.collect()Result:Array[(String, Int)] = Array((is,2), (It,1),(awesome,1), (Spark,1), (fun,1))

groupByKey([numTasks])Purpose: To convert (K,V) to(K,Iterable<V>)

scala> val cntWrd = wrdCnt.map{case (word,count) => (count, word)}scala> cntWrd.groupByKey().collect()Result:Array[(Int, Iterable[String])] =Array((1,ArrayBuffer(It, awesome, Spark,fun)), (2,ArrayBuffer(is)))

distinct([numTasks])Purpose: Eliminate duplicates from RDD

scala> fm.distinct().collect()Result:Array[String] = Array(is, It, awesome, Spark,fun)

Page 15: Introduction to Apache Spark

RDDS : COMMONLY USED ACTIONS

Transformation & Purpose Example & Result

count()Purpose: Get the number ofdata elements in the RDD

scala> val rdd = sc.parallelize(List(‘A’,’B’,’C’))scala> rdd.count()Result:Long = 3

collect()Purpose: get all the data elements in an RDD as an Array

scala> val rdd = sc.parallelize(List(‘A’,’B’,’C’))scala> rdd.collect()Result:Array[Char] = Array(A, B, C)

reduce(func)Purpose: Aggregate the data elements in an RDD using this function which takes two arguments and returns one

scala> val rdd = sc.parallelize(List(1,2,3,4))scala> rdd.reduce(_+_)Result:Int = 10

take (n)Purpose: fetch first n data elements in an RDD. Computed by driver program.

Scala> val rdd = sc.parallelize(List(1,2,3,4))scala> rdd.take(2)Result:Array[Int] = Array(1, 2)

Page 16: Introduction to Apache Spark

RDDS : COMMONLY USED ACTIONS

Transformation & Purpose Example & Result

foreach(func)Purpose: execute function foreach data element in RDD.Usually used to update an accumulator(discussed later) or interacting with external systems.

Scala> val rdd = sc.parallelize(List(1,2))scala> rdd.foreach(x=>println(“%s*10=%s”.format(x,x*10)))Result:1*10=102*10=20

first()Purpose: retrieves the firstdata element in RDD. Similar to take(1)

scala> val rdd = sc.parallelize(List(1,2,3,4))scala> rdd.first()Result:Int = 1

saveAsTextFile(path)Purpose: Writes the content of RDD to a text file or a set of text files to local file system/HDFS

scala> val hamlet = sc.textFile(“readme.txt”)scala> hamlet.filter(_.contains(“Spark")).saveAsTextFile(“filtered”)Result:…/filtered$ ls_SUCCESS part-00000 part-00001

Page 17: Introduction to Apache Spark

RDDS :

For a more detailed list of actions and transformations, please refer to:

http://spark.apache.org/docs/latest/programming-guide.html#transformations

http://spark.apache.org/docs/latest/programming-guide.html#actions

Page 18: Introduction to Apache Spark

PERSISTANCE

Spark can persist (or cache) a dataset in memory acrossoperations

Each node stores in memory any slices of it that it computes and reuses them in other actions on that dataset – often making future actions more than 10x faster

The cache is fault-tolerant: if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it

Page 19: Introduction to Apache Spark

PERSISTANCE

Page 20: Introduction to Apache Spark

PERSISTANCE

Page 21: Introduction to Apache Spark

PERSISTANCE : STORAGE LEVEL

Storage Level Purpose

MEMORY_ONLY(Default level)

Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.

MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.

MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.

MEMORY_ONLY_DISK_SER Similar to MEMORY_ONLY_SER, but spill artitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.

DISC_ONLY Store the RDD partitions only on disk.

MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.

Same as the levels above, but replicate each partition on two cluster nodes.

Page 22: Introduction to Apache Spark

SIMPLE SPARK APPS : WORDCOUNT

Download project from github:

https://github.com/MohamedHedi/SparkSamples

sbt

compile

assembly

WordCount.scala:

val logFile = args(0)val conf = new SparkConf().setAppName("WordCount")

val sc = new SparkContext(conf)val logData = sc.textFile(logFile, 2).cache()val numApache = logData.filter(line => line.contains("apache")).count()val numSpark = logData.filter(line => line.contains("spark")).count()println("Lines with apache: %s, Lines with spark: %s".format(numApache,

numSpark))

Page 23: Introduction to Apache Spark

SPARK-SUBMIT

./bin/spark-submit

--class <main-class>

--master <master-url>

--deploy-mode <deploy-mode>

--conf <key>=<value>

... # other options

<application-jar>

[application-arguments]

Page 24: Introduction to Apache Spark

SPARK-SUBMIT : LOCAL MODE

./bin/spark-submit

--class com.ebiznext.spark.examples.WordCount

--master local[4]

--deploy-mode client

--conf <key>=<value>

... # other options

.\target\scala-2.10\SparkSamples-assembly-1.0.jar

.\ressources\README.md

Page 25: Introduction to Apache Spark

CLUSTER MANAGER TYPES

Spark supports three cluster managers:

Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.

Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service applications.

Hadoop YARN – the resource manager in Hadoop 2.

Page 26: Introduction to Apache Spark

MASTER URLS

Master URL Meaning

local One worker thread (no parallelism at all)

local[K] Run Spark locally with K worker threads (ideally, sethis to the number of cores on your machine).

local[*] Run Spark locally with as many worker threads as logical cores on your machine.

spark://HOST:PORT Connect to the given Spark standalone cluster master. Default master port : 7077

mesos://HOST:PORT Connect to the given Mesos cluster. Default mesos port : 5050

yarn-client Connect to a YARN cluster in client mode. The cluster location will be found based on the HADOOP_CONF_DIR variable.

yarn-cluster Connect to a YARN cluster in cluster mode. The cluster location will be found based on HADOOP_CONF_DIR.

Page 27: Introduction to Apache Spark

SPARK-SUBMIT : STANDALONE CLUSTER

./sbin/start-master.sh(Windows users spark-class.cmd org.apache.spark.deploy.master.Master)

Go to the master’s web UI

Page 28: Introduction to Apache Spark

SPARK-SUBMIT : STANDALONE CLUSTER

Connect Workers to Master./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT

Go to the master’s web UI

Page 29: Introduction to Apache Spark

SPARK-SUBMIT : STANDALONE CLUSTER

./bin/spark-submit --class com.ebiznext.spark.examples.WordCount

--master spark://localhost:7077 .\target\scala-2.10\SparkSamples-assembly-1.0.jar .\ressources\README.md

Page 30: Introduction to Apache Spark

SPARK SQL

Shark is being migrated to Spark SQL

Spark SQL blurs the lines between RDDs and relational tables

val conf = new SparkConf().setAppName("SparkSQL")val sc = new SparkContext(conf)val peopleFile = args(0)

val sqlContext = new org.apache.spark.sql.SQLContext(sc)import sqlContext._

// Define the schema using a case class.case class Person(name: String, age: Int)

// Create an RDD of Person objects and register it as a table.val people = sc.textFile(peopleFile).map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))

people.registerAsTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are SchemaRDDs and support all the normal RDD operations.// The columns of a row in the result can be accessed by ordinal.teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

Page 31: Introduction to Apache Spark

SPARK GRAPHX

GraphX is the new (alpha) Spark API for graphs and graph-parallel computation.

GraphX extends the Spark RDD by introducing the Resilient Distributed Property Graph

case class Peep(name: String, age: Int)val vertexArray = Array((1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),(3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),(5L, Peep("Leslie", 45)))

val edgeArray = Array(Edge(2L, 1L, 7), Edge(2L, 4L, 2),Edge(3L, 2L, 4), Edge(3L, 5L, 3),Edge(4L, 1L, 1), Edge(5L, 3L, 9))

val conf = new SparkConf().setAppName("SparkGraphx")val sc = new SparkContext(conf)val vertexRDD: RDD[(Long, Peep)] = sc.parallelize(vertexArray)val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)val g: Graph[Peep, Int] = Graph(vertexRDD, edgeRDD)

val results = g.triplets.filter(t => t.attr > 7)for (triplet <- results.collect) {println(s"${triplet.srcAttr.name} loves ${triplet.dstAttr.name}")

}

Page 32: Introduction to Apache Spark

SPARK MLLIB

MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities.

Use cases :

Recommendation Engine

Content classification

Ranking

Algorithms

Classification and regression : linear regression, decisiontrees, naive Bayes

Collaborative filtering : alternating least squares (ALS)

Clustering : k-means

Page 33: Introduction to Apache Spark

SPARK MLLIB

SparkKMeans.scala

val sparkConf = new SparkConf().setAppName("SparkKMeans")val sc = new SparkContext(sparkConf)val lines = sc.textFile(args(0))val data = lines.map(parseVector _).cache()val K = args(1).toIntval convergeDist = args(2).toDoubleval kPoints = data.takeSample(withReplacement = false, K, 42).toArrayvar tempDist = 1.0while (tempDist > convergeDist) {val closest = data.map(p => (closestPoint(p, kPoints), (p, 1)))val pointStats = closest.reduceByKey { case ((x1, y1), (x2, y2)) => (x1 + x2, y1 + y2) }val newPoints = pointStats.map { pair =>(pair._1, pair._2._1 * (1.0 / pair._2._2))

}.collectAsMap()tempDist = 0.0for (i <- 0 until K) {tempDist += squaredDistance(kPoints(i), newPoints(i))

}for (newP <- newPoints) yield {kPoints(newP._1) = newP._2

}println("Finished iteration (delta = " + tempDist + ")")

}println("Final centers:")kPoints.foreach(println)sc.stop()

Page 34: Introduction to Apache Spark

SPARK STREAMING

Spark Streaming extends the core API to allow high-throughput, fault-tolerant stream processing of live data streams

Data can be ingested from many sources: Kafka, Flume, Twitter, ZeroMQ, TCP sockets…

Results can be pushed out to filesystems, databases, live dashboards…

Spark’s Mllib algorithms and graph processing algorithms can be applied to data streams

Page 35: Introduction to Apache Spark

SPARK STREAMING

val ssc = new StreamingContext(sparkConf, Seconds(10))

Create a StreamingContext by providing the configuration and batch duration

Page 36: Introduction to Apache Spark

TWITTER - SPARK STREAMING - ELASTICSEARCH

1. Twitter access

2. Streaming from Twitterval sparkConf = new SparkConf().setAppName("TwitterPopularTags")sparkConf.set("es.index.auto.create", "true")val ssc = new StreamingContext(sparkConf, Seconds(10))val keys = ssc.sparkContext.textFile(args(0), 2).cache()val stream = TwitterUtils.createStream(ssc, None)val hashTags = stream.flatMap(status => status.getText.split(" ").filter(_.startsWith("#")))

val topCounts10 = hashTags.map((_, 1)).reduceByKeyAndWindow(_ + _, Seconds(10)).map { case (topic, count) => (count, topic) }.transform(_.sortByKey(false))

val keys = ssc.sparkContext.textFile(args(0), 2).cache()val Array(consumerKey, consumerSecret, accessToken, accessTokenSecret) = keys.take(4)// Set the system properties so that Twitter4j library used by twitter stream// can use them to generat OAuth credentialsSystem.setProperty("twitter4j.oauth.consumerKey", consumerKey)System.setProperty("twitter4j.oauth.consumerSecret", consumerSecret)System.setProperty("twitter4j.oauth.accessToken", accessToken)System.setProperty("twitter4j.oauth.accessTokenSecret", accessTokenSecret)

Page 37: Introduction to Apache Spark

TWITTER - SPARK STREAMING - ELASTICSEARCH

index in Elasticsearch

Adding elasticsearch-spark jar to build.sbt:libraryDependencies += "org.elasticsearch" % "elasticsearch-spark_2.10" % "2.1.0.Beta3"

Writing RDD to elasticsearch:

val conf = new SparkConf().setAppName(appName).setMaster(master)sparkConf.set("es.index.auto.create", "true")

val apache = Map("hashtag" -> "#Apache", "count" -> 10)val spark = Map("hashtag" -> "#Spark", "count" -> 15)

val rdd = ssc.sparkContext.makeRDD(Seq(apache,spark))rdd.saveToEs("spark/hashtag")