Introduction to Apache Spark Ecosystem

27
Using Apache Spark Bojan Babic

Transcript of Introduction to Apache Spark Ecosystem

Page 1: Introduction to Apache Spark Ecosystem

Using Apache Spark

Bojan Babic

Page 2: Introduction to Apache Spark Ecosystem

Apache Spark

spark.incubator.apache.org

github.com/apache/incubator-spark

[email protected]

Page 3: Introduction to Apache Spark Ecosystem

The Spark Community

Page 4: Introduction to Apache Spark Ecosystem

INTRODUCTION TO APACHE SPARK

Page 5: Introduction to Apache Spark Ecosystem

What is Spark?

Efficient• General execution

graphs

• In-memory storage

Usable• Rich APIs in Java,

Scala, Python• Interactive shell

Fast and Expressive Cluster Computing System Compatible with Apache Hadoop

Page 6: Introduction to Apache Spark Ecosystem

Today’s talk

Page 7: Introduction to Apache Spark Ecosystem

Key Concepts

Resilient Distributed Datasets• Collections of objects spread

across a cluster, stored in RAM or on Disk

• Built through parallel transformations

• Automatically rebuilt on failure

• in hadoop work RDD corresponds to partition elsewhere dataframe

Operations• Transformations

(e.g. map, filter, groupBy)

• Actions(e.g. count, collect, save)

Page 8: Introduction to Apache Spark Ecosystem

Working With RDDs

RDDRDD

RDDRDD

Transformations

Action Value

linesWithSpark = textFile.filter(l => l.contains("Spark”))

linesWithSpark.count()74

linesWithSpark.first()# Apache Spark

textFile = sc.textFile(”/var/log/hadoop/somelog”)

Page 9: Introduction to Apache Spark Ecosystem

Spark (batch) exampleLoad error messages from a log into memory, then

interactively search for various patterns

val sc = new SparkContext(config)val lines = spark.textFile(“hdfs://...”)val messages = lines.filter(l => l.startswith(“ERROR”))

.map(e => e.split(“\t“)(2))messages.cache()

messages.filter(s => s.contains(“mongo”)).count()messages.filter(s => s.contains(“500”)).count()

Base RDDTransformed RDD

Action

Page 10: Introduction to Apache Spark Ecosystem

Scaling Down

Page 11: Introduction to Apache Spark Ecosystem

Spark Streaming exampleval sc = new StreamingContext(conf, Seconds(10))

val kafkaParams = Map[String, String](...)val messages = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( sc, kafkaParams, Map(topic -> threadsPerTopic), storageLevel = StorageLevel.MEMORY_ONLY_SER )val revenue = messages.filter(m => m.contains(“[PURCHASE]”))

.map(p => p.split(“\t”)(4)).reduceByKey(_ + _)

Mini-batch

Page 12: Introduction to Apache Spark Ecosystem

Fault Recovery

RDDs track lineage information that can be used to efficiently recompute lost data

val msgs = textFile.filter(l => l.contains(“ERROR”)) .map(e => e.split(“\t”)(2))

HDFS File Filtered RDD Mapped RDDfilter

(func = startsWith(…))

map(func = split

(...))

Page 13: Introduction to Apache Spark Ecosystem

JOB EXECUTION

Page 14: Introduction to Apache Spark Ecosystem

Software Components

• Spark runs as a library in your program (1 instance per app)

• Runs tasks locally or on cluster– Mesos, YARN or standalone

mode

• Accesses storage systems via Hadoop InputFormat API– Can use HBase, HDFS, S3, …

Your application

SparkContext

Local threads

Cluster manager

Worker

Spark executor

Worker

Spark executor

HDFS or other storage

Page 15: Introduction to Apache Spark Ecosystem

Task Scheduler

• General task graphs

• Automatically pipelines functions

• Data locality aware

• Partitioning awareto avoid shuffles

= cached partition

= RDD

join

filter

groupBy

Stage 3

Stage 1

Stage 2

A: B:

C:

D:

E:

F:

map

Page 16: Introduction to Apache Spark Ecosystem

Advanced Features

• Controllable partitioning– Speed up joins against a dataset

• Controllable storage formats– Keep data serialized for efficiency, replicate to

multiple nodes, cache on disk

• Shared variables: broadcasts, accumulators

Page 17: Introduction to Apache Spark Ecosystem

WORKING WITH SPARK

Page 18: Introduction to Apache Spark Ecosystem

Using the Shell

Launching:

Modes:

MASTER=local ./spark-shell # local, 1 threadMASTER=local[2] ./spark-shell # local, 2 threadsMASTER=spark://host:port ./spark-shell # cluster

spark-shellpyspark (IPYTHON=1)

Page 19: Introduction to Apache Spark Ecosystem

SparkContext

• Main entry point to Spark functionality

• Available in shell as variable sc• In standalone programs, you’d make your own

(see later for details)

Page 20: Introduction to Apache Spark Ecosystem

RDD Operators

• map• filter• groupBy• sort• union• join• leftOuterJoin• rightOuterJoin

• reduce• count• fold• reduceByKey• groupByKey• cogroup• cross• zip

sample

take

first

partitionBy

mapWith

pipe

save ...

Page 21: Introduction to Apache Spark Ecosystem

Add Spark to Your Project

• Scala: add dependency to build.sbtlibraryDependencies ++= Seq(

("org.apache.spark" %% "spark-core" % "1.2.0").

exclude("org.mortbay.jetty", "servlet-api")

...

)

• make sure you exclude overlapping libs

Page 22: Introduction to Apache Spark Ecosystem

and add Spark-Streaming

• Scala: add dependency to build.sbtlibraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "1.2.0"

Page 23: Introduction to Apache Spark Ecosystem

PageRank Performance

Page 24: Introduction to Apache Spark Ecosystem

Other Iterative Algorithms

Time per Iteration (s)

Page 25: Introduction to Apache Spark Ecosystem

CONCLUSION

Page 26: Introduction to Apache Spark Ecosystem

Conclusion

• Spark offers a rich API to make data analytics fast: both fast to write and fast to run

• Achieves 100x speedups in real applications

• Growing community with 25+ companies contributing

Page 27: Introduction to Apache Spark Ecosystem

Thanks!!!

disclosure: some parts of this presentation have been inspired by Matei Zaharia ampmeeting in Berlkey