Introduction to Apache Spark Ecosystem

Using Apache Spark

Bojan Babic

Apache Spark

spark.incubator.apache.org

github.com/apache/incubator-spark

user@spark.incubator.apache.org

The Spark Community

INTRODUCTION TO APACHE SPARK

What is Spark?

Efficient• General execution

graphs

• In-memory storage

Usable• Rich APIs in Java,

Scala, Python• Interactive shell

Fast and Expressive Cluster Computing System Compatible with Apache Hadoop

Today’s talk

Key Concepts

Resilient Distributed Datasets• Collections of objects spread

across a cluster, stored in RAM or on Disk

• Built through parallel transformations

• Automatically rebuilt on failure

• in hadoop work RDD corresponds to partition elsewhere dataframe

Operations• Transformations

(e.g. map, filter, groupBy)

• Actions(e.g. count, collect, save)

Working With RDDs

RDDRDD

Transformations

Action Value

linesWithSpark = textFile.filter(l => l.contains("Spark”))

linesWithSpark.count()74

linesWithSpark.first()# Apache Spark

textFile = sc.textFile(”/var/log/hadoop/somelog”)

Spark (batch) exampleLoad error messages from a log into memory, then

interactively search for various patterns

val sc = new SparkContext(config)val lines = spark.textFile(“hdfs://...”)val messages = lines.filter(l => l.startswith(“ERROR”))

.map(e => e.split(“\t“)(2))messages.cache()

messages.filter(s => s.contains(“mongo”)).count()messages.filter(s => s.contains(“500”)).count()

Base RDDTransformed RDD

Action

Scaling Down

Spark Streaming exampleval sc = new StreamingContext(conf, Seconds(10))

val kafkaParams = Map[String, String](...)val messages = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( sc, kafkaParams, Map(topic -> threadsPerTopic), storageLevel = StorageLevel.MEMORY_ONLY_SER )val revenue = messages.filter(m => m.contains(“[PURCHASE]”))

.map(p => p.split(“\t”)(4)).reduceByKey(_ + _)

Mini-batch

Fault Recovery

RDDs track lineage information that can be used to efficiently recompute lost data

val msgs = textFile.filter(l => l.contains(“ERROR”)) .map(e => e.split(“\t”)(2))

HDFS File Filtered RDD Mapped RDDfilter

(func = startsWith(…))

map(func = split

(...))

JOB EXECUTION

Software Components

• Spark runs as a library in your program (1 instance per app)

• Runs tasks locally or on cluster– Mesos, YARN or standalone

• Accesses storage systems via Hadoop InputFormat API– Can use HBase, HDFS, S3, …

Your application

SparkContext

Local threads

Cluster manager

Worker

Spark executor

Worker

Spark executor

HDFS or other storage

Task Scheduler

• General task graphs

• Automatically pipelines functions

• Data locality aware

• Partitioning awareto avoid shuffles

= cached partition

filter

groupBy

Stage 3

Stage 1

Stage 2

Advanced Features

• Controllable partitioning– Speed up joins against a dataset

• Controllable storage formats– Keep data serialized for efficiency, replicate to

multiple nodes, cache on disk

• Shared variables: broadcasts, accumulators

WORKING WITH SPARK

Using the Shell

Launching:

Modes:

MASTER=local ./spark-shell # local, 1 threadMASTER=local[2] ./spark-shell # local, 2 threadsMASTER=spark://host:port ./spark-shell # cluster

spark-shellpyspark (IPYTHON=1)

SparkContext

• Main entry point to Spark functionality

• Available in shell as variable sc• In standalone programs, you’d make your own

(see later for details)

RDD Operators

• map• filter• groupBy• sort• union• join• leftOuterJoin• rightOuterJoin

• reduce• count• fold• reduceByKey• groupByKey• cogroup• cross• zip

sample

partitionBy

mapWith

save ...

Add Spark to Your Project

• Scala: add dependency to build.sbtlibraryDependencies ++= Seq(

("org.apache.spark" %% "spark-core" % "1.2.0").

exclude("org.mortbay.jetty", "servlet-api")

• make sure you exclude overlapping libs

and add Spark-Streaming

• Scala: add dependency to build.sbtlibraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "1.2.0"

PageRank Performance

Other Iterative Algorithms

Time per Iteration (s)

CONCLUSION

Conclusion

• Spark offers a rich API to make data analytics fast: both fast to write and fast to run

• Achieves 100x speedups in real applications

• Growing community with 25+ companies contributing

Thanks!!!

disclosure: some parts of this presentation have been inspired by Matei Zaharia ampmeeting in Berlkey

Introduction to Apache Spark Ecosystem

Engineering

Transcript of Introduction to Apache Spark Ecosystem

Big data with Apache spark - WUNCA · 2017-07-21 · - Apache spark architecture - Databricks community - Introduction to Big Data with Apache Spark บ าย - Apache Spark on Databricks

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem

[@NaukriEngineering] Apache Spark

Apache Spark 2.0

Apache Spark - Yandex

Apache spark linkedin

Apache spark Intro

Developing Apache Spark Applications - Cloudera · Apache Spark Quick Start Apache Spark Overview Apache Spark Programming Guide Using the Spark DataFrame API A DataFrame is a distributed

Apache Spark Introduction

Apache Spark & Hadoop

Apache spark - Installation

SpaRC: Scalable Sequence Clustering using Apache Spark · 1/11/2018 · Recently Apache Spark [5] has overtaken Hadoop in the big data ecosystem due to 59 its fast in-memory computation.

Apache Spark RDDs

Apache Spark Briefing

Using Apache Spark Pat McDonough - Databricks. Apache Spark spark.incubator.apache.org github.com/apache/incubator- spark user@spark.incubator.apache.or.

Apache spark

Apache Spark overview

Integrating Apache Hive with Kafka, Spark, and BI...Community Connection: Integrating Apache Hive with Apache Spark--Hive Warehouse Connector Apache Spark-Apache Hive connection configuration

Apache Spark Operations

Apache Ignite and Apache Spark - GridGain Systems · Ignite and Spark Integration Spark Application Spark Worker Spark Job Spark Job Yarn Mesos Docker HDFS Spark Worker Spark Job