Intro to Apache Spark

45
Intro to Apache SparkBy: Robert Sanders

Transcript of Intro to Apache Spark

Page 1: Intro to Apache Spark

Intro to Apache Spark™

By: Robert Sanders

Page 2: Intro to Apache Spark

2Page:

Agenda

• What is Apache Spark?• Apache Spark Ecosystem• MapReduce vs. Apache Spark• Core Spark (RDD API)• Apache Spark Concepts• Spark SQL (DataFrame and Dataset API)• Spark Streaming• Use Cases• Next Steps

Page 3: Intro to Apache Spark

3Page:

Robert Sanders

• Big Data Manager, Engineer, Architect, etc.• Work for Clairvoyant LLC• 5+ Years of Big Data Experience• Certified Apache Spark Developer• Email: [email protected] • LinkedIn: https://www.linkedin.com/in/robert-sanders-

61446732

Page 4: Intro to Apache Spark

4Page:

What is Apache Spark?

• Open source data processing engine that runs on a cluster• https://github.com/apache/spark• Distributed under the Apache License

• Provides a number of Libraries for Batch, Streaming and other forms of processing

• Very fast in memory processing engine• Primarily written in Scala• Support for Java, Scala, Python, and R• Version:

• Most Used Version: 1.6.X• Latest version: 2.0

Page 5: Intro to Apache Spark

5Page:

Apache Spark EcoSystem

• Apache Spark• RDDs

• Spark SQL• Once known as “Shark”

before completely integrated into Spark

• For SQL, structured and semi-structured data processing

• Spark Streaming• Processing of live data

streams• MLlib/ML

• Machine Learning Algorithms• GraphX

• Graph Processing

Apache Spark, Apache Spark Ecosystemhttp://spark.apache.org/images/spark-stack.png

Page 6: Intro to Apache Spark

6Page:

MapReduce (Hadoop)

Michele Usuelli, Example of MapReducehttp://xiaochongzhang.me/blog/wp-content/uploads/2013/05/MapReduce_Work_Structure.png

Page 7: Intro to Apache Spark

7Page:

MapReduce Bottlenecks and Improvements

• Bottlenecks• MapReduce is a very I/O heavy operation• Map phase needs to read from disk then write back out• Reduce phase needs to read from disk and then write back

out• How can we improve it?

• RAM is becoming very cheap and abundant• Use RAM for in-data sharing

Page 8: Intro to Apache Spark

8Page:

MapReduce vs. Spark (Performance) (Cont.)

• Dayton Gray 100 TB sorting results• https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

MapReduce Record

Spark Record

Spark Record 1PB

Data Size 102.5 TB 100 TB 1000 TB# Nodes 2100 206 190# Cores 50400 physical 6592

virtualized6080 virtualized

Elapsed Time

72 mins 23 mins 234 mins

Sort rate 1.42 TB/min 4.27 TB/min 4.27 TB/minSort rate/node

0.67 GB/min 20.7 GB/min 22.5 GB/min

Page 9: Intro to Apache Spark

9Page:

Running Spark Jobs

• Shell• Shell for running Scala Code

$ spark-shell• Shell for running Python Code

$ pyspark• Shell for running R Code

$ sparkR• Submitting (Java, Scala, Python, R)

$ spark-submit --class {MAIN_CLASS} [OPTIONS] {PATH_TO_FILE} {ARG0} {ARG1} … {ARGN}

Page 10: Intro to Apache Spark

10Page:

SparkContext

• A Spark program first creates a SparkContext object• Spark Shell automatically creates a SparkContext as the sc variable

• Tells spark how and where to access a cluster• Use SparkContext to create RDDs• Documentation

• https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext

Page 11: Intro to Apache Spark

11Page:

Spark Architecture

Apache Spark, Cluster Mode Overviewhttp://spark.apache.org/docs/latest/img/cluster-overview.png

Page 12: Intro to Apache Spark

12Page:

RDDs

• Primary abstraction object used by Apache Spark• Resilient Distributed Dataset

• Fault-tolerant • Collection of elements that can be operated on in parallel• Distributed collection of data from any source

• Contained in an RDD:• Set of dependencies on parent RDDs

• Lineage (Directed Acyclic Graph – DAG)• Set of partitions

• Atomic pieces of a dataset• A function for computing the RDD based on its parents• Metadata about its partitioning scheme and data placement

Page 13: Intro to Apache Spark

13Page:

RDDs (Cont.)

• RDDs are Immutable• Allows for more effective fault tolerance

• Intended to support abstract datasets while also maintain MapReduce properties like automatic fault tolerance, locality-aware scheduling and scalability.

• RDD API built to resemble the Scala Collections API• Programming Guide

• http://spark.apache.org/docs/latest/quick-start.html

Page 14: Intro to Apache Spark

14Page:

RDDs (Cont.)• Lazy Evaluation

• Waits for action to be called before distributing actions to worker nodes

Surendra Pratap Singh - To The New, Working with RDDshttp://www.tothenew.com/blog/wp-content/uploads/2015/02/580x402xSpark.jpg.pagespeed.ic.KZMzgXwkwB.jpg

Page 15: Intro to Apache Spark

15Page:

Create RDD

• Can only be created using the SparkContext or by adding a Transformation to an existing RDD

• Using the SparkContext:

• Parallelized Collections – take an existing collection and run functions on it in parallel

rdd = sc.parallelize([ "some", "list", "to", "parallelize"], [numTasks])

• File Datasets – run functions on each record of a file in Hadoop distributed file system or any other storage system supported by Hadoop

rdd = sc.textFile("/path/to/file", [numTasks])rdd = sc.objectFile("/path/to/file", [numTasks])

Page 16: Intro to Apache Spark

16Page:

API (Overview)

Berkely.edu, Transformations and Actionshttp://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.pdf

Page 17: Intro to Apache Spark

17Page:

Word Count Example

Scalaval textFile = sc.textFile("/path/to/file.txt")

val counts = textFile.flatMap(line => line.split("

")).map(word => (word, 1))

.reduceByKey(_ + _)

counts.saveAsTextFile("/path/to/output")

Pythontext_file = sc.textFile("/path/to/file.txt")

counts = text_file.flatMap(lambda line:

line.split(" "))\.map(lambda word: (word, 1))\.reduceByKey(lambda a, b: a +

b)

counts.saveAsTextFile("/path/to/output")

Page 18: Intro to Apache Spark

18Page:

Word Count Example (Java 7)JavaRDD<String> textFile = sc.textFile("/path/to/file.txt");

JavaRDD<String> words = textFile.flatMap(new FlatMapFunction<String, String>() {

public Iterable<String> call(String line) { return Arrays.asList(line.split(" "));

}});JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {

public Tuple2<String, Integer> call(String word) { return new Tuple2<String, Integer>(word, 1);

}});JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {

public Integer call(Integer a, Integer b) { return a + b;

}});

counts.saveAsTextFile("/path/to/output");

Page 19: Intro to Apache Spark

19Page:

Word Count Example (Java 8)JavaRDD<String> textFile = sc.textFile("/path/to/file.txt");

JavaPairRDD<String, Integer> counts = lines.flatMap(line -> Arrays.asList(line.split(" ")));.mapToPair(w -> new Tuple2<String, Integer>(w, 1)).reduceByKey((a, b) -> a + b);

counts.saveAsTextFile("/path/to/output");

Page 20: Intro to Apache Spark

20Page:

RDD Lineage Graphval textFile = sc.textFile("/path/to/file.txt")val counts = textFile.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)counts.toDebugString

res1: String =(1) ShuffledRDD[7] at reduceByKey at <console>:23 [] +-(1) MapPartitionsRDD[6] at map at <console>:23 [] | MapPartitionsRDD[5] at flatMap at <console>:23 [] | /path/to/file.txt MapPartitionsRDD[3] at textFile at <console>:21 [] | /path/to/file.txt HadoopRDD[2] at textFile at <console>:21 []

Page 21: Intro to Apache Spark

21Page:

RDD Persistence

• Each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset.• After marking an RDD to be persisted, the first time the

dataset is computed in an action, it will be kept in memory on the nodes.• Allows future actions to be much faster (often by more than

10x) since you’re not re-computing some data every time you perform an action.• If data is too big to be cached, then it will spill to disk and

memory will gradually degrade• Least Recently Used (LRU) replacement policy

Page 22: Intro to Apache Spark

22Page:

RDD Persistence (Storage Levels)Storage Level MEANINGMEMORY_ONLY Store RDD as deserialized Java objects in the JVM.

If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.

MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.

MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.

MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of re-computing them on the fly each time they're needed.

DISK_ONLY Store the RDD partitions only on disk.

MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.

Same as the levels above, but replicate each partition on two cluster nodes.

Page 23: Intro to Apache Spark

23Page:

RDD Persistence APIsrdd.persist()rdd.persist(StorageLevel)• Persist this RDD with the default storage level (MEMORY_ONLY).• You can override the StorageLevel for fine grain control over persistence

rdd.cache()• Persists the RDD with the default storage level (MEMORY_ONLY)

rdd.checkpoint()• RDD will be saved to a file inside the checkpoint directory set with

SparkContext#setCheckpointDir(“/path/to/dir”) • Used for RDDs with long lineage chains with wide dependencies since it would

be expensive to re-compute

rdd.unpersist()• Marks it as non-persistent and/or removes all blocks of it from memory and

disk

Page 24: Intro to Apache Spark

24Page:

Fault Tolerance

• RDDs contain lineage graphs (coarse grained updates/transformations) to help it rebuild partitions that were lost

• Only the lost partitions of an RDD need to be recomputed upon failure.

• They can be recomputed in parallel on different nodes without having to roll back the entire app

• Also lets a system tolerate slow nodes (stragglers) by running a backup copy of the troubled task.

• Original process on straggling node will be killed when new process is complete

• Cached/Check pointed partitions are also used to re-compute lost partitions if available in shared memory

Page 25: Intro to Apache Spark

25Page:

Spark SQL• Spark module for structured data processing• The most popular Spark Module in the Ecosystem• It is highly recommended to use this the DataFrames or Dataset API because

of the performance benefits• Runs SQL/HiveQL Queries, optionally alongside or replacing existing Hive

deployments• Use SQLContext to perform operations

• Run SQL Queries• Use the DataFrame API• Use the Dataset API

• White Paper• http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf

• Programming Guide:• https://spark.apache.org/docs/latest/sql-programming-guide.html

Page 26: Intro to Apache Spark

26Page:

SQLContext

• Used to Create DataFrames and Datasets• Spark Shell automatically creates a SparkContext as the

sqlContext variable• Implementations

• SQLContext• HiveContext

• An instance of the Spark SQL execution engine that integrates with data stored in Hive

• Documentation• https://spark.apache.org/docs/latest/api/scala/index.html#

org.apache.spark.sql.SQLContext • As of Spark 2.0 use SparkSession

• https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SparkSession

Page 27: Intro to Apache Spark

27Page:

DataFrame API

• A distributed collection of rows organized into named columns• You know the names of the columns and data types

• Like Pandas and R• Unlike RDDs, DataFrame’s keep track of their schema and support

various relational operations that lead to more optimized execution• Catalyst Optimizer

Page 29: Intro to Apache Spark

29Page:

DataFrame API (SQL Queries)

• One use of Spark SQL is to execute SQL queries written using either a basic SQL syntax or HiveQL

Scalaval df = sqlContext.sql(”<SQL>”)

Pythondf = sqlContext.sql(”<SQL>”)

JavaDataset<Row> df = sqlContext.sql(”<SQL>");

Page 30: Intro to Apache Spark

30Page:

DataFrame API (DataFrame Reader and Writer)

DataFrameReaderval df = sqlContext.read

.format(”json”)

.option(“samplingRatio”, “0.1”)

.load(“/path/to/file.json”)

DataFrameWritersqlContext.write

.format(”parquet”)

.mode(“append”)

.partitionby(“year”)

.saveAsTable(“table_name”)

Page 31: Intro to Apache Spark

31Page:

DataFrame APISQL Statement:

SELECT name, avg(age)FROM peopleGROUP BY name

Can be written as:ScalasqlContext.table(”people”)

.groupBy(“name”)

.agg(“name”, avg(“age”))

.collect()

PythonsqlContext.table(”people”)

.groupBy(“name”)

.agg(“name”, avg(“age”))

.collect()

JavaRow[] output = sqlContext.table(”<SQL>")

.groupBy(“name”)

.agg(“name”, avg(“age”))

.collect();

Page 32: Intro to Apache Spark

32Page:

DataFrame API (UDFs)Scalaval castToInt = uft[Int, String](someStr ->

someStr.toInt)

val df = sqlContext.table(“users”)val newDF = df.withColumn(

“birth_year_int”, castToInt(df.birth_year))

PythoncastToInt = utf(lambda someStr: int(someStr))

df = sqlContext.table(“users”)newDF = df.withColumn(

“birth_year_int”,castToInt(df.birth_year)

)

JavaUDF1 castToInt = new UDF1<String, Integer>() { public String call(final String someStr) throws Exception { return Integer.valueOf(someStr); }};sqlContext.udf().register(”castToInt", castToInt, DataTypes.IntegerType);

Dataset<Row> df = sqlContext.table(“users”);Dataset<Row> newDF = df.withColumn(“birth_year_int”, callUDF(”castToInt", col(”birth_year")))

Page 33: Intro to Apache Spark

33Page:

Dataset API

• Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs with the benefits of Spark SQL’s optimized execution engine

• Use the SQLContext• DataFrame is simply a type alias of Dataset[Row]• Support

• The unified Dataset API can be used both in Scala and Java• Python does not yet have support for the Dataset API

• Easily convert DataFrame Dataset

Page 34: Intro to Apache Spark

34Page:

Dataset APIScalaval df = sqlContext.read.json(”people.json”)case class Person(name: String, age: Long)val ds: Dataset[Person] = df.as[Person]

PythonNot Supported

Javapublic static class Person implements Serializable {

private String name;private long age;/*

Getters and Setters*/

}Encoder<Person> personEncoder = Encoders.bean(Person.class);Dataset[Row] df = sqlContext.read().json(“people.json”);Dataset<Person> ds = df.as(personEncoder);

Page 35: Intro to Apache Spark

35Page:

Spark Streaming

• Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams

Databricks, Spark Streaminghttp://spark.apache.org/docs/latest/streaming-programming-guide.html

Page 36: Intro to Apache Spark

36Page:

Spark Streaming (Cont.)• Works off the Micro Batch architecture

• Polling ever X Seconds = Batch Interval• Use the StreamingContext to create DStreams

• DStream = Discretized Streams• Collection of discrete batches• Represented as a series of RDDs

• One for each Block Interval in the Batch Interval• Programming Guide

• https://spark.apache.org/docs/latest/streaming-programming-guide.html

Databricks, Spark Streaminghttp://spark.apache.org/docs/latest/streaming-programming-guide.html

Page 37: Intro to Apache Spark

37Page:

Spark Streaming Example• Use netcat to stream data from a TCP Socket

$ nc -lk 9999Scalaimport org.apache.spark._import org.apache.spark.streaming._

val ssc = new StreamingContext(sc, Seconds(5))

val lines = ssc.socketTextStream("localhost", 9999)val wordCounts = lines.flatMap(line => line.split(" "))

.map(word => (word, 1)).reduceByKey(_ + _)

wordCounts.print()

ssc.start()ssc.awaitTermination()

Pythonfrom pyspark import SparkContextfrom pyspark.streaming import StreamingContext

ssc = new StreamingContext(sc, 5)

lines = ssc.socketTextStream("localhost", 9999)wordCounts = text_file

.flatMap(lambda line: line.split(" "))\

.map(lambda word: (word, 1))\

.reduceByKey(lambda a, b: a + b)

wordCounts.print()

ssc.start()ssc.awaitTermination()

Page 38: Intro to Apache Spark

38Page:

Spark Streaming Example (Java)import org.apache.spark.*;import org.apache.spark.api.java.function.*;import org.apache.spark.streaming.*;import org.apache.spark.streaming.api.java.*;import scala.Tuple2;

JavaStreamingContext jssc = new JavaStreamingContext(sc, Durations.seconds(5));JavaReceiverInputDStream<String> lines = jssc.socketTextStream("localhost", 9999);

JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {public Iterable<String> call(String line) { return Arrays.asList(line.split("

"));}});JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {

public Tuple2<String, Integer> call(String word) { return new Tuple2<String, Integer>(word, 1); }});JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {

public Integer call(Integer a, Integer b) { return a + b; }});

wordCounts.print();jssc.start()jssc.awaitTermination();

Page 39: Intro to Apache Spark

39Page:

Spark Streaming Dangers

• SparkStreaming processes one Batch at a time• If the processing of each Batch takes longer then the Batch Interval

you could see issues• Back Pressure• Buffering

• Eventually you’ll see the Stream crash

Page 40: Intro to Apache Spark

40Page:

Use Case #1 – Streaming• Ingest data from RabbitMQ into Hadoop using Spark Streaming

Page 41: Intro to Apache Spark

41Page:

Use Case #2 – ETL• Perform ETL with Spark

Page 42: Intro to Apache Spark

42Page:

Learn More (Courses and Videos)• MapR Academy

• http://learn.mapr.com/dev-360-apache-spark-essentials• edx

• https://www.edx.org/course/introduction-apache-spark-uc-berkeleyx-cs105x

• https://www.edx.org/course/distributed-machine-learning-apache-uc-berkeleyx-cs120x

• https://www.edx.org/course/big-data-analysis-apache-spark-uc-berkeleyx-cs110x

• https://www.edx.org/xseries/data-science-engineering-apache-spark• Coursera

• https://www.coursera.org/learn/big-data-analysys• Apache Spark YouTube

• https://www.youtube.com/channel/UCRzsq7k4-kT-h3TDUBQ82-w• Spark Summit

• https://spark-summit.org/2016/schedule/

Page 43: Intro to Apache Spark

Interested in learning more about SparkSQL?

Well here’s an additional Desert Code Camp session

to attend:Getting started with SparkSQL

Presenter: Avinash RamineniRoom: AH-1240Time: 4:45 PM – 5:45 PM

Page 44: Intro to Apache Spark

44Page:

References

• https://en.wikipedia.org/wiki/Apache_Spark• http://spark.apache.org/news/spark-wins-daytona-gray-sort-100tb-be

nchmark.html

• https://www.datanami.com/2016/06/08/apache-spark-adoption-numbers/

• http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf• http://training.databricks.com/workshop/itas_workshop.pdf• https://spark.apache.org/docs/latest/api/scala/index.html• https://spark.apache.org/docs/latest/programming-guide.html• https://github.com/databricks/learning-spark

Page 45: Intro to Apache Spark

Q&A