Apache Spark, the Next Generation Cluster Computing

Apache SparkThe next Generation Cluster Computing

Ivan Lozić, 04/25/2017

Ivan Lozić, software engineer & entrepreneur

Scala & Spark, C#, Node.js, Swift

Web page: www.deegloo.comE-Mail: ilozic@gmail.com

LinkedIn: https://www.linkedin.com/in/ilozic/

Zagreb, Croatia

Contents

● Apache Spark and its relation to Hadoop MapReduce● What makes Apache Spark run fast● How to use Spark rich API to build batch ETL jobs● Streaming capabilities● Structured streaming

Apache Hadoop

● Open Source framework for distributed storage and processing● Origins are in the project “Nutch” back in 2002 (Cutting, Cafarella)● 2006. Yahoo! Created Hadoop based on GFS and MapReduce ● Based on MapReduce programming model● Fundamental assumption - all the modules are built to handle

hardware failures automatically● Clusters built of commodity hardware

Apache Spark

Motivation

● Hardware - CPU compute bottleneck● Users - democratise access to data and improve usability● Applications - necessity to build near real time big data applications

Apache Spark

● Open source fast and expressive cluster computing framework designed for Big data analytics

● Compatible with Apache Hadoop● Developed at UC Berkley’s AMP Lab 2009. and donated to the Apache

Software Foundation in 2013.● Original author - Matei Zaharia● Databricks inc. - company behind Apache Spark

Apache Spark

● General distributed computing engine which unifies:○ SQL and DataFrames ○ Real-time streaming (Spark streaming)○ Machine learning (SparkML/MLLib)○ Graph processing (GraphX)

Apache Spark

● Runs everywhere - standalone, EC2, Hadoop YARN, Apache Mesos● Reads and writes from/to:

○ File/Directory○ HDFS/S3○ JDBC○ JSON○ CSV○ Parquet○ Cassandra, HBase, ...

Apache Spark - architecture

source: Databricks

Word count - MapReduce vs Spark

package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } }

val file = spark.textFile("hdfs://...")val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)counts.saveAsTextFile("hdfs://...")

Hadoop ecosystem

Who uses Apache Spark?

Core data abstractions

Resilient Distributed Dataset

● RDDs are partitioned collections of objects - building blocks of Spark● Immutable and provide fault tolerant computation● Two types of operations:

1. Transformations - map, reduce, sort, filter, groupBy, ...2. Actions - collect, count, take, first, foreach, saveToCassandra, ...

● Types of operations are based on Scala collection API● Transformations are lazily evaluated DAG (Directed Acyclic Graph)

constituents● Actions invoke DAG creation and actual computation

Data shuffling

● Sending data over the network● Slow - should be minimized as much as possible!● Typical example - groupByKey (slow) vs reduceByKey (faster)

RDD - the problems

● They express the how better than what● Operations and data type in clojure are black box for Spark - Spark

cannot make optimizations

val category = spark.sparkContext.textFile("/data/SFPD_Incidents_2003.csv") .map(line => line.split(byCommaButNotUnderQuotes)(1)) .filter(cat => cat != "Category")

Structure(Structured APIs)

SparkSQL

● Originally named “Shark” - to enable HiveQL queries● As of Spark 2.0 - SQL 2003 support

category.toDF("categoryName").createOrReplaceTempView("category")

spark.sql(""" SELECT categoryName, count(*) AS Count FROM category GROUP BY categoryName ORDER BY 2 DESC""").show(5)

DataFrame

● Higher level abstraction (DSL) to manipulate with data● Distributed collection of rows organized into named columns● Modeled after Pandas DataFrame● DataFrame has schema (something RDD is missing)

val categoryDF = category.toDF("categoryName")

categoryDF .groupBy("categoryName") .count() .orderBy($"Count".desc) .show(5)

DataFrame

Structured APIs error-check comparison

26source: Databricks

Dataset

● Extension to DataFrame● Type-safe● DataFrame = Dataset[Row]

case class Incident(Category: String, DayOfWeek: String)

val incidents = spark .read .option("header", "true") .csv("/data/SFPD_Incidents_2003.csv") .select("Category", "DayOfWeek") .as[Incident]

val days = Array("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")

val histogram = incidents.groupByKey(_.Category).mapGroups { case (category, daysOfWeek) => { val buckets = new Array[Int](7) daysOfWeek.map(_.DayOfWeek).foreach { dow => buckets(days.indexOf(dow)) += 1 } (category, buckets) }}

What makes Spark fast?

In memory computation

● Fault tolerance is achieved by using HDFS● Easy possible to spend 90% of time in Disk I/O only

iter. 1

iter. 2 ...HDFS read HDFS write HDFS read HDFS write HDFS read

● Fault tolerance is provided by building lineage of transformations● Data is not being replicated

iter. 1

iter. 2 ...

Catalyst - query optimizer

source: Databricks

● Applies transformations to convert unoptimized to optimized query plan

Project Tungsten

● Improve Spark execution memory and CPU efficiency by:○ Performing explicit memory management instead of relying on JVM objects (Dataset

encoders)○ Generating code on the fly to fuse multiple operators into one (Whole stage codegen)○ Introducing cache-aware computation○ In-memory columnar format

● Bringing Spark closer to the bare metal

Dataset encoders

● Encoders translate between domain objects and Spark's internal format

source: Databricks

Dataset encoders

● Encoders bridge objects with data sources

{ "Category": "THEFT", "IncidntNum": "150060275", "DayOfWeek": "Saturday"}

case class Incident(IncidntNum: Int, Category: String, DayOfWeek: String)

Dataset benchmark

Space efficiency

source: Databricks

Dataset benchmark

Serialization/deserialization performance

source: Databricks

Whole stage codegen

● Fuse the operators together● Generate code on the fly● The idea: generate specialized code as if it was written manually to be

Result: Spark 2.0 is 10x faster than Spark 1.6

Whole stage codegen

SELECT COUNT(*) FROM store_sales WHERE ss_item_sk=1000

Whole stage codegen

Volcano iterator model

Whole stage codegen

What if we would ask some intern to write this in c#?

long count = 0;foreach (var ss_item_sk in store_sales) {

if (ss_item_sk == 1000)count++;

Volcano vs Intern

Volcano

Intern

source: Databricks

Volcano vs Intern

Developing ETL with Spark

Choose your favorite IDE

Define Spark job entry point

object IncidentsJob { def main(args: Array[String]) {

val spark = SparkSession.builder() .appName("Incidents processing job") .config("spark.sql.shuffle.partitions", "16") .master("local[4]") .getOrCreate()

{ spark transformations and actions... }

System.exit(0)}

Create build.sbt file

lazy val root = (project in file(".")). settings( organization := "com.mycompany", name := "spark.job.incidents", version := "1.0.0", scalaVersion := "2.11.8", mainClass in Compile := Some("com.mycompany.spark.job.incidents.main") )

libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "2.0.1" % "provided", "org.apache.spark" %% "spark-sql" % "2.0.1" % "provided", "org.apache.spark" %% "spark-streaming" % "2.0.1" % "provided",

"com.microsoft.sqlserver" % "sqljdbc4" % "4.0")

Create application (fat) jar file

$ sbt compile

$ sbt test

$ sbt assembly (sbt-assembly plugin)

Submit job via spark-submit command

./bin/spark-submit \ --class <main-class> \ --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \ ... # other options <application-jar> \ [application-arguments]

Example workflow

1. pull content2. take build number (331)3. build & test

4. copy to cluster

job331.jar

produce job artifact

notification

5. create/schedule job job331 (http)

6. spark submit job331

Spark Streaming

Apache Spark streaming

● Scalable fault tolerant streaming system● Receivers receive data streams and chop them into batches● Spark processes batches and pushes out the result

● Input: Files, Socket, Kafka, Flume, Kinesis...

def main(args: Array[String]) { val conf = new SparkConf() .setMaster("local[2]") .setAppName("Incidents processing job - Stream")

val ssc = new StreamingContext(conf, Seconds(1))

val topics = Set( Topics.Incident,

val directKafkaStream = KafkaUtils.createDirectStream[Array[Byte], Array[Byte], DefaultDecoder, DefaultDecoder]( ssc, kafkaParams, topics)

// process batchesdirectKafkaStream.map(_._2).flatMap(_.split(“ “))...

// Start the computation ssc.start() ssc.awaitTermination()

System.exit(0)}

● Integrates with the rest of the ecosystem○ Combine batch and stream processing○ Combine machine learning with streaming○ Combine SQL with streaming

Structured streaming

[Alpha version in Spark 2.1]

Structured streaming (continuous apps)

● High-level streaming API built on DataFrames● Catalyst optimizer creates incremental execution plan

● Unifies streaming, interactive and batch queries

● Supports multiple sources and sinks

● E.g. aggregate data in a stream, then serve using JDBC

Structured streaming key idea

The simplest way to perform streaming analytics is not having to reason about streaming.

● Reusing same API

val categories = spark .read .option("header", "true") .schema(schema) .csv("/data/source") .select("Category")

val categories = spark .readStream .option("header", "true") .schema(schema) .csv("/data/source") .select("Category")

finite infinite

● Reusing same API

categories .write .format("parquet") .save("/data/warehouse/categories.parquet")

categories .writeStream .format("parquet") .start("/data/warehouse/categories.parquet")

finite infinite

Useful resources

● Spark home page: https://spark.apache.org/● Spark summit page: https://spark-summit.org/● Apache Spark Docker image:

https://github.com/dylanmei/docker-zeppelin● SFPD Incidents:

https://data.sfgov.org/Public-Safety/Police-Department-Incidents/tmnf-yvry

Thank you for the attention!

References

● Michael Armbrust - STRUCTURING SPARK: DATAFRAMES, DATASETS AND STREAMING - https://spark-summit.org/2016/events/structuring-spark-dataframes-datasets-and-streaming/

● Apache Parquet - https://parquet.apache.org/ ● Spark Performance: What's Next -

https://spark-summit.org/east-2016/events/spark-performance-whats-next/ ● Avoid groupByKey -

https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html

Apache Spark, the Next Generation Cluster Computing

Software

Transcript of Apache Spark, the Next Generation Cluster Computing

Using Apache Spark

Apache spark linkedin

Apache Spark チュートリアル

Introduction to Apache Spark - Meetupfiles.meetup.com/3576292/Dubravko Dulic SparkR June 2016.pdf · What is Apache Spark? 4 A cluster-based computing engine Developed since 2012

Budapest Spark Meetup - Apache Spark @enbrite.ly

Apache Spark: Hands-on Session - ce.uniroma2.it · Spark Cluster . 10 • Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object

Hadoop 2 Quick-Start Guide - SECAB...Apache Spark 182 Apache Storm 182 Apache REEF: Retainable Evaluator Execution Framework 182 Hamster: Hadoop and MPI on the Same Cluster 183 Apache

Tuning Apache Spark - docs.cloudera.com · In yarn-cluster mode, the Spark driver runs inside an application master process that is managed by YARN on the cluster. The client can

Apache Spark Briefing

Apache Spark Streaming

An Introduction to Apache Spark - Meetupfiles.meetup.com/17173282/intro2spark-shug.pdf · Spark in a Nutshell • General cluster computing platform: • Distributed in-memory computational

Using Apache Spark Pat McDonough - Databricks. Apache Spark spark.incubator.apache.org github.com/apache/incubator- spark user@spark.incubator.apache.or.

Big data with Apache spark - WUNCA · 2017-07-21 · - Apache spark architecture - Databricks community - Introduction to Big Data with Apache Spark บ าย - Apache Spark on Databricks

Apache Spark - Courses€¦ · Apache Spark Introduction to Data Science DATA11001 Nitinder Mohan CollaborativeNetworking (CoNe) nitinder.mohan@helsinki.fi. What is Apache Spark?

Experiences Implementing Apache Spark and Apache Hadoop ... · The Apache Spark site illustrates the Spark architecture. The nodes types are: Client Driver, Cluster Manager, and Worker

Rapid Cluster Computing with Apache Spark 2016

Developing Apache Spark Applications · Apache Spark Introduction Introduction Apache Spark enables you to quickly develop applications and process jobs. Apache Spark is designed

Apache Spark & Hadoop

Apache Spark overview

Apache Spark Introduction