Strata NYC 2015 - What's coming for the Spark community

What’s New in the Spark Community

Patrick Wendell | @pwendell

About Me

Co-Founder of Databricks Founding committer of Apache Spark at U.C. Berkeley Today, manage Spark effort @ Databricks

About Databricks

Team donated Spark to ASF in 2013; primary maintainers of Spark today Hosted analytics stack based on Apache Spark Managed clusters, notebooks, collaboration, and third party apps:

Today’s Talk

Quick overview of Apache Spark Technical roadmap directions Community and ecosystem trends

What is your familiarity with Spark?

1.  Not very familiar with Spark – only very high level. 2.  Understand the components/uses well, but I’ve never written code. 3.  I’ve written Spark code on POC or production use case of Spark.

“Spark is the Taylor Swift of big data software.” - Derrick Harris, Fortune

…

Apache Spark Engine

Spark Core

Streaming SQL and

Dataframe MLlib GraphX

Unified engine across diverse workloads & environments

Scale out, fault tolerant Python, Java, Scala, and R APIs Standard libraries

This Talk

“What’s new” in Spark? And what’s coming? Two parts: Technical roadmap and community developments

“The future is already here — it's just not very evenly distributed.” - William Gibson

Technical Directions

Spark Technical Directions

Higher level API’s Make developers more productive

Performance of key execution primitives

Shuffle, sorting, hashing, and state management Pluggability and extensibility

Make it easy for other projects to integrate with Spark

Higher Level API’s

Making Spark accessible to data scientists, engineers, statisticians…

Computing an Average: MapReduce vs Spark

private IntWritable one = new IntWritable(1) private IntWritable output = new IntWritable() proctected void map( LongWritable key, Text value, Context context) { String[] fields = value.split("\t") output.set(Integer.parseInt(fields[1])) context.write(one, output) } IntWritable one = new IntWritable(1) DoubleWritable average = new DoubleWritable() protected void reduce( IntWritable key, Iterable<IntWritable> values, Context context) { int sum = 0 int count = 0 for(IntWritable value : values) { sum += value.get() count++ } average.set(sum / (double) count) context.Write(key, average) }

data = sc.textFile(...).split("\t") data.map(lambda x: (x[0], [x.[1], 1])) \ .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \ .map(lambda x: [x[0], x[1][0] / x[1][1]]) \ .collect()

13

Computing an Average with Spark

data = sc.textFile(...).split("\t") data.map(lambda x: (x[0], [x.[1], 1])) \ .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \ .map(lambda x: [x[0], x[1][0] / x[1][1]]) \ .collect()

14

Computing an Average with DataFrames

sqlCtx.table("people") \ .groupBy("name") \ .agg("name", avg("age")) \ .collect()

15

Spark DataFrame API

Explicit data model and schema Selecting columns and filtering Aggregation (count, sum, average, etc)

User defined functions Joining different data sources Statistical functions and easy plotting Python, Scala, Java, and R

16

sqlCtx.table("people") \ .groupBy("name") \ .agg("name", avg("age")) \ .collect()

Ask more of your framework! MapReduce Spark Spark + DataFrames Fault tolerance Fault tolerance Fault tolerance

Data distribution Data distribution Data distribution

Set operators Set operators

Operator DAG Operator DAG

Caching Caching

Schema management

Relational semantics

Logical plan optimization

Storage push down and opt.

Analytic operations

…

Other high level API’s

ML Pipelines SparkR

ds0 ds1 ds2 ds3 tokenizer hashingTF lr.model

lr

> faithful <-‐ read.df("faithful.json", "json”) > head(filter(faithful, faithful $waiting < 50)) ## eruptions waiting ##1 1.750 47 ##2 1.750 47 ##3 1.867 48

Performance Initiatives

Project Tungsten – improving runtime efficiency of key internals Everything else – IO optimizations, dynamic plan re-writing

Project Tungsten: The CPU Squeeze

2010 2015

Storage 50+MB/s (HDD)

500+MB/s (SSD) 10X

Network 1Gbps 10Gbps 10X

CPU ~3GHz ~3GHz L

Project Tungsten Code generation for CPU efficiency

Code generation on by default and using Janino [SPARK-7956] Beef up built-in UDF library (added ~100 UDF’s with code gen)

AddMonths ArrayContains Ascii Base64 Bin BinaryMathExpression CheckOverflow CombineSets Contains CountSet Crc32 DateAdd

DateDiff DateFormatClass DateSub DayOfMonth DayOfYear Decode Encode EndsWith Explode Factorial FindInSet FormatNumber FromUTCTimestamp

FromUnixTime GetArrayItem GetJsonObject GetMapValue Hex InSet InitCap IsNaN IsNotNull IsNull LastDay Length Levenshtein

Like Lower MakeDecimal Md5 Month MonthsBetween NaNvl NextDay Not PromotePrecision Quarter RLike Round

Second Sha1 Sha2 ShiYLeY ShiYRight ShiYRightUnsigned SortArray SoundEx StartsWith StringInstr StringRepeat StringReverse StringSpace

StringSplit StringTrim StringTrimLeY StringTrimRight TimeAdd TimeSub ToDate ToUTCTimestamp TruncDate UnBase64 UnaryMathExpression Unhex UnixTimestamp

Project Tungsten

Binary processing for memory management (all data types): External sorting with managed memory External hashing with managed memory

Memory page

hc ptr

…

key value key value key value key value

key value key value

Managed Memory HashMap in Tungsten

Python Java/Scala R SQL …

DataFrame Logical Plan

LLVM JVM GPU NVRAM

Where are we going?

Tungsten backend

language frontend

…

Tungsten Execution

Python SQL R Streaming

DataFrame

Advanced Analytics

Pluggability: Rich IO Support

df = sqlContext.read \ .format("json") \ .option("samplingRatio", "0.1") \ .load("/home/michael/data.json”) df.write \ .format("parquet") \ .mode("append") \ .partitionBy("year") \ .saveAsTable("fasterData")

Unified interface to reading/writing data in a variety of formats

Large Number of IO Integration

Spark SQL’s Data Source API can read and write DataFrames using a variety of formats.

28

{ JSON }

Built-In External

JDBC

and more…

Find more sources at http://spark-packages.org/

Deployment Integrations

Technical Directions

Early on, the focus was: Can Spark be an engine that is faster and easier to use than Hadoop MapReduce?

Today the question is:

Can Spark & its ecosystem make big data as easy as little data?

Community/User Growth

Who is the “Spark Community”?

thousands of users

… hundreds of developers

… dozens of distributors

Getting a better vantage point

Databricks survey - feedback from more than 1,400 users

Community trends: Library & package ecosystem

Strata NY 2014: Widespread use of core RDD API Today: Most use built-in and community libraries

51% of users use 3 or more libraries

Spark Packages

Strata NY 2014: Didn’t exist Today: > 100 community packages

> ./bin/spark-shell --packages databricks/spark-avro:0.2

Spark Packages

API Extensions Clojure API

Spark Kernel

Zepplin Notebook

Indexed RDD

Deployment Utilities

Google Compute

Microsoft Azure

Spark Jobserver

Data Sources Redshift

Avro CSV

Elastic Search MongoDB

Increasing storage options

Strata NY 2014: IO primarily through Hadoop InputFormat API January 2015: Spark adds native storage API Today: Well over 20 natively integrated storage bindings

Cassandra, ElasticSearch, MongoDB, Avro, Parquet, ORC, HBase,

Redshift, SAP, CSV, Cloudant, Oracle, JDBC, SequoiaDB, Couchbase…

Deployment environments

Strata NY 2014: Traction in the Hadoop community

Today: Growth beyond Hadoop… increasingly public cloud

51% of respondents run Spark in public cloud

Wrapping it up

Spark has grown and developed quickly in the last year! Looking forward expect: -  Engineering effort on higher level API’s and performance -  A broader surrounding ecosystem -  The unexpected

Where to learn more about Spark?

SparkHub community portal Spark Summit conference - https://spark-summit.org/ Massive online course (edX): Databricks Spark training Books:

Questions?

Strata NYC 2015 - What's coming for the Spark community

Software

Transcript of Strata NYC 2015 - What's coming for the Spark community