Strata NYC 2015 - What's coming for the Spark community

41
What’s New in the Spark Community Patrick Wendell | @pwendell

Transcript of Strata NYC 2015 - What's coming for the Spark community

Page 1: Strata NYC 2015 - What's coming for the Spark community

What’s New in the Spark Community

Patrick Wendell | @pwendell

Page 2: Strata NYC 2015 - What's coming for the Spark community

About Me

Co-Founder of Databricks Founding committer of Apache Spark at U.C. Berkeley Today, manage Spark effort @ Databricks

Page 3: Strata NYC 2015 - What's coming for the Spark community

About Databricks

Team donated Spark to ASF in 2013; primary maintainers of Spark today Hosted analytics stack based on Apache Spark Managed clusters, notebooks, collaboration, and third party apps:

Page 4: Strata NYC 2015 - What's coming for the Spark community

Today’s Talk

Quick overview of Apache Spark Technical roadmap directions Community and ecosystem trends

Page 5: Strata NYC 2015 - What's coming for the Spark community

What is your familiarity with Spark?

1.  Not very familiar with Spark – only very high level. 2.  Understand the components/uses well, but I’ve never written code. 3.  I’ve written Spark code on POC or production use case of Spark.

Page 6: Strata NYC 2015 - What's coming for the Spark community

“Spark is the Taylor Swift of big data software.” - Derrick Harris, Fortune

Page 7: Strata NYC 2015 - What's coming for the Spark community

Apache Spark Engine

Spark Core

Streaming SQL and

Dataframe MLlib GraphX

Unified engine across diverse workloads & environments

Scale out, fault tolerant Python, Java, Scala, and R APIs Standard libraries

Page 8: Strata NYC 2015 - What's coming for the Spark community

This Talk

“What’s new” in Spark? And what’s coming? Two parts: Technical roadmap and community developments

“The future is already here — it's just not very evenly distributed.” - William Gibson

Page 9: Strata NYC 2015 - What's coming for the Spark community

Technical Directions

Page 10: Strata NYC 2015 - What's coming for the Spark community

Spark Technical Directions

Higher level API’s Make developers more productive

Performance of key execution primitives

Shuffle, sorting, hashing, and state management Pluggability and extensibility

Make it easy for other projects to integrate with Spark

Page 11: Strata NYC 2015 - What's coming for the Spark community

Spark Technical Directions

Higher level API’s Make developers more productive

Performance of key execution primitives

Shuffle, sorting, hashing, and state management Pluggability and extensibility

Make it easy for other projects to integrate with Spark

Page 12: Strata NYC 2015 - What's coming for the Spark community

Higher Level API’s

Making Spark accessible to data scientists, engineers, statisticians…

Page 13: Strata NYC 2015 - What's coming for the Spark community

Computing an Average: MapReduce vs Spark

private  IntWritable  one  =        new  IntWritable(1)  private  IntWritable  output  =      new  IntWritable()  proctected  void  map(          LongWritable  key,          Text  value,          Context  context)  {      String[]  fields  =  value.split("\t")      output.set(Integer.parseInt(fields[1]))      context.write(one,  output)  }    IntWritable  one  =  new  IntWritable(1)  DoubleWritable  average  =  new  DoubleWritable()    protected  void  reduce(          IntWritable  key,          Iterable<IntWritable>  values,          Context  context)  {      int  sum  =  0      int  count  =  0      for(IntWritable  value  :  values)  {            sum  +=  value.get()            count++          }      average.set(sum  /  (double)  count)      context.Write(key,  average)  }  

data  =  sc.textFile(...).split("\t")  data.map(lambda  x:  (x[0],  [x.[1],  1]))  \        .reduceByKey(lambda  x,  y:  [x[0]  +  y[0],  x[1]  +  y[1]])  \        .map(lambda  x:  [x[0],  x[1][0]  /  x[1][1]])  \        .collect()  

13

Page 14: Strata NYC 2015 - What's coming for the Spark community

Computing an Average with Spark

data  =  sc.textFile(...).split("\t")  data.map(lambda  x:  (x[0],  [x.[1],  1]))  \        .reduceByKey(lambda  x,  y:  [x[0]  +  y[0],  x[1]  +  y[1]])  \        .map(lambda  x:  [x[0],  x[1][0]  /  x[1][1]])  \        .collect()  

14

Page 15: Strata NYC 2015 - What's coming for the Spark community

Computing an Average with DataFrames

 

sqlCtx.table("people")  \        .groupBy("name")  \        .agg("name",  avg("age"))  \        .collect()    

 

15

Page 16: Strata NYC 2015 - What's coming for the Spark community

Spark DataFrame API

Explicit data model and schema Selecting columns and filtering Aggregation (count, sum, average, etc)

User defined functions Joining different data sources Statistical functions and easy plotting Python, Scala, Java, and R

16

 

sqlCtx.table("people")  \        .groupBy("name")  \        .agg("name",  avg("age"))  \        .collect()    

Page 17: Strata NYC 2015 - What's coming for the Spark community

Ask more of your framework! MapReduce Spark Spark + DataFrames Fault tolerance Fault tolerance Fault tolerance

Data distribution Data distribution Data distribution

Set operators Set operators

Operator DAG Operator DAG

Caching Caching

Schema management

Relational semantics

Logical plan optimization

Storage push down and opt.

Analytic operations

Page 18: Strata NYC 2015 - What's coming for the Spark community

Other high level API’s

ML Pipelines SparkR

ds0 ds1 ds2 ds3 tokenizer hashingTF lr.model

lr

>  faithful  <-­‐  read.df("faithful.json",  "json”)  >  head(filter(faithful,  faithful  $waiting  <  50))  ##    eruptions  waiting  ##1          1.750            47  ##2          1.750            47  ##3          1.867            48  

   

Page 19: Strata NYC 2015 - What's coming for the Spark community

Spark Technical Directions

Higher level API’s Make developers more productive

Performance of key execution primitives

Shuffle, sorting, hashing, and state management Pluggability and extensibility

Make it easy for other projects to integrate with Spark

Page 20: Strata NYC 2015 - What's coming for the Spark community

Performance Initiatives

Project Tungsten – improving runtime efficiency of key internals Everything else – IO optimizations, dynamic plan re-writing

Page 21: Strata NYC 2015 - What's coming for the Spark community

Project Tungsten: The CPU Squeeze

2010 2015

Storage 50+MB/s (HDD)

500+MB/s (SSD) 10X

Network 1Gbps 10Gbps 10X

CPU ~3GHz ~3GHz L

Page 22: Strata NYC 2015 - What's coming for the Spark community

Project Tungsten Code generation for CPU efficiency

Code generation on by default and using Janino [SPARK-7956] Beef up built-in UDF library (added ~100 UDF’s with code gen)

AddMonths  ArrayContains  Ascii  Base64  Bin  BinaryMathExpression  CheckOverflow  CombineSets  Contains  CountSet  Crc32  DateAdd  

DateDiff  DateFormatClass  DateSub  DayOfMonth  DayOfYear  Decode  Encode  EndsWith  Explode  Factorial  FindInSet  FormatNumber  FromUTCTimestamp  

FromUnixTime  GetArrayItem  GetJsonObject  GetMapValue  Hex  InSet  InitCap  IsNaN  IsNotNull  IsNull  LastDay  Length  Levenshtein  

Like  Lower  MakeDecimal  Md5  Month  MonthsBetween  NaNvl  NextDay  Not  PromotePrecision  Quarter  RLike  Round  

Second  Sha1  Sha2  ShiYLeY  ShiYRight  ShiYRightUnsigned  SortArray  SoundEx  StartsWith  StringInstr  StringRepeat  StringReverse  StringSpace  

StringSplit  StringTrim  StringTrimLeY  StringTrimRight  TimeAdd  TimeSub  ToDate  ToUTCTimestamp  TruncDate  UnBase64  UnaryMathExpression  Unhex  UnixTimestamp  

Page 23: Strata NYC 2015 - What's coming for the Spark community

Project Tungsten

Binary processing for memory management (all data types): External sorting with managed memory External hashing with managed memory

Memory  page  

hc   ptr  

…  

key   value   key   value  key   value   key   value  

key   value   key   value  

Managed Memory HashMap in Tungsten

Page 24: Strata NYC 2015 - What's coming for the Spark community

Python Java/Scala R SQL …

DataFrame Logical Plan

LLVM JVM GPU NVRAM

Where are we going?

Tungsten backend

language frontend

Page 25: Strata NYC 2015 - What's coming for the Spark community

Tungsten Execution

Python SQL R Streaming

DataFrame

Advanced Analytics

Page 26: Strata NYC 2015 - What's coming for the Spark community

Spark Technical Directions

Higher level API’s Make developers more productive

Performance of key execution primitives

Shuffle, sorting, hashing, and state management Pluggability and extensibility

Make it easy for other projects to integrate with Spark

Page 27: Strata NYC 2015 - What's coming for the Spark community

Pluggability: Rich IO Support

df  =  sqlContext.read  \      .format("json")  \      .option("samplingRatio",  "0.1")  \      .load("/home/michael/data.json”)    df.write  \      .format("parquet")  \      .mode("append")  \      .partitionBy("year")  \      .saveAsTable("fasterData")  

Unified interface to reading/writing data in a variety of formats

Page 28: Strata NYC 2015 - What's coming for the Spark community

Large Number of IO Integration

Spark SQL’s Data Source API can read and write DataFrames using a variety of formats.

28

{ JSON }

Built-In External

JDBC

and more…

Find more sources at http://spark-packages.org/

Page 29: Strata NYC 2015 - What's coming for the Spark community

Deployment Integrations

Page 30: Strata NYC 2015 - What's coming for the Spark community

Technical Directions

Early on, the focus was: Can Spark be an engine that is faster and easier to use than Hadoop MapReduce?

Today the question is:

Can Spark & its ecosystem make big data as easy as little data?

Page 31: Strata NYC 2015 - What's coming for the Spark community

Community/User Growth

Page 32: Strata NYC 2015 - What's coming for the Spark community

Who is the “Spark Community”?

thousands of users

… hundreds of developers

… dozens of distributors

Page 33: Strata NYC 2015 - What's coming for the Spark community

Getting a better vantage point

Databricks survey - feedback from more than 1,400 users

Page 34: Strata NYC 2015 - What's coming for the Spark community

Community trends: Library & package ecosystem

Strata NY 2014: Widespread use of core RDD API Today: Most use built-in and community libraries

51% of users use 3 or more libraries

Page 35: Strata NYC 2015 - What's coming for the Spark community

Spark Packages

Strata NY 2014: Didn’t exist Today: > 100 community packages

> ./bin/spark-shell --packages databricks/spark-avro:0.2

Page 36: Strata NYC 2015 - What's coming for the Spark community

Spark Packages

API Extensions Clojure API

Spark Kernel

Zepplin Notebook

Indexed RDD

Deployment Utilities

Google Compute

Microsoft Azure

Spark Jobserver

Data Sources Redshift

Avro CSV

Elastic Search MongoDB

Page 37: Strata NYC 2015 - What's coming for the Spark community

Increasing storage options

Strata NY 2014: IO primarily through Hadoop InputFormat API January 2015: Spark adds native storage API Today: Well over 20 natively integrated storage bindings

Cassandra, ElasticSearch, MongoDB, Avro, Parquet, ORC, HBase,

Redshift, SAP, CSV, Cloudant, Oracle, JDBC, SequoiaDB, Couchbase…

Page 38: Strata NYC 2015 - What's coming for the Spark community

Deployment environments

Strata NY 2014: Traction in the Hadoop community

Today: Growth beyond Hadoop… increasingly public cloud

51% of respondents run Spark in public cloud

Page 39: Strata NYC 2015 - What's coming for the Spark community

Wrapping it up

Spark has grown and developed quickly in the last year! Looking forward expect: -  Engineering effort on higher level API’s and performance -  A broader surrounding ecosystem -  The unexpected

Page 40: Strata NYC 2015 - What's coming for the Spark community

Where to learn more about Spark?

SparkHub community portal Spark Summit conference - https://spark-summit.org/ Massive online course (edX): Databricks Spark training Books:

Page 41: Strata NYC 2015 - What's coming for the Spark community

Questions?