Strata NYC 2015 - What's coming for the Spark community

Post on 14-Jan-2017

934 views 1 download

Transcript of Strata NYC 2015 - What's coming for the Spark community

What’s New in the Spark Community

Patrick Wendell | @pwendell

About Me

Co-Founder of Databricks Founding committer of Apache Spark at U.C. Berkeley Today, manage Spark effort @ Databricks

About Databricks

Team donated Spark to ASF in 2013; primary maintainers of Spark today Hosted analytics stack based on Apache Spark Managed clusters, notebooks, collaboration, and third party apps:

Today’s Talk

Quick overview of Apache Spark Technical roadmap directions Community and ecosystem trends

What is your familiarity with Spark?

1.  Not very familiar with Spark – only very high level. 2.  Understand the components/uses well, but I’ve never written code. 3.  I’ve written Spark code on POC or production use case of Spark.

“Spark is the Taylor Swift of big data software.” - Derrick Harris, Fortune

Apache Spark Engine

Spark Core

Streaming SQL and

Dataframe MLlib GraphX

Unified engine across diverse workloads & environments

Scale out, fault tolerant Python, Java, Scala, and R APIs Standard libraries

This Talk

“What’s new” in Spark? And what’s coming? Two parts: Technical roadmap and community developments

“The future is already here — it's just not very evenly distributed.” - William Gibson

Technical Directions

Spark Technical Directions

Higher level API’s Make developers more productive

Performance of key execution primitives

Shuffle, sorting, hashing, and state management Pluggability and extensibility

Make it easy for other projects to integrate with Spark

Spark Technical Directions

Higher level API’s Make developers more productive

Performance of key execution primitives

Shuffle, sorting, hashing, and state management Pluggability and extensibility

Make it easy for other projects to integrate with Spark

Higher Level API’s

Making Spark accessible to data scientists, engineers, statisticians…

Computing an Average: MapReduce vs Spark

private  IntWritable  one  =        new  IntWritable(1)  private  IntWritable  output  =      new  IntWritable()  proctected  void  map(          LongWritable  key,          Text  value,          Context  context)  {      String[]  fields  =  value.split("\t")      output.set(Integer.parseInt(fields[1]))      context.write(one,  output)  }    IntWritable  one  =  new  IntWritable(1)  DoubleWritable  average  =  new  DoubleWritable()    protected  void  reduce(          IntWritable  key,          Iterable<IntWritable>  values,          Context  context)  {      int  sum  =  0      int  count  =  0      for(IntWritable  value  :  values)  {            sum  +=  value.get()            count++          }      average.set(sum  /  (double)  count)      context.Write(key,  average)  }  

data  =  sc.textFile(...).split("\t")  data.map(lambda  x:  (x[0],  [x.[1],  1]))  \        .reduceByKey(lambda  x,  y:  [x[0]  +  y[0],  x[1]  +  y[1]])  \        .map(lambda  x:  [x[0],  x[1][0]  /  x[1][1]])  \        .collect()  

13

Computing an Average with Spark

data  =  sc.textFile(...).split("\t")  data.map(lambda  x:  (x[0],  [x.[1],  1]))  \        .reduceByKey(lambda  x,  y:  [x[0]  +  y[0],  x[1]  +  y[1]])  \        .map(lambda  x:  [x[0],  x[1][0]  /  x[1][1]])  \        .collect()  

14

Computing an Average with DataFrames

 

sqlCtx.table("people")  \        .groupBy("name")  \        .agg("name",  avg("age"))  \        .collect()    

 

15

Spark DataFrame API

Explicit data model and schema Selecting columns and filtering Aggregation (count, sum, average, etc)

User defined functions Joining different data sources Statistical functions and easy plotting Python, Scala, Java, and R

16

 

sqlCtx.table("people")  \        .groupBy("name")  \        .agg("name",  avg("age"))  \        .collect()    

Ask more of your framework! MapReduce Spark Spark + DataFrames Fault tolerance Fault tolerance Fault tolerance

Data distribution Data distribution Data distribution

Set operators Set operators

Operator DAG Operator DAG

Caching Caching

Schema management

Relational semantics

Logical plan optimization

Storage push down and opt.

Analytic operations

Other high level API’s

ML Pipelines SparkR

ds0 ds1 ds2 ds3 tokenizer hashingTF lr.model

lr

>  faithful  <-­‐  read.df("faithful.json",  "json”)  >  head(filter(faithful,  faithful  $waiting  <  50))  ##    eruptions  waiting  ##1          1.750            47  ##2          1.750            47  ##3          1.867            48  

   

Spark Technical Directions

Higher level API’s Make developers more productive

Performance of key execution primitives

Shuffle, sorting, hashing, and state management Pluggability and extensibility

Make it easy for other projects to integrate with Spark

Performance Initiatives

Project Tungsten – improving runtime efficiency of key internals Everything else – IO optimizations, dynamic plan re-writing

Project Tungsten: The CPU Squeeze

2010 2015

Storage 50+MB/s (HDD)

500+MB/s (SSD) 10X

Network 1Gbps 10Gbps 10X

CPU ~3GHz ~3GHz L

Project Tungsten Code generation for CPU efficiency

Code generation on by default and using Janino [SPARK-7956] Beef up built-in UDF library (added ~100 UDF’s with code gen)

AddMonths  ArrayContains  Ascii  Base64  Bin  BinaryMathExpression  CheckOverflow  CombineSets  Contains  CountSet  Crc32  DateAdd  

DateDiff  DateFormatClass  DateSub  DayOfMonth  DayOfYear  Decode  Encode  EndsWith  Explode  Factorial  FindInSet  FormatNumber  FromUTCTimestamp  

FromUnixTime  GetArrayItem  GetJsonObject  GetMapValue  Hex  InSet  InitCap  IsNaN  IsNotNull  IsNull  LastDay  Length  Levenshtein  

Like  Lower  MakeDecimal  Md5  Month  MonthsBetween  NaNvl  NextDay  Not  PromotePrecision  Quarter  RLike  Round  

Second  Sha1  Sha2  ShiYLeY  ShiYRight  ShiYRightUnsigned  SortArray  SoundEx  StartsWith  StringInstr  StringRepeat  StringReverse  StringSpace  

StringSplit  StringTrim  StringTrimLeY  StringTrimRight  TimeAdd  TimeSub  ToDate  ToUTCTimestamp  TruncDate  UnBase64  UnaryMathExpression  Unhex  UnixTimestamp  

Project Tungsten

Binary processing for memory management (all data types): External sorting with managed memory External hashing with managed memory

Memory  page  

hc   ptr  

…  

key   value   key   value  key   value   key   value  

key   value   key   value  

Managed Memory HashMap in Tungsten

Python Java/Scala R SQL …

DataFrame Logical Plan

LLVM JVM GPU NVRAM

Where are we going?

Tungsten backend

language frontend

Tungsten Execution

Python SQL R Streaming

DataFrame

Advanced Analytics

Spark Technical Directions

Higher level API’s Make developers more productive

Performance of key execution primitives

Shuffle, sorting, hashing, and state management Pluggability and extensibility

Make it easy for other projects to integrate with Spark

Pluggability: Rich IO Support

df  =  sqlContext.read  \      .format("json")  \      .option("samplingRatio",  "0.1")  \      .load("/home/michael/data.json”)    df.write  \      .format("parquet")  \      .mode("append")  \      .partitionBy("year")  \      .saveAsTable("fasterData")  

Unified interface to reading/writing data in a variety of formats

Large Number of IO Integration

Spark SQL’s Data Source API can read and write DataFrames using a variety of formats.

28

{ JSON }

Built-In External

JDBC

and more…

Find more sources at http://spark-packages.org/

Deployment Integrations

Technical Directions

Early on, the focus was: Can Spark be an engine that is faster and easier to use than Hadoop MapReduce?

Today the question is:

Can Spark & its ecosystem make big data as easy as little data?

Community/User Growth

Who is the “Spark Community”?

thousands of users

… hundreds of developers

… dozens of distributors

Getting a better vantage point

Databricks survey - feedback from more than 1,400 users

Community trends: Library & package ecosystem

Strata NY 2014: Widespread use of core RDD API Today: Most use built-in and community libraries

51% of users use 3 or more libraries

Spark Packages

Strata NY 2014: Didn’t exist Today: > 100 community packages

> ./bin/spark-shell --packages databricks/spark-avro:0.2

Spark Packages

API Extensions Clojure API

Spark Kernel

Zepplin Notebook

Indexed RDD

Deployment Utilities

Google Compute

Microsoft Azure

Spark Jobserver

Data Sources Redshift

Avro CSV

Elastic Search MongoDB

Increasing storage options

Strata NY 2014: IO primarily through Hadoop InputFormat API January 2015: Spark adds native storage API Today: Well over 20 natively integrated storage bindings

Cassandra, ElasticSearch, MongoDB, Avro, Parquet, ORC, HBase,

Redshift, SAP, CSV, Cloudant, Oracle, JDBC, SequoiaDB, Couchbase…

Deployment environments

Strata NY 2014: Traction in the Hadoop community

Today: Growth beyond Hadoop… increasingly public cloud

51% of respondents run Spark in public cloud

Wrapping it up

Spark has grown and developed quickly in the last year! Looking forward expect: -  Engineering effort on higher level API’s and performance -  A broader surrounding ecosystem -  The unexpected

Where to learn more about Spark?

SparkHub community portal Spark Summit conference - https://spark-summit.org/ Massive online course (edX): Databricks Spark training Books:

Questions?