Apache Spark 2.0

7/26/2019 Apache Spark 2.0

1/16

Matei Zaharia@matei_zaharia

Apache Spark 2


2/16

Apache Spark 2.0

Next major release, coming out this month! Unstable preview release at spark.apache.org

Remains highly compatible with Apache Spark 1.X

Over 2000 patches from 280 contributors!


3/16

Apache Spark Philosophy

Unified engineSupport end-to-end applications

High-level APIsEasy to use, rich optimizations

Integrate broadlyStorage systems, libraries, etc

SQLStreaming1

2

3


4/16

New in 2.0Structured API improvements(DataFrame, Dataset, SparkSession)

Structured Streaming

MLlib model export

MLlib R bindingsSQL 2003 support

Scala 2.12 support

Deep learning librar(Baidu, Yahoo!, Berkeley,

GraphFrames

PyData integration

Reactive streamsC# bindings: Mobius

JS bindings: EclairJS

Broader Com

Build on common interface of RDDs & DataFram


5/16

Deep Dive: Structured AP

events =sc.read.json(/logs)

stats =events.join(users).groupBy(loc,status).avg(duration)

errors = stats.where(

stats.status == ERR)

DataFrame API Optimized Plan Spec

READ logs READ users

JOIN

AGG

FILTER

while(loe = lif(e.

u = key sumcou

}

}...


6/16

New in 2.0

Whole-stage code generation! Fuse across multiple operators

Spark 1.6 14M

rows/s

Spark 2.0

Parquetin 1.6 11Mrows/s

Parquetin 2.0

Optimized input / output! Apache Parquet + built-in cache


7/16

Structured StreamingHigh-level streaming API built on DataFrames

Event time, windowing, sessions, sources & sinks

Also supports interactive & batch queries

Aggregate data in a stream, then serve using JDBC

Change queries at runtime

Build and apply ML modelsNot just stream

continuous app


8/16

Apache Spark 2.0:

Infinite DataFrames

Apache Spark 1.X:

Static DataFrames

Single API

Structured Streaming API


9/16

logs = ctx.read.format("json").open("s3://logs")

logs.groupBy(userid, hour).avg(latency)

.write.format("jdbc")

.s ve("jdbc:mysql//...")

Example: Batch App


10/16

logs = ctx.read.format("json").stre m("s3://logs")

logs.groupBy(userid, hour).avg(latency)

.write.format("jdbc")

.st rt tre m("jdbc:mysql//...")

Example: Continuous App


11/16

More Details in Conferenc

Engine: Structuring Spark, Structured Streaming, deep d

ML: SparkR, MLlib 2.0, new algorithms

Other: deep learning, GraphFrames, Solr, Cassandra,

Try 2.0-preview at spark.apache.org


12/16

Growing the CommuNew initiatives from Databricks


13/16

The largest challenge in applyidata is the skills gap.

StackOverflow Developer Survey


14/16

Databricks Community E

Free version of Data! Interactive tutoria

! Apache Spark and

data science libra

! Visualization & de

GA Today!databricks.


15/16

Massive Open Online Cou

Free 5-course series on bigdata with Apache Spark

dbricks.co/mooc16

Introductionto Apache Spark

TM

DistributedMachine Learningwith Apache Spark

T

Advanced Apache Sparkfor Data Science andData Engineering

TM

AdvancedMachine Learningwith Apache Spark

TM


16/16

Michael ArmbrustDemo

Apache Spark 2.0

Documents

Transcript of Apache Spark 2.0