Apache Spark 2.0
Embed Size (px)
Transcript of Apache Spark 2.0
-
7/26/2019 Apache Spark 2.0
1/16
Matei [email protected]_zaharia
Apache Spark 2
-
7/26/2019 Apache Spark 2.0
2/16
Apache Spark 2.0
Next major release, coming out this month! Unstable preview release at spark.apache.org
Remains highly compatible with Apache Spark 1.X
Over 2000 patches from 280 contributors!
-
7/26/2019 Apache Spark 2.0
3/16
Apache Spark Philosophy
Unified engineSupport end-to-end applications
High-level APIsEasy to use, rich optimizations
Integrate broadlyStorage systems, libraries, etc
SQLStreaming1
2
3
-
7/26/2019 Apache Spark 2.0
4/16
New in 2.0Structured API improvements(DataFrame, Dataset, SparkSession)
Structured Streaming
MLlib model export
MLlib R bindingsSQL 2003 support
Scala 2.12 support
Deep learning librar(Baidu, Yahoo!, Berkeley,
GraphFrames
PyData integration
Reactive streamsC# bindings: Mobius
JS bindings: EclairJS
Broader Com
Build on common interface of RDDs & DataFram
-
7/26/2019 Apache Spark 2.0
5/16
Deep Dive: Structured AP
events =sc.read.json(/logs)
stats =events.join(users).groupBy(loc,status).avg(duration)
errors = stats.where(
stats.status == ERR)
DataFrame API Optimized Plan Spec
READ logs READ users
JOIN
AGG
FILTER
while(loe = lif(e.
u = key sumcou
}
}...
-
7/26/2019 Apache Spark 2.0
6/16
New in 2.0
Whole-stage code generation! Fuse across multiple operators
Spark 1.6 14M
rows/s
Spark 2.0
Parquetin 1.6 11Mrows/s
Parquetin 2.0
Optimized input / output! Apache Parquet + built-in cache
-
7/26/2019 Apache Spark 2.0
7/16
Structured StreamingHigh-level streaming API built on DataFrames
Event time, windowing, sessions, sources & sinks
Also supports interactive & batch queries
Aggregate data in a stream, then serve using JDBC
Change queries at runtime
Build and apply ML modelsNot just stream
continuous app
-
7/26/2019 Apache Spark 2.0
8/16
Apache Spark 2.0:
Infinite DataFrames
Apache Spark 1.X:
Static DataFrames
Single API
Structured Streaming API
-
7/26/2019 Apache Spark 2.0
9/16
logs = ctx.read.format("json").open("s3://logs")
logs.groupBy(userid, hour).avg(latency)
.write.format("jdbc")
.s ve("jdbc:mysql//...")
Example: Batch App
-
7/26/2019 Apache Spark 2.0
10/16
logs = ctx.read.format("json").stre m("s3://logs")
logs.groupBy(userid, hour).avg(latency)
.write.format("jdbc")
.st rt tre m("jdbc:mysql//...")
Example: Continuous App
-
7/26/2019 Apache Spark 2.0
11/16
More Details in Conferenc
Engine: Structuring Spark, Structured Streaming, deep d
ML: SparkR, MLlib 2.0, new algorithms
Other: deep learning, GraphFrames, Solr, Cassandra,
Try 2.0-preview at spark.apache.org
-
7/26/2019 Apache Spark 2.0
12/16
Growing the CommuNew initiatives from Databricks
-
7/26/2019 Apache Spark 2.0
13/16
The largest challenge in applyidata is the skills gap.
StackOverflow Developer Survey
-
7/26/2019 Apache Spark 2.0
14/16
Databricks Community E
Free version of Data! Interactive tutoria
! Apache Spark and
data science libra
! Visualization & de
GA Today!databricks.
-
7/26/2019 Apache Spark 2.0
15/16
Massive Open Online Cou
Free 5-course series on bigdata with Apache Spark
dbricks.co/mooc16
Introductionto Apache Spark
TM
DistributedMachine Learningwith Apache Spark
T
Advanced Apache Sparkfor Data Science andData Engineering
TM
AdvancedMachine Learningwith Apache Spark
TM
-
7/26/2019 Apache Spark 2.0
16/16
Michael ArmbrustDemo