What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3...

57
What’s new in Spark 2.0?

Transcript of What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3...

Page 1: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

What’snewinSpark2.0?

Page 2: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2015CouchbaseInc. 2

Page 3: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 3

SparkOverview

ApacheSparkisafastandgeneralengineforlarge-scaledataprocessing.

Page 4: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 4

SparkOverview

Page 5: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 5

Spark2.0

§ Largelycompatiblewith1.x§ SimplifiesAPI§ 2000patchesfrom280contributors

http://www.slideshare.net/SparkSummit/simplifying-big-data-applications-with-apache-spark-20

Page 6: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 6

Spark2.0

§ StructuredAPIImprovements§ Whole-stagecodegeneration§ StructuredStreaming§ SimplerSetup§ SQL2003Support§ MLlibenhancements§ EnhancedRsupport§ …

Page 7: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 7

StructuredAPIImprovements

§ Dataset(typed)andDataFrame(untyped)arenowunified§  DataFrame==Dataset<Row>

§ AlsousedforStructuredStreaming

https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

Page 8: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 8

Whole-StageCodegen

§ Second-generationTungstenengine§ DepartingfromVolcanoIteratorModel§ Also:Vectorizationformoreefficientbatch-processing

https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html

Page 9: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 9

Whole-StageCodegen

https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html

Page 10: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 10http://www.slideshare.net/SparkSummit/simplifying-big-data-applications-with-apache-spark-20

Page 11: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 11

StructuredStreaming(Experimental)

§ Tackling“continuousapplications”§  IntegratedAPIwithbatchjobs§ Betterinteractionwithstoragesystems§ RichIntegrationintotherestofSpark

Page 12: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 12

StructuredStreaming(Experimental)

https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

Page 13: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 13

StructuredStreaming(Experimental)

https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

Page 14: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 14

SimplerSetup

§ SparkSessionsubsumesSQLContext,HiveContext,…§ Onecommonentrypoint

Page 15: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 15

SQL2003Support

§ SupportsSQL2003Standard

§ ReworkednativeSQLparser§ NativeDDLcommandimplementations§ Allkindsofsubqueriesnowsupported§ Cannowrunall99TPC-DSqueries

Page 16: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 16

MLlibEnhancements

§ DataFrameasprimaryMLAPI§ ModelPersistence§  SupportforalllanguageAPIsinSpark:Scala,Java,Python&R§  SupportfornearlyallMLalgorithmsintheDataFrame-basedAPI§  SupportforsinglemodelsandfullPipelines,bothunfitted(a“recipe”)andfitted(aresult)

§  Distributedstorageusinganexchangeableformat

Page 17: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

QAThanks!

Page 18: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

Couchbase&Spark

Page 19: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2015CouchbaseInc. 19

Page 20: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 20

EcommerserunsonCouchbase

6 10 ECOMMERCE COMPANIES

IN THE UNITED STATES

of the TOP

Online shopping

Omni channel services

Page 21: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 21

TravelrunsonCouchbase

3 3

GLOBAL DISTRIBUTION SYSTEMS WORLDWIDE

of the TOP

3 10

AIRLINES

of the TOP

Page 22: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 22

OnlineVideoStreamingrunsonCouchbase

6 10 NORTH AMERICAN AND

EUROPEAN BROADCAST TELEVISION COMPANIES

of the TOP

Page 23: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 23

Sports&CasinoGamingrunsonCouchbase

6 10 ONLINE SPORTS AND

CASINO GAMING COMPANIES

of the TOP

Page 24: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 24

FinancialServicesrunonCouchbase

3 3 CREDIT REPORTING

COMPANIES

of the TOP

Page 25: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2015CouchbaseInc. 25DaHorvath,http://up.picr.de/23770402by.jpg

Page 26: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

WhySparkandCouchbaseOverview&Use-Cases

Page 27: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 27

UseCases

Operations Analytics

CB

§  Recommendations§  Predictiveanalytics§  Frauddetection

§  Catalog§  Personalization§  Mobileapplications

Page 28: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 28

UseCase:OperationalizeAnalytics/ML

Hadoop

MLModel

Data Warehouse

Training Data

CB

Model Online Data

Serving

Predictions

Page 29: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 29Adaptedfrom:Databricks–NotYourFather’sDatabasehttps://www.brighttalk.com/webcast/12891/196891

Page 30: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 30

UseCase:DataIntegration

RDBMS S3 HDFS ES

NoSQL

Page 31: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 31

StandaloneDeployment

Page 32: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 32

Side-By-SideDeployment

Page 33: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

AccessPatternsFromSparktoCouchbaseandBackAgain

Page 34: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2015CouchbaseInc. 34

Key-Value

Fetch/StorebyDocumentID

Page 35: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2015CouchbaseInc. 35

Key-Value

Fetch/StorebyDocumentID

N1QLQuery

FetchbyCriteria“SQL”

Page 36: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2015CouchbaseInc. 36

Key-Value

Fetch/StorebyDocumentID

N1QLQuery

FetchbyCriteria“SQL”

Map-ReduceViews

MaterializedIndexes

(Aggregation)

Page 37: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2015CouchbaseInc. 37

Key-Value

Fetch/StorebyDocumentID

N1QLQuery

FetchbyCriteria“SQL”

Map-ReduceViews

MaterializedIndexes

(Aggregation)

Streaming

MutationStreamsForProcessing

Page 38: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2015CouchbaseInc. 38

Key-Value

Fetch/StorebyDocumentID

N1QLQuery

FetchbyCriteria“SQL”

Map-ReduceViews

MaterializedIndexes

(Aggregation)

Streaming

MutationStreamsForProcessing

FullText

SearchonFreeformText

Page 39: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2015CouchbaseInc. 39

Key-Value

Fetch/StorebyDocumentID

N1QLQuery

FetchbyCriteria“SQL”

Map-ReduceViews

MaterializedIndexes

(Aggregation)

Streaming

MutationStreamsForProcessing

Page 40: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 40

CouchbaseDataPartitioning

Page 41: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 41

DataLocality

§ RDDLocationHintsbasedontheClusterMap

§ NotavailableforN1QLorViews§  Roundrobin-can’tgivelocationhints§  Backendisscattergatherwith1noderesponding

Page 42: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 42

N1QLQuery

§ N1QLisaSQLservicewithJSONextensions

§ UsesCouchbase’sGlobalSecondaryIndexes

§ Canrunonanynodeswithinthecluster

§ Nodeswithdifferingservicescanbeaddedandremovedasneededonthefly

Page 43: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 43

DataService

Projector&Router

CouchbaseQueryArchitecture

QueryService

IndexService

SupervisorIndexmaintenance&Scancoordinator

Index#2Index#1

QueryProcessorcbq-engine

Bucket#1 Bucket#2

DCPStreamIndex#4Index#3

...Bucket#2

Bucket#1

ForestDBStorageEngine

Page 44: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 44

SparkSQLSources

TableScanScanallofthedataandreturnit

PrunedScanScananindexthatmatchesonlyrelevantdatatothequeryathand.

PrunedFilteredScanScananindexthatmatchesonlyrelevantdatatothequeryathand.

Page 45: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 45

PredicateConversion

Page 46: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 46

SchemaInference

Page 47: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 47

SchemaInference

N1QLRelation:28 - Inferring schema from bucket travel-sample with query 'SELECT META(`travel-sample`).id as `META_ID`, `travel-sample`.* FROM `travel-sample` WHERE `type` = 'airline' LIMIT 1000'

N1QLRelation:28 - Executing generated query: 'SELECT `name`,`callsign` FROM `travel-sample` WHERE `type` = 'airline''

Page 48: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 48

SchemaInference

Page 49: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 49

DCPandSparkStreaming

ReplicaIndexing

Page 50: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 50

StructuredStreamingSource

50

Adaptedfromhttps://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html

DCPStream UnboundedTable

Page 51: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 51

(Un)StructuredStreaming?

51

Page 52: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 52

StructuredStreamingSource

52

Page 53: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 53

StructuredStreamingSink

53

Page 54: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 54

CouchbaseSparkConnector1.2.1

§ Spark1.6.xsupport,includingDatasets§ DCPFlowControl§ EnhancedJavaAPIs

54

Page 55: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 55

CouchbaseSparkConnector2.0.0

•  Spark2.0.xSupport•  EnhancedDCPClient•  ExperimentalStructuredStreaming

55

Page 56: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

©2016CouchbaseInc. 56

Resources

§  SparkPackageshttps://spark-packages.org/package/couchbase/couchbase-spark-connector

•  Docshttp://docs.couchbase.com

§  Sourcehttps://github.com/couchbase/couchbase-spark-connector

§  Bugshttps://issues.couchbase.com/browse/SPARKC

56

Page 57: What’s new in Spark 2.0?files.meetup.com/19070069/161129 - Michael... · ©2016 Couchbase Inc. 3 Spark Overview Apache Spark is a fast and general engine for large-scale data processing.

QAThanks!