Download - Cassandra and Spark SQL

Transcript
Page 1: Cassandra and Spark SQL

You don't need Functional Programming for Fun!

Cassandra and SparkSQL

Page 2: Cassandra and Spark SQL

Russell (left) and Cara (right)• Software Engineer

• Spark-Cassandra Integration since Spark 0.9

• Cassandra since Cassandra1.2

• 2 Year Scala Convert • Still not comfortable

talking about Monads in public

@Evanfchan

Page 3: Cassandra and Spark SQL

A Story in 3 Parts

• Why SparkSQL? • The Spark SQL Thrift Server • Writing SQL for Spark

Page 4: Cassandra and Spark SQL

You have lots of options why Spark SQL?

• Scala? • Java?

Page 5: Cassandra and Spark SQL

Spark is A Powerful Analytics Tool Built on Scala

Distributed Analytics Platform with In Memory Capabilities

Lots of new concepts: RDDs DataSets Streaming Serialization Functional Programming

Page 6: Cassandra and Spark SQL

Functional Programming Is Awesome

Side-effect Free Functions

Monads

Easy P

aralle

lizatio

n

Anonymous FunctionsScala

Async Models

Type Matching

rdd.map(y => y+1)

Endofunctors

Page 7: Cassandra and Spark SQL

Functional Programming can be Hard

blah-blah blah

Blah

Easy b

lahiliz

ation

baaaaah

blahala

Asybc blah

Blah blahhing

rdd.map(y => y+1)

Aren't Endofunctors from

ghostbusters?

Endofunctors

Page 8: Cassandra and Spark SQL

Practical considerations when devoting time to a new Project.

Compile Time Type Safety!

Catalyst! Tungsten! We get to learn all sorts of fun new things! SBT

is probably great!

Usually Me Less Excitable Dev

We ship next week

Page 9: Cassandra and Spark SQL

Spark SQL Provides A Familiar and Easy API

Use SQL to access the Power of Spark

Page 10: Cassandra and Spark SQL

Spark Sql Provides A Familiar and Easy API

Catalyst

Codegen! Optimization!

Predicate Pushdowns

Distributed Work

SQL

Page 11: Cassandra and Spark SQL

It still takes Scala/Java/Python/… Code.importorg.apache.spark.sql.cassandra._valdf=spark

.read

.cassandraFormat("tab","ks").loaddf.createTempView("tab")spark.sql("SELECT*FROMtab").show

+---+---+---+

|k|c|v|+---+---+---+

|1|1|1|

|1|2|2|

Let me color code that by parts I like vs parts I don't

like.

Page 12: Cassandra and Spark SQL

It still takes Scala/Java/Python/… Code.importorg.apache.spark.sql.cassandra._valdf=spark

.read

.cassandraFormat("tab","ks").loaddf.createTempView("tab")spark.sql("SELECT*FROMtab").show

+---+---+---+

|k|c|v|+---+---+---+

|1|1|1|

|1|2|2|

Also, your import has an underscore in it..

Page 13: Cassandra and Spark SQL

For exploration we have the Spark-SQL Shell

spark-sql>SELECT*FROMks.tab;1 2 21 3 3

Page 14: Cassandra and Spark SQL

For exploration we have the Spark-SQL Shell

spark-sql>SELECT*FROMks.tab;1 2 21 3 3

SparkSession

Page 15: Cassandra and Spark SQL

For exploration we have the Spark-SQL Shell

spark-sql>SELECT*FROMks.tab;1 2 21 3 3

SparkSession

Executor Executor Executor Executor Executor

Page 16: Cassandra and Spark SQL

Not really good for multiple-users

spark-sql>SELECT*FROMks.tab;1 2 21 3 3

SparkSession

Executor Executor Executor Executor Executor

Page 17: Cassandra and Spark SQL

Enter Spark Thrift Server

Spark Sql Thrift Server

Executor Executor Executor Executor Executor

JDBC Client JDBC ClientJDBC Client

Page 18: Cassandra and Spark SQL

The Spark Sql Thrift Server is a Spark Application

• Built on HiveServer2 • Single Spark Context • Clients Communicate with it via JDBC • Can use all SparkSQL • Fair Scheduling • Clients can share Cached Resources • Security

Page 19: Cassandra and Spark SQL

The Spark Sql ThriftServer is a Spark Application

• Built on HiveServer2 • Single Spark Context • Clients Communicate with it via JDBC • Can use all SparkSQL • Fair Scheduling • Clients can share Cached Resources • Security

Page 20: Cassandra and Spark SQL

Fair Scheduling is Sharing

FIFO

Time

Page 21: Cassandra and Spark SQL

Fair Scheduling is Sharing

FIFO

Time

Page 22: Cassandra and Spark SQL

Fair Scheduling is Sharing

FIFO

Time

Page 23: Cassandra and Spark SQL

Fair Scheduling is Sharing

FIFO

Time

Page 24: Cassandra and Spark SQL

Fair Scheduling is Sharing

FIFO

FAIR

Time

Page 25: Cassandra and Spark SQL

Fair Scheduling is Sharing

FIFO

FAIR

Time

Page 26: Cassandra and Spark SQL

Fair Scheduling is Sharing

FIFO

FAIR

Time

Page 27: Cassandra and Spark SQL

SingleContext can Share Cached Data

Spark Sql Thrift Server

Executor Executor Executor Executor Executor

cache TABLE today select * from ks.tab where date = today;

Page 28: Cassandra and Spark SQL

SingleContext can Share Cached Data

Spark Sql Thrift Server

Executor Executor Executor Executor ExecutorCACHED CACHED CACHED CACHED CACHED

cache TABLE today select * from ks.tab where date = today;

Page 29: Cassandra and Spark SQL

SingleContext can Share Cached Data

Spark Sql Thrift Server

Executor Executor Executor Executor ExecutorCACHED CACHED CACHED CACHED CACHED

cache TABLE today select * from ks.tab where date = today;

SELECT * from TODAY where age > 5

Page 30: Cassandra and Spark SQL

How to use itStarts from the command line and can use all Spark Submit Args • ./sbin/start-thriftserver.sh • dse spark-sql-thriftserver start

startingorg.apache.spark.sql.hive.thriftserver.HiveThriftServer2

Page 31: Cassandra and Spark SQL

How to use itStarts from the command line and can use all Spark Submit Args • ./sbin/start-thriftserver.sh • dse spark-sql-thriftserver start

Use with all of your favorite Spark Packages like the Spark Cassandra Connector!

--packages com.datastax.spark:spark-cassandra-connector_2.11:2.0.2 --conf spark.cassandra.connection.host=127.0.0.1

Page 32: Cassandra and Spark SQL

Hive? Wait I though we were Doing Spark

startingorg.apache.spark.sql.hive.thriftserver.HiveThriftServer2

Why does it say Hive everywhere?

• Built on HiveServer2

Page 33: Cassandra and Spark SQL

A Brief History of the Spark Thrift Server

• Thrift? • Hive?

Page 34: Cassandra and Spark SQL

They are not the Same

Cassandra Thrift Hive Thrift

Page 35: Cassandra and Spark SQL

Have you heard of the"Ship of Theseus?"

Time for a quick history

More Greek stuff ..

Page 36: Cassandra and Spark SQL

When you replace all the parts of a thing Does it Remain the Same?

Greek Boat

Page 37: Cassandra and Spark SQL

When you replace all the parts of a thing Does it Remain the Same?

SharkServer

Hive Parser

Hive Optimization

Map-Reduce

Spark Execution

JDBC Results

Page 38: Cassandra and Spark SQL

When you replace all the parts of a thing Does it Remain the Same?

SharkServer ThriftServer

Hive Parser

Map-Reduce

Spark Execution

JDBC Results

Hive Optimization

Page 39: Cassandra and Spark SQL

When you replace all the parts of a thing Does it Remain the Same?

ThriftServer

Hive Parser

Catalyst

Schema RDDs

Spark Execution

JDBC Results

Page 40: Cassandra and Spark SQL

When you replace all the parts of a thing Does it Remain the Same?

ThriftServer

Hive Parser

Catalyst

Dataframes

Spark Execution

JDBC Results

Page 41: Cassandra and Spark SQL

When you replace all the parts of a thing Does it Remain the Same?

ThriftServer

Hive Parser

Catalyst

DataSets

Spark Execution

JDBC Results

Page 42: Cassandra and Spark SQL

When you replace all the parts of a thing Does it Remain the Same?

ThriftServer

Spark Parser

Catalyst

DataSets

Spark Execution

JDBC Results

Page 43: Cassandra and Spark SQL

Almost all Spark now

ThriftServer

Spark Parser

Catalyst

DataSets

Spark Execution

JDBC Results

Page 44: Cassandra and Spark SQL

Connecting with Beeline (JDBC Client)./bin/beelinedsebeeline!connectjdbc:hive2://localhost:10000

Even More Hive!

Page 45: Cassandra and Spark SQL

Connect Tableau to Cassandra

Page 46: Cassandra and Spark SQL

The Full JDBC/ODBC Ecosystem Can Connect to ThriftServer

Page 47: Cassandra and Spark SQL

Incremental Collect - Because some BI Tools are Mean

SELECT * FROM TABLE

Spark Sql Thrift Server

ALL THE DATA

Page 48: Cassandra and Spark SQL

Incremental Collect - Because some BI Tools are Mean

SELECT * FROM TABLE

Spark Sql Thrift Server

ALL THE DATA

OOM

Page 49: Cassandra and Spark SQL

Incremental Collect - Because some BI Tools are Mean

SELECT * FROM TABLE

Spark Sql Thrift Serverspark.sql.thriftServer.incrementalCollect=true

ALL THE DATASpark Partition 1 Spark Partition 2 Spark Partition 3

Page 50: Cassandra and Spark SQL

Incremental Collect - Because some BI Tools are Mean

SELECT * FROM TABLE

Spark Sql Thrift Serverspark.sql.thriftServer.incrementalCollect=true

ALL THE DATA

Spark Partition 1

Spark Partition 2 Spark Partition 3

Page 51: Cassandra and Spark SQL

Incremental Collect - Because some BI Tools are Mean

SELECT * FROM TABLE

Spark Sql Thrift Serverspark.sql.thriftServer.incrementalCollect=true

ALL THE DATA

Spark Partition 1 Spark Partition 2

Spark Partition 3

Page 52: Cassandra and Spark SQL

Incremental Collect - Because some BI Tools are Mean

SELECT * FROM TABLE

Spark Sql Thrift Serverspark.sql.thriftServer.incrementalCollect=true

ALL THE DATA

Spark Partition 1Spark Partition 2 Spark Partition 3

Page 53: Cassandra and Spark SQL

Getting things done with SQL• Registering Sources • Writing to Tables • Examining Query Plans • Debugging Predicate pushdowns • Caching Views

Page 54: Cassandra and Spark SQL

Registering Sources using SQLCREATETEMPORARYVIEWwordsUSINGformat.goes.hereOPTIONS(key"value")

Page 55: Cassandra and Spark SQL

Registering Sources using SQLCREATETEMPORARYVIEWwordsUSINGorg.apache.spark.sql.cassandraOPTIONS(table"tab",keyspace"ks")

Not a single monad…

Page 56: Cassandra and Spark SQL

CREATETEMPORARYVIEWwordsUSINGorg.apache.spark.sql.cassandraOPTIONS(table"tab",keyspace"ks")

Registering Sources using SQL

CassandraSourceRelation

Page 57: Cassandra and Spark SQL

We Can Still Use a HiveMetaStore

DSE auto registers C* Tables in a C* based Metastore

MetaStore Thrift Server

Page 58: Cassandra and Spark SQL

Writing DataFrames using SQLINSERTINTOarrowSELECT*FROMwords;

CassandraSourceRelation words read

CassandraSourceRelation arrow write

Page 59: Cassandra and Spark SQL

Explain to Analyze Query PlansEXPLAINSELECT*FROMarrowWHEREC>2;Scanorg.apache.spark.sql.cassandra.CassandraSourceRelation@6069193a[k#18,c#19,v#20]PushedFilters:[IsNotNull(c),GreaterThan(c,2)],ReadSchema:struct<k:int,c:int,v:int>

We can analyze the inside of the Catalyst just

like with Scala/Java/…

Page 60: Cassandra and Spark SQL

Predicates get Pushed Down AutomaticallyEXPLAINSELECT*FROMarrowWHEREC>2;Scanorg.apache.spark.sql.cassandra.CassandraSourceRelation@6069193a[k#18,c#19,v#20]PushedFilters:[IsNotNull(c),GreaterThan(c,2)],ReadSchema:struct<k:int,c:int,v:int>

CassandraSourceRelation Filter [GreaterThan(c,2)]

Page 61: Cassandra and Spark SQL

EXPLAINSELECT*FROMarrowWHEREC>2;Scanorg.apache.spark.sql.cassandra.CassandraSourceRelation@6069193a[k#18,c#19,v#20]PushedFilters:[IsNotNull(c),GreaterThan(c,2)],ReadSchema:struct<k:int,c:int,v:int>

CassandraSourceRelation

Filter [GreaterThan(c,2)]

Internal Request to Cassandra: CQL SELECT * FROM ks.bat WHERE C > 2

Automatic Pushdowns!

Predicates get Pushed Down Automatically

Page 62: Cassandra and Spark SQL

Common Cases where Predicates Don't Push

SELECT*fromtroublesWHEREc<'2017-05-27' *Filter(cast(c#76asstring)<2017-05-27)+-*ScanCassandraSourceRelation@53e82b30[k#75,c#76,v#77]PushedFilters:[IsNotNull(c)],ReadSchema:struct<k:int,c:date,v:int>

Why is my date clustering column not being

pushed down.

Page 63: Cassandra and Spark SQL

Common Cases where Predicates Don't Push

CassandraSourceRelation Filter [LessThan(c,'2017-05-27')]

SELECT*fromtroublesWHEREc<'2017-05-27' *Filter(cast(c#76asstring)<2017-05-27)+-*ScanCassandraSourceRelation@53e82b30[k#75,c#76,v#77]PushedFilters:[IsNotNull(c)],ReadSchema:struct<k:int,c:date,v:int>

Page 64: Cassandra and Spark SQL

Common Cases where Predicates Don't Push

CassandraSourceRelationReadSchema: struct<k:int,c:date,v:int>

Filter [LessThan(c,'2017-05-27')]

Date != String

SELECT*fromtroublesWHEREc<'2017-05-27' *Filter(cast(c#76asstring)<2017-05-27)+-*ScanCassandraSourceRelation@53e82b30[k#75,c#76,v#77]PushedFilters:[IsNotNull(c)],ReadSchema:struct<k:int,c:date,v:int>

Page 65: Cassandra and Spark SQL

Make Sure we Cast CorrectlyEXPLAINSELECT*fromtroublesWHEREc<cast('2017-05-27'asdate);*ScanC*RelationPushedFilters:[IsNotNull(c),LessThan(c,2017-05-27)]

CassandraSourceRelationReadSchema: struct<k:int,c:date,v:int>

Filter [LessThan(c,Date('2017-05-27'))]

Date == Date

Page 66: Cassandra and Spark SQL

Make Sure we Cast CorrectlyEXPLAINSELECT*fromtroublesWHEREc<cast('2017-05-27'asdate);*ScanC*RelationPushedFilters:[IsNotNull(c),LessThan(c,2017-05-27)]

CassandraSourceRelationReadSchema: struct<k:int,c:date,v:int>

Filter [LessThan(c,Date('2017-05-27'))]

Automatic Pushdowns!

Page 67: Cassandra and Spark SQL

DSE Search Automatic Pushdowns!EXPLAINSELECT*fromtroublesWHEREv<6; *ScanC*RelationPushedFilters:[IsNotNull(v),LessThan(v,6)]

CassandraSourceRelationReadSchema: struct<k:int,c:date,v:int>

Solr_Query

Page 68: Cassandra and Spark SQL

DSE Search Automatic Pushdowns!

Page 69: Cassandra and Spark SQL

DSE Search Automatic Pushdowns!

Count Happens in the IndexDSEContinuous Paging

Page 70: Cassandra and Spark SQL

Cache a whole table

CassandraSourceRelation InMemoryRelation

CACHETABLEks.tab;explainSELECT*FROMks.tab;==PhysicalPlan==InMemoryTableScan[k#0,c#1,v#2]:+-InMemoryRelationStorageLevel(disk,memory,deserialized,1replicas),`ks`.`tab`::+-*ScanCassandraSourceRelation

Page 71: Cassandra and Spark SQL

Uncache

CassandraSourceRelation

UNCACHETABLEks.tab;explainSELECT*FROMks.tab;==PhysicalPlan==*ScanCassandraSourceRelation

Page 72: Cassandra and Spark SQL

Cache a fraction of Data

CassandraSourceRelation

CACHETABLEsomedataSELECT*FROMks.tabWHEREc>2; explainSELECT*fromsomedata;==PhysicalPlan==InMemoryTableScan:+-InMemoryRelation`somedata`::+-*ScanCassandraSourceRelationPushedFilters:[IsNotNull(c),GreaterThan(c,2)]

Filter [GreaterThan(c,2)]

InMemoryRelation somedata

Page 73: Cassandra and Spark SQL

Let this be a starting point• https://github.com/datastax/spark-cassandra-connector • https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md • https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-thrift-server.html • http://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/spark/sparkSqlThriftServer.html • https://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine • https://www.datastax.com/dev/blog/dse-5-1-automatic-optimization-of-spark-sql-queries-using-dse-search • https://www.datastax.com/dev/blog/dse-continuous-paging-tuning-and-support-guide

Page 74: Cassandra and Spark SQL

Thank You.http://www.russellspitzer.com/@RussSpitzer

Come chat with us at DataStax Academy: https://academy.datastax.com/slack