Cassandra and Spark SQL

download Cassandra and Spark SQL

of 74

  • date post

    18-Mar-2018
  • Category

    Software

  • view

    112
  • download

    0

Embed Size (px)

Transcript of Cassandra and Spark SQL

  • You don't need Functional Programming for Fun!

    Cassandra and SparkSQL

  • Russell (left) and Cara (right) Software Engineer

    Spark-Cassandra Integration since Spark 0.9

    Cassandra since Cassandra1.2

    2 Year Scala Convert Still not comfortable

    talking about Monads in public

    @Evanfchan

  • A Story in 3 Parts

    Why SparkSQL? The Spark SQL Thrift Server Writing SQL for Spark

  • You have lots of options why Spark SQL?

    Scala? Java?

  • Spark is A Powerful Analytics Tool Built on Scala

    Distributed Analytics Platform with In Memory Capabilities

    Lots of new concepts: RDDs DataSets Streaming Serialization Functional Programming

  • Functional Programming Is Awesome

    Side-effect Fr

    ee Functions

    Monads

    Easy

    Paral

    leliza

    tion

    Anonym

    ous Fu

    nctions

    Scala

    Async Models

    Type Matching

    rdd.map(y => y+1)

    Endofunctors

  • Functional Programming can be Hard

    blah-blah blah

    Blah

    Easy

    blah

    ilizati

    on

    baaaaa

    h

    blahala

    Asybc blah

    Blah blahhing

    rdd.map(y => y+1)

    Aren't Endofunctors from

    ghostbusters?

    Endofunctors

  • Practical considerations when devoting time to a new Project.

    Compile Time Type Safety!

    Catalyst! Tungsten! We get to learn all sorts of fun new things! SBT

    is probably great!

    Usually Me Less Excitable Dev

    We ship next week

  • Spark SQL Provides A Familiar and Easy API

    Use SQL to access the Power of Spark

  • Spark Sql Provides A Familiar and Easy API

    Catalyst

    Codegen! Optimization!

    Predicate Pushdowns

    Distributed Work

    SQL

  • It still takes Scala/Java/Python/ Code.importorg.apache.spark.sql.cassandra._valdf=spark

    .read

    .cassandraFormat("tab","ks").loaddf.createTempView("tab")spark.sql("SELECT*FROMtab").show

    +---+---+---+

    |k|c|v|+---+---+---+

    |1|1|1|

    |1|2|2|

    Let me color code that by parts I like vs parts I don't

    like.

  • It still takes Scala/Java/Python/ Code.importorg.apache.spark.sql.cassandra._valdf=spark

    .read

    .cassandraFormat("tab","ks").loaddf.createTempView("tab")spark.sql("SELECT*FROMtab").show

    +---+---+---+

    |k|c|v|+---+---+---+

    |1|1|1|

    |1|2|2|

    Also, your import has an underscore in it..

  • For exploration we have the Spark-SQL Shell

    spark-sql>SELECT*FROMks.tab;1 2 21 3 3

  • For exploration we have the Spark-SQL Shell

    spark-sql>SELECT*FROMks.tab;1 2 21 3 3

    SparkSession

  • For exploration we have the Spark-SQL Shell

    spark-sql>SELECT*FROMks.tab;1 2 21 3 3

    SparkSession

    Executor Executor Executor Executor Executor

  • Not really good for multiple-users

    spark-sql>SELECT*FROMks.tab;1 2 21 3 3

    SparkSession

    Executor Executor Executor Executor Executor

  • Enter Spark Thrift Server

    Spark Sql Thrift Server

    Executor Executor Executor Executor Executor

    JDBC Client JDBC ClientJDBC Client

  • The Spark Sql Thrift Server is a Spark Application

    Built on HiveServer2 Single Spark Context Clients Communicate with it via JDBC Can use all SparkSQL Fair Scheduling Clients can share Cached Resources Security

  • The Spark Sql ThriftServer is a Spark Application

    Built on HiveServer2 Single Spark Context Clients Communicate with it via JDBC Can use all SparkSQL Fair Scheduling Clients can share Cached Resources Security

  • Fair Scheduling is Sharing

    FIFO

    Time

  • Fair Scheduling is Sharing

    FIFO

    Time

  • Fair Scheduling is Sharing

    FIFO

    Time

  • Fair Scheduling is Sharing

    FIFO

    Time

  • Fair Scheduling is Sharing

    FIFO

    FAIR

    Time

  • Fair Scheduling is Sharing

    FIFO

    FAIR

    Time

  • Fair Scheduling is Sharing

    FIFO

    FAIR

    Time

  • SingleContext can Share Cached Data

    Spark Sql Thrift Server

    Executor Executor Executor Executor Executor

    cache TABLE today select * from ks.tab where date = today;

  • SingleContext can Share Cached Data

    Spark Sql Thrift Server

    Executor Executor Executor Executor ExecutorCACHED CACHED CACHED CACHED CACHED

    cache TABLE today select * from ks.tab where date = today;

  • SingleContext can Share Cached Data

    Spark Sql Thrift Server

    Executor Executor Executor Executor ExecutorCACHED CACHED CACHED CACHED CACHED

    cache TABLE today select * from ks.tab where date = today;

    SELECT * from TODAY where age > 5

  • How to use itStarts from the command line and can use all Spark Submit Args ./sbin/start-thriftserver.sh dse spark-sql-thriftserver start

    startingorg.apache.spark.sql.hive.thriftserver.HiveThriftServer2

  • How to use itStarts from the command line and can use all Spark Submit Args ./sbin/start-thriftserver.sh dse spark-sql-thriftserver start

    Use with all of your favorite Spark Packages like the Spark Cassandra Connector!

    --packages com.datastax.spark:spark-cassandra-connector_2.11:2.0.2 --conf spark.cassandra.connection.host=127.0.0.1

  • Hive? Wait I though we were Doing Spark

    startingorg.apache.spark.sql.hive.thriftserver.HiveThriftServer2

    Why does it say Hive everywhere?

    Built on HiveServer2

  • A Brief History of the Spark Thrift Server

    Thrift? Hive?

  • They are not the Same

    Cassandra Thrift Hive Thrift

  • Have you heard of the"Ship of Theseus?"

    Time for a quick history

    More Greek stuff ..

  • When you replace all the parts of a thing Does it Remain the Same?

    Greek Boat

  • When you replace all the parts of a thing Does it Remain the Same?

    SharkServer

    Hive Parser

    Hive Optimization

    Map-Reduce

    Spark Execution

    JDBC Results

  • When you replace all the parts of a thing Does it Remain the Same?

    SharkServer ThriftServer

    Hive Parser

    Map-Reduce

    Spark Execution

    JDBC Results

    Hive Optimization

  • When you replace all the parts of a thing Does it Remain the Same?

    ThriftServer

    Hive Parser

    Catalyst

    Schema RDDs

    Spark Execution

    JDBC Results

  • When you replace all the parts of a thing Does it Remain the Same?

    ThriftServer

    Hive Parser

    Catalyst

    Dataframes

    Spark Execution

    JDBC Results

  • When you replace all the parts of a thing Does it Remain the Same?

    ThriftServer

    Hive Parser

    Catalyst

    DataSets

    Spark Execution

    JDBC Results

  • When you replace all the parts of a thing Does it Remain the Same?

    ThriftServer

    Spark Parser

    Catalyst

    DataSets

    Spark Execution

    JDBC Results

  • Almost all Spark now

    ThriftServer

    Spark Parser

    Catalyst

    DataSets

    Spark Execution

    JDBC Results

  • Connecting with Beeline (JDBC Client)./bin/beelinedsebeeline!connectjdbc:hive2://localhost:10000

    Even More Hive!

    hive2://localhost:10000

  • Connect Tableau to Cassandra

  • The Full JDBC/ODBC Ecosystem Can Connect to ThriftServer

  • Incremental Collect - Because some BI Tools are Mean

    SELECT * FROM TABLE

    Spark Sql Thrift Server

    ALL THE DATA

  • Incremental Collect - Because some BI Tools are Mean

    SELECT * FROM TABLE

    Spark Sql Thrift Server

    ALL T

    HE DA

    TAOOM

  • Incremental Collect - Because some BI Tools are Mean

    SELECT * FROM TABLE

    Spark Sql Thrift Serverspark.sql.thriftServer.incrementalCollect=true

    ALL THE DATASpark Partition 1 Spark Partition 2 Spark Partition 3

  • Incremental Collect - Because some BI Tools are Mean

    SELECT * FROM TABLE

    Spark Sql Thrift Serverspark.sql.thriftServer.incrementalCollect=true

    ALL THE DATA

    Spark Partition 1

    Spark Partition 2 Spark Partition 3

  • Incremental Collect - Because some BI Tools are Mean

    SELECT * FROM TABLE

    Spark Sql Thrift Serverspark.sql.thriftServer.incrementalCollect=true

    ALL THE DATA

    Spark Partition 1 Spark Partition 2

    Spark Partition 3

  • Incremental Collect - Because some BI Tools are Mean

    SELECT * FROM TABLE

    Spark Sql Thrift Serverspark.sql.thriftServer.incrementalCollect=true

    ALL THE DATA

    Spark Partition 1Spark Partition 2 Spark Partition 3

  • Getting things done with SQL Registering Sources Writing to Tables Examining Query Plans Debugging Predicate pushdowns Caching Views

  • Registering Sources using SQLCREATETEMPORARYVIEWwordsUSINGformat.goes.hereOPTIONS(key"value")

  • Registering Sources using SQLCREATETEMPORARYVIEWwordsUSINGorg.apache.spark.sql.cassandraOPTIONS(table"tab",keyspace"ks")

    Not a single monad

  • CREATETEMPORARYVIEWwordsUSINGorg.apache.spark.sql.cassandraOPTIONS(table"tab",keyspace"ks")

    Registering Sources using SQL

    CassandraSourceRelation

  • We Can Still Use a HiveMetaStore

    DSE auto registers C* Tables in a C* based Metastore

    MetaStore Thrift Server

  • Writing DataFrames using SQLINSERTINTOarrowSELECT*FROMwords;

    CassandraSourceRelation words read

    CassandraSourceRelation arrow write

  • Explain to Analyze Query PlansEXPLAINSELECT*FROMarrowWHERE