Apache cassandra and spark. you got the the lighter, let's start the fire

Click here to load reader

  • date post

    15-Jul-2015
  • Category

    Documents

  • view

    737
  • download

    4

Embed Size (px)

Transcript of Apache cassandra and spark. you got the the lighter, let's start the fire

  • 2013 DataStax Confidential. Do not distribute without consent.

    @PatrickMcFadin

    Patrick McFadinChief Evangelist for Apache Cassandra

    Apache Cassandra and Spark You got the lighter, lets spark the fire

    1

  • 6 years. Hows it going?

  • Cassandra 3.0 & 3.1

    Spring and Fall

  • Cassandra is Shared nothing Masterless peer-to-peer Great scaling story Resilient to failure

  • Cassandra for Applications

    APACHECASSANDRA

  • A Data Ocean or Pond., Lake

    An In-Memory Database

    A Key-Value Store

    A magical database unicorn that farts rainbows

  • Apache Spark

  • Apache Spark 10x faster on disk,100x faster in memory than Hadoop MR Works out of the box on EMR Fault Tolerant Distributed Datasets Batch, iterative and streaming analysis In Memory Storage and Disk Integrates with Most File and Storage Options

    Up to 100 faster (2-10 on disk)

    2-5 less code

  • Spark Components

    Spark Core

    Spark SQL structured

    Spark Streaming

    real-time

    MLlib machine learning

    GraphX graph

  • org.apache.spark.rdd.RDDResilient Distributed Dataset (RDD)

    Created through transformations on data (map,filter..) or other RDDs ImmutablePartitionedReusable

  • RDD OperationsTransformations - Similar to scala collections APIProduce new RDDs filter, flatmap, map, distinct, groupBy, union, zip, reduceByKey, subtract

    ActionsRequire materialization of the records to generate a valuecollect: Array[T], count, fold, reduce..

  • Analytic

    Analytic

    Search

    Transformation

    Action

    RDD Operations

  • Cassandra and Spark

  • Cassandra & Spark: A Great Combo

    Datastax: spark-cassandra-connector: https://github.com/datastax/spark-cassandra-connector

    Both are Easy to UseSpark Can Help You Bridge Your Hadoop

    and Cassandra Systems

    Use Spark Libraries, Caching on-top of Cassandra-stored Data

    Combine Spark Streaming with Cassandra Storage

  • Spark On CassandraServer-Side filters (where clauses)Cross-table operations (JOIN, UNION, etc.)Data locality-aware (speed)Data transformation, aggregation, etc. Natural Time Series Integration

  • Apache Spark and Cassandra Open Source Stack

    Cassandra

  • Spark Cassandra Connector

    18

  • Spark Cassandra Connector*Cassandra tables exposed as Spark RDDs*Read from and write to Cassandra*Mapping of C* tables and rows to Scala objects*All Cassandra types supported and converted to Scala types*Server side data selection*Virtual Nodes support*Use with Scala or Java*Compatible with, Spark 1.1.0, Cassandra 2.1 & 2.0

  • Type MappingCQL Type Scala Typeascii Stringbigint Longboolean Booleancounter Longdecimal BigDecimal, java.math.BigDecimaldouble Doublefloat Floatinet java.net.InetAddressint Intlist Vector, List, Iterable, Seq, IndexedSeq, java.util.Listmap Map, TreeMap, java.util.HashMapset Set, TreeSet, java.util.HashSettext, varchar Stringtimestamp Long, java.util.Date, java.sql.Date, org.joda.time.DateTimetimeuuid java.util.UUIDuuid java.util.UUIDvarint BigInt, java.math.BigInteger*nullable values Option

  • Connecting to Cassandra

    // Import Cassandra-specific functions on SparkContext and RDD objectsimport com.datastax.driver.spark._

    // Spark connection optionsval conf = new SparkConf(true)

    .setMaster("spark://192.168.123.10:7077") .setAppName("cassandra-demo")

    .set("cassandra.connection.host", "192.168.123.10") // initial contact .set("cassandra.username", "cassandra") .set("cassandra.password", "cassandra")

    val sc = new SparkContext(conf)

  • Accessing DataCREATE TABLE test.words (word text PRIMARY KEY, count int);

    INSERT INTO test.words (word, count) VALUES ('bar', 30);INSERT INTO test.words (word, count) VALUES ('foo', 20);

    // Use table as RDDval rdd = sc.cassandraTable("test", "words")// rdd: CassandraRDD[CassandraRow] = CassandraRDD[0]

    rdd.toArray.foreach(println)// CassandraRow[word: bar, count: 30]// CassandraRow[word: foo, count: 20]

    rdd.columnNames // Stream(word, count) rdd.size // 2

    val firstRow = rdd.first // firstRow: CassandraRow = CassandraRow[word: bar, count: 30]firstRow.getInt("count") // Int = 30

    *Accessing table above as RDD:

  • Saving Dataval newRdd = sc.parallelize(Seq(("cat", 40), ("fox", 50)))// newRdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[2]

    newRdd.saveToCassandra("test", "words", Seq("word", "count"))

    SELECT * FROM test.words;

    word | count------+------- bar | 30 foo | 20 cat | 40 fox | 50

    (4 rows)

    *RDD above saved to Cassandra:

  • Spark Cassandra Connectorhttps://github.com/datastax/spark-cassandra-connector

    Keyspace Table

    Cassandra Spark

    RDD[CassandraRow]

    RDD[Tuples]

    Bundled and Supported with DSE 4.5!

  • Spark Cassandra Connector uses the DataStax Java Driver to Read from and Write to C*

    Spark C*

    Full Token Range

    Each Executor Maintains a connection to the C* Cluster

    Spark Executor

    DataStax Java Driver

    Tokens 1-1000

    Tokens 1001 -2000

    Tokens

    RDDs read into different splits based on sets of tokens

    Spark Cassandra Connector

  • Co-locate Spark and C* for Best Performance

    C*

    C*C*

    C*

    Spark Worker

    Spark Worker

    Spark Master

    Spark WorkerRunning Spark Workers

    on the same nodes as your C* Cluster will save network hops when reading and writing

  • Analytics Workload Isolation

    Cassandra+ Spark DC

    CassandraOnly DC

    Online App

    Analytical App

    Mixed Load Cassandra Cluster

  • Data Locality

  • Example 1: Weather Station Weather station collects data Cassandra stores in sequence Application reads in sequence

  • Data Model Weather Station Id and Time

    are unique Store as many as needed

    CREATE TABLE temperature ( weather_station text, year int, month int, day int, hour int, temperature double, PRIMARY KEY (weather_station,year,month,day,hour) );

    INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (10010:99999,2005,12,1,7,-5.6);

    INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (10010:99999,2005,12,1,8,-5.1);

    INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (10010:99999,2005,12,1,9,-4.9);

    INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (10010:99999,2005,12,1,10,-5.3);

  • Primary key relationship

    PRIMARY KEY (weather_station,year,month,day,hour)

  • Primary key relationship

    PRIMARY KEY (weather_station,year,month,day,hour)

    Partition Key

  • Primary key relationship

    PRIMARY KEY (weather_station,year,month,day,hour)

    Partition Key Clustering Columns

  • Primary key relationship

    PRIMARY KEY (weather_station,year,month,day,hour)

    Partition Key Clustering Columns

    10010:99999

  • 2005:12:1:7

    -5.6

    Primary key relationship

    PRIMARY KEY (weather_station,year,month,day,hour)

    Partition Key Clustering Columns

    10010:99999-5.3-4.9-5.1

    2005:12:1:8 2005:12:1:9 2005:12:1:10

  • Partition keys

    10010:99999 Murmur3 Hash Token = 7224631062609997448

    722266:13850 Murmur3 Hash Token = -6804302034103043898

    INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (10010:99999,2005,12,1,7,-5.6);

    INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (722266:13850,2005,12,1,7,-5.6);

    Consistent hash. 128 bit number between 2-63 and 264

  • Partition keys

    10010:99999 Murmur3 Hash Token = 15

    722266:13850 Murmur3 Hash Token = 77

    INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (10010:99999,2005,12,1,7,-5.6);

    INSERT INTO temperature(weather_station,year,month,day,hour,temperature) VALUES (722266:13850,2005,12,1,7,-5.6);

    For this example, lets make it a reasonable number

  • Writes & WAN replication

    10.0.0.1 00-25

    10.0.0.4 76-100

    10.0.0.2 26-50

    10.0.0.3 51-75

    DC1

    DC1: RF=3

    Node Primary Replica Replica

    10.0.0.1 00-25 76-100 51-75

    10.0.0.2 26-50 00-25 76-100

    10.0.0.3 51-75 26-50 00-25

    10.0.0.4 76-100 51-75 26-50

    10.10.0.1 00-25

    10.10.0.4 76-100

    10.10.0.2 26-50

    10.10.0.3 51-75

    DC2

    Node Primary Replica Replica

    10.0.0.1 00-25 76-100 51-75

    10.0.0.2 26-50 00-25 76-100

    10.0.0.3 51-75 26-50 00-25

    10.0.0.4 76-100 51-75 26-50

    DC2: RF=3

    Client

    Insert DataPartition Key = 15

    Asynchronous Local Replication

    Asynchronous WAN Replication

  • Locality

    10.0.0.1 00-25

    10.0.0.4 76-100

    10.0.0.2 26-50

    10.0.0.3 51-75

    DC1

    DC1: RF=3

    Node Primary Replica Replica

    10.0.0.1 00-25 76-100 51-75

    10.0.0.2 26-50 00-25 76-100

    10.0.0.3 51-75 26-50 00-25

    10.0.0.4 76-100 51-75 26-50

    10.10.0.1 00-25

    10.10.0.4 76-100

    10.10.0.2 26-50

    10.10.0.3 51-75

    DC2

    Node Primary Replica Replica

    10.0.0.1 00-25 76-100 51-75

    10.0.0.2 26-50 00-25 76-100

    10.0.0.3 51-75 26-50 00-25

    10.0.0.4 76-100 51-75 26-50

    DC2: RF=3

    Client

    Get DataPartition Key = 15

    Client

    Get DataPartition Key = 15

  • Data Locality

    weatherstation_id=10