Analytics with Spark and Cassandra

download Analytics with Spark and Cassandra

If you can't read please download the document

  • date post

    15-Jan-2017
  • Category

    Technology

  • view

    911
  • download

    2

Embed Size (px)

Transcript of Analytics with Spark and Cassandra

  • @tlberglund

    andSparkCassandra

  • @tlberglund

    andSparkCassandra

  • Cassandra

  • You have too much data It changes a lot The system is always on

    When?

  • Distributed database Transactional/online Low latency, high velocity

    What :

  • CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC));

    Data Model

  • CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC));

    Partition Key

  • CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC));

    Partition Key

  • CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC));

    Clustering Column

  • CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC));

    Clustering Column

  • CREATE TABLE credit_card_transactions( card_number VARCHAR, transaction_time TIMESTAMP, amount DECIMAL(10,2), merchant_id VARCHAR, country VARCHAR, PRIMARY KEY(card_number, transaction_time DESC));

    Primary Key

  • Data Modelcard_number xctn_time amount country merchant

    54448389

    45321191

    54446027

    45322189

    13:58

    13:59

    13:59

    14:02

    54446027 14:03

    54448389 14:03

    $12.42

    $50.38

    $3.02

    $12.42

    23.78

    $98.03

    US

    US

    US

    US

    ES

    US

    101

    101

    102

    198

    352

    158

  • card_number xctn_time amount country merchant

    54448389

    45321191

    54446027

    45322189

    13:58

    13:59

    13:59

    14:02

    54446027 14:03

    54448389 14:03

    $12.42

    $50.38

    $3.02

    $12.42

    23.78

    $98.03

    US

    US

    US

    US

    ES

    US

    101

    101

    102

    198

    352

    158

    Partition Key

  • card_number xctn_time amount country merchant

    54448389

    45321191

    54446027

    45322189

    13:58

    13:59

    13:59

    14:02

    54446027 14:03

    54448389 14:03

    $12.42

    $50.38

    $3.02

    $12.42

    23.78

    $98.03

    US

    US

    US

    US

    ES

    US

    101

    101

    102

    198

    352

    158

    Clustering Column

  • card_number xctn_time amount country merchant

    54448389

    45321191

    54446027

    45322189

    13:58

    13:59

    13:59

    14:02

    54446027 14:03

    54448389 14:03

    $12.42

    $50.38

    $3.02

    $12.42

    23.78

    $98.03

    US

    US

    US

    US

    ES

    US

    101

    101

    102

    198

    352

    158

    Primary Key

  • card_number xctn_time amount country merchant

    54448389

    45321191

    54446027

    45322189

    13:58

    13:59

    13:59

    14:02

    54446027 14:03

    54448389 14:03

    $12.42

    $50.38

    $3.02

    $12.42

    23.78

    $98.03

    US

    US

    US

    US

    ES

    US

    101

    101

    102

    198

    352

    158Partition

  • card_number xctn_time amount country merchant

    54448389

    45321191

    54446027

    45322189

    13:58

    13:59

    13:59

    14:02

    54446027 14:03

    54448389 14:03

    $12.42

    $50.38

    $3.02

    $12.42

    23.78

    $98.03

    US

    US

    US

    US

    ES

    US

    101

    101

    102

    198

    352

    158

    Partition

  • card_number xctn_time amount country merchant

    54448389

    45321191

    54446027

    45322189

    13:58

    13:59

    13:59

    14:02

    54446027 14:03

    54448389 14:03

    $12.42

    $50.38

    $3.02

    $12.42

    23.78

    $98.03

    US

    US

    US

    US

    ES

    US

    101

    101

    102

    198

    352

    158

    Partition

  • card_number xctn_time amount country merchant

    54448389

    45321191

    54446027

    45322189

    13:58

    13:59

    13:59

    14:02

    54446027 14:03

    54448389 14:03

    $12.42

    $50.38

    $3.02

    $12.42

    23.78

    $98.03

    US

    US

    US

    US

    ES

    US

    101

    101

    102

    198

    352

    158

    Partition

  • 2000

    4000

    6000

    8000

    A000

    C000

    E000

    0000Partitioning

    5444838914:03 $98.03 US 158

    f(x) 5D93

  • 2000

    4000

    6000

    8000

    A000

    C000

    E000

    0000Partitioning

    5444838914:03 $98.03 US 158

    5D93

  • 2000

    4000

    6000

    8000

    A000

    C000

    E000

    0000Partitioning

    5444838914:03 $98.03 US 158

    f(x) 5D93

    54448389

  • card_number xctn_time amount country merchant

    54448389

    45321191

    54446027

    45322189

    13:58

    13:59

    13:59

    14:02

    54446027 14:03

    54448389 14:03

    $12.42

    $50.38

    $3.02

    $12.42

    23.78

    $98.03

    US

    US

    US

    US

    ES

    US

    101

    101

    102

    198

    352

    158

    Partition

    Partitioning

  • Clumps of related data Always on one server Accessed by consistent hashing

    Partitions

  • Design for low-latency, high-throughput reads and writes

    Inflexible ad-hoc queries Very little aggregation Very little secondary indexing

    Consequences

  • Spark

  • You have too much data You dont know the questions You dont like MapReduce

    Youre tired of Hadoop lol

    When do I use it?

  • Distributed, functional In-memory Real-time Next-generation MapReduce

    What are they saying?

  • Paradigm :Store collections in Resilient Distributed Datasets, or RDDs

    Transform RDDs into other RDDs

  • Paradigm :Store collections in Resilient Distributed Datasets, or RDDs

    Transform RDDs into other RDDs

    Aggregate RDDs with actions

  • Paradigm :Store collections in Resilient Distributed Datasets, or RDDs

    Aggregate RDDs with actions

    Transform RDDs into other RDDs

    Keep track of a graph of transformations and

    actions

  • Paradigm :Store collections in Resilient Distributed Datasets, or RDDs

    Keep track of a graph of transformations and

    actionsTransform RDDs into

    other RDDsAggregate RDDs with

    actionsDistribute all of this

    storage and computation

  • Paradigm :Store collections in Resilient Distributed Datasets, or RDDs

    Distribute all of this storage and computation

    Transform RDDs into other RDDs

    Aggregate RDDs with actions

    Keep track of a graph of transformations and

    actions

  • 2000

    4000

    6000

    8000

    A000

    C000

    E000

    0000Architecture

    Spark Client Spark

    Context

    Driver

  • Architecture

    Spark Client Spark

    ContextJob

  • Architecture

    Spark Client Spark

    Context

    Job

    Cluster Manager

  • Architecture

    Spark Client Spark

    Context

    Cluster Manager

    JobTaskTaskTaskTask

  • Architecture

    Spark Client Spark

    Context

    Cluster Manager

    Job

    TaskTask

    Task

    Task

  • Worker NodeExecutor

    Task Task

    Cache

  • Worker NodeExecutor

    Task Task

    Cache

    Executor

    Task Task

    CacheRDD RDD

    RDD

    RDDRDD

    RDD

  • Whats an RDD?

  • Whats an RDD?

    This is

  • Bigger than a computer

    Whats an RDD?

  • Read from an input source?

    Whats an RDD?

  • The output of a pure

    function?

    Whats an RDD?

  • Immutable

    Whats an RDD?

  • Typed

    Whats an RDD?

  • Ordered

    Whats an RDD?

  • Functions comprise a

    graph

    Whats an RDD?

  • Lazily Evaluated

    Whats an RDD?

  • Collection of Things

    Whats an RDD?

  • Distributed, how?

  • Distributed, how?

    5444 5676 0686 8389: List(xn-1, xn-2, xn-3)

    4532 4569 7030 1191: List(xn-4, xn-5, xn-6)

    5444 5607 6517 6027: List(xn-7, xn-8, xn-9)

    4532 4577 0122 2189: List(xn-10, xn-11, xn-12)

    Hash these

  • val conf = new SparkConf(true) .set("spark.cassandra.connection.host", 10.0.3.74)

    val sc = new SparkContext(spark://127.0.0.1:7077, music, conf) val transactions = sc. // cassandra stuff .partitionBy(new HashPartitioner(25)).persist() } }

  • ConnectorCassandra

  • import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.SparkContext._ import com.datastax.spark.connector._

    object ScalaProgram { def main(args: Array[String]) {

    val conf = new SparkConf(tru