PySpark Cassandra - Amsterdam Spark Meetup

download PySpark Cassandra - Amsterdam Spark Meetup

of 55

  • date post

    27-Jan-2017
  • Category

    Technology

  • view

    589
  • download

    11

Embed Size (px)

Transcript of PySpark Cassandra - Amsterdam Spark Meetup

  • PySpark CassandraAnalytics with Cassandra and PySpark

    + +

  • Frens Jan Rumph Database and processing architect

    at Target Holding

    Contact me at:frens.jan.rumph@target-holding.nl

  • Target Holding Machine Learning Company

    Timeseries: Prediction, Search,

    Anomaly detection, ...

    Text: Search, matching (e.g. jobs

    and resumes), ...

    Markets: media human resources infrastructure

    (energy, waterworks, ...) health

  • PySpark Cassandra Techology background

    Cassandra, Spark and PySpark PySpark Cassandra

    Introduction Features and use cases Getting started Operators and examples

  • Technology backgroundCassandra, Spark and PySpark

  • Cassandra Distributed database Originated at

    Facebook Roots in Amazon

    Dynamo

  • Cassandra Query LanguageMain 'user interface' of Cassandra with a SQL feel (tables with rows) DML

    Insert into ..., Select from ..., Update ..., Delete from ...

    DDL Create table ..., Create index ...

    Column types: Numbers, strings, etc., Collections (lists, sets and maps) Counters

  • Distribution and replication Distributed map of

    ordered maps Under the hood some

    updates in C* 3

    Consistent hashing Replication along ring

    keys usually 'placed on ring' through hashing

    Image by DataStax

  • Local datastructures and 2i Memtables SSTables

    Ordered within partition on clustering columns

    Various caches Various indices

    Materialized views manually < C* 3.0 similar to normal tables

    Secondary indices scatter gather model 'normal' or 'search'

  • Spark Distributed data

    processing engine 'Doesn't touch disk if

    it doesn't have to' Layers on top of data

    sources HDFS, Cassandra,

    Elasticsearch, JDBC, ...

  • Resiliant Distributed Dataset Partitioned (and distributed) collection of rows Part of computational graph

    RDD has linkage to 'source' RDD(s)

    DataFrame, DataSet / Frame with stronger typing and declarative querying layered on top

  • Transformations and actions Narrow transformations

    are 'data-local' map, filter, ...

    Wide transformations aren't join, sort, reduce, group, ...

    Actions to read results to the driver write results to disk,

    database, ...

    Image by Apache

  • Distribution / topology

    Where your app 'lives'

    Coordinate resourcesStandalone MESOS or YARN

    Where the work happens

    Image by Apache

  • PySpark Wrapper around Java APIs

    JVM for data shipping when working with Python RDDs 'Query language' when working with DataFrames

    CPython interpreters as (extra) executors essentially the multiprocessing model but distributed cpython executors forked per job (not application)

  • Pickle Object serialization shipped with Python Pickle used for messaging between

    CPython interpreters and JVM cPickle / cloudpickle in CPython Py4J in the JVM

  • PySpark Cassandra

  • PySpark Cassandra Developed at Target Holding

    use a lot of python and Cassandra Spark option for processing

    Build on Spark Cassandra Connector Datastax provides Spark Cassandra Connector Python + Cassandra link was missing PySpark Cassandra

  • Features and use casesPySpark Cassandra

  • Features Distributed C* table scanning into RDD's Writing RDD's and DStreams to C* Joining RDD's and DStreams with C* tables

    + +

  • Use cases Perform bulk 'queries' you normally can't

    (C* doesn't do group by or join)

    takes a prohibitive amount of time /or just because it's easy once it's set up

    Data wrangling ('cooking' features, etc.) As a streaming processing platform

  • Use cases at Target HoldingPySpark Cassandra

  • Media metadata processing We disambiguated close to 90 million

    'authorships' using names, contact details, publication keywords In order to build a analytical

    applications for a large publisher of scientific journals

    Spark / PySpark Cassandra for data wrangling

  • Earthquake monitoring We are building an monitoring system

    for use with many low cost vibration sensors Spark / PySpark Cassandra for

    Enriching the event stream Saving the event stream Bulk processing Anomaly detection

  • Processing time series data We collect time series data in various fields

    Energy (electricity and gas usage) Music (tracking online music and video portals)

    Spark / PySpark Cassandra for data wrangling rolling up data bulk forecasting anomaly detection

  • Getting startedPySpark Cassandra

  • Getting started 'Homepage': github.com/

    TargetHolding/pyspark-cassandra

    Available on: spark-packages.org/package/TargetHolding/pyspark-cassandra

    Also read: github.com/datastax/spark-cassandra-connector

  • Compatibility Spark 1.5 and 1.6

    (supported older versions in the past)

    Cassandra 2.1.5, 2.2 and 3

    Python 2.7 and 3

    + +

  • High over Read from and write to C* using Spark as a

    colocated distributed processing platform

    Image by DataStax

  • Software setup

    Python partPySpark Cassandra

    Scala partPySpark Cassandra

    Datastax Spark Cassandra Connector

    PySpark Spark

    CPython JVM JVM

    Cassandra

    Application

  • Submit scriptspark-submit --packages TargetHolding/pyspark-cassandra:0.3.5 --conf spark.cassandra.connection.host=cas1,cas2,cas3 --master spark://spark-master:7077 yourscript.py

    import ...

    conf = SparkConf()

    sc = CassandraSparkContext(conf=conf)

    # your script

  • PySpark shellIPYTHON_OPTS=notebookPYSPARK_DRIVER_PYTHON=ipythonpyspark --packages TargetHolding/pyspark-cassandra:0.3.5 --conf ... ...

    import pyspark_cassandra

  • Operators and examplesPySpark Cassandra

  • Operators Scan Project (select) Filter (where) Limit, etc. Count

    'Spanning' Join Save

  • Scan cassandraTable() to scanning C*

    Determine basic token ranges Group them into partitions

    taking size and location into account Execute (concurrent) CQL queries to C*

    rows = sc.cassandraTable('keyspace', 'table')

  • Scan Basically executing this query many times:SELECT columnsFROM keyspace.tableWHERE token(pk) > ? and token(pk) < ? filterORDER BY ...LIMIT ...ALLOW FILTERING

  • Scan Quite tunable if neccessary

    sc.cassandraTable('keyspace', 'table', row_format=RowFormat.DICT, # ROW, DICT or TUPLE split_count=1000, # no partitions (splits) split_size=100000, # size of a partition fetch_size=1000, # query page size consistency_level='ALL', metrics_enabled=True)

  • Project / Select To make things go a little faster,

    select only the columns you need. This saves in communication:

    C* Spark JVM CPython

    sc.cassandraTable(...).select('col1', 'col2', ...)

  • TypesCQL Python

    ascii unicode string

    bigint long

    blob bytearray

    boolean boolean

    counter int, long

    decimal decimal

    double float

    float float

    inet str

    int int

    CQL Python

    set set

    list list

    text unicode string

    timestamp datetime.datetime

    timeuuid uuid.UUID

    varchar unicode string

    varint long

    uuid uuid.UUID

    UDT pyspark_cassandra.UDT

  • Key by primary key Cassandra RDD's can be keyed by primary key

    yielding an RDD of key value pairs Keying by partition key not yet supported

    sc.cassandraTable(...).by_primary_key()

  • Filter / where Clauses on primary keys, clustering columns or

    secondary indices can be pushed down If a where with allow filtering works in cql

    Otherwise resort to RDD.filter or DF.filter

    sc.cassandraTable(...).where(

    'col2 > ?', datetime.now() - timedelta(days=14)

    )

  • Combine with 2i With the cassandra lucene indexsc.cassandraTable(...).where('lucene = ?', '''{ filter : { field: "loc", type: "geo_bbox", min_latitude: 53.217, min_longitude: 6.521, max_latitude: 53.219, max_longitude: 6.523 }}''')

  • Limit, take and first limit() the number of rows per query

    there are as least as many queries as there are token ranges

    take(n) at most n rows from the RDD applying limit to make it just a tad bit faster

    sc.cassandraTable(...).limit(1)...

    sc.cassandraTable(...).take(3)

    sc.cassandraTable(...).first()

  • Push down count cassandraCount() pushes down count(*)

    queries down to C* counting in partitions and then reduced

    When all you want to do is count records in C* doesn't force caching

    sc.cassandraTable(...).cassandraCount()

  • Spanning Wide rows in C* are retrieved in order, are

    consecutive and don't cross partition boundaries spanBy() is like groupBy() for wide rows

    sc.cassandraTable(...).spanBy('doc_id')

  • Save Save any PySpark RDD to C*

    for as long as it consists of dicts, tuples or Rows

    rdd.saveToCassandra('keyspace', 'table', ...)

  • Saverows = [dict(

    key = k,

    stamp = datetime.now(),

    val = random() * 10,

    tags = ['a', 'b', 'c'],

    options = dict(foo='bar', baz='qux'),

    ) for k in ('x', 'y', 'z')]

    rdd = sc.parallelize(rows)

    rdd.saveToCassandra('keyspace', 'table')

  • Saverdd.saveToCassandra(...,

    columns = ('col1', 'col2'), # The columns to save/ how to interpret the elements in a tuple

    row_format = RowFormat.DICT,# RDD format hint

    keyed=True, # W