Spark zeppelin-cassandra at synchrotron

download Spark zeppelin-cassandra at synchrotron

of 44

  • date post

    08-Jan-2017
  • Category

    Technology

  • view

    550
  • download

    2

Embed Size (px)

Transcript of Spark zeppelin-cassandra at synchrotron

  • Spark/Cassandra/Zeppelin for particle accelerator metrics storage and aggregationDuyHai DOANApache Cassandra Evangelist

  • @doanduyhai

    Who Am I ?Duy Hai DOAN Apache Cassandra Evangelist talks, meetups, confs open-source projects (Achilles, Apache Zeppelin ...) OSS Cassandra point of contact

    duy_hai.doan@datastax.com @doanduyhai

    2

  • The HDB++ project What is Synchrotron HDB++ project presentation Why Spark, Cassandra and Zeppelin ?

  • @doanduyhai

    What is Synchrotron ?

    4

    particle accelerator (electrons)

    electron beams used for crystallography analysis of: material molecular biology

  • @doanduyhai

    What is Synchrotron ?

    5

  • @doanduyhai 6

  • @doanduyhai

    The HDB++ project

    7

    Sub-project of TANGO, software toolkit to connect control/monitor integrate sensor devices

    HDB++ = new TANGO event-driven archiving system historically used MySQL now stores data into Cassandra

  • @doanduyhai

    The HDB++ project

    8

  • @doanduyhai

    The HDB++ project

    9

    As of Sept - 2015

  • @doanduyhai

    The HDB++ GUI

    10

  • @doanduyhai

    The HDB++ GUI

    11

  • @doanduyhai

    The HDB++ hardware specs

    12

  • 13

    Q & A

    ! "

  • The HDB++ Cassandra data model

  • @doanduyhai

    Metrics table

    15

    CREATE TABLE hdb.att_scalar_devshort_ro ( att_conf_id timeuuid, period text, data_time timestamp, data_time_us int, error_desc text, insert_time timestamp, insert_time_us int, quality int, recv_time timestamp,recv_time_us int, value_r int, PRIMARY KEY((att_conf_id,period),data_time, data_time_us))

  • @doanduyhai

    Statistics table

    16

    CREATE TABLE hdb.stat_scalar_devshort_ro ( att_conf_id text, type_period text, //HOUR, DAY, MONTH, YEAR period text, //yyyy-MM-dd:HH, yyyy-MM-dd, yyyy-MM, yyyy count_distinct_error bigint, count_error bigint, count_point bigint, value_r_max int, value_r_min int, value_r_mean double, value_r_sd double, PRIMARY KEY ((att_conf_id, type_period), period) );

  • @doanduyhai

    Statistics table

    17

    INSERT INTO hdbtest.stat_scalar_devshort_ro(att_conf_id, type_period, period, value_r_mean) VALUES(xxxx, 'DAY', '2016-06-28', 123.456); INSERT INTO hdbtest.stat_scalar_devshort_ro(att_conf_id, type_period, period, value_r_mean) VALUES(xxxx, 'HOUR', '2016-06-28:01', 123.456); INSERT INTO hdbtest.stat_scalar_devshort_ro(att_conf_id, type_period, period, value_r_mean) VALUES(xxxx, 'MONTH', '2016-06', 123.456); // Request by period of time SELECT * FROM hdbtest.stat_scalar_devshort_ro WHERE att_conf_id = xxx AND type_period='DAY' AND period > '2016-06-20' AND period < '2016-06-28';

  • 18

    Q & A

    ! "

  • The Spark jobs

  • @doanduyhai

    Source code

    20

    val devShortRoTable = sqlContext .read .format("org.apache.spark.sql.cassandra") .options(Map("table" -> "att_scalar_devshort_ro", "keyspace" -> "hdbtest")) .load() devShortRoTable.registerTempTable("att_scalar_devshort_ro")

  • @doanduyhai

    Source code

    21

    val devShortRo = sqlContext.sql(s""" SELECT "DAY" AS type_period, att_conf_id, period, count(att_conf_id) AS count_point, count(error_desc) AS count_error, count(DISTINCT error_desc) AS count_distinct_error, min(value_r) AS value_r_min, max(value_r) AS value_r_max, avg(value_r) AS value_r_mean, stddev(value_r) AS value_r_sd FROM att_scalar_devshort_ro WHERE period="${day}" GROUP BY att_conf_id, period""")

  • @doanduyhai

    Source code

    22

    devShortRo.write .format("org.apache.spark.sql.cassandra") .options(Map( "table" -> "stat_scalar_devshort_ro", "keyspace" -> "hdbtest")) .mode(SaveMode.Append) .save()

  • Demo Zeppelin 23

  • @doanduyhai

    Zeppelin vizualisation (export as Iframe)

    24

  • 25

    Q & A

    ! "

  • Spark/Cassandra/Zeppelin tricks and traps Zeppelin/Spark/Cassandra Spark/Cassandra

  • @doanduyhai

    Zeppelin/Spark/Cassandra

    27

    Legend

    = trap = trick

  • @doanduyhai

    Zeppelin/Spark/Cassandra

    28

    Zeppelin build mode standard with Spark-Cassandra connector (maven profile -Pcassandra-spark-1.x)

    Spark run mode local with a stand-alone Spark co-located with Cassandra

  • @doanduyhai

    Zeppelin/Spark/Cassandra

    29

    Zeppelin build mode standard, Spark run mode local needs to add Spark-Cassandra connector as dependency to the Spark interpreter

  • @doanduyhai

    Zeppelin/Spark/Cassandra

    30

    Zeppelin build mode standard, Spark run mode local on Spark interpreter init, all declared dependencies will be fetched from declared

    repositories (default = Maven central + local Maven repo) beware of corporate FIREWALL !!!!!!!!!

    Where are the downloaded dependencies (jars) stored ?

  • @doanduyhai

    Zeppelin/Spark/Cassandra

    31

    Zeppelin build mode standard, Spark run mode cluster Zeppelin uses spark-submit command Spark interpreter run by bin/interpreter.sh

  • @doanduyhai

    Zeppelin/Spark/Cassandra

    32

    Zeppelin build mode standard, Spark run mode cluster run at least in local mode ONCE so that Zeppelin can dowload dependencies into

    local repo !!!! (zeppelin.interpreter.localRepo)

  • @doanduyhai

    Zeppelin/Spark/Cassandra

    33

    Zeppelin build mode with connector, Spark run mode local or cluster run smoothly because all Spark-Cassandra connector dependencies are merged

    into the interpreter/spark/dep/zeppelin-spark-dependencies-x.y.z.jar fat jar during the build process

  • @doanduyhai

    Zeppelin/Spark/Cassandra

    34

    OSS Spark needs to add Spark-Cassandra connector dependencies in conf/spark-env.sh

    ... ... Caused by: java.lang.NoClassDefFoundError: com/datastax/driver/core/ConsistencyLevel

  • @doanduyhai

    Zeppelin/Spark/Cassandra

    35

    OSS Spark

    needs to provide all transitive dependencies for the Spark-Cassandra connector !!!

    in conf/spark-env.sh

    or use spark-submit --package groupId:artifactId:version option

  • @doanduyhai

    Zeppelin/Spark/Cassandra

    36

    DSE Spark run smoothly because the Spark-Cassandra connector dependencies are already

    embedded into the package ($DSE_HOME/resources/spark/lib)

  • @doanduyhai

    Spark/Cassandra

    37

    Spark deploy mode (spark-submit --deploy-mode ) client cluster

    Zeppelin deploys by default using client mode

  • @doanduyhai

    Spark/Cassandra

    38

    Spark client deploy mode default needs to ship all driver program dependencies to the workers (network intensive) suitable for REPL (Spark Shell, Zeppelin) suitable for one-shot job/testing

  • @doanduyhai

    Spark/Cassandra

    39

    Spark cluster deploy mode driver program runs on a worker node all driver program dependencies should be reachable by any worker usually dependencies are stored in HDFS, can be stored on local FS on all workers suitable for recurrent jobs need a consistent build & deploy process for your jobs

  • @doanduyhai

    Spark/Cassandra

    40

    The job fails when using spark-submit but succeeded with Zeppelin error: value stddev not found

    val devShortRo = sqlContext.sql(s""" SELECT "DAY" AS type_period, att_conf_id, period, count(att_conf_id) AS count_point, count(error_desc) AS count_error, count(DISTINCT error_desc) AS count_distinct_error, min(value_r) AS value_r_min, max(value_r) AS value_r_max, avg(value_r) AS value_r_mean, stddev(value_r) AS value_r_sd FROM att_scalar_devshort_ro WHERE period="${day}" GROUP BY att_conf_id, period""")

  • @doanduyhai

    Spark/Cassandra

    41

    Indeed Zeppelin use Hive context by default

    Fix

  • 42

    Q & A

    ! "

  • @doanduyhai

    Cassandra Summit 2016 September 7-9 San Jose, CA

    Get 15% Off with Code: DoanDuy15

    Cassandrasummit.org

  • 44

    @doanduyhai

    duy_hai.doan@datastax.com

    https://academy.datastax.com/

    Thank You