Datastax - Reportingand Analyticson Apache Cassandra

33
Reporting and Analytics on Apache Cassandra Big Data Paris 2016 Victor Coustenoble Solutions Engineer DataSTax [email protected] @vizanalytics

Transcript of Datastax - Reportingand Analyticson Apache Cassandra

Page 1: Datastax - Reportingand Analyticson Apache Cassandra

Reporting and Analytics on Apache Cassandra

Big Data Paris 2016

Victor CoustenobleSolutions Engineer [email protected]@vizanalytics

Page 2: Datastax - Reportingand Analyticson Apache Cassandra

Agenda

• DataStax & Apache Cassandra

• Reporting and Analytics

• DataStax Enterprise Analytics

• Architectures

©2014 DataStax Confidential. Do not distribute without consent. 2

Page 3: Datastax - Reportingand Analyticson Apache Cassandra

3

DataStax & Apache Cassandra

Page 4: Datastax - Reportingand Analyticson Apache Cassandra

4

Page 5: Datastax - Reportingand Analyticson Apache Cassandra

Confidential 5

Page 6: Datastax - Reportingand Analyticson Apache Cassandra

Apache Cassandra™

• Massively scalable, Open Source, NoSQL, distributed database built for modern, mission-critical online applications

• Written in Java and is a hybrid of Amazon Dynamo and Google BigTable

• Masterless with no single point of failure

• Distributed and data center aware

• 100% uptime

• Predictable scaling

• High Performance

• Multi Data Center

• Time Series

• Tunable Consistency

• Simple to Operate

• CQL language

• OpsCenter / DevCenter

Dynamo

BigTable

BigTable: http://research.google.com/archive/bigtable-osdi06.pdf

Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

Page 7: Datastax - Reportingand Analyticson Apache Cassandra

Cassandra Data Access

CQL language via cqlsh (command line) or DevCenter(development environnement) or drivers

• Drivers on Cassandra native protocol

• Command CQL COPY

• Import/Export tools for massive bulk loader

• Connectors in ETL solutions (Talend, Informatica)

• Via analytics layers Spark and Hadoop

• Via ODBC/JDBC drivers

Page 8: Datastax - Reportingand Analyticson Apache Cassandra

CQL – Cassandra Query Language

©2014 DataStax Confidential. Do not distribute without consent.

• Data type : BLOB, UUID, TIMEUUID, User Defined Type …

• User Defined Functions, User Defined Aggregates

• Materialized Views

• Collections : Map, List, Set

• TTL (Time-To-Live) at column level

• Counters

• Lightweight Transactions (LWT) : race condition problem solving with IF NOT EXISTS

• Batch statements

• Secondary Index

• Very similar to RDBMS SQL syntax

• Core DML and DDL commands supported: INSERT, UPDATE, DELETE, SELECT, CREATE, GRANT …

INSERT INTO sporty_league (team_name, player_name, jersey) VALUES (’PSG',’Zlatan’,10);

SELECT player_name as nom_joueur FROM sporty_league WHERE team_name = ‘PSG’;

DevCenter

Page 9: Datastax - Reportingand Analyticson Apache Cassandra

9

Reporting & Analytics

Page 10: Datastax - Reportingand Analyticson Apache Cassandra

Real-Time / Operational Analytics Use Cases

Recommendation Engine

Internet of Things

Fraud Detection

Risk Analysis

Buyer Behaviour Analytics

Telematics, Logistics

Business Intelligence

Infrastructure Monitoring

Page 11: Datastax - Reportingand Analyticson Apache Cassandra

How to do analytics on Cassandra data ?

Remember …

Cassandra = NO JOIN , NO GROUP BY , Filter on Primary Key only

2 solutions:

• CQL with predictable queries

• Joins and Aggregations on the fly:

Server level => Need a distributed processing framework : Hadoop or Spark

Client level => Possible but risky !

Page 12: Datastax - Reportingand Analyticson Apache Cassandra

Reporting and Dashboard

Confidential 12

• Static and operational dashboards and reports created for a

specific Cassandra application.

• CQL, Solr queries and DataStax drivers

• KPI and aggregations pre-calculated with scheduled batch or on

the fly during insert.

Page 13: Datastax - Reportingand Analyticson Apache Cassandra

BI & Data Visualization tools

13

For BI and Data Visualization tools like Tableau Software,

Power BI, Qlikview, Excel ….

• DataStax ODBC driver

SQL joins and aggregations executed at client level !

• Spark ODBC driver (from Databricks or Microsoft)

SQL translated in Spark jobs and executed at server level

Page 14: Datastax - Reportingand Analyticson Apache Cassandra

Tableau Software

14

• DataStax ODBC Driver

• Databricks Spark ODBC Driver for SparkSQL

Live SQL queries to Spark or Extract data on local client

http://www.datastax.com/dev/blog/tableau-spark-cassandra

Page 15: Datastax - Reportingand Analyticson Apache Cassandra

Power BI Desktop

15

Support for On-Prem Spark distributions

“The new data source in this month’s release is support for On-Prem Spark distributions. Last

month, we added support for Microsoft Azure HDInsight Spark, and this month we’re expanding

to other Spark distributions.

This new connector can be found under the “Other” category in the “Get Data” dialog.”

http://blogs.msdn.com/b/powerbi/archive/2015/09/23/44-new-features-in-the-power-bi-desktop-

september-update.aspx

Microsoft Spark ODBC Driver

Page 16: Datastax - Reportingand Analyticson Apache Cassandra

Notebook

16

Run code (Spark or CQL) from a Web browser

Notebooks like Zeppelin, Spark Notebook, Jupyter

For example Zeppelin:

• Examples available for Cassandra

• CQL language interpreter

• https://github.com/doanduyhai/incubator-zeppelin

Page 17: Datastax - Reportingand Analyticson Apache Cassandra

17

DataStax Enterprise Analytics

Page 18: Datastax - Reportingand Analyticson Apache Cassandra

In-Memory

Option pour un stockage aussi en

mémoire et un accès encore plus

rapide.

Support

Support 24x7 avec hot-

fixes et revues de

performance.

Visual Admin

Outil visuel “OpsCenter” pour la

supervision et l’administration

Management Services

Services d’administration

automatique (repair, backup,

alertes, ….) et suivi des

performances.

Cassandra Certified

Version de Apache Cassandra

certifiée, supportée et prête pour

l’entreprise.

Security

Sécurité d’entreprise avec

chiffrement sur disques, traces

d’audit et authentification

externe (Kerberos, LDAP/AD)

DataStax Enterprise (DSE)

“Une plateforme d’entreprise de base de données

opérationnelle avec la recherche et l’analyse temps réel”

Page 19: Datastax - Reportingand Analyticson Apache Cassandra

©2015 DataStax. Do not distribute without consent. 19

Recherche Intégrée

• Recherche sur des données Cassandra à travers une intégration forte

des moteurs Solr et Lucene

• Facettes, filtres, recherche géospatiale, recherche plein texte, jointures,

etc…

• Opérations de recherche et d’Indexation Temps Réel

• Requêtes de recherche depuis CQL et l’API REST de Solr

• Index Solr distribués et répliqués, architecture Masterless

Page 20: Datastax - Reportingand Analyticson Apache Cassandra

©2015 DataStax. Do not distribute without consent. 20

Analytique et Transformation de données

• Intégration poussée avec Cassandra de Apache Spark

• Spark = Traitement Distribué : “In-memory Map/Reduce”, multi-thread …

• GraphX, MLLib (Machine learning), SparkSQL, Spark Streaming, SparkR

• Serveur Spark JDBC – Spark Job Server

• Intégration de Solr

• Partenariat DataStax / Databricks - Support Spark

Page 21: Datastax - Reportingand Analyticson Apache Cassandra

Cas d’utilisation de Spark

21

Load data from various

sources

Analytics (join, aggregate, transform, …)

Sanitize, validate, normalize data

Schema migration,

Data conversion

Page 22: Datastax - Reportingand Analyticson Apache Cassandra

Traitement Temps Réel ou Batch

Détection, Alertes,

Enrichissement

des Données

Batch Processing

Machine Learning

Agrégats pré-

calculés, Calculs de

Modèles

Sans

ETL

Page 23: Datastax - Reportingand Analyticson Apache Cassandra

Workloads Isolation

©2014 DataStax Confidential. Do not distribute without consent. 23

No ETL

Page 24: Datastax - Reportingand Analyticson Apache Cassandra

DataStax Enterprise 5

24

Page 25: Datastax - Reportingand Analyticson Apache Cassandra

DataStax Enterprise Graph

“DSE Graph est une solution de base de données graphe scalable pour les

applications Web et Mobiles avec des besoins de gérer des données

hautement connectés”

Origine : Projet Open Source Titan

DSE Graph intégré dans DSE:

• Intégration forte dans Cassandra

• OLAP et analyse Graph avec Apache Spark

• OLTP avec support de Apache Solr pour la recherche

• Supervision depuis OpsCenter

• Pas de besoin de noeuds ou clusters additionnels

• Pas de processus externe, même JVM

• Utilisation et Support du framework TinkerPop

Page 26: Datastax - Reportingand Analyticson Apache Cassandra

26

Architectures “Analytique Temps Réel”

Page 27: Datastax - Reportingand Analyticson Apache Cassandra

Confidential 27

Architecture Kappa

Spark

Mesos

Akka

Cassandra

Kafka

• Tout est fait dans la couche de Streaming (contrairement à une architecture Lambda)

• SMACK est un exemple d’implémentation de Kappa

Page 28: Datastax - Reportingand Analyticson Apache Cassandra

Confidential 28

Exemple d’architecture SMACK chez ING

http://www.natalinobusa.com/2015/11/why-is-smack-stack-all-rage-lately.html

Spark

Mesos

Akka

Cassandra

Kafkahttp://www.slideshare.net/natalinobusa/realtime-anomaly-detection-with-spark-mllib-akka-and-cassandra

http://www.slideshare.net/patrickmcfadin/laying-down-the-smack-on-your-data-pipelines

Page 29: Datastax - Reportingand Analyticson Apache Cassandra

©2015 DataStax. Do not distribute without consent. 29

Analyse temps réel dans IoT

Page 30: Datastax - Reportingand Analyticson Apache Cassandra

Write Intensive

Internet of Things - Activity

logs for fraud and

recommendation – Messages

Read Intensive

Catalogue – Playlist –

Recommendation – Fraud

Alert – Personalization

Operational Search,

Dashboard and Reporting

Offline Applications

Historical Analysis – Complex

Analytics – Self Service BI

Operational Search,

Dashboard and Reporting

Data Warehouse / Data LakeHadoop cluster Computation Engine

DataStax Enterprise + Hadoop

Page 31: Datastax - Reportingand Analyticson Apache Cassandra

Functional use cases

Messaging

Collections/

Playlists

Fraud

detection

Recommendation/

Personalization

Internet of things/

Sensor data

Page 32: Datastax - Reportingand Analyticson Apache Cassandra

Plus d’information

• DataStax: http://www.datastax.com

• Downloads: http://www.datastax.com/download

• Documentation: http://www.datastax.com/docs

• Developer Blog: http://www.datastax.com/dev/blog

• Academy: https://academy.datastax.com/

• Community Site: http://planetcassandra.org/

©2014 DataStax Confidential. Do not distribute without consent. 32

Page 33: Datastax - Reportingand Analyticson Apache Cassandra

Merci

We power the big data apps that transform business.

©2013 DataStax Confidential. Do not distribute without consent.