Datastax - Reportingand Analyticson Apache Cassandra
-
Upload
novencia-groupe -
Category
Data & Analytics
-
view
620 -
download
3
Transcript of Datastax - Reportingand Analyticson Apache Cassandra
Reporting and Analytics on Apache Cassandra
Big Data Paris 2016
Victor CoustenobleSolutions Engineer [email protected]@vizanalytics
Agenda
• DataStax & Apache Cassandra
• Reporting and Analytics
• DataStax Enterprise Analytics
• Architectures
©2014 DataStax Confidential. Do not distribute without consent. 2
3
DataStax & Apache Cassandra
4
Confidential 5
Apache Cassandra™
• Massively scalable, Open Source, NoSQL, distributed database built for modern, mission-critical online applications
• Written in Java and is a hybrid of Amazon Dynamo and Google BigTable
• Masterless with no single point of failure
• Distributed and data center aware
• 100% uptime
• Predictable scaling
• High Performance
• Multi Data Center
• Time Series
• Tunable Consistency
• Simple to Operate
• CQL language
• OpsCenter / DevCenter
Dynamo
BigTable
BigTable: http://research.google.com/archive/bigtable-osdi06.pdf
Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
Cassandra Data Access
CQL language via cqlsh (command line) or DevCenter(development environnement) or drivers
• Drivers on Cassandra native protocol
• Command CQL COPY
• Import/Export tools for massive bulk loader
• Connectors in ETL solutions (Talend, Informatica)
• Via analytics layers Spark and Hadoop
• Via ODBC/JDBC drivers
CQL – Cassandra Query Language
©2014 DataStax Confidential. Do not distribute without consent.
• Data type : BLOB, UUID, TIMEUUID, User Defined Type …
• User Defined Functions, User Defined Aggregates
• Materialized Views
• Collections : Map, List, Set
• TTL (Time-To-Live) at column level
• Counters
• Lightweight Transactions (LWT) : race condition problem solving with IF NOT EXISTS
• Batch statements
• Secondary Index
• Very similar to RDBMS SQL syntax
• Core DML and DDL commands supported: INSERT, UPDATE, DELETE, SELECT, CREATE, GRANT …
INSERT INTO sporty_league (team_name, player_name, jersey) VALUES (’PSG',’Zlatan’,10);
SELECT player_name as nom_joueur FROM sporty_league WHERE team_name = ‘PSG’;
DevCenter
9
Reporting & Analytics
Real-Time / Operational Analytics Use Cases
Recommendation Engine
Internet of Things
Fraud Detection
Risk Analysis
Buyer Behaviour Analytics
Telematics, Logistics
Business Intelligence
Infrastructure Monitoring
…
How to do analytics on Cassandra data ?
Remember …
Cassandra = NO JOIN , NO GROUP BY , Filter on Primary Key only
2 solutions:
• CQL with predictable queries
• Joins and Aggregations on the fly:
Server level => Need a distributed processing framework : Hadoop or Spark
Client level => Possible but risky !
Reporting and Dashboard
Confidential 12
• Static and operational dashboards and reports created for a
specific Cassandra application.
• CQL, Solr queries and DataStax drivers
• KPI and aggregations pre-calculated with scheduled batch or on
the fly during insert.
BI & Data Visualization tools
13
For BI and Data Visualization tools like Tableau Software,
Power BI, Qlikview, Excel ….
• DataStax ODBC driver
SQL joins and aggregations executed at client level !
• Spark ODBC driver (from Databricks or Microsoft)
SQL translated in Spark jobs and executed at server level
Tableau Software
14
• DataStax ODBC Driver
• Databricks Spark ODBC Driver for SparkSQL
Live SQL queries to Spark or Extract data on local client
http://www.datastax.com/dev/blog/tableau-spark-cassandra
Power BI Desktop
15
Support for On-Prem Spark distributions
“The new data source in this month’s release is support for On-Prem Spark distributions. Last
month, we added support for Microsoft Azure HDInsight Spark, and this month we’re expanding
to other Spark distributions.
This new connector can be found under the “Other” category in the “Get Data” dialog.”
http://blogs.msdn.com/b/powerbi/archive/2015/09/23/44-new-features-in-the-power-bi-desktop-
september-update.aspx
Microsoft Spark ODBC Driver
Notebook
16
Run code (Spark or CQL) from a Web browser
Notebooks like Zeppelin, Spark Notebook, Jupyter
For example Zeppelin:
• Examples available for Cassandra
• CQL language interpreter
• https://github.com/doanduyhai/incubator-zeppelin
17
DataStax Enterprise Analytics
In-Memory
Option pour un stockage aussi en
mémoire et un accès encore plus
rapide.
Support
Support 24x7 avec hot-
fixes et revues de
performance.
Visual Admin
Outil visuel “OpsCenter” pour la
supervision et l’administration
Management Services
Services d’administration
automatique (repair, backup,
alertes, ….) et suivi des
performances.
Cassandra Certified
Version de Apache Cassandra
certifiée, supportée et prête pour
l’entreprise.
Security
Sécurité d’entreprise avec
chiffrement sur disques, traces
d’audit et authentification
externe (Kerberos, LDAP/AD)
DataStax Enterprise (DSE)
“Une plateforme d’entreprise de base de données
opérationnelle avec la recherche et l’analyse temps réel”
©2015 DataStax. Do not distribute without consent. 19
Recherche Intégrée
• Recherche sur des données Cassandra à travers une intégration forte
des moteurs Solr et Lucene
• Facettes, filtres, recherche géospatiale, recherche plein texte, jointures,
etc…
• Opérations de recherche et d’Indexation Temps Réel
• Requêtes de recherche depuis CQL et l’API REST de Solr
• Index Solr distribués et répliqués, architecture Masterless
©2015 DataStax. Do not distribute without consent. 20
Analytique et Transformation de données
• Intégration poussée avec Cassandra de Apache Spark
• Spark = Traitement Distribué : “In-memory Map/Reduce”, multi-thread …
• GraphX, MLLib (Machine learning), SparkSQL, Spark Streaming, SparkR
• Serveur Spark JDBC – Spark Job Server
• Intégration de Solr
• Partenariat DataStax / Databricks - Support Spark
Cas d’utilisation de Spark
21
Load data from various
sources
Analytics (join, aggregate, transform, …)
Sanitize, validate, normalize data
Schema migration,
Data conversion
Traitement Temps Réel ou Batch
Détection, Alertes,
Enrichissement
des Données
Batch Processing
Machine Learning
Agrégats pré-
calculés, Calculs de
Modèles
Sans
ETL
Workloads Isolation
©2014 DataStax Confidential. Do not distribute without consent. 23
No ETL
DataStax Enterprise 5
24
DataStax Enterprise Graph
“DSE Graph est une solution de base de données graphe scalable pour les
applications Web et Mobiles avec des besoins de gérer des données
hautement connectés”
Origine : Projet Open Source Titan
DSE Graph intégré dans DSE:
• Intégration forte dans Cassandra
• OLAP et analyse Graph avec Apache Spark
• OLTP avec support de Apache Solr pour la recherche
• Supervision depuis OpsCenter
• Pas de besoin de noeuds ou clusters additionnels
• Pas de processus externe, même JVM
• Utilisation et Support du framework TinkerPop
26
Architectures “Analytique Temps Réel”
Confidential 27
Architecture Kappa
Spark
Mesos
Akka
Cassandra
Kafka
• Tout est fait dans la couche de Streaming (contrairement à une architecture Lambda)
• SMACK est un exemple d’implémentation de Kappa
Confidential 28
Exemple d’architecture SMACK chez ING
http://www.natalinobusa.com/2015/11/why-is-smack-stack-all-rage-lately.html
Spark
Mesos
Akka
Cassandra
Kafkahttp://www.slideshare.net/natalinobusa/realtime-anomaly-detection-with-spark-mllib-akka-and-cassandra
http://www.slideshare.net/patrickmcfadin/laying-down-the-smack-on-your-data-pipelines
©2015 DataStax. Do not distribute without consent. 29
Analyse temps réel dans IoT
Write Intensive
Internet of Things - Activity
logs for fraud and
recommendation – Messages
Read Intensive
Catalogue – Playlist –
Recommendation – Fraud
Alert – Personalization
Operational Search,
Dashboard and Reporting
Offline Applications
Historical Analysis – Complex
Analytics – Self Service BI
Operational Search,
Dashboard and Reporting
Data Warehouse / Data LakeHadoop cluster Computation Engine
DataStax Enterprise + Hadoop
Functional use cases
Messaging
Collections/
Playlists
Fraud
detection
Recommendation/
Personalization
Internet of things/
Sensor data
Plus d’information
• DataStax: http://www.datastax.com
• Downloads: http://www.datastax.com/download
• Documentation: http://www.datastax.com/docs
• Developer Blog: http://www.datastax.com/dev/blog
• Academy: https://academy.datastax.com/
• Community Site: http://planetcassandra.org/
©2014 DataStax Confidential. Do not distribute without consent. 32
Merci
We power the big data apps that transform business.
©2013 DataStax Confidential. Do not distribute without consent.