Datastax Cassandra + Spark Streaming

download Datastax Cassandra + Spark Streaming

of 32

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Datastax Cassandra + Spark Streaming

Overview of DataStax Enterprise

Cassandra et Spark Streaming

Architecture vnementielle et Analytique temps relVictor Coustenoble Petit Djeuner OCTO TechnologyIngnieur Solutions 14/04/

1AgendaConfidential2Cassandra / DataStaxSpark / Spark StreamingArchitecture / Cas mtiersDmonstrations3DataStax dlivre une plateforme de la base de donnes Apache Cassandra, conue spcifiquement pour les besoins en Performance et Disponibilit exigs par les applications dInternet des Objets, Web ou Mobiles, en offrant aux entreprises une base de donnes Scurise toujours disponible, qui reste Simple administrer mme pour des dploiements grande chelle, dans un seul ou de Multiples Data Centers et dans le Cloud.

Cas dusage frquents

MessagerieCollections/PlaylistsDtection de FraudeRecommandation/ PersonnalisationObjets connects/ Donnes de Capteurs

4DataStaxFond en avril 2010~35500+Santa Clara, Austin, New York, London, Paris, Sydney400+


5Key Takeaway- Introduce the company, our incredible growth and global presence, that we are in about 25% of the FORTUNE 100, and the fact that many of the online and mobile applications you already use every day are actually built on DataStax.

Talk Track-DataStax, the leading distributed database technology, delivers Apache Cassandra to the worlds most innovative companies such as Netflix, Rackspace, Pearson Education and Constant Contact. DataStax is built to be agile, always-on, and predictably scalable to any size.

We were founded in April 2010, so we are a little over 4 years old. We are headquartered in Santa Clara, California and have offices in Austin TX, New York, London, England and Sydney Australia. We now have over 330 employees; this number will reach well over 400 by the end of our fiscal year (Jan 31 2015) and double by the end of FY16.

Currently 25% of the Fortune 100 use us, and our success has been built on our customers success and today and we have over 500 customers worldwide, in over 40 countries. The logos you see here are ones that you are already using every day.

These applications are all built on DataStax and Apache Cassandra.

So how have we come so far in such a short time..?5

Straightening the roadRELATIONAL DATABASESCQLSQLOpsCenter / DevCenterManagement toolsDSE for search & analyticsIntegrationSecuritySecuritySupport, consulting & training30 years ecosystem

En fait la mission de DataStax est de vos librer de ces incertitudes et vous faciliter la route sur cette nouvelle voie. A cette fin, nous vous offrons un DML DDL appel CQL trs proche du SQL maitris par vos quipes, des outils complets dadministration et de monitoring,

So, What DataStax is doing is trying to straightened that bend in the road. We are providing things like CQL, and management tools called DevCenter and OpsCenter. DataStax Enterprise provides integration into analytics and search capabilities and we do it all within a secure environment. We also provide consultants and training courses, including free virtual training to help get you up to speed.


Sans Matre-Esclave (peer-to-peer), sans Point Unique de Dfaillance (No SPOF)Distribue avec la possibilit de Data Center100% Disponible (replication)Massivement scalableMonte en charge linaireHaute Performance (lecture ET criture)Multi Data CenterSries TemporellesMulti ModleSimple ExploiterLanguage CQL (comme SQL)Outils OpsCenter / DevCenterApache Cassandra7

Cassandra is designed to handle big data workloads across multiple data centers with no single point of failure, providing enterprises with continuous availability without compromising performance. It uses aspects of Dynamos partitioning and replication and a log-structured data model similar to Bigtables.

It takes its distribution algorithm from Dynamo and its data model from Bigtable.

Cassandra is a reinvented database which is lightening fast and always on ideal for todays online applications where relational databases like Oracle cant keep up. This means that in todays world, cassandra stores and processes real time information at fast, predictive performance and built in fault tolerance

7Confidential8DataStax Enterprise

ConfiancedutilisationFonctionnalits dentrepriseDataStax EnterpriseDataStax supporte la communaut open source et les entreprises9Open Source/CommunautEnterprise SoftwareDataStax emploie le prsident du projet Apache et dveloppe 80+% du code de Apache CassandraDataStax Community EditionDataStax Simple OpsCenterDataStax DevCenterDataStax Drivers/ConnecteursDocumentation en ligneFormation en ligne Mailing lists et forumsDataStax Enterprise EditionCassandra CertifiIn-MemoryAnalytique intgr (Hadoop, Spark)Recherche intgre (Solr)Securit dentrepriseDataStax OpsCenter AvancServices dAdministration AutomatiqueSupport ExpertAide et ConsultingFormation ProfessionnelleDataStax is the company that delivers Cassandra to the enterprise.

First, we take the open source software and put it through rigorous quality assurance tests including a 1000 node scalability test. We certify it and provide the worlds most comprehensive support, training and consulting for Cassandra so that you can get up and running quickly.

But that isnt all DataStax does.

We also build additional software features on top of DataStax including security, search, analytics as well as provide in memory capabilities that dont come with the open source Cassandra product. We also provide management services to help visualize your nodes, plan your capacity and repair issues automatically. Finally, we also provide developer tools and drivers as well as monitoring tools. DataStax is the commercial company behind Apache Cassandra plus a whole host of additional software and services.

9Confidential10Pourquoi Spark + Cassandra ?

Analytique Oprationnelle / Temps Rel2014 DataStax Confidential. Do not distribute without consent.11Enrichissement des DonnesContraintes dintgritDtection de dpassement de seuilBatch ProcessingMachine LearningAgrgats pr-calculsCration de KPIDonnes

TraitementFluxPredictive analytics

Does this simple architecture look familiar to you? Lambda

Nathan Marz11Cassandra a besoin dun framework de traitement distribuPour des requtes indpendates du modle de donnesPour des oprations cross-table (JOIN, UNION)Pour des analyses complexes (machine learning)Pour des transformation, des aggrgationsPour des traitements de flux

Spark = Traitement DistribuIn-memory Map/Reduce, multi-thread, cachingIntgration pousse de Spark avec CassandraPartenariat DataStax / Databricksx10 x100 plus rapide que Hadoop MapReduce2014 DataStax Confidential. Do not distribute without consent.13ReplicationCassandraApplicationOprationnelleNoeudsSpark

Le SDK du Big Data13Cas dutilisation de Spark pour Cassandra14Load data from various sources

Analytics (join, aggregate, transform, )Sanitize, validate, normalize dataSchema migration,Data conversion

DUYHAI142013 DataStax Confidential. Do not distribute without consent.15Fast, distributed, scalable and fault tolerant cluster compute system Enables Low-latency with complex analytics Developed in 2009 at UC Berkeley AMPLab, open sourced in 2010, and became a top-level Apache project in February, 2014

Spark Reprsentation Conceptuelle2013 DataStax Confidential. Do not distribute without consent.16RDDRDDRDDRDDTransformationsActionValuecounts = lines.flatMap(lambda s: s.split( )) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda x, y: x + y)counts.saveAsTextFile(sys.argv[2])lines = sc.textFile(sys.argv[1])123Resilient Distributed Datasets (RDDs) Sparks datasetsFault tolerant collection of elements that enable parallel processingTransformation and Actions are executed against RDDsCan persist in Memory, on Disk, or bothCan be partitioned to control parallel processingCan be reusedComposants Spark + CassandraSharkorSpark SQLStructuredSparkStreamingReal-timeMLlibMachine learningSpark (General execution engine)GraphXGraphCassandraShark is hive compatible you can run the same application on SharkShark integration is only on DSE, otherwise you have to wait for Spark SQL

Separate projects Spark is totally different projectSpark SQL has borrowed from Spark

Both promising to be Hive compatible

Connecteur Cassandra Spark C*C*C*C*Spark ExecutorC* Java DriverSpark-Cassandra ConnectorUser Application

CassandraConnecteur Cassandra SparkTables Cassandra tables exposes en temps que RDDs SparkChargement des donnes depuis Cassandra vers SparkEcriture des donnes depuis Spark vers CassandraObject mapper : Mapping des tables Cassandra en objets Scala/JavaConversions des types Cassandra en type Scala/JavaSlection et filtres des donnes au niveau de CassandraAPI Scala, Java et Pyhton

Lecture des donnesval table_rdd = sc .cassandraTable[CassandraRow]("db", "tweets") .select("user_name", "message") .where("user_name = ?", "ewa")row representationkeyspacetableserver side column and row selectionEcriture des donnesCREATE TABLE test.words(word TEXT PRIMARY KEY, count INT);val collection_rdd = sc.parallelize(Seq(("foo", 2), ("bar", 5)))

collection_rdd.saveToCassandra("test", "words", SomeColumns("word", "count"))

cqlsh:test> select * from words; word | count------+------- bar | 5 foo | 2(2 rows)Je veux des rsultats en continue depuis un flux de donnesJe veux une garantie que mes messages soient traits une seule fois

DStream (Discretized Stream)

Flux continu de micro batchs pour:Traitements complexes avec un minimal deffortCalculs sur des flux dans un petit interval de temps

Une transformation sur DS