StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit 2013

17
StratioDeep: an integration layer between Spark and Cassandra

description

We present StratioDeep, an integration layer between the Spark distributed computing framework and Cassandra, a NoSQL distributed database. Cassandra brings together the distributed system technologies from Dynamo and the data model from Google’s BigTable. Like Dynamo, Cassandra is eventually consistent and based on a P2P model without a single point of failure. Like BigTable, Cassandra provides a ColumnFamily-based data model richer than typical key/value systems. For these reasons, C* is one of the most popular NoSQL databases, but one of its handicaps is that it’s necessary to model the schema on the executed queries. This is because C* is oriented to search by key. Integrating C* and Spark gives us a system that combines the best of both worlds. Existing integrations between the two systems are not satisfactory: they basically provide an HDFS abstraction layer over C*. We believe this solution is not efficient because introduces an important overhead between the two systems. The purpose of our work has been to provide an much lower-level integration that not only performs better, it also opens to Cassandra the possibility to solve a wide range of new use cases thanks to the powerfulness of the Spark distributed computing framework. We’ve already deployed this solution in real applications with diverse clients: pattern detection, log mining, fraud detection, sentiment analysis and financial transaction analysis. In addition this integration is the building block for our challenging and novel Lambda architecture completely based on Cassandra. In order to complete the integration, we provide a seamless extension to the Cassandra Query Language: CQL is oriented to key-based search. As such, it is not a good choice to perform queries that move an huge amount of data. We’ve extended CQL in order to provide a user-friendly interface. This is a new approach for batch processing over C*. It consists in an abstraction layer that translates custom CQL queries to Spark jobs and delegates the complexity of distributing the query itself over the underlying cluster of commodity machines to Spar

Transcript of StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit 2013

Page 1: StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit 2013

StratioDeep: an integration layer between Spark and Cassandra

Page 2: StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit 2013
Page 3: StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit 2013

Our customers

#StratioBD

Page 4: StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit 2013

StratioDeep

An efficient data mining solution

“Two and two are four?

Sometimes… Sometimes they are five.”

G. Orwell

#StratioBD

Page 5: StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit 2013

Why we useCassandra

Page 6: StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit 2013

One User – Lots of data

Case A

#StratioBD

Page 7: StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit 2013

Many Users – Few data

Case B

#StratioBD

Page 8: StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit 2013

Many users – Lots of data

Case C

#StratioBD

Page 9: StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit 2013

Why we also need Spark

• In Cassandra, you need to design the schema with the

query in mind• Every other type of query is either very inefficient or

impossible to resolve

#StratioBD

Page 10: StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit 2013

ChallengeAccepted

Page 11: StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit 2013
Page 12: StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit 2013

•Supports CQL3 features•Use of secondary Indexes•Small codebase (less bugs)

StratioDeep features (I)

#StratioBD

Page 13: StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit 2013

StratioDeep features (II)

Provides a Java friendly API:• Developers map Column Families to custom serializable POJOs

• StratioDeep wraps the complexity of performing Spark calculations

directly over the user provided POJOs.

• SQL-Like Domain Specific Language

#StratioBD

Page 14: StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit 2013

SQL-Like domain specific language:• Built on-top of Spark’s API.• SQL + Linq abstractions.• Unique interface to all Stratio platform modules

Stratio DSL (I)

#StratioBD

Page 15: StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit 2013

Stratio RT extension• Built on-top of Spark Streaming API.

Stratio BUS extension• Registration of new channels/consumer/producers

Cross-module integration with StratioMeta• Lets us create flows of data between StratioDeep StratioRT

• Materialized views, live queries, alerts, etc…

Stratio DSL (II)

#StratioBD

Page 16: StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit 2013

Use case A Use case C

#StratioBD

Conclusion

Page 17: StratioDeep: an Integration Layer Between Spark and Cassandra - Spark Summit 2013

THANKS

Luca Rosellini @luca_rosellini

Alvaro Agea @alvaroagea