Advanced search and Top-K queries in Cassandra
1
Andrés de la Peña [email protected] @a_de_la_pena
Apache Cassandra Meetup 2015
• Stratio is a Big Data Company
• Founded in 2013
• Commercially launched in 2014
• 70+ employees in Madrid
• Office in San Francisco
• Certified Spark distribution
Apache Cassandra Meetup 2015
Who are we?
Introduction to Cassandra
Cassandra query methods
Stratio Lucene based 2i implementation
Integrating Lucene 2i with Apache Spark
1
2
3
CONTENTS
Apache Cassandra Meetup 2015
4
Tunable consistency Tradeoffs between consistency and latency are tunable. C* values a high availability and partitioning against consistency; strong consistency can be achieved but there is no row locking.
Incremental scalability Nodes added to a cluster increase throughput in a predictable & linear fashion.
The best of Dynamo & Big Table Combines the partitioning and replication of Amazon’s Dynamo with the log-structured data model of Google’s Bigtable.
Decentralized P2P architecture without master node or single point of failure.
Apache Cassandra overview
4 Apache Cassandra Meetup 2015
Apache Cassandra operators
5 Apache Cassandra Meetup 2015
primary key
secondary indexes
token ranges
Throughput
Expressiveness
Cassandra query methods
6 Apache Cassandra Meetup 2015
• O(1) node lookup for partition key • Range slices for clustering key • Usually requires denormalization
Primary key queries
Node 3
Node 1
Node 2
Partition key Clustering key range CLIENT
apena 2014-04-10:body
When you..
aagea
dhiguero
apena
2014-04-06:body 2014-04-07:body 2014-04-08:body
To study and… To think and... If you see what..
2014-04-06:body
The cautious…
2014-04-10:body
When you..
2014-04-11:body
When you do…
7 Apache Cassandra Meetup 2015
primary key
secondary indexes
token ranges
Throughput
Expressiveness
Cassandra query methods
8 Apache Cassandra Meetup 2015
CLIENT C* node
C* node
2i local column family
C* node
2i local column family
2i local column family
Secondary indexes queries
• Inverted index • Mitigates denormalization • Queries may involve all C* nodes • Queries limited to a single column
9 Apache Cassandra Meetup 2015
primary key
secondary indexes
token ranges
Throughput
Expressiveness
Cassandra query methods
10 Apache Cassandra Meetup 2015
C* node
C* node
C* node
Spark master
Token range queries
• Used by MapReduce frameworks as Hadoop or Spark
• All kinds of queries are possible • Low throughput • Ad-hoc queries • Batch processing • Materialized views
CLIENT
query= function (all data)
11 Apache Cassandra Meetup 2015
C* node
C* node
C* node
Combining 2i with MapReduce
• Expressiveness avoiding full scans • Still limited by one indexed column per query
Spark master CLIENT
Secondary index
Secondary index
Secondary index
12 Apache Cassandra Meetup 2015
MORE EXPRESIVENESS
What do we miss from 2i indexes?
• Range queries • Multivariable search • Full text search • Sorting by fields • Top-k queries
13 Apache Cassandra Meetup 2015
IT’S ARCHITECTURE
What do we like from the existing 2i?
• Each node indexes its own data • The index implementations do not need to be distributed • Can be created after design and ingestion • Natural extension point
14 Apache Cassandra Meetup 2015
Thinking in a custom secondary index implementation…
WHY NOT USE ?
15 Apache Cassandra Meetup 2015
Why we like Lucene
• Proven stable and fast indexing solution • Expressive queries
- Multivariable, ranges, full text, sorting, top-k, etc.
• Mature distributed search solutions built on top of it
- Solr, ElasticSearch • Can be fully embedded in application code • Published under the Apache License
16 Apache Cassandra Meetup 2015
HOW IT WORKS
Apache Cassandra Meetup 2015
ALTER TABLE tweets ADD lucene TEXT;
CREATE TABLE tweets ( id bigint, createdAt timestamp, message text, userid bigint, username text, PRIMARY KEY (userid, createdAt, id) );
Create index
• Built in the background in any moment • Real time updates • Mapping eases ETL • Language aware
18
CREATE CUSTOM INDEX tweets_idx ON twitter.tweets (lucene) USING 'com.stratio.index.RowIndex' WITH OPTIONS = { 'refresh_seconds' : '60', 'schema' : '{ default_analyzer : "EnglishAnalyzer”, fields : { createdat : {type : "date", pattern : "yyyy-MM-dd"}, message : {type : "text", analyzer : ”EnglishAnalyzer"}, userid : {type : "string"}, username : {type : "string”} } }'};
Apache Cassandra Meetup 2015
SELECT * FROM tweets WHERE lucene = ‘{ filter : {type : "match", field : "text", value : "cassandra"}}’ LIMIT 10;
search 10
found 6
found 4
We are done !
Filtering query
CLIENT
C* node
C* node
C* node
Lucene index
Lucene index
Lucene index
19 Apache Cassandra Meetup 2015
Found 5
Found 4
Found 5
Top-k query
SELECT * FROM tweets WHERE lucene = ‘{ query: {type : "match", field : "text", value : "cassandra"}}’ LIMIT 5;
Search top-5 CLIENT Search top-5
C* node
C* node
C* node
Lucene index
Lucene index
Lucene index
Merge 14 to best 5
20 Apache Cassandra Meetup 2015
SELECT * FROM tweets WHERE lucene = ‘{ filter : { type : "boolean", must : [ {type : "range", field : "time" lower : "2014/04/25”}, {type : "boolean", should : [ {type : "prefix", field : "user", value : "a"} , {type : "wildcard", field : "user", value : "*b*"} , {type : "match", field : "user", value : "fast"} ] } ] }, sort : { fields: [ {field :"time", reverse : true}, {field : "user", reverse : false} ] }}’ LIMIT 10000;
Queries can be as complex as you want
21 Apache Cassandra Meetup 2015
NO MAINTENANCE REQUIRED
Some implementation details
• A Lucene document per CQL row, and a Lucene field per indexed column • SortingMergePolicy keeps index sorted in the same way that C* does • Index commits synchronized with column family flushes • Segments merge synchronized with column family compactions
22 Apache Cassandra Meetup 2015
LUCENE AND
SPARK
Apache Cassandra Meetup 2015
Split friendly. It supports searches within a token range
SELECT * FROM tweets WHERE lucene = ‘{ filter : {type:"match", field:"text", value:"cassandra"}}’ AND TOKEN(userid, createdAt, id) > 253653456456AND TOKEN(userid, createdAt, id) <= 3456467456756LIMIT 10000;
Integrating Lucene & Spark
24 Apache Cassandra Meetup 2015
SELECT * FROM tweets WHERE lucene = ‘{ filter : {type:"match", field:"text", value:"cassandra"}}’ AND userid = 3543534 AND createdAt > 2011-02-03 04:05+0000 LIMIT 5000;
Paging friendly: It supports starting queries in a certain point
Integrating Lucene & Spark
25 Apache Cassandra Meetup 2015
Integrating Lucene & Spark
CLIENT Spark
master
C* node
C* node
C* node
Lucene
Lucene
Lucene
• Compute large amounts of data • Avoid systematic full scan • Reduces the amount of data to be processed • Filtering push-down
26 Apache Cassandra Meetup 2015
WHEN TO USE INDEXES
AND WHEN TO USE FULL SCAN
Apache Cassandra Meetup 2015
Index performance in Spark
Time
Records returned
Full scan
Lucene 2i
28 Apache Cassandra Meetup 2015
DEMO Lucene indexes in C*
Apache Cassandra Meetup 2015
Conclusions
• Added new query methods
- Multivariable queries (AND, OR, NOT)
- Range queries (>, >=, <, <=) and regular expressions
- Full text queries (match, phrase, fuzzy...)
• Top-k query support
- Lucene scoring formula
- Sort by field values
• Compatible with MapReduce frameworks
• Preserves Cassandra’s functionality 30 Apache Cassandra Meetup 2015
Its open source
31
github.com/stratio/stratio-cassandra • Published as fork of Apache Cassandra • Apache License Version 2.0
stratio.github.io/crossdata • Apache License Version 2.0
github.com/stratio/deep-spark • Apache License Version 2.0
Apache Cassandra Meetup 2015
Advanced search and Top-K queries in Cassandra
32
Andrés de la Peña [email protected] @a_de_la_pena
Apache Cassandra Meetup 2015
Top Related