Download - Advanced search and Top-K queries in Cassandra

Advanced search and Top-K queries in Cassandra

1

Andrés de la Peña [email protected] @a_de_la_pena

Apache Cassandra Meetup 2015

•  Stratio is a Big Data Company

•  Founded in 2013

•  Commercially launched in 2014

•  70+ employees in Madrid

•  Office in San Francisco

•  Certified Spark distribution


Who are we?

Introduction to Cassandra

Cassandra query methods

Stratio Lucene based 2i implementation

Integrating Lucene 2i with Apache Spark

1

2

3

CONTENTS


4

Tunable consistency Tradeoffs between consistency and latency are tunable. C* values a high availability and partitioning against consistency; strong consistency can be achieved but there is no row locking.

Incremental scalability Nodes added to a cluster increase throughput in a predictable & linear fashion.

The best of Dynamo & Big Table Combines the partitioning and replication of Amazon’s Dynamo with the log-structured data model of Google’s Bigtable.

Decentralized P2P architecture without master node or single point of failure.

Apache Cassandra overview

4 Apache Cassandra Meetup 2015

Apache Cassandra operators


primary key

secondary indexes

token ranges

Throughput

Expressiveness



•  O(1) node lookup for partition key •  Range slices for clustering key •  Usually requires denormalization

Primary key queries

Node 3

Node 1

Node 2

Partition key Clustering key range CLIENT

apena 2014-04-10:body

When you..

aagea

dhiguero

apena

2014-04-06:body 2014-04-07:body 2014-04-08:body

To study and… To think and... If you see what..

2014-04-06:body

The cautious…

2014-04-10:body

When you..

2014-04-11:body

When you do…


primary key

secondary indexes

token ranges

Throughput

Expressiveness



CLIENT C* node

C* node

2i local column family

C* node



Secondary indexes queries

•  Inverted index •  Mitigates denormalization •  Queries may involve all C* nodes •  Queries limited to a single column


primary key

secondary indexes

token ranges

Throughput

Expressiveness



C* node

C* node

C* node

Spark master

Token range queries

•  Used by MapReduce frameworks as Hadoop or Spark

•  All kinds of queries are possible •  Low throughput •  Ad-hoc queries •  Batch processing •  Materialized views

CLIENT

query= function (all data)


C* node

C* node

C* node

Combining 2i with MapReduce

•  Expressiveness avoiding full scans •  Still limited by one indexed column per query

Spark master CLIENT

Secondary index

Secondary index

Secondary index


MORE EXPRESIVENESS

What do we miss from 2i indexes?

•  Range queries •  Multivariable search •  Full text search •  Sorting by fields •  Top-k queries


IT’S ARCHITECTURE

What do we like from the existing 2i?

•  Each node indexes its own data •  The index implementations do not need to be distributed •  Can be created after design and ingestion •  Natural extension point


Thinking in a custom secondary index implementation…

WHY NOT USE ?


Why we like Lucene

•  Proven stable and fast indexing solution •  Expressive queries

- Multivariable, ranges, full text, sorting, top-k, etc.

•  Mature distributed search solutions built on top of it

- Solr, ElasticSearch •  Can be fully embedded in application code •  Published under the Apache License


HOW IT WORKS


ALTER TABLE tweets ADD lucene TEXT;

CREATE TABLE tweets ( id bigint, createdAt timestamp, message text, userid bigint, username text, PRIMARY KEY (userid, createdAt, id) );

Create index

•  Built in the background in any moment •  Real time updates •  Mapping eases ETL •  Language aware

18

CREATE CUSTOM INDEX tweets_idx ON twitter.tweets (lucene) USING 'com.stratio.index.RowIndex' WITH OPTIONS = { 'refresh_seconds' : '60', 'schema' : '{ default_analyzer : "EnglishAnalyzer”, fields : { createdat : {type : "date", pattern : "yyyy-MM-dd"}, message : {type : "text", analyzer : ”EnglishAnalyzer"}, userid : {type : "string"}, username : {type : "string”} } }'};


SELECT * FROM tweets WHERE lucene = ‘{ filter : {type : "match", field : "text", value : "cassandra"}}’ LIMIT 10;

search 10

found 6

found 4

We are done !

Filtering query

CLIENT

C* node

C* node

C* node

Lucene index

Lucene index

Lucene index


Found 5

Found 4

Found 5

Top-k query

SELECT * FROM tweets WHERE lucene = ‘{ query: {type : "match", field : "text", value : "cassandra"}}’ LIMIT 5;

Search top-5 CLIENT Search top-5

C* node

C* node

C* node

Lucene index

Lucene index

Lucene index

Merge 14 to best 5


SELECT * FROM tweets WHERE lucene = ‘{ filter : { type : "boolean", must : [ {type : "range", field : "time" lower : "2014/04/25”}, {type : "boolean", should : [ {type : "prefix", field : "user", value : "a"} , {type : "wildcard", field : "user", value : "*b*"} , {type : "match", field : "user", value : "fast"} ] } ] }, sort : { fields: [ {field :"time", reverse : true}, {field : "user", reverse : false} ] }}’ LIMIT 10000;

Queries can be as complex as you want


NO MAINTENANCE REQUIRED

Some implementation details

•  A Lucene document per CQL row, and a Lucene field per indexed column •  SortingMergePolicy keeps index sorted in the same way that C* does •  Index commits synchronized with column family flushes •  Segments merge synchronized with column family compactions


LUCENE AND

SPARK


Split friendly. It supports searches within a token range

SELECT * FROM tweets WHERE lucene = ‘{ filter : {type:"match", field:"text", value:"cassandra"}}’ AND TOKEN(userid, createdAt, id) > 253653456456AND TOKEN(userid, createdAt, id) <= 3456467456756LIMIT 10000;

Integrating Lucene & Spark


SELECT * FROM tweets WHERE lucene = ‘{ filter : {type:"match", field:"text", value:"cassandra"}}’ AND userid = 3543534 AND createdAt > 2011-02-03 04:05+0000 LIMIT 5000;

Paging friendly: It supports starting queries in a certain point




CLIENT Spark

master

C* node

C* node

C* node

Lucene

Lucene

Lucene

•  Compute large amounts of data •  Avoid systematic full scan •  Reduces the amount of data to be processed •  Filtering push-down


WHEN TO USE INDEXES

AND WHEN TO USE FULL SCAN


Index performance in Spark

Time

Records returned

Full scan

Lucene 2i


DEMO Lucene indexes in C*


Conclusions

•  Added new query methods

- Multivariable queries (AND, OR, NOT)

- Range queries (>, >=, <, <=) and regular expressions

- Full text queries (match, phrase, fuzzy...)

•  Top-k query support

- Lucene scoring formula

- Sort by field values

•  Compatible with MapReduce frameworks

•  Preserves Cassandra’s functionality 30 Apache Cassandra Meetup 2015

Its open source

31

github.com/stratio/stratio-cassandra •  Published as fork of Apache Cassandra •  Apache License Version 2.0

stratio.github.io/crossdata •  Apache License Version 2.0

github.com/stratio/deep-spark •  Apache License Version 2.0


Advanced search and Top-K queries in Cassandra

32

Andrés de la Peña [email protected] @a_de_la_pena