Advanced search and Top-K queries in Cassandra

32
Advanced search and Top-K queries in Cassandra 1 Andrés de la Peña [email protected] @a_de_la_pena Apache Cassandra Meetup 2015

Transcript of Advanced search and Top-K queries in Cassandra

Page 1: Advanced search and Top-K queries in Cassandra

Advanced search and Top-K queries in Cassandra

1

Andrés de la Peña [email protected] @a_de_la_pena

Apache Cassandra Meetup 2015

Page 2: Advanced search and Top-K queries in Cassandra

•  Stratio is a Big Data Company

•  Founded in 2013

•  Commercially launched in 2014

•  70+ employees in Madrid

•  Office in San Francisco

•  Certified Spark distribution

Apache Cassandra Meetup 2015

Who are we?

Page 3: Advanced search and Top-K queries in Cassandra

Introduction to Cassandra

Cassandra query methods

Stratio Lucene based 2i implementation

Integrating Lucene 2i with Apache Spark

1

2

3

CONTENTS

Apache Cassandra Meetup 2015

4

Page 4: Advanced search and Top-K queries in Cassandra

Tunable  consistency  Tradeoffs between consistency and latency are tunable. C* values a high availability and partitioning against consistency; strong consistency can be achieved but there is no row locking.

Incremental  scalability  Nodes added to a cluster increase throughput in a predictable & linear fashion.

The  best  of  Dynamo  &  Big  Table  Combines the partitioning and replication of Amazon’s Dynamo with the log-structured data model of Google’s Bigtable.

Decentralized  P2P architecture without master node or single point of failure.

Apache Cassandra overview

4 Apache Cassandra Meetup 2015

Page 5: Advanced search and Top-K queries in Cassandra

Apache Cassandra operators

5 Apache Cassandra Meetup 2015

Page 6: Advanced search and Top-K queries in Cassandra

primary key

secondary indexes

token ranges

Throughput

Expressiveness

Cassandra query methods

6 Apache Cassandra Meetup 2015

Page 7: Advanced search and Top-K queries in Cassandra

•  O(1) node lookup for partition key •  Range slices for clustering key •  Usually requires denormalization

Primary key queries

Node 3

Node 1

Node 2

Partition key Clustering key range CLIENT

apena 2014-04-10:body

When you..

aagea

dhiguero

apena

2014-04-06:body 2014-04-07:body 2014-04-08:body

To study and… To think and... If you see what..

2014-04-06:body

The cautious…

2014-04-10:body

When you..

2014-04-11:body

When you do…

7 Apache Cassandra Meetup 2015

Page 8: Advanced search and Top-K queries in Cassandra

primary key

secondary indexes

token ranges

Throughput

Expressiveness

Cassandra query methods

8 Apache Cassandra Meetup 2015

Page 9: Advanced search and Top-K queries in Cassandra

CLIENT C* node

C* node

2i local column family

C* node

2i local column family

2i local column family

Secondary indexes queries

•  Inverted index •  Mitigates denormalization •  Queries may involve all C* nodes •  Queries limited to a single column

9 Apache Cassandra Meetup 2015

Page 10: Advanced search and Top-K queries in Cassandra

primary key

secondary indexes

token ranges

Throughput

Expressiveness

Cassandra query methods

10 Apache Cassandra Meetup 2015

Page 11: Advanced search and Top-K queries in Cassandra

C*  node  

C*  node  

C*  node  

Spark master

Token range queries

•  Used by MapReduce frameworks as Hadoop or Spark

•  All kinds of queries are possible •  Low throughput •  Ad-hoc queries •  Batch processing •  Materialized views

CLIENT

query= function (all data)

11 Apache Cassandra Meetup 2015

Page 12: Advanced search and Top-K queries in Cassandra

C*  node  

C*  node  

C*  node  

Combining 2i with MapReduce

•  Expressiveness avoiding full scans •  Still limited by one indexed column per query

Spark master CLIENT

Secondary index

Secondary index

Secondary index

12 Apache Cassandra Meetup 2015

Page 13: Advanced search and Top-K queries in Cassandra

MORE EXPRESIVENESS

What do we miss from 2i indexes?

•  Range queries •  Multivariable search •  Full text search •  Sorting by fields •  Top-k queries

13 Apache Cassandra Meetup 2015

Page 14: Advanced search and Top-K queries in Cassandra

IT’S ARCHITECTURE

What do we like from the existing 2i?

•  Each node indexes its own data •  The index implementations do not need to be distributed •  Can be created after design and ingestion •  Natural extension point

14 Apache Cassandra Meetup 2015

Page 15: Advanced search and Top-K queries in Cassandra

Thinking in a custom secondary index implementation…

WHY NOT USE ?

15 Apache Cassandra Meetup 2015

Page 16: Advanced search and Top-K queries in Cassandra

Why we like Lucene

•  Proven stable and fast indexing solution •  Expressive queries

- Multivariable, ranges, full text, sorting, top-k, etc.

•  Mature distributed search solutions built on top of it

- Solr, ElasticSearch •  Can be fully embedded in application code •  Published under the Apache License

16 Apache Cassandra Meetup 2015

Page 17: Advanced search and Top-K queries in Cassandra

HOW IT WORKS

Apache Cassandra Meetup 2015

Page 18: Advanced search and Top-K queries in Cassandra

ALTER TABLE tweets ADD lucene TEXT;

CREATE TABLE tweets ( id bigint, createdAt timestamp, message text, userid bigint, username text, PRIMARY KEY (userid, createdAt, id) );

Create index

•  Built in the background in any moment •  Real time updates •  Mapping eases ETL •  Language aware

18

CREATE CUSTOM INDEX tweets_idx ON twitter.tweets (lucene) USING 'com.stratio.index.RowIndex' WITH OPTIONS = { 'refresh_seconds' : '60', 'schema' : '{ default_analyzer : "EnglishAnalyzer”, fields : { createdat : {type : "date", pattern : "yyyy-MM-dd"}, message : {type : "text", analyzer : ”EnglishAnalyzer"}, userid : {type : "string"}, username : {type : "string”} } }'};

Apache Cassandra Meetup 2015

Page 19: Advanced search and Top-K queries in Cassandra

SELECT * FROM tweets WHERE lucene = ‘{ filter : {type : "match", field : "text", value : "cassandra"}}’ LIMIT 10;

search 10

found 6

found 4

We are done !

Filtering query

CLIENT

C* node

C* node

C* node

Lucene index

Lucene index

Lucene index

19 Apache Cassandra Meetup 2015

Page 20: Advanced search and Top-K queries in Cassandra

Found 5

Found 4

Found 5

Top-k query

SELECT * FROM tweets WHERE lucene = ‘{ query: {type : "match", field : "text", value : "cassandra"}}’ LIMIT 5;

Search top-5 CLIENT Search top-5

C* node

C* node

C* node

Lucene index

Lucene index

Lucene index

Merge 14 to best 5

20 Apache Cassandra Meetup 2015

Page 21: Advanced search and Top-K queries in Cassandra

SELECT * FROM tweets WHERE lucene = ‘{ filter : { type : "boolean", must : [ {type : "range", field : "time" lower : "2014/04/25”}, {type : "boolean", should : [ {type : "prefix", field : "user", value : "a"} , {type : "wildcard", field : "user", value : "*b*"} , {type : "match", field : "user", value : "fast"} ] } ] }, sort : { fields: [ {field :"time", reverse : true}, {field : "user", reverse : false} ] }}’ LIMIT 10000;

Queries can be as complex as you want

21 Apache Cassandra Meetup 2015

Page 22: Advanced search and Top-K queries in Cassandra

NO MAINTENANCE REQUIRED

Some implementation details

•  A Lucene document per CQL row, and a Lucene field per indexed column •  SortingMergePolicy keeps index sorted in the same way that C* does •  Index commits synchronized with column family flushes •  Segments merge synchronized with column family compactions

22 Apache Cassandra Meetup 2015

Page 23: Advanced search and Top-K queries in Cassandra

LUCENE AND

SPARK

Apache Cassandra Meetup 2015

Page 24: Advanced search and Top-K queries in Cassandra

Split friendly. It supports searches within a token range

SELECT * FROM tweets WHERE lucene = ‘{ filter : {type:"match", field:"text", value:"cassandra"}}’ AND TOKEN(userid, createdAt, id) > 253653456456AND TOKEN(userid, createdAt, id) <= 3456467456756LIMIT 10000;

Integrating Lucene & Spark

24 Apache Cassandra Meetup 2015

Page 25: Advanced search and Top-K queries in Cassandra

SELECT * FROM tweets WHERE lucene = ‘{ filter : {type:"match", field:"text", value:"cassandra"}}’ AND userid = 3543534 AND createdAt > 2011-02-03 04:05+0000 LIMIT 5000;

Paging friendly: It supports starting queries in a certain point

Integrating Lucene & Spark

25 Apache Cassandra Meetup 2015

Page 26: Advanced search and Top-K queries in Cassandra

Integrating Lucene & Spark

CLIENT Spark

master

C* node

C* node

C* node

Lucene

Lucene

Lucene

•  Compute large amounts of data •  Avoid systematic full scan •  Reduces the amount of data to be processed •  Filtering push-down

26 Apache Cassandra Meetup 2015

Page 27: Advanced search and Top-K queries in Cassandra

WHEN TO USE INDEXES

AND WHEN TO USE FULL SCAN

Apache Cassandra Meetup 2015

Page 28: Advanced search and Top-K queries in Cassandra

Index performance in Spark

Time

Records returned

Full scan

Lucene 2i

28 Apache Cassandra Meetup 2015

Page 29: Advanced search and Top-K queries in Cassandra

DEMO Lucene indexes in C*

Apache Cassandra Meetup 2015

Page 30: Advanced search and Top-K queries in Cassandra

Conclusions

•  Added new query methods

- Multivariable queries (AND, OR, NOT)

- Range queries (>, >=, <, <=) and regular expressions

- Full text queries (match, phrase, fuzzy...)

•  Top-k query support

- Lucene scoring formula

- Sort by field values

•  Compatible with MapReduce frameworks

•  Preserves Cassandra’s functionality 30 Apache Cassandra Meetup 2015

Page 31: Advanced search and Top-K queries in Cassandra

Its open source

31

github.com/stratio/stratio-cassandra •  Published as fork of Apache Cassandra •  Apache License Version 2.0

stratio.github.io/crossdata •  Apache License Version 2.0

github.com/stratio/deep-spark •  Apache License Version 2.0

Apache Cassandra Meetup 2015

Page 32: Advanced search and Top-K queries in Cassandra

Advanced search and Top-K queries in Cassandra

32

Andrés de la Peña [email protected] @a_de_la_pena

Apache Cassandra Meetup 2015