C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

27
#CassandraEU Top-k queries in real-time with Cassandra and Intravert Jonathan Halliday, JBoss [email protected] Rui Vieira, Newcastle University [email protected]

description

Speakers: Jonathan Halliday, Core Developer at JBoss & Rui Vieira, Postgrad Student at Newcastle University Video: http://www.youtube.com/watch?v=SRejy08zM7Y&list=PLqcm6qE9lgKLoYaakl3YwIWP4hmGsHm5e&index=15 Performing ranking queries to find the most relevant documents, most popular urls, etc on huge datasets is trivial —if you're willing to wait a while for the answers. For those with less time to waste, this session describes techniques for performing such queries efficiently. We'll describe the ranking queries problem, outline the Cassandra CQL3 data structures and code that can be used to solve it and describe the trade-offs available. We describe intravert, an innovative server-side programming solution for Cassandra, and show how it can be used to reduce network usage and improve performance by filtering data closer to source.

Transcript of C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

Page 1: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

Top-k queries in real-time with Cassandra and Intravert

Jonathan Halliday, [email protected]

Rui Vieira, Newcastle [email protected]

Page 2: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

What is Top-k ?

Page 3: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

What is Top-k ?

Page 4: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

Top-k queries

• Rank matching results for the term(s)– We don't really care about the scoring

algorithm

• Application: text search– Documents containing the search words

• Application: log analysis– Popular URLs in the time period

Page 5: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

yawn ?

• SELECT document_id, scoreFROM dataWHERE term='top-k'ORDER BY score DESC, document_id LIMIT 100

• Lunch time!

Page 6: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

Not so fast...

• SELECT document_id, scoreFROM dataWHERE term IN('top-k', 'algorithm')GROUP BY document_idORDER BY score DESC, document_id LIMIT 100

Page 7: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

Distributed Top-k

• We have a lot of data

• It's spread out

• We need to combine a subset efficiently

• Map/Reduce to the rescue!– HiveQL, Stinger, Impala, Hawq

• Easy! But not fast

Page 8: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

'real-time'

• Web pages, not control systems

• Performance, not Timeliness

• Pre-compute as much as possible– scores for each term

• Assemble pre-computed fragments at query time– 'group by'

Page 9: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

Naive method

foreach(term in searchTerms) {SELECT ... FROM ... WHERE ...

}

• Handle group by in the application code

• Inefficient – transfers ALL the data for each term, even low scores

Page 10: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

How much data is enough?

• Data is stored keyed (i.e. sorted) by{ term, score DESC, doc_id }

or { time_period, score DESC, Url }

• Partition keys IN the query params– We can filter efficiently

• Can we range limit on score?– Avoid going into the long tail

Page 11: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

Bring on the clever algorithms

• Smart People thought about this problem already...

• ...but not in quite the same context– WAN distributed logs from CDNs

• Identify, adapt and reuse existing solutions– faster and less risky than starting over

Page 12: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

Inside a clever algorithm

• Fetch a little bit of data

• Look at it, decide how much more we need

• Fetch some more• Rinse and repeat

– but not too many times.

Page 13: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

Desirable Characteristics

• Fixed number of communication rounds is key

• Generality is good– Cope with any distribution of data

• So is flexibility– Tune for different use cases

Page 14: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

Meet the candidates

Three-Phase Uniform Threshold (TPUT)'Efficient Top-K Query Calculation in Distributed Networks', Stanford/Princeton, 2004

Hybrid Threshold'Efficient Processing of Distributed Top-k Queries', UCSB, 2005

KLEE'KLEE: a framework for distributed top-k query algorithms', Max-Planck Institute, 2005

Page 15: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

Implementation Issues

• Algorithms assume server side code execution

• Limitations of CQL3 add some round trips, increase network I/O

• Previous performance comparisons of algorithms may no longer be valid

Page 16: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

Data Transfer vs. k

Page 17: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

Execution Time vs. k

Page 18: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

Execution Time vs. peers

Page 19: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

Page 20: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

YMMV

• Test with your own data

• Test with your own hardware

• Hybrid Threshold for exact top-k– Intravert optional

• KLEE for tunable approximate top-k– Inefficient without intravert– Requires metadata

Page 21: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

Intravert

• Cassandra++– Embed and extend the existing server– Based on Vert.x

• JSON over HTTP, REST API– yup, virgil did that already

• Multiple commands per call, chain operations with REFs

Page 22: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

Intravert

• Server side code execution– Groovy (for now – Vert.x is polyglot)

• Filter result sets

• Write path triggers– C* 2.0 has CASSANDRA-1311

• Run groovy scripts on the server– Easier than extending thrift api

Page 23: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

Intravert

• Good trade-off between power and operational complexity

• More complex development cycle– Not easy to move code between client and

server

• Client not topology aware– 'run x on each node' not possible

Page 24: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

Back to the clever algorithms

• Intravert server side execution enables cleaner, more efficient implementation

• Reduces network round trips

• Some dev and ops complexity increase• Less complexity than custom server

deployment– Reuse existing tools

Page 25: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

Pre-aggregation

• For text search, can't predict common term sets

• For time periods, can predict contiguous periods

• Pre-calculate the rollups– Hours, days, weeks, months– Reduces number of terms (peers) to group

at query time

Page 26: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

Really clever algorithms

• Hierarchical node topology– Map to cassandra ring: same node may

own multiple keys (peers != nodes)

• Budget constrained approximate top-k– Get as close as possible with the allowable

time and I/O constraints

• Fault tolerance– Approximation given available nodes

Page 27: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert

#CassandraEU

Questions?

Or email us:

Jonathan Halliday, [email protected]

Rui Vieira, Newcastle [email protected]