C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert
-
Upload
planet-cassandra -
Category
Technology
-
view
1.087 -
download
0
description
Transcript of C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert
#CassandraEU
Top-k queries in real-time with Cassandra and Intravert
Jonathan Halliday, [email protected]
Rui Vieira, Newcastle [email protected]
#CassandraEU
What is Top-k ?
#CassandraEU
What is Top-k ?
#CassandraEU
Top-k queries
• Rank matching results for the term(s)– We don't really care about the scoring
algorithm
• Application: text search– Documents containing the search words
• Application: log analysis– Popular URLs in the time period
#CassandraEU
yawn ?
• SELECT document_id, scoreFROM dataWHERE term='top-k'ORDER BY score DESC, document_id LIMIT 100
• Lunch time!
#CassandraEU
Not so fast...
• SELECT document_id, scoreFROM dataWHERE term IN('top-k', 'algorithm')GROUP BY document_idORDER BY score DESC, document_id LIMIT 100
#CassandraEU
Distributed Top-k
• We have a lot of data
• It's spread out
• We need to combine a subset efficiently
• Map/Reduce to the rescue!– HiveQL, Stinger, Impala, Hawq
• Easy! But not fast
#CassandraEU
'real-time'
• Web pages, not control systems
• Performance, not Timeliness
• Pre-compute as much as possible– scores for each term
• Assemble pre-computed fragments at query time– 'group by'
#CassandraEU
Naive method
foreach(term in searchTerms) {SELECT ... FROM ... WHERE ...
}
• Handle group by in the application code
• Inefficient – transfers ALL the data for each term, even low scores
#CassandraEU
How much data is enough?
• Data is stored keyed (i.e. sorted) by{ term, score DESC, doc_id }
or { time_period, score DESC, Url }
• Partition keys IN the query params– We can filter efficiently
• Can we range limit on score?– Avoid going into the long tail
#CassandraEU
Bring on the clever algorithms
• Smart People thought about this problem already...
• ...but not in quite the same context– WAN distributed logs from CDNs
• Identify, adapt and reuse existing solutions– faster and less risky than starting over
#CassandraEU
Inside a clever algorithm
• Fetch a little bit of data
• Look at it, decide how much more we need
• Fetch some more• Rinse and repeat
– but not too many times.
#CassandraEU
Desirable Characteristics
• Fixed number of communication rounds is key
• Generality is good– Cope with any distribution of data
• So is flexibility– Tune for different use cases
#CassandraEU
Meet the candidates
Three-Phase Uniform Threshold (TPUT)'Efficient Top-K Query Calculation in Distributed Networks', Stanford/Princeton, 2004
Hybrid Threshold'Efficient Processing of Distributed Top-k Queries', UCSB, 2005
KLEE'KLEE: a framework for distributed top-k query algorithms', Max-Planck Institute, 2005
#CassandraEU
Implementation Issues
• Algorithms assume server side code execution
• Limitations of CQL3 add some round trips, increase network I/O
• Previous performance comparisons of algorithms may no longer be valid
#CassandraEU
Data Transfer vs. k
#CassandraEU
Execution Time vs. k
#CassandraEU
Execution Time vs. peers
#CassandraEU
#CassandraEU
YMMV
• Test with your own data
• Test with your own hardware
• Hybrid Threshold for exact top-k– Intravert optional
• KLEE for tunable approximate top-k– Inefficient without intravert– Requires metadata
#CassandraEU
Intravert
• Cassandra++– Embed and extend the existing server– Based on Vert.x
• JSON over HTTP, REST API– yup, virgil did that already
• Multiple commands per call, chain operations with REFs
#CassandraEU
Intravert
• Server side code execution– Groovy (for now – Vert.x is polyglot)
• Filter result sets
• Write path triggers– C* 2.0 has CASSANDRA-1311
• Run groovy scripts on the server– Easier than extending thrift api
#CassandraEU
Intravert
• Good trade-off between power and operational complexity
• More complex development cycle– Not easy to move code between client and
server
• Client not topology aware– 'run x on each node' not possible
#CassandraEU
Back to the clever algorithms
• Intravert server side execution enables cleaner, more efficient implementation
• Reduces network round trips
• Some dev and ops complexity increase• Less complexity than custom server
deployment– Reuse existing tools
#CassandraEU
Pre-aggregation
• For text search, can't predict common term sets
• For time periods, can predict contiguous periods
• Pre-calculate the rollups– Hours, days, weeks, months– Reduces number of terms (peers) to group
at query time
#CassandraEU
Really clever algorithms
• Hierarchical node topology– Map to cassandra ring: same node may
own multiple keys (peers != nodes)
• Budget constrained approximate top-k– Get as close as possible with the allowable
time and I/O constraints
• Fault tolerance– Approximation given available nodes
#CassandraEU
Questions?
Or email us:
Jonathan Halliday, [email protected]
Rui Vieira, Newcastle [email protected]