Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Cassandra Data Maintenance with Spark
-
Upload
datastax-academy -
Category
Technology
-
view
1.054 -
download
0
Transcript of Cassandra Data Maintenance with Spark
def create_fake_record( num: Int ) = { (num, 1453389992000L + num, s"My Token $num", s"My Session Data$num")
}
sc.parallelize(1 to 1000000) .map( create_fake_record ) .repartitionByCassandraReplica("maintdemo","user_visits",10) .saveToCassandra("user_visits","oauth_cache")
THREE BASIC PATTERNS
• Read - Transform - Write (1:1) - .map()
• Read - Transform - Write (1:m) - .flatMap()
• Read - Filter - Delete (m:1) - it’s complicated
DELETES ARE TRICKY
• Keep tombstones in mind
• Select the records you want to delete, then loop over those and issue deletes through the driver
• OR select the records you want to keep, rewrite them, then delete the partitions they lived in… IN THE PAST…
PREDICATE PUSHDOWN
• Use Cassandra-level filtering at every opportunity
• With DSE, benefit from predicate pushdown to solr_query
TIPS & TRICKS
• .spanBy( partition key ) - work on one Cassandra partition at a time
• .repartitionByCassandraReplica()
• tune spark.cassandra.output.throughput_mb_per_sec to throttle writes
USE CASE : TRIM USER HISTORY
• Cassandra Data Model: PRIMARY KEY( userid, last_access )
• Keep last X records
• .spanBy( partitionKey ) flatMap filtering Seq
USE CASE: PUBLISH DATA• Cassandra Data Model: publish_date field
• filter by date, map to new RDD matching destination, saveToCassandra()