Lambda at Weather Scale - Cassandra Summit 2015

47
Lambda at Weather Scale Robbie Strickland

Transcript of Lambda at Weather Scale - Cassandra Summit 2015

Page 1: Lambda at Weather Scale - Cassandra Summit 2015

Lambda at Weather Scale Robbie Strickland

Page 2: Lambda at Weather Scale - Cassandra Summit 2015

Who Am I?

Robbie StricklandDirector of Engineering, [email protected]@rs_atl

Page 3: Lambda at Weather Scale - Cassandra Summit 2015

Who Am I?● Contributor to C*

community since 2010● DataStax MVP 2014/15● Author, Cassandra High

Availability● Founder, ATL Cassandra

User Group

Page 4: Lambda at Weather Scale - Cassandra Summit 2015

About TWC

● ~30 billion API requests per day● ~120 million active mobile users● #3 most active mobile user base● ~360 PB of traffic daily● Most weather data comes from us

Page 5: Lambda at Weather Scale - Cassandra Summit 2015

Use Case● Billions of events per day

○ Web/mobile beacons○ Logs○ Weather conditions + forecasts○ etc.

● Keep data forever

Page 6: Lambda at Weather Scale - Cassandra Summit 2015

Use Case● Efficient batch + streaming analysis● Self-serve data science● BI / visualization tool support

Page 7: Lambda at Weather Scale - Cassandra Summit 2015

Architecture

Page 8: Lambda at Weather Scale - Cassandra Summit 2015

Attempt[0] ArchitectureOperational

Analytics

Business Analytics

Executive Dashboards

Data Discovery

Data Science

3rd Party

System Integration

Events

3rd Party

Other DBs

S3

Stream Processing

Batch Sources

Storage and Processing

Consumers

Data Access

Kafka

Streaming

Custom Ingestion Pipeline

ETL

Streaming Sources

RESTful Enqueue service

SQL

Page 9: Lambda at Weather Scale - Cassandra Summit 2015

Attempt[0] Data ModelCREATE TABLE events ( timebucket bigint, timestamp bigint, eventtype varchar, eventid varchar, platform varchar, userid varchar, version int, appid varchar, useragent varchar, eventdata varchar, tags set<varchar>, devicedata map<varchar, varchar>, PRIMARY KEY ((timebucket, eventtype), timestamp, eventid)) WITH CACHING = 'none' AND COMPACTION = { 'class' : 'DateTieredCompactionStrategy' };

Page 10: Lambda at Weather Scale - Cassandra Summit 2015

Attempt[0] Data ModelCREATE TABLE events ( timebucket bigint, timestamp bigint, eventtype varchar, eventid varchar, platform varchar, userid varchar, version int, appid varchar, useragent varchar, eventdata varchar, tags set<varchar>, devicedata map<varchar, varchar>, PRIMARY KEY ((timebucket, eventtype), timestamp, eventid)) WITH CACHING = 'none' AND COMPACTION = { 'class' : 'DateTieredCompactionStrategy' };

Event payload == schema-less JSON

Page 11: Lambda at Weather Scale - Cassandra Summit 2015

Attempt[0] Data ModelCREATE TABLE events ( timebucket bigint, timestamp bigint, eventtype varchar, eventid varchar, platform varchar, userid varchar, version int, appid varchar, useragent varchar, eventdata varchar, tags set<varchar>, devicedata map<varchar, varchar>, PRIMARY KEY ((timebucket, eventtype), timestamp, eventid)) WITH CACHING = 'none' AND COMPACTION = { 'class' : 'DateTieredCompactionStrategy' };

Partitioned by time bucket + type

Page 12: Lambda at Weather Scale - Cassandra Summit 2015

Attempt[0] Data ModelCREATE TABLE events ( timebucket bigint, timestamp bigint, eventtype varchar, eventid varchar, platform varchar, userid varchar, version int, appid varchar, useragent varchar, eventdata varchar, tags set<varchar>, devicedata map<varchar, varchar>, PRIMARY KEY ((timebucket, eventtype), timestamp, eventid)) WITH CACHING = 'none' AND COMPACTION = { 'class' : 'DateTieredCompactionStrategy' };

Time-series data good fit for DTCS

Page 13: Lambda at Weather Scale - Cassandra Summit 2015

Attempt[0] tl;dr● C* everywhere● Streaming data via custom ingest process● Kafka backed by RESTful service● Batch data via Informatica● Spark SQL through ODBC● Schema-less event payload● Date-tiered compaction

Page 14: Lambda at Weather Scale - Cassandra Summit 2015

Attempt[0] tl;dr● C* everywhere● Streaming data via custom ingest process● Kafka backed by RESTful service● Batch data via Informatica● Spark SQL through ODBC● Schema-less event payload● Date-tiered compaction

Page 15: Lambda at Weather Scale - Cassandra Summit 2015

Attempt[0] Lessons● Batch loading large data sets into C* is silly● … and expensive● … and using Informatica to do it is SLOW● Kafka + REST services == unnecessary● No viable open source C* Hive driver● DTCS is broken (see CASSANDRA-9666)

Page 16: Lambda at Weather Scale - Cassandra Summit 2015

Attempt[0] Lessons● Schema-less == bad:

○ Must parse JSON to extract key data○ Expensive to analyze by event type○ Cannot tune by event type

Page 17: Lambda at Weather Scale - Cassandra Summit 2015

Attempt[1] Architecture

Data Lake

Operational Analytics

Business Analytics

Executive Dashboards

Data Discovery

Data Science

3rd Party

System Integration

Stream Processing

Long Term Raw Storage

Short Term Storage and Big Data Processing

Consumers

Amazon SQS

Streaming

Custom Ingestion Pipeline

Events

3rd Party

Other DBs

S3

Batch Sources

Streaming Sources

ETL

Data Access

SQL

Page 18: Lambda at Weather Scale - Cassandra Summit 2015

Attempt[1] Data Model● Each event type gets its own table● Tables individually tuned based on workload● Schema applied at ingestion:

○ We’re reading everything anyway○ Makes subsequent analysis much easier○ Allows us to filter junk early

Page 19: Lambda at Weather Scale - Cassandra Summit 2015

Attempt[1] tl;dr● Use C* for streaming data

○ Rolling time window (TTL depends on type)○ Real-time access to events○ Data locality makes Spark jobs faster

Page 20: Lambda at Weather Scale - Cassandra Summit 2015

Attempt[1] tl;dr● Everything else in S3

○ Batch data loads (mostly logs)○ Daily C* backups○ Stored as Parquet○ Cheap, scalable long-term storage○ Easy access from Spark○ Easy to share internally & externally○ Open source Hive support

Page 21: Lambda at Weather Scale - Cassandra Summit 2015

Attempt[1] tl;dr● Kafka replaced by SQS:

○ Scalable & reliable○ Already fronted by a RESTful interface○ Nearly free to operate (nothing to manage)○ Robust security model○ One queue per event type/platform○ Built-in monitoring

Page 22: Lambda at Weather Scale - Cassandra Summit 2015

Attempt[1] tl;dr● STCS in lieu of DTCS (and LCS)

○ Because it’s bulletproof○ Partitions spanning sstables is acceptable○ Testing Time-Window compaction (thanks Jeff

Jirsa)

Page 23: Lambda at Weather Scale - Cassandra Summit 2015

Attempt[1] tl;dr● STCS in lieu of DTCS (and LCS)

○ Because it’s bulletproof○ Partitions spanning sstables is acceptable○ Testing Time-Window compaction (thanks Jeff

Jirsa)

Page 24: Lambda at Weather Scale - Cassandra Summit 2015

Fine Print● Use C* >= 2.1.8

○ CASSANDRA-9637 - fixes Spark input split computation

○ CASSANDRA-9549 - fixes memory leak○ CASSANDRA-9436 - exposes rpc/broadcast

addresses for Spark/cloud environments● Version incompatibilities abound (check sbt

file for Spark-Cassandra connector)

Page 25: Lambda at Weather Scale - Cassandra Summit 2015

Fine Print● Two main Spark clusters:

○ Co-located with C* for heavy analysis■ Predictable load■ Efficient C* access

○ Self-serve in same DC but not co-located■ Unpredictable load■ Favors mining S3 data■ Isolated from production jobs

Page 26: Lambda at Weather Scale - Cassandra Summit 2015

Data Modeling

Page 27: Lambda at Weather Scale - Cassandra Summit 2015

Partitioning● Opposite strategy from “normal” C* modeling

○ Model for good parallelism○ … not for single-partition queries

● Avoid shuffling for most cases○ Shuffles occur when NOT grouping by partition key○ Partition for your most common grouping

Page 28: Lambda at Weather Scale - Cassandra Summit 2015

Secondary Indexes● Useful for C*-level filtering● Reduces Spark workload and RAM footprint● Low cardinality is still the rule

Page 29: Lambda at Weather Scale - Cassandra Summit 2015

Secondary Indexes (Client Access)

Page 30: Lambda at Weather Scale - Cassandra Summit 2015

Secondary Indexes (with Spark)

Page 31: Lambda at Weather Scale - Cassandra Summit 2015

Full-text Indexes● Enabled via Stratio-Lucene custom index

(https://github.com/Stratio/cassandra-lucene-index)● Great for C*-side filters● Same access pattern as secondary indexes

Page 32: Lambda at Weather Scale - Cassandra Summit 2015

Full-text IndexesCREATE CUSTOM INDEX email_index on emails(lucene)USING 'com.stratio.cassandra.lucene.Index'WITH OPTIONS = {

'refresh_seconds':'1','schema': '{

fields: { id : {type : "integer"},

user : {type : "string"}, subject : {type : "text", analyzer : "english"}, body : {type : "text", analyzer : "english"}, time : {type : "date", pattern : "yyyy-MM-dd hh:mm:ss"}

} }'};

Page 33: Lambda at Weather Scale - Cassandra Summit 2015

Full-text IndexesSELECT * FROM emails WHERE lucene='{ filter : {type:"range", field:"time", lower:"2015-05-26 20:29:59"}, query : {type:"phrase", field:"subject", values:["test"]}}';

SELECT * FROM emails WHERE lucene='{ filter : {type:"range", field:"time", lower:"2015-05-26 18:29:59"}, query : {type:"fuzzy", field:"subject", value:"thingy", max_edits:1}}';

Page 34: Lambda at Weather Scale - Cassandra Summit 2015

WIDE ROWS

Caution:

Page 35: Lambda at Weather Scale - Cassandra Summit 2015

Wide Rows● It only takes one to ruin your day● Monitor cfstats for max partition bytes● Use toppartitions to find hot keys

Page 36: Lambda at Weather Scale - Cassandra Summit 2015

Avoid Nulls● Nulls are deletes● Deletes create tombstones● Don’t write nulls!● Beware of nulls in prepared statements

Page 37: Lambda at Weather Scale - Cassandra Summit 2015

Data Exploration

Page 38: Lambda at Weather Scale - Cassandra Summit 2015

Data Warehouse Paradigm - Old

Ingest Model Transform Design

Visualize

Page 39: Lambda at Weather Scale - Cassandra Summit 2015

Data Warehouse Paradigm - New

Ingest Explore Analyze Deploy

Visualize

Page 40: Lambda at Weather Scale - Cassandra Summit 2015

Visualization● Critical to understanding your data● Reduced time to visualization ● … from >1 month to minutes (!!)● Waterfall to agile

Page 41: Lambda at Weather Scale - Cassandra Summit 2015

Zeppelin● Open source Spark notebook● Interpreters for Scala, Python, Spark SQL,

CQL, Hive, Shell, & more● Data visualizations● Scheduled jobs

Page 42: Lambda at Weather Scale - Cassandra Summit 2015

Zeppelin

Page 43: Lambda at Weather Scale - Cassandra Summit 2015

Zeppelin

Page 44: Lambda at Weather Scale - Cassandra Summit 2015

Zeppelin

Page 45: Lambda at Weather Scale - Cassandra Summit 2015

Final Thoughts

Page 46: Lambda at Weather Scale - Cassandra Summit 2015

Should I use DSE?● Open source culture?● On-staff C* expert(s)?● Willingness to contribute/fix stuff?● Moderate degree of risk is acceptable?● Need/desire for latest features?● Need/desire to control tool versions?● Don’t have the budget for licensing?

Page 47: Lambda at Weather Scale - Cassandra Summit 2015

We’re Hiring!

Robbie [email protected]