Lambda at Weather Scale by Robbie Strickland

Lambdaat Weather Scale Robbie Strickland

Who Am I?

Robbie StricklandDirector of Engineering, Analyticsrstrickland@weather.com@rs_atl

An IBM Business

Who Am I?• Contributor to C*

community since 2010

• DataStax MVP 2014/15

• Author, Cassandra High Availability

• Founder, ATL Cassandra User Group

About TWC

~30 billion API requests per day

About TWC

~120 million active mobile users

About TWC

#3 most active mobile user base

About TWC

~360 PB of traffic daily

About TWC

~360 PB of traffic daily

Most weather data comes from us

Use CaseBillions of events per day (~1.3M per sec)

Web/mobile beaconsLogsWeather conditions + forecastsetc.

Use CaseBillions of events per day (~1.3M per sec)

Web/mobile beaconsLogsWeather conditions + forecastsetc.

Keep data forever

Use CaseEfficient batch + streaming analysis

Self-serve data science

Use CaseEfficient batch + streaming analysis

Self-serve data science

BI / visualization tool support

Architecture

Attempt[0] ArchitectureOperational Analytics

Business Analytics

Executive Dashboards

Data Discovery

Data Science

3rd Party

System Integration

Events

3rd Party

Other DBs

Stream Processing

Batch Sources

Storage and Processing

Consumers

Data Access

Streaming

Custom Ingestion Pipeline

Streaming Sources

RESTful Enqueue service

Attempt[0] Data ModelCREATE TABLE events (

timebucket bigint,timestamp bigint,eventtype varchar,eventid varchar,platform varchar,userid varchar,version int,appid varchar,useragent varchar,eventdata varchar,tags set<varchar>,devicedata map<varchar, varchar>,PRIMARY KEY ((timebucket, eventtype), timestamp, eventid)

) WITH CACHING = 'none'AND COMPACTION = { 'class' : 'DateTieredCompactionStrategy' };

Event payload == schema-less JSON

Partitioned by time bucket + type

Time-series data good fit for DTCS

Attempt[0] tl;drC* everywhere

Attempt[0] tl;drC* everywhereStreaming data via custom ingest process

Attempt[0] tl;drC* everywhereStreaming data via custom ingest processKafka backed by RESTful service

Attempt[0] tl;drC* everywhereStreaming data via custom ingest processKafka backed by RESTful serviceBatch data via Informatica

Attempt[0] tl;drC* everywhereStreaming data via custom ingest processKafka backed by RESTful serviceBatch data via InformaticaSpark SQL through ODBC

Attempt[0] tl;drC* everywhereStreaming data via custom ingest processKafka backed by RESTful serviceBatch data via InformaticaSpark SQL through ODBCSchema-less event payload

Attempt[0] tl;drC* everywhereStreaming data via custom ingest processKafka backed by RESTful serviceBatch data via InformaticaSpark SQL through ODBCSchema-less event payloadDate-tiered compaction

Attempt[0] LessonsBatch loading large data sets into C* is silly

Attempt[0] LessonsBatch loading large data sets into C* is silly… and expensive

Attempt[0] LessonsBatch loading large data sets into C* is silly… and expensive… and using Informatica to do it is SLOW

Attempt[0] LessonsBatch loading large data sets into C* is silly… and expensive… and using Informatica to do it is SLOWKafka + REST services == unnecessary

Attempt[0] LessonsBatch loading large data sets into C* is silly… and expensive… and using Informatica to do it is SLOWKafka + REST services == unnecessaryNo viable open source C* Hive driver

Attempt[0] LessonsBatch loading large data sets into C* is silly… and expensive… and using Informatica to do it is SLOWKafka + REST services == unnecessaryNo viable open source C* Hive driverDTCS is broken (see CASSANDRA-9666)

Attempt[0] LessonsSchema-less == bad:

Must parse JSON to extract key data

Must parse JSON to extract key dataExpensive to analyze by event type

Must parse JSON to extract key dataExpensive to analyze by event typeCannot tune by event type

Attempt[1] Architecture

Data Lake

Operational Analytics

Business Analytics

Executive Dashboards

Data Discovery

Data Science

3rd Party

System Integration

Stream Processing

Long Term Raw Storage

Short Term Storage and Big Data Processing

Consumers

Amazon SQS

Streaming

Custom Ingestion Pipeline

Events

3rd Party

Other DBs

Batch Sources

Streaming Sources

Data Access

Attempt[1] Data ModelEach event type gets its own table

Attempt[1] Data ModelEach event type gets its own tableTables individually tuned based on workload

Attempt[1] Data ModelEach event type gets its own tableTables individually tuned based on workloadSchema applied at ingestion:

We’re reading everything anyway

We’re reading everything anywayMakes subsequent analysis much easier

We’re reading everything anywayMakes subsequent analysis much easierAllows us to filter junk early

Attempt[1] tl;drUse C* for streaming data

Rolling time window (TTL depends on type)

Rolling time window (TTL depends on type)Real-time access to events

Rolling time window (TTL depends on type)Real-time access to eventsData locality makes Spark jobs faster

Attempt[1] tl;drEverything else in S3

Batch data loads (mostly logs)

Batch data loads (mostly logs)Daily C* backups

Batch data loads (mostly logs)Daily C* backupsStored as Parquet

Batch data loads (mostly logs)Daily C* backupsStored as ParquetCheap, scalable long-term storage

Batch data loads (mostly logs)Daily C* backupsStored as ParquetCheap, scalable long-term storageEasy access from Spark

Batch data loads (mostly logs)Daily C* backupsStored as ParquetCheap, scalable long-term storageEasy access from SparkEasy to share internally & externally

Batch data loads (mostly logs)Daily C* backupsStored as ParquetCheap, scalable long-term storageEasy access from SparkEasy to share internally & externallyOpen source Hive support

Attempt[1] tl;drKafka replaced by SQS:

Scalable & reliable

Scalable & reliableAlready fronted by a RESTful interface

Scalable & reliableAlready fronted by a RESTful interfaceNearly free to operate (nothing to manage)

Scalable & reliableAlready fronted by a RESTful interfaceNearly free to operate (nothing to manage)Robust security model

Scalable & reliableAlready fronted by a RESTful interfaceNearly free to operate (nothing to manage)Robust security modelOne queue per event type/platform

Scalable & reliableAlready fronted by a RESTful interfaceNearly free to operate (nothing to manage)Robust security modelOne queue per event type/platformBuilt-in monitoring

Attempt[1] tl;drDTCS replaced by Time-Window Compaction

Developed by Jeff Jirsa at CrowdStrike

Developed by Jeff Jirsa at CrowdStrikeGroups similar timestamps/expirations together

Developed by Jeff Jirsa at CrowdStrikeGroups similar timestamps/expirations togetherSimply delete expired sstables

Developed by Jeff Jirsa at CrowdStrikeGroups similar timestamps/expirations togetherSimply deletes expired sstablesImproved stability & throughput

Fine PrintUse C* >= 2.1.8

CASSANDRA-9637 - fixes Spark input split computation

CASSANDRA-9549 - fixes memory leakCASSANDRA-9436 - exposes rpc/broadcast

addresses for Spark/cloud environments

Fine PrintUse C* >= 2.1.8

CASSANDRA-9637 - fixes Spark input split computation

CASSANDRA-9549 - fixes memory leakCASSANDRA-9436 - exposes rpc/broadcast

addresses for Spark/cloud environments

Version incompatibilities abound (check sbt file for Spark-Cassandra connector)

Fine PrintTwo main Spark clusters:

Co-located with C* for heavy analysisPredictable loadEfficient C* access

Fine PrintTwo main Spark clusters:

Co-located with C* for heavy analysisPredictable loadEfficient C* access

Self-serve in same DC but not co-locatedUnpredictable loadFavors mining S3 dataIsolated from production jobs

Data Modeling

PartitioningOpposite strategy from “normal” C* modeling

Model for good parallelism

Model for good parallelism… not for single-partition queries

Avoid shuffling for most cases

Avoid shuffling for most casesShuffles occur when NOT grouping by partition key

Avoid shuffling for most casesShuffles occur when NOT grouping by partition keyPartition for your most common grouping

Secondary IndexesUseful for C*-level filtering

Secondary IndexesUseful for C*-level filteringReduces Spark workload and RAM footprint

Secondary IndexesUseful for C*-level filteringReduces Spark workload and RAM footprintLow cardinality is still the rule

Secondary Indexes (Client Access)

Secondary Indexes (with Spark)

Full-text IndexesEnabled via Stratio-Lucene custom index

(https://github.com/Stratio/cassandra-lucene-index)

Great for C*-side filters

Great for C*-side filtersSame access pattern as secondary indexes

Full-text IndexesCREATE CUSTOM INDEX email_index on emails(lucene)USING 'com.stratio.cassandra.lucene.Index'WITH OPTIONS = {

'refresh_seconds':'1','schema': '{

fields: {id : {type : "integer"},

user : {type : "string"},subject : {type : "text", analyzer : "english"},body : {type : "text", analyzer : "english"},time : {type : "date", pattern : "yyyy-MM-dd hh:mm:ss"}}

Full-text IndexesSELECT * FROM emails WHERE lucene='{

filter : {type:"range", field:"time", lower:"2015-05-26 20:29:59"},query : {type:"phrase", field:"subject", values:["test"]}

SELECT * FROM emails WHERE lucene='{filter : {type:"range", field:"time", lower:"2015-05-26 18:29:59"},query : {type:"fuzzy", field:"subject", value:"thingy", max_edits:1}

WIDE ROWS

Caution:

Wide RowsIt only takes one to ruin your day

Wide RowsIt only takes one to ruin your dayMonitor cfstats for max partition bytes

Wide RowsIt only takes one to ruin your dayMonitor cfstats for max partition bytesUse toppartitions to find hot keys

Avoid NullsNulls are deletes

Avoid NullsNulls are deletesDeletes create tombstones

Avoid NullsNulls are deletesDeletes create tombstonesDon’t write nulls!

Avoid NullsNulls are deletesDeletes create tombstonesDon’t write nulls!Beware of nulls in prepared statements

Data Exploration

Data Warehouse Paradigm - Old

Ingest Model Transform Design

Visualize

Data Warehouse Paradigm - New

Ingest Explore Analyze Deploy

Visualize

VisualizationCritical to understanding your data

VisualizationCritical to understanding your dataReduced time to visualization

VisualizationCritical to understanding your dataReduced time to visualization… from >1 month to minutes (!!)

VisualizationCritical to understanding your dataReduced time to visualization… from >1 month to minutes (!!)Waterfall to agile

ZeppelinOpen source Spark notebook

ZeppelinOpen source Spark notebookInterpreters for Scala, Python, Spark SQL,

CQL, Hive, Shell, & more

CQL, Hive, Shell, & moreData visualizations

CQL, Hive, Shell, & moreData visualizationsScheduled jobs

Zeppelin

Future Work

FiloDBLow latency time-series aggregations using

Spark + Cassandra/in-memory storage

Spark + Cassandra/in-memory storageSpace efficient – similar to Parquet

Spark + Cassandra/in-memory storageSpace efficient – similar to ParquetSQL queries using ODBC/JDBC

Direct to ParquetStream to Parquet directly

Direct to ParquetStream to Parquet directlyEliminate interim storage

Direct to ParquetStream to Parquet directlyEliminate interim storageCurrently in R&D

We’re Hiring!

Robbie Stricklandrstrickland@weather.com

Lambda at Weather Scale by Robbie Strickland

Data & Analytics

Transcript of Lambda at Weather Scale by Robbie Strickland

Attendance Zone Map - Strickland

Strickland Christian School - “Teaching them to …stricklandschool.com/wp-content/uploads/2017/10/... · Web viewWelcome to Strickland Christian School. In 1961, Corine Strickland

Jeanna strickland constructivism

Scott Strickland

Building Highly Available Apps on Cassandra (Robbie Strickland, Weather Company) | C* Summit 2016

Presentación_charley_cambio_fondo-Strickland Racing-Charley Strickland-Mollie Global Inc.

Dr. Mike Strickland, ITP & Duke University (ITP QCD-RHIC ...online.itp.ucsb.edu/online/rhic02/strickland/pdf/strickland.pdf · Dr. Mike Strickland, ITP & Duke University (ITP QCD-RHIC

Chap002 Thompson Strickland Gamble

Minimalism Origins - Edward Strickland

Robbie Henderson

SyntheticDNAfragmentsbearingICR … · 2018. 6. 29. · I,IV 1st lambda-MA-5S4 lambda-MA-3A2 2nd lambda-MA-5S1 lambda-MA-3A3 II,V 1st lambda-MA-5S5 lambda-MA-3A7 2nd lambda-MA-5S6

DRI Presentation Ashton Strickland

Strickland 2015

Lambda at Weather Scale by Robbie Strickland

Strickland - NCLB 2011

DSCI Final - Gravis/Strickland

Stokes's theorem - Neil Strickland

Robbie Burns Day Recipe Book - Kootenay Coopkootenay.coop/wp-content/uploads/2015/01/Robbie-Burns-Day-Recip… · Robbie Burns Day Recipe Book Robbie Burns 25 ... That's newly sprung

Strickland - Driggers 6th Annual Bull Salelivestockdirect.s3-website-us-west-2.amazonaws.com › catalogs › 87… · Strickland Angus Farm Driggers Simmental Farm . Strickland Driggers

Sistemi modulari - Modular systems LAMBDA - Amazon S3 · alla prova del filo incandescente 850°C, ... LAMBDA arcluce.it lambda Code ... 4 A1708 5 Lambda LAMBDA 11 21 arcluce.it lambda