High Throughput Analytics with Cassandra & Azure

Charles LamannaPrincipal Dev Lead

@clamanna

MetricsHubkeep cloud services up and running for the lowest possible cost

Live Status

Cost Awareness

Alerts and Notifications

Actions and Scaling

2000+ customers in 6 months

10/18/2012 12/7/2012 1/26/2013 3/17/2013 5/6/2013 6/25/20130

storing data200M data points per hour80,000 data points per second (peak)

Planning for huge data ingestion ratesRequires high scale, real-time data

1,000 data points per minute per VM12 data points per endpoint per minute

Aggregate, analyze and take actions based on this data stream (in near real-time)

Must be cheap, scalable and reliable

Evaluated several technologiesAggregation in memory; good performance, bad COGs

Rolling tables for aggs; good tooling/support, hard to scale

Aggregation on write; easy to scale and good COGs

Cassandra UpsideScales fluidly Grows horizontally – double the nodes, double capacityAdd / remove capacity / nodes with no downtime

Highly availableNo single point of failureReplication factor (i.e. hot copies) is just a config switch

… and by the wayLittle-to-no operations cost

New nodes take minutes to setupNodes just keep running for months on end

“Aggregate on write” – no jobs required!Distributed counters make it easy to do aggregates on write

…and a nice kicker: has *great* perf / COGS in Azure

architecture68 virtual machines (PAAS and IAAS)

Table StorageJobs Worker Role (24 instances)

SQL Database

Portal Web Role

(3 instances)

Cassandra VM Cluster

(32 XL instances)

Web API Web Role

(8 instances)

End User Web Browsers

Monitored Customer Resources

(e.g. websites; SQL databases)

Monitored Virtual Machines

Endpoints Replicated datain multiple

datacenters

ClientsPaaS

Services

Avoiding state

• Application logic / code all lives on stateless machines

• Keeps it simple: decreases human operations cost

• Use Azure PAAS offerings (Web and Worker roles)

Table Storage

Jobs Worker Role (24 instances)

SQL Database

Blob storage

Portal Web Role

(3 instances)

(32 XL instances)

Web API Web Role

(8 instances)

datacenters

Azure Cloud Services (PAAS)

• Scale horizontally (grew from 1 to 30+ instances)

• Managed by the platform (patched; coordinated recycling; failover; etc.)

• 1 click deployment from Visual Studio (with automatic load balancer swaps)

Web Role Worker Role

Table Storage

Jobs Worker Role (24 instances)

SQL Database

Blob storage

Web API Web Role

(8 instances)

datacenters

Maintains all state for metrics / time series data

32 XL Linux Virtual Machines Portal Web

Role (3 instances)

(32 XL instances)

Cassandra Cluster

32 nodes, 8 “pods” of 4 nodes

……..

……….

Exposed via a single endpoint

Exposing the pods• Each pod of 4 nodes

has a single load balanced endpoint

• Clients (on our stateless roles) treats the endpoints as a pool

• Blacklists and skips an endpoint if it starts producing a lot of errors

Where does the data go?

• Data files are on 16 mounted network backed disks (*not* ephemeral disks)

• Data disks are geo-replicated (3 copies local; 1 remote) for “free” DR

• Azure data disks offer great throughput (VMs end up CPU bound)

Our Column Families (CQL 3)

CREATE TABLE oneminute (

rk text, ck text, cnt counter, sum counter, PRIMARY KEY (rk, ck)

Updating values…Realtime “average” values at any granularity, for any time window

updateoneminute/tenminute/oneday

setsum = sum + {sample_value},cnt = cnt + 1

where rk = '{customer+metric}' and ck = '{tags_and_timestamp}'

Reading values…

*ONE* round trip to fetch a metric over time (e.g. CPU over past week)

select * from oneminutewhere rk = ‘{customer_name}' and ck < '{metric_path_start}' and ck >= '{metric_path_end}‘order by ck desc;

Some hard lessons…

• Static private IPs are a must; otherwise, reboots / outages can confuse the cluster when nodes come back up

• Monitor performance carefully; once you tip over, it is hard to rebalance the cluster and add new nodes

• Fit the cluster to the platform: in Azure, match the Upgrade Domains / Fault Domains to preserve uptime during service maintenance / hardware failure

Single node tests..• 4 disks, RAID 0, no read cache

Workload (%write)

Ops / sec Latency median

Latency 95th

Latency 99th

%100 20018 1.5 3.7 7.9%75 8361 85.9 376.6 584.8%25 5412 459.9 759.1 940.1

• 4 disks, RAID 0, read cacheWorkload (%write)

Ops / sec Latency median

Latency95th

Latency99th

%100 19208 1.5 3.8 7.9 18543 1.5 3.6 7.9 18563 1.4 3.6 8.2

%75 7112 195.9 595.8 1099.6 7581 168.9 589.5 985.2 5149 256.5 774.0 1402.9

%25 15358 23.0 110.2 309.1 3742 279.2 563.0 789.7 15376 22.1 98.8 293.3

jbod RAID00

JBOD vs RAID 0 for read-heavy workload

Workload (%write)

Ops / sec

Latency Median

Latency 95th

Latency99th

%100 13638 1.9 4.9 24.0%75 3239 11.2 687.0 1099.3%25 1825 243.6 687.0 808.7

Multi-node load tests..

• 4 Nodes; RF = 3 (Quorom)

• 8 Disks, RAID 0

QUESTIONS & ANSWERS

Charles LamannaCharles.Lamanna@Microsoft.co

m@clamanna

High Throughput Analytics with Cassandra & Azure

Documents

Transcript of High Throughput Analytics with Cassandra & Azure

SQL Server 2014 - download.microsoft.com · Hybrid Scenarios In-Memory OLTP • Up to 30x throughput • On ramp existing ... Azure SQL Server with Azure Storage integration SQL Server

Apache Cassandra™ Documentationcourses.physics.illinois.edu/cs425/fa2017/cassandra10.pdfApache Cassandra 1.0 Documentation Introduction to Apache Cassandra Apache Cassandra is a

There are More Clouds! Azure and Cassandra (Carlos Rolo, Pythian) | C* Summit 2016

Highly Scalable Web Application in the Cloud with Cassandra, C#, and Azure (from Cassandra Summit 2014)

GridDB and Cassandra Performance and Scalability.solutions.toshiba.com/content/Fixstars_NoSQL_Benchmarks... · 2018-08-15 · and Cassandra, not Azure. Starting from a state where

Introduction to Cassandra • Why Spark + Cassandra ... · • Introduction to Cassandra • Why Spark + Cassandra • Problem background and overall architecture •Implementation

Cassandra and Kubernetes - instaclustr.com · /usr/bin/whoami • Ben Bromhead, CTO of Instaclustr • We provide managed Cassandra, Spark and Kafka in the cloud (AWS, GCP, Azure

Cassandra Summit 2014: Highly Scalable Web Application in the Cloud with Cassandra, C# and Azure

Associate)Professor)Cassandra)L.Atherton) Deakin ...cassandra-atherton.com/wp-content/uploads/2018/01/Cassandra-Athert… · Associate)Professor)Cassandra)L.Atherton) Deakin’University’

Yahoo! Cloud Serving Benchmark · System setup and tuning assistance from members of the Cassandra and HBase committers, and the Sherpa engineering team ... • Target throughput

Running Cassandra on Amazon’s ECS - Meetupfiles.meetup.com/7439192/Cassandra-ECS.pdf · • Cassandra • ECS • Cassandra on Docker best practices • Cassandra on ECS. Motivation.

Paris Cassandra Meetup - Cassandra for Developers

Cassandra on MS Azureユースケース

C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft) | C* Summit 2016

LA Cassandra Day 2015 - Testing Cassandra

Azure Container Service: Slides · DOCKER CON DOCKER docker ... Jenkins Torque ElasticSearch Cassandra Hypertable Mesos . MARATHON . APPLAUSE . Title: Azure …

WAN WAN ExpressRoute provides a private, dedicated, high-throughput network connection between on-premises and Microsoft Azure.

Cassandra at eBay - Cassandra Summit 2013

Cassandra Freeman - Thoughtful Inspirationsthoughtfulinspirations.com/.../2017/12/Cassandra-Freeman-Final-Bio.pdf · Cassandra Freeman Cassandra Freeman ... Jim Rohn, Zig Ziglar Leadership