Scylla db deck, july 2017

Dor Laor, CEO, co-founder

ScyllaDB - Achieve No-Compromise Performance and Availability

+ Intro: Who we are, what we do, who uses it+ Why ScyllaDB+ Why should you care+ How we made design decisions to achieve no-compromise

performance and availability

Today we will cover:

Introduction+ Founded by KVM hypervisor creators+ Q2 2014 - Pivot to the database world+ Q3 2015 - Decloak during Cassandra Summit 2015, Beta+ Q1 2016 - General Availability+ Q3 2016 - First Scylla Summit: 100+ Attendees+ Q1 2017 - Completed B round+ $25MM in funding+ HQs: Palo Alto, CA; Herzelia, Israel+ 42+ employees, hiring!

What we do: Scylla, towards the best NoSQL

+ > 1 million OPS per node

+ < 1ms 99% latency

+ Auto tuned

+ Scale up and out

+ Open source

+ Large community (piggyback on Cassandra)

+ Blends in the ecosystem- Spark, Presto, time series, search, ..

Where Scylla is deployed?

+ Intro: Who we are, what we do, who uses it+ Why we started ScyllaDB+ Why should you care+ How we made design decisions to achieve no-compromise



Why ?@#$%$$%^?

Scylla benchmark by Samsung

op/s

Why we started Scylla? (cont)

+ Originally it was about performance/efficiency only+ Over time, we understood we can deliver more:

+ SLA between background and foreground tasks+ Work well on any given hardware {back pressure}+ Deliver consistent, low 99th percentile latency+ Reduction in admin effort+ Low latency under the face of failures (hot cache load balancing)+ High observability

Cassandra ScyllaThroughput: Cannot utilize multi-core efficiently Scales linearly - shard-per-core

Latency: High due to Java and JVM’s GC Low and consistent - own cache

Complexity: Intricate tuning and configuration Auto tuned, dynamic scheduling

Admin: Maintenance impacts performance SLA guarantee for admin vs serving

+ Outbrain is the world’s largest content discovery platform.

+ Over 557 million unique visitors from across the globe.

+ 250 billion personalized content recommendations every month.

Case study: Document column family

+ First read from memcached, go to Cassandra on misses.+ Pain

+ Stale data from cache+ Complexity+ Cold cache -> C* gets full volume

ReadMicroservice

Write Process

Outbrain: Cassandra plus Memcache

+ Writes are written in parallel to C* and Scylla+ Reads are done in parallel:

+ Memcached + Cassandra+ Scylla (no cache at all)

Scylla/Cassandra side by side deployment

ReadMicroservice

Write Process

Scylla (w/o cache) vs Cassandra + MemcachedScylla Cassandra Diff%

Requests/Minute

12,000,000 500,000(memcache

handles 11,500,000)

24X

AVG Latency

4 ms 8 ms 2X

Max Latency

8 ms 35 ms 3.5X

Hardware 9 machines 30+9 machines

4.3X

14

Cassandra’s latency

Scylla’s latency

What does it mean for a non Cassandra user?

+ Throughput, latency and scale benefits+ Wide range of big data integration: {Kariosdb, Spark,

JanusGraph, Presto, Kafka, Elastic}+ Best HA/DR in the industry. + Stop using caches in front of the database+ Consolidate HBase, Redis, MySQL, Mongo and others

Assorted Quotes

+ Intro: Who we are, what we do, who uses it+ Why we started ScyllaDB+ Why should you care+ How we made design decisions to achieve no-compromise



Design decisions: #1 The trivials

+ SSTable file format+ Configuration file format+ CQL language+ CQL native protocol+ JMX management protocol+ Management command line

Design decisions: #2 Compatibility

Double cluster - Migration w/o downtime

AppCassandra

Scylla

CQLproxy

Design decisions: #3 All things async

Threads Shards

Design decisions: #4 Shard per core

SCYLLA DB: Network Comparison

Kernel

Cassandra

TCP/IPScheduler

queuequeuequeuequeuequeuethreads

NICQueues

Kernel

Traditional stack SeaStar’s sharded stack

Memory

Lock contentionCache contentionNUMA unfriendly

Application

TCP/IP

Task Schedulerqueuequeuequeuequeuequeuesmp queue

NICQueue

DPDK

Kernel (isn’t

involved)

Userspace

Application

TCP/IP


NICQueue

DPDK

Kernel (isn’t

involved)

Userspace

Application

TCP/IP


NICQueue

DPDK

Kernel (isn’t

involved)

Userspace

No contentionLinear scalingNUMA friendly

CoreDatabase

Task Scheduler

queuequeuequeuequeuequeuesmp queue

NICQueue

Userspace

Scylla has its own task schedulerTraditional stack Scylla’s stack

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise

Task

Promise

Task

Promise

Task

Promise

Task

CPU

Promise is a pointer to eventually computed value

Task is a pointer to a lambda function

Scheduler

CPU

Scheduler

CPU

Scheduler

CPU

Scheduler

CPU

Scheduler

CPU

Thread

Stack

Thread

Stack

Thread

Stack

Thread

Stack

Thread

Stack

Thread

Stack

Thread

Stack

Thread

Stack

Thread is a function pointer

Stack is a byte array from 64k to megabytes

Context switch cost is

high. Large stacks pollutes

the caches No sharing, millions of

parallel events

SCYLLA IS DIFFERENT

❏ DMA❏ Log structured❏ merge tree❏ DBaware cache❏ Userspace I/O❏ scheduler

❏ NUMA friendly❏ Log structured

allocator❏ Zero copy

❏ Thread per core❏ lock-free❏ Task scheduler❏ Reactor programing❏ C++14

❏ Multi queue ❏ Poll mode❏ Userspace

TCP/IP

Scylla vs C* latency by

Design Decision: #5 Unified cache vs C* cache plethora Cassandra Scylla

Key cache

Row cache

On-heap /Off-heap

Linux page cache

SSTables

Unified cache

SSTables

Complex Tuning

Unified cache vs Cassandra cache plethora Cassandra Scylla

Key cache

Row cache

On-heap /Off-heap

Linux page cache

SSTables

Unified cache

SSTables

SSTable page (4k)

Your data (300b)

Parasitic Rows - 300b/4kb

Cassandra Scylla

Key cache

Row cache

On-heap /Off-heap

Linux page cache

SSTables

Unified cache

SSTables

App thread

Kernel

SSD

Page faultSuspend thread

Initiate I/OContext switch

I/O completesInterruptContext switch

Map pageResume threadPage

fault

Unified cache vs Cassandra cache plethora

Cassandra Streaming configuration

Design decisions: #6 I/O scheduler

Scylla I/O Scheduling

Query

Commitlog

Compaction

Queue

Queue

Queue

UserspaceI/O

SchedulerDisk

Max useful disk concurrency

I/O queued in FS/deviceNo queues

I/O scheduler result

Memtable

Seastar SchedulerCompaction

Query

Repair

Commitlog

SSD

Compaction Backlog Monitor

Memory Monitor

Adjust priority

Adjust priority

WAN

CPU

Design Decision: #8..#10 Workload conditioning

Workload Conditioning in practice

Disk can’t keep up:workload conditioning will figure out the right request rate

+ Intro: Who we are, what we do, who uses it+ Why we started ScyllaDB+ How we made design decisions to achieve no-compromise



+ Enterprise release, based on 1.6+ 1.7 - May 2017

+ Counters+ New intra-node sharding algorithm+ SStableloader from 2.2/3.x+ Debian

+ 2.0 - August 2017+ Materialized views+ Execution blocks (cpu cache optimization which boost performance)+ Partial row cache (for wide row streaming)

Upcoming releases

Vertical HorizontalCore database

Scylla Beyond Cassandra

Q&A

Resources

slideshare.net/ScyllaDB

[email protected] (@DorLaor)

[email protected] (@AviKivity)

@scylladb

http://bit.ly/2oHAfok

youtube.com/c/scylladbscylladb.com/blog

NoSQL GOES NATIVE

Thank you.

Scylla db deck, july 2017

Engineering

Transcript of Scylla db deck, july 2017