Cloudius Systems presents - Percona · Cloudius Systems presents: NoSQL goes NATIVE Avi Kivity...
Transcript of Cloudius Systems presents - Percona · Cloudius Systems presents: NoSQL goes NATIVE Avi Kivity...
Cloudius Systems presents:
NoSQL goes NATIVEAvi Kivity (@AviKivity)
Percona Live 2016
Capable of 1,000,000 operations per secondPER NODE
With predictable, low latenciesCompatible with Apache Cassandra
Scylla: A new NoSQL Database
BACKGROUND
SQL: Structured, no scale
Document store: No structureSome scale
Column store: Some structureScale outAwesome HA/DR
Key-value: SimpleScaleNot a real DB
YESSCALE UP OR SCALE OUT?
CPUs AND DISKS ARE NETWORK DEVICES
THE POWER OFCASSANDRA AT THE SPEED OF REDIS
SOLUTION: SCYLLA DB
AWESOME REDUNDANCY & HA
+ Multi DC
+ Spark
+ CQL
+ Auto sharding
+ Wide rows
RESULTS: THROUGHPUT/SCALE UP
LATENCY - AVG
LATENCY - P99
FULLY COMPATIBLE
SELF TUNING
No need to guess thread pool sizes
No need to guess compaction/streaming bandwidth
No fiddling with heap sizes, on/off heap
No endless tweaking of Garbage Collector parameters
WHAT WOULD YOU DO WITH 1 MILLION TPS?
Shrink your cluster by a factor of X10
Handle 10X traffic spikes on Black Friday
Model your data instead of use Cassandra as K/V
Get the most out of your data - Run more queries
Maintain your clusters while serving
Stop using caches in front of the database
TECHNOLOGY:HOW IT WORKS
SeaStar
Before: Thread model After: SeaStar shards
SCYLLA IS QUITE DIFFERENT
Shard-per-core, no locks, no threads, zero-copy
Reactor programing with C++14
Our own efficient,DB-aware cache, not using Linux page cache
Better storage engine than cassandra
Max out all HW resources - NUMA friendly, multiqueue NICs, etc
SCYLLA DB: NETWORK COMPARISON
Kernel
Cassandra
TCP/IPScheduler
queuequeuequeuequeuequeue
threads
NIC
Queues
Kernel
Traditional stack SeaStar’s sharded stack
Memory
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeue
smp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeue
smp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Application
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeue
smp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Core
Database
TCP/I
P
Task Scheduler
queuequeuequeuequeuequeue
smp queue
NIC
Queue
DPDK
Kernel
(isn’t
involved)
Userspace
Scylla has its own task schedulerTraditional stack Scylla’s stack
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise is a
pointer to
eventually
computed value
Task is a
pointer to a
lambda function
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Threa
d
Stack
Threa
d
Stack
Threa
d
Stack
Threa
d
Stack
Threa
d
Stack
Threa
d
Stack
Threa
d
Stack
Threa
d
Stack
Thread is a
function pointer
Stack is a byte
array from 64k
to megabytes
Unified cacheCassandra Scylla
Key cache
Row cache
On-heap /Off-heap
Linux page cache
SSTables
Unified cache
SSTables
TuningParasitic rowsPage faults
total, 14825839, 25002, 25002, 25002, 0.5, 0.3, 0.5, 5.0, 12.8, 22.2, 592.6, 0.00076
total, 14851605, 24980, 24980, 24980, 0.5, 0.3, 0.6, 6.7, 12.9, 21.6, 593.6, 0.00076
total, 14877443, 25004, 25004, 25004, 0.5, 0.3, 0.5, 6.5, 17.5, 38.8, 594.7, 0.00076
total, 14903361, 25017, 25017, 25017, 0.5, 0.3, 0.5, 6.9, 29.9, 39.9, 595.7, 0.00076
total, 14927655, 23553, 23553, 23553, 4.6, 0.3, 34.3, 66.8, 203.0, 255.9, 596.7, 0.00076
total, 14956055, 26384, 26384, 26384, 5.0, 0.4, 27.2, 53.9, 81.5, 99.9, 597.8, 0.00077
total, 14981910, 24987, 24987, 24987, 0.5, 0.3, 0.7, 6.2, 13.5, 25.0, 598.8, 0.00077
total, 15007673, 25003, 25003, 25003, 0.4, 0.3, 0.5, 3.7, 12.5, 24.5, 599.9, 0.00077
total, 15033484, 25006, 25006, 25006, 0.4, 0.3, 0.5, 3.8, 12.4, 32.8, 600.9, 0.00077
total, 15059256, 25004, 25004, 25004, 0.4, 0.3, 0.5, 2.2, 14.9, 33.0, 601.9, 0.00076
total, 15085126, 24994, 24994, 24994, 0.4, 0.3, 0.5, 2.4, 10.4, 19.4, 603.0, 0.00076
total, 15110948, 24988, 24988, 24988, 0.5, 0.3, 0.6, 4.1, 10.3, 19.9, 604.0, 0.00076
Compatibility (and speed): Repair
Jan 13 17:33:44 server-02.localdomain scylla[7730]: [shard 27] stream_session - [Stream #cbbf8dc1-ba1b-11e5-a51a-00000000000b ID#0]
Creating new streaming plan for repair-in
Jan 13 17:33:44 server-02.localdomain scylla[7730]: [shard 27] stream_session - [Stream #cbbf8dc1-ba1b-11e5-a51a-00000000000b ID#0]
Received streaming plan for repair-in
Jan 13 17:33:45 server-02.localdomain scylla[7730]: [shard 27] stream_session - [Stream #cbc02a01-ba1b-11e5-a51a-00000000000b ID#0]
Creating new streaming plan for repair-in
Jan 13 17:33:45 server-02.localdomain scylla[7730]: [shard 27] stream_session - [Stream #cbc02a01-ba1b-11e5-a51a-00000000000b ID#0]
Received streaming plan for repair-in
I/O Scheduling
Query
Commitlog
Compaction
Queue
Queue
Queue
I/O
SchedulerDisk
STATUS
1.0 Released
Major CQL feature support
Good match for upcoming Non Volatile Memory technology
High availability with gossip and flexible replication
Runs everywhere: Physical, virtual, containers
Can be integrated with microservices with its own httpd
❏Build a community❏Core database improvements❏VERTICAL: Spark, Solr, distributed SQL engines❏HORIZONTAL: Microservice integration, more
protocols
WHAT’S NEXT?
SCYLLA, NoSQL GOES NATIVEThank you.
Basic model
■ Futures
■ Promises
■ Continuations
F-P-C defined: Future
A future is a result of a computation
that may not be available yet. ■ Data buffer from the network
■ Timer expiration
■ Completion of a disk write
■ Result computation that requires the values from one or
more other futures.
F-P-C defined: Promise
A promise is an object or function
that provides you with a future, with
the expectation that it will fulfil the
future.
F-P-C defined: Continuation
A continuation is a computation that
is executed when a future becomes
ready (yielding a new future).
Basic future/promise
future<int> get(); // promises an int will be produced eventuallyfuture<> put(int) // promises to store an int
void f() {get().then([] (int value) {
put(value + 1).then([] {std::cout << "value stored successfully\n";
});});
}
Chaining
future<int> get(); // promises an int will be produced eventuallyfuture<> put(int) // promises to store an int
void f() {get().then([] (int value) {
return put(value + 1);}).then([] {
std::cout << "value stored successfully\n";});
}
Parallelism
void f() {
std::cout << "Sleeping... " << std::flush;
using namespace std::chrono_literals;
sleep(200ms).then([] { std::cout << "200ms " << std::flush; });
sleep(100ms).then([] { std::cout << "100ms " << std::flush; });
sleep(1s).then([] { std::cout << "Done.\n"; engine_exit(); });
}
Zero copy friendly
future<temporary_buffer> connected_socket::read(size_t n);
■ temporary_buffer points at driver-provided pages if possible
■ discarded after use
Zero copy friendly (2)
future<size_t>connected_socket::write(temporary_buffer);
■ Future becomes ready when TCP window allows sending more data (usually immediately)
■ temporary_buffer discarded after data is ACKed■ can call delete[] or decrement a reference count
Dual Networking Stack
Networking API
Seastar (native) Stack POSIX (hosted) stack
Linux kernel (sockets)
User-space TCP/IP
Interface layer
DPDK
Virtio Xen
igb ixgb
Disk I/O
■ Zero copy using Linux AIO and O_DIRECT■ Some operations using worker threads (open()
etc.)■ Plans for direct NVMe support