Running NoSQL natively on flash Thomas...
Transcript of Running NoSQL natively on flash Thomas...
Fusion-io Confidential—Copyright © 2013 Fusion-io, Inc. All rights reserved.Fusion-io Confidential—Copyright © 2013 Fusion-io, Inc. All rights reserved.
Running NoSQL natively on flash
Thomas Rochner
Topics – NoSQL Amsterdam 2013
1. What are we building ?
2. Why are we building it?
3. OpenNVM
4. Use Cases
5. Where are we headed?
November 1, 2013 2
⌃2006
⌃2007
⌃2008
⌃2009
⌃2010
⌃2011
⌃2012
⌃2013
Mission to
consolidate
memory and
storage
ioMemory
technology
unveiled
First products
launched
1 million IOPS
IBM
Quicksilver
Dell strategic
investment
HP OEMs
products
IBM OEMs
products
Samsung
strategic
investment
Dell OEMs
products
VSL
introduced
IPO on NYSE
ioTurbine
acquired
ioDrive2
announced
Supermicro
OEMs
products
1 Billion IOPS
2,500
customers
>120 channel
and alliance
partners
ioFX
announced
OpenNVM
introduced
Cisco OEMs
products
ioScale
announced at
Open
Compute
Summit
NexGen
acquired
Fusion-io First Mover Milestones
Fusion-io Confidential 3
Fusion-io Accelerates
Fusion-io Confidential 4
AnalyticsSearch
Messaging
MQ
Databases
INFORMIX
Virtualization
KVM
HPC
GPFS
Big Data
Security/Logging
Collaboration
Lotus
Development Web
LAMP
CachingWorkstation
Direct Acceleration
Fusion-io Confidential 5
Up to 3.0TB of capacity Up to 2.4TB of capacity
per x8 PCI Express slot
Up to 10.24TB to maximize
performance for large data sets
Up to 1650GB of workstation
acceleration for digital content
creation
Up to 1.2TB for maximum
performance density
Up to 3.2TB of low-latency,
high-performance flash per PCI
Express slot
MEZZANINE
ioMemory Solutions Platform
Fusion-io Confidential 6
Small to MediumEnterpr ise
Hyperscale WorkstationEnterprise
Flash-aware Applications
Flash-aware Applications
Comprehensive Solution Portfolio
Fusion-io Confidential 9
Visual Computing
Digital Content
Web Apps
Big Data
SaaS
Databases
Server
Virtualization
Virtual Desktop
Infrastructure
Mixed Workloads
HYPERSCALESCALE OUT
WORKSTATIONSINGLE USER
ENTERPRISESCALE UP
Topics – NoSQL Amsterdam 2013
1. What are we building ?
2. Why are we building it?
3. OpenNVM
4. Use Cases
5. Where are we headed?
November 1, 2013 10
37% of servers are massively underutilized¹…
SLOW STORAGE LEADS TO IDLE CAPACITY
According to Moore’s Law, processing performance doubles every 18 months
1 Source: IDC's Server Workloads 2010, July 20102 Source: Taming the Power Hungry Data Center, Fusion-io White Paper
Server is idle
80% of the time
CPUs
StorageRe
lati
ve
Pe
rfo
rma
nc
e
…because the performance gap continues to grow
2000 20051985 1990 1995 2010
Growing
performance
gap²
November 1, 2013 © 2012 Fusion-io, Inc. All rights reserved. 12
SPINNING MEDIA
OVER 150 YEARS OLD
SSD treats memory like disk
© Fusion-io
PC
Ie
DRAM
Host
CPU
App
OS
FLASH AS DISK FLASH AS MEMORY
Flash Architectures
PC
IeS
AS
DRAM
Data path
Controller
NAND
Host
CPU
RAID
Controller
App
OS
SC
15November 1, 2013
Cut-through Architecture and VSL
November 1, 2013 16
▸ Sophisticated architecture
• maximum performance
▸ Intelligent software
• advanced features
Kernel
File System
Virtual Storage Layer (VSL)
ioMemory
Applications/Databases PCIe
DRAM /
Memory /
Operating System and
Application Memory ioM
em
ory
Vir
tua
liza
tio
n
Ta
ble
s
Channels Wide
Banks
ioDrive ioMemory
Data-Path
Controller
Commands
Host
Virtual Storage Layer
(VSL)
DA
TA
TR
AN
SF
ER
S
CPU and cores
Balanced Performance Affects Throughput
SSD
ioScale
Queuing behind slow writes causes SSD latency spikes
ioMemory balances read/write performance for consistent throughput
17November 1, 2013
ioDrive2 specs – visit fusionio.com
18November 1, 2013
Topics – NoSQL Amsterdam 2013
1. What are we building ?
2. Why are we building it?
3. OpenNVM
4. Use Cases
5. Where are we headed?
November 1, 2013 19
NoSQL Software challenges
• Keeping NoSQL software simplicity with data
persistence
• Transforming in-memory structures to block I/O
• Tiering data between DRAM and persistent storage
• Keeping latency low with data persistence
• Scaling up
November 1, 2013 20
OpenNVM - http://opennvm.github.io
• Native programming interfaces
• Access flash as a memory
• Eliminate legacy software
layers
• Simplify application authoring
• Accelerate time-to-market
November 1, 2013 © 2012 Fusion-io, Inc. All rights reserved. 21
NVM Software interfaces
▸ Industry-first, direct API access to non-volatile memory’s unique
characteristics.
▸ The OpenNVM was introduced to help developers:
• Write less code to create high-performing apps
• Tap into performance not available with conventional I/O access to
SSDs
• Reduce operating costs by decreasing RAM while increasing NVM
November 1, 2013 22
Direct-Access to non-volatile memory is now emerging
▸Developers are beginning to manipulate
data
directly in Non-Volatile Memory (NVM)
without converting to basic block I/O.
November 1, 2013 23
Traditional SSDs
ioMemory™ withConventional I/O
ioMemory™ asTransparent Cache
ioMemory™ with direct access I/O
ioMemory™ with memory semantics
Ap
plic
atio
n
Application
Ap
plic
atio
n
Application Application Application Application
User-defined I/O API Libraries
User-defined Memory API Libraries
OS Block I/O OS Block I/O OS Block I/O Direct-access I/OAPI Libraries
Memory Semantics API Libraries
Ho
st
Ho
st
File System File System File SystemNVMFS –
NVM filesystemNVMFS –
NVM filesystem
Block Layer Block Layer Block Layer I/O Primitives Memory Primitives
SAS/SATAVSL™
expanded flash translation layer
ioTurbine™
VSL™ VSL™Network
VSL™
Rem
ote
RAID Controller
Read/Write Read/Write Read/Write Read/Write CPU Load/StoreFlash Translation
Layer
Read/Write
Native Access|
24November 1, 2013
Flash memory evolution
Example: Key-Value Store API Library
November 1, 2013 25
key value
1-128B 64b-1MB
kv_put() kv_get() or
kv_batch_get()
key expiration
timer marks KV
pair for VSL
garbage
collection
key hashed
into sparse
address space
to simplify
collision
management
value returned
through single
I/O operation,
regardless of
value size
key value
key value
key value
pool Apool B
pool C
kv_get_current(), kv_next()
Iterate through
each KV pair in
a pool of
related keys
Atomic transaction
envelope
Application issues call to Key-Value Store API
key
value
Virtual Storage Layer
Key-Value Store API Library Benchmarks(native KV Get/Put vs. raw reads/writes)
0
20000
40000
60000
80000
100000
120000
140000
0 20 40 60 80 100 120 140
GETs/s
Threads
SamplePerformance-GET
512B
4KB
16KB
64KB
0
20000
40000
60000
80000
100000
120000
0 20 40 60 80 100 120 140
PUTs/s
Threads
SamplePerformance-PUT
512B
4KB
16KB
64KB
0
20000
40000
60000
80000
100000
120000
140000
0 20 40 60 80
OPS/s
Threads
Performanerela vetoioDrive
512BKeyGET
1KBFIOREAD
0
20000
40000
60000
80000
100000
120000
0 10 20 30 40 50 60 70
OPS/s
Threads
Performancerela vetoioDrive
512BKeyPUT
1K-FIOWRITE
Sample Performance - GET
Performance relative to ioDrive
Sample Performance - PUT
Performance relative to ioDrive
1U HP blade server with 16 GB RAM, 8 CPU cores - Intel(R) Xeon(R) CPU X5472 @ 3.00GHz with single 1.2 TB ioDrive2 mono
SIGNIF ICANTLY MORE FUNCTIONALITY WITH NEGLIGIBLE PERFORMANCE COST
November 1, 2013 26
Key-value store API Library Benefits
November 1, 2013 27
95% performance of raw device
Smarter media now natively understands a key-value I/O interface with
lock-free updates, crash recovery, and no additional metadata overhead.
Up to 3x capacity increase
Dramatically reduces over-provisioning with coordinated garbage collection and
automated key expiry.
3x throughput on same SSD
Early benchmarks comparing against memcached with BerkeleyDB persistence
show up to 3x improvement.
NVMFS – Eliminating duplicate logic
November 1, 2013 28
Linux VFS
Kernel Block Layer
Device Driver
Sector Mapping
Meta
data
ext / btrfs / xfs NVMFS
Driver Primitives
Meta
data
NVMFS –Benefits in Eliminating Duplicate logic
November 1, 2013 29
File System Lines of Code
NVMFS 6879
ReiserFS 19996
ext4 25837
btrfs 51925
XFS 63230
NVMFS: Native Flash Filesystem
▸ File system convenience
▸ Raw block performance
▸ No compromise necessary
Ban
dw
idth
(M
iB/s
)I/O Size
Raw Block NVMFS
8 Threads
30November 1, 2013
NVMFS: Consistent Low Latency
Sample
La
ten
cy (
µs
)Fusion-io NVMFS vs EXT4 write latency
1000 512 Byte Sequential Write with O_DIRECT
NVMFS
EXT4
31November 1, 2013
MySQL: NVMFS and Atomic Writes
Double-write disabled – Non-ACID
Double-write – ACID
XFS NVMFS with Atomic I/O
50% More
ACID transactions
Atomic writes – ACID
32November 1, 2013
Range of memory-Access Semantics
November 1, 2013 33
Extended
MemoryVolatile
Transparently extends DRAM
onto flash, extending
application virtual memory
Checkpointed
Memory
Volatile with non-volatile
checkpoints
Region of application
virtual memory with ability to
preserve snapshots to flash
namespace
Auto-Commit
Memory™Non-volatile
Region of application memory
automatically persisted to non-
volatile memory and
recoverable post-system failure
OS Swap vs. Extended Memory
• Originally designed as a last resort to prevent OOM (out-of-memory) failures
• Never tuned for high-performance demand-paging
• Never tuned for multi-threaded apps
• Poor performance, ex. < 30 MB/sec throughput
• No application code changes required
• Designed to migrate hot pages to DRAM and cold pages to ioMemory
• Tuned to run natively on flash (leverages native characteristics)
• Tuned for multi-threaded apps
• 10-15x throughput improvement over standard OS Swap
November 1, 2013 34
Non-Volatile Storage(Disks, SSDs, etc.)
System MemoryOS SWAPMechanism
NV Memory (volatile usage)
System MemoryExtended Memory
Mechanism
Topics – NoSQL Amsterdam 2013
1. What are we building ?
2. Why are we building it?
3. OpenNVM
4. Use Cases
5. Where are we headed?
November 1, 2013 36
DRAM Dictates NoSQL Scaling
November 1, 2013 37
▸ Key Design Principle:
▸ Working Set < DRAM
#Cassandra13
DO
LL
AR
S
Cost of DRAM Modules
0
200
400
600
800
1000
1200
1400
1600
4GB 8GB 16GB 32GB
#Cassandra13November 1,
201338
$ $$$$$
$$$$$$
When do we scale out?
November 1, 2013 39
▸ A typical server…
CPU Cores: 32 with HT
Memory: 128 GB
…is your working set > 128GB?
#Cassandra13
Is there a better way?
November 1, 2013 40
▸ With NoSQL Databases, we tend to scale out for
DRAM
Combined Resources
CPU Cores: 96
Memory: 384 GB
More cores than needed to serve reads and writes.
#Cassandra13
Three Deployment Options
November 1, 2013 41
1. All Flash
2. Data Placement (CASSANDRA-2749)
3. Use Logical Data Centers
#Cassandra13
Cassandra with All-Flash Storage
November 1, 2013 #Cassandra13 42
Step 1: Mount ioMemory at /var/lib/cassandra/data
Step 2:
Data Placement
November 1, 2013 43
▸ https://issues.apache.org/jira/browse/CASSANDRA-2749
• Thanks Marcus!
▸ Takes advantage of filesystem hierarchy
▸ Use mount points to pin Keyspaces or Column
Families to flash:
• /var/lib/cassandra/data/{Keyspace}/{CF}
▸ Use flash for high performance needs, disk for
capacity needs
#Cassandra13
Data Centers for Storage Control
November 1, 2013 44
DC1
(Interactive requests)DC3
(High density replicas)
DC2
(Hadoop MR Jobs)
PERFORMANCE
CAPACITY/NODE
HIGH
MEDIUM
LOW
HIGH
Cassandra cluster
#Cassandra13
Rethinking Cassandra I/O
November 1, 2013 45
http://www.datastax.com/docs/1.2/dml/about_writes
Flash
#Cassandra13
Rethinking Cassandra I/O
November 1, 2013 46#Cassandra13
http://www.datastax.com/docs/1.2/dml/about_writes
Flashtable
Accelerating Cassandra With Flash
November 1, 2013 47
+
#Cassandra13
NAND Flash Accelerator
We have been talking with Cassandra about how we can solve
some of these problems, so stay tuned.
Real-World Cassandra on Fusion
November 1, 2013 48#Cassandra13
49
Spotify? Spotify!
50
▸ Over 24 million active users
▸ Over 20 million songs available globally
▸ Over 6 million paying subscribers
▸ Over 1 billion playlists created
▸ Over $500 million paid to rights-holders
▸ Over 850 employees
▸ Over 250 developers
▸ Available in: 28 countries - USA, UK, Australia, New Zealand, Germany, Sweden, Finland, Norway, Denmark, France, Spain, Austria, Belgium, Switzerland, The Netherlands, Ireland, Luxembourg, Italy, Poland, Portugal, Mexico, Singapore, Hong Kong, Malaysia, Lithuania, Latvia, Estonia and Iceland.
51
• Over 24 clusters and quickly growing
• Containing over 300 nodes
• Distributed over 4 data centers around the world
• Our main solution for scalable storage
Cassandra at Spotify
52
• It changes everything, is a step change going from spinning disks to flash
• Cassandra is page cache bound - flash moves scaling from memory to
flash
• Allow us to both consolidate and scale our clusters at the same time
• Developers can focus on delivering products instead of optimizing for I/O
Why flash?
Why Fusion-io?
• Why attach flash to a legacy platform?
• It turns out that it’s easier to get installed
• Developer kit allow direct access to flash
• Performance
Early results
• 3-4x consolidation factor
• 3-6x reduction in latency
• Forcing SStables to memory not needed anymore
• ROI so far is 2.2x
• Consolidation limited by Cassandra 1.1
One iodrive vs a 10-disk raid-0 -MongoDB
November 1,
2013Fusion-io 55
vs
ioDrive 2 – 1.2TB 10 x 7200 RPM SAS disks in raid-0 mdarray
YCSB Details
• 2 x 8-core 2.9GHz Xeon/64GB/2x10GbE
• XFS
Test
• Workload A: 50/50% read/write mix
• Workload B: 95/5% read/write mix
• Workload C: 100% read
• Workload F: 50/50% read/read+modify+write mix
November 1,
2013Fusion-io 56
Workload A - 50/50% read/write mix
November 1,
2013Fusion-io 57
Workload B - 95/5% read/write mix
November 1,
2013Fusion-io 58
Workload C - 100% read
November 1,
2013Fusion-io 59
Workload F - 50/50% read/read+modify+write mix
November 1,
2013Fusion-io 60
AV
GR
EA
D L
AT
EN
CY
IN
MIL
ISE
CO
ND
S
YCSB: Lowest read latency wins
November 1,
201361
0
20
40
60
80
100
120
140
160
180
W O R K L O A D A W O R K L O A D B W O R K L O A D F : U P D A T E W O R K L O A D F : R M W
Fusion-io Disk
Fusion-io
30x F
aste
r
27x F
aste
r
20x F
aste
r
40x F
aste
r
AV
G U
PD
AT
E L
AT
EN
CY
IN
MIL
ISE
CO
ND
S
YCSB: lowest Update latency wins
November 1,
201362
0
50
100
150
200
250
300
350
400
450
W O R K L O A D A W O R K L O A D B W O R K L O A D F : U P D A T E W O R K L O A D F : R M W
Fusion-io Disk
Fusion-io
18x F
aste
r
4x F
aste
r
14x F
aste
r
16x F
aste
r
Architectural impact of Fusion-io
• Avoid scaling out for DRAM
▸ Nodes can handle higher transaction load
▸ Terabytes of low latency persistent storage on single
nodes
▸ Potentially avoid sharding
• Use less DRAM per node
▸ Lower cost servers
▸ DRAM available for applications
November 1,
201363Fusion-io
Topics – NoSQL Amsterdam 2013
1. What are we building ?
2. Why are we building it?
3. OpenNVM
4. Use Cases
5. Where are we headed?
November 1, 2013 64
API Specs posted at opennvm.github.io
▸ Early-access to OpenNVM API specs and technical documentation
(limited enrollment during early-access phase)
▸ http://opennvm.github.io
• Write less code to create high-performing apps
• Tap into performance not available with conventional I/O access to
SSDs
• Reduce operating costs by decreasing RAM while increasing NVM
November 1, 2013 65
Direct-access to NVM is for developers whose
software retrieves and stores data.
Open Interfaces and Open Source
• NVM Primitives: Open Interface
• NVMFS: Open Source, POSIX Interface
• NVM API Libraries: Open Source, Open Interface
• INCITS SCSI (T10) active standards proposals:
▸ SBC-4 SPC-5 Atomic-Write
http://www.t10.org/cgi-bin/ac.pl?t=d&f=11-229r6.pdf
▸ SBC-4 SPC-5 Scattered writes, optionally atomic
http://www.t10.org/cgi-bin/ac.pl?t=d&f=12-086r3.pdf
▸ SBC-4 SPC-5 Gathered reads, optionally atomic
http://www.t10.org/cgi-bin/ac.pl?t=d&f=12-087r3.pdf
• SNIA NVM-Programming TWG active member
66November 1, 2013
Catalyst for top industry players to Accelerate pursuit of NVM programming
November 1, 2013 67
…And Resonating through the Industry
November 1, 2013 68
Traditional SSDs
ioMemory™ withConventional I/O
ioMemory™ asTransparent Cache
ioMemory™ with direct access I/O
ioMemory™ with memory semantics
Ap
plic
atio
n
Application
Ap
plic
atio
n
Application Application Application Application
User-defined I/O API Libraries
User-defined Memory API Libraries
OS Block I/O OS Block I/O OS Block I/O Direct-access I/OAPI Libraries
Memory Semantics API Libraries
Ho
st
Ho
st
File System File System File SystemNVMFS –
NVM filesystemNVMFS –
NVM filesystem
Block Layer Block Layer Block Layer I/O Primitives Memory Primitives
SAS/SATAVSL™
expanded flash translation layer
directCache™
VSL™ VSL™Network
VSL™
Rem
ote
RAID Controller
Read/Write Read/Write Read/Write Read/Write CPU Load/StoreFlash Translation
Layer
Read/Write
Native Access|
69November 1, 2013
Flash memory evolution
Comprehensive Customer Success
November 1, 2013 7030+ case studies at http://fusionio.com/casestudies
F I N AN C I AL SM AN U F A C T U R I N G /
G O VER N M E N TW EB T EC H N O L O G Y R ET AI L
FASTER DATA
W AREHOUSE
QUERIES40xQUERY
PROCESSING
THROUGHPUT15x
The world’s leading Q&A site
®
FASTER
DATABASE
REPLICATION30xFASTER DATA
ANALYSIS5x FASTER
QUERIES15x
f u s i o n i o . c o m | R E D E F I N E W H A T ’ S P O S S I B L E
T H A N K Y O U !