Download - Cassandra at arkivum

Cassandra at Arkivum

Richard Lowe, Principal Engineer

Arkivum

[email protected]

About Arkivum

• We offer a safe, secure archive service for digital data

• We use data archiving expertise to keep data for the long-term: for

years, decades or forever

• Our service allows our customers to meet their compliance needs

and asset retention goals whilst focusing on their core business

© Arkivum Limited, 2012

2

Our architecture


• Gateway appliance is installed at customer site running our software,

talking across WAN using secure VPN to our software in our DCs

• File data is encrypted and stored on variety of storage media,

including SSD, hard disk and tape

• Focus is on maintaining long term data integrity, not low latency or

high availability

3

Legacy design

• Original code used an SQL database

• Our knowledge was biased towards RDBMS

• Normalization, JDBC, ACID, mature platform

• The software design assumed SQL

• Indexes and ad-hoc queries gave basic search functionality for

relatively little extra effort


4

Relational model of a file system CREATE TABLE files (

file_id VARCHAR NOT NULL PRIMARY KEY,

parent_id VARCHAR NOT NULL,

name VARCHAR NOT NULL,

size BIGINT DEFAULT 0,

created_date DATETIME DEFAULT CURRENT_TIMESTAMP,

modified_date DATETIME DEFAULT CURRENT_TIMESTAMP,

owner_uid INT DEFAULT 0,

owner_gid INT DEFAULT 0,

file_mode INT DEFAULT 493,

file_attr INT DEFAULT 0,

UNIQUE(parent_id, name)

);


5

Relational model of a file system Get a file by id SELECT * FROM files WHERE

file_id = 'f90b3e92-0e96-482f-b4e5-f1ca071f26d6';

List all files in a particular directory SELECT * FROM files WHERE

parent_id = 'e98eaaaa-07a6-4ffa-bd21-f3975529718b';

List all files modified in April 2010 and sort by size SELECT * FROM files WHERE

modified_date > '2010-03-31'

AND modified_date < '2010-05-01'

ORDER BY bytesize DESC;


6

Why Cassandra?

• Scalability

• Meets our need to scale to billions of records

• Designed for high-availability, high-throughput environments

• Replication

• Data safety is paramount to us

• Cassandra replication is a really strong feature

• Stability

• Well supported and used worldwide in high-profile, high-end

production systems


7

Cassandra model of a file system

Approach 1: Pretend we're using a relational database

• Use column families as if they're tables

• Use CQL because it's like SQL

• Create secondary indexes for everything in case we want to query

on it later


"parent_id" "name" "size" "modified" "accessed" "gid" "uid" "mode"

file_id UUID UTF8 Long Long Long Long Long Long

Files CF

8


Approach 1 doesn't work

• Column families are not tables

• CQL looks like SQL, but isn't SELECT * FROM Files WHERE modified > '2010-03-31';

• Secondary indexes aren't cheap

• Can't sort based on column values, only on column names


9


Approach 2: Use composite types and blobs

• Serialize file record and store as single object instead of multiple

values

• Use actual values as part of composite column name, so we can

search and sort based on them


(name, size, mtime, atime, gid, uid, mode)

(parent_id, file_id) file_blob

Files CF

10


Approach 2 doesn't work either

• Need to know all the values for a composite to query based on it -

otherwise it means a range query, which is expensive file_exists = len(list(files_cf.get_range(

start = CompositeType(MIN_UUID, file_id),

finish = CompositeType(MAX_UUID, file_id),

row_count = 1))) == 1

• Sorting compares the entire composite, not each field [CompositeType('apples', 6), CompositeType('bananas', 2),

CompositeType('oranges', 5), CompositeType('pears', 4)]


11


Approach 3: De-normalize

• Look at the most common queries and optimize for those

• Most lookups should require just a single get or slice query

• Speed vs. space: do we really care if a record is stored twice?


"file"

file_id file_blob

name

parent_id file_blob

Files CF Directories CF

12


Approach 3 works

Get a file by id file = unpackFile(

files_cf.get(key=file_id, columns=['file']))

List all files in a particular directory files = unpackFiles(list(

directories_cf.get(key=directory_id)))


13

Lessons learned

• CQL isn't necessarily the easiest or best interface

• Break the golden rule

• Composites are useful under limited circumstances

• Avoid wide rows, they can lead to pain

• Should focus on queries that are most important

• Post-processing or Map/Reduce can be used to meet needs of less

common queries


14

Cassandra and network usage

10Mbit connection, replicating to 2 nodes


15


So how can it be used on a slow WAN?

• Tune down the message and packet size rpc_send_buff_size_in_bytes

rpc_recv_buff_size_in_bytes

thrift_framed_transport_size_in_mb

thrift_max_message_length_in_mb

• Be prepared for higher failure rates when things get busy rpc_timeout_in_ms

• Use an additional cache layer to reduce network I/O


16



10MBit connection, replicating to 2 nodes, after tuning

17



Cassandra replication

is better than DIY

alternative

18

Configuring Cassandra is key

Cassandra has lots of configuration options.

Taking time to understand and tweak them is worth the effort. Leaving

them as default probably won't give the best results.

Determine custom policies for how often to compact, repair, scrub,

etc. as these depend on the profile of the data being stored.


19

Future work

Continuing to scale our systems to cope with growing load and data

volumes

Adding additional search capabilities

Applying analytics to better understand how people are using our

service to store petabytes of data


20

Summary

• Arkivum provides a guaranteed service for long-term data archive

• We've transitioned our data model from RDBMS to Cassandra

• Our Cassandra deployment is multi-DC, multi-site across WAN

• Future tasks include improving search and using analytics

21

Questions?

[email protected]

www.arkivum.com

Cassandra cheat sheet


23