CFS: Cassandra backed storage for Hadoop

CFSCassandra-backed storage for HadoopNick Bailey@nickmbaileynick@datastax.com

Motivation

Help me Cassandra, you’re my only hope

Cassandra• Distributed architecture

• No SPOF

• Scalable

• Real time data

• No ad-hoc query support

Cassandra, why can’t you...

...do the things Hadoop was built for.

Cassandra + Hadoop = <3

The Solution• InputFormat/OutputFormat

• Unfortunately, still need a DFS

• Run tasktrackers/datanodes locally• Data Locality FTW!

• Run namenode/jobtracker somewhere

• Since Cassandra 0.6 (the dark ages)

Ok, but what about these parts that suck...

Do not want...• Multiple hadoop stacks?

• SPOF?

• 3 JVMS?

Cassandra Data model in 1 minute

Column Families• Column Family ~= Table

• Row Key + columns

• Columns are sparse

Static - Users Column Family

Row Key

nickmbailey password: * name: Nick

zznate password: * name: Nate phone: 512-7777

Select * from Users where name=Nick;

Secondary Indexes

Dynamic - Friends

Row Key

nickmbailey zznate: thobbs:

zznate jbeiber: thobbs: steve_watt:

So what about CFS...

Simple...

CF: inode• Essentially, namenode replacement

• File metadata

CF: inode• Row Key = UUID

• Allows for file renames

• Secondary indexes for file browsing

• Columns:

Columnfilename /home/nick/data.txt

parent_path /home/nick/attributes nick:nick:777

TimeUUID1 <block metadata>TimeUUID2 <block metadata>TimeUUID3 <block metadata>

CF: sblocks• Essentially, datanode replacement

• Stores actual contents of files

• Each row is an hdfs block

• Row Key = Block ID

Column

TimeUUID1 <compressed file data>

Writes• Write file metadata

• Split into blocks• Still controlled by ‘dfs.block.size’• also ‘cfs.local.subblock.size’

• Read in a block• split into sub blocks

• Update inode, sblocks

• rinse, repeat

Reads• Check for file in inode

• Determine appropriate blocks

• Request blocks via thrift

• If data is local...• ...get location on local filesystem

• If data is remote...• ...get actual file content via thrift

What Else?• Current Implementation: 1.0.4• <property>

<name>fs.cfs.impl</name> <value>com.datastax.bdp.hadoop.cfs.CassandraFileSystem</value> </property>

• Supports HDFS append()

• Immutability makes things easy

• See the first incarnation• https://github.com/riptano/brisk

Want a job?

nick@datastax.com

Questions?

CFS: Cassandra backed storage for Hadoop

Technology

Transcript of CFS: Cassandra backed storage for Hadoop

CFS-AC2KIT - Costar Video Systems-660 feet of plastic optical fiber (CFS-LW660)-50 metal “eye” bolts (CFS-BOLT)-2 FSMA connectors (CFS-FSMA) * Note: Termination kit required (CFS-TKIT)

Solr & Cassandra: Searching Cassandra with DataStax Enterprise

Cassandra @ Yahoo Japan | Cassandra Summit 2016

Associate)Professor)Cassandra)L.Atherton) Deakin ...cassandra-atherton.com/wp-content/uploads/2018/01/Cassandra-Athert… · Associate)Professor)Cassandra)L.Atherton) Deakin’University’

Introduction to Hbase - Distributed Managementdmod.eu/WeeklyMeeting/hbase.pdf · Introduction to Hbase ... CFs stored separately on disk access one without ... Tried MySQL,Cassandra,

Cure Cfs Download, Cure Cfs

Chronic Fatigue SyndromeMyalgic Encephalomyelitis · PDF fileMyalgic Encephalomyelitis DRAFT (ME-CFS) ... Cassandra McGregor ... Diagnosis of ME-CFS is by a Doctor dependant on the

A GUIDE TO STRESS TESTING KAFKA, SPARK AND CASSANDRA … · Spark Workers. The nodes are named Spark-Cassandra-Master, Spark-Cassandra-Worker01 and Spark-Cassandra-Worker02. The Cassandra

Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastructure

Cassandra Summit EU 2014 - Testing Cassandra Applications

Cassandra at eBay - Cassandra Summit 2013

Paris Cassandra Meetup - Cassandra for Developers

Cassandra Community Webinar: Apache Cassandra Internals

Flood Forecasting for Wimberley, · PDF fileAcknowledgements: Harry Evans, Cassandra Fagan, ESRI, Kisters, Microsoft Research. ... 120,000 cfs. Onion Creek at Highway 183 (Stage Height)

Cassandra: A Decentralized Structured Storage System · PDF fileCASSANDRA: A DECENTRALIZED STRUCTURED STORAGE SYSTEM ... •CFs have to be defined in advance → structured ... a decentralized

Cassandra Day NYC - Cassandra anti patterns

Apache Cassandra at Target - Cassandra Summit 2014

Cassandra Training Session 2svn.wso2.org/repos/wso2/people/kasunw/BAM/Cassandra/Cassandra... · Configuring Cassandra Contd ... replicationStrategy, replicationFactor, cfs); cluster.addKeyspace(definition);

Cassandra Day Atlanta 2015: Python & Cassandra

Comparing the Hadoop Distributed File System (HDFS) · PDF file1 Comparing the Hadoop Distributed File System (HDFS) with the Cassandra File System (CFS) White Paper BY DATASTAX CORPORATION