Hadoop and Cassandra at Rackspace

Making Massive Manageable:

Hadoop and Cassandra (at Rackspace)

Big Data Workshop

Stu Hood (@stuhood) – Technical Lead, Rackspace

April 23rd 2010

My, what a large dataset you have...

Processing 3 TB/day of logs

Using Hadoop/Pig

And the sticking points?

“How fast can we provision machines?”

“How do we get data on/off the cluster?”

“How do we add structure?”

MapReduce

Distributed processing methodology

Adapt a problem to MapReduce

Scale forever

Crunch almost anything

Typically adding structure to unstructured data

Also great for structured

Graph processing

Machine learning

“You want to use how many clients?”

Need to store structured inputs/outputs

Solution needs to

Support arbitrary number of clients

Preferably provide locality

Possibly provide 'web' latency

Solutions of varying quality

Sharding the RDBMS

shard n. - A horizontal partition in a databaseExample: Sharding by userid

Provided by ORM?Fixed partitions: manual rebalancing

Developing from scratch?Adding/removing nodes

Handling failover

As a library? As a middle tier?

Leaving data in Hadoop

Storage in Map/SequenceFile

Serialized with Thrift/Avro/ProtoBuffs

No random access

High latency

Storing in HBase/Hypertable

Column stores implemented on Hadoop

Modeled after Google's Bigtable

Multiple points of failure

Namenode

Master

High (almost non-web) latency

And the newest contender...

Standing on the shoulders of: Amazon Dynamo

No node in the cluster is special

No special roles

No scaling bottlenecks

No single point of failure

Techniques

Gossip

Eventual consistency

Standing on the shoulders of: Google Bigtable

“Column family” data model

Range queries for rows:

Scan rows in order

Memtable/SSTable structure

Always writes sequentially to disk

Bloom filters to minimize random reads

Trounces B-Trees for big dataLinear insert performance

Log growth for reads

Enter Cassandra

Hybrid of ancestors

Adopts listed features

And adds:

A sweet logo!

Pluggable partitioning

Multi datacenter supportPluggable locality

awareness

Datamodel improvements

Enter Cassandra

Project status

Open sourced by Facebook in 2008 (no longer active)

Apache License

Graduated to Apache TLP February 2010

Major releases: 0.3 through 0.6 (0.7 in two months)

cassandra.apache.org

Enter Cassandra

The code base

Java, Apache Ant, Git/SVN

5+ committers from 3+ companies

Known deployments at:

Cloudkick, Digg, Mahalo, SimpleGeo, Twitter, Rackspace, Reddit

Performance

Like peanut butter with jelly

Apache Cassandra 0.6:

MapReduce input support out of the box

Locality information partially exposed

Hadoop InputFormat

Pig LoadFunc

Hadoop + Cassandra at RAX

Multiple Hadoop clusters deployed

Smaller Cassandra deployments

Preparing for large scale Cassandra deployment

In the pipeline

MapReduce output support

Adding an OutputFormat with locality information

Improving locality for Hadoop inputs

Getting started

http://cassandra.apache.org/

Read "Getting Started"... Roughly:

Start one node

Test/develop app, editing node config as necessary

Launch cluster by starting more nodes with chosen config

Thanks!

Big Data Workshop

Participants!

Questions?

References

Brandon William's perf tests

http://racklabs.com/~bwilliam/cassandra/04vs05vs06.png

Hadoop/Cassandra Integration

http://issues.apache.org/jira/browse/CASSANDRA-342

Hadoop and Cassandra at Rackspace

Documents

Transcript of Hadoop and Cassandra at Rackspace

Online Analytics with Hadoop and Cassandra

Real-Time Analytics with Cassandra and Hadoop - Huihoodocs.huihoo.com/oreilly/conferences/strataconf/big-data-conference... · #strataconf + #hw2013 Real-Time Analytics with Cassandra

Securing Big Data at Rest with Encryption for Hadoop, Cassandra ...

Manchester Hadoop User Group: Cassandra Intro

Rackspace 1 2624 mixing digital_1_aw1 rackspace (1)

Cassandra Hadoop Integration at HUG France by Piotr Kołaczkowski

Cassandra Query Language - Tutorials · PDF filedeveloped as a part of Apache Hadoop project and runs on ... Cisco, Rackspace, ebay, Twitter, Netflix ... Cassandra has become so popular

Cassandra + Hadoop: Analisi Batch con Apache Cassandra

Lecture 11 Hadoop & Sparkece.uprm.edu/~wrivera/ICOM6025/Lecture11.pdf · HBase PIG R Hive Cassandra MapReduce . Hadoop • Designed to reliably store data using ... High Performance

202007 SecureSphere function...Sybase ASE Sybase IQ Teradata Cassandra DataStax Hadoop Cloudera Hadoop Hortonworks Hadoop IBM BigInsights MongoDB 時の記録 ログイン、ログアウト、SQL実

Brisk: more powerful Hadoop powered by Cassandra

大数据时代的变革 - doc.fens.medoc.fens.me/hbun-collage-bigdata.pdf · Hadoop HDFS，Hbase, Google GFS, DynamoDB, MongoDB, Cassandra 计算： Hadoop MapReduce, Spark, Mahout,

Red Hat. Cassandra and MongoDB on Encryption for Hadoop ...

Cassandra + Hadoop = Brisk

OpenStack and Rackspace – an Australian perspective: Tony Breeds, Rackspace

Introduction to Real-Time Analytics with Cassandra and Hadoop

Hadoop - yappidays.ruyappidays.ru/wp-content/uploads/2017/09/Hadoop-2017-Yaroslavl.pdf · Titan & KairosDB store data in Cassandra Push Events & Alarms (Email, SNMP etc.) Hadoop Jungle

Adattárház alapú vezetői információs rendszerek · Yahoo! Hadoop, PNUTS Columnar NoSQL Twitter FlockDB, Cassandra, Hadoop/Hbase Graph, Columnar NoSQL Wikipedia Memcached, Flatfile,

TDC2016POA | Trilha BigData - Orquestrando Hadoop, Cassandra e MongoDB com o Pentaho Big Data Analytics

Hadoop and cassandra

202007 SecureSphere function...Sybase ASE Sybase IQ Teradata Cassandra DataStax Hadoop Cloudera Hadoop Hortonworks Hadoop IBM BigInsights MongoDB 時の記録ログイン、ログアウト、SQL実