Hadoop and Cassandra at Rackspace

21
Making Massive Manageable: Hadoop and Cassandra (at Rackspace) Big Data Workshop Stu Hood (@stuhood) – Technical Lead, Rackspace April 23rd 2010

Transcript of Hadoop and Cassandra at Rackspace

Page 1: Hadoop and Cassandra at Rackspace

Making Massive Manageable:

Hadoop and Cassandra (at Rackspace)

Big Data Workshop

Stu Hood (@stuhood) – Technical Lead, Rackspace

April 23rd 2010

Page 2: Hadoop and Cassandra at Rackspace

My, what a large dataset you have...

Processing 3 TB/day of logs

Using Hadoop/Pig

And the sticking points?

“How fast can we provision machines?”

“How do we get data on/off the cluster?”

“How do we add structure?”

Page 3: Hadoop and Cassandra at Rackspace

MapReduce

Distributed processing methodology

Adapt a problem to MapReduce

Scale forever

Crunch almost anything

Typically adding structure to unstructured data

Logs

Also great for structured

Graph processing

Machine learning

Page 4: Hadoop and Cassandra at Rackspace

“You want to use how many clients?”

Need to store structured inputs/outputs

Solution needs to

Support arbitrary number of clients

Preferably provide locality

Possibly provide 'web' latency

Page 5: Hadoop and Cassandra at Rackspace

Solutions of varying quality

Sharding the RDBMS

shard n. - A horizontal partition in a databaseExample: Sharding by userid

Provided by ORM?Fixed partitions: manual rebalancing

Developing from scratch?Adding/removing nodes

Handling failover

As a library? As a middle tier?

Page 6: Hadoop and Cassandra at Rackspace

Solutions of varying quality

Leaving data in Hadoop

Storage in Map/SequenceFile

Serialized with Thrift/Avro/ProtoBuffs

No random access

High latency

Page 7: Hadoop and Cassandra at Rackspace

Solutions of varying quality

Storing in HBase/Hypertable

Column stores implemented on Hadoop

Modeled after Google's Bigtable

Multiple points of failure

Namenode

Master

High (almost non-web) latency

Page 8: Hadoop and Cassandra at Rackspace

And the newest contender...

Page 9: Hadoop and Cassandra at Rackspace

Standing on the shoulders of: Amazon Dynamo

No node in the cluster is special

No special roles

No scaling bottlenecks

No single point of failure

Techniques

Gossip

Eventual consistency

Page 10: Hadoop and Cassandra at Rackspace

Standing on the shoulders of: Google Bigtable

“Column family” data model

Range queries for rows:

Scan rows in order

Memtable/SSTable structure

Always writes sequentially to disk

Bloom filters to minimize random reads

Trounces B-Trees for big dataLinear insert performance

Log growth for reads

Page 11: Hadoop and Cassandra at Rackspace

Enter Cassandra

Hybrid of ancestors

Adopts listed features

And adds:

A sweet logo!

Pluggable partitioning

Multi datacenter supportPluggable locality

awareness

Datamodel improvements

Page 12: Hadoop and Cassandra at Rackspace

Enter Cassandra

Project status

Open sourced by Facebook in 2008 (no longer active)

Apache License

Graduated to Apache TLP February 2010

Major releases: 0.3 through 0.6 (0.7 in two months)

cassandra.apache.org

Page 13: Hadoop and Cassandra at Rackspace

Enter Cassandra

The code base

Java, Apache Ant, Git/SVN

5+ committers from 3+ companies

Known deployments at:

Cloudkick, Digg, Mahalo, SimpleGeo, Twitter, Rackspace, Reddit

Page 14: Hadoop and Cassandra at Rackspace

Performance

Page 15: Hadoop and Cassandra at Rackspace

Like peanut butter with jelly

Apache Cassandra 0.6:

MapReduce input support out of the box

Locality information partially exposed

Hadoop InputFormat

Pig LoadFunc

Page 16: Hadoop and Cassandra at Rackspace

Hadoop + Cassandra at RAX

Multiple Hadoop clusters deployed

Smaller Cassandra deployments

Preparing for large scale Cassandra deployment

Page 17: Hadoop and Cassandra at Rackspace

In the pipeline

MapReduce output support

Adding an OutputFormat with locality information

Improving locality for Hadoop inputs

Page 18: Hadoop and Cassandra at Rackspace

Getting started

http://cassandra.apache.org/

Read "Getting Started"... Roughly:

Start one node

Test/develop app, editing node config as necessary

Launch cluster by starting more nodes with chosen config

Page 19: Hadoop and Cassandra at Rackspace

Thanks!

Big Data Workshop

Participants!

Page 20: Hadoop and Cassandra at Rackspace

Questions?

Page 21: Hadoop and Cassandra at Rackspace

References

Brandon William's perf tests

http://racklabs.com/~bwilliam/cassandra/04vs05vs06.png

Hadoop/Cassandra Integration

http://issues.apache.org/jira/browse/CASSANDRA-342