Cassandra Hadoop Best Practices by Jeremy Hanna

31
Hadoop + Cassandra Best Practices Thursday, June 6, 13

Transcript of Cassandra Hadoop Best Practices by Jeremy Hanna

Page 1: Cassandra Hadoop Best Practices by Jeremy Hanna

Hadoop + CassandraBest Practices

Thursday, June 6, 13

Page 2: Cassandra Hadoop Best Practices by Jeremy Hanna

Some Background

Thursday, June 6, 13

Page 3: Cassandra Hadoop Best Practices by Jeremy Hanna

Some Background

• Hadoop support since early 2010

Thursday, June 6, 13

Page 4: Cassandra Hadoop Best Practices by Jeremy Hanna

Some Background

• Hadoop support since early 2010

• MapReduce/Pig works with any Hadoop 1.x distribution.

Thursday, June 6, 13

Page 5: Cassandra Hadoop Best Practices by Jeremy Hanna

Some Background

• Hadoop support since early 2010

• MapReduce/Pig works with any Hadoop 1.x distribution.

• Hive is a neatly integrated piece of DSE

Thursday, June 6, 13

Page 6: Cassandra Hadoop Best Practices by Jeremy Hanna

Some Background

• Hadoop support since early 2010

• MapReduce/Pig works with any Hadoop 1.x distribution.

• Hive is a neatly integrated piece of DSE

• Data locality just like with HDFS

Thursday, June 6, 13

Page 7: Cassandra Hadoop Best Practices by Jeremy Hanna

Some Background

• Hadoop support since early 2010

• MapReduce/Pig works with any Hadoop 1.x distribution.

• Hive is a neatly integrated piece of DSE

• Data locality just like with HDFS

• Cassandra can handle ~200 CFs

Thursday, June 6, 13

Page 8: Cassandra Hadoop Best Practices by Jeremy Hanna

Setup

Thursday, June 6, 13

Page 9: Cassandra Hadoop Best Practices by Jeremy Hanna

Setup

• Analytics specific datacenter

Thursday, June 6, 13

Page 10: Cassandra Hadoop Best Practices by Jeremy Hanna

Setup

• Analytics specific datacenter

• Configure replication (KS/DC specific)

Thursday, June 6, 13

Page 11: Cassandra Hadoop Best Practices by Jeremy Hanna

Setup

• Analytics specific datacenter

• Configure replication (KS/DC specific)

• Isolated reads at CL.LOCAL_QUORUM

Thursday, June 6, 13

Page 12: Cassandra Hadoop Best Practices by Jeremy Hanna

Setup

• Analytics specific datacenter

• Configure replication (KS/DC specific)

• Isolated reads at CL.LOCAL_QUORUM

• Writes will be replicated

Thursday, June 6, 13

Page 13: Cassandra Hadoop Best Practices by Jeremy Hanna

Setup

• Analytics specific datacenter

• Configure replication (KS/DC specific)

• Isolated reads at CL.LOCAL_QUORUM

• Writes will be replicated

• Same best practices as with Hadoop alone

Thursday, June 6, 13

Page 14: Cassandra Hadoop Best Practices by Jeremy Hanna

Vanilla Hadoop

Thursday, June 6, 13

Page 15: Cassandra Hadoop Best Practices by Jeremy Hanna

Vanilla Hadoop

• Co-locate task trackers and data nodes with Cassandra nodes (data locality)

Thursday, June 6, 13

Page 16: Cassandra Hadoop Best Practices by Jeremy Hanna

Vanilla Hadoop

• Co-locate task trackers and data nodes with Cassandra nodes (data locality)

• Workload isolation with separate Cassandra datacenter configured

Thursday, June 6, 13

Page 17: Cassandra Hadoop Best Practices by Jeremy Hanna

Planning

Thursday, June 6, 13

Page 18: Cassandra Hadoop Best Practices by Jeremy Hanna

Planning

• MapReduce over full column family

Thursday, June 6, 13

Page 19: Cassandra Hadoop Best Practices by Jeremy Hanna

Planning

• MapReduce over full column family

• Model data accordingly

Thursday, June 6, 13

Page 20: Cassandra Hadoop Best Practices by Jeremy Hanna

Planning

• MapReduce over full column family

• Model data accordingly

• Add more column families

Thursday, June 6, 13

Page 21: Cassandra Hadoop Best Practices by Jeremy Hanna

Planning

• MapReduce over full column family

• Model data accordingly

• Add more column families

• Can use secondary index, but use caution

Thursday, June 6, 13

Page 22: Cassandra Hadoop Best Practices by Jeremy Hanna

Execution

Thursday, June 6, 13

Page 23: Cassandra Hadoop Best Practices by Jeremy Hanna

Execution

• Project and select early in your workflow

Thursday, June 6, 13

Page 24: Cassandra Hadoop Best Practices by Jeremy Hanna

Execution

• Project and select early in your workflow

• Store common intermediate datasets (in CFS/HDFS)

Thursday, June 6, 13

Page 25: Cassandra Hadoop Best Practices by Jeremy Hanna

Execution

• Project and select early in your workflow

• Store common intermediate datasets (in CFS/HDFS)

• Bulk loader output format excels

Thursday, June 6, 13

Page 26: Cassandra Hadoop Best Practices by Jeremy Hanna

Use Cases

Thursday, June 6, 13

Page 27: Cassandra Hadoop Best Practices by Jeremy Hanna

Use Cases

• Typical Hadoop tasks

Thursday, June 6, 13

Page 28: Cassandra Hadoop Best Practices by Jeremy Hanna

Use Cases

• Typical Hadoop tasks

• Validate data

Thursday, June 6, 13

Page 29: Cassandra Hadoop Best Practices by Jeremy Hanna

Use Cases

• Typical Hadoop tasks

• Validate data

• Fix data

Thursday, June 6, 13

Page 30: Cassandra Hadoop Best Practices by Jeremy Hanna

Use Cases

• Typical Hadoop tasks

• Validate data

• Fix data

• Bootstrap a new column family from existing data

Thursday, June 6, 13

Page 31: Cassandra Hadoop Best Practices by Jeremy Hanna

Thank you

• Jeremy Hanna

• @jeromatron (twitter and irc)

[email protected]

• Ping me if you have any questions

Thursday, June 6, 13