Cassandra Hadoop Best Practices by Jeremy Hanna
-
Upload
altic-sarl -
Category
Technology
-
view
4.075 -
download
0
Transcript of Cassandra Hadoop Best Practices by Jeremy Hanna
Hadoop + CassandraBest Practices
Thursday, June 6, 13
Some Background
Thursday, June 6, 13
Some Background
• Hadoop support since early 2010
Thursday, June 6, 13
Some Background
• Hadoop support since early 2010
• MapReduce/Pig works with any Hadoop 1.x distribution.
Thursday, June 6, 13
Some Background
• Hadoop support since early 2010
• MapReduce/Pig works with any Hadoop 1.x distribution.
• Hive is a neatly integrated piece of DSE
Thursday, June 6, 13
Some Background
• Hadoop support since early 2010
• MapReduce/Pig works with any Hadoop 1.x distribution.
• Hive is a neatly integrated piece of DSE
• Data locality just like with HDFS
Thursday, June 6, 13
Some Background
• Hadoop support since early 2010
• MapReduce/Pig works with any Hadoop 1.x distribution.
• Hive is a neatly integrated piece of DSE
• Data locality just like with HDFS
• Cassandra can handle ~200 CFs
Thursday, June 6, 13
Setup
Thursday, June 6, 13
Setup
• Analytics specific datacenter
Thursday, June 6, 13
Setup
• Analytics specific datacenter
• Configure replication (KS/DC specific)
Thursday, June 6, 13
Setup
• Analytics specific datacenter
• Configure replication (KS/DC specific)
• Isolated reads at CL.LOCAL_QUORUM
Thursday, June 6, 13
Setup
• Analytics specific datacenter
• Configure replication (KS/DC specific)
• Isolated reads at CL.LOCAL_QUORUM
• Writes will be replicated
Thursday, June 6, 13
Setup
• Analytics specific datacenter
• Configure replication (KS/DC specific)
• Isolated reads at CL.LOCAL_QUORUM
• Writes will be replicated
• Same best practices as with Hadoop alone
Thursday, June 6, 13
Vanilla Hadoop
Thursday, June 6, 13
Vanilla Hadoop
• Co-locate task trackers and data nodes with Cassandra nodes (data locality)
Thursday, June 6, 13
Vanilla Hadoop
• Co-locate task trackers and data nodes with Cassandra nodes (data locality)
• Workload isolation with separate Cassandra datacenter configured
Thursday, June 6, 13
Planning
Thursday, June 6, 13
Planning
• MapReduce over full column family
Thursday, June 6, 13
Planning
• MapReduce over full column family
• Model data accordingly
Thursday, June 6, 13
Planning
• MapReduce over full column family
• Model data accordingly
• Add more column families
Thursday, June 6, 13
Planning
• MapReduce over full column family
• Model data accordingly
• Add more column families
• Can use secondary index, but use caution
Thursday, June 6, 13
Execution
Thursday, June 6, 13
Execution
• Project and select early in your workflow
Thursday, June 6, 13
Execution
• Project and select early in your workflow
• Store common intermediate datasets (in CFS/HDFS)
Thursday, June 6, 13
Execution
• Project and select early in your workflow
• Store common intermediate datasets (in CFS/HDFS)
• Bulk loader output format excels
Thursday, June 6, 13
Use Cases
Thursday, June 6, 13
Use Cases
• Typical Hadoop tasks
Thursday, June 6, 13
Use Cases
• Typical Hadoop tasks
• Validate data
Thursday, June 6, 13
Use Cases
• Typical Hadoop tasks
• Validate data
• Fix data
Thursday, June 6, 13
Use Cases
• Typical Hadoop tasks
• Validate data
• Fix data
• Bootstrap a new column family from existing data
Thursday, June 6, 13
Thank you
• Jeremy Hanna
• @jeromatron (twitter and irc)
• Ping me if you have any questions
Thursday, June 6, 13