Hadoop and Cassandra at Rackspace

download Hadoop and Cassandra at Rackspace

of 21

  • date post

    29-Jan-2018
  • Category

    Documents

  • view

    15.574
  • download

    0

Embed Size (px)

Transcript of Hadoop and Cassandra at Rackspace

  1. 1. Making Massive Manageable: Hadoop and Cassandra (at Rackspace) Big Data Workshop Stu Hood (@stuhood) Technical Lead, Rackspace April 23rd 2010
  2. 2. My, what a large dataset you have...
    • Processing 3 TB/day of logs
    • 3. Using Hadoop/Pig
    • 4. And the sticking points?
      • How fast can we provision machines?
      • 5. How do we get data on/off the cluster?
      • 6. How do we add structure?
  3. 7. MapReduce
    • Distributed processing methodology
      • Adapt a problem to MapReduce
      • 8. Scale forever
      • 9. Crunch almost anything
    • Typically adding structure to unstructured data
      • Logs
    • Also great for structured
      • Graph processing
      • 10. Machine learning
  4. 11. You want to usehow manyclients?
    • Need to store structured inputs/outputs
    • 12. Solution needs to
      • Support arbitrary number of clients
      • 13. Preferably provide locality
      • 14. Possibly provide 'web' latency
  5. 15. Solutions of varying quality
    • Sharding the RDBMS
      • shard n. -A horizontal partition in a database
        • Example: Sharding by userid
      • Provided by ORM?
        • Fixed partitions: manual rebalancing
      • Developing from scratch?
        • Adding/removing nodes
        • 16. Handling failover
        • 17. As a library? As a middle tier?
  6. 18. Solutions of varying quality
    • Leaving data in Hadoop
      • Storage in Map/SequenceFile
        • Serialized with Thrift/Avro/ProtoBuffs
      • No random access
      • 19. High latency
  7. 20. Solutions of varying quality
    • Storing in HBase/Hypertable
      • Column stores implemented on Hadoop
        • Modeled after Google's Bigtable
      • Multiple points of failure
        • Namenode
        • 21. Master
      • High (almost non-web) latency
  8. 22. And the newest contender...
  9. 23. Standing on the shoulders of: Amazon Dynamo
    • No node in the cluster is special
      • No special roles
      • 24. No scaling bottlenecks
      • 25. No single point of failure
    • Techniques
      • Gossip
      • 26. Eventual consistency
  10. 27. Standing on the shoulders of: Google Bigtable
    • Column family data model
    • 28. Range queries for rows:
      • Scan rows in order
    • Memtable/SSTable structure
      • Always writes sequentially to disk
      • 29. Bloom filters to minimize random reads
      • 30. Trounces B-Trees for big data
        • Linear insert performance
        • 31. Log growth for reads
  11. 32. Enter Cassandra
    • Hybrid of ancestors
      • Adopts listed features
    • And adds:
      • A sweet logo!
      • 33. Pluggable partitioning
      • 34. Multi datacenter support
        • Pluggable locality awareness
      • Datamodel improvements
  12. 35. Enter Cassandra
    • Project status
      • Open sourced by Facebook in 2008 (no longer active)
      • 36. Apache License
      • 37. Graduated to Apache TLP February 2010
      • 38. Major releases: 0.3 through 0.6 (0.7 in two months)
    • cassandra.apache.org
  13. 39. Enter Cassandra
    • The code base
      • Java, Apache Ant, Git/SVN
      • 40. 5+ committers from 3+ companies
    • Known deployments at:
      • Cloudkick, Digg, Mahalo, SimpleGeo, Twitter, Rackspace, Reddit
  14. 41. Performance
  15. 42. Like peanut butter with jelly
    • Apache Cassandra 0.6:
    • 43. MapReduce input support out of the box
      • Locality information partially exposed
      • 44. Hadoop InputFormat
      • 45. Pig LoadFunc
  16. 46. Hadoop + Cassandra at RAX
    • Multiple Hadoop clusters deployed
    • 47. Smaller Cassandra deployments
    • 48. Preparing for large scale Cassandra deployment
  17. 49. In the pipeline
    • MapReduce output support
      • Adding an OutputFormat with locality information
    • Improving locality for Hadoop inputs
  18. 50. Getting started
    • http://cassandra.apache.org/
    • 51. Read "Getting Started"... Roughly:
      • Start one node
      • 52. Test/develop app, editing node config as necessary
      • 53. Launch cluster by starting more nodes with chosen config
  19. 54. Thanks! Big Data Workshop Participants!
  20. 55. Questions?
  21. 56. References
    • Brandon William's perf tests
      • http://racklabs.com/~bwilliam/cassandra/04vs05vs06.png
    • Hadoop/Cassandra Integration
      • http://issues.apache.org/jira/browse/CASSANDRA-342