Talk about apache cassandra, TWJUG 2011

Click here to load reader

  • date post

  • Category


  • view

  • download


Embed Size (px)



Transcript of Talk about apache cassandra, TWJUG 2011

2. Outline Overview Architecture Overview Partitioning and Replication Data Consistency 3. Overview Distributed Data partitioned among all nodes Extremely Scalable Add new node = Add more capacity Easy to add new node Fault tolerant All nodes are the same Read/Write anywhere Automatic Data replication 4. Overview High Performance Schema-less (Not completely true) Need to provide basic settings for each column family. 5. Architecture Overview Keyspace Where the replication strategy and replication factoris defined RDBMS synonym: Database Column family Standard (recommended) or Super Lots of settings can be defined RDBMS synonym: Table Row/Record Indexed by Key. Columns might be indexed as well Column name are sorted based on the comparator Each column has its own timestamp 6. Architecture OverviewStandard CFSuper CF{{Key1: {Key1: { column1: value,super_column1: { column2: valuesubColumn1: value,}, subColumn2: valueKey2: { }, column1: value,super_column2: { column2: valuesubColumn1: value,}subColumn2: value} } },Recommended. Super Key2: {columns could be somehowsuper_column1: {replaced by compositesubColumn1: value,columns. subColumn2: value} } 7. Architecture Overview Commit log Used to capture write activities. Data durability isassured. Memtable Used to store most recent write activities. SSTable When a memtable got flushed to disk, it becomesa sstable. 8. Architecture Overview Data write path Data Commitlog Memtable Flushed SSTable 9. Architecture Overview Data read path Search Row cache, if the result is not empty, thenreturn the result. No further actions are needed. If no hit in the Row cache. Try to get data fromMemtable(s) and SSTable(s). Collate the resultsand return. 10. Partitioning and Replication In Cassandra, the total data managed by thecluster is represented as a circular space or ring. The ring is divided up into ranges equal to thenumber of nodes, with each node beingresponsible for one or more ranges of the overalldata. Before a node can join the ring, it must beassigned a token. The token determines thenodes position on the ring and the range of datait is responsible for. 11. Partitioningboris is inserted hereDataData is inserted andassigned a row key in acolumn family.{boris:{first name: boris,last name: Yen} Data placed on the node based on its} row key 12. Partitioning Strategies Random Partitioning This is the default and recommended strategy.Partition data as evenly as possible across all nodesusing an MD5 hash of every column family row key Order Partitioning Store column family row keys in sorted order across allnodes in the cluster. Sequential writes can cause hot spots More administrative overhead to load balance thecluster Uneven load balancing for multiple column families 13. Setting up data Partitioning The data partitioning strategy is controlled viathe partitioner option inside cassandra.yamlfile Once a cluster in initialized with a partitioneroption, it can not be changed withoutreloading all of the data in the cluster. 14. Replication To ensure fault tolerance and no single point offailure, you can replicate one or more copies ofevery row across nodes in the cluster Replication is controlled by the parametersreplication factor and replication strategy of akeyspace Replication factor controls how many copies of arow should be store in the cluster Replication strategy controls how the data beingreplicated. 15. Replication RF=3 boris is inserted hereDataData is inserted andassigned a row key in acolumn family. boris is inserted here boris is inserted here{boris:{first name: boris,last name: Yen} Copy of row is replicated across} various nodes based on the assigned replication factor 16. Replication Strategies Simple Strategy Place the original row on a node determined by thepartitioner. Additional replica rows are placed on thenew nodes clockwise in the ring. Network Topology Strategy Allow replication between different racks in a datacenter and or between multiple data centers The original row is placed according the partitioner.Additional replica rows in the same data center arethen placed by walking the ring clockwise until a nodein a different rack from previous replica is found. Ifthere is no such node, additional replicas will beplaced in the same rack. 17. Replication - Network Topology Strategy RF={DC1:2, DC2:2} 18. Replication Mechanics Cassandra uses a snitch to define how nodesare grouped together within the overallnetwork topology, such as rack and datacenter groupings. The snitch is defined in the cassandra.yaml 19. Replication Mechanics - Snitches Simple Snitch The default and used for simple replication strategy Rack Inferring Snitch Infers the topology of the network by analyzing the node IP addresses. This snitch assumes that the second octet identifies the data center where a node is located, and third octet identifies the rack Property File Snitch Determines the location of nodes by referring to a user-defined file, EC2 Snitch Is for deployments on Amazon EC2 only 20. Data Consistency Cassandra supports tunable data consistency Choose from strong and eventual consistencydepending on the need Can be done on a per-operation basis, and forboth reads and writes. Handles multi-data center operations 21. Consistency Level for Writes Any A write must succeed on any available node (hint included) One A write must succeed on any node responsible for that row (either primary or replica) Quorum A write mush succeed on a quorum of replica nodes (RF/2 + 1) Local_Quorum A write mush succeed on a quorum of replica nodes in the same data center as the coordinator node. Each_Quorum A write must succeed on a quorum of replica nodes in all data centers All A write must succeed on all replica nodes for a row key 22. Consistency Level for Reads One Reads from the closest node holding the data Quorum Returns a result from a quorum of servers with the most recent timestamp for the data Local_Quorum Returns a result from a quorum of servers with the most recent timestamp for the data in the same data center as the coordinator node Each_Quorum Returns a result from a quorum of servers with the most recent timestamp in all data centers All Returns a result from all replica nodes for a row key 23. Built-in Consistency Repair Features Read Repair When a read is done, the coordinator nodecompares the data from all the remaining replicasthat own the row in the background, and If theyare inconsistent, issues writes to the out-of-datereplicas to update the row. Anti-Entropy Node Repair Hinted Handoff 24. What is New in 1.0 Column Family Compression 2x-4x reduction in data size 25-35% performance improvement on reads 5-10% performance improvement on writes Improved Memory and Disk Space Management Off-heap row cache Storage engine self-tuning Faster disk space reclamation Tunable Compaction Strategy Support LevelDB style compaction algorithm that canbe enabled on a per-column family basis. 25. What is New in 1.0 Cassandra Windows Service Improved Write Consistency and Performance Hint data is stored more efficiently Coordinator nodes no longer need to wait for thefailure detector to mark a node as down beforesaving hints for unresponsive nodes. Running a full node repair to reconcile missed writes is not necessary. Full node repair is only necessary when simultaneous multi-node fails o losing a node entirely Default read repair probability has been reduced from 100% to 10% 26. Anti-Patterns Non-Sun JVM CommitLog+Data on the same Disk Does not apply to SSDs or EC2 Oversized JVM heaps 6-8 GB is good 10-12 is possible and in some circumstancescorrect 16GB == max JVM heap size > 16GB => badness 27. Anti-Patterns Large batch mutations Timeout => entire mutation must be retried =>wasted work Keep the batch mutations to 10-100 (this reallydepends on the HW) Ordered partitioner Creates hot spots Requires extra cares from operators Cassnadra auto selection of tokens Always specify your initial token. 28. Anti-Patterns Super Column 10-15 percent performance penalty on reads andwrites Easier/Better to use to composite columns Read Before write Winblows 29. Want to Learn More P.S. Most of the content in this presentation is actuallycoming from the websites above 30. Q&A 31. We are hiring people If you are interesting in what we aredoing, please contact us.