Apache Cassandra in Bangalore - Cassandra Internals and Performance

Click here to load reader

  • date post

    01-Dec-2014
  • Category

    Technology

  • view

    1.063
  • download

    2

Embed Size (px)

description

Slides from http://www.meetup.com/Apache-Cassandra/events/108524582/

Transcript of Apache Cassandra in Bangalore - Cassandra Internals and Performance

  • 1. BANGALORE CASSANDRA UG APRIL 2013CASSANDRA INTERNALS &PERFORMANCE Aaron Morton @aaronmorton www.thelastpickle.com Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License

2. Architecture Code 3. Cassandra Architecture ClientsAPIsCluster Aware Cluster UnawareDisk 4. Cassandra Cluster Architecture ClientsAPIs APIsCluster Aware Cluster Aware Cluster Unaware Cluster UnawareDiskDisk Node 1Node 2 5. Dynamo Cluster ArchitectureClients APIs APIsDynamoDynamoDatabaseDatabase DiskDiskNode 1Node 2 6. ArchitectureAPI Dynamo Database 7. API TransportsThrift Native BinaryRead Line RMI 8. Thrift Transport //Custom TServer implementations o.a.c.thrift.CustomTThreadPoolServer o.a.c.thrift.CustomTNonBlockingServer o.a.c.thrift.CustomTHsHaServer 9. API Transports Thrift Native BinaryRead LineRMI 10. Native Binary Transport Beta in Cassandra 1.2 Uses Netty 3.5 Enabled with start_native_transport(Disabled by default) 11. o.a.c.transport.Server.run() //Setup the Netty server new ExecutionHandler() new NioServerSocketChannelFactory() ServerBootstrap.setPipelineFactory() 12. o.a.c.transport.Message.Dispatcher.messageReceived() //Process message from client ServerConnection.validateNewMessage() Request.execute() ServerConnection.applyStateTransition() Channel.write() 13. o.a.c.transport.messages CredentialsMessage() EventMessage() ExecuteMessage() PrepareMessage() QueryMessage() ResultMessage()(And more...) 14. MessagesDened in the Native Binary Protocol $SRC/doc/native_protocol.spec 15. API ServicesJMXCLI Thrift CQL 3 16. JMX Management Beans Spread around the code base. Interfaces named *MBean 17. JMX Management BeansRegistered with the names such as org.apache.cassandra.db:type=StorageProxy 18. API ServicesJMXCLI Thrift CQL 3 19. o.a.c.cli.CliMain.main()// Connect to server to read inputthis.connect()this.evaluateFileStatements()this.processStatementInteractive() 20. CLI Grammar ANTLR Grammar$SRC/src/java/o/a/c/cli/CLI.g 21. o.a.c.cli.CliClient.executeCLIStatement() // Process statement CliCompiler.compileQuery() #ANTLR switch (tree.getType()) case... 22. API ServicesJMXCLI Thrift CQL 3 23. o.a.c.thrift.CassandraServer// Implements Thrift Interface// Access control// Input validation// Mapping to/from Thrift and internal types 24. Thrift Interface Thrift IDL$SRC/interface/cassandra.thrift 25. o.a.c.thrift.CassandraServer.get_slice()// get columns for one rowTracing.begin()ClientState cState = state()cState.hasColumnFamilyAccess()multigetSliceInternal() 26. CassandraServer.multigetSliceInternal()// get columns for may rowsThriftValidation.validate*()// Create ReadCommandsgetSlice() 27. CassandraServer.getSlice()// Process ReadCommands// return Thrift typesreadColumnFamily()thriftifyColumnFamily() 28. CassandraServer.readColumnFamily()// Process ReadCommands// Return ColumnFamiliesStorageProxy.read() 29. API ServicesJMXCLI Thrift CQL 3 30. o.a.c.cql3.QueryProcessor// Prepares and executes CQL3 statements// Used by Thrift & Native transports// Access control// Input validation// Returns transport.ResultMessage 31. CQL3 Grammar ANTLR Grammar $SRC/o.a.c.cql3/Cql.g 32. o.a.c.cql3.statements.ParsedStatement// Subclasses generated by ANTLR// Tracks bound term count// Prepare CQLStatementprepare() 33. o.a.c.cql3.statements.CQLStatementcheckAccess(ClientState state)validate(ClientState state)execute(ConsistencyLevel cl,QueryState state,List variables) 34. o.a.c.cql3.functions.FunctionargsType()returnType()execute(Listparameters) 35. statements.SelectStatement.RawStatement// Implements ParsedStatement// Input validationprepare() 36. statements.SelectStatement.execute()// Create ReadCommandsStorageProxy.read() 37. ArchitectureAPI Dynamo Database 38. Dynamo Layer o.a.c.service o.a.c.net o.a.c.dht o.a.c.locator o.a.c.gms o.a.c.stream 39. o.a.c.service.StorageProxy// Cluster wide storage operations// Select endpoints & check CL available// Send messages to Stages// Wait for response// Store Hints 40. o.a.c.service.StorageService// Ring operations// Track ring state// Start & stop ring membership// Node & token queries 41. o.a.c.service.IResponseResolverpreprocess(MessageIn message)resolve() throws DigestMismatchExceptionRowDigestResolverRowDataResolverRangeSliceResponseResolver 42. Response Handlers / Callbackimplements IAsyncCallbackresponse(MessageIn msg) 43. o.a.c.service.ReadCallback.get()//Wait for blockfor & datacondition.await(timeout, TimeUnit.MILLISECONDS)throw ReadTimeoutException()resolver.resolve() 44. o.a.c.service.StorageProxy.fetchRows()getLiveSortedEndpoints()new RowDigestResolver()new ReadCallback()MessagingService.sendRR()---------------------------------------ReadCallback.get() # blockingcatch (DigestMismatchException ex)catch (ReadTimeoutException ex) 45. Dynamo Layer o.a.c.service o.a.c.net o.a.c.dht o.a.c.locatoro.a.c.gms o.a.c.stream 46. o.a.c.net.MessagingService.verbMUTATIONREADREQUEST_RESPONSETREE_REQUESTTREE_RESPONSE(And more...) 47. o.a.c.net.MessagingService.verbHandlersnew EnumMap(Verb.class) 48. o.a.c.net.IVerbHandlerdoVerb(MessageIn message, String id); 49. o.a.c.net.MessagingService.verbStagesnew EnumMap(MessagingService.Verb.class) 50. o.a.c.net.MessagingService.receive()runnable = new MessageDeliveryTask(message, id, timestamp);StageManager.getStage(message.getMessageType());stage.execute(runnable); 51. o.a.c.net.MessageDeliveryTask.run()// If dropable and rpc_timeoutMessagingService.incrementDroppedMessages(verb);MessagingService.getVerbHandler(verb)verbHandler.doVerb(message, id) 52. Dynamo Layer o.a.c.service o.a.c.net o.a.c.dht o.a.c.locatoro.a.c.gms o.a.c.stream 53. o.a.c.dht.IPartitionergetToken(ByteBuffer key)getRandomToken()LocalPartitionerRandomPartitionerMurmur3Partitioner 54. o.a.c.dht.TokencompareTo(Token o)BytesTokenBigIntegerTokenLongToken 55. Dynamo Layer o.a.c.service o.a.c.net o.a.c.dht o.a.c.locator o.a.c.gms o.a.c.stream 56. o.a.c.locator.IEndpointSnitchgetRack(InetAddress endpoint)getDatacenter(InetAddress endpoint)sortByProximity(InetAddress address, List addresses)SimpleSnitchPropertyFileSnitchEc2MultiRegionSnitch 57. o.a.c.locator.AbstractReplicationStrategygetNaturalEndpoints(RingPosition searchPosition)calculateNaturalEndpoints(TokensearchToken, TokenMetadatatokenMetadata)SimpleStrategyNetworkTopologyStrategy 58. o.a.c.locator.TokenMetadataBiMultiValMaptokenToEndpointMapBiMultiValMapbootstrapTokensSet leavingEndpoints 59. Dynamo Layer o.a.c.service o.a.c.net o.a.c.dht o.a.c.locatoro.a.c.gms o.a.c.stream 60. o.a.c.gms.VersionedValue// VersionGenerator.getNextVersion()public final int version;public final String value; 61. o.a.c.gms.ApplicationStateSTATUSLOADSCHEMADCRACK (And more...) 62. o.a.c.gms.HeartBeatState//VersionGenerator.getNextVersion();private int generation;private int version; 63. o.a.c.gms.Gossiper.GossipTask.run()// SYN -> ACK -> ACK2makeRandomGossipDigest()new GossipDigestSyn()// Use MessagingService.sendOneWay()Gossiper.doGossipToLiveMember()Gossiper.doGossipToUnreachableMember()Gossiper.doGossipToSeed() 64. gms.GossipDigestSynVerbHandler.doVerb()Gossiper.examineGossiper()new GossipDigestAck()MessagingService.sendOneWay() 65. gms.GossipDigestAckVerbHandler.doVerb()Gossiper.notifyFailureDetector()Gossiper.applyStateLocally()Gossiper.makeGossipDigestAck2Message() 66. gms.GossipDigestAcksVerbHandler.doVerb()Gossiper.notifyFailureDetector()Gossiper.applyStateLocally() 67. ArchitectureAPI LayerDynamo LayerDatabase Layer 68. Database Layer o.a.c.concurrento.a.c.db o.a.c.cache o.a.c.io o.a.c.trace 69. o.a.c.concurrent.StageManagerstages = new EnumMap(Stage.class);getStage(Stage stage) 70. o.a.c.concurrent.StageREADMUTATIONGOSSIPREQUEST_RESPONSEANTI_ENTROPY(And more...) 71. Database Layer o.a.c.concurrento.a.c.db o.a.c.cache o.a.c.io o.a.c.trace 72. o.a.c.db.Table// Keyspaceopen(String table)getColumnFamilyStore(String cfName)getRow(QueryFilter filter)apply(RowMutation mutation, boolean writeCommitLog) 73. o.a.c.db.ColumnFamilyStore// Column FamilygetColumnFamily(QueryFilter filter)getTopLevelColumns(...)apply(DecoratedKey key, ColumnFamily columnFamily, SecondaryIndexManager.Updater indexer) 74. o.a.c.db.IColumnContaineraddColumn(IColumn column)remove(ByteBuffer columnName)ColumnFamilySuperColumn 75. o.a.c.db.ISortedColumnsaddColumn(IColumn column,Allocator allocator)removeColumn(ByteBuffer name)ArrayBackedSortedColumnsAtomicSortedColumnsTreeMapBackedSortedColumns 76. o.a.c.db.Memtableput(DecoratedKey key,ColumnFamily columnFamily,SecondaryIndexManager.Updaterindexer)flushAndSignal(CountDownLatch latch, Future context) 77. Memtable.FlushRunnable.writeSortedContents()// SSTableWritercreateFlushWriter()// Iterate through rows & CFs in orderwriter.append() 78. o.a.c.db.ReadCommandgetRow(Table table)SliceByNamesReadCommandSliceFromReadCommand 79. o.a.c.db.IDiskAtomFiltergetMemtableColumnIterator(...)getSSTableColumnIterator(...)IdentityQueryFilterNamesQueryFilterSliceQueryFilter 80. Some query performance... 81. Today. Write Path Read Path 82. memtable_flush_queue_size test... m1.xlarge Cassandra nodem1.xlarge client node 1 CF with 6 Secondary Indexes 1 Client Thread10,000 Inserts, 100 Columns per Row 1100 bytes per Column 83. CF write latency and memtable_flush_queue_size...memtable_ush_queue_size=7 memtable_ush_queue_size=1 1,200900Latency Microseconds6003000 85th 95th99th100th 84. Request latency and memtable_flush_queue_size... memtable_ush_queue_size=7 memtable_ush_queue_size=15,000,0003,750,000Latecy Microseconds2,500,0001,250,000 085th 95th99th100th 85. durable_writes test...10,000 Inserts, 50 Columns per Row 50 bytes per Column 86. Request latency and durable_writes (1 client)...enableddisabled 7,000 5,250Latency Microseconds 3,500 1,7500 85th 95th99th 87. Request latency and durable_writes (10 clients)... enableddisable