#GeodeSummit - Apex & Geode: In-memory streaming, storage & analytics
-
Upload
pivotalopensourcehub -
Category
Technology
-
view
2.126 -
download
0
Transcript of #GeodeSummit - Apex & Geode: In-memory streaming, storage & analytics
In-MemoryStreaming,Storage&Analy4csApacheApex+ApacheGeode
ThomasWeise AshishTadose
• In-memoryStreamProcessing• Par22oningandScalingout• WindowingSupport(temporal)• StatefulFault-tolerance,Operability• ProcessingGuarantees• ComputeLocality• Dynamicupdates
ApexFeatures…
ApexPlaGormOverview
Applica2onProgrammingModelApplica2onProgrammingModel
§ Stream is a sequence of data tuples§ Operator takes one or more input streams, performs computations & emits one or more output streams
– Each Operator is YOUR custom business logic in java, or built-in operator from our open source library– Operator has many instances that run in parallel and each instance is single-threaded
§ Directed Acyclic Graph (DAG) is made up of operators and streams– Iterative processing supported
Directed Acyclic Graph (DAG)
Output Stream Tuple Tuple er
Operator
er
Operator
er
Operator
er
Operator
ApacheApex-Malhar
ApexNa2veHadoopIntegra2on
YARNistheresourcemanagerHDFSusedforstoringanypersistentstate
• Operatorstateischeckpointedtoapersistentstore– Automa2callyperformedbyengine,noaddi2onalworkneededbyoperator– Incaseoffailureoperatorsarerestartedfromcheckpointstate– Frequencyconfigurableperoperator– Asynchronousanddistributedbydefault– DefaultstoreisHDFS
• Automa2cdetec2onandrecoveryoffailedoperators– Heartbeatmechanism
• Bufferingmechanismtoensurereplayofdatafromrecoveredpointsothatthereisnolossofdata
• Applica2onmasterstatecheckpointed
ApexFaultTolerance
At-least-once• Onrecoverydatawillbereplayedfromapreviouscheckpoint
– Nomessageslost– Default,suitableformostapplica2ons
• CanbeusedtoensuredataiswriUenoncetostore– Transac2onswithmetainforma2on,Rewindingoutput,Feedbackfromexternalen2ty,
Idempotentopera2onsAt-most-once• Onrecoverythelatestdataismadeavailabletooperator
– UsefulwheredatalossisacceptableandlatestdataissufficientExactly-once
– At-least-once+idempotency+transac2onalmechanisms(operatorlogic)toachieveend-to-endexactlyoncebehavior
ApexProcessingSeman2cs
• Dataflowin-memory,nodisk
• Incrementalrecovery–bufferserver
• In-memorydataforqueryingdata
IMCBenefitsforApex
StreamingmeetsInMemoryDataGrid
Apex+GeodeIntegra2on
Completed
• Operatorcheck-poin2nginGeode.• OutputoperatortostoretuplesinGeoderegion.
Proposed
• GeodeoutputoperatorwithTransac2onalsupport.• IngestdatafromGeodetoApexDAG.• DistributedCacheOperator.• Scanoperatorforparallelqueryexecu2on&resultretrieval.
OperatorCheckpoin2nginGeode
ApexOperatorcheck-poin4nginanIMDG(Geodestore)• Checkpoin2ngisanessen2almechanismtoensureFaultTolerance• ApexcheckpointsoperatorstatetoHDFS• SlowerHDFScheckpoin2nghurtsapplica2onperformance• Checkpoin2nginGeodeensuresthatapplica2onperformanceisnotimpacted• GeodehasbeUerlatencyforwriteopera2onsthanHDFS.
Implementa4on: GeodeStorageAgent
hUps://issues.apache.org/jira/browse/APEXCORE-283
DataStreamstoGeodeStore
ApexOutputOperatortowritetoGeodestore• ApexOutputoperator–EgressdatafromApexDAGtoexternalstore• Storeincomingtuplesinbinary/POJOformatinGeoderegion• GeodeEfficientQueryintegra2on–OQL• Geoderegionsupportsdatareplica2on,overflowtodisk,persistence&manyevic2onstrategies
Implementa4on: GeodeStoreGeodePOJOPutOperatorAbstractGeodePutOperator
hUps://malhar.atlassian.net/projects/MLHR/issues/MLHR-1942
GeodeTransac2onswrites
ApexOutputOperatortowritetoGeodestorewithTransac4ons• ApexDAGusesTransac2onableStoretoprovideguaranteethatrecordsarewriUenareexactlyonce.E.g.JdbcTransac2onalStore
• GeodeprovidesTransac2onsupportforefficientandsafecoordinatedopera2ons• Geodestoreusingtransac2onsguaranteethatrecordsarewriUenexactlyonce• PutoperatorbackedbyGeodeTransac2onalstorecanhelptoachieveExactlyonceseman2cs
Implementa4on: GeodeWindowStoreasTransac2onableStore
StreamingGeodedatainApex
ApexInputOperatortoreadfromGeodestore
• ApexInputoperators–IngestdatafromexternalsourcesintoApexDAG
• Geodeprovidesversa2leandreliableeventdistribu2ontoprovideRealTimeupdatestodata• Usecase–ApexoperatortostreamasynceventsfromGeodeinDAG• Callbackeventsreducepollingcyclesovernetwork
Implementa4on: GeodeRegionStreamOperator receivesanewlyaddedtuplesandemitsinDAG
GeodeCacheOperator
ApexGeodeCacheOperator
• GeodeprovidesefficientEvents&No2fica2ons• Registerinterest–updatelocalcopies• Con2nuousQuery
• Receiveno2fica2onwhenQuerycondi2onmetonserver• Eg.gSELECT*FROM/tradeOrdertWHEREt.price>100.00
• UseGeodeeventsno2fica2onframeworktomaintain&invalidatecache.
Implementa4on: GeodeCacheOperator maintainsconsistentcachebasedonsubscribedkeyset/query
GeodeScanOperator
ApexGeodeScanOperator
• Func2onExecu2onprovidesParallelQueryExecu2on• MapReducelikeexecu2on-concurrentexecu2ononmembers&resultsarecollectedfrommembers&senttocaller.• Usecase:Streamingapplica2ondependingonlargescanresultfromexternalstore
Implementa4on: GeodeQueryOperator executedatadependentqueriesondistributedregion emitresultsinDAG
Join the Apache Geode Community!
• Check out: http://geode.incubator.apache.org
• Subscribe: [email protected]
• Download: http://geode.incubator.apache.org/releases/
Ques4ons???
ThankYou…