In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

23
In-Memory Computing, Storage & Analysis Apache Apex + Apache Geode Sandeep Deshmukh Ashish Tadose

Transcript of In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

Page 1: In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

In-MemoryComputing,Storage&AnalysisApacheApex+ApacheGeode

SandeepDeshmukh AshishTadose

Page 2: In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

ProjectStatus

Mentor ListTed Dunning: Apache Member, MapR

Alan Gates: Apache Member, HortonworksTaylor Goetz: Apache Member, Hortonworks

Justin Mclean: Apache Member, Class SoftwareChris Nauroth: Apache Member, HortonworksHitesh Shah: Apache Member, Hortonworks

ApexInApacheIncubationStage

Page 3: In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

ApacheApex(Incubating)CommitterList

Open-sourced inJuly2015

Over50 committersalready…Andgrowing….

Page 4: In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

ApexPlatformOverview EnterpriseEdition

Page 5: In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

Directed AcyclicGraph (DAG)

ApplicationProgrammingModel

• A Stream is a sequence of data tuples• An Operator takes one or more input streams, performs computations & emits one or more output streams

• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library• Operator has many instances that run in parallel and each instance in single-threaded

• Directed Acyclic Graph (DAG) is made up of operators and streams

Output StreamTuple Tuple er

Operator

er

Operator

er

Operator

er

Operator

ApplicationProgrammingModel

Page 6: In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

Hadoop EdgeNode

DTRTSManagement

Server

HadoopNode

YARNContainerApexAppMaster

HadoopNode

YARNContainerYARNContainer

YARNContainer

Thread1

Op2

Op1

Thread-N

Op3

StreamingContainer

HadoopNode

YARNContainerYARNContainer

YARNContainer

Thread1

Op2

Op1

Thread-N

Op3

StreamingContainer

CLI

RESTAPI

DTRTSManagement

Server

RESTAPI

PartofCommunityEdition

ApexComponentOverview

Page 7: In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

• NativeHadoopIntegration• PartitioningandScalingout• AdvancedWindowingSupport• StatefulFault-tolerance• ProcessingSemantics• ComputeLocality• Dynamicupdates

ApexFeatures…

Page 8: In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

ApacheApex-Malhar

Page 9: In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

• Processingdatain-motion

• Preventingdata-loss– bufferserver

• Inmemorydatastoresforqueryingdata

IMCComponentsinApex

Page 10: In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

Typicallatencies

WhyIn-MemoryComputing?

Page 11: In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

WhyIn-MemoryComputing?

In-memorycomputingwillhavelongterm,disruptiveimpactbyradicallychangingusersexpectations,applicationdesignprinciples,product'sarchitecturesandvendor'sstrategiesRAMisthenewdisk,

diskthenewtapeRAMisthenewdisk,diskthenewtape

In-memorycomputingisthefutureofcomputing..itoffersmassivenotonlyinTCOreductionbutacrossallfourvaluedimensions:performance,process,processinnovation,simplificationand

flexibility.

Page 12: In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

WhatareIMDG?• IMDGshostdatainmemoryanddistribute itacrossa clusterofcommodityservers• Themainaccesspatterniskey/valueaccess,MapReduce,variousformsofHPC-likeprocessing,

andalimiteddistributedqueryingandindexingcapabilities.

Whytheyareimportant?

• Performance– usingRAMisfasterthanusingdisk.• Extremely Highavailabilityofdata- bykeepingitinmemoryandinhighlydistributedcluster.• DataStructure– usingakey/valuestoreallowsgreater flexibility fortheapplicationdeveloper.

objectstoresimilar ininterfacetoatypicalconcurrenthashmap.• ScalableDataPartitioning• TransactionalACIDsupport

InMemoryDataGrid- IMDG

Page 13: In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode
Page 14: In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

HighLevelArchitecture- Geode

Page 15: In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

GeodeFeatures

CoreFeatures• Linearscalability&latencyminiming datadistribution • Performanceoptimizedpersistence- Highavailability&durability • Configurableconsistency- regiontypes{partitioned, replicated&local}• Distributed transactions• Clusterresilience&failover

AdvancedFeatures• ServerFunctionExecution- Sendcomputationtodata• Asynchronous Events- Delivereventstoareceiverwithoutimpacting the

writepath• ContinuesQueries&Clientsubscriptions - Usefulforrefreshing client

cache

Page 16: In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

GeodeFeatures

CoreFeatures• Linearscalability&latencyminiming datadistribution • Performanceoptimizedpersistence- Highavailability&durability • Configurableconsistency- regiontypes{partitioned, replicated&local}• Distributed transactions• Clusterresilience&failover

AdvancedFeatures• ServerFunctionExecution- Sendcomputationtodata• Asynchronous Events- Delivereventstoareceiverwithoutimpacting the

writepath• ContinuesQueries&Clientsubscriptions - Usefulforrefreshing client

cache

Page 17: In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

� Caching for speed and scale– Read-through, Write-through, Write-behind

� Geode as the OLTP system of record– Data in-memory for low latency, on disk for durability

� Parallel compute engine

� Real-time analytics

ApplicationPatterns

Page 18: In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

GeodereadsWithConsistentLatencyandCPU

• Scaledfrom256clientsand2serversto1280clientsand10servers• Partitionedregionwithredundancyand1Kdatasize

0

2

4

6

8

10

12

14

16

18

0

1

2

3

4

5

6

2 4 6 8 10

Spee

dup

ServerHosts

speedup

latency(ms)

CPU%

GeodeFeatures

Page 19: In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

Geode3.5-4.5XFasterThanCassandraforYCSB

Page 20: In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

Roadmap

� HDFS persistence

� Off-heap storage

� Lucene indexes

� Spark integration

� Cloud Foundry service

…and other ideas from the Geode community!

Roadmap

Page 21: In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

StreamingmeetsInMemoryDataGrid

Page 22: In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

Apex+GeodeApexOperatorcheck-pointinginGeodestore• BetterlatencyforcheckpointoperationsthanHDFScheck-pointing • MakesApexDAGacompletein-memorypipeline• https://issues.apache.org/jira/browse/APEXCORE-283

WriteApexdatastreamstoGeodestore• Apexoutput operatorimplementationwhichwritesdatatoGeoderegion• Usecases

• IngeststreamingdatainGeodeforfurtherprocessing• StoreDataprocessedbyApexpipeline inGeodestoretoserveuserqueries

• https://malhar.atlassian.net/projects/MLHR/issues/MLHR-1942

Page 23: In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

Questions???

ThankYou…