AutomatedSharded MongoDBDeploymentandBenchmarking
forBigDataAnalysisGregorvonLaszewski
IntelligentSystemsEngineeringDepartment
Acknowledgement
• ThisstudyhasbeenconductedaspartoftheI524classwiththetopicBigDataandSoftwareProjects• Theclassusedthefollowingresources
• Studentscomputers• FutureSystems (DSC@IndianaUniversity)acontinuationoftheFutureGrid (NSF)• ChameleonCloud(NSF):ProjectCH-818664,KVM• Jetstream(NSF)
• Somestudentsalsoelectedtouse• AWS• Azure
• Allresourcesasfarasweknowwereprovidedtousforfree.
Outline
• Motivationfortheproject• IUeducatesdatascientists
• Sharded MongoDBdeployment
• Benchmarks
• UsageObservations
• Conclusion• Whywedidnotdoalargescalestudy…• Implicationforfutureclasses…
DataScientistAnalysis• Statistics• MachineLearning• Optimization
Programming• Python,JavaScript• DistributedComp.• CloudProgramming
Infrastructure• CloudComputing• DistributedSystems• DevOps
Visualization• BasicSkills• CustomizeforDataSet
DomainKnowledge
Communication• Paperwrite-up• OnlinePublication
• Requiresintegratedknowledgeinseveralkeyareas.Weuseaprojectthataddresses:
• Communication• Analysis• Visualization• Programming• Infrastructure• Domainknowledge
• EducationProgramsneedtoaddressallofthem
DataScientist
ShardedMongoDB
Deployments
ShardedMongoDB
Deployments
ContinuousImprovementvs.ContinuousDeploymentviaDevOps
design&modification
Cloudmeshscript
deployment
data
execution
verification
Continuousimprovement
• DevOpsisintegrated• Leadstoimprovementwhennotonlytargetingapplicationbutalsodeploymentenvironment.
CloudmeshShell– MakeBootingSimple
$emacs cloudmesh.yaml$cms defaultcloud=NAME$cms defaultimage=NAME$cmd defaultflavor=NAME$cms vm boot
$cms vm login
$cms vm delete
• cloudmesh.yaml
• Preparedefaults
• Boot
• Login
• Management …
CloudmeshShell– ManageHybridClouds
$cms aws boot$cms vm boot
$cms defaultcloud=chameleon$cms vm boot
$cms defaultcloud=IUCloud$cms vm boot
• BootCloudA
• BootCloudB
• BootCloudC
CloudmeshShell– CreateaHadoopCluster
$cmdefaultcloud=chameleon$cmclusterdefine- -count=10
- -flavor=m1.large$cmhadoop definespark
$cmhadoop sync#~30sec
$cmhadoop deploy#~7min
• Setcloud
• Definecluster
• Definehadoop Cluster
• Syncdefinitiontodb
• Deploythecluster
CloudmeshShell– CreateaHadoopCluster
$cmdefaultcloud=IUCloud$cmclusterdefine- -count=10
- -flavor=m1.large
$cmnist fingerprint #~30min
• Setcloud
• Definecluster
• RunNISTusecase
Additionalresources:https://github.com/cloudmesh/classes/blob/master/docs/source/notebooks/fingerprint_matching.ipynb
MongoDBFeatures
• DocumentorientedNoSQLdatabase• JSON-likedocuments• Specifiedthroughschemas
• Cross-platformcompatible• Freeopensource
• NoSQL=datathatismodeledinmeansotherthanthetabularrelationsusedinrelationaldatabases.
• Ad-hocqueries• Indexing• Replication• LoadBalancingwithSharding• FileStorage• Aggregation• ServerSideJavaScript• Cappedcollections
MongoDB- Sharding
• Userselectsshardkey thatdetermineshowthedatainacollectionwillbedistributed.• dataissplitintoranges(basedontheshardkey)• distributedacrossmultipleshards.
• (a)ashardisamasterwithoneormoreslaves.• (b)ortheshardkeycanbehashedtomaptoashardallowingevendatadistribution.
• MongoDBcanrunovermultipleservers,• balancingtheload• duplicatingdataforfaulttolerance
BenchmarksonClouds
• Threecloudswereselectedfordeployment:• ChameleonCloud• Futuresystems• Jetstream
• Goal• Comparewithintheallocationlimitationsofaclassmultiplecloudperformancesbyvaryinganumberofparameters.
• ScriptedDeployments• Wedevelopedautomatedscripteddeploymentandbenchmarkingprocess• cloudnameispassedasaparameter• CustomizationforthedeploymentofMongoDBispassedviacommandline
CloudComparison
FutureSystems Chameleon JetstreamCPU XeonE5-2670 XeonX5550 HaswellE-2680Cores 1024 1008 7680Speed 2.66GHz 2.3GHz 2.5GHzRAM 3072GB 5376GB 40TBrStorage 335TB 2TB 2TBDeploymentyear 2010 Early 2015 OS2016
FlavorandOS
• Ubuntu16.04LTS(Xenial Xerus)operatingsystem.• Flavors– slightlydifferentbetweencloudsweusemostalike• m1.mediumChameleonCloud• m1.mediumFutureSystems• m1.smallwasusedonJetstream
• FlavorshavemoreresourcesthanChameleonandFutureSystems• Storageisloweronjetstream
Cloud Flavor VCPU RAM Size Chameleon m1.medium 2 4 40 FutureSystems m1.medium 2 4 40 Jetstream m1.small 2 4 20
Requirements• ResourceRequirements• 60users->VMhourswerelimited.
• CapabilityRequirements• creationofVMsandtheexecutionofourapplicationswithintheseVMs
• MonitoringRequirements• Monitoringandbenchmarkingwasconductedbyhandwithoutneedforspecializedservices.
• Newsoftwarecreated• improvedthecloudmesh clientsoftware[5][6][7],essentialtothesuccessoftheclass.
• PerformanceComparison• Wehaveconductedasignificantperformancecomparisonamongallclouds.
Deployments
• DeploymentA• asimpledeploymentwithonlyoneofeachcomponentbeingcreated..
• DeploymentB• variationinconfig serversandshardsandanadditionalMongosinstance.
• DeploymentC• focusonhighperformance.• 9shardsnoreplication
Config Mongos Shards Replicas Seconds
A 1 1 1 1 330
B 3 2 3 3 1059
C 1 1 9 1 [email protected]
Variing otherDeploymenttimes
ConfigServers -c Mongos -m Shards -s Replicas -r Time in
Seconds
5 1 1 1 534
1 5 1 1 556
1 1 5 1 607
1 1 1 5 524
Data
• MajorLeagueBaseballPITCHf/xdataobtainedbyusingtheprogramBaseballonaStick(BBOS).
• BBOSisaPythonprogramcreatedby"willkoky"andhostedonsourceforge.netwhichextractsdatafrommlb.com andloadsitintoaMySQLdatabase.
• datawascapturedlocallytothedefaultMySQLdatabaseandthenextractedtoaCSV
• CSVfilewasimported• Contains5,508,014rowsand61columns.1.58GBinsizeuncompressed.
VersionComparison:3.2vs3.4MapReduce(ChameleonCloud)
MapReduce
Result:nottoomanychanges
Figure1:FindCommand- Sharding Test
• Chameleon– Jetstream• Same
• FutureSystems• Acceptableresultswithhighernumberofshards
Figure3:MapReduce- Sharding Test
• Chameleon– jetstream• Same
• Futuresystems• Significantlyworse
Figure4:FindCommand- ReplicationTest
• Replication• Chameleoncloudseemstoperformslightlybetter• Futiresystems performssurprisinglywell
Figure5:Mongoimport Command- ReplicationTest
• Chameleon– jetstream• Same
• Futuresystems• Significantlyworse
Figure6:MapReduce- ReplicationTest
• Chameleon• SlightlybetterthanJetstream
• Futuresystems• Significantlyworse
Conclusion
• JetstreamandChameleonCloudareessentiallythesame.• InsomeinstancesChameleonCloudperformsslightlybetter• (disks/network…)
• AsexpectedFutureSystem isoldermachineandperformsnotaswell• ForsomequeriesFutureSystem issurprisinglygood
• Experimentswerelimitedbynumberofnodehoursfor60studentsinclass.• Afterclassisovernotimetorunonlargerexamples• Itsnotobviousforateacherwhentogivelargerallocationsforastudentthatperformswell.• Allocationprocessbroken• Futuresystems allocationprocessissuperior
Top Related