Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at...
Transcript of Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at...
DatabaseInternalsADeepDiveintoHowDistributedDataSystemsWork
AlexPetrov
DatabaseInternalsbyAlexPetrov
Copyright©2019OleksandrPetrov.Allrightsreserved.
PrintedintheUnitedStatesofAmerica.
PublishedbyO’ReillyMedia,Inc.,1005GravensteinHighwayNorth,Sebastopol,CA95472.
O’Reillybooksmaybepurchasedforeducational,business,orsalespromotionaluse.Onlineeditionsarealsoavailableformosttitles(http://oreilly.com).Formoreinformation,contactourcorporate/institutionalsalesdepartment:[email protected].
AcquisitionsEditor:MikeLoukides
DevelopmentEditor:MicheleCronin
ProductionEditor:ChristopherFaucher
Copyeditor:KimCofer
Proofreader:SoniaSaruba
Indexer:JudithMcConville
InteriorDesigner:DavidFutato
CoverDesigner:KarenMontgomery
Illustrator:RebeccaDemarest
October2019:FirstEdition
RevisionHistoryfortheFirstEdition
2019-09-12:FirstRelease
Seehttp://oreilly.com/catalog/errata.csp?isbn=9781492040347forrelease
details.
TheO’ReillylogoisaregisteredtrademarkofO’ReillyMedia,Inc.DatabaseInternals,thecoverimage,andrelatedtradedressaretrademarksofO’ReillyMedia,Inc.
Theviewsexpressedinthisworkarethoseoftheauthor,anddonotrepresentthepublisher’sviews.Whilethepublisherandtheauthorhaveusedgoodfaitheffortstoensurethattheinformationandinstructionscontainedinthisworkareaccurate,thepublisherandtheauthordisclaimallresponsibilityforerrorsoromissions,includingwithoutlimitationresponsibilityfordamagesresultingfromtheuseoforrelianceonthiswork.Useoftheinformationandinstructionscontainedinthisworkisatyourownrisk.Ifanycodesamplesorothertechnologythisworkcontainsordescribesissubjecttoopensourcelicensesortheintellectualpropertyrightsofothers,itisyourresponsibilitytoensurethatyourusethereofcomplieswithsuchlicensesand/orrights.
978-1-492-04034-7
[MBP]
DedicationToPieterHintjens,fromwhomIgotmyfirsteversignedbook:aninspiringdistributedsystemsprogrammer,author,philosopher,andfriend.
Preface
Distributeddatabasesystemsareanintegralpartofmostbusinessesandthevastmajorityofsoftwareapplications.Theseapplicationsprovidelogicandauserinterface,whiledatabasesystemstakecareofdataintegrity,consistency,andredundancy.
Backin2000,ifyouweretochooseadatabase,youwouldhavejustafewoptions,andmostofthemwouldbewithintherealmofrelationaldatabases,sodifferencesbetweenthemwouldberelativelysmall.Ofcourse,thisdoesnotmeanthatalldatabaseswerecompletelythesame,buttheirfunctionalityandusecaseswereverysimilar.
Someofthesedatabaseshavefocusedonhorizontalscaling(scalingout)—improvingperformanceandincreasingcapacitybyrunningmultipledatabaseinstancesactingasasinglelogicalunit:GammaDatabaseMachineProject,Teradata,Greenplum,ParallelDB2,andmanyothers.Today,horizontalscalingremainsoneofthemostimportantpropertiesthatcustomersexpectfromdatabases.Thiscanbeexplainedbytherisingpopularityofcloud-basedservices.Itisofteneasiertospinupanewinstanceandaddittotheclusterthanscalingvertically(scalingup)bymovingthedatabasetoalarger,morepowerfulmachine.Migrationscanbelongandpainful,potentiallyincurringdowntime.
Around2010,anewclassofeventuallyconsistentdatabasesstartedappearing,andtermssuchasNoSQL,andlater,bigdatagrewinpopularity.Overthelast15years,theopensourcecommunity,largeinternetcompanies,anddatabasevendorshavecreatedsomanydatabasesandtoolsthatit’seasytogetlosttryingtounderstandusecases,details,andspecifics.
TheDynamopaper[DECANDIA07],publishedbytheteamatAmazonin2007,hadsomuchimpactonthedatabasecommunitythatwithinashortperioditinspiredmanyvariantsandimplementations.ThemostprominentofthemwereApacheCassandra,createdatFacebook;ProjectVoldemort,createdatLinkedIn;andRiak,createdbyformerAkamaiengineers.
Today,thefieldischangingagain:afterthetimeofkey-valuestores,NoSQL,
andeventualconsistency,wehavestartedseeingmorescalableandperformantdatabases,abletoexecutecomplexquerieswithstrongerconsistencyguarantees.AudienceofThisBookInconversationsattechnicalconferences,Ioftenhearthesamequestion:“HowcanIlearnmoreaboutdatabaseinternals?Idon’tevenknowwheretostart.”Mostofthebooksondatabasesystemsdonotgointodetailsofstorageengineimplementation,andcovertheaccessmethods,suchasB-Trees,onaratherhighlevel.Thereareveryfewbooksthatcovermorerecentconcepts,suchasdifferentB-Treevariantsandlog-structuredstorage,soIusuallyrecommendreadingpapers.
Everyonewhoreadspapersknowsthatit’snotthateasy:youoftenlackcontext,thewordingmightbeambiguous,there’slittleornoconnectionbetweenpapers,andthey’rehardtofind.Thisbookcontainsconcisesummariesofimportantdatabasesystemsconceptsandcanserveasaguideforthosewho’dliketodigindeeper,orasacheatsheetforthosealreadyfamiliarwiththeseconcepts.
Noteveryonewantstobecomeadatabasedeveloper,butthisbookwillhelppeoplewhobuildsoftwarethatusesdatabasesystems:softwaredevelopers,reliabilityengineers,architects,andengineeringmanagers.
Ifyourcompanydependsonanyinfrastructurecomponent,beitadatabase,amessagingqueue,acontainerplatform,orataskscheduler,youhavetoreadtheprojectchange-logsandmailingliststostayintouchwiththecommunityandbeup-to-datewiththemostrecenthappeningsintheproject.Understandingterminologyandknowingwhat’sinsidewillenableyoutoyieldmoreinformationfromthesesourcesanduseyourtoolsmoreproductivelytotroubleshoot,identify,andavoidpotentialrisksandbottlenecks.Havinganoverviewandageneralunderstandingofhowdatabasesystemsworkwillhelpincasesomethinggoeswrong.Usingthisknowledge,you’llbeabletoformahypothesis,validateit,findtherootcause,andpresentittootherprojectmaintainers.
Thisbookisalsoforcuriousminds:forthepeoplewholikelearningthingswithoutimmediatenecessity,thosewhospendtheirfreetimehackingonsomethingfun,creatingcompilers,writinghomegrownoperatingsystems,texteditors,computergames,learningprogramminglanguages,andabsorbingnewinformation.
Thereaderisassumedtohavesomeexperiencewithdevelopingbackendsystemsandworkingwithdatabasesystemsasauser.Havingsomepriorknowledgeofdifferentdatastructureswillhelptodigestmaterialfaster.WhyShouldIReadThisBook?Weoftenhearpeopledescribingdatabasesystemsintermsoftheconceptsandalgorithmstheyimplement:“Thisdatabaseusesgossipformembershippropagation”(seeChapter12),“TheyhaveimplementedDynamo,”or“Thisisjustlikewhatthey’vedescribedintheSpannerpaper”(seeChapter13).Or,ifyou’rediscussingthealgorithmsanddatastructures,youcanhearsomethinglike“ZABandRafthavealotincommon”(seeChapter14),“Bw-TreesareliketheB-Treesimplementedontopoflogstructuredstorage”(seeChapter6),or“TheyareusingsiblingpointerslikeinB -Trees”(seeChapter5).
Weneedabstractionstodiscusscomplexconcepts,andwecan’thaveadiscussionaboutterminologyeverytimewestartaconversation.Havingshortcutsintheformofcommonlanguagehelpsustomoveourattentiontoother,higher-levelproblems.
Oneoftheadvantagesoflearningthefundamentalconcepts,proofs,andalgorithmsisthattheynevergrowold.Ofcourse,therewillalwaysbenewones,butnewalgorithmsareoftencreatedafterfindingaflaworroomforimprovementinaclassicalone.Knowingthehistoryhelpstounderstanddifferencesandmotivationbetter.
Learningaboutthesethingsisinspiring.Youseethevarietyofalgorithms,seehowourindustrywassolvingoneproblemaftertheother,andgettoappreciatethatwork.Atthesametime,learningisrewarding:youcanalmostfeelhowmultiplepuzzlepiecesmovetogetherinyourmindtoformafullpicturethatyouwillalwaysbeabletosharewithothers.ScopeofThisBookThisisneitherabookaboutrelationaldatabasemanagementsystemsnoraboutNoSQLones,butaboutthealgorithmsandconceptsusedinallkindsofdatabasesystems,withafocusonastorageengineandthecomponentsresponsiblefordistribution.
Someconcepts,suchasqueryplanning,queryoptimization,scheduling,therelationalmodel,andafewothers,arealreadycoveredinseveralgreattextbooks
link
ondatabasesystems.Someoftheseconceptsareusuallydescribedfromtheuser’sperspective,butthisbookconcentratesontheinternals.YoucanfindsomepointerstousefulliteratureinthePartIIConclusionandinthechaptersummaries.Inthesebooksyou’relikelytofindanswerstomanydatabase-relatedquestionsyoumighthave.
Querylanguagesaren’tdiscussed,sincethere’snosinglecommonlanguageamongthedatabasesystemsmentionedinthisbook.
Tocollectmaterialforthisbook,Istudiedover15books,morethan300papers,countlessblogposts,sourcecode,andthedocumentationforseveralopensourcedatabases.Theruleofthumbforwhetherornottoincludeaparticularconceptinthebookwasthequestion:“Dothepeopleinthedatabaseindustryandresearchcirclestalkaboutthisconcept?”Iftheanswerwas“yes,”Iaddedtheconcepttothelonglistofthingstodiscuss.StructureofThisBookTherearesomeexamplesofextensibledatabaseswithpluggablecomponents(suchas[SCHWARZ86]),buttheyareratherrare.Atthesametime,thereareplentyofexampleswheredatabasesusepluggablestorage.Similarly,werarelyheardatabasevendorstalkingaboutqueryexecution,whiletheyareveryeagertodiscussthewaystheirdatabasespreserveconsistency.
Themostsignificantdistinctionsbetweendatabasesystemsareconcentratedaroundtwoaspects:howtheystoreandhowtheydistributethedata.(Othersubsystemscanattimesalsobeofimportance,butarenotcoveredhere.)Thebookisarrangedintopartsthatdiscussthesubsystemsandcomponentsresponsibleforstorage(PartI)anddistribution(PartII).
PartIdiscussesnode-localprocessesandfocusesonthestorageengine,thecentralcomponentofthedatabasesystemandoneofthemostsignificantdistinctivefactors.First,westartwiththearchitectureofadatabasemanagementsystemandpresentseveralwaystoclassifydatabasesystemsbasedontheprimarystoragemediumandlayout.
Wecontinuewithstoragestructuresandtrytounderstandhowdisk-basedstructuresaredifferentfromin-memoryones,introduceB-Trees,andcoveralgorithmsforefficientlymaintainingB-Treestructuresondisk,includingserialization,pagelayout,andon-diskrepresentations.Later,wediscussmultiple
variantstoillustratethepowerofthisconceptandthediversityofdatastructuresinfluencedandinspiredbyB-Trees.
Last,wediscussseveralvariantsoflog-structuredstorage,commonlyusedforimplementingfileandstoragesystems,motivation,andreasonstousethem.
PartIIisabouthowtoorganizemultiplenodesintoadatabasecluster.Westartwiththeimportanceofunderstandingthetheoreticalconceptsforbuildingfault-tolerantdistributedsystems,howdistributedsystemsaredifferentfromsingle-nodeapplications,andwhichproblems,constraints,andcomplicationswefaceinadistributedenvironment.
Afterthat,wedivedeepintodistributedalgorithms.Here,westartwithalgorithmsforfailuredetection,helpingtoimproveperformanceandstabilitybynoticingandreportingfailuresandavoidingthefailednodes.Sincemanyalgorithmsdiscussedlaterinthebookrelyonunderstandingtheconceptofleadership,weintroduceseveralalgorithmsforleaderelectionanddiscusstheirsuitability.
Asoneofthemostdifficultthingsindistributedsystemsisachievingdataconsistency,wediscussconceptsofreplication,followedbyconsistencymodels,possibledivergencebetweenreplicas,andeventualconsistency.Sinceeventuallyconsistentsystemssometimesrelyonanti-entropyforconvergenceandgossipfordatadissemination,wediscussseveralanti-entropyandgossipapproaches.Finally,wediscusslogicalconsistencyinthecontextofdatabasetransactions,andfinishwithconsensusalgorithms.
Itwould’vebeenimpossibletowritethisbookwithoutalltheresearchandpublications.Youwillfindmanyreferencestopapersandpublicationsinthetext,insquarebracketswithmonospacefont;forexample,[DECANDIA07].Youcanusethesereferencestolearnmoreaboutrelatedconceptsinmoredetail.
Aftereachchapter,youwillfindasummarysectionthatcontainsmaterialforfurtherstudy,relatedtothecontentofthechapter.ConventionsUsedinThisBookThefollowingtypographicalconventionsareusedinthisbook:
Italic
Indicatesnewterms,URLs,emailaddresses,filenames,andfileextensions.
Constant width
Usedforprogramlistings,aswellaswithinparagraphstorefertoprogramelementssuchasvariableorfunctionnames,databases,datatypes,environmentvariables,statements,andkeywords.
TIPThiselementsignifiesatiporsuggestion.
NOTEThiselementsignifiesageneralnote.
WARNINGThiselementindicatesawarningorcaution.
UsingCodeExamplesThisbookisheretohelpyougetyourjobdone.Ingeneral,ifexamplecodeisofferedwiththisbook,youmayuseitinyourprogramsanddocumentation.Youdonotneedtocontactusforpermissionunlessyou’rereproducingasignificantportionofthecode.Forexample,writingaprogramthatusesseveralchunksofcodefromthisbookdoesnotrequirepermission.SellingordistributingaCD-ROMofexamplesfromO’Reillybooksdoesrequirepermission.Answeringaquestionbycitingthisbookandquotingexamplecodedoesnotrequirepermission.Incorporatingasignificantamountofexamplecodefromthisbookintoyourproduct’sdocumentationdoesrequirepermission.
Weappreciate,butdonotrequire,attribution.Anattributionusuallyincludesthetitle,author,publisher,andISBN.Forexample:“DatabaseInternalsbyAlexPetrov(O’Reilly).Copyright2019OleksandrPetrov,978-1-492-04034-7.”
Ifyoufeelyouruseofcodeexamplesfallsoutsidefairuseorthepermissiongivenabove,[email protected].
O’ReillyOnlineLearning
NOTEForalmost40years,O’ReillyMediahasprovidedtechnologyandbusinesstraining,knowledge,andinsighttohelpcompaniessucceed.
Ouruniquenetworkofexpertsandinnovatorssharetheirknowledgeandexpertisethroughbooks,articles,conferences,andouronlinelearningplatform.O’Reilly’sonlinelearningplatformgivesyouon-demandaccesstolivetrainingcourses,in-depthlearningpaths,interactivecodingenvironments,andavastcollectionoftextandvideofromO’Reillyand200+otherpublishers.Formoreinformation,pleasevisithttp://oreilly.com.
HowtoContactUsPleaseaddresscommentsandquestionsconcerningthisbooktothepublisher:
O’ReillyMedia,Inc.
1005GravensteinHighwayNorth
Sebastopol,CA95472
800-998-9938(intheUnitedStatesorCanada)
707-829-0515(internationalorlocal)
707-829-0104(fax)
Wehaveawebpageforthisbook,wherewelisterrata,examples,andanyadditionalinformation.Youcanaccessthispageathttp://bit.ly/database-internals.
Tocommentorasktechnicalquestionsaboutthisbook,[email protected].
Formoreinformationaboutourbooks,courses,conferences,andnews,seeourwebsiteathttp://www.oreilly.com.
FindusonFacebook:http://facebook.com/oreilly
FollowusonTwitter:http://twitter.com/oreillymedia
WatchusonYouTube:http://www.youtube.com/oreillymediaAcknowledgmentsThisbookwouldn’thavebeenpossiblewithoutthehundredsofpeoplewhohaveworkedhardonresearchpapersandbooks,whichhavebeenasourceofideas,inspiration,andservedasreferencesforthisbook.
I’dliketosaythankyoutoallthepeoplewhoreviewdmanuscriptsandprovidedfeedback,makingsurethatthematerialinthisbookiscorrectandthewordingisprecise:DmitryAlimov,PeterAlvaro,CarlosBaquero,JasonBrown,BlakeEggleston,MarcusEriksson,FranciscoFernándezCastaño,HeidiHoward,VaidehiJoshi,MaximilianKarasz,StasKelvich,MichaelKlishin,PredragKnežević,JoelKnighton,EugeneLazin,NateMcCall,ChristopherMeiklejohn,TylerNeely,MaximNeverov,MarinaPetrova,StefanPodkowinski,EdwardRibiero,DenisRytsov,KirShatrov,AlexSorokoumov,MassimilianoTomassi,andArielWeisberg.
Ofcourse,thisbookwouldn’thavebeenpossiblewithoutsupportfrommyfamily:mywifeMarinaandmydaughterAlexandra,whohavesupportedmeoneverystepontheway.
PartI.StorageEngines
Theprimaryjobofanydatabasemanagementsystemisreliablystoringdataandmakingitavailableforusers.Weusedatabasesasaprimarysourceofdata,helpingustoshareitbetweenthedifferentpartsofourapplications.Insteadoffindingawaytostoreandretrieveinformationandinventinganewwaytoorganizedataeverytimewecreateanewapp,weusedatabases.Thiswaywecanconcentrateonapplicationlogicinsteadofinfrastructure.
Sincethetermdatabasemanagementsystem(DBMS)isquitebulky,throughoutthisbookweusemorecompactterms,databasesystemanddatabase,torefertothesameconcept.
Databasesaremodularsystemsandconsistofmultipleparts:atransportlayeracceptingrequests,aqueryprocessordeterminingthemostefficientwaytorunqueries,anexecutionenginecarryingouttheoperations,andastorageengine(see“DBMSArchitecture”).
Thestorageengine(ordatabaseengine)isasoftwarecomponentofadatabasemanagementsystemresponsibleforstoring,retrieving,andmanagingdatainmemoryandondisk,designedtocaptureapersistent,long-termmemoryofeachnode[REED78].Whiledatabasescanrespondtocomplexqueries,storageengineslookatthedatamoregranularlyandofferasimpledatamanipulationAPI,allowinguserstocreate,update,delete,andretrieverecords.Onewaytolookatthisisthatdatabasemanagementsystemsareapplicationsbuiltontopofstorageengines,offeringaschema,aquerylanguage,indexing,transactions,andmanyotherusefulfeatures.
Forflexibility,bothkeysandvaluescanbearbitrarysequencesofbyteswithnoprescribedform.Theirsortingandrepresentationsemanticsaredefinedinhigher-levelsubsystems.Forexample,youcanuseint32(32-bitinteger)asakeyinoneofthetables,andascii(ASCIIstring)intheother;fromthestorageengineperspectivebothkeysarejustserializedentries.
StorageenginessuchasBerkeleyDB,LevelDBanditsdescendantRocksDB,LMDBanditsdescendantlibmdbx,Sophia,HaloDB,andmanyothersweredevelopedindependentlyfromthedatabasemanagementsystemsthey’renowembeddedinto.Usingpluggablestorageengineshasenableddatabasedeveloperstobootstrapdatabasesystemsusingexistingstorageengines,andconcentrateontheothersubsystems.
Atthesametime,clearseparationbetweendatabasesystemcomponentsopensupanopportunitytoswitchbetweendifferentengines,potentiallybettersuitedforparticularusecases.Forexample,MySQL,apopulardatabasemanagementsystem,hasseveralstorageengines,includingInnoDB,MyISAM,andRocksDB(intheMyRocksdistribution).MongoDBallowsswitchingbetweenWiredTiger,In-Memory,andthe(now-deprecated)MMAPv1storageengines.
PartI.ComparingDatabasesYourchoiceofdatabasesystemmayhavelong-termconsequences.Ifthere’sachancethatadatabaseisnotagoodfitbecauseofperformanceproblems,consistencyissues,oroperationalchallenges,itisbettertofindoutaboutitearlierinthedevelopmentcycle,sinceitcanbenontrivialtomigratetoadifferentsystem.Insomecases,itmayrequiresubstantialchangesintheapplicationcode.
Everydatabasesystemhasstrengthsandweaknesses.Toreducetheriskofanexpensivemigration,youcaninvestsometimebeforeyoudecideonaspecificdatabasetobuildconfidenceinitsabilitytomeetyourapplication’sneeds.
Tryingtocomparedatabasesbasedontheircomponents(e.g.,whichstorageenginetheyuse,howthedataisshared,replicated,anddistributed,etc.),theirrank(anarbitrarypopularityvalueassignedbyconsultancyagenciessuchasThoughtWorksordatabasecomparisonwebsitessuchasDB-EnginesorDatabaseofDatabases),orimplementationlanguage(C++,Java,orGo,etc.)canleadtoinvalidandprematureconclusions.Thesemethodscanbeusedonlyforahigh-levelcomparisonandcanbeascoarseaschoosingbetweenHBaseandSQLite,soevenasuperficialunderstandingofhoweachdatabaseworksandwhat’sinsideitcanhelpyoulandamoreweightedconclusion.
Everycomparisonshouldstartbyclearlydefiningthegoal,becauseeventheslightestbiasmaycompletelyinvalidatetheentireinvestigation.Ifyou’researchingforadatabasethatwouldbeagoodfitfortheworkloadsyouhave(orareplanningtofacilitate),thebestthingyoucandoistosimulatetheseworkloadsagainstdifferentdatabasesystems,measuretheperformancemetricsthatareimportantforyou,andcompareresults.Someissues,especiallywhenitcomestoperformanceandscalability,startshowingonlyaftersometimeorasthecapacitygrows.Todetectpotentialproblems,itisbesttohavelong-runningtestsinanenvironmentthatsimulatesthereal-worldproductionsetupascloselyaspossible.
Simulatingreal-worldworkloadsnotonlyhelpsyouunderstandhowthedatabaseperforms,butalsohelpsyoulearnhowtooperate,debug,andfindouthowfriendlyandhelpfulitscommunityis.Databasechoiceisalwaysacombinationofthesefactors,andperformanceoftenturnsoutnottobethemostimportantaspect:it’susuallymuchbettertouseadatabasethatslowlysavesthe
datathanonethatquicklylosesit.
Tocomparedatabases,it’shelpfultounderstandtheusecaseingreatdetailanddefinethecurrentandanticipatedvariables,suchas:
Schemaandrecordsizes
Numberofclients
Typesofqueriesandaccesspatterns
Ratesofthereadandwritequeries
Expectedchangesinanyofthesevariables
Knowingthesevariablescanhelptoanswerthefollowingquestions:
Doesthedatabasesupporttherequiredqueries?
Isthisdatabaseabletohandletheamountofdatawe’replanningtostore?
Howmanyreadandwriteoperationscanasinglenodehandle?
Howmanynodesshouldthesystemhave?
Howdoweexpandtheclustergiventheexpectedgrowthrate?
Whatisthemaintenanceprocess?
Havingthesequestionsanswered,youcanconstructatestclusterandsimulateyourworkloads.Mostdatabasesalreadyhavestresstoolsthatcanbeusedtoreconstructspecificusecases.Ifthere’snostandardstresstooltogeneraterealisticrandomizedworkloadsinthedatabaseecosystem,itmightbearedflag.Ifsomethingpreventsyoufromusingdefaulttools,youcantryoneoftheexistinggeneral-purposetools,orimplementonefromscratch.
Ifthetestsshowpositiveresults,itmaybehelpfultofamiliarizeyourselfwiththedatabasecode.Lookingatthecode,itisoftenusefultofirstunderstandthepartsofthedatabase,howtofindthecodefordifferentcomponents,andthennavigatethroughthose.Havingevenaroughideaaboutthedatabasecodebasehelpsyoubetterunderstandthelogrecordsitproduces,itsconfigurationparameters,andhelpsyoufindissuesintheapplicationthatusesitandevenin
thedatabasecodeitself.
It’dbegreatifwecouldusedatabasesasblackboxesandneverhavetotakealookinsidethem,butthepracticeshowsthatsoonerorlaterabug,anoutage,aperformanceregression,orsomeotherproblempopsup,andit’sbettertobepreparedforit.Ifyouknowandunderstanddatabaseinternals,youcanreducebusinessrisksandimprovechancesforaquickrecovery.
Oneofthepopulartoolsusedforbenchmarking,performanceevaluation,andcomparisonisYahoo!CloudServingBenchmark(YCSB).YCSBoffersaframeworkandacommonsetofworkloadsthatcanbeappliedtodifferentdatastores.Justlikeanythinggeneric,thistoolshouldbeusedwithcaution,sinceit’seasytomakewrongconclusions.Tomakeafaircomparisonandmakeaneducateddecision,itisnecessarytoinvestenoughtimetounderstandthereal-worldconditionsunderwhichthedatabasehastoperform,andtailorbenchmarksaccordingly.
TPC-CBENCHMARKTheTransactionProcessingPerformanceCouncil(TPC)hasasetofbenchmarksthatdatabasevendorsuseforcomparingandadvertisingperformanceoftheirproducts.TPC-Cisanonlinetransactionprocessing(OLTP)benchmark,amixtureofread-onlyandupdatetransactionsthatsimulatecommonapplicationworkloads.
Thisbenchmarkconcernsitselfwiththeperformanceandcorrectnessofexecutedconcurrenttransactions.Themainperformanceindicatoristhroughput:thenumberoftransactionsthedatabasesystemisabletoprocessperminute.ExecutedtransactionsarerequiredtopreserveACIDpropertiesandconformtothesetofpropertiesdefinedbythebenchmarkitself.
Thisbenchmarkdoesnotconcentrateonanyparticularbusinesssegment,butprovidesanabstractsetofactionsimportantformostoftheapplicationsforwhichOLTPdatabasesaresuitable.Itincludesseveraltablesandentitiessuchaswarehouses,stock(inventory),customersandorders,specifyingtablelayouts,detailsoftransactionsthatcanbeperformedagainstthesetables,theminimumnumberofrowspertable,anddatadurabilityconstraints.
Thisdoesn’tmeanthatbenchmarkscanbeusedonlytocomparedatabases.Benchmarkscanbeusefultodefineandtestdetailsoftheservice-levelagreement, understandingsystemrequirements,capacityplanning,andmore.Themoreknowledgeyouhaveaboutthedatabasebeforeusingit,themoretimeyou’llsavewhenrunningitinproduction.
Choosingadatabaseisalong-termdecision,andit’sbesttokeeptrackofnewlyreleasedversions,understandwhatexactlyhaschangedandwhy,andhaveanupgradestrategy.Newreleasesusuallycontainimprovementsandfixesforbugsandsecurityissues,butmayintroducenewbugs,performanceregressions,orunexpectedbehavior,sotestingnewversionsbeforerollingthemoutisalsocritical.Checkinghowdatabaseimplementerswerehandlingupgradespreviouslymightgiveyouagoodideaaboutwhattoexpectinthefuture.Pastsmoothupgradesdonotguaranteethatfutureoneswillbeassmooth,butcomplicatedupgradesinthepastmightbeasignthatfutureoneswon’tbeeasy,either.
1
PartI.UnderstandingTrade-OffsAsusers,wecanseehowdatabasesbehaveunderdifferentconditions,butwhenworkingondatabases,wehavetomakechoicesthatinfluencethisbehaviordirectly.
Designingastorageengineisdefinitelymorecomplicatedthanjustimplementingatextbookdatastructure:therearemanydetailsandedgecasesthatarehardtogetrightfromthestart.Weneedtodesignthephysicaldatalayoutandorganizepointers,decideontheserializationformat,understandhowdataisgoingtobegarbage-collected,howthestorageenginefitsintothesemanticsofthedatabasesystemasawhole,figureouthowtomakeitworkinaconcurrentenvironment,and,finally,makesureweneverloseanydata,underanycircumstances.
Notonlytherearemanythingstodecideupon,butmostofthesedecisionsinvolvetrade-offs.Forexample,ifwesaverecordsintheordertheywereinsertedintothedatabase,wecanstorethemquicker,butifweretrievethemintheirlexicographicalorder,wehavetore-sortthembeforereturningresultstotheclient.Asyouwillseethroughoutthisbook,therearemanydifferentapproachestostorageenginedesign,andeveryimplementationhasitsownupsidesanddownsides.
Whenlookingatdifferentstorageengines,wediscusstheirbenefitsandshortcomings.Iftherewasanabsolutelyoptimalstorageengineforeveryconceivableusecase,everyonewouldjustuseit.Butsinceitdoesnotexist,weneedtochoosewisely,basedontheworkloadsandusecaseswe’retryingtofacilitate.
Therearemanystorageengines,usingallsortsofdatastructures,implementedindifferentlanguages,rangingfromlow-levelones,suchasC,tohigh-levelones,suchasJava.Allstorageenginesfacethesamechallengesandconstraints.Todrawaparallelwithcityplanning,itispossibletobuildacityforaspecificpopulationandchoosetobuilduporbuildout.Inbothcases,thesamenumberofpeoplewillfitintothecity,buttheseapproachesleadtoradicallydifferentlifestyles.Whenbuildingthecityup,peopleliveinapartmentsandpopulationdensityislikelytoleadtomoretrafficinasmallerarea;inamorespread-outcity,peoplearemorelikelytoliveinhouses,butcommutingwillrequirecoveringlargerdistances.
Similarly,designdecisionsmadebystorageenginedevelopersmakethembettersuitedfordifferentthings:someareoptimizedforlowreadorwritelatency,sometrytomaximizedensity(theamountofstoreddatapernode),andsomeconcentrateonoperationalsimplicity.
Youcanfindcompletealgorithmsthatcanbeusedfortheimplementationandotheradditionalreferencesinthechaptersummaries.Readingthisbookshouldmakeyouwellequippedtoworkproductivelywiththesesourcesandgiveyouasolidunderstandingoftheexistingalternativestoconceptsdescribedthere.
1 Theservice-levelagreement(orSLA)isacommitmentbytheserviceprovideraboutthequalityofprovidedservices.Amongotherthings,theSLAcanincludeinformationaboutlatency,throughput,jitter,andthenumberandfrequencyoffailures.
Chapter1.IntroductionandOverview
Databasemanagementsystemscanservedifferentpurposes:someareusedprimarilyfortemporaryhotdata,someserveasalong-livedcoldstorage,someallowcomplexanalyticalqueries,someonlyallowaccessingvaluesbythekey,someareoptimizedtostoretime-seriesdata,andsomestorelargeblobsefficiently.Tounderstanddifferencesanddrawdistinctions,westartwithashortclassificationandoverview,asthishelpsustounderstandthescopeoffurtherdiscussions.
Terminologycansometimesbeambiguousandhardtounderstandwithoutacompletecontext.Forexample,distinctionsbetweencolumnandwidecolumnstoresthathavelittleornothingtodowitheachother,orhowclusteredandnonclusteredindexesrelatetoindex-organizedtables.Thischapteraimstodisambiguatethesetermsandfindtheirprecisedefinitions.
Westartwithanoverviewofdatabasemanagementsystemarchitecture(see“DBMSArchitecture”),anddiscusssystemcomponentsandtheirresponsibilities.Afterthat,wediscussthedistinctionsamongthedatabasemanagementsystemsintermsofastoragemedium(see“Memory-VersusDisk-BasedDBMS”),andlayout(see“Column-VersusRow-OrientedDBMS”).
Thesetwogroupsdonotpresentafulltaxonomyofdatabasemanagementsystemsandtherearemanyotherwaysthey’reclassified.Forexample,somesourcesgroupDBMSsintothreemajorcategories:
Onlinetransactionprocessing(OLTP)databases
Thesehandlealargenumberofuser-facingrequestsandtransactions.Queriesareoftenpredefinedandshort-lived.
Onlineanalyticalprocessing(OLAP)databases
Thesehandlecomplexaggregations.OLAPdatabasesareoftenusedforanalyticsanddatawarehousing,andarecapableofhandlingcomplex,long-
runningadhocqueries.
Hybridtransactionalandanalyticalprocessing(HTAP)
ThesedatabasescombinepropertiesofbothOLTPandOLAPstores.
Therearemanyothertermsandclassifications:key-valuestores,relationaldatabases,document-orientedstores,andgraphdatabases.Theseconceptsarenotdefinedhere,sincethereaderisassumedtohaveahigh-levelknowledgeandunderstandingoftheirfunctionality.Becausetheconceptswediscussherearewidelyapplicableandareusedinmostofthementionedtypesofstoresinsomecapacity,completetaxonomyisnotnecessaryorimportantforfurtherdiscussion.
SincePartIofthisbookfocusesonthestorageandindexingstructures,weneedtounderstandthehigh-leveldataorganizationapproaches,andtherelationshipbetweenthedataandindexfiles(see“DataFilesandIndexFiles”).
Finally,in“Buffering,Immutability,andOrdering”,wediscussthreetechniqueswidelyusedtodevelopefficientstoragestructuresandhowapplyingthesetechniquesinfluencestheirdesignandimplementation.
DBMSArchitectureThere’snocommonblueprintfordatabasemanagementsystemdesign.Everydatabaseisbuiltslightlydifferently,andcomponentboundariesaresomewhathardtoseeanddefine.Eveniftheseboundariesexistonpaper(e.g.,inprojectdocumentation),incodeseeminglyindependentcomponentsmaybecoupledbecauseofperformanceoptimizations,handlingedgecases,orarchitecturaldecisions.
Sourcesthatdescribedatabasemanagementsystemarchitecture(forexample,[HELLERSTEIN07],[WEIKUM01],[ELMASRI11],and[MOLINA08]),definecomponentsandrelationshipsbetweenthemdifferently.ThearchitecturepresentedinFigure1-1demonstratessomeofthecommonthemesintheserepresentations.
Databasemanagementsystemsuseaclient/servermodel,wheredatabasesysteminstances(nodes)taketheroleofservers,andapplicationinstancestaketherole
ofclients.
Clientrequestsarrivethroughthetransportsubsystem.Requestscomeintheformofqueries,mostoftenexpressedinsomequerylanguage.Thetransportsubsystemisalsoresponsibleforcommunicationwithothernodesinthedatabasecluster.
Figure1-1.Architectureofadatabasemanagementsystem
Uponreceipt,thetransportsubsystemhandsthequeryovertoaqueryprocessor,whichparses,interprets,andvalidatesit.Later,accesscontrolchecksareperformed,astheycanbedonefullyonlyafterthequeryisinterpreted.
Theparsedqueryispassedtothequeryoptimizer,whichfirsteliminatesimpossibleandredundantpartsofthequery,andthenattemptstofindthemostefficientwaytoexecuteitbasedoninternalstatistics(indexcardinality,approximateintersectionsize,etc.)anddataplacement(whichnodesinthe
clusterholdthedataandthecostsassociatedwithitstransfer).Theoptimizerhandlesbothrelationaloperationsrequiredforqueryresolution,usuallypresentedasadependencytree,andoptimizations,suchasindexordering,cardinalityestimation,andchoosingaccessmethods.
Thequeryisusuallypresentedintheformofanexecutionplan(orqueryplan):asequenceofoperationsthathavetobecarriedoutforitsresultstobeconsideredcomplete.Sincethesamequerycanbesatisfiedusingdifferentexecutionplansthatcanvaryinefficiency,theoptimizerpicksthebestavailableplan.
Theexecutionplanishandledbytheexecutionengine,whichcollectstheresultsoftheexecutionoflocalandremoteoperations.Remoteexecutioncaninvolvewritingandreadingdatatoandfromothernodesinthecluster,andreplication.
Localqueries(comingdirectlyfromclientsorfromothernodes)areexecutedbythestorageengine.Thestorageenginehasseveralcomponentswithdedicatedresponsibilities:
Transactionmanager
Thismanagerschedulestransactionsandensurestheycannotleavethedatabaseinalogicallyinconsistentstate.
Lockmanager
Thismanagerlocksonthedatabaseobjectsfortherunningtransactions,ensuringthatconcurrentoperationsdonotviolatephysicaldataintegrity.
Accessmethods(storagestructures)
Thesemanageaccessandorganizingdataondisk.AccessmethodsincludeheapfilesandstoragestructuressuchasB-Trees(see“UbiquitousB-Trees”)orLSMTrees(see“LSMTrees”).
Buffermanager
Thismanagercachesdatapagesinmemory(see“BufferManagement”).
Recoverymanager
Thismanagermaintainstheoperationlogandrestoringthesystemstateincaseofafailure(see“Recovery”).
Together,transactionandlockmanagersareresponsibleforconcurrencycontrol(see“ConcurrencyControl”):theyguaranteelogicalandphysicaldataintegritywhileensuringthatconcurrentoperationsareexecutedasefficientlyaspossible.
Memory-VersusDisk-BasedDBMSDatabasesystemsstoredatainmemoryandondisk.In-memorydatabasemanagementsystems(sometimescalledmainmemoryDBMS)storedataprimarilyinmemoryandusethediskforrecoveryandlogging.Disk-basedDBMSholdmostofthedataondiskandusememoryforcachingdiskcontentsorasatemporarystorage.Bothtypesofsystemsusethedisktoacertainextent,butmainmemorydatabasesstoretheircontentsalmostexclusivelyinRAM.
Accessingmemoryhasbeenandremainsseveralordersofmagnitudefasterthanaccessingdisk, soitiscompellingtousememoryastheprimarystorage,anditbecomesmoreeconomicallyfeasibletodosoasmemorypricesgodown.However,RAMpricesstillremainhighcomparedtopersistentstoragedevicessuchasSSDsandHDDs.
Mainmemorydatabasesystemsaredifferentfromtheirdisk-basedcounterpartsnotonlyintermsofaprimarystoragemedium,butalsoinwhichdatastructures,organization,andoptimizationtechniquestheyuse.
Databasesusingmemoryasaprimarydatastoredothismainlybecauseofperformance,comparativelylowaccesscosts,andaccessgranularity.Programmingformainmemoryisalsosignificantlysimplerthandoingsoforthedisk.Operatingsystemsabstractmemorymanagementandallowustothinkintermsofallocatingandfreeingarbitrarilysizedmemorychunks.Ondisk,wehavetomanagedatareferences,serializationformats,freedmemory,andfragmentationmanually.
Themainlimitingfactorsonthegrowthofin-memorydatabasesareRAMvolatility(inotherwords,lackofdurability)andcosts.SinceRAMcontentsarenotpersistent,softwareerrors,crashes,hardwarefailures,andpoweroutagescanresultindataloss.Therearewaystoensuredurability,suchasuninterruptedpowersuppliesandbattery-backedRAM,buttheyrequireadditionalhardwareresourcesandoperationalexpertise.Inpractice,itallcomesdowntothefactthatdisksareeasiertomaintainandhavesignificantlylowerprices.
1
ThesituationislikelytochangeastheavailabilityandpopularityofNon-VolatileMemory(NVM)[ARULRAJ17]technologiesgrow.NVMstoragereducesorcompletelyeliminates(dependingontheexacttechnology)asymmetrybetweenreadandwritelatencies,furtherimprovesreadandwriteperformance,andallowsbyte-addressableaccess.
DurabilityinMemory-BasedStoresIn-memorydatabasesystemsmaintainbackupsondisktoprovidedurabilityandpreventlossofthevolatiledata.Somedatabasesstoredataexclusivelyinmemory,withoutanydurabilityguarantees,butwedonotdiscusstheminthescopeofthisbook.
Beforetheoperationcanbeconsideredcomplete,itsresultshavetobewrittentoasequentiallogfile.Wediscusswrite-aheadlogsinmoredetailin“Recovery”.Toavoidreplayingcompletelogcontentsduringstartuporafteracrash,in-memorystoresmaintainabackupcopy.Thebackupcopyismaintainedasasorteddisk-basedstructure,andmodificationstothisstructureareoftenasynchronous(decoupledfromclientrequests)andappliedinbatchestoreducethenumberofI/Ooperations.Duringrecovery,databasecontentscanberestoredfromthebackupandlogs.
Logrecordsareusuallyappliedtobackupinbatches.Afterthebatchoflogrecordsisprocessed,backupholdsadatabasesnapshotforaspecificpointintime,andlogcontentsuptothispointcanbediscarded.Thisprocessiscalledcheckpointing.Itreducesrecoverytimesbykeepingthedisk-residentdatabasemostup-to-datewithlogentrieswithoutrequiringclientstoblockuntilthebackupisupdated.
NOTEItisunfairtosaythatthein-memorydatabaseistheequivalentofanon-diskdatabasewithahugepagecache(see“BufferManagement”).Eventhoughpagesarecachedinmemory,serializationformatanddatalayoutincuradditionaloverheadanddonotpermitthesamedegreeofoptimizationthatin-memorystorescanachieve.
Disk-baseddatabasesusespecializedstoragestructures,optimizedfordisk
access.Inmemory,pointerscanbefollowedcomparativelyquickly,andrandommemoryaccessissignificantlyfasterthantherandomdiskaccess.Disk-basedstoragestructuresoftenhaveaformofwideandshorttrees(see“TreesforDisk-BasedStorage”),whilememory-basedimplementationscanchoosefromalargerpoolofdatastructuresandperformoptimizationsthatwouldotherwisebeimpossibleordifficulttoimplementondisk[MOLINA92].Similarly,handlingvariable-sizedataondiskrequiresspecialattention,whileinmemoryit’softenamatterofreferencingthevaluewithapointer.
Forsomeusecases,itisreasonabletoassumethatanentiredatasetisgoingtofitinmemory.Somedatasetsareboundedbytheirreal-worldrepresentations,suchasstudentrecordsforschools,customerrecordsforcorporations,orinventoryinanonlinestore.EachrecordtakesupnotmorethanafewKb,andtheirnumberislimited.
Column-VersusRow-OrientedDBMSMostdatabasesystemsstoreasetofdatarecords,consistingofcolumnsandrowsintables.Fieldisanintersectionofacolumnandarow:asinglevalueofsometype.Fieldsbelongingtothesamecolumnusuallyhavethesamedatatype.Forexample,ifwedefineatableholdinguserrecords,allnameswouldbeofthesametypeandbelongtothesamecolumn.Acollectionofvaluesthatbelonglogicallytothesamerecord(usuallyidentifiedbythekey)constitutesarow.
Oneofthewaystoclassifydatabasesisbyhowthedataisstoredondisk:row-orcolumn-wise.Tablescanbepartitionedeitherhorizontally(storingvaluesbelongingtothesamerowtogether),orvertically(storingvaluesbelongingtothesamecolumntogether).Figure1-2depictsthisdistinction:(a)showsthevaluespartitionedcolumn-wise,and(b)showsthevaluespartitionedrow-wise.
Figure1-2.Datalayoutincolumn-androw-orientedstores
Examplesofrow-orienteddatabasemanagementsystemsareabundant:MySQL,PostgreSQL,andmostofthetraditionalrelationaldatabases.Thetwopioneeropensourcecolumn-orientedstoresareMonetDBandC-Store(C-StoreisanopensourcepredecessortoVertica).
Row-OrientedDataLayoutRow-orienteddatabasemanagementsystemsstoredatainrecordsorrows.Theirlayoutisquiteclosetothetabulardatarepresentation,whereeveryrowhasthesamesetoffields.Forexample,arow-orienteddatabasecanefficientlystoreuserentries,holdingnames,birthdates,andphonenumbers:
| ID | Name | Birth Date | Phone Number || 10 | John | 01 Aug 1981 | +1 111 222 333 || 20 | Sam | 14 Sep 1988 | +1 555 888 999 || 30 | Keith | 07 Jan 1984 | +1 333 444 555 |
Thisapproachworkswellforcaseswhereseveralfieldsconstitutetherecord(name,birthdate,andaphonenumber)uniquelyidentifiedbythekey(inthisexample,amonotonicallyincrementednumber).Allfieldsrepresentingasingleuserrecordareoftenreadtogether.Whencreatingrecords(forexample,whentheuserfillsoutaregistrationform),wewritethemtogetheraswell.Atthesametime,eachfieldcanbemodifiedindividually.
Sincerow-orientedstoresaremostusefulinscenarioswhenwehavetoaccessdatabyrow,storingentirerowstogetherimprovesspatiallocality[DENNING68].
Becausedataonapersistentmediumsuchasadiskistypicallyaccessedblock-wise(inotherwords,aminimalunitofdiskaccessisablock),asingleblockwillcontaindataforallcolumns.Thisisgreatforcaseswhenwe’dliketoaccessanentireuserrecord,butmakesqueriesaccessingindividualfieldsofmultipleuserrecords(forexample,queriesfetchingonlythephonenumbers)moreexpensive,sincedatafortheotherfieldswillbepagedinaswell.
Column-OrientedDataLayout
2
Column-orienteddatabasemanagementsystemspartitiondatavertically(i.e.,bycolumn)insteadofstoringitinrows.Here,valuesforthesamecolumnarestoredcontiguouslyondisk(asopposedtostoringrowscontiguouslyasinthepreviousexample).Forexample,ifwestorehistoricalstockmarketprices,pricequotesarestoredtogether.Storingvaluesfordifferentcolumnsinseparatefilesorfilesegmentsallowsefficientqueriesbycolumn,sincetheycanbereadinonepassratherthanconsumingentirerowsanddiscardingdataforcolumnsthatweren’tqueried.
Column-orientedstoresareagoodfitforanalyticalworkloadsthatcomputeaggregates,suchasfindingtrends,computingaveragevalues,etc.Processingcomplexaggregatescanbeusedincaseswhenlogicalrecordshavemultiplefields,butsomeofthem(inthiscase,pricequotes)havedifferentimportanceandareoftenconsumedtogether.
Fromalogicalperspective,thedatarepresentingstockmarketpricequotescanstillbeexpressedasatable:
| ID | Symbol | Date | Price || 1 | DOW | 08 Aug 2018 | 24,314.65 || 2 | DOW | 09 Aug 2018 | 24,136.16 || 3 | S&P | 08 Aug 2018 | 2,414.45 || 4 | S&P | 09 Aug 2018 | 2,232.32 |
However,thephysicalcolumn-baseddatabaselayoutlooksentirelydifferent.Valuesbelongingtothesamerowarestoredcloselytogether:
Symbol: 1:DOW; 2:DOW; 3:S&P; 4:S&PDate: 1:08 Aug 2018; 2:09 Aug 2018; 3:08 Aug 2018; 4:09 Aug 2018Price: 1:24,314.65; 2:24,136.16; 3:2,414.45; 4:2,232.32
Toreconstructdatatuples,whichmightbeusefulforjoins,filtering,andmultirowaggregates,weneedtopreservesomemetadataonthecolumnleveltoidentifywhichdatapointsfromothercolumnsitisassociatedwith.Ifyoudothisexplicitly,eachvaluewillhavetoholdakey,whichintroducesduplicationandincreasestheamountofstoreddata.Somecolumnstoresuseimplicitidentifiers(virtualIDs)insteadandusethepositionofthevalue(inotherwords,itsoffset)tomapitbacktotherelatedvalues[ABADI13].
Duringthelastseveralyears,likelyduetoarisingdemandtoruncomplex
analyticalqueriesovergrowingdatasets,we’veseenmanynewcolumn-orientedfileformatssuchasApacheParquet,ApacheORC,RCFile,aswellascolumn-orientedstores,suchasApacheKudu,ClickHouse,andmanyothers[ROY12].
DistinctionsandOptimizationsItisnotsufficienttosaythatdistinctionsbetweenrowandcolumnstoresareonlyinthewaythedataisstored.Choosingthedatalayoutisjustoneofthestepsinaseriesofpossibleoptimizationsthatcolumnarstoresaretargeting.
Readingmultiplevaluesforthesamecolumninonerunsignificantlyimprovescacheutilizationandcomputationalefficiency.OnmodernCPUs,vectorizedinstructionscanbeusedtoprocessmultipledatapointswithasingleCPUinstruction [DREPPER07].
Storingvaluesthathavethesamedatatypetogether(e.g.,numberswithothernumbers,stringswithotherstrings)offersabettercompressionratio.Wecanusedifferentcompressionalgorithmsdependingonthedatatypeandpickthemosteffectivecompressionmethodforeachcase.
Todecidewhethertouseacolumn-orarow-orientedstore,youneedtounderstandyouraccesspatterns.Ifthereaddataisconsumedinrecords(i.e.,mostorallofthecolumnsarerequested)andtheworkloadconsistsmostlyofpointqueriesandrangescans,therow-orientedapproachislikelytoyieldbetterresults.Ifscansspanmanyrows,orcomputeaggregateoverasubsetofcolumns,itisworthconsideringacolumn-orientedapproach.
WideColumnStoresColumn-orienteddatabasesshouldnotbemixedupwithwidecolumnstores,suchasBigTableorHBase,wheredataisrepresentedasamultidimensionalmap,columnsaregroupedintocolumnfamilies(usuallystoringdataofthesametype),andinsideeachcolumnfamily,dataisstoredrow-wise.Thislayoutisbestforstoringdataretrievedbyakeyorasequenceofkeys.
AcanonicalexamplefromtheBigtablepaper[CHANG06]isaWebtable.AWebtablestoressnapshotsofwebpagecontents,theirattributes,andtherelationsamongthemataspecifictimestamp.PagesareidentifiedbythereversedURL,andallattributes(suchaspagecontentandanchors,representing
3
linksbetweenpages)areidentifiedbythetimestampsatwhichthesesnapshotsweretaken.Inasimplifiedway,itcanberepresentedasanestedmap,asFigure1-3shows.
Figure1-3.ConceptualstructureofaWebtable
Dataisstoredinamultidimensionalsortedmapwithhierarchicalindexes:wecanlocatethedatarelatedtoaspecificwebpagebyitsreversedURLanditscontentsoranchorsbythetimestamp.Eachrowisindexedbyitsrowkey.Relatedcolumnsaregroupedtogetherincolumnfamilies—contentsandanchorinthisexample—whicharestoredondiskseparately.Eachcolumninsideacolumnfamilyisidentifiedbythecolumnkey,whichisacombinationofthecolumnfamilynameandaqualifier(html,cnnsi.com,my.look.cainthisexample).Columnfamiliesstoremultipleversionsofdatabytimestamp.Thislayoutallowsustoquicklylocatethehigher-levelentries(webpages,inthiscase)andtheirparameters(versionsofcontentandlinkstotheotherpages).
Whileitisusefultounderstandtheconceptualrepresentationofwidecolumnstores,theirphysicallayoutissomewhatdifferent.AschematicrepresentationofthedatalayoutincolumnfamiliesisshowninFigure1-4:columnfamiliesarestoredseparately,butineachcolumnfamily,thedatabelongingtothesamekey
isstoredtogether.
Figure1-4.PhysicalstructureofaWebtable
DataFilesandIndexFilesTheprimarygoalofadatabasesystemistostoredataandtoallowquickaccesstoit.Buthowisthedataorganized?Whydoweneedadatabasemanagementsystemandnotjustabunchoffiles?Howdoesfileorganizationimproveefficiency?
Databasesystemsdousefilesforstoringthedata,butinsteadofrelyingonfilesystemhierarchiesofdirectoriesandfilesforlocatingrecords,theycomposefilesusingimplementation-specificformats.Themainreasonstousespecializedfileorganizationoverflatfilesare:
Storageefficiency
Filesareorganizedinawaythatminimizesstorageoverheadperstoreddatarecord.
Accessefficiency
Recordscanbelocatedinthesmallestpossiblenumberofsteps.
Updateefficiency
Recordupdatesareperformedinawaythatminimizesthenumberofchangesondisk.
Databasesystemsstoredatarecords,consistingofmultiplefields,intables,whereeachtableisusuallyrepresentedasaseparatefile.Eachrecordinthetablecanbelookedupusingasearchkey.Tolocatearecord,databasesystemsuseindexes:auxiliarydatastructuresthatallowittoefficientlylocatedatarecordswithoutscanninganentiretableoneveryaccess.Indexesarebuiltusingasubsetoffieldsidentifyingtherecord.
Adatabasesystemusuallyseparatesdatafilesandindexfiles:datafilesstoredatarecords,whileindexfilesstorerecordmetadataanduseittolocaterecordsindatafiles.Indexfilesaretypicallysmallerthanthedatafiles.Filesarepartitionedintopages,whichtypicallyhavethesizeofasingleormultiplediskblocks.Pagescanbeorganizedassequencesofrecordsorasaslottedpages(see“SlottedPages”).
Newrecords(insertions)andupdatestotheexistingrecordsarerepresentedbykey/valuepairs.Mostmodernstoragesystemsdonotdeletedatafrompagesexplicitly.Instead,theyusedeletionmarkers(alsocalledtombstones),whichcontaindeletionmetadata,suchasakeyandatimestamp.Spaceoccupiedbytherecordsshadowedbytheirupdatesordeletionmarkersisreclaimedduringgarbagecollection,whichreadsthepages,writesthelive(i.e.,nonshadowed)recordstothenewplace,anddiscardstheshadowedones.
DataFilesDatafiles(sometimescalledprimaryfiles)canbeimplementedasindex-organizedtables(IOT),heap-organizedtables(heapfiles),orhash-organizedtables(hashedfiles).
Recordsinheapfilesarenotrequiredtofollowanyparticularorder,andmostofthetimetheyareplacedinawriteorder.Thisway,noadditionalworkorfilereorganizationisrequiredwhennewpagesareappended.Heapfilesrequireadditionalindexstructures,pointingtothelocationswheredatarecordsarestored,tomakethemsearchable.
Inhashedfiles,recordsarestoredinbuckets,andthehashvalueofthekeydetermineswhichbucketarecordbelongsto.Recordsinthebucketcanbestoredinappendorderorsortedbykeytoimprovelookupspeed.
Index-organizedtables(IOTs)storedatarecordsintheindexitself.Sincerecords
arestoredinkeyorder,rangescansinIOTscanbeimplementedbysequentiallyscanningitscontents.
Storingdatarecordsintheindexallowsustoreducethenumberofdiskseeksbyatleastone,sinceaftertraversingtheindexandlocatingthesearchedkey,wedonothavetoaddressaseparatefiletofindtheassociateddatarecord.
Whenrecordsarestoredinaseparatefile,indexfilesholddataentries,uniquelyidentifyingdatarecordsandcontainingenoughinformationtolocatetheminthedatafile.Forexample,wecanstorefileoffsets(sometimescalledrowlocators),locationsofdatarecordsinthedatafile,orbucketIDsinthecaseofhashfiles.Inindex-organizedtables,dataentriesholdactualdatarecords.
IndexFilesAnindexisastructurethatorganizesdatarecordsondiskinawaythatfacilitatesefficientretrievaloperations.Indexfilesareorganizedasspecializedstructuresthatmapkeystolocationsindatafileswheretherecordsidentifiedbythesekeys(inthecaseofheapfiles)orprimarykeys(inthecaseofindex-organizedtables)arestored.
Anindexonaprimary(data)fileiscalledtheprimaryindex.However,inmostcaseswecanalsoassumethattheprimaryindexisbuiltoveraprimarykeyorasetofkeysidentifiedasprimary.Allotherindexesarecalledsecondary.
Secondaryindexescanpointdirectlytothedatarecord,orsimplystoreitsprimarykey.Apointertoadatarecordcanholdanoffsettoaheapfileoranindex-organizedtable.Multiplesecondaryindexescanpointtothesamerecord,allowingasingledatarecordtobeidentifiedbydifferentfieldsandlocatedthroughdifferentindexes.Whileprimaryindexfilesholdauniqueentrypersearchkey,secondaryindexesmayholdseveralentriespersearchkey[MOLINA08].
Iftheorderofdatarecordsfollowsthesearchkeyorder,thisindexiscalledclustered(alsoknownasclustering).Datarecordsintheclusteredcaseareusuallystoredinthesamefileorinaclusteredfile,wherethekeyorderispreserved.Ifthedataisstoredinaseparatefile,anditsorderdoesnotfollowthekeyorder,theindexiscallednonclustered(sometimescalledunclustered).
Figure1-5showsthedifferencebetweenthetwoapproaches:
a)Twoindexesreferencedataentriesdirectlyfromsecondaryindexfiles.
b)Asecondaryindexgoesthroughtheindirectionlayerofaprimaryindextolocatethedataentries.
Figure1-5.Storingdatarecordsinanindexfileversusstoringoffsetstothedatafile(indexsegmentsshowninwhite;segmentsholdingdatarecordsshowningray)
NOTEIndex-organizedtablesstoreinformationinindexorderandareclusteredbydefinition.Primaryindexesaremostoftenclustered.Secondaryindexesarenonclusteredbydefinition,sincethey’reusedtofacilitateaccessbykeysotherthantheprimaryone.Clusteredindexescanbebothindex-organizedorhaveseparateindexanddatafiles.
Manydatabasesystemshaveaninherentandexplicitprimarykey,asetofcolumnsthatuniquelyidentifythedatabaserecord.Incaseswhentheprimarykeyisnotspecified,thestorageenginecancreateanimplicitprimarykey(forexample,MySQLInnoDBaddsanewauto-incrementcolumnandfillsinitsvaluesautomatically).
Thisterminologyisusedindifferentkindsofdatabasesystems:relationaldatabasesystems(suchasMySQLandPostgreSQL),Dynamo-basedNoSQLstores(suchasApacheCassandraandinRiak),anddocumentstores(suchasMongoDB).Therecanbesomeproject-specificnaming,butmostoftenthere’saclearmappingtothisterminology.
PrimaryIndexasanIndirection
Therearedifferentopinionsinthedatabasecommunityonwhetherdatarecordsshouldbereferenceddirectly(throughfileoffset)orviatheprimarykeyindex.
Bothapproacheshavetheirprosandconsandarebetterdiscussedinthescopeofacompleteimplementation.Byreferencingdatadirectly,wecanreducethenumberofdiskseeks,buthavetopayacostofupdatingthepointerswhenevertherecordisupdatedorrelocatedduringamaintenanceprocess.Usingindirectionintheformofaprimaryindexallowsustoreducethecostofpointerupdates,buthasahighercostonareadpath.
Updatingjustacoupleofindexesmightworkiftheworkloadmostlyconsistsofreads,butthisapproachdoesnotworkwellforwrite-heavyworkloadswithmultipleindexes.Toreducethecostsofpointerupdates,insteadofpayloadoffsets,someimplementationsuseprimarykeysforindirection.Forexample,MySQLInnoDBusesaprimaryindexandperformstwolookups:oneinthesecondaryindex,andoneinaprimaryindexwhenperformingaquery[TARIQ11].Thisaddsanoverheadofaprimaryindexlookupinsteadoffollowingtheoffsetdirectlyfromthesecondaryindex.
Figure1-6showshowthetwoapproachesaredifferent:
a)Twoindexesreferencedataentriesdirectlyfromsecondaryindexfiles.
b)Asecondaryindexgoesthroughtheindirectionlayerofaprimaryindextolocatethedataentries.
Figure1-6.Referencingdatatuplesdirectly(a)versususingaprimaryindexasindirection(b)
4
Itisalsopossibletouseahybridapproachandstorebothdatafileoffsetsandprimarykeys.First,youcheckifthedataoffsetisstillvalidandpaytheextracostofgoingthroughtheprimarykeyindexifithaschanged,updatingtheindexfileafterfindinganewoffset.
Buffering,Immutability,andOrderingAstorageengineisbasedonsomedatastructure.However,thesestructuresdonotdescribethesemanticsofcaching,recovery,transactionality,andotherthingsthatstorageenginesaddontopofthem.
Inthenextchapters,wewillstartthediscussionwithB-Trees(see“UbiquitousB-Trees”)andtrytounderstandwhytherearesomanyB-Treevariants,andwhynewdatabasestoragestructureskeepemerging.
Storagestructureshavethreecommonvariables:theyusebuffering(oravoidusingit),useimmutable(ormutable)files,andstorevaluesinorder(oroutoforder).Mostofthedistinctionsandoptimizationsinstoragestructuresdiscussedinthisbookarerelatedtooneofthesethreeconcepts.
Buffering
Thisdefineswhetherornotthestoragestructurechoosestocollectacertainamountofdatainmemorybeforeputtingitondisk.Ofcourse,everyon-diskstructurehastousebufferingtosomedegree,sincethesmallestunitofdatatransfertoandfromthediskisablock,anditisdesirabletowritefullblocks.Here,we’retalkingaboutavoidablebuffering,somethingstorageengineimplementerschoosetodo.Oneofthefirstoptimizationswediscussinthisbookisaddingin-memorybufferstoB-TreenodestoamortizeI/Ocosts(see“LazyB-Trees”).However,thisisnottheonlywaywecanapplybuffering.Forexample,two-componentLSMTrees(see“Two-componentLSMTree”),despitetheirsimilaritieswithB-Trees,usebufferinginanentirelydifferentway,andcombinebufferingwithimmutability.
Mutability(orimmutability)
Thisdefineswhetherornotthestoragestructurereadspartsofthefile,updatesthem,andwritestheupdatedresultsatthesamelocationinthefile.Immutablestructuresareappend-only:oncewritten,filecontentsarenot
modified.Instead,modificationsareappendedtotheendofthefile.Thereareotherwaystoimplementimmutability.Oneofthemiscopy-on-write(see“Copy-on-Write”),wherethemodifiedpage,holdingtheupdatedversionoftherecord,iswrittentothenewlocationinthefile,insteadofitsoriginallocation.OftenthedistinctionbetweenLSMandB-Treesisdrawnasimmutableagainstin-placeupdatestorage,buttherearestructures(forexample,“Bw-Trees”)thatareinspiredbyB-Treesbutareimmutable.
Ordering
Thisisdefinedaswhetherornotthedatarecordsarestoredinthekeyorderinthepagesondisk.Inotherwords,thekeysthatsortcloselyarestoredincontiguoussegmentsondisk.Orderingoftendefineswhetherornotwecanefficientlyscantherangeofrecords,notonlylocatetheindividualdatarecords.Storingdataoutoforder(mostoften,ininsertionorder)opensupforsomewrite-timeoptimizations.Forexample,Bitcask(see“Bitcask”)andWiscKey(see“WiscKey”)storedatarecordsdirectlyinappend-onlyfiles.
Ofcourse,abriefdiscussionofthesethreeconceptsisnotenoughtoshowtheirpower,andwe’llcontinuethisdiscussionthroughouttherestofthebook.
SummaryInthischapter,we’vediscussedthearchitectureofadatabasemanagementsystemandcovereditsprimarycomponents.
Tohighlighttheimportanceofdisk-basedstructuresandtheirdifferencefromin-memoryones,wediscussedmemory-anddisk-basedstores.Wecametotheconclusionthatdisk-basedstructuresareimportantforbothtypesofstores,butareusedfordifferentpurposes.
Tounderstandhowaccesspatternsinfluencedatabasesystemdesign,wediscussedcolumn-androw-orienteddatabasemanagementsystemsandtheprimaryfactorsthatsetthemapartfromeachother.Tostartaconversationabouthowthedataisstored,wecovereddataandindexfiles.
Lastly,weintroducedthreecoreconcepts:buffering,immutability,andordering.Wewillusethemthroughoutthisbooktohighlightpropertiesofthestorageenginesthatusethem.
FURTHERREADINGIfyou’dliketolearnmoreabouttheconceptsmentionedinthischapter,youcanrefertothefollowingsources:
Databasearchitecture
Hellerstein,JosephM.,MichaelStonebraker,andJamesHamilton.2007.“ArchitectureofaDatabaseSystem.”FoundationsandTrendsinDatabases1,no.2(February):141-259.https://doi.org/10.1561/1900000002.
Column-orientedDBMS
Abadi,Daniel,PeterBoncz,StavrosHarizopoulos,StratosIdreaos,andSamuelMadden.2013.TheDesignandImplementationofModernColumn-OrientedDatabaseSystems.Hanover,MA:NowPublishersInc.
In-memoryDBMS
Faerber,Frans,AlfonsKemper,andPer-ÅkeAlfons.2017.MainMemoryDatabaseSystems.Hanover,MA:NowPublishersInc.
1 Youcanfindavisualizationandcomparisonofdisk,memoryaccesslatencies,andmanyotherrelevantnumbersovertheyearsathttps://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html.
2 SpatiallocalityisoneofthePrinciplesofLocality,statingthatifamemorylocationisaccessed,itsnearbymemorylocationswillbeaccessedinthenearfuture.
3 Vectorizedinstructions,orSingleInstructionMultipleData(SIMD),describesaclassofCPUinstructionsthatperformthesameoperationonmultipledatapoints.
4 Theoriginalpostthathasstirredupthediscussionwascontroversialandone-sided,butyoucanrefertothepresentationcomparingMySQLandPostgreSQLindexandstorageformats,whichreferencestheoriginalsourceaswell.