Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at...

40

Transcript of Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at...

Page 1: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates
Page 2: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

DatabaseInternalsADeepDiveintoHowDistributedDataSystemsWork

AlexPetrov

Page 3: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

DatabaseInternalsbyAlexPetrov

Copyright©2019OleksandrPetrov.Allrightsreserved.

PrintedintheUnitedStatesofAmerica.

PublishedbyO’ReillyMedia,Inc.,1005GravensteinHighwayNorth,Sebastopol,CA95472.

O’Reillybooksmaybepurchasedforeducational,business,orsalespromotionaluse.Onlineeditionsarealsoavailableformosttitles(http://oreilly.com).Formoreinformation,contactourcorporate/institutionalsalesdepartment:[email protected].

AcquisitionsEditor:MikeLoukides

DevelopmentEditor:MicheleCronin

ProductionEditor:ChristopherFaucher

Copyeditor:KimCofer

Proofreader:SoniaSaruba

Indexer:JudithMcConville

InteriorDesigner:DavidFutato

CoverDesigner:KarenMontgomery

Illustrator:RebeccaDemarest

October2019:FirstEdition

RevisionHistoryfortheFirstEdition

2019-09-12:FirstRelease

Seehttp://oreilly.com/catalog/errata.csp?isbn=9781492040347forrelease

Page 4: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

details.

TheO’ReillylogoisaregisteredtrademarkofO’ReillyMedia,Inc.DatabaseInternals,thecoverimage,andrelatedtradedressaretrademarksofO’ReillyMedia,Inc.

Theviewsexpressedinthisworkarethoseoftheauthor,anddonotrepresentthepublisher’sviews.Whilethepublisherandtheauthorhaveusedgoodfaitheffortstoensurethattheinformationandinstructionscontainedinthisworkareaccurate,thepublisherandtheauthordisclaimallresponsibilityforerrorsoromissions,includingwithoutlimitationresponsibilityfordamagesresultingfromtheuseoforrelianceonthiswork.Useoftheinformationandinstructionscontainedinthisworkisatyourownrisk.Ifanycodesamplesorothertechnologythisworkcontainsordescribesissubjecttoopensourcelicensesortheintellectualpropertyrightsofothers,itisyourresponsibilitytoensurethatyourusethereofcomplieswithsuchlicensesand/orrights.

978-1-492-04034-7

[MBP]

Page 5: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

DedicationToPieterHintjens,fromwhomIgotmyfirsteversignedbook:aninspiringdistributedsystemsprogrammer,author,philosopher,andfriend.

Page 6: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

Preface

Distributeddatabasesystemsareanintegralpartofmostbusinessesandthevastmajorityofsoftwareapplications.Theseapplicationsprovidelogicandauserinterface,whiledatabasesystemstakecareofdataintegrity,consistency,andredundancy.

Backin2000,ifyouweretochooseadatabase,youwouldhavejustafewoptions,andmostofthemwouldbewithintherealmofrelationaldatabases,sodifferencesbetweenthemwouldberelativelysmall.Ofcourse,thisdoesnotmeanthatalldatabaseswerecompletelythesame,buttheirfunctionalityandusecaseswereverysimilar.

Someofthesedatabaseshavefocusedonhorizontalscaling(scalingout)—improvingperformanceandincreasingcapacitybyrunningmultipledatabaseinstancesactingasasinglelogicalunit:GammaDatabaseMachineProject,Teradata,Greenplum,ParallelDB2,andmanyothers.Today,horizontalscalingremainsoneofthemostimportantpropertiesthatcustomersexpectfromdatabases.Thiscanbeexplainedbytherisingpopularityofcloud-basedservices.Itisofteneasiertospinupanewinstanceandaddittotheclusterthanscalingvertically(scalingup)bymovingthedatabasetoalarger,morepowerfulmachine.Migrationscanbelongandpainful,potentiallyincurringdowntime.

Around2010,anewclassofeventuallyconsistentdatabasesstartedappearing,andtermssuchasNoSQL,andlater,bigdatagrewinpopularity.Overthelast15years,theopensourcecommunity,largeinternetcompanies,anddatabasevendorshavecreatedsomanydatabasesandtoolsthatit’seasytogetlosttryingtounderstandusecases,details,andspecifics.

TheDynamopaper[DECANDIA07],publishedbytheteamatAmazonin2007,hadsomuchimpactonthedatabasecommunitythatwithinashortperioditinspiredmanyvariantsandimplementations.ThemostprominentofthemwereApacheCassandra,createdatFacebook;ProjectVoldemort,createdatLinkedIn;andRiak,createdbyformerAkamaiengineers.

Today,thefieldischangingagain:afterthetimeofkey-valuestores,NoSQL,

Page 7: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

andeventualconsistency,wehavestartedseeingmorescalableandperformantdatabases,abletoexecutecomplexquerieswithstrongerconsistencyguarantees.AudienceofThisBookInconversationsattechnicalconferences,Ioftenhearthesamequestion:“HowcanIlearnmoreaboutdatabaseinternals?Idon’tevenknowwheretostart.”Mostofthebooksondatabasesystemsdonotgointodetailsofstorageengineimplementation,andcovertheaccessmethods,suchasB-Trees,onaratherhighlevel.Thereareveryfewbooksthatcovermorerecentconcepts,suchasdifferentB-Treevariantsandlog-structuredstorage,soIusuallyrecommendreadingpapers.

Everyonewhoreadspapersknowsthatit’snotthateasy:youoftenlackcontext,thewordingmightbeambiguous,there’slittleornoconnectionbetweenpapers,andthey’rehardtofind.Thisbookcontainsconcisesummariesofimportantdatabasesystemsconceptsandcanserveasaguideforthosewho’dliketodigindeeper,orasacheatsheetforthosealreadyfamiliarwiththeseconcepts.

Noteveryonewantstobecomeadatabasedeveloper,butthisbookwillhelppeoplewhobuildsoftwarethatusesdatabasesystems:softwaredevelopers,reliabilityengineers,architects,andengineeringmanagers.

Ifyourcompanydependsonanyinfrastructurecomponent,beitadatabase,amessagingqueue,acontainerplatform,orataskscheduler,youhavetoreadtheprojectchange-logsandmailingliststostayintouchwiththecommunityandbeup-to-datewiththemostrecenthappeningsintheproject.Understandingterminologyandknowingwhat’sinsidewillenableyoutoyieldmoreinformationfromthesesourcesanduseyourtoolsmoreproductivelytotroubleshoot,identify,andavoidpotentialrisksandbottlenecks.Havinganoverviewandageneralunderstandingofhowdatabasesystemsworkwillhelpincasesomethinggoeswrong.Usingthisknowledge,you’llbeabletoformahypothesis,validateit,findtherootcause,andpresentittootherprojectmaintainers.

Thisbookisalsoforcuriousminds:forthepeoplewholikelearningthingswithoutimmediatenecessity,thosewhospendtheirfreetimehackingonsomethingfun,creatingcompilers,writinghomegrownoperatingsystems,texteditors,computergames,learningprogramminglanguages,andabsorbingnewinformation.

Page 8: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

Thereaderisassumedtohavesomeexperiencewithdevelopingbackendsystemsandworkingwithdatabasesystemsasauser.Havingsomepriorknowledgeofdifferentdatastructureswillhelptodigestmaterialfaster.WhyShouldIReadThisBook?Weoftenhearpeopledescribingdatabasesystemsintermsoftheconceptsandalgorithmstheyimplement:“Thisdatabaseusesgossipformembershippropagation”(seeChapter12),“TheyhaveimplementedDynamo,”or“Thisisjustlikewhatthey’vedescribedintheSpannerpaper”(seeChapter13).Or,ifyou’rediscussingthealgorithmsanddatastructures,youcanhearsomethinglike“ZABandRafthavealotincommon”(seeChapter14),“Bw-TreesareliketheB-Treesimplementedontopoflogstructuredstorage”(seeChapter6),or“TheyareusingsiblingpointerslikeinB -Trees”(seeChapter5).

Weneedabstractionstodiscusscomplexconcepts,andwecan’thaveadiscussionaboutterminologyeverytimewestartaconversation.Havingshortcutsintheformofcommonlanguagehelpsustomoveourattentiontoother,higher-levelproblems.

Oneoftheadvantagesoflearningthefundamentalconcepts,proofs,andalgorithmsisthattheynevergrowold.Ofcourse,therewillalwaysbenewones,butnewalgorithmsareoftencreatedafterfindingaflaworroomforimprovementinaclassicalone.Knowingthehistoryhelpstounderstanddifferencesandmotivationbetter.

Learningaboutthesethingsisinspiring.Youseethevarietyofalgorithms,seehowourindustrywassolvingoneproblemaftertheother,andgettoappreciatethatwork.Atthesametime,learningisrewarding:youcanalmostfeelhowmultiplepuzzlepiecesmovetogetherinyourmindtoformafullpicturethatyouwillalwaysbeabletosharewithothers.ScopeofThisBookThisisneitherabookaboutrelationaldatabasemanagementsystemsnoraboutNoSQLones,butaboutthealgorithmsandconceptsusedinallkindsofdatabasesystems,withafocusonastorageengineandthecomponentsresponsiblefordistribution.

Someconcepts,suchasqueryplanning,queryoptimization,scheduling,therelationalmodel,andafewothers,arealreadycoveredinseveralgreattextbooks

link

Page 9: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

ondatabasesystems.Someoftheseconceptsareusuallydescribedfromtheuser’sperspective,butthisbookconcentratesontheinternals.YoucanfindsomepointerstousefulliteratureinthePartIIConclusionandinthechaptersummaries.Inthesebooksyou’relikelytofindanswerstomanydatabase-relatedquestionsyoumighthave.

Querylanguagesaren’tdiscussed,sincethere’snosinglecommonlanguageamongthedatabasesystemsmentionedinthisbook.

Tocollectmaterialforthisbook,Istudiedover15books,morethan300papers,countlessblogposts,sourcecode,andthedocumentationforseveralopensourcedatabases.Theruleofthumbforwhetherornottoincludeaparticularconceptinthebookwasthequestion:“Dothepeopleinthedatabaseindustryandresearchcirclestalkaboutthisconcept?”Iftheanswerwas“yes,”Iaddedtheconcepttothelonglistofthingstodiscuss.StructureofThisBookTherearesomeexamplesofextensibledatabaseswithpluggablecomponents(suchas[SCHWARZ86]),buttheyareratherrare.Atthesametime,thereareplentyofexampleswheredatabasesusepluggablestorage.Similarly,werarelyheardatabasevendorstalkingaboutqueryexecution,whiletheyareveryeagertodiscussthewaystheirdatabasespreserveconsistency.

Themostsignificantdistinctionsbetweendatabasesystemsareconcentratedaroundtwoaspects:howtheystoreandhowtheydistributethedata.(Othersubsystemscanattimesalsobeofimportance,butarenotcoveredhere.)Thebookisarrangedintopartsthatdiscussthesubsystemsandcomponentsresponsibleforstorage(PartI)anddistribution(PartII).

PartIdiscussesnode-localprocessesandfocusesonthestorageengine,thecentralcomponentofthedatabasesystemandoneofthemostsignificantdistinctivefactors.First,westartwiththearchitectureofadatabasemanagementsystemandpresentseveralwaystoclassifydatabasesystemsbasedontheprimarystoragemediumandlayout.

Wecontinuewithstoragestructuresandtrytounderstandhowdisk-basedstructuresaredifferentfromin-memoryones,introduceB-Trees,andcoveralgorithmsforefficientlymaintainingB-Treestructuresondisk,includingserialization,pagelayout,andon-diskrepresentations.Later,wediscussmultiple

Page 10: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

variantstoillustratethepowerofthisconceptandthediversityofdatastructuresinfluencedandinspiredbyB-Trees.

Last,wediscussseveralvariantsoflog-structuredstorage,commonlyusedforimplementingfileandstoragesystems,motivation,andreasonstousethem.

PartIIisabouthowtoorganizemultiplenodesintoadatabasecluster.Westartwiththeimportanceofunderstandingthetheoreticalconceptsforbuildingfault-tolerantdistributedsystems,howdistributedsystemsaredifferentfromsingle-nodeapplications,andwhichproblems,constraints,andcomplicationswefaceinadistributedenvironment.

Afterthat,wedivedeepintodistributedalgorithms.Here,westartwithalgorithmsforfailuredetection,helpingtoimproveperformanceandstabilitybynoticingandreportingfailuresandavoidingthefailednodes.Sincemanyalgorithmsdiscussedlaterinthebookrelyonunderstandingtheconceptofleadership,weintroduceseveralalgorithmsforleaderelectionanddiscusstheirsuitability.

Asoneofthemostdifficultthingsindistributedsystemsisachievingdataconsistency,wediscussconceptsofreplication,followedbyconsistencymodels,possibledivergencebetweenreplicas,andeventualconsistency.Sinceeventuallyconsistentsystemssometimesrelyonanti-entropyforconvergenceandgossipfordatadissemination,wediscussseveralanti-entropyandgossipapproaches.Finally,wediscusslogicalconsistencyinthecontextofdatabasetransactions,andfinishwithconsensusalgorithms.

Itwould’vebeenimpossibletowritethisbookwithoutalltheresearchandpublications.Youwillfindmanyreferencestopapersandpublicationsinthetext,insquarebracketswithmonospacefont;forexample,[DECANDIA07].Youcanusethesereferencestolearnmoreaboutrelatedconceptsinmoredetail.

Aftereachchapter,youwillfindasummarysectionthatcontainsmaterialforfurtherstudy,relatedtothecontentofthechapter.ConventionsUsedinThisBookThefollowingtypographicalconventionsareusedinthisbook:

Italic

Indicatesnewterms,URLs,emailaddresses,filenames,andfileextensions.

Page 11: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

Constant width

Usedforprogramlistings,aswellaswithinparagraphstorefertoprogramelementssuchasvariableorfunctionnames,databases,datatypes,environmentvariables,statements,andkeywords.

TIPThiselementsignifiesatiporsuggestion.

NOTEThiselementsignifiesageneralnote.

WARNINGThiselementindicatesawarningorcaution.

UsingCodeExamplesThisbookisheretohelpyougetyourjobdone.Ingeneral,ifexamplecodeisofferedwiththisbook,youmayuseitinyourprogramsanddocumentation.Youdonotneedtocontactusforpermissionunlessyou’rereproducingasignificantportionofthecode.Forexample,writingaprogramthatusesseveralchunksofcodefromthisbookdoesnotrequirepermission.SellingordistributingaCD-ROMofexamplesfromO’Reillybooksdoesrequirepermission.Answeringaquestionbycitingthisbookandquotingexamplecodedoesnotrequirepermission.Incorporatingasignificantamountofexamplecodefromthisbookintoyourproduct’sdocumentationdoesrequirepermission.

Weappreciate,butdonotrequire,attribution.Anattributionusuallyincludesthetitle,author,publisher,andISBN.Forexample:“DatabaseInternalsbyAlexPetrov(O’Reilly).Copyright2019OleksandrPetrov,978-1-492-04034-7.”

Ifyoufeelyouruseofcodeexamplesfallsoutsidefairuseorthepermissiongivenabove,[email protected].

Page 12: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

O’ReillyOnlineLearning

NOTEForalmost40years,O’ReillyMediahasprovidedtechnologyandbusinesstraining,knowledge,andinsighttohelpcompaniessucceed.

Ouruniquenetworkofexpertsandinnovatorssharetheirknowledgeandexpertisethroughbooks,articles,conferences,andouronlinelearningplatform.O’Reilly’sonlinelearningplatformgivesyouon-demandaccesstolivetrainingcourses,in-depthlearningpaths,interactivecodingenvironments,andavastcollectionoftextandvideofromO’Reillyand200+otherpublishers.Formoreinformation,pleasevisithttp://oreilly.com.

HowtoContactUsPleaseaddresscommentsandquestionsconcerningthisbooktothepublisher:

O’ReillyMedia,Inc.

1005GravensteinHighwayNorth

Sebastopol,CA95472

800-998-9938(intheUnitedStatesorCanada)

707-829-0515(internationalorlocal)

707-829-0104(fax)

Wehaveawebpageforthisbook,wherewelisterrata,examples,andanyadditionalinformation.Youcanaccessthispageathttp://bit.ly/database-internals.

Tocommentorasktechnicalquestionsaboutthisbook,[email protected].

Page 13: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

Formoreinformationaboutourbooks,courses,conferences,andnews,seeourwebsiteathttp://www.oreilly.com.

FindusonFacebook:http://facebook.com/oreilly

FollowusonTwitter:http://twitter.com/oreillymedia

WatchusonYouTube:http://www.youtube.com/oreillymediaAcknowledgmentsThisbookwouldn’thavebeenpossiblewithoutthehundredsofpeoplewhohaveworkedhardonresearchpapersandbooks,whichhavebeenasourceofideas,inspiration,andservedasreferencesforthisbook.

I’dliketosaythankyoutoallthepeoplewhoreviewdmanuscriptsandprovidedfeedback,makingsurethatthematerialinthisbookiscorrectandthewordingisprecise:DmitryAlimov,PeterAlvaro,CarlosBaquero,JasonBrown,BlakeEggleston,MarcusEriksson,FranciscoFernándezCastaño,HeidiHoward,VaidehiJoshi,MaximilianKarasz,StasKelvich,MichaelKlishin,PredragKnežević,JoelKnighton,EugeneLazin,NateMcCall,ChristopherMeiklejohn,TylerNeely,MaximNeverov,MarinaPetrova,StefanPodkowinski,EdwardRibiero,DenisRytsov,KirShatrov,AlexSorokoumov,MassimilianoTomassi,andArielWeisberg.

Ofcourse,thisbookwouldn’thavebeenpossiblewithoutsupportfrommyfamily:mywifeMarinaandmydaughterAlexandra,whohavesupportedmeoneverystepontheway.

Page 14: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

PartI.StorageEngines

Theprimaryjobofanydatabasemanagementsystemisreliablystoringdataandmakingitavailableforusers.Weusedatabasesasaprimarysourceofdata,helpingustoshareitbetweenthedifferentpartsofourapplications.Insteadoffindingawaytostoreandretrieveinformationandinventinganewwaytoorganizedataeverytimewecreateanewapp,weusedatabases.Thiswaywecanconcentrateonapplicationlogicinsteadofinfrastructure.

Sincethetermdatabasemanagementsystem(DBMS)isquitebulky,throughoutthisbookweusemorecompactterms,databasesystemanddatabase,torefertothesameconcept.

Databasesaremodularsystemsandconsistofmultipleparts:atransportlayeracceptingrequests,aqueryprocessordeterminingthemostefficientwaytorunqueries,anexecutionenginecarryingouttheoperations,andastorageengine(see“DBMSArchitecture”).

Thestorageengine(ordatabaseengine)isasoftwarecomponentofadatabasemanagementsystemresponsibleforstoring,retrieving,andmanagingdatainmemoryandondisk,designedtocaptureapersistent,long-termmemoryofeachnode[REED78].Whiledatabasescanrespondtocomplexqueries,storageengineslookatthedatamoregranularlyandofferasimpledatamanipulationAPI,allowinguserstocreate,update,delete,andretrieverecords.Onewaytolookatthisisthatdatabasemanagementsystemsareapplicationsbuiltontopofstorageengines,offeringaschema,aquerylanguage,indexing,transactions,andmanyotherusefulfeatures.

Forflexibility,bothkeysandvaluescanbearbitrarysequencesofbyteswithnoprescribedform.Theirsortingandrepresentationsemanticsaredefinedinhigher-levelsubsystems.Forexample,youcanuseint32(32-bitinteger)asakeyinoneofthetables,andascii(ASCIIstring)intheother;fromthestorageengineperspectivebothkeysarejustserializedentries.

Page 15: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

StorageenginessuchasBerkeleyDB,LevelDBanditsdescendantRocksDB,LMDBanditsdescendantlibmdbx,Sophia,HaloDB,andmanyothersweredevelopedindependentlyfromthedatabasemanagementsystemsthey’renowembeddedinto.Usingpluggablestorageengineshasenableddatabasedeveloperstobootstrapdatabasesystemsusingexistingstorageengines,andconcentrateontheothersubsystems.

Atthesametime,clearseparationbetweendatabasesystemcomponentsopensupanopportunitytoswitchbetweendifferentengines,potentiallybettersuitedforparticularusecases.Forexample,MySQL,apopulardatabasemanagementsystem,hasseveralstorageengines,includingInnoDB,MyISAM,andRocksDB(intheMyRocksdistribution).MongoDBallowsswitchingbetweenWiredTiger,In-Memory,andthe(now-deprecated)MMAPv1storageengines.

Page 16: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

PartI.ComparingDatabasesYourchoiceofdatabasesystemmayhavelong-termconsequences.Ifthere’sachancethatadatabaseisnotagoodfitbecauseofperformanceproblems,consistencyissues,oroperationalchallenges,itisbettertofindoutaboutitearlierinthedevelopmentcycle,sinceitcanbenontrivialtomigratetoadifferentsystem.Insomecases,itmayrequiresubstantialchangesintheapplicationcode.

Everydatabasesystemhasstrengthsandweaknesses.Toreducetheriskofanexpensivemigration,youcaninvestsometimebeforeyoudecideonaspecificdatabasetobuildconfidenceinitsabilitytomeetyourapplication’sneeds.

Tryingtocomparedatabasesbasedontheircomponents(e.g.,whichstorageenginetheyuse,howthedataisshared,replicated,anddistributed,etc.),theirrank(anarbitrarypopularityvalueassignedbyconsultancyagenciessuchasThoughtWorksordatabasecomparisonwebsitessuchasDB-EnginesorDatabaseofDatabases),orimplementationlanguage(C++,Java,orGo,etc.)canleadtoinvalidandprematureconclusions.Thesemethodscanbeusedonlyforahigh-levelcomparisonandcanbeascoarseaschoosingbetweenHBaseandSQLite,soevenasuperficialunderstandingofhoweachdatabaseworksandwhat’sinsideitcanhelpyoulandamoreweightedconclusion.

Everycomparisonshouldstartbyclearlydefiningthegoal,becauseeventheslightestbiasmaycompletelyinvalidatetheentireinvestigation.Ifyou’researchingforadatabasethatwouldbeagoodfitfortheworkloadsyouhave(orareplanningtofacilitate),thebestthingyoucandoistosimulatetheseworkloadsagainstdifferentdatabasesystems,measuretheperformancemetricsthatareimportantforyou,andcompareresults.Someissues,especiallywhenitcomestoperformanceandscalability,startshowingonlyaftersometimeorasthecapacitygrows.Todetectpotentialproblems,itisbesttohavelong-runningtestsinanenvironmentthatsimulatesthereal-worldproductionsetupascloselyaspossible.

Simulatingreal-worldworkloadsnotonlyhelpsyouunderstandhowthedatabaseperforms,butalsohelpsyoulearnhowtooperate,debug,andfindouthowfriendlyandhelpfulitscommunityis.Databasechoiceisalwaysacombinationofthesefactors,andperformanceoftenturnsoutnottobethemostimportantaspect:it’susuallymuchbettertouseadatabasethatslowlysavesthe

Page 17: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

datathanonethatquicklylosesit.

Tocomparedatabases,it’shelpfultounderstandtheusecaseingreatdetailanddefinethecurrentandanticipatedvariables,suchas:

Schemaandrecordsizes

Numberofclients

Typesofqueriesandaccesspatterns

Ratesofthereadandwritequeries

Expectedchangesinanyofthesevariables

Knowingthesevariablescanhelptoanswerthefollowingquestions:

Doesthedatabasesupporttherequiredqueries?

Isthisdatabaseabletohandletheamountofdatawe’replanningtostore?

Howmanyreadandwriteoperationscanasinglenodehandle?

Howmanynodesshouldthesystemhave?

Howdoweexpandtheclustergiventheexpectedgrowthrate?

Whatisthemaintenanceprocess?

Havingthesequestionsanswered,youcanconstructatestclusterandsimulateyourworkloads.Mostdatabasesalreadyhavestresstoolsthatcanbeusedtoreconstructspecificusecases.Ifthere’snostandardstresstooltogeneraterealisticrandomizedworkloadsinthedatabaseecosystem,itmightbearedflag.Ifsomethingpreventsyoufromusingdefaulttools,youcantryoneoftheexistinggeneral-purposetools,orimplementonefromscratch.

Ifthetestsshowpositiveresults,itmaybehelpfultofamiliarizeyourselfwiththedatabasecode.Lookingatthecode,itisoftenusefultofirstunderstandthepartsofthedatabase,howtofindthecodefordifferentcomponents,andthennavigatethroughthose.Havingevenaroughideaaboutthedatabasecodebasehelpsyoubetterunderstandthelogrecordsitproduces,itsconfigurationparameters,andhelpsyoufindissuesintheapplicationthatusesitandevenin

Page 18: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

thedatabasecodeitself.

It’dbegreatifwecouldusedatabasesasblackboxesandneverhavetotakealookinsidethem,butthepracticeshowsthatsoonerorlaterabug,anoutage,aperformanceregression,orsomeotherproblempopsup,andit’sbettertobepreparedforit.Ifyouknowandunderstanddatabaseinternals,youcanreducebusinessrisksandimprovechancesforaquickrecovery.

Oneofthepopulartoolsusedforbenchmarking,performanceevaluation,andcomparisonisYahoo!CloudServingBenchmark(YCSB).YCSBoffersaframeworkandacommonsetofworkloadsthatcanbeappliedtodifferentdatastores.Justlikeanythinggeneric,thistoolshouldbeusedwithcaution,sinceit’seasytomakewrongconclusions.Tomakeafaircomparisonandmakeaneducateddecision,itisnecessarytoinvestenoughtimetounderstandthereal-worldconditionsunderwhichthedatabasehastoperform,andtailorbenchmarksaccordingly.

TPC-CBENCHMARKTheTransactionProcessingPerformanceCouncil(TPC)hasasetofbenchmarksthatdatabasevendorsuseforcomparingandadvertisingperformanceoftheirproducts.TPC-Cisanonlinetransactionprocessing(OLTP)benchmark,amixtureofread-onlyandupdatetransactionsthatsimulatecommonapplicationworkloads.

Thisbenchmarkconcernsitselfwiththeperformanceandcorrectnessofexecutedconcurrenttransactions.Themainperformanceindicatoristhroughput:thenumberoftransactionsthedatabasesystemisabletoprocessperminute.ExecutedtransactionsarerequiredtopreserveACIDpropertiesandconformtothesetofpropertiesdefinedbythebenchmarkitself.

Thisbenchmarkdoesnotconcentrateonanyparticularbusinesssegment,butprovidesanabstractsetofactionsimportantformostoftheapplicationsforwhichOLTPdatabasesaresuitable.Itincludesseveraltablesandentitiessuchaswarehouses,stock(inventory),customersandorders,specifyingtablelayouts,detailsoftransactionsthatcanbeperformedagainstthesetables,theminimumnumberofrowspertable,anddatadurabilityconstraints.

Page 19: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

Thisdoesn’tmeanthatbenchmarkscanbeusedonlytocomparedatabases.Benchmarkscanbeusefultodefineandtestdetailsoftheservice-levelagreement, understandingsystemrequirements,capacityplanning,andmore.Themoreknowledgeyouhaveaboutthedatabasebeforeusingit,themoretimeyou’llsavewhenrunningitinproduction.

Choosingadatabaseisalong-termdecision,andit’sbesttokeeptrackofnewlyreleasedversions,understandwhatexactlyhaschangedandwhy,andhaveanupgradestrategy.Newreleasesusuallycontainimprovementsandfixesforbugsandsecurityissues,butmayintroducenewbugs,performanceregressions,orunexpectedbehavior,sotestingnewversionsbeforerollingthemoutisalsocritical.Checkinghowdatabaseimplementerswerehandlingupgradespreviouslymightgiveyouagoodideaaboutwhattoexpectinthefuture.Pastsmoothupgradesdonotguaranteethatfutureoneswillbeassmooth,butcomplicatedupgradesinthepastmightbeasignthatfutureoneswon’tbeeasy,either.

1

Page 20: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

PartI.UnderstandingTrade-OffsAsusers,wecanseehowdatabasesbehaveunderdifferentconditions,butwhenworkingondatabases,wehavetomakechoicesthatinfluencethisbehaviordirectly.

Designingastorageengineisdefinitelymorecomplicatedthanjustimplementingatextbookdatastructure:therearemanydetailsandedgecasesthatarehardtogetrightfromthestart.Weneedtodesignthephysicaldatalayoutandorganizepointers,decideontheserializationformat,understandhowdataisgoingtobegarbage-collected,howthestorageenginefitsintothesemanticsofthedatabasesystemasawhole,figureouthowtomakeitworkinaconcurrentenvironment,and,finally,makesureweneverloseanydata,underanycircumstances.

Notonlytherearemanythingstodecideupon,butmostofthesedecisionsinvolvetrade-offs.Forexample,ifwesaverecordsintheordertheywereinsertedintothedatabase,wecanstorethemquicker,butifweretrievethemintheirlexicographicalorder,wehavetore-sortthembeforereturningresultstotheclient.Asyouwillseethroughoutthisbook,therearemanydifferentapproachestostorageenginedesign,andeveryimplementationhasitsownupsidesanddownsides.

Whenlookingatdifferentstorageengines,wediscusstheirbenefitsandshortcomings.Iftherewasanabsolutelyoptimalstorageengineforeveryconceivableusecase,everyonewouldjustuseit.Butsinceitdoesnotexist,weneedtochoosewisely,basedontheworkloadsandusecaseswe’retryingtofacilitate.

Therearemanystorageengines,usingallsortsofdatastructures,implementedindifferentlanguages,rangingfromlow-levelones,suchasC,tohigh-levelones,suchasJava.Allstorageenginesfacethesamechallengesandconstraints.Todrawaparallelwithcityplanning,itispossibletobuildacityforaspecificpopulationandchoosetobuilduporbuildout.Inbothcases,thesamenumberofpeoplewillfitintothecity,buttheseapproachesleadtoradicallydifferentlifestyles.Whenbuildingthecityup,peopleliveinapartmentsandpopulationdensityislikelytoleadtomoretrafficinasmallerarea;inamorespread-outcity,peoplearemorelikelytoliveinhouses,butcommutingwillrequirecoveringlargerdistances.

Page 21: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

Similarly,designdecisionsmadebystorageenginedevelopersmakethembettersuitedfordifferentthings:someareoptimizedforlowreadorwritelatency,sometrytomaximizedensity(theamountofstoreddatapernode),andsomeconcentrateonoperationalsimplicity.

Youcanfindcompletealgorithmsthatcanbeusedfortheimplementationandotheradditionalreferencesinthechaptersummaries.Readingthisbookshouldmakeyouwellequippedtoworkproductivelywiththesesourcesandgiveyouasolidunderstandingoftheexistingalternativestoconceptsdescribedthere.

1 Theservice-levelagreement(orSLA)isacommitmentbytheserviceprovideraboutthequalityofprovidedservices.Amongotherthings,theSLAcanincludeinformationaboutlatency,throughput,jitter,andthenumberandfrequencyoffailures.

Page 22: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

Chapter1.IntroductionandOverview

Databasemanagementsystemscanservedifferentpurposes:someareusedprimarilyfortemporaryhotdata,someserveasalong-livedcoldstorage,someallowcomplexanalyticalqueries,someonlyallowaccessingvaluesbythekey,someareoptimizedtostoretime-seriesdata,andsomestorelargeblobsefficiently.Tounderstanddifferencesanddrawdistinctions,westartwithashortclassificationandoverview,asthishelpsustounderstandthescopeoffurtherdiscussions.

Terminologycansometimesbeambiguousandhardtounderstandwithoutacompletecontext.Forexample,distinctionsbetweencolumnandwidecolumnstoresthathavelittleornothingtodowitheachother,orhowclusteredandnonclusteredindexesrelatetoindex-organizedtables.Thischapteraimstodisambiguatethesetermsandfindtheirprecisedefinitions.

Westartwithanoverviewofdatabasemanagementsystemarchitecture(see“DBMSArchitecture”),anddiscusssystemcomponentsandtheirresponsibilities.Afterthat,wediscussthedistinctionsamongthedatabasemanagementsystemsintermsofastoragemedium(see“Memory-VersusDisk-BasedDBMS”),andlayout(see“Column-VersusRow-OrientedDBMS”).

Thesetwogroupsdonotpresentafulltaxonomyofdatabasemanagementsystemsandtherearemanyotherwaysthey’reclassified.Forexample,somesourcesgroupDBMSsintothreemajorcategories:

Onlinetransactionprocessing(OLTP)databases

Thesehandlealargenumberofuser-facingrequestsandtransactions.Queriesareoftenpredefinedandshort-lived.

Onlineanalyticalprocessing(OLAP)databases

Thesehandlecomplexaggregations.OLAPdatabasesareoftenusedforanalyticsanddatawarehousing,andarecapableofhandlingcomplex,long-

Page 23: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

runningadhocqueries.

Hybridtransactionalandanalyticalprocessing(HTAP)

ThesedatabasescombinepropertiesofbothOLTPandOLAPstores.

Therearemanyothertermsandclassifications:key-valuestores,relationaldatabases,document-orientedstores,andgraphdatabases.Theseconceptsarenotdefinedhere,sincethereaderisassumedtohaveahigh-levelknowledgeandunderstandingoftheirfunctionality.Becausetheconceptswediscussherearewidelyapplicableandareusedinmostofthementionedtypesofstoresinsomecapacity,completetaxonomyisnotnecessaryorimportantforfurtherdiscussion.

SincePartIofthisbookfocusesonthestorageandindexingstructures,weneedtounderstandthehigh-leveldataorganizationapproaches,andtherelationshipbetweenthedataandindexfiles(see“DataFilesandIndexFiles”).

Finally,in“Buffering,Immutability,andOrdering”,wediscussthreetechniqueswidelyusedtodevelopefficientstoragestructuresandhowapplyingthesetechniquesinfluencestheirdesignandimplementation.

DBMSArchitectureThere’snocommonblueprintfordatabasemanagementsystemdesign.Everydatabaseisbuiltslightlydifferently,andcomponentboundariesaresomewhathardtoseeanddefine.Eveniftheseboundariesexistonpaper(e.g.,inprojectdocumentation),incodeseeminglyindependentcomponentsmaybecoupledbecauseofperformanceoptimizations,handlingedgecases,orarchitecturaldecisions.

Sourcesthatdescribedatabasemanagementsystemarchitecture(forexample,[HELLERSTEIN07],[WEIKUM01],[ELMASRI11],and[MOLINA08]),definecomponentsandrelationshipsbetweenthemdifferently.ThearchitecturepresentedinFigure1-1demonstratessomeofthecommonthemesintheserepresentations.

Databasemanagementsystemsuseaclient/servermodel,wheredatabasesysteminstances(nodes)taketheroleofservers,andapplicationinstancestaketherole

Page 24: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

ofclients.

Clientrequestsarrivethroughthetransportsubsystem.Requestscomeintheformofqueries,mostoftenexpressedinsomequerylanguage.Thetransportsubsystemisalsoresponsibleforcommunicationwithothernodesinthedatabasecluster.

Figure1-1.Architectureofadatabasemanagementsystem

Uponreceipt,thetransportsubsystemhandsthequeryovertoaqueryprocessor,whichparses,interprets,andvalidatesit.Later,accesscontrolchecksareperformed,astheycanbedonefullyonlyafterthequeryisinterpreted.

Theparsedqueryispassedtothequeryoptimizer,whichfirsteliminatesimpossibleandredundantpartsofthequery,andthenattemptstofindthemostefficientwaytoexecuteitbasedoninternalstatistics(indexcardinality,approximateintersectionsize,etc.)anddataplacement(whichnodesinthe

Page 25: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

clusterholdthedataandthecostsassociatedwithitstransfer).Theoptimizerhandlesbothrelationaloperationsrequiredforqueryresolution,usuallypresentedasadependencytree,andoptimizations,suchasindexordering,cardinalityestimation,andchoosingaccessmethods.

Thequeryisusuallypresentedintheformofanexecutionplan(orqueryplan):asequenceofoperationsthathavetobecarriedoutforitsresultstobeconsideredcomplete.Sincethesamequerycanbesatisfiedusingdifferentexecutionplansthatcanvaryinefficiency,theoptimizerpicksthebestavailableplan.

Theexecutionplanishandledbytheexecutionengine,whichcollectstheresultsoftheexecutionoflocalandremoteoperations.Remoteexecutioncaninvolvewritingandreadingdatatoandfromothernodesinthecluster,andreplication.

Localqueries(comingdirectlyfromclientsorfromothernodes)areexecutedbythestorageengine.Thestorageenginehasseveralcomponentswithdedicatedresponsibilities:

Transactionmanager

Thismanagerschedulestransactionsandensurestheycannotleavethedatabaseinalogicallyinconsistentstate.

Lockmanager

Thismanagerlocksonthedatabaseobjectsfortherunningtransactions,ensuringthatconcurrentoperationsdonotviolatephysicaldataintegrity.

Accessmethods(storagestructures)

Thesemanageaccessandorganizingdataondisk.AccessmethodsincludeheapfilesandstoragestructuressuchasB-Trees(see“UbiquitousB-Trees”)orLSMTrees(see“LSMTrees”).

Buffermanager

Thismanagercachesdatapagesinmemory(see“BufferManagement”).

Recoverymanager

Thismanagermaintainstheoperationlogandrestoringthesystemstateincaseofafailure(see“Recovery”).

Page 26: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

Together,transactionandlockmanagersareresponsibleforconcurrencycontrol(see“ConcurrencyControl”):theyguaranteelogicalandphysicaldataintegritywhileensuringthatconcurrentoperationsareexecutedasefficientlyaspossible.

Memory-VersusDisk-BasedDBMSDatabasesystemsstoredatainmemoryandondisk.In-memorydatabasemanagementsystems(sometimescalledmainmemoryDBMS)storedataprimarilyinmemoryandusethediskforrecoveryandlogging.Disk-basedDBMSholdmostofthedataondiskandusememoryforcachingdiskcontentsorasatemporarystorage.Bothtypesofsystemsusethedisktoacertainextent,butmainmemorydatabasesstoretheircontentsalmostexclusivelyinRAM.

Accessingmemoryhasbeenandremainsseveralordersofmagnitudefasterthanaccessingdisk, soitiscompellingtousememoryastheprimarystorage,anditbecomesmoreeconomicallyfeasibletodosoasmemorypricesgodown.However,RAMpricesstillremainhighcomparedtopersistentstoragedevicessuchasSSDsandHDDs.

Mainmemorydatabasesystemsaredifferentfromtheirdisk-basedcounterpartsnotonlyintermsofaprimarystoragemedium,butalsoinwhichdatastructures,organization,andoptimizationtechniquestheyuse.

Databasesusingmemoryasaprimarydatastoredothismainlybecauseofperformance,comparativelylowaccesscosts,andaccessgranularity.Programmingformainmemoryisalsosignificantlysimplerthandoingsoforthedisk.Operatingsystemsabstractmemorymanagementandallowustothinkintermsofallocatingandfreeingarbitrarilysizedmemorychunks.Ondisk,wehavetomanagedatareferences,serializationformats,freedmemory,andfragmentationmanually.

Themainlimitingfactorsonthegrowthofin-memorydatabasesareRAMvolatility(inotherwords,lackofdurability)andcosts.SinceRAMcontentsarenotpersistent,softwareerrors,crashes,hardwarefailures,andpoweroutagescanresultindataloss.Therearewaystoensuredurability,suchasuninterruptedpowersuppliesandbattery-backedRAM,buttheyrequireadditionalhardwareresourcesandoperationalexpertise.Inpractice,itallcomesdowntothefactthatdisksareeasiertomaintainandhavesignificantlylowerprices.

1

Page 27: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

ThesituationislikelytochangeastheavailabilityandpopularityofNon-VolatileMemory(NVM)[ARULRAJ17]technologiesgrow.NVMstoragereducesorcompletelyeliminates(dependingontheexacttechnology)asymmetrybetweenreadandwritelatencies,furtherimprovesreadandwriteperformance,andallowsbyte-addressableaccess.

DurabilityinMemory-BasedStoresIn-memorydatabasesystemsmaintainbackupsondisktoprovidedurabilityandpreventlossofthevolatiledata.Somedatabasesstoredataexclusivelyinmemory,withoutanydurabilityguarantees,butwedonotdiscusstheminthescopeofthisbook.

Beforetheoperationcanbeconsideredcomplete,itsresultshavetobewrittentoasequentiallogfile.Wediscusswrite-aheadlogsinmoredetailin“Recovery”.Toavoidreplayingcompletelogcontentsduringstartuporafteracrash,in-memorystoresmaintainabackupcopy.Thebackupcopyismaintainedasasorteddisk-basedstructure,andmodificationstothisstructureareoftenasynchronous(decoupledfromclientrequests)andappliedinbatchestoreducethenumberofI/Ooperations.Duringrecovery,databasecontentscanberestoredfromthebackupandlogs.

Logrecordsareusuallyappliedtobackupinbatches.Afterthebatchoflogrecordsisprocessed,backupholdsadatabasesnapshotforaspecificpointintime,andlogcontentsuptothispointcanbediscarded.Thisprocessiscalledcheckpointing.Itreducesrecoverytimesbykeepingthedisk-residentdatabasemostup-to-datewithlogentrieswithoutrequiringclientstoblockuntilthebackupisupdated.

NOTEItisunfairtosaythatthein-memorydatabaseistheequivalentofanon-diskdatabasewithahugepagecache(see“BufferManagement”).Eventhoughpagesarecachedinmemory,serializationformatanddatalayoutincuradditionaloverheadanddonotpermitthesamedegreeofoptimizationthatin-memorystorescanachieve.

Disk-baseddatabasesusespecializedstoragestructures,optimizedfordisk

Page 28: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

access.Inmemory,pointerscanbefollowedcomparativelyquickly,andrandommemoryaccessissignificantlyfasterthantherandomdiskaccess.Disk-basedstoragestructuresoftenhaveaformofwideandshorttrees(see“TreesforDisk-BasedStorage”),whilememory-basedimplementationscanchoosefromalargerpoolofdatastructuresandperformoptimizationsthatwouldotherwisebeimpossibleordifficulttoimplementondisk[MOLINA92].Similarly,handlingvariable-sizedataondiskrequiresspecialattention,whileinmemoryit’softenamatterofreferencingthevaluewithapointer.

Forsomeusecases,itisreasonabletoassumethatanentiredatasetisgoingtofitinmemory.Somedatasetsareboundedbytheirreal-worldrepresentations,suchasstudentrecordsforschools,customerrecordsforcorporations,orinventoryinanonlinestore.EachrecordtakesupnotmorethanafewKb,andtheirnumberislimited.

Column-VersusRow-OrientedDBMSMostdatabasesystemsstoreasetofdatarecords,consistingofcolumnsandrowsintables.Fieldisanintersectionofacolumnandarow:asinglevalueofsometype.Fieldsbelongingtothesamecolumnusuallyhavethesamedatatype.Forexample,ifwedefineatableholdinguserrecords,allnameswouldbeofthesametypeandbelongtothesamecolumn.Acollectionofvaluesthatbelonglogicallytothesamerecord(usuallyidentifiedbythekey)constitutesarow.

Oneofthewaystoclassifydatabasesisbyhowthedataisstoredondisk:row-orcolumn-wise.Tablescanbepartitionedeitherhorizontally(storingvaluesbelongingtothesamerowtogether),orvertically(storingvaluesbelongingtothesamecolumntogether).Figure1-2depictsthisdistinction:(a)showsthevaluespartitionedcolumn-wise,and(b)showsthevaluespartitionedrow-wise.

Page 29: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

Figure1-2.Datalayoutincolumn-androw-orientedstores

Examplesofrow-orienteddatabasemanagementsystemsareabundant:MySQL,PostgreSQL,andmostofthetraditionalrelationaldatabases.Thetwopioneeropensourcecolumn-orientedstoresareMonetDBandC-Store(C-StoreisanopensourcepredecessortoVertica).

Row-OrientedDataLayoutRow-orienteddatabasemanagementsystemsstoredatainrecordsorrows.Theirlayoutisquiteclosetothetabulardatarepresentation,whereeveryrowhasthesamesetoffields.Forexample,arow-orienteddatabasecanefficientlystoreuserentries,holdingnames,birthdates,andphonenumbers:

| ID | Name | Birth Date | Phone Number || 10 | John | 01 Aug 1981 | +1 111 222 333 || 20 | Sam | 14 Sep 1988 | +1 555 888 999 || 30 | Keith | 07 Jan 1984 | +1 333 444 555 |

Thisapproachworkswellforcaseswhereseveralfieldsconstitutetherecord(name,birthdate,andaphonenumber)uniquelyidentifiedbythekey(inthisexample,amonotonicallyincrementednumber).Allfieldsrepresentingasingleuserrecordareoftenreadtogether.Whencreatingrecords(forexample,whentheuserfillsoutaregistrationform),wewritethemtogetheraswell.Atthesametime,eachfieldcanbemodifiedindividually.

Sincerow-orientedstoresaremostusefulinscenarioswhenwehavetoaccessdatabyrow,storingentirerowstogetherimprovesspatiallocality[DENNING68].

Becausedataonapersistentmediumsuchasadiskistypicallyaccessedblock-wise(inotherwords,aminimalunitofdiskaccessisablock),asingleblockwillcontaindataforallcolumns.Thisisgreatforcaseswhenwe’dliketoaccessanentireuserrecord,butmakesqueriesaccessingindividualfieldsofmultipleuserrecords(forexample,queriesfetchingonlythephonenumbers)moreexpensive,sincedatafortheotherfieldswillbepagedinaswell.

Column-OrientedDataLayout

2

Page 30: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

Column-orienteddatabasemanagementsystemspartitiondatavertically(i.e.,bycolumn)insteadofstoringitinrows.Here,valuesforthesamecolumnarestoredcontiguouslyondisk(asopposedtostoringrowscontiguouslyasinthepreviousexample).Forexample,ifwestorehistoricalstockmarketprices,pricequotesarestoredtogether.Storingvaluesfordifferentcolumnsinseparatefilesorfilesegmentsallowsefficientqueriesbycolumn,sincetheycanbereadinonepassratherthanconsumingentirerowsanddiscardingdataforcolumnsthatweren’tqueried.

Column-orientedstoresareagoodfitforanalyticalworkloadsthatcomputeaggregates,suchasfindingtrends,computingaveragevalues,etc.Processingcomplexaggregatescanbeusedincaseswhenlogicalrecordshavemultiplefields,butsomeofthem(inthiscase,pricequotes)havedifferentimportanceandareoftenconsumedtogether.

Fromalogicalperspective,thedatarepresentingstockmarketpricequotescanstillbeexpressedasatable:

| ID | Symbol | Date | Price || 1 | DOW | 08 Aug 2018 | 24,314.65 || 2 | DOW | 09 Aug 2018 | 24,136.16 || 3 | S&P | 08 Aug 2018 | 2,414.45 || 4 | S&P | 09 Aug 2018 | 2,232.32 |

However,thephysicalcolumn-baseddatabaselayoutlooksentirelydifferent.Valuesbelongingtothesamerowarestoredcloselytogether:

Symbol: 1:DOW; 2:DOW; 3:S&P; 4:S&PDate: 1:08 Aug 2018; 2:09 Aug 2018; 3:08 Aug 2018; 4:09 Aug 2018Price: 1:24,314.65; 2:24,136.16; 3:2,414.45; 4:2,232.32

Toreconstructdatatuples,whichmightbeusefulforjoins,filtering,andmultirowaggregates,weneedtopreservesomemetadataonthecolumnleveltoidentifywhichdatapointsfromothercolumnsitisassociatedwith.Ifyoudothisexplicitly,eachvaluewillhavetoholdakey,whichintroducesduplicationandincreasestheamountofstoreddata.Somecolumnstoresuseimplicitidentifiers(virtualIDs)insteadandusethepositionofthevalue(inotherwords,itsoffset)tomapitbacktotherelatedvalues[ABADI13].

Duringthelastseveralyears,likelyduetoarisingdemandtoruncomplex

Page 31: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

analyticalqueriesovergrowingdatasets,we’veseenmanynewcolumn-orientedfileformatssuchasApacheParquet,ApacheORC,RCFile,aswellascolumn-orientedstores,suchasApacheKudu,ClickHouse,andmanyothers[ROY12].

DistinctionsandOptimizationsItisnotsufficienttosaythatdistinctionsbetweenrowandcolumnstoresareonlyinthewaythedataisstored.Choosingthedatalayoutisjustoneofthestepsinaseriesofpossibleoptimizationsthatcolumnarstoresaretargeting.

Readingmultiplevaluesforthesamecolumninonerunsignificantlyimprovescacheutilizationandcomputationalefficiency.OnmodernCPUs,vectorizedinstructionscanbeusedtoprocessmultipledatapointswithasingleCPUinstruction [DREPPER07].

Storingvaluesthathavethesamedatatypetogether(e.g.,numberswithothernumbers,stringswithotherstrings)offersabettercompressionratio.Wecanusedifferentcompressionalgorithmsdependingonthedatatypeandpickthemosteffectivecompressionmethodforeachcase.

Todecidewhethertouseacolumn-orarow-orientedstore,youneedtounderstandyouraccesspatterns.Ifthereaddataisconsumedinrecords(i.e.,mostorallofthecolumnsarerequested)andtheworkloadconsistsmostlyofpointqueriesandrangescans,therow-orientedapproachislikelytoyieldbetterresults.Ifscansspanmanyrows,orcomputeaggregateoverasubsetofcolumns,itisworthconsideringacolumn-orientedapproach.

WideColumnStoresColumn-orienteddatabasesshouldnotbemixedupwithwidecolumnstores,suchasBigTableorHBase,wheredataisrepresentedasamultidimensionalmap,columnsaregroupedintocolumnfamilies(usuallystoringdataofthesametype),andinsideeachcolumnfamily,dataisstoredrow-wise.Thislayoutisbestforstoringdataretrievedbyakeyorasequenceofkeys.

AcanonicalexamplefromtheBigtablepaper[CHANG06]isaWebtable.AWebtablestoressnapshotsofwebpagecontents,theirattributes,andtherelationsamongthemataspecifictimestamp.PagesareidentifiedbythereversedURL,andallattributes(suchaspagecontentandanchors,representing

3

Page 32: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

linksbetweenpages)areidentifiedbythetimestampsatwhichthesesnapshotsweretaken.Inasimplifiedway,itcanberepresentedasanestedmap,asFigure1-3shows.

Figure1-3.ConceptualstructureofaWebtable

Dataisstoredinamultidimensionalsortedmapwithhierarchicalindexes:wecanlocatethedatarelatedtoaspecificwebpagebyitsreversedURLanditscontentsoranchorsbythetimestamp.Eachrowisindexedbyitsrowkey.Relatedcolumnsaregroupedtogetherincolumnfamilies—contentsandanchorinthisexample—whicharestoredondiskseparately.Eachcolumninsideacolumnfamilyisidentifiedbythecolumnkey,whichisacombinationofthecolumnfamilynameandaqualifier(html,cnnsi.com,my.look.cainthisexample).Columnfamiliesstoremultipleversionsofdatabytimestamp.Thislayoutallowsustoquicklylocatethehigher-levelentries(webpages,inthiscase)andtheirparameters(versionsofcontentandlinkstotheotherpages).

Whileitisusefultounderstandtheconceptualrepresentationofwidecolumnstores,theirphysicallayoutissomewhatdifferent.AschematicrepresentationofthedatalayoutincolumnfamiliesisshowninFigure1-4:columnfamiliesarestoredseparately,butineachcolumnfamily,thedatabelongingtothesamekey

Page 33: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

isstoredtogether.

Figure1-4.PhysicalstructureofaWebtable

DataFilesandIndexFilesTheprimarygoalofadatabasesystemistostoredataandtoallowquickaccesstoit.Buthowisthedataorganized?Whydoweneedadatabasemanagementsystemandnotjustabunchoffiles?Howdoesfileorganizationimproveefficiency?

Databasesystemsdousefilesforstoringthedata,butinsteadofrelyingonfilesystemhierarchiesofdirectoriesandfilesforlocatingrecords,theycomposefilesusingimplementation-specificformats.Themainreasonstousespecializedfileorganizationoverflatfilesare:

Storageefficiency

Filesareorganizedinawaythatminimizesstorageoverheadperstoreddatarecord.

Accessefficiency

Recordscanbelocatedinthesmallestpossiblenumberofsteps.

Updateefficiency

Recordupdatesareperformedinawaythatminimizesthenumberofchangesondisk.

Page 34: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

Databasesystemsstoredatarecords,consistingofmultiplefields,intables,whereeachtableisusuallyrepresentedasaseparatefile.Eachrecordinthetablecanbelookedupusingasearchkey.Tolocatearecord,databasesystemsuseindexes:auxiliarydatastructuresthatallowittoefficientlylocatedatarecordswithoutscanninganentiretableoneveryaccess.Indexesarebuiltusingasubsetoffieldsidentifyingtherecord.

Adatabasesystemusuallyseparatesdatafilesandindexfiles:datafilesstoredatarecords,whileindexfilesstorerecordmetadataanduseittolocaterecordsindatafiles.Indexfilesaretypicallysmallerthanthedatafiles.Filesarepartitionedintopages,whichtypicallyhavethesizeofasingleormultiplediskblocks.Pagescanbeorganizedassequencesofrecordsorasaslottedpages(see“SlottedPages”).

Newrecords(insertions)andupdatestotheexistingrecordsarerepresentedbykey/valuepairs.Mostmodernstoragesystemsdonotdeletedatafrompagesexplicitly.Instead,theyusedeletionmarkers(alsocalledtombstones),whichcontaindeletionmetadata,suchasakeyandatimestamp.Spaceoccupiedbytherecordsshadowedbytheirupdatesordeletionmarkersisreclaimedduringgarbagecollection,whichreadsthepages,writesthelive(i.e.,nonshadowed)recordstothenewplace,anddiscardstheshadowedones.

DataFilesDatafiles(sometimescalledprimaryfiles)canbeimplementedasindex-organizedtables(IOT),heap-organizedtables(heapfiles),orhash-organizedtables(hashedfiles).

Recordsinheapfilesarenotrequiredtofollowanyparticularorder,andmostofthetimetheyareplacedinawriteorder.Thisway,noadditionalworkorfilereorganizationisrequiredwhennewpagesareappended.Heapfilesrequireadditionalindexstructures,pointingtothelocationswheredatarecordsarestored,tomakethemsearchable.

Inhashedfiles,recordsarestoredinbuckets,andthehashvalueofthekeydetermineswhichbucketarecordbelongsto.Recordsinthebucketcanbestoredinappendorderorsortedbykeytoimprovelookupspeed.

Index-organizedtables(IOTs)storedatarecordsintheindexitself.Sincerecords

Page 35: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

arestoredinkeyorder,rangescansinIOTscanbeimplementedbysequentiallyscanningitscontents.

Storingdatarecordsintheindexallowsustoreducethenumberofdiskseeksbyatleastone,sinceaftertraversingtheindexandlocatingthesearchedkey,wedonothavetoaddressaseparatefiletofindtheassociateddatarecord.

Whenrecordsarestoredinaseparatefile,indexfilesholddataentries,uniquelyidentifyingdatarecordsandcontainingenoughinformationtolocatetheminthedatafile.Forexample,wecanstorefileoffsets(sometimescalledrowlocators),locationsofdatarecordsinthedatafile,orbucketIDsinthecaseofhashfiles.Inindex-organizedtables,dataentriesholdactualdatarecords.

IndexFilesAnindexisastructurethatorganizesdatarecordsondiskinawaythatfacilitatesefficientretrievaloperations.Indexfilesareorganizedasspecializedstructuresthatmapkeystolocationsindatafileswheretherecordsidentifiedbythesekeys(inthecaseofheapfiles)orprimarykeys(inthecaseofindex-organizedtables)arestored.

Anindexonaprimary(data)fileiscalledtheprimaryindex.However,inmostcaseswecanalsoassumethattheprimaryindexisbuiltoveraprimarykeyorasetofkeysidentifiedasprimary.Allotherindexesarecalledsecondary.

Secondaryindexescanpointdirectlytothedatarecord,orsimplystoreitsprimarykey.Apointertoadatarecordcanholdanoffsettoaheapfileoranindex-organizedtable.Multiplesecondaryindexescanpointtothesamerecord,allowingasingledatarecordtobeidentifiedbydifferentfieldsandlocatedthroughdifferentindexes.Whileprimaryindexfilesholdauniqueentrypersearchkey,secondaryindexesmayholdseveralentriespersearchkey[MOLINA08].

Iftheorderofdatarecordsfollowsthesearchkeyorder,thisindexiscalledclustered(alsoknownasclustering).Datarecordsintheclusteredcaseareusuallystoredinthesamefileorinaclusteredfile,wherethekeyorderispreserved.Ifthedataisstoredinaseparatefile,anditsorderdoesnotfollowthekeyorder,theindexiscallednonclustered(sometimescalledunclustered).

Page 36: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

Figure1-5showsthedifferencebetweenthetwoapproaches:

a)Twoindexesreferencedataentriesdirectlyfromsecondaryindexfiles.

b)Asecondaryindexgoesthroughtheindirectionlayerofaprimaryindextolocatethedataentries.

Figure1-5.Storingdatarecordsinanindexfileversusstoringoffsetstothedatafile(indexsegmentsshowninwhite;segmentsholdingdatarecordsshowningray)

NOTEIndex-organizedtablesstoreinformationinindexorderandareclusteredbydefinition.Primaryindexesaremostoftenclustered.Secondaryindexesarenonclusteredbydefinition,sincethey’reusedtofacilitateaccessbykeysotherthantheprimaryone.Clusteredindexescanbebothindex-organizedorhaveseparateindexanddatafiles.

Manydatabasesystemshaveaninherentandexplicitprimarykey,asetofcolumnsthatuniquelyidentifythedatabaserecord.Incaseswhentheprimarykeyisnotspecified,thestorageenginecancreateanimplicitprimarykey(forexample,MySQLInnoDBaddsanewauto-incrementcolumnandfillsinitsvaluesautomatically).

Thisterminologyisusedindifferentkindsofdatabasesystems:relationaldatabasesystems(suchasMySQLandPostgreSQL),Dynamo-basedNoSQLstores(suchasApacheCassandraandinRiak),anddocumentstores(suchasMongoDB).Therecanbesomeproject-specificnaming,butmostoftenthere’saclearmappingtothisterminology.

PrimaryIndexasanIndirection

Page 37: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

Therearedifferentopinionsinthedatabasecommunityonwhetherdatarecordsshouldbereferenceddirectly(throughfileoffset)orviatheprimarykeyindex.

Bothapproacheshavetheirprosandconsandarebetterdiscussedinthescopeofacompleteimplementation.Byreferencingdatadirectly,wecanreducethenumberofdiskseeks,buthavetopayacostofupdatingthepointerswhenevertherecordisupdatedorrelocatedduringamaintenanceprocess.Usingindirectionintheformofaprimaryindexallowsustoreducethecostofpointerupdates,buthasahighercostonareadpath.

Updatingjustacoupleofindexesmightworkiftheworkloadmostlyconsistsofreads,butthisapproachdoesnotworkwellforwrite-heavyworkloadswithmultipleindexes.Toreducethecostsofpointerupdates,insteadofpayloadoffsets,someimplementationsuseprimarykeysforindirection.Forexample,MySQLInnoDBusesaprimaryindexandperformstwolookups:oneinthesecondaryindex,andoneinaprimaryindexwhenperformingaquery[TARIQ11].Thisaddsanoverheadofaprimaryindexlookupinsteadoffollowingtheoffsetdirectlyfromthesecondaryindex.

Figure1-6showshowthetwoapproachesaredifferent:

a)Twoindexesreferencedataentriesdirectlyfromsecondaryindexfiles.

b)Asecondaryindexgoesthroughtheindirectionlayerofaprimaryindextolocatethedataentries.

Figure1-6.Referencingdatatuplesdirectly(a)versususingaprimaryindexasindirection(b)

4

Page 38: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

Itisalsopossibletouseahybridapproachandstorebothdatafileoffsetsandprimarykeys.First,youcheckifthedataoffsetisstillvalidandpaytheextracostofgoingthroughtheprimarykeyindexifithaschanged,updatingtheindexfileafterfindinganewoffset.

Buffering,Immutability,andOrderingAstorageengineisbasedonsomedatastructure.However,thesestructuresdonotdescribethesemanticsofcaching,recovery,transactionality,andotherthingsthatstorageenginesaddontopofthem.

Inthenextchapters,wewillstartthediscussionwithB-Trees(see“UbiquitousB-Trees”)andtrytounderstandwhytherearesomanyB-Treevariants,andwhynewdatabasestoragestructureskeepemerging.

Storagestructureshavethreecommonvariables:theyusebuffering(oravoidusingit),useimmutable(ormutable)files,andstorevaluesinorder(oroutoforder).Mostofthedistinctionsandoptimizationsinstoragestructuresdiscussedinthisbookarerelatedtooneofthesethreeconcepts.

Buffering

Thisdefineswhetherornotthestoragestructurechoosestocollectacertainamountofdatainmemorybeforeputtingitondisk.Ofcourse,everyon-diskstructurehastousebufferingtosomedegree,sincethesmallestunitofdatatransfertoandfromthediskisablock,anditisdesirabletowritefullblocks.Here,we’retalkingaboutavoidablebuffering,somethingstorageengineimplementerschoosetodo.Oneofthefirstoptimizationswediscussinthisbookisaddingin-memorybufferstoB-TreenodestoamortizeI/Ocosts(see“LazyB-Trees”).However,thisisnottheonlywaywecanapplybuffering.Forexample,two-componentLSMTrees(see“Two-componentLSMTree”),despitetheirsimilaritieswithB-Trees,usebufferinginanentirelydifferentway,andcombinebufferingwithimmutability.

Mutability(orimmutability)

Thisdefineswhetherornotthestoragestructurereadspartsofthefile,updatesthem,andwritestheupdatedresultsatthesamelocationinthefile.Immutablestructuresareappend-only:oncewritten,filecontentsarenot

Page 39: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

modified.Instead,modificationsareappendedtotheendofthefile.Thereareotherwaystoimplementimmutability.Oneofthemiscopy-on-write(see“Copy-on-Write”),wherethemodifiedpage,holdingtheupdatedversionoftherecord,iswrittentothenewlocationinthefile,insteadofitsoriginallocation.OftenthedistinctionbetweenLSMandB-Treesisdrawnasimmutableagainstin-placeupdatestorage,buttherearestructures(forexample,“Bw-Trees”)thatareinspiredbyB-Treesbutareimmutable.

Ordering

Thisisdefinedaswhetherornotthedatarecordsarestoredinthekeyorderinthepagesondisk.Inotherwords,thekeysthatsortcloselyarestoredincontiguoussegmentsondisk.Orderingoftendefineswhetherornotwecanefficientlyscantherangeofrecords,notonlylocatetheindividualdatarecords.Storingdataoutoforder(mostoften,ininsertionorder)opensupforsomewrite-timeoptimizations.Forexample,Bitcask(see“Bitcask”)andWiscKey(see“WiscKey”)storedatarecordsdirectlyinappend-onlyfiles.

Ofcourse,abriefdiscussionofthesethreeconceptsisnotenoughtoshowtheirpower,andwe’llcontinuethisdiscussionthroughouttherestofthebook.

SummaryInthischapter,we’vediscussedthearchitectureofadatabasemanagementsystemandcovereditsprimarycomponents.

Tohighlighttheimportanceofdisk-basedstructuresandtheirdifferencefromin-memoryones,wediscussedmemory-anddisk-basedstores.Wecametotheconclusionthatdisk-basedstructuresareimportantforbothtypesofstores,butareusedfordifferentpurposes.

Tounderstandhowaccesspatternsinfluencedatabasesystemdesign,wediscussedcolumn-androw-orienteddatabasemanagementsystemsandtheprimaryfactorsthatsetthemapartfromeachother.Tostartaconversationabouthowthedataisstored,wecovereddataandindexfiles.

Lastly,weintroducedthreecoreconcepts:buffering,immutability,andordering.Wewillusethemthroughoutthisbooktohighlightpropertiesofthestorageenginesthatusethem.

Page 40: Database Internals - The world's largest ebook library-IEU.US€¦ · Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; ... , but this book concentrates

FURTHERREADINGIfyou’dliketolearnmoreabouttheconceptsmentionedinthischapter,youcanrefertothefollowingsources:

Databasearchitecture

Hellerstein,JosephM.,MichaelStonebraker,andJamesHamilton.2007.“ArchitectureofaDatabaseSystem.”FoundationsandTrendsinDatabases1,no.2(February):141-259.https://doi.org/10.1561/1900000002.

Column-orientedDBMS

Abadi,Daniel,PeterBoncz,StavrosHarizopoulos,StratosIdreaos,andSamuelMadden.2013.TheDesignandImplementationofModernColumn-OrientedDatabaseSystems.Hanover,MA:NowPublishersInc.

In-memoryDBMS

Faerber,Frans,AlfonsKemper,andPer-ÅkeAlfons.2017.MainMemoryDatabaseSystems.Hanover,MA:NowPublishersInc.

1 Youcanfindavisualizationandcomparisonofdisk,memoryaccesslatencies,andmanyotherrelevantnumbersovertheyearsathttps://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html.

2 SpatiallocalityisoneofthePrinciplesofLocality,statingthatifamemorylocationisaccessed,itsnearbymemorylocationswillbeaccessedinthenearfuture.

3 Vectorizedinstructions,orSingleInstructionMultipleData(SIMD),describesaclassofCPUinstructionsthatperformthesameoperationonmultipledatapoints.

4 Theoriginalpostthathasstirredupthediscussionwascontroversialandone-sided,butyoucanrefertothepresentationcomparingMySQLandPostgreSQLindexandstorageformats,whichreferencestheoriginalsourceaswell.