CS336: Intelligent Information Retrieval Lecture 8: Indexing Models.
Ingestion, Indexing and Retrieval of High-Velocity...
Transcript of Ingestion, Indexing and Retrieval of High-Velocity...
![Page 1: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/1.jpg)
Ingestion,IndexingandRetrievalofHigh-VelocityMultidimensionalSensor
DataonaSingleNodeJuanA.Colmenares,RezaDorrigiv andDanielWaddington
<[email protected]>SeminarSeries
DepartmentofComputerScienceUniversityofCalifornia,Irvine
January12,2018
SamsungResearchAmerica
![Page 2: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/2.jpg)
Disclaimer
• Nopartofthispresentationnecessarilyrepresentstheviewsandopinionsofmycurrentemployerandresearchcollaborators.
• Thismaterialwaspresentedatthe2017IEEEInt’lConferenceonBigData(IEEEBigData).
![Page 3: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/3.jpg)
MultidimensionalDataSourcesMobileDevices
Vehicles
DataCenters PowerGrid
SmartAppliances
![Page 4: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/4.jpg)
MultidimensionalData
• Spatial-temporaldata– [time,longitude,latitude,speed,…]
• Sensordata– [time,voltage,current,temp,…]
• Logs– [time,responselatency,resultcount,…]
id:28379,time:2015/12/04-11:52:21.134,latitude:40.77,longitude:-73.89,occupants:3,speed:43.2mph
NYCTaxiData
Record:[f1,f2,f3,f4,…,f(N-1),f(N) ](withnumericalindexingfields)
A
![Page 5: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/5.jpg)
DemandsforHighIngestionRatesinIndustrialIoT
• SomeIoT appsrequire– Ingestingmillionsofrecs/sec– Processingqueriesonrecentlyingestedandhistoricaldata
• Example– Telemetryofpowerdistributionsystemswithmicrophasormeasurementunits(μPMU)[1,2,3]
[1] UCBerkeley,LBNLetal.http://pqubepmu.com/[2] Pinte etal.Lowvoltagemicro-phasormeasurementunit(μPMU).PECI2015.[3] Andersenetal.DISTIL:Designandimplementationofascalablesynchrophasor dataprocessingsystem.IEEESmartGridComm 2015.
512+samples/secofACvoltagesandcurrents,andothersvariables
![Page 6: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/6.jpg)
NewDBMSstoMeetHighIngestionRequirements
• TraditionalDBMS– Optimizedforread-heavyworkloads– Offerverylowingestrates(<300Krecs/s)
• Newtimeseriesdatabases– Gorilla[VLDB2015],BTrDB [FAST2016]
• NewOLAPsystems– Druid[SIGMOD2014],VOLAP[CLUSTER2016],Cubrik [VLDB2016]
• Scalehorizontally• Sub-secondqueryresponses• Someoperatein-memory
• Lowper-nodeingestionrates
![Page 7: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/7.jpg)
KeyQuestion
• Toanswerit:– Adoptasimpledesigntostreamlineingestion– Conductaexperimentalstudyconfinedto• Recordswithnumericalindexingfields• Rangequeries
Canwebuildasingle-nodemultidimensionaldatastore ableto:(1) sustainingestionratesmuchhigherthanthoseof
individualnodesofexistingDBMS(2) whilestillofferingsimilarqueryperformance?
![Page 8: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/8.jpg)
SeparateNodesforIngestionandStorage
DINode(1) DINode(N)
DSNode(1) DSNode(M)DataStorageNodes
DataIngestionNodes
DataStream
PermanentStorage
SimilartoDruid’sdesign[SIGMOD2014]
InterimStorage
Queries
QueryBrokerNode
![Page 9: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/9.jpg)
R-Tree
FamilyVariants:R*-tree,R+-tree,HilbertR-tree,X-tree
Source:Wikipedia
![Page 10: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/10.jpg)
K-dimensional(kd)Tree
K-dtreedecompositionforthepointset(2,3),(5,4),(9,6),(4,7),(8,1),(7,2)
Source:wikipedia
![Page 11: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/11.jpg)
Two-LevelIndexingScheme
1. AnR*-Treeindexesdatasegments(boundingboxes)
2. AKD-Treeineachsegmentindexesindividualrecords
SerializedDataSegments(withtherecords)
R*-Tree,inmemory(Level1)
KD-Tree(Level2)
SimilartoEMINC[CloudDB 2009]
BoundingBox ={d1,min,d1,max,…,d3,min,d3,max }
![Page 12: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/12.jpg)
Two-LevelIndexingScheme
DataSegments(withtherecords)
KD-Tree(Level2)
RangeQuery
1
2
3
R*-Tree,inmemory(Level1)
BoundingBox ={d1,min,d1,max,…,d3,min,d3,max }
![Page 13: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/13.jpg)
DataSegment
PackedKD-Tree(Serialized)
DataRecords
![Page 14: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/14.jpg)
RecordDescriptor
![Page 15: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/15.jpg)
DataIngestionProcedure
• Steps1– 5areperformedonlyinmemory
![Page 16: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/16.jpg)
Multi-DimensionalDatastore (MDDS)
μ-batches
ThreadParallelism(Chunksprocessedindependently)
Dataaccessibletoqueriesfrommemory
beforebecomingpersistent
ConcurrentQueries(whileingestingdata)
Exploitdatalocality
![Page 17: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/17.jpg)
EvaluationSystems Datasets Queries• Percona Server(enhancedMySQL)withstorageenginesXtraDB,MyISAM,andTokuDB• SQLite3• Druid [SIGMOD2014]
NYCTaxiTrip• ~169Mrecords• 10 numericalfields(outof14)
16 randomlygeneratedquerieson1kmX1kmareas
USNOAA’sGlobalHistoricalClimatologyNetwork- Daily (GHCN-Daily)• First100Mrecords• 6 numericalfields(outof7)
10meaningfulhandcraftedqueries(e.g.,theaveragesnowdepthforMountMcKinleyinAlaska)
Test Platform: Dell PowerEdge R720 Server • Two 2.50-GHz Intel Xeon processors (20 hardware threads), 64GB of RAM, and an
Intel 750 400GB SSD with ext4 file system. • Ubuntu 14.04 LTS (Linux kernel 3.13.0-71).
Details of experimental setup at: https://arxiv.org/abs/1707.00825
![Page 18: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/18.jpg)
TestQueriesonNYCTaxiDataRandomlyGenerated
![Page 19: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/19.jpg)
TestQueriesonGHCN-DailyDataMeaningfulHandcrafted
![Page 20: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/20.jpg)
CharacterizationofDataSegmentationSchemes
• UniformlyRandomScheme(verysimple)– Recordsassignedtodatasegmentschosenuniformlyatrandom
• Kd-treepartitioningbasedscheme– Triestocreatewell-populatedsegmentswithsmalloverlapamongtheirboundinghyperrectangles
– Ourhypothesis• Itlimitsreadamplificationandimprovesqueryperformance(butnotquiteL!)
![Page 21: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/21.jpg)
K-dimensional(kd)Tree
K-dtreedecompositionforthepointset(2,3),(5,4),(9,6),(4,7),(8,1),(7,2)
Source:wikipedia
![Page 22: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/22.jpg)
KD-treePartitioningBasedScheme
…..
ChunkofRecords(μ-batch)
(1)BulkLoading
KD-Tree
(2)KD-TreePartitioning
(3)Assembly(Serialization)
PartitionedKD-Tree
(2)KD-TreePartitioning• Traversesthetreeindepth-firstpre-order,groupingtherecordsbasedonthenumberofnodesinthesubtrees(withenoughrecords)
DataSegment
![Page 23: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/23.jpg)
ComparisonbetweenSegmentationSchemesIngestionThroughput
![Page 24: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/24.jpg)
ComparisonbetweenSegmentationSchemesNumberofOverlapsamongSegments
Lessoverlapsforkd-treepart.,
exceptinthiscase
![Page 25: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/25.jpg)
ComparisonbetweenSegmentationSchemesQueryPerformance
Couldn’tvalidateourhypothesisThekd-treepartitioningschemedoesnotyieldbetterqueryperformance
![Page 26: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/26.jpg)
ComparisonbetweenSegmentationSchemesQueryPerformance
Couldn’tvalidateourhypothesisThekd-treepartitioningschemedoesnotyieldbetterqueryperformance
![Page 27: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/27.jpg)
Single-Threaded/Bulkloading Ingestion
11xw/binarydata
2x w/CSVdata
![Page 28: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/28.jpg)
IngestionThreadScalingandInfluenceofQueries
Percona Server,SQLite&Druidreported35K,30K,and55Krecs/s,respectively.
230x inthemultithreadedscenario160x overall27x w/CSVdata
-20%
![Page 29: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/29.jpg)
QueryResponseTimesforNYCTaxiData• Querieson1km2 areaswithrangesintime,tripdurationandpassengercount.• Percona ServerandSQLitewithasinglemulticolumnindex.
• MDDSperformscomparablytoorbetterthanPercona Serverin12queries.• ItoutperformsSQLiteinQ7andQ14.• ItoutperformsDruidinQ6-Q16(on3- to5-dimensionalranges).
![Page 30: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/30.jpg)
QueryResponseTimesforGHCN-DailyData• 10meaningfulqueries(e.g.,averagesnowdepthforMt.McKinleyinAlaska).• Percona ServerandSQLitewithmultipleindicestailoredtothequeries.
• Asexpected,RDBMSsoutperformsMDDSacrossallqueries.• MDDSoutperformsDruidinhalfofthequeries(with3+dimensionalranges)
![Page 31: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/31.jpg)
StorageFootprint(inGB)
MDDSoccupies• 20-42%lessstoragespacethantheRDBMSs• Upto2xthespaceusedbyDruid(w/heavydata
compression)
![Page 32: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/32.jpg)
Conclusions
• Developedamultidimensionaldatastore ableto– Ingesthigh-velocitysensordata– Offerdecentqueryperformance
• Showedpotentialforsignificantreductionsinthenumberofclusternodesrequiredtoingesthigh-velocitysensordata
• Comparedarandomschemeandakd-treepartitioningbasedschemefordatasegmentation– Kd-treepartitioningschemeproducedlessoverlapbetweendatasegments,butdidnotyieldbetterqueryperformance
– Therandomschemeisverysimpleandfaster• Ourfirstchoice
![Page 33: Ingestion, Indexing and Retrieval of High-Velocity ...juancol.me/rsrc/mdds-cs-uci-20180112.pdf2018/01/12 · Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed3f7d38d46b66d22632ee6/html5/thumbnails/33.jpg)
Thanks
Questions?