Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A....
Transcript of Alexander Ponomarenko 2nd International Scientific Conference … · 2016-09-25 · [Y. Malkov, A....
AlgorithmsforBuildingHighlyScalableDistributedDataStorages
AlexanderPonomarenko2ndInternationalScientificConference“SCIENCEOFTHEFUTURE”
September20-232016,Kazan,Russia.
HierarchyvsHeterarchyData
HierarchyvsHeterarchyData
HierarchyvsHeterarchyData
HierarchyvsHeterarchyData
DHTprotocolsandimplementations
• Aeropike• ApacheCassandra• BATONOverlay• MainlineDHT-StandardDHTusedbyBitTorrent(basedonKademlia)• CAN(ContentAddressableNetwork)• Chord• Koorde• Kademlia• Pastry• P-Grid• Riak• Tapestry• TomP2P• Voldemort
6
ApplicationsemployingDHTs
• BTDigg:BitTorrentDHTsearchengine• cjdns:rou\ngengineformesh-basednetworks• CloudSNAP:adecentralizedwebapplica\ondeploymentpla]orm• Codeen:webcaching• CoralContentDistribu\onNetwork• FAROO:peer-to-peerWebsearchengine• Freenet:acensorship-resistantanonymousnetwork• GlusterFS:adistributedfilesystemusedforstoragevirtualiza\on• GNUnet:Freenet-likedistribu\onnetworkincludingaDHTimplementa\on• Hazelcast:Open-sourcein-memorydatagrid• I2P:Anopen-sourceanonymouspeer-to-peernetwork.• I2P-Bote:serverlesssecureanonymouse-mail.• JXTA:open-sourceP2Ppla]orm• OracleCoherence:anin-memorydatagridbuiltontopofaJavaDHTimplementa\on• Retroshare:aFriend-to-friendnetwork[17]• YaCy:adistributedsearchengine• Tox:aninstantmessagingsystemintendedtofunc\onasaSkypereplacement• Twister:amicrobloggingpeer-to-peerpla]orm• PerfectDark:apeer-to-peerfile-sharingapplica\onfromJapan
7
StructuredPeer-to-PeerNetworks:ChordProtocol
Searchingofkey54staringfrom«N8».Routingtableofnode«N8»
Eachnode,n,maintainsaroutingtablewith(atmost)mentries,calledthefingertable.Thei-thentryinthetableatnodencontainstheidentityofthefirstnode,s,thatsucceedsnbyatleast2^(i-1)ontheidentifiercircle,i.e.,s=successor(n+2^(i-1)),where1<=i<=m
Distancefunction:d(x,y)=(y–x)mod2^m
StructuredPeer-to-PeerNetworks:Kademlia
IdentifierspaceofKademlia
MaymounkovP.,MazieresD.Kademlia:Apeer-to-peerinformationsystembasedonthexormetric//Peer-to-PeerSystems.–SpringerBerlinHeidelberg,2002.–С.53-65.
Distancefunction:d(x,y)=xxory
toOvercomeDHTDisadvantages
10
• DHTusesverysimpledistancefunctions• Hashingdestroyssemanticofthedata• It’shardtoperformcomplexqueries
Usenearestneighboursearchinhighdimensionalmetricspaceinsteadofexactsearch
11
Let–domain-distancefunc1onwhichsa1sfiesproper1es:
– strictposi1veness:d(x,y)>0�x≠y,– symmetry:d(x,y)=d(y,x),– reflexivity:d(x,x)=0,– triangleinequality:d(x,y)+d(y,z)≥d(x,z).
NearestNeighborSearch
GivenafinitesetX={p1,…,pn}ofnpointsinsomemetricspace(D,d),needtobuildadatastructureonXsothatforagivenquerypointq∈Donecanfindapointp∈Xwhichminimizesd(p,q)withasfewdistancecomputa<onsaspossible
);0[: +∞→× RDDdD
12
ExamplesofDistanceFunc3ons
• LpMinkovskidistance(forvectors)• L1–city-blockdistance
• L2–Euclideandistance
• L∞ –infinity
• Editdistance(forstrings)• minimalnumberofinser3ons,dele3onsandsubs3tu3ons• d(‘applica3on’,‘applet’)=6
• Jaccard’scoefficient(forsetsA,B)
∑=
−=n
iii yxyxL
11 ||),(
( )∑=
−=n
iii yxyxL
1
22 ),(
ii
n
iyxyxL −=
=∞ max),(
1
( )∪∩BA
BABAd −=1,
13
2(| ( 1, 2) | | ( 1, 2) |)( 1, 2)(| ( 1) | | ( 1) |) (| ( 2) | | ( 2) |)
V G G E G Gsim G GV G E G V G E G
+=
+ ⋅ +
( 1, 2) 1 ( 1, 2)d G G sim G G= −
MaxCommonSubgraphSimilarity
Kleinberg’sNavigableSmallWorld
[KleinbergJ.Thesmall-worldphenomenon:Analgorithmicperspective//Proceedingsofthethirty-secondannualACMsymposiumonTheoryofcomputing.–ACM,2000.–С.163-170.]
VoroNet,RayNet:AscalableobjectnetworkbasedonVoronoitessellations
BeaumontO.etal.VoroNet:AscalableobjectnetworkbasedonVoronoitessellations.–2006.
Distancefunction: 2 22),( yxyxd +=
MetrizedSmallWorldAlgorithmu=1
u=2
u=3+
+=
Navigablesmallworld
“Toplevel”– first(oldest)elements
“Bottom”level- allelements
u=log(N)
u=log(N)-1
queryelement
R1 R2
[Y.Malkov,A.Ponomarenko,A.Logvinov,andV.Krylov,“Scalabledistributedalgorithmforapproximatenearestneighborsearchprobleminhighdimensionalgeneralmetricspaces,”inSimilaritySearchandApplications.Springer,2012,pp.32–147][Y.Malkov,A.Ponomarenko,A.Logvinov,andV.Krylov,“Approximatenearestneighboralgorithmbasedonnavigablesmallworldgraphs,”InformationSystems,vol.45,2014,pp.61–68.][PonomarenkoA.Query-BasedImprovementProcedureandSelf-AdaptiveGraphConstructionAlgorithmforApproximateNearestNeighborSearch//InternationalConferenceonSimilaritySearchandApplications.–SpringerInternationalPublishing,2015.–С.314-319.]
Booleannon-linearprogrammingformulationforoptimalgraphstructure
17
0.6
0.5
0.52 0.52
0.4
0.3
0.42
0.42 0.42
0.41
0.2
0.2
0.20.2
0.41
0.3
Query
EntryPoint
Searchbygreedyalgorithm
Constructionalgorithm
0.5
0.8
0.90.4
0.7 0.3
0.7
0.90.2
Datasets
• CoPhIR(L2)isthecollectionof208-dimensionalvectorsextractedfromimagesinMPEG7format.
• SIFTisapartoftheTexMexdatasetcollectionavailablehttp://corpus-texmex.irisa.frIthasonemillion128-dimensionalvectors.EachvectorcorrespondstodescriptorextractedfromimagedatausingScaleInvariantFeatureTransformation(SIFT)
• Unfi64issyntheticdatasetof64-dimensionalvectors.Thevectorsweregeneratedrandomly,independentlyanduniformlyintheunithypercube.
Performanceofa10-NNsearchfor:plotsinthesamecolumncorrespondtothesamedataset
2L
[PonomarenkoA.etal.ComparativeAnalysisofDataStructuresforApproximateNearestNeighborSearch//DATAANALYTICS2014,TheThirdInternationalConferenceonDataAnalytics.–2014.–С.125-130.]
KL-divergence: ∑=i
ii y
xxyxd log),( Final16,Final64,andFinal256:aresetsof0.5milliontopic
histogramsgeneratedusingtheLatentDirichletAllocation(LDA).
Wikipediadataset
1, 2, ,( , ,..., )j j j n jd w w w=
1, 2, ,( , ,..., )q q n qq w w w=
, ,1
2 2, ,
1 1
( , )|| || || ||
n
i j i qj i
j n nj
i j i qi i
w wd qsim d q
d qw w
=
= =
⋅= =
⋅
∑
∑ ∑
VectorSpaceModel
Wikipedia(cosinesimilarity):isadatasetthatcontains3.2millionvectorsrepresentedinasparseformat.Thissethasanextremelyhighdimensionality(morethan100thousandelements).Yet,thevectorsaresparse:Onaverageonlyabout600elementsarenon-zero.
Dis
tanc
e C
ompu
tatio
ns
0
450000
900000
1350000
1800000
Number of Elements0 1000000 2000000 3000000 4000000
perm. vp-tree perm. incr. sort. msw perm. inv. index
ScalingofmethodsonWikipediadataset
Wikipediaisdatasetthatcontains3.2millionvectorsrepresentedinasparseformat.EachvectorcorrespondstothefrequencytermvectoroftheWikipediapageextractedusingthegensimlibrary.Thissethasanextremelyhighdimensionality(morethan100thousandelements).
Recall=0.9
Dis
tanc
e C
ompu
tatio
ns
0
10000
20000
30000
40000
Number of objects10000 100000 1000000 10000000
ScalingofMSWdatastructure
Summingup• Algorithmisverysimple• Algorithmusesonlydistancevaluesbetweentheobjects,makingitsuitablefor
arbitraryspaces.• Proposeddatastructurehasnorootelement.• Alloperations(additionandsearch)useonlylocalinformationandcanbeinitiated
fromanyelementthatwaspreviouslyaddedtothestructure.• Accuracyoftheapproximatesearchcanbetunedwithoutrebuildingdatastructure• Algorithmhighscalablebothin
sizeanddatadimensionality
Goodbaseforbuildingmanyreal-worldextremedatasetsizehighdimensionalitysimilaritysearchapplications
27
hwps://github.com/searchivarius/NonMetricSpaceLibhwps://github.com/aponom84/MetrizedSmallWorld
SourceCode
28
Questions?
29
Questions?
WhyCERNdoesn’tuseDHT?