Bigdata and ai in p2 p industry: Knowledge graph and inference
PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is...
Transcript of PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is...
![Page 2: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/2.jpg)
PROJECT1:COMMUNITYDETECTION
![Page 3: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/3.jpg)
WhatisCommunityDetection?
• WhatSocialNetworkAnalysisis?
• Communitydetection:discoveringgroupsinanetworkwhereindividuals’groupmembershipsarenotexplicitlygiven
Network Analysis is the keywordFor the 21st Century
Researchers , Politicians , People talk about Social Networks.
![Page 4: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/4.jpg)
SubjectivityofCommunityDefinition
Eachconnectedcomponentisacommunity
Adensely-connectedcommunity
Definitionofacommunitycanbesubjective.
![Page 5: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/5.jpg)
Node-CentricCommunityDetection
• Node-CentricCommunity:Eachnodeinagroupsatisfiescertainproperties
• Sampleproperties:• CompleteMutuality
• cliques• Reachabilityofmembers
• k-clique,k-clan,k-club• Nodaldegrees
• k-plex,k-core• RelativefrequencyofWithin-OutsideTies
5
![Page 6: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/6.jpg)
CompleteMutuality:Cliques
• Clique:amaximumcompletesubgraphinwhichallnodesareadjacenttoeachother
• NP-hardtofindthemaximumcliqueinanetwork• Straightforwardimplementationtofindcliquesisveryexpensiveintimecomplexity
Nodes5,6,7and8formaclique
6
![Page 7: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/7.jpg)
EnumeratingallMaximalCliques[CDMPT16]
AJ H H
FD D E
S
A
J HF
D ES
W UG
Y
![Page 8: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/8.jpg)
CliqueisVeryStrict
• Cliqueisaverystrictdefinition• Verticesofacliqueareatdistance1eachother• Diameterofinducedsubgraphis1• Min-degreeofinducedsubgraphs-1(cliquesizes)
• Normallyuserelaxationsofcliquesasdefinitionforcommunities
• Cliquerelaxationsinclude:• k-clique:verticeswithdistance*nogreaterthankfromeachother• k-club/k-clan:subgraphsofdiameternogreaterthank• k-plex:subgraphsofmin-degreenogreaterthans-k
• 1-clique=1-club=1-clan=1-plex=clique
• (*)distanceiscomputedontheinputgraphandcancontain“external”edges 8
![Page 9: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/9.jpg)
Enumeratinglargek-plexes[CFMPT17]
![Page 10: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/10.jpg)
Enumeratinglargek-plexes[CFMPT17]
![Page 11: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/11.jpg)
GraphDatabases
• Storedataasnodesandrelationships• Databasefulloflinkednodes
![Page 12: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/12.jpg)
SampleGraphDB
• AllegroGraph• Bitsy• Cayley• GraphBase• Graphd• HyperGraphDB• IBMSystemG• imGraph• InfiniteGraph• InfoGrid• Neo4j• Sparksee/DEX• Trinity• TurboGraph
![Page 13: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/13.jpg)
SampleGraphDBqueries
• Patternmatchingquery• Nodeswithfirstname“James”
• Adjacencyquery• NodesthatJamesknowsdirecly• I.e.,areadjacenttoJamesintheknowsrelationship
• Reachabilityquery• NodesthatJamesknows• I.e.,arereachablefromJamesintheknowsrelationship
• GraphAnalyticalquery
![Page 14: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/14.jpg)
Single-sql-queryfor#connectedcomponents(forFUN)
http://stackoverflow.com/questions/33465859/a-number-of-connected-components-of-a-graph-in-sql
![Page 15: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/15.jpg)
Neo4jqueryfor#connectedcomponents
• http://172.17.0.21:7474/service/mazerunner/analysis/connected_components/FOLLOWS
• ViaMazerunner• RESTAPI• https://github.com/neo4j-contrib/neo4j-mazerunner• IntegratesApacheSpark,GraphXandNeo4jforbigscalegraphanalysis
• GraphX:ApacheSpark'sAPIforgraphsandgraph-parallelcomputation
![Page 16: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/16.jpg)
Performance
![Page 17: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/17.jpg)
Summaryandopenproblems
• NetworkAnalysisisthekeywordForthe21stCentury• Researchers,Politicians,PeopletalkaboutSocialNetworks.
• Problems:• Communities• AnalysisofStructure&SocialSpace
• Technologies:• GraphDB• BigDatatechnologies
![Page 18: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/18.jpg)
PROJECT2:ENTITYRESOLUTION
![Page 19: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/19.jpg)
WhatisEntityResolution(ER)?
• Inputdata:modeledasagraph.• Graphnode=datarecord.• Graphedgelabel=probabilitythat
recordpairrepresentsthesameentity.
• Output:asetofclusters,eachofwhichcorrespondstoanentity.• 2nodesinaclusteriff recordsrepresentthesameentity.
• Traditionalproblems[EIV07,GM12].• Pairwisematch:whatistheprobabilitythattworecordsmatch?• Clustering:howtopartitionrecordsintoanunknown#ofentities?• Blocking:howtoperformERinsub-quadratictime?
19
![Page 20: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/20.jpg)
WhatisERUsinganOracle?
• Inputdata:modeledasagraph.
• Output:asetofclusters=entities.
• Formalproblem[WL+13,VBD14,FSS16]:• Givenanoraclethatcancorrectlyanswerifarecordpairisamatch,whatisanoptimalstrategytoaskoraclequeriessoastominimizethenumberofqueriesforresolvingtheentiregraph?
• Motivation:reducecrowdsourcingERcostfordataset.
20
![Page 21: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/21.jpg)
¨ Formalproblem[FSS16]:– Givenanoraclethatcancorrectlyanswerifarecordpairisamatch,
whatisanoptimalstrategytoaskoraclequeriessoastomaximizeprogressiverecallwrt thesequenceoforaclequeries?
– Progressiverecall=areaunder“recallvsquerysequence”curve.
¨ Motivation:limitedresolutiontime,earlyusertermination.
WhatisOnlineERUsinganOracle?
21
![Page 22: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/22.jpg)
¨ DatafromtheVaticanSecretArchives– Registri Vaticani:Popelettersthroughoutthe13th-century.
¨ Linkageproblem:entities=characters.
Example:DBofHandwrittenCharacters
22
![Page 23: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/23.jpg)
Example:DBofHandwrittenCharacters
23
?
?
![Page 24: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/24.jpg)
¨ OptimalstrategyneedstoaskN– K+(Kchoose2)oraclequeries.– Takesadvantageof(matchingandnon-matching)transitivity.
¨ EO:askoraclequeriesin↓edgeprobabilityorder.– Cangrowmultipleclustersandsub-clustersinparallel.– Worst-caseapproximationratioofO(N)[VBD14].
Strategy1:EdgeOrdering[WL+13]
24
![Page 25: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/25.jpg)
¨ OptimalstrategyneedstoaskN– K+(Kchoose2)oraclequeries.– Takesadvantageof(matchingandnon-matching)transitivity.
¨ NO:processnodesin ↓orderoftheirexpectedclustersizes.– Askoraclequeriesin↓edgeprobabilityordertoprocessednodes.– Cangrowsimilar-sizedclusters(butnotsub-clusters)inparallel.– Worst-caseapproximationratioofO(K)[VBD14].
Strategy2:NodeOrdering[VBD14]
25
![Page 26: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/26.jpg)
¨ Edgeordering:usebenefitmetricinsteadofedgeprobability.– Iterativelyqueryoraclewith(u,v)havinghighestvalueofbe(u,v).– Initially,edgewithhighestvalueofp(u,v)isqueried.– Subsequently,canquerylowerprobability,higherbenefitedge.
OracleStrategyforProgressiveRecall
26
![Page 27: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/27.jpg)
¨ Hybridordering:usenodeordering,thenedgeordering.– Iteratively:selectnodeuwithhighestvalueofbn(u),thenquery
oraclewith(u,v),vє C,indecreasingorderofbn(u,C).– Heuristic:useathresholdonbenefitbn(u,C).– Finally,processnon-inferableedges(u,v)in↓orderofbe(u,v).
Strategy3:HybridOrdering[FSS16]
27
![Page 28: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/28.jpg)
ErrorsinOracleAnswers
• Inputdata:modeledasagraph.
• Output:asetofnoisy clusters.
• Formalproblem:• Givenanoraclethatcananswersifarecordpairisamatchwithsomeerrorprobability,whatisanoptimalstrategytoaskoraclequeriessoastominimizethenumberofqueriesforresolvingtheentiregraphandmaximizingprecision?
28
![Page 29: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/29.jpg)
Example:DBofHandwrittenCharacters
29
?
?
![Page 30: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/30.jpg)
ErrorsandGraphCuts
• Vertexcut:a partition ofthe nodes(vertices) ofagraphintotwo disjointsubsets.
• Cut-set:thesetofedgesthathaveoneendpointineachsubsetofthepartition.
• Whatwouldyoutrustmore?
30
? ?
![Page 31: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/31.jpg)
ErrorsandGraphCuts
• Vertexcut:a partition ofthe nodes(vertices) ofagraphintotwo disjointsubsets.
• Cut-set:thesetofedgesthathaveoneendpointineachsubsetofthepartition.
• Formalproblem:• Buildgraphswithlargecutswithaslessasedgesaspossible• So-calledexpandergraphs
• Technicalcontribution:Provethattheoutputgraphconsistsofexpanders
31
![Page 32: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/32.jpg)
¨ HybridorderingwithExpanders:usenodeorderingbyassigninganodetoaclusteronlyifmorethanKanswersarepositive,thenedgeordering.
Strategy4:HybridOrderingwithExpanders
32
![Page 33: PROJECT PROPOSALS: COMMUNITY DETECTION …torlone.dia.uniroma3.it/bigdata/A5-Community.pdfWhat is Entity Resolution (ER)? •Input data: modeled as a graph. •Graph node = data record.](https://reader030.fdocuments.net/reader030/viewer/2022041023/5ed4739464cb9d0fda746e10/html5/thumbnails/33.jpg)
Summaryandopenproblems
• FormalstudyofmaximizingprogressiverecallinonlineER.• ProblemisNP-complete.
• Formalstudyofmaximizingprogressiverecallandprecisioninpresenceoferrorsinoracleanswers.
• Openproblems:• Designrobust,onlinestrategiesforerrorsinoracleanswers.• Designamorepowerfulinterface forqueriesthanpairwise.• Scalability(e.g.blocking)
33