The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre...
Transcript of The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre...
![Page 1: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/1.jpg)
25 May 2016, AAFD & SFC’16
The Veracity of Big Data
Pierre Senellart
![Page 2: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/2.jpg)
23 April 2013, Dow Jones (cnn.com)
Twitter feed of Associated Press hacked
Algorithmic trading systems reacting to tweets
![Page 3: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/3.jpg)
23 April 2013, Dow Jones (cnn.com)
Twitter feed of Associated Press hacked
Algorithmic trading systems reacting to tweets
![Page 4: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/4.jpg)
23 April 2013, Dow Jones (cnn.com)
Twitter feed of Associated Press hacked
Algorithmic trading systems reacting to tweets
![Page 5: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/5.jpg)
3 / 38 AAFD & SFC’16 Pierre Senellart
The Four Vs of Big Data
Volume: Data volumes beyond what is manageable by traditionaldata management systems (from TB to PB to EB)
Variety: Very diverse forms of data (text, multimedia, graphs,structured data), very diverse organization of data
Velocity: Data produced or changing at high speed (LHC:100,000,000 collisions / second), more than able to store
Veracity: Data quality very diverse; imprecise, imperfect,untrustworthy information
![Page 6: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/6.jpg)
3 / 38 AAFD & SFC’16 Pierre Senellart
The Four Vs of Big Data
Volume: Data volumes beyond what is manageable by traditionaldata management systems (from TB to PB to EB)
Variety: Very diverse forms of data (text, multimedia, graphs,structured data), very diverse organization of data
Velocity: Data produced or changing at high speed (LHC:100,000,000 collisions / second), more than able to store
Veracity: Data quality very diverse; imprecise, imperfect,untrustworthy information
![Page 7: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/7.jpg)
3 / 38 AAFD & SFC’16 Pierre Senellart
The Four Vs of Big Data
Volume: Data volumes beyond what is manageable by traditionaldata management systems (from TB to PB to EB)
Variety: Very diverse forms of data (text, multimedia, graphs,structured data), very diverse organization of data
Velocity: Data produced or changing at high speed (LHC:100,000,000 collisions / second), more than able to store
Veracity: Data quality very diverse; imprecise, imperfect,untrustworthy information
![Page 8: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/8.jpg)
3 / 38 AAFD & SFC’16 Pierre Senellart
The Four Vs of Big Data
Volume: Data volumes beyond what is manageable by traditionaldata management systems (from TB to PB to EB)
Variety: Very diverse forms of data (text, multimedia, graphs,structured data), very diverse organization of data
Velocity: Data produced or changing at high speed (LHC:100,000,000 collisions / second), more than able to store
Veracity: Data quality very diverse; imprecise, imperfect,untrustworthy information
![Page 9: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/9.jpg)
3 / 38 AAFD & SFC’16 Pierre Senellart
The Four Vs of Big Data
Volume: Data volumes beyond what is manageable by traditionaldata management systems (from TB to PB to EB)
Variety: Very diverse forms of data (text, multimedia, graphs,structured data), very diverse organization of data
Velocity: Data produced or changing at high speed (LHC:100,000,000 collisions / second), more than able to store
Veracity: Data quality very diverse; imprecise, imperfect,untrustworthy information
![Page 10: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/10.jpg)
4 / 38 AAFD & SFC’16 Pierre Senellart
Uncertain data is everywhere
Numerous sources of uncertain data:
Measurement errors
Data integration from contradicting sources
Imprecise mappings between heterogeneous schemas
Imprecise automatic processes (information extraction,classification, natural language processing, etc.)
Imperfect human judgment
Lies, opinions, rumors
![Page 11: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/11.jpg)
4 / 38 AAFD & SFC’16 Pierre Senellart
Uncertain data is everywhere
Numerous sources of uncertain data:
Measurement errors
Data integration from contradicting sources
Imprecise mappings between heterogeneous schemas
Imprecise automatic processes (information extraction,classification, natural language processing, etc.)
Imperfect human judgment
Lies, opinions, rumors
![Page 12: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/12.jpg)
5 / 38 AAFD & SFC’16 Pierre Senellart
Uncertainty in Web information extraction
Never-ending Language Learning (NELL, CMU),http://rtw.ml.cmu.edu/rtw/kbbrowser/
![Page 13: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/13.jpg)
5 / 38 AAFD & SFC’16 Pierre Senellart
Uncertainty in Web information extraction
Google Squared (terminated),screenshot from (Fink et al. 2011)
![Page 14: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/14.jpg)
5 / 38 AAFD & SFC’16 Pierre Senellart
Uncertainty in Web information extraction
Subject Predicate Object Confidence
Elvis Presley diedOnDate 1977-08-16 97.91%Elvis Presley isMarriedTo Priscilla Presley 97.29%Elvis Presley influences Carlo Wolff 96.25%
YAGO, http://www.mpi-inf.mpg.de/yago-naga/yago
(Suchanek et al. 2007)
![Page 15: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/15.jpg)
6 / 38 AAFD & SFC’16 Pierre Senellart
Dealing with Uncertainty
Three main research questions:
How to estimate the veracity of a source, or of a piece ofinformation? ) truth finding
How to ensure the provenance of a piece of information, to knowwhere it comes from? ) provenance management
How to efficiently process uncertain data at scale? ) probabilisticdatabase management systems
![Page 16: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/16.jpg)
6 / 38 AAFD & SFC’16 Pierre Senellart
Dealing with Uncertainty
Three main research questions:
How to estimate the veracity of a source, or of a piece ofinformation? ) truth finding
How to ensure the provenance of a piece of information, to knowwhere it comes from? ) provenance management
How to efficiently process uncertain data at scale? ) probabilisticdatabase management systems
![Page 17: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/17.jpg)
6 / 38 AAFD & SFC’16 Pierre Senellart
Dealing with Uncertainty
Three main research questions:
How to estimate the veracity of a source, or of a piece ofinformation? ) truth finding
How to ensure the provenance of a piece of information, to knowwhere it comes from? ) provenance management
How to efficiently process uncertain data at scale? ) probabilisticdatabase management systems
![Page 18: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/18.jpg)
7 / 38 AAFD & SFC’16 Pierre Senellart
Outline
Introduction
Truth FindingSettingModelExperiments
Probabilistic Databases
Conclusion
![Page 19: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/19.jpg)
8 / 38 AAFD & SFC’16 Pierre Senellart
Outline
Introduction
Truth FindingSettingModelExperiments
Probabilistic Databases
Conclusion
![Page 20: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/20.jpg)
9 / 38 AAFD & SFC’16 Pierre Senellart
Motivating Example
What are the capital cities of European countries?
France Italy Poland Romania Hungary
Alice Paris Rome Warsaw Bucharest BudapestBob ? Rome Warsaw Bucharest BudapestCharlie Paris Rome Katowice Bucharest BudapestDavid Paris Rome Bratislava Budapest SofiaEve Paris Florence Warsaw Budapest SofiaFred Rome ? ? Budapest SofiaGeorge Rome ? ? ? Sofia
![Page 21: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/21.jpg)
10 / 38 AAFD & SFC’16 Pierre Senellart
Voting
Information: redundance
France Italy Poland Romania Hungary
Alice Paris Rome Warsaw Bucharest BudapestBob ? Rome Warsaw Bucharest BudapestCharlie Paris Rome Katowice Bucharest BudapestDavid Paris Rome Bratislava Budapest SofiaEve Paris Florence Warsaw Budapest SofiaFred Rome ? ? Budapest SofiaGeorge Rome ? ? ? Sofia
Frequence P. 0.67 R. 0.80 W. 0.60 Buch. 0.50 Bud. 0.43R. 0.33 F. 0.20 K. 0.20 Bud. 0.50 S. 0.57
B. 0.20
![Page 22: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/22.jpg)
11 / 38 AAFD & SFC’16 Pierre Senellart
Evaluating Trustworthiness of Sources
Information: redundance, trustworthiness of sources (= averagefrequence of predicted correctness)
France Italy Poland Romania Hungary Trust
Alice Paris Rome Warsaw Bucharest Budapest 0.60Bob ? Rome Warsaw Bucharest Budapest 0.58Charlie Paris Rome Katowice Bucharest Budapest 0.52David Paris Rome Bratislava Budapest Sofia 0.55Eve Paris Florence Warsaw Budapest Sofia 0.51Fred Rome ? ? Budapest Sofia 0.47George Rome ? ? ? Sofia 0.45
Frequence P. 0.70 R. 0.82 W. 0.61 Buch. 0.53 Bud. 0.46weighted R. 0.30 F. 0.18 K. 0.19 Bud. 0.47 S. 0.54by trust B 0.20
![Page 23: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/23.jpg)
12 / 38 AAFD & SFC’16 Pierre Senellart
Iterative Fixpoint Computation
Information: redundance, trustworthiness of sources with iterativefixpoint computation
France Italy Poland Romania Hungary Trust
Alice Paris Rome Warsaw Bucharest Budapest 0.65Bob ? Rome Warsaw Bucharest Budapest 0.63Charlie Paris Rome Katowice Bucharest Budapest 0.57David Paris Rome Bratislava Budapest Sofia 0.54Eve Paris Florence Warsaw Budapest Sofia 0.49Fred Rome ? ? Budapest Sofia 0.39George Rome ? ? ? Sofia 0.37
Frequence P. 0.75 R. 0.83 W. 0.62 Buch. 0.57 Bud. 0.51weighted R. 0.25 F. 0.17 K. 0.20 Bud. 0.43 S. 0.49by trust B 0.19
![Page 24: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/24.jpg)
13 / 38 AAFD & SFC’16 Pierre Senellart
Context and problem
Context:Set of sources stating facts(Possible) functional dependencies between factsFully unsupervised setting: we do not assume any information ontruth values of facts or inherent trust in sources
Problem: determine which facts are true and which facts are false
Real world applications: query answering, source selection, dataquality assessment on the web, making good use of the wisdom ofcrowds
![Page 25: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/25.jpg)
14 / 38 AAFD & SFC’16 Pierre Senellart
Outline
Introduction
Truth FindingSettingModelExperiments
Probabilistic DatabasesUncertainty ManagementQuerying Probabilistic Databases
Conclusion
![Page 26: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/26.jpg)
15 / 38 AAFD & SFC’16 Pierre Senellart
Outline
Introduction
Truth FindingSettingModelExperiments
Probabilistic Databases
Conclusion
![Page 27: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/27.jpg)
16 / 38 AAFD & SFC’16 Pierre Senellart
General Model
Set of facts F = ff1:::fng
Examples: “Paris is capital of France”, “Rome is capital of France”,“Rome is capital of Italy”
Set of views (= sources) V = fV1:::Vmg, where a view is a partialmapping from F to {T, F}
Example:: “Paris is capital of France” ^ “Rome is capital of France”
Objective: find the most likely real world W given V where thereal world is a total mapping from F to {T, F}
Example:“Paris is capital of France” ^ : “Rome is capital of France” ^“Rome is capital of Italy” ^ ...
![Page 28: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/28.jpg)
17 / 38 AAFD & SFC’16 Pierre Senellart
Generative Probabilistic Model(Galland et al. 2010)
Vi, fj
?
'(Vi)'(fj)1� '(Vi)'(fj)
:W(fj)
"(Vi)"(fj)
W(fj)
1� "(Vi)"(fj)
'(Vi)'(fj): probability that Vi “forgets” fj"(Vi)"(fj): probability that Vi “makes an error” on fj
Number of parameters: n+ 2(n+m)
Size of data: ~'nm with ~' the average forget rate
![Page 29: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/29.jpg)
18 / 38 AAFD & SFC’16 Pierre Senellart
Obvious Approach
Method: use this generative model to find the most likelyparameters given the data
Inverse the generative model to compute the probability of a set ofparameters given the data
Not practically applicable:Non-linearity of the model and boolean parameter W(fj)
) equations for inversing the generative model very complexLarge number of parameters (n and m can both be quite large) )Any exponential technique unpractical
) Heuristic fix-point algorithms (many proposed ones!)
![Page 30: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/30.jpg)
19 / 38 AAFD & SFC’16 Pierre Senellart
Outline
Introduction
Truth FindingSettingModelExperiments
Probabilistic Databases
Conclusion
![Page 31: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/31.jpg)
20 / 38 AAFD & SFC’16 Pierre Senellart
Hubdub (1/2)
http://www.hubdub.com/
357 questions, 1 to 20 answers, 473 participants
![Page 32: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/32.jpg)
21 / 38 AAFD & SFC’16 Pierre Senellart
Hubdub (2/2)
Number of errors Number of errors(no post-filtering) (with post-filtering)
Voting 278 292Counting 340 327TruthFinder 458 274(Yin et al. 2007)
3-Estimates 272 270(Galland et al. 2010)
![Page 33: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/33.jpg)
22 / 38 AAFD & SFC’16 Pierre Senellart
General-Knowledge Quiz (1/2)
http://www.madore.org/~david/quizz/quizz1.html
17 questions, 4 to 14 answers, 601 participants
![Page 34: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/34.jpg)
23 / 38 AAFD & SFC’16 Pierre Senellart
General-Knowledge Quiz (2/2)
Number of errors Number of errors(no post-filtering) (with post-filtering)
Voting 11 6Counting 12 6TruthFinder 78 77(Yin et al. 2007)
3-Estimates 9 0(Galland et al. 2010)
![Page 35: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/35.jpg)
24 / 38 AAFD & SFC’16 Pierre Senellart
Many variations. . .
Modeling of various real-world phenomena:
Sources copying each other (Dong et al. 2010)
Complex source dependencies (Pochampally et al. 2014)
Similarity between attribute values (Yin et al. 2008)
Correlated group of attributes (Ba et al. 2015)
Heterogeneous data types (Q. Li et al. 2014)
. . .
See extensive evaluations of different techniques (X. Li et al. 2012; Waguih
and Berti-Equille 2014). General problem far from being solved!
![Page 36: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/36.jpg)
25 / 38 AAFD & SFC’16 Pierre Senellart
Outline
Introduction
Truth Finding
Probabilistic DatabasesUncertainty ManagementQuerying Probabilistic Databases
Conclusion
![Page 37: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/37.jpg)
26 / 38 AAFD & SFC’16 Pierre Senellart
Outline
Introduction
Truth Finding
Probabilistic DatabasesUncertainty ManagementQuerying Probabilistic Databases
Conclusion
![Page 38: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/38.jpg)
27 / 38 AAFD & SFC’16 Pierre Senellart
Different types of uncertainty
Two dimensions:
Different types:Unknown value: NULL in an RDBMSAlternative between several possibilities: either A or B or CImprecision on a numeric value: a sensor gives a value that is anapproximation of the actual valueConfidence in a fact as a whole: cf. information extractionStructural uncertainty: the schema of the data itself is uncertain
Qualitative (NULL) or Quantitative (95%, low-confidence, etc.)uncertainty
![Page 39: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/39.jpg)
28 / 38 AAFD & SFC’16 Pierre Senellart
Managing uncertaintyObjectiveNot to pretend this imprecision does not exist, and manage it as rigor-ously as possible throughout a long, automatic and human, potentiallycomplex, process.
Especially:
Represent all different forms of uncertainty
Use probabilities to represent quantitative information on theconfidence in the data
Query data and retrieve uncertain results
Allow adding, deleting, modifying data in an uncertain way
Bonus (if possible): Keep as well lineage/provenance information,so as to ensure traceability
![Page 40: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/40.jpg)
28 / 38 AAFD & SFC’16 Pierre Senellart
Managing uncertaintyObjectiveNot to pretend this imprecision does not exist, and manage it as rigor-ously as possible throughout a long, automatic and human, potentiallycomplex, process.
Especially:
Represent all different forms of uncertainty
Use probabilities to represent quantitative information on theconfidence in the data
Query data and retrieve uncertain results
Allow adding, deleting, modifying data in an uncertain way
Bonus (if possible): Keep as well lineage/provenance information,so as to ensure traceability
![Page 41: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/41.jpg)
29 / 38 AAFD & SFC’16 Pierre Senellart
Why probabilities?
Not the only option: fuzzy set theory (Galindo et al. 2005),Dempster-Shafer theory (Zadeh 1986)
Mathematically rich theory, nice semantics with respect totraditional database operations (e.g., joins)
Some applications already generate probabilities (e.g., statisticalinformation extraction or natural language probabilities)
In other cases, we “cheat” and pretend that (normalized)confidence scores are probabilities: see this as a first-orderapproximation
![Page 42: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/42.jpg)
30 / 38 AAFD & SFC’16 Pierre Senellart
Outline
Introduction
Truth Finding
Probabilistic DatabasesUncertainty ManagementQuerying Probabilistic Databases
Conclusion
![Page 43: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/43.jpg)
31 / 38 AAFD & SFC’16 Pierre Senellart
Tuple-independent databases (TID)S
a a 1
b v 0:5
b w 0:2
aa
1b
v
0.5
w
0.2
This TID instance represents the following probability distribution:
0:5� 0:2
S
a a
b v
b w
0:5� (1� 0:2)
S
a a
b v
(1� 0:5)� 0:2
S
a a
b w
(1� 0:5)� (1� 0:2)
S
a a
![Page 44: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/44.jpg)
31 / 38 AAFD & SFC’16 Pierre Senellart
Tuple-independent databases (TID)S
a a 1
b v 0:5
b w 0:2 aa
1b
v
0.5
w
0.2
This TID instance represents the following probability distribution:
0:5� 0:2
S
a a
b v
b w
0:5� (1� 0:2)
S
a a
b v
(1� 0:5)� 0:2
S
a a
b w
(1� 0:5)� (1� 0:2)
S
a a
![Page 45: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/45.jpg)
31 / 38 AAFD & SFC’16 Pierre Senellart
Tuple-independent databases (TID)S
a a 1
b v 0:5
b w 0:2 aa
1b
v
0.5
w
0.2
This TID instance represents the following probability distribution:
0:5� 0:2
S
a a
b v
b w
0:5� (1� 0:2)
S
a a
b v
(1� 0:5)� 0:2
S
a a
b w
(1� 0:5)� (1� 0:2)
S
a a
![Page 46: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/46.jpg)
31 / 38 AAFD & SFC’16 Pierre Senellart
Tuple-independent databases (TID)S
a a 1
b v 0:5
b w 0:2 aa
1b
v
0.5
w
0.2
This TID instance represents the following probability distribution:
0:5� 0:2
S
a a
b v
b w
0:5� (1� 0:2)
S
a a
b v
(1� 0:5)� 0:2
S
a a
b w
(1� 0:5)� (1� 0:2)
S
a a
![Page 47: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/47.jpg)
31 / 38 AAFD & SFC’16 Pierre Senellart
Tuple-independent databases (TID)S
a a 1
b v 0:5
b w 0:2 aa
1b
v
0.5
w
0.2
This TID instance represents the following probability distribution:
0:5� 0:2
S
a a
b v
b w
0:5� (1� 0:2)
S
a a
b v
(1� 0:5)� 0:2
S
a a
b w
(1� 0:5)� (1� 0:2)
S
a a
![Page 48: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/48.jpg)
31 / 38 AAFD & SFC’16 Pierre Senellart
Tuple-independent databases (TID)S
a a 1
b v 0:5
b w 0:2 aa
1b
v
0.5
w
0.2
This TID instance represents the following probability distribution:
0:5� 0:2
S
a a
b v
b w
0:5� (1� 0:2)
S
a a
b v
(1� 0:5)� 0:2
S
a a
b w
(1� 0:5)� (1� 0:2)
S
a a
![Page 49: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/49.jpg)
31 / 38 AAFD & SFC’16 Pierre Senellart
Tuple-independent databases (TID)S
a a 1
b v 0:5
b w 0:2 aa
1b
v
0.5
w
0.2
This TID instance represents the following probability distribution:
0:5� 0:2
S
a a
b v
b w
0:5� (1� 0:2)
S
a a
b v
(1� 0:5)� 0:2
S
a a
b w
(1� 0:5)� (1� 0:2)
S
a a
![Page 50: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/50.jpg)
32 / 38 AAFD & SFC’16 Pierre Senellart
Query evaluation on probabilistic instances
We want to evaluate the probability of a query on a TID instance
q : 9x y R(x) ^ S(x; y) ^ T (y)
R
a 1
b 0:4
c 0:6
S
a a 1
b v 0:5
b w 0:2
T
v 0:3
w 0:7
b 1
The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here
! Probability: 0:4��1� (1� 0:5� 0:3)� (1� 0:2� 0:7)
�
![Page 51: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/51.jpg)
32 / 38 AAFD & SFC’16 Pierre Senellart
Query evaluation on probabilistic instances
We want to evaluate the probability of a query on a TID instance
q : 9x y R(x) ^ S(x; y) ^ T (y)
R
a 1
b 0:4
c 0:6
S
a a 1
b v 0:5
b w 0:2
T
v 0:3
w 0:7
b 1
The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here
! Probability: 0:4��1� (1� 0:5� 0:3)� (1� 0:2� 0:7)
�
![Page 52: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/52.jpg)
32 / 38 AAFD & SFC’16 Pierre Senellart
Query evaluation on probabilistic instances
We want to evaluate the probability of a query on a TID instance
q : 9x y R(x) ^ S(x; y) ^ T (y)
R
a 1
b 0:4
c 0:6
S
a a 1
b v 0:5
b w 0:2
T
v 0:3
w 0:7
b 1
The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here
! Probability: 0:4��1� (1� 0:5� 0:3)� (1� 0:2� 0:7)
�
![Page 53: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/53.jpg)
32 / 38 AAFD & SFC’16 Pierre Senellart
Query evaluation on probabilistic instances
We want to evaluate the probability of a query on a TID instance
q : 9x y R(x) ^ S(x; y) ^ T (y)
R
a 1
b 0:4
c 0:6
S
a a 1
b v 0:5
b w 0:2
T
v 0:3
w 0:7
b 1
The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here
! Probability: 0:4��1� (1� 0:5� 0:3)� (1� 0:2� 0:7)
�
![Page 54: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/54.jpg)
32 / 38 AAFD & SFC’16 Pierre Senellart
Query evaluation on probabilistic instances
We want to evaluate the probability of a query on a TID instance
q : 9x y R(x) ^ S(x; y) ^ T (y)
R
a 1
b 0:4
c 0:6
S
a a 1
b v 0:5
b w 0:2
T
v 0:3
w 0:7
b 1
The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here
! Probability: 0:4��1� (1� 0:5� 0:3)� (1� 0:2� 0:7)
�
![Page 55: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/55.jpg)
32 / 38 AAFD & SFC’16 Pierre Senellart
Query evaluation on probabilistic instances
We want to evaluate the probability of a query on a TID instance
q : 9x y R(x) ^ S(x; y) ^ T (y)
R
a 1
b 0:4
c 0:6
S
a a 1
b v 0:5
b w 0:2
T
v 0:3
w 0:7
b 1
The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here
! Probability: 0:4��1� (1� 0:5� 0:3)� (1� 0:2� 0:7)
�
![Page 56: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/56.jpg)
32 / 38 AAFD & SFC’16 Pierre Senellart
Query evaluation on probabilistic instances
We want to evaluate the probability of a query on a TID instance
q : 9x y R(x) ^ S(x; y) ^ T (y)
R
a 1
b 0:4
c 0:6
S
a a 1
b v 0:5
b w 0:2
T
v 0:3
w 0:7
b 1
The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here
! Probability: 0:4��1� (1� 0:5� 0:3)� (1� 0:2� 0:7)
�
![Page 57: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/57.jpg)
32 / 38 AAFD & SFC’16 Pierre Senellart
Query evaluation on probabilistic instances
We want to evaluate the probability of a query on a TID instance
q : 9x y R(x) ^ S(x; y) ^ T (y)
R
a 1
b 0:4
c 0:6
S
a a 1
b v 0:5
b w 0:2
T
v 0:3
w 0:7
b 1
The query is true iff R(b) is here and one of:
S(b; v) and T (v) are hereS(b; w) and T (w) are here
! Probability: 0:4��1� (1� 0:5� 0:3)� (1� 0:2� 0:7)
�
![Page 58: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/58.jpg)
32 / 38 AAFD & SFC’16 Pierre Senellart
Query evaluation on probabilistic instances
We want to evaluate the probability of a query on a TID instance
q : 9x y R(x) ^ S(x; y) ^ T (y)
R
a 1
b 0:4
c 0:6
S
a a 1
b v 0:5
b w 0:2
T
v 0:3
w 0:7
b 1
The query is true iff R(b) is here and one of:S(b; v) and T (v) are here
S(b; w) and T (w) are here
! Probability: 0:4��1� (1� 0:5� 0:3)� (1� 0:2� 0:7)
�
![Page 59: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/59.jpg)
32 / 38 AAFD & SFC’16 Pierre Senellart
Query evaluation on probabilistic instances
We want to evaluate the probability of a query on a TID instance
q : 9x y R(x) ^ S(x; y) ^ T (y)
R
a 1
b 0:4
c 0:6
S
a a 1
b v 0:5
b w 0:2
T
v 0:3
w 0:7
b 1
The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here
! Probability: 0:4��1� (1� 0:5� 0:3)� (1� 0:2� 0:7)
�
![Page 60: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/60.jpg)
32 / 38 AAFD & SFC’16 Pierre Senellart
Query evaluation on probabilistic instances
We want to evaluate the probability of a query on a TID instance
q : 9x y R(x) ^ S(x; y) ^ T (y)
R
a 1
b 0:4
c 0:6
S
a a 1
b v 0:5
b w 0:2
T
v 0:3
w 0:7
b 1
The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here
! Probability:
0:4��1� (1� 0:5� 0:3)� (1� 0:2� 0:7)
�
![Page 61: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/61.jpg)
32 / 38 AAFD & SFC’16 Pierre Senellart
Query evaluation on probabilistic instances
We want to evaluate the probability of a query on a TID instance
q : 9x y R(x) ^ S(x; y) ^ T (y)
R
a 1
b 0:4
c 0:6
S
a a 1
b v 0:5
b w 0:2
T
v 0:3
w 0:7
b 1
The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here
! Probability: 0:4�
�1� (1� 0:5� 0:3)� (1� 0:2� 0:7)
�
![Page 62: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/62.jpg)
32 / 38 AAFD & SFC’16 Pierre Senellart
Query evaluation on probabilistic instances
We want to evaluate the probability of a query on a TID instance
q : 9x y R(x) ^ S(x; y) ^ T (y)
R
a 1
b 0:4
c 0:6
S
a a 1
b v 0:5
b w 0:2
T
v 0:3
w 0:7
b 1
The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here
! Probability: 0:4��1�
(1� 0:5� 0:3)� (1� 0:2� 0:7)�
![Page 63: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/63.jpg)
32 / 38 AAFD & SFC’16 Pierre Senellart
Query evaluation on probabilistic instances
We want to evaluate the probability of a query on a TID instance
q : 9x y R(x) ^ S(x; y) ^ T (y)
R
a 1
b 0:4
c 0:6
S
a a 1
b v 0:5
b w 0:2
T
v 0:3
w 0:7
b 1
The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here
! Probability: 0:4��1� (1� 0:5� 0:3)
� (1� 0:2� 0:7)�
![Page 64: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/64.jpg)
32 / 38 AAFD & SFC’16 Pierre Senellart
Query evaluation on probabilistic instances
We want to evaluate the probability of a query on a TID instance
q : 9x y R(x) ^ S(x; y) ^ T (y)
R
a 1
b 0:4
c 0:6
S
a a 1
b v 0:5
b w 0:2
T
v 0:3
w 0:7
b 1
The query is true iff R(b) is here and one of:S(b; v) and T (v) are hereS(b; w) and T (w) are here
! Probability: 0:4��1� (1� 0:5� 0:3)� (1� 0:2� 0:7)
�
![Page 65: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/65.jpg)
33 / 38 AAFD & SFC’16 Pierre Senellart
Complexity of probabilistic query evalua-tion (PQE)
What is the data complexity of probabilistic query evaluation on TIDdepending on the class Q of queries and class I of instances?
Existing dichotomy result: (Dalvi and Suciu 2012)
Q are (unions of) conjunctive queries, I is all TID instancesThere is a class S � Q of safe queriesPQE is PTIME for any q 2 S on all instancesPQE is #P-hard for any q 2 QnS on all instancesq : 9x y R(x) ^ S(x; y) ^ T (y) is unsafe!
Is there a smaller class I such that PQE is tractablefor a larger Q?
![Page 66: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/66.jpg)
33 / 38 AAFD & SFC’16 Pierre Senellart
Complexity of probabilistic query evalua-tion (PQE)
What is the data complexity of probabilistic query evaluation on TIDdepending on the class Q of queries and class I of instances?
Existing dichotomy result: (Dalvi and Suciu 2012)
Q are (unions of) conjunctive queries, I is all TID instancesThere is a class S � Q of safe queries
PQE is PTIME for any q 2 S on all instancesPQE is #P-hard for any q 2 QnS on all instancesq : 9x y R(x) ^ S(x; y) ^ T (y) is unsafe!
Is there a smaller class I such that PQE is tractablefor a larger Q?
![Page 67: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/67.jpg)
33 / 38 AAFD & SFC’16 Pierre Senellart
Complexity of probabilistic query evalua-tion (PQE)
What is the data complexity of probabilistic query evaluation on TIDdepending on the class Q of queries and class I of instances?
Existing dichotomy result: (Dalvi and Suciu 2012)
Q are (unions of) conjunctive queries, I is all TID instancesThere is a class S � Q of safe queriesPQE is PTIME for any q 2 S on all instances
PQE is #P-hard for any q 2 QnS on all instancesq : 9x y R(x) ^ S(x; y) ^ T (y) is unsafe!
Is there a smaller class I such that PQE is tractablefor a larger Q?
![Page 68: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/68.jpg)
33 / 38 AAFD & SFC’16 Pierre Senellart
Complexity of probabilistic query evalua-tion (PQE)
What is the data complexity of probabilistic query evaluation on TIDdepending on the class Q of queries and class I of instances?
Existing dichotomy result: (Dalvi and Suciu 2012)
Q are (unions of) conjunctive queries, I is all TID instancesThere is a class S � Q of safe queriesPQE is PTIME for any q 2 S on all instancesPQE is #P-hard for any q 2 QnS on all instances
q : 9x y R(x) ^ S(x; y) ^ T (y) is unsafe!
Is there a smaller class I such that PQE is tractablefor a larger Q?
![Page 69: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/69.jpg)
33 / 38 AAFD & SFC’16 Pierre Senellart
Complexity of probabilistic query evalua-tion (PQE)
What is the data complexity of probabilistic query evaluation on TIDdepending on the class Q of queries and class I of instances?
Existing dichotomy result: (Dalvi and Suciu 2012)
Q are (unions of) conjunctive queries, I is all TID instancesThere is a class S � Q of safe queriesPQE is PTIME for any q 2 S on all instancesPQE is #P-hard for any q 2 QnS on all instancesq : 9x y R(x) ^ S(x; y) ^ T (y) is unsafe!
Is there a smaller class I such that PQE is tractablefor a larger Q?
![Page 70: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/70.jpg)
33 / 38 AAFD & SFC’16 Pierre Senellart
Complexity of probabilistic query evalua-tion (PQE)
What is the data complexity of probabilistic query evaluation on TIDdepending on the class Q of queries and class I of instances?
Existing dichotomy result: (Dalvi and Suciu 2012)
Q are (unions of) conjunctive queries, I is all TID instancesThere is a class S � Q of safe queriesPQE is PTIME for any q 2 S on all instancesPQE is #P-hard for any q 2 QnS on all instancesq : 9x y R(x) ^ S(x; y) ^ T (y) is unsafe!
Is there a smaller class I such that PQE is tractablefor a larger Q?
![Page 71: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/71.jpg)
34 / 38 AAFD & SFC’16 Pierre Senellart
Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)
Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1
! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries
! non-probabilistic QE is in linear time
! Does this extend to probabilistic QE?
![Page 72: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/72.jpg)
34 / 38 AAFD & SFC’16 Pierre Senellart
Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)
Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1
! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries
! non-probabilistic QE is in linear time
! Does this extend to probabilistic QE?
![Page 73: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/73.jpg)
34 / 38 AAFD & SFC’16 Pierre Senellart
Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)
Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1
! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries
! non-probabilistic QE is in linear time
! Does this extend to probabilistic QE?
![Page 74: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/74.jpg)
34 / 38 AAFD & SFC’16 Pierre Senellart
Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)
Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1
! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries
! non-probabilistic QE is in linear time
! Does this extend to probabilistic QE?
![Page 75: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/75.jpg)
34 / 38 AAFD & SFC’16 Pierre Senellart
Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)
Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1
! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries
! non-probabilistic QE is in linear time
! Does this extend to probabilistic QE?
![Page 76: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/76.jpg)
34 / 38 AAFD & SFC’16 Pierre Senellart
Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)
Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1
! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries
! non-probabilistic QE is in linear time
! Does this extend to probabilistic QE?
![Page 77: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/77.jpg)
34 / 38 AAFD & SFC’16 Pierre Senellart
Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)
Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1
! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries
! non-probabilistic QE is in linear time
! Does this extend to probabilistic QE?
![Page 78: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/78.jpg)
34 / 38 AAFD & SFC’16 Pierre Senellart
Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)
Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1
! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries
! non-probabilistic QE is in linear time
! Does this extend to probabilistic QE?
![Page 79: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/79.jpg)
34 / 38 AAFD & SFC’16 Pierre Senellart
Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)
Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1
! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries
! non-probabilistic QE is in linear time
! Does this extend to probabilistic QE?
![Page 80: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/80.jpg)
34 / 38 AAFD & SFC’16 Pierre Senellart
Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)
Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1
! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries
! non-probabilistic QE is in linear time
! Does this extend to probabilistic QE?
![Page 81: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/81.jpg)
34 / 38 AAFD & SFC’16 Pierre Senellart
Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)
Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1
! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries
! non-probabilistic QE is in linear time
! Does this extend to probabilistic QE?
![Page 82: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/82.jpg)
34 / 38 AAFD & SFC’16 Pierre Senellart
Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)
Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1
! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries
! non-probabilistic QE is in linear time
! Does this extend to probabilistic QE?
![Page 83: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/83.jpg)
34 / 38 AAFD & SFC’16 Pierre Senellart
Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)
Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1
! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries
! non-probabilistic QE is in linear time
! Does this extend to probabilistic QE?
![Page 84: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/84.jpg)
34 / 38 AAFD & SFC’16 Pierre Senellart
Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)
Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1
! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries
! non-probabilistic QE is in linear time
! Does this extend to probabilistic QE?
![Page 85: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/85.jpg)
34 / 38 AAFD & SFC’16 Pierre Senellart
Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)
Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1
! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries
! non-probabilistic QE is in linear time
! Does this extend to probabilistic QE?
![Page 86: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/86.jpg)
34 / 38 AAFD & SFC’16 Pierre Senellart
Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)
Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1
! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries
! non-probabilistic QE is in linear time
! Does this extend to probabilistic QE?
![Page 87: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/87.jpg)
34 / 38 AAFD & SFC’16 Pierre Senellart
Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)
Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1
! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries
! non-probabilistic QE is in linear time
! Does this extend to probabilistic QE?
![Page 88: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/88.jpg)
34 / 38 AAFD & SFC’16 Pierre Senellart
Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)
Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1
! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries
! non-probabilistic QE is in linear time
! Does this extend to probabilistic QE?
![Page 89: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/89.jpg)
34 / 38 AAFD & SFC’16 Pierre Senellart
Trees and treelike instancesIdea: let I be treelike instances (constant bound on treewidth)
Trees have treewidth 1Cycles have treewidth 2k-cliques and (k � 1)-grids have treewidth k � 1
! Known results (Courcelle 1990):I: treelike instances; Q: monadic second-order queries
! non-probabilistic QE is in linear time
! Does this extend to probabilistic QE?
![Page 90: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/90.jpg)
35 / 38 AAFD & SFC’16 Pierre Senellart
Dichotomy for PQE
An instance-based dichotomy result:
Upper bound. (Amarilli et al. 2015)
For I the treelike instances and Q the MSO queries! PQE is in linear time modulo arithmetic costs
Also for expressive provenance representationsAlso with bounded-treewidth correlations
Lower bound. (Amarilli et al. 2016)
For any unbounded-tw family I and Q the FO queries! PQE is #P-hard under RP reductions assuming:
High-tw instances in I are easily constructibleSignature arity is 2 (graphs)
![Page 91: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/91.jpg)
35 / 38 AAFD & SFC’16 Pierre Senellart
Dichotomy for PQE
An instance-based dichotomy result:
Upper bound. (Amarilli et al. 2015)
For I the treelike instances and Q the MSO queries! PQE is in linear time modulo arithmetic costs
Also for expressive provenance representationsAlso with bounded-treewidth correlations
Lower bound. (Amarilli et al. 2016)
For any unbounded-tw family I and Q the FO queries! PQE is #P-hard under RP reductions assuming:
High-tw instances in I are easily constructibleSignature arity is 2 (graphs)
![Page 92: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/92.jpg)
35 / 38 AAFD & SFC’16 Pierre Senellart
Dichotomy for PQE
An instance-based dichotomy result:
Upper bound. (Amarilli et al. 2015)
For I the treelike instances and Q the MSO queries! PQE is in linear time modulo arithmetic costs
Also for expressive provenance representationsAlso with bounded-treewidth correlations
Lower bound. (Amarilli et al. 2016)
For any unbounded-tw family I and Q the FO queries! PQE is #P-hard under RP reductions assuming:
High-tw instances in I are easily constructibleSignature arity is 2 (graphs)
![Page 93: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/93.jpg)
36 / 38 AAFD & SFC’16 Pierre Senellart
Application: Efficient querying of uncertaingraphs (Maniu et al. 2014)
06
5
60
2
06
4
3
4
26
1
61
3: 0.144: 0.01
2: 0.183: 0.01
1: 0.75
1: 0.752: 0.06
1: 0.75
1: 0.75
1: 0.5
1: 0.75
1: 0.25
1: 0.75
1: 0.5 1: 0.5
1: 1
(α)
(β)
(γ)
(ε)
(δ)
(ζ)
Problem: Optimize query evaluationon probabilistic graphs
Challenge: Real graph data is nottreelike
Methodology: Build partial treedecompositions and use differentquery evaluation techniques ontreelike parts and on the rest of thedata
![Page 94: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/94.jpg)
37 / 38 AAFD & SFC’16 Pierre Senellart
Outline
Introduction
Truth Finding
Probabilistic Databases
Conclusion
![Page 95: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/95.jpg)
38 / 38 AAFD & SFC’16 Pierre Senellart
Conclusion
The real world is uncertain
Tools we use to process the real world introduce uncertaintyNeed for principled methods to:
Estimate uncertainty (veracity, truthfulness. . . ) of informationProperly manage the confidence (probability, level of certainty. . . )in the informationKeep information on the provenance of data
Merci.
![Page 96: The Veracity of Big Data - Pierre Senellart · 5/25/2016 · 3 / 38 AAFD & SFC’16 Pierre Senellart The Four Vs of Big Data Volume:Datavolumesbeyondwhatismanageablebytraditional](https://reader033.fdocuments.net/reader033/viewer/2022042307/5ed37e22847f87317f77c401/html5/thumbnails/96.jpg)
38 / 38 AAFD & SFC’16 Pierre Senellart
Conclusion
The real world is uncertain
Tools we use to process the real world introduce uncertaintyNeed for principled methods to:
Estimate uncertainty (veracity, truthfulness. . . ) of informationProperly manage the confidence (probability, level of certainty. . . )in the informationKeep information on the provenance of data
Merci.