Probabilistic answers to relational queries (PARQ) Octavian Udrea Yu Deng Edward Hung V. S....
-
date post
22-Dec-2015 -
Category
Documents
-
view
218 -
download
0
Transcript of Probabilistic answers to relational queries (PARQ) Octavian Udrea Yu Deng Edward Hung V. S....
Probabilistic answers to relational queries (PARQ)
Octavian UdreaYu DengEdward HungV. S. Subrahmanian
Content
Motivation and goals Running example Technical preliminaries CPO model CPO integration CPO inference algorithms Experimental results Ongoing work
Content
Motivation and goals Running example Technical preliminaries CPO model CPO integration CPO inference algorithms Experimental results Ongoing work
Motivation
Query algebras do not take semantics into account when computing answers
Data is not always precise Ambiguity, insufficient information
Goal: Use probabilistic ontologies to improve query answer recall and quality
The probabilistic solution
Compute and return answers with high probability ( > pthr)
Keep probabilities hidden from the user
Problems How do we assign a probability to each
data item? How do we choose pthr?
Concepts
Constraint probabilistic ontologies Is-a graph with edges labeled with
probabilities Including conditional probabilities Disjoint decompositions
Ontologies associated with terms in a data source Attributes in a relation/XML Propositional entities in text sources
Content
Motivation and goals Running example Technical preliminaries CPO model CPO integration CPO inference algorithms Experimental results Ongoing work
Running example
Email fragment: “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”
OrganizationUnit
Board
d
Comittee Team Department Legal Executive Financial Marketing
d
JudicialBoardFinancial
ReviewBoardAuditingComittee
Board of Directors
Sales Department
d
0.1 0.5 0.1 0.2 0.1 0.4 0.4 0.1
0.5, <(hasSubject marketing), 0.65>
0.35
d
0.1
0.15
0.9 0.95
0.85
0.85
0.95
0.5, <(isPresent Ed_Masters) 0.75>
Example: decompositions
Email fragment: “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”
OrganizationUnit
Board
d
Comittee Team Department Legal Executive Financial Marketing
d
JudicialBoardFinancial
ReviewBoardAuditingComittee
Board of Directors
Sales Department
d
0.1 0.5 0.1 0.2 0.1 0.4 0.4 0.1
0.5, <(hasSubject marketing), 0.65>
0.35
d
0.1
0.15
0.9 0.95
0.85
0.85
0.95
0.5, <(isPresent Ed_Masters) 0.75>
Example: probability labels
Email fragment: “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”
OrganizationUnit
Board
d
Comittee Team Department Legal Executive Financial Marketing
d
JudicialBoardFinancial
ReviewBoardAuditingComittee
Board of Directors
Sales Department
d
0.1 0.5 0.1 0.2 0.1 0.4 0.4 0.1
0.5, <(hasSubject marketing), 0.65>
0.35
d
0.1
0.15
0.9 0.95
0.85
0.85
0.95
0.5, <(isPresent Ed_Masters) 0.75>
Example: conditional probabilities
Email fragment: “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”
OrganizationUnit
Board
d
Comittee Team Department Legal Executive Financial Marketing
d
JudicialBoardFinancial
ReviewBoardAuditingComittee
Board of Directors
Sales Department
d
0.1 0.5 0.1 0.2 0.1 0.4 0.4 0.1
0.5, <(hasSubject marketing), 0.65>
0.35
d
0.1
0.15
0.9 0.95
0.85
0.85
0.95
0.5, <(isPresent Ed_Masters) 0.75>
Running example: Sample queries
“Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”
What type of board meeting is being discussed? Since Ed Masters is present, there is a 75%
probability it is a board of directors meeting What type of financial unit is referenced?
Since the subject is marketing policy, there is a 65% probability it is the Financial Review Board.
Content
Motivation and goals Running example Technical preliminaries CPO model CPO integration CPO inference algorithms Experimental results Ongoing work
Technical preliminaries: POB
POB schema: C is a finite set of classes is a directed acyclic graph me produces clusters (disjoint decompositions)
for each node me(OrganizationUnit) = {{Comittee, Board,
Team, Department}, {Legal, Executive, Financial, Marketing}}
maps each edge in to a positive rational number in [0,1]
),,,( meC
),( C
),( C
1),(),(, Ld
cdcmeLCc
Back to the example
Email fragment: “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”
OrganizationUnit
Board
d
Comittee Team Department Legal Executive Financial Marketing
d
JudicialBoardFinancial
ReviewBoardAuditingComittee
Board of Directors
Sales Department
d
0.1 0.5 0.1 0.2 0.1 0.4 0.4 0.1
0.5, <(hasSubject marketing), 0.65>
0.35
d
0.1
0.15
0.9 0.95
0.85
0.85
0.95
0.5, <(isPresent Ed_Masters) 0.75>
Constraint probabilities
Simple constraints: Only for entities NOT represented in the current
ontology Nil constraint:
Constraint probabilities: Pair , with p in [0,1] and a conjunction of
simple constraints
)(, ii AdomDDA
)( iAdomD
),( p
Labeling
Labeling should not be arbitrary Invalid labeling may lead to time-consuming
consistency algorithms And to ambiguity in interpreting query answers
Valid labeling: No constraint refers to the entities associated
with this ontology There is exactly one nil constraint probability on
each edge
Content
Motivation and goals Running example Technical preliminaries CPO model CPO integration CPO inference algorithms Experimental results Ongoing work
The CPO model
CPO: C is a finite set of classes is a directed acyclic graph me produces clusters (disjoint decompositions)
for each node is a valid labeling for
Note there is no condition on the probabilities....yet!
),,,( meC
),( C
),( C
CPO enhanced data sources
Associate CPOs with some attributes of a relation.
Associate CPOs with elements in an XML data store.
Associate CPOs with some keywords for text files.
CPOk
At most k probabilities on each edge CPO1 is a POB
Answering queries
“Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”
What type of board meeting is being discussed? Since Ed Masters is present, there is a 75%
probability it is a board of directors meeting Goal: Associate probabilities with
possible answers.
Probability path
Email fragment: “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”
OrganizationUnit
Board
d
Comittee Team Department Legal Executive Financial Marketing
d
JudicialBoardFinancial
ReviewBoardAuditingComittee
Board of Directors
Sales Department
d
0.1 0.5 0.1 0.2 0.1 0.4 0.4 0.1
0.5, <(hasSubject marketing), 0.65>
0.35
d
0.1
0.15
0.9 0.95
0.85
0.85
0.95
0.5, <(isPresent Ed_Masters) 0.75>
Probability path
if: c => c1 => c2 => … => ck => d f is a function defined on the chain
f selects one probability on each edge is the set of constraints selected
by f along with the probabilities
),(),( yxyxf
dc p
yx
pyxf,
),(
)( dc pf
CPO consistency
CPO An arbitrary universe of objects O Interpretation ε is a mapping from C to 2O
ε is a taxonomic model iff: We assign objects to each class Objects cannot be shared between classes in
the same cluster => edges imply subset relations on the sets of
objects assigned to each class If A => B is labeled with probability p, at least
p percent of objects in A are also assigned to B
),,,( meC
CPO consistency (cont’d)
CPO consistent it has a taxonomic probabilistic model
Deciding if a CPO is consistent is NP-complete The weight formula satisfiability
problem. A non-deterministic algorithm for
consistency checking is straightforward.
Consistency approach
Identify a subclass of CPOs for which we can check consistency
Two parts: Pseudoconsistency – this was done for
POBs Well-structuredness – particular to
CPOs
Pseudoconsistent CPO
CPO No two classes in the same cluster have a
common subclass The graph is rooted For every immediate distinct subclasses of c, they
either: Have no common subclass Have a greatest common subclass different from
them No cycles If c inherits from multiple clusters, all paths from
descendants of c to the root go through c
),,,( meC
Pseudoconsistency
OrganizationUnit
Board
d
Comittee Team Department Legal Executive Financial Marketing
d
JudicialBoardFinancial
ReviewBoardAuditingComittee
Board of Directors
Sales Department
d
0.1 0.5 0.1 0.2 0.1 0.4 0.4 0.1
0.5, <(hasSubject marketing), 0.65>
0.35
d
0.1
0.15
0.9 0.95
0.85
0.85
0.95
0.5, <(isPresent Ed_Masters) 0.75>
Weight factor
A set P of not-nil constraint probabilities If P is the empty set, wf(P) = 0
If P = {(p,γ)}, wf(P) = p
wf(P U Q) = wf(P) + wf(Q) – wf(P) * wf(Q)
Intuitive meaning: how many objects from class A do I have to assign to class B and satisfy the constraints?
More weight factors
CPO c => d an edge We write: We define: Result: Conditions of taxonomic
interpretation can be satisfied by selecting at most w(c,d)*|Od| objects from d into c.
),,,( meC
),(')},{(),( 0 dctruepdc
))),('(,max(),( 0 dcwpdcw f
Well-structured CPO
Conditional constraints on edges from the same cluster must be disjoint Otherwise, impossible to cpumte a
weight factor for the cluster edges. The sum of the weight factors for
edges in a cluster is ≤ 1
Well-structuredness
0.85
OrganizationUnit
Board
d
Comittee Team Department Legal Executive Financial Marketing
d
JudicialBoardFinancial
ReviewBoardAuditingComittee
Board of Directors
Sales Department
d
0.1 0.5 0.1 0.2 0.1 0.4 0.4 0.1
0.5, <(hasSubject marketing), 0.65>
0.35
d
0.1,<NOT (IsPresent Ed_Masters) 0.2>
0.15
0.9 0.95
0.85
0.95
0.5, <(isPresent Ed_Masters) 0.75>
Consistent CPOs revisited
A pseudoconsistent and well-structured CPO is consistent Pseudoconsistency accounts for most of
the conditions in the taxonomic interpretation
Well-structuredness accounts for the the assignment of objects to subclasses
Consistency checking algorithm
Pseudoconsistency is O(n2e) and well-structuredness is O(n2k2) n – number of classes e – the number of edges k – the order of the CPO
Algorithm based on: Topological sort Dijskstra and derivatives
CPO enhanced algebras
CPO enhanced algebras formally defined for: Relational data sources XML data stores Selection, projection, product, join, etc.
Ongoing work: RDF ehanced query algebra Directly related to RDF extraction from
text.
Content
Motivation and goals Running example Technical preliminaries CPO model CPO integration CPO inference algorithms Experimental results Ongoing work
CPO integration: motivationOrganizationUnit
Board
d
ComitteeExecutive Financial
d
FinancialReviewBoard
AuditingComittee
Board of Directors
d
0.1 0.50.4 0.4
0.5, <(hasSubject marketing), 0.65>
0.35
d
0.150.85
0.95
0.5, <(isPresent Ed_Masters) 0.75>
OrganizationUnit
Board
d
Team Department Management Financial Marketing
d
FO BoardBoard of Directors
Sales Department
d
0.5 0.1 0.2 0.4 0.4 0.1
0.7
d
0.15
0.90.95
0.3, <(isPresent Ed_Masters) 0.75>
Management :=: FinancialFinancialReviewBoard :=: FO Board
Email from ACME corp. to EVIL corp.: “During you last FO board meeting, the rising costs of quality assurance were not addressed. We would like to include this in our next auditing comittee meeting....
ACME corp. CPO EVIL corp. CPO
Merging CPOs
Two scenarios: One data source that refers to similar
entities but from different application domains.
Example: ACME – EVIL correspondence Queries across multiple data sources Example: Two different CPOs
associated with distinct relations during a join query.
Interoperation constraints
Since the CPOs being merged refer to similar entities, some classes may be euqivalent Equality constraints c1:=:c2
Possiblity: immediate subclassing constraints
Not really used – hardly feasible
The integration problem
Two CPOs S1 = (C1, =>1, me1, φ1), S2 = (C2, =>2, me2, φ2)
Set of interoperation constraints I An integration witness is another
CPO S = (C, =>, me, φ) that satisifes S1, S2 and I
Integration witness
Every class c in C1 U C2 Appears in C OR c:=:d appears in I and d є C i.e. no classes get “lost”
Similarly, no edges are lost No constraints are lost
If two identical constraint probabilities are on the same edge in both CPOs, take a probability p between the two
Integration witness
Immediate subclassing constraints add edges to S
No cluster can be split as a result of merging
S is pseudoconsistent and well-structured (if it’s not, it’s of no use) Open problem: If it is not, how can we
minimally change it such that it has these properties?
CPOmerge algorithm
CPOmerge produces an integration witness if exists
O(n3) – costly In pratice, much more efficient
through: Caching Some properties are preserved if the
original ontologies are pseudo-consistent and well-structured
Who writes the interop constraints?
User – not feasible How to infer them? Intuitive solution: If enough neighbours
are in equality constraints, then infer respective nodes should be equivalent. But we still need some equivalence constraints
to get started – use lexical distance How many neighbors are “enough”?
ICI – Simple solution
Neighbor: parent, immediate child, sibling from the same cluster
We define
ne – number of neighbors in equality constraints nc,d – number of neighbors of c,d Why? Number of equal neighbors / Total number
of neighbors (including self). Always < 1 ICI algorithm: if pe exceeds threshold, assume they
are equal Start with lexical distance
2
2),(
dc
ee nn
ndcp
Content
Motivation and goals Running example Technical preliminaries CPO model CPO integration CPO inference algorithms Experimental results Ongoing work
Give me a CPO…
Very little work so far on probabilistic ontologies. Nothing resembling CPOs around
How do we infer them: How do we build disjoint
decompositions? How do we infer probabilities?
Building disjoint decompositions
Take regular ontologies from the Web Many sources: daml.org, SchemaWeb,
OntoBroker Modify CPOmerge to ignore labeling The merge result will contain
disjoint decompositions Equality constraints can be inferred
through ICI
Infer probabilities – simple methods
Simple methods: Distribute probabilities uniformly within each
cluster For each cluster L in me(c), d=>c,
For any distance function (lexical or otherwise)
ce
ceDist
cdDist
cd e
ep
),(
),(
,
Advanced methods
Probabilistic relational models with structural uncertainty Work by Dr. Getoor et. al
Classification approach Feature extraction determines entities of
interest Create conditional probabilities on those
entities User feedback approach
General, applicable to any of the above(ongoing work)
Content
Motivation and goals Running example Technical preliminaries CPO model CPO integration CPO inference algorithms Experimental results Ongoing work
Experimental setup
Java implementation CPO enhanced relational DB
Movies database maintained by Dr. Wiederhold
IMDB data IMDB to estimate recall Classifications from the Web to build
initial CPO
Consistency check & inference
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
45 70 92 145 172 204 237 275 304 321 342ontology size [no. of classes]
run
nin
g t
ime
[s]
consistency
CPO inference
Recall
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
45 70 92 145 172 204 237 275 304 321 342
Ontology size [no. of classes]
Re
call
threshold: 0.6
threshold:0.7
threshold:0.8
threshold:0.9
relational
Precision
0
0.2
0.4
0.6
0.8
1
1.2
45 70 92 145 172 204 237 275 304 321 342
Ontology size [no. of classes]
Pre
cisi
on
Precision p:0.6
Precision p:0.7
Precision p:0.8
Precision p:0.9
Precision relational
Answer quality
0.6
0.62
0.64
0.66
0.68
0.7
0.72
0.74
0.76
0.78
0.8
0 50 100 150 200 250 300 350 400
Ontology size [no. of classes]
Qu
alit
y [S
QR
T(P
reci
sio
n*R
eca
ll)]
Quality p:0.6
Quality p:0.7
Quality p:0.8
Quality p:0.9
Quality relational
Query running time
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
45 70 92 145 172 204 237 275 304 321 342
Ontology size [no. of classes]
Ru
nn
ing
tim
e [s
]
Running time p:0.6
Running time p:0.7
Running time p:0.8
Running time p:0.9
Running time relational
ICI quality
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0 0.2 0.4 0.6 0.8 1 1.2
Epsilon
Qu
alit
yJoin quality
Relational join quality
Bottomline
Clear improvement in query answer quality Some time penalty, but reasonable
Very little user intervention CPOs are suited for a wide variety of
data sources Potentially, they can be used to convey
semantics across heterogenous data sources
Content
Motivation and goals Running example Technical preliminaries CPO model CPO integration CPO inference algorithms Experimental results Ongoing work
Current experimental setup
DBLP data over 60 years of scientific publications XML data set
CPOs from complex ontologies DBLP classification ACM classification of subjects
Goals (1)
Determine the efficiency of advanced CPO inference methods
Experimentally determine the best approach in terms of minimizing user feedback
Goals (2)
Use CPOs with RDF databases For extracting RDF from text as a means of
using semantic information For answering queries from RDF databases
Benefits: Probabilistic model is clearly formalized Proven improvement in answer quality
Experimentally determine what the probability threshold may be for various domains
Thank you