Probabilistic answers to relational queries (PARQ) Octavian Udrea Yu Deng Edward Hung V. S....

Probabilistic answers to relational queries (PARQ)

Octavian UdreaYu DengEdward HungV. S. Subrahmanian

Content

Motivation and goals Running example Technical preliminaries CPO model CPO integration CPO inference algorithms Experimental results Ongoing work

Motivation

Query algebras do not take semantics into account when computing answers

Data is not always precise Ambiguity, insufficient information

Goal: Use probabilistic ontologies to improve query answer recall and quality

The probabilistic solution

Compute and return answers with high probability ( > pthr)

Keep probabilities hidden from the user

Problems How do we assign a probability to each

data item? How do we choose pthr?

Concepts

Constraint probabilistic ontologies Is-a graph with edges labeled with

probabilities Including conditional probabilities Disjoint decompositions

Ontologies associated with terms in a data source Attributes in a relation/XML Propositional entities in text sources

Content


Running example

Email fragment: “Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”

OrganizationUnit

Board

d

Comittee Team Department Legal Executive Financial Marketing

d

JudicialBoardFinancial

ReviewBoardAuditingComittee

Board of Directors

Sales Department

d

0.1 0.5 0.1 0.2 0.1 0.4 0.4 0.1

0.5, <(hasSubject marketing), 0.65>

0.35

d

0.1

0.15

0.9 0.95

0.85

0.85

0.95

0.5, <(isPresent Ed_Masters) 0.75>

Example: decompositions


OrganizationUnit

Board

d


d



Board of Directors

Sales Department

d

0.1 0.5 0.1 0.2 0.1 0.4 0.4 0.1


0.35

d

0.1

0.15

0.9 0.95

0.85

0.85

0.95


Example: probability labels


OrganizationUnit

Board

d


d



Board of Directors

Sales Department

d

0.1 0.5 0.1 0.2 0.1 0.4 0.4 0.1


0.35

d

0.1

0.15

0.9 0.95

0.85

0.85

0.95


Example: conditional probabilities


OrganizationUnit

Board

d


d



Board of Directors

Sales Department

d

0.1 0.5 0.1 0.2 0.1 0.4 0.4 0.1


0.35

d

0.1

0.15

0.9 0.95

0.85

0.85

0.95


Running example: Sample queries

“Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”

What type of board meeting is being discussed? Since Ed Masters is present, there is a 75%

probability it is a board of directors meeting What type of financial unit is referenced?

Since the subject is marketing policy, there is a 65% probability it is the Financial Review Board.

Content


Technical preliminaries: POB

POB schema: C is a finite set of classes is a directed acyclic graph me produces clusters (disjoint decompositions)

for each node me(OrganizationUnit) = {{Comittee, Board,

Team, Department}, {Legal, Executive, Financial, Marketing}}

maps each edge in to a positive rational number in [0,1]

),,,( meC

),( C

),( C

1),(),(, Ld

cdcmeLCc

Back to the example


OrganizationUnit

Board

d


d



Board of Directors

Sales Department

d

0.1 0.5 0.1 0.2 0.1 0.4 0.4 0.1


0.35

d

0.1

0.15

0.9 0.95

0.85

0.85

0.95


Constraint probabilities

Simple constraints: Only for entities NOT represented in the current

ontology Nil constraint:

Constraint probabilities: Pair , with p in [0,1] and a conjunction of

simple constraints

)(, ii AdomDDA

)( iAdomD

),( p

Labeling

Labeling should not be arbitrary Invalid labeling may lead to time-consuming

consistency algorithms And to ambiguity in interpreting query answers

Valid labeling: No constraint refers to the entities associated

with this ontology There is exactly one nil constraint probability on

each edge

Content


The CPO model

CPO: C is a finite set of classes is a directed acyclic graph me produces clusters (disjoint decompositions)

for each node is a valid labeling for

Note there is no condition on the probabilities....yet!

),,,( meC

),( C

),( C

CPO enhanced data sources

Associate CPOs with some attributes of a relation.

Associate CPOs with elements in an XML data store.

Associate CPOs with some keywords for text files.

CPOk

At most k probabilities on each edge CPO1 is a POB

Answering queries

“Ed Masters opposed the new marketing policy during the board meeting...Eric claimed Ed was not aware of the situation in the financial unit...”

What type of board meeting is being discussed? Since Ed Masters is present, there is a 75%

probability it is a board of directors meeting Goal: Associate probabilities with

possible answers.

Probability path


OrganizationUnit

Board

d


d



Board of Directors

Sales Department

d

0.1 0.5 0.1 0.2 0.1 0.4 0.4 0.1


0.35

d

0.1

0.15

0.9 0.95

0.85

0.85

0.95


Probability path

if: c => c1 => c2 => … => ck => d f is a function defined on the chain

f selects one probability on each edge is the set of constraints selected

by f along with the probabilities

),(),( yxyxf

dc p

yx

pyxf,

),(

)( dc pf

CPO consistency

CPO An arbitrary universe of objects O Interpretation ε is a mapping from C to 2O

ε is a taxonomic model iff: We assign objects to each class Objects cannot be shared between classes in

the same cluster => edges imply subset relations on the sets of

objects assigned to each class If A => B is labeled with probability p, at least

p percent of objects in A are also assigned to B

),,,( meC

CPO consistency (cont’d)

CPO consistent it has a taxonomic probabilistic model

Deciding if a CPO is consistent is NP-complete The weight formula satisfiability

problem. A non-deterministic algorithm for

consistency checking is straightforward.

Consistency approach

Identify a subclass of CPOs for which we can check consistency

Two parts: Pseudoconsistency – this was done for

POBs Well-structuredness – particular to

CPOs

Pseudoconsistent CPO

CPO No two classes in the same cluster have a

common subclass The graph is rooted For every immediate distinct subclasses of c, they

either: Have no common subclass Have a greatest common subclass different from

them No cycles If c inherits from multiple clusters, all paths from

descendants of c to the root go through c

),,,( meC

Pseudoconsistency

OrganizationUnit

Board

d


d



Board of Directors

Sales Department

d

0.1 0.5 0.1 0.2 0.1 0.4 0.4 0.1


0.35

d

0.1

0.15

0.9 0.95

0.85

0.85

0.95


Weight factor

A set P of not-nil constraint probabilities If P is the empty set, wf(P) = 0

If P = {(p,γ)}, wf(P) = p

wf(P U Q) = wf(P) + wf(Q) – wf(P) * wf(Q)

Intuitive meaning: how many objects from class A do I have to assign to class B and satisfy the constraints?

More weight factors

CPO c => d an edge We write: We define: Result: Conditions of taxonomic

interpretation can be satisfied by selecting at most w(c,d)*|Od| objects from d into c.

),,,( meC

),(')},{(),( 0 dctruepdc

))),('(,max(),( 0 dcwpdcw f

Well-structured CPO

Conditional constraints on edges from the same cluster must be disjoint Otherwise, impossible to cpumte a

weight factor for the cluster edges. The sum of the weight factors for

edges in a cluster is ≤ 1

Well-structuredness

0.85

OrganizationUnit

Board

d


d



Board of Directors

Sales Department

d

0.1 0.5 0.1 0.2 0.1 0.4 0.4 0.1


0.35

d

0.1,<NOT (IsPresent Ed_Masters) 0.2>

0.15

0.9 0.95

0.85

0.95


Consistent CPOs revisited

A pseudoconsistent and well-structured CPO is consistent Pseudoconsistency accounts for most of

the conditions in the taxonomic interpretation

Well-structuredness accounts for the the assignment of objects to subclasses

Consistency checking algorithm

Pseudoconsistency is O(n2e) and well-structuredness is O(n2k2) n – number of classes e – the number of edges k – the order of the CPO

Algorithm based on: Topological sort Dijskstra and derivatives

CPO enhanced algebras

CPO enhanced algebras formally defined for: Relational data sources XML data stores Selection, projection, product, join, etc.

Ongoing work: RDF ehanced query algebra Directly related to RDF extraction from

text.

Content


CPO integration: motivationOrganizationUnit

Board

d

ComitteeExecutive Financial

d

FinancialReviewBoard

AuditingComittee

Board of Directors

d

0.1 0.50.4 0.4


0.35

d

0.150.85

0.95


OrganizationUnit

Board

d

Team Department Management Financial Marketing

d

FO BoardBoard of Directors

Sales Department

d

0.5 0.1 0.2 0.4 0.4 0.1

0.7

d

0.15

0.90.95


Management :=: FinancialFinancialReviewBoard :=: FO Board

Email from ACME corp. to EVIL corp.: “During you last FO board meeting, the rising costs of quality assurance were not addressed. We would like to include this in our next auditing comittee meeting....

ACME corp. CPO EVIL corp. CPO

Merging CPOs

Two scenarios: One data source that refers to similar

entities but from different application domains.

Example: ACME – EVIL correspondence Queries across multiple data sources Example: Two different CPOs

associated with distinct relations during a join query.

Interoperation constraints

Since the CPOs being merged refer to similar entities, some classes may be euqivalent Equality constraints c1:=:c2

Possiblity: immediate subclassing constraints

Not really used – hardly feasible

The integration problem

Two CPOs S1 = (C1, =>1, me1, φ1), S2 = (C2, =>2, me2, φ2)

Set of interoperation constraints I An integration witness is another

CPO S = (C, =>, me, φ) that satisifes S1, S2 and I

Integration witness

Every class c in C1 U C2 Appears in C OR c:=:d appears in I and d є C i.e. no classes get “lost”

Similarly, no edges are lost No constraints are lost

If two identical constraint probabilities are on the same edge in both CPOs, take a probability p between the two

Integration witness

Immediate subclassing constraints add edges to S

No cluster can be split as a result of merging

S is pseudoconsistent and well-structured (if it’s not, it’s of no use) Open problem: If it is not, how can we

minimally change it such that it has these properties?

CPOmerge algorithm

CPOmerge produces an integration witness if exists

O(n3) – costly In pratice, much more efficient

through: Caching Some properties are preserved if the

original ontologies are pseudo-consistent and well-structured

Who writes the interop constraints?

User – not feasible How to infer them? Intuitive solution: If enough neighbours

are in equality constraints, then infer respective nodes should be equivalent. But we still need some equivalence constraints

to get started – use lexical distance How many neighbors are “enough”?

ICI – Simple solution

Neighbor: parent, immediate child, sibling from the same cluster

We define

ne – number of neighbors in equality constraints nc,d – number of neighbors of c,d Why? Number of equal neighbors / Total number

of neighbors (including self). Always < 1 ICI algorithm: if pe exceeds threshold, assume they

are equal Start with lexical distance

2

2),(

dc

ee nn

ndcp

Content


Give me a CPO…

Very little work so far on probabilistic ontologies. Nothing resembling CPOs around

How do we infer them: How do we build disjoint

decompositions? How do we infer probabilities?

Building disjoint decompositions

Take regular ontologies from the Web Many sources: daml.org, SchemaWeb,

OntoBroker Modify CPOmerge to ignore labeling The merge result will contain

disjoint decompositions Equality constraints can be inferred

through ICI

Infer probabilities – simple methods

Simple methods: Distribute probabilities uniformly within each

cluster For each cluster L in me(c), d=>c,

For any distance function (lexical or otherwise)

ce

ceDist

cdDist

cd e

ep

),(

),(

,

Advanced methods

Probabilistic relational models with structural uncertainty Work by Dr. Getoor et. al

Classification approach Feature extraction determines entities of

interest Create conditional probabilities on those

entities User feedback approach

General, applicable to any of the above(ongoing work)

Content


Experimental setup

Java implementation CPO enhanced relational DB

Movies database maintained by Dr. Wiederhold

IMDB data IMDB to estimate recall Classifications from the Web to build

initial CPO

Consistency check & inference

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

45 70 92 145 172 204 237 275 304 321 342ontology size [no. of classes]

run

nin

g t

ime

[s]

consistency

CPO inference

Recall

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

45 70 92 145 172 204 237 275 304 321 342

Ontology size [no. of classes]

Re

call

threshold: 0.6

threshold:0.7

threshold:0.8

threshold:0.9

relational

Precision

0

0.2

0.4

0.6

0.8

1

1.2

45 70 92 145 172 204 237 275 304 321 342


Pre

cisi

on

Precision p:0.6

Precision p:0.7

Precision p:0.8

Precision p:0.9

Precision relational

Answer quality

0.6

0.62

0.64

0.66

0.68

0.7

0.72

0.74

0.76

0.78

0.8

0 50 100 150 200 250 300 350 400


Qu

alit

y [S

QR

T(P

reci

sio

n*R

eca

ll)]

Quality p:0.6

Quality p:0.7

Quality p:0.8

Quality p:0.9

Quality relational

Query running time

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

45 70 92 145 172 204 237 275 304 321 342


Ru

nn

ing

tim

e [s

]

Running time p:0.6

Running time p:0.7

Running time p:0.8

Running time p:0.9

Running time relational

ICI quality

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0 0.2 0.4 0.6 0.8 1 1.2

Epsilon

Qu

alit

yJoin quality

Relational join quality

Bottomline

Clear improvement in query answer quality Some time penalty, but reasonable

Very little user intervention CPOs are suited for a wide variety of

data sources Potentially, they can be used to convey

semantics across heterogenous data sources

Content


Current experimental setup

DBLP data over 60 years of scientific publications XML data set

CPOs from complex ontologies DBLP classification ACM classification of subjects

Goals (1)

Determine the efficiency of advanced CPO inference methods

Experimentally determine the best approach in terms of minimizing user feedback

Goals (2)

Use CPOs with RDF databases For extracting RDF from text as a means of

using semantic information For answering queries from RDF databases

Benefits: Probabilistic model is clearly formalized Proven improvement in answer quality

Experimentally determine what the probability threshold may be for various domains

Thank you

Probabilistic answers to relational queries (PARQ) Octavian Udrea Yu Deng Edward Hung V. S....

Documents

Transcript of Probabilistic answers to relational queries (PARQ) Octavian Udrea Yu Deng Edward Hung V. S....