Download - Webdamlog and Contradictions

Transcript
Page 1: Webdamlog and Contradictions

Webdamlog and Contradictions

Daniel Deutch

Tel Aviv University

Joint work with

Serge Abiteboul, Meghyn Bienvenu ,

Victor Vianu

Page 2: Webdamlog and Contradictions

Motivation

• In a distributed setting, contradictions and uncertainty naturally arise.

• Due to – Different /contradictory opinions – Different view points– Partial Information– …

Page 3: Webdamlog and Contradictions

Example: Where is Alice

• Consider a IsIn(Person,City,Peer) relation– “Peer believes Person is in City”

• There is a natural Functional Dependency {Peer, Person} → City• Now consider a datalog rule IsIn(Person,City,p) :- IsIn(Person,City,p’), Friend(p,p’)• How to combine the contradictory opinions of two friends on the location

of Alice?• How to do so if the opinions are uncertain?

Page 4: Webdamlog and Contradictions

Roadmap

• Centralized non-deterministic semantics– For Datalog in presence of FDs– We study properties of the semantics,

computational and representation issues

• Quantifying non-determinism with probabilities– Studying computation of probabilities and

explanation of answers

• Distributed settings

Page 5: Webdamlog and Contradictions

Centralized case

• For the centralized case we use the datalog syntax

• Standard (safe) datalog rules R(X1…Xn) :- R1(X11,…,X1m),…, Rk(Xk1,…,Xks)

Functional Dependencies of the form R:1,2 → 3

We will change the datalog semantics to account for FDs

DatalogFD

Page 6: Webdamlog and Contradictions

First Semantics

• Non-deterministic inflationary fact-at-a-time semantics

• Re-define the immediate consequence operator such that – A fact is derived only if it does not contradict other facts

already in the database

• A possible world is a maximal consequence

• Simple “stubborn” semantics

Page 7: Webdamlog and Contradictions

Example

Program IsIn(X,Y,P):-Friend(P,P’), IsIn(X,Y,P’) IsIn(Carol,Y,P):-IsIn(Alice,Y,P) Database IsIn(Alice,Paris,Peter), IsIn(Carol,London,Tom), Friend(Ben,Tom),

Friend(Ben,Peter)

IsIn(Alice,Paris,Peter)=>IsIn(Alice,Paris,Ben)=>IsIn(Carol,Paris,Ben)

IsIn(Carol,London,Tom)=>IsIn(Carol,London,Ben)

In either case, IsIn(Alice,Paris,Ben) will be derived

Page 8: Webdamlog and Contradictions

Set-at-a-time Semantics

• Idea: the immediate consequence operator now selects a maximal consistent subset of the new facts that can be derived in one step of derivation– Still inflationary: “old” facts always stay.

• The semantics gives “priority” to more “direct” derivations– Intuitive, especially in a distributed settings

• Operational• Two types of non-determinism in nfat

– Data non-determinism (choice between contradicting facts)– Control non-determinism (choice of order of rule activation)

• In nsat only the first type remains

Page 9: Webdamlog and Contradictions

Example

Program IsIn(X,Y,P):-Friend(P,P’), IsIn(X,Y,P’) IsIn(Carol,Y,P):-IsIn(Alice,Y,P) Database IsIn(Alice,Paris,Peter), IsIn(Carol,London,Tom),

Friend(Ben,Tom),Friend(Ben,Peter)

IsIn(Alice,Paris,Peter)=>IsIn(Alice,Paris,Ben)=>IsIn(Carol,Paris,Ben)

IsIn(Carol,London,Tom)=>IsIn(Carol,London,Ben)

IsIn(Alice,Paris,Ben) will be derived

Page 10: Webdamlog and Contradictions

Expressive Power

• Thm: nsat is strictly stronger– We can simulate a program under the nfat semantics,

using a program under the nsat semantics

– But not the converse

• Thm: datalogfd with nsat captures NDB-PTIME– Queries computable by a nondeterministic TM for

which every computation is in PTIME

Page 11: Webdamlog and Contradictions

)Tuple (Possibility and Certainty

nfat nsat

Non-recursivePTIMENP-complete

Recursive NP-completeNP-complete

nfat nsat

Non-recursiveCoNP-completeCoNP-complete

Recursive CoNP-completeCoNP-complete

Possibility

Certainty

Page 12: Webdamlog and Contradictions

Representation SystemABConditions

AliceLondonX

CarolLondonX

AliceParisNOT(X)

CarolParisNOT(X)

Concrete c-table: variables only in condition boolean formuals General c-tables: variables may appear in entries

Non-deterministic semantics = set of possible worldsCan we capture with compact (i.e. PSIZE) c-tables?

Page 13: Webdamlog and Contradictions

Efficient Representation!

• Theorem: Given a (possibly recursive) datalogfd program P and input instance I one can compute in PTIME (w.r.t. |I|)

a concrete c-table C such that:– C encodes the fixpoint possible worlds + the empty relation

• This holds both for the nsat and nfat semantics – Different constructions, using formulas (with negation) to compactly

encode the possible derivations

• This holds also if instead of a certain instance I,

we start with an (arbitrary) c-table

Page 14: Webdamlog and Contradictions

Probabilistic Semantics

• Probabilistic counterpart introduced for both nfat and nsat semantics over a prob. database– Choose a possible world for base facts– Repeatedly and uniformly choose one possible set of

rule instantiations,• Applying the nfat/nsat immediate consequence operator

– This defines a distribution over fixpoint possible worlds• Allows to capture voting• Extensions allow to associate probabilities with rules• Can we compute the probability of a tuple to appear in

a fixpoint world?

Page 15: Webdamlog and Contradictions

Probability Computation

• Thm: Even if input instances are tuple-indepndent and the query is non-recursive and safe:– Computing exact tuple probability is #P-hard– Even with one FD per relation

I.e. FDs introduce a novel hardness

• Thm: PTIME absolute approximation exists for the general case– Relative approximation is hard

Page 16: Webdamlog and Contradictions

Probabilistic Representation System

• In pc-tables, probabilities are associated with boolean variables

• Theorem: For non-recursive case, one FD per relation:

– We can capture possible worlds with their probabilities via a pc-table

– Even if starting from a pc-table instance

• General case is open.

Page 17: Webdamlog and Contradictions

Top-k Supports

• Top-k (minimal) subsets of facts that are most likely to occur in conjunction with Q (given Q)

• The problem is PTIME with no recursion, no FDs, tuple-independent DBs– Compute a DNF for the result and rank clauses by their

individual probabilities.• Either FDs (even one FD) or recursion (even linear

recursion) lead to #P-hardness– Even approximation is hard.

• But surprisingly, PTIME exact solution for Transitive Closure program

• Future work to identify classes of easy inputs, practically efficient heuristics.

Page 18: Webdamlog and Contradictions

Influence

• A tuple is necessary if without it, Q cannot be derived• PTIME for the recursive case

• A tuple is relevant if it is necessary in conjunction with some subset

• PTIME for non-recursive• NP-complete for recursive

• We can further quantify influence as the change in answer probability when removing the fact

• Top-k facts based on their influence – Exact top-k is NP-hard even with no recursion, no FD– Approximate top-k possible in PTIME for the general case

Page 19: Webdamlog and Contradictions

The distributed setting

• We next extend the model to a distributed setting

• A quick overview of some problems that are of interest in this settings– There are many others

Page 20: Webdamlog and Contradictions

Webdamlog basics

• Alphabet– Peer and Relations names

• Schema– A set of peer Ids

– A disjoint sets of extensional & intensional relations of the form m@p (with m relation constant, p peer ID)

Typing function defines the arity and sorts of components for each such relation

• Facts are of the form m@p(u)

Page 21: Webdamlog and Contradictions

Webdamlog basics (cont.)

• Mn+1@Qn+1(Un+1) :- M1@Q1(U1),...,Mn@Qn(Un) where

• Mi are relation terms, Qi are peer terms, Ui are tuples of terms

• We focus on local and deductive rules

I.e. (body)At p, Qi = p for 1≤i≤n (head) Mn+1@Qn+1(Un+1) is extensional

Page 22: Webdamlog and Contradictions

Webdamlog basics (example)

IsIn@$P($X,$Y) :- Friend@p($P)

IsIn@p($X,$Y)

IsIn@p($X,$Y) :- baseIsIn@p($X,$Y)

Page 23: Webdamlog and Contradictions

Webdamlog basics (semantics)

• A local semantics to be used at each peer is defined and then induces a global semantics based on moves and runs.

• In our restricted case, with standard datalog semantics for each peer– A move of a peer p is :

• Computing the fixpoint for its program• Alerting other peers of the derived facts that concern them

– A run is a sequence of moves which satisfies fairness, i.e. each peer p is invoked infinitely many times.

Page 24: Webdamlog and Contradictions

With Contradictions

• With respect to webdamlog we change local semantics of peers to be nsat\nfat– When activated, a peer runs the semantics until saturation– The obtained system is (I,R,F) for (Initial instance, Rules, FDs)

• A subtlety: – Note that non-deterministic choices are made upon

derivation at a peer p, but then the facts are added to a peer q– So we need to explicitly make sure that subsequent choices

at p are consistent– We use a “memory” that p keeps throughout the run

Page 25: Webdamlog and Contradictions

Translation to the centralized case

• Given a distributed system (I,P,F), the centralized system is (Ic,Pc,Fc)– Ic is the union of all peers instances– Pc is all possible instantiations of the peer variables in the

rules of P, with concrete peers (only instantiations respecting the typing and arity constraints)

– Fc is the union of all Functional Dependencies• If F is empty (no FDs) the systems are “equivalent”• But with FDs:• Theorem: There exists a webdamlog system such that

the set of nsat possible worlds for the original and centralized system are not contained in each other

Page 26: Webdamlog and Contradictions

Probabilities and VotingIsIn@$P($X,$Y) :- Follower@p($P)

IsIn@p($X,$Y)

IsIn@p($X,$Y) :- baseIsIn@p($X,$Y)

Prob. Semantics: Uniform choice of peer to move, prob. local semantics for moves

Proposition: For acyclic networks, the probability of a peer inferring a fact is exactly its relative support at followed peers

We can also use probabilistic base facts to weigh opinions

Page 27: Webdamlog and Contradictions

Distributed Sampling

• The idea is that each peer chooses a possible world for the base facts

• And then simply executes the semantics– Making probabilistic choices along the way

• Some new subtleties in the procedure– Cooperation is needed for initiating the samples

• As well as in the convergence proof– Due to order of peer invocation

Page 28: Webdamlog and Contradictions

Related Work

• Datalog with negation, nondeterminism

• Witness

• Repair and probabilistic repair

• Integrity constraints via rules in Data Exchange

• Distributed Datalog

• Probabilistic and Incomplete Databases

Page 29: Webdamlog and Contradictions

Conclusion

• We have studied data management in presence of contradictions

• Defined semantics in the centralized and distributed case

• Provided a probabilistic modeling of the uncertainty that arises

• Studied computational problems in these contexts

• Future work: Open questions, optimizations and implementation issues, additional semantics.

Page 30: Webdamlog and Contradictions

Thank you!

Page 31: Webdamlog and Contradictions

Proof theory

• A proof tree is a tree labeled with facts, such that leaves are labeled with facts from the original database.

• In presence of FDs, we require that the facts in the tree nodes do not violate any FD.

• Theorem: The possible facts are exactly those that have proof trees

Page 32: Webdamlog and Contradictions

Example

C :- R(a,0),R(a,1),

R(a,0) :- A,

R(a,1) :- B

DB = {A,B}, FD in R: 1→2

There are proof trees for both R(a,0) and R(a,1) but not for C

Page 33: Webdamlog and Contradictions

Connection to Datalog with negation

• The nsat semantics is non-deterministic, while datalog is deterministic.

• We can “encode” the non-determinism in a new prefer relation deciding for every non-deterministic choice, the preference between facts.

• Then we can simulate nsat via inflationary datalog with negation and with the prefer relation

• Details omitted

Page 34: Webdamlog and Contradictions

Possibility and Certainty

• As the semantics is non-deterministic, there are multiple possible fixpoint states– Referred to as possible worlds

• Given an input instance:– A fact is possible if it appears in some fixpoint state– A fact is certain if it appears in all fixpoint states

Page 35: Webdamlog and Contradictions

Observations

1. An nsat possible world is an nfat possible world.

2. The converse of 1. does not hold in general.

3. All possible nsat facts are possible nfat facts

4. Certain nfat facts are certain nsat facts

5. The above inclusions may be strict

Page 36: Webdamlog and Contradictions

C-tables

• A c-table is an incomplete instance (variables may occur in tuple entries), with conditions associated with tuples– Conditions are boolean combinations of

equality predicates over variables and values

• A concrete c-table is one where– No variables appear in tuples– Conditions are boolean formulas over variables

Page 37: Webdamlog and Contradictions

Same Possible Worlds? NO

The program at p FD on S@p A@p :- B@p :- A@p C@p :- B@p S@p(0) :- C@p D@q :- S@p(1) :- E@p The program at q E@p :- D@q

Page 38: Webdamlog and Contradictions

Same Possible Worlds?

• We could use nfat (rather than nsat) as a local semantics, but this would not generate all nfat worlds of the centralized case

• Of course, a semantics that runs a single nfat step at each peer activation would capture all.– But not very realistic to assume communication

after every step

Page 39: Webdamlog and Contradictions

Introducing probabilities

• So far we have non-deterministic semantics• We next turn to a probabilistic semantics

where the non-deterministic choices are associated with probabilities– We will show how this allows to capture voting

• We will also capture uncertainty on base facts with probabilities– This captures incomplete as well as weighted

knowledge