Webdamlog and Contradictions

download Webdamlog and Contradictions

of 39

  • date post

    14-Jan-2016
  • Category

    Documents

  • view

    35
  • download

    0

Embed Size (px)

description

Webdamlog and Contradictions. Daniel Deutch Tel Aviv University Joint work with Serge Abiteboul, Meghyn Bienvenu, Victor Vianu. Motivation. In a distributed setting, contradictions and uncertainty naturally arise. Due to Different /contradictory opinions Different view points - PowerPoint PPT Presentation

Transcript of Webdamlog and Contradictions

  • Webdamlog and ContradictionsDaniel DeutchTel Aviv University

    Joint work with Serge Abiteboul, Meghyn Bienvenu, Victor Vianu

  • MotivationIn a distributed setting, contradictions and uncertainty naturally arise.

    Due to Different /contradictory opinions Different view pointsPartial Information

  • Example: Where is Alice

    Consider a IsIn(Person,City,Peer) relationPeer believes Person is in CityThere is a natural Functional Dependency {Peer, Person} CityNow consider a datalog rule IsIn(Person,City,p) :- IsIn(Person,City,p), Friend(p,p)How to combine the contradictory opinions of two friends on the location of Alice?How to do so if the opinions are uncertain?

  • RoadmapCentralized non-deterministic semanticsFor Datalog in presence of FDsWe study properties of the semantics, computational and representation issuesQuantifying non-determinism with probabilitiesStudying computation of probabilities and explanation of answersDistributed settings

  • Centralized case

    For the centralized case we use the datalog syntax

    Standard (safe) datalog rules R(X1Xn) :- R1(X11,,X1m),, Rk(Xk1,,Xks)

    Functional Dependencies of the form R:1,2 3

    We will change the datalog semantics to account for FDs

    DatalogFD

  • First SemanticsNon-deterministic inflationary fact-at-a-time semantics

    Re-define the immediate consequence operator such that A fact is derived only if it does not contradict other facts already in the database

    A possible world is a maximal consequence

    Simple stubborn semantics

  • Example Program IsIn(X,Y,P):-Friend(P,P), IsIn(X,Y,P) IsIn(Carol,Y,P):-IsIn(Alice,Y,P) Database IsIn(Alice,Paris,Peter), IsIn(Carol,London,Tom), Friend(Ben,Tom), Friend(Ben,Peter)

    IsIn(Alice,Paris,Peter)=>IsIn(Alice,Paris,Ben)=>IsIn(Carol,Paris,Ben)

    IsIn(Carol,London,Tom)=>IsIn(Carol,London,Ben)

    In either case, IsIn(Alice,Paris,Ben) will be derived

  • Set-at-a-time SemanticsIdea: the immediate consequence operator now selects a maximal consistent subset of the new facts that can be derived in one step of derivationStill inflationary: old facts always stay.The semantics gives priority to more direct derivationsIntuitive, especially in a distributed settingsOperationalTwo types of non-determinism in nfat Data non-determinism (choice between contradicting facts)Control non-determinism (choice of order of rule activation) In nsat only the first type remains

  • Example Program IsIn(X,Y,P):-Friend(P,P), IsIn(X,Y,P) IsIn(Carol,Y,P):-IsIn(Alice,Y,P) Database IsIn(Alice,Paris,Peter), IsIn(Carol,London,Tom), Friend(Ben,Tom),Friend(Ben,Peter)

    IsIn(Alice,Paris,Peter)=>IsIn(Alice,Paris,Ben)=>IsIn(Carol,Paris,Ben)

    IsIn(Carol,London,Tom)=>IsIn(Carol,London,Ben)

    IsIn(Alice,Paris,Ben) will be derived

  • Expressive Power

    Thm: nsat is strictly strongerWe can simulate a program under the nfat semantics, using a program under the nsat semanticsBut not the converseThm: datalogfd with nsat captures NDB-PTIMEQueries computable by a nondeterministic TM for which every computation is in PTIME

  • (Tuple) Possibility and CertaintyPossibility Certainty

  • Representation SystemConcrete c-table: variables only in condition boolean formuals General c-tables: variables may appear in entriesNon-deterministic semantics = set of possible worldsCan we capture with compact (i.e. PSIZE) c-tables?

  • Efficient Representation!Theorem: Given a (possibly recursive) datalogfd program P and input instance I one can compute in PTIME (w.r.t. |I|) a concrete c-table C such that:C encodes the fixpoint possible worlds + the empty relation

    This holds both for the nsat and nfat semantics Different constructions, using formulas (with negation) to compactly encode the possible derivations

    This holds also if instead of a certain instance I, we start with an (arbitrary) c-table

  • Probabilistic SemanticsProbabilistic counterpart introduced for both nfat and nsat semantics over a prob. databaseChoose a possible world for base factsRepeatedly and uniformly choose one possible set of rule instantiations,Applying the nfat/nsat immediate consequence operatorThis defines a distribution over fixpoint possible worldsAllows to capture votingExtensions allow to associate probabilities with rulesCan we compute the probability of a tuple to appear in a fixpoint world?

  • Probability ComputationThm: Even if input instances are tuple-indepndent and the query is non-recursive and safe:Computing exact tuple probability is #P-hardEven with one FD per relation

    I.e. FDs introduce a novel hardness

    Thm: PTIME absolute approximation exists for the general caseRelative approximation is hard

  • Probabilistic Representation SystemIn pc-tables, probabilities are associated with boolean variables

    Theorem: For non-recursive case, one FD per relation:

    We can capture possible worlds with their probabilities via a pc-table Even if starting from a pc-table instance

    General case is open.

  • Top-k SupportsTop-k (minimal) subsets of facts that are most likely to occur in conjunction with Q (given Q)The problem is PTIME with no recursion, no FDs, tuple-independent DBsCompute a DNF for the result and rank clauses by their individual probabilities.Either FDs (even one FD) or recursion (even linear recursion) lead to #P-hardnessEven approximation is hard.But surprisingly, PTIME exact solution for Transitive Closure programFuture work to identify classes of easy inputs, practically efficient heuristics.

  • InfluenceA tuple is necessary if without it, Q cannot be derivedPTIME for the recursive caseA tuple is relevant if it is necessary in conjunction with some subsetPTIME for non-recursiveNP-complete for recursiveWe can further quantify influence as the change in answer probability when removing the fact

    Top-k facts based on their influence Exact top-k is NP-hard even with no recursion, no FDApproximate top-k possible in PTIME for the general case

  • The distributed settingWe next extend the model to a distributed setting

    A quick overview of some problems that are of interest in this settingsThere are many others

  • Webdamlog basicsAlphabetPeer and Relations namesSchemaA set of peer IdsA disjoint sets of extensional & intensional relations of the form m@p (with m relation constant, p peer ID) Typing function defines the arity and sorts of components for each such relationFacts are of the form m@p(u)

  • Webdamlog basics (cont.)Mn+1@Qn+1(Un+1) :- M1@Q1(U1),...,Mn@Qn(Un) whereMi are relation terms, Qi are peer terms, Ui are tuples of terms We focus on local and deductive rules I.e. (body)At p, Qi = p for 1in (head) Mn+1@Qn+1(Un+1) is extensional

  • Webdamlog basics (example)

    IsIn@$P($X,$Y) :- Friend@p($P) IsIn@p($X,$Y)

    IsIn@p($X,$Y) :- baseIsIn@p($X,$Y)

  • Webdamlog basics (semantics)A local semantics to be used at each peer is defined and then induces a global semantics based on moves and runs.In our restricted case, with standard datalog semantics for each peerA move of a peer p is : Computing the fixpoint for its programAlerting other peers of the derived facts that concern themA run is a sequence of moves which satisfies fairness, i.e. each peer p is invoked infinitely many times.

  • With ContradictionsWith respect to webdamlog we change local semantics of peers to be nsat\nfatWhen activated, a peer runs the semantics until saturationThe obtained system is (I,R,F) for (Initial instance, Rules, FDs)A subtlety: Note that non-deterministic choices are made upon derivation at a peer p, but then the facts are added to a peer qSo we need to explicitly make sure that subsequent choices at p are consistentWe use a memory that p keeps throughout the run

  • Translation to the centralized caseGiven a distributed system (I,P,F), the centralized system is (Ic,Pc,Fc)Ic is the union of all peers instancesPc is all possible instantiations of the peer variables in the rules of P, with concrete peers (only instantiations respecting the typing and arity constraints)Fc is the union of all Functional DependenciesIf F is empty (no FDs) the systems are equivalentBut with FDs:Theorem: There exists a webdamlog system such that the set of nsat possible worlds for the original and centralized system are not contained in each other

  • Probabilities and VotingIsIn@$P($X,$Y) :- Follower@p($P) IsIn@p($X,$Y)

    IsIn@p($X,$Y) :- baseIsIn@p($X,$Y)

    Prob. Semantics: Uniform choice of peer to move, prob. local semantics for moves

    Proposition: For acyclic networks, the probability of a peer inferring a fact is exactly its relative support at followed peers

    We can also use probabilistic base facts to weigh opinions

  • Distributed SamplingThe idea is that each peer chooses a possible world for the base facts

    And then simply executes the semanticsMaking probabilistic choices along the way

    Some new subtleties in the procedureCooperation is needed for initiating the samplesAs well as in the convergence proofDue to order of peer invocation

  • Related Work

    Datalog with negation, nondeterminism

    Witness

    Repair and probabilistic repair

    Integrity constraints via rules in Data Exchange

    Distributed Datalog

    Probabilistic and Incomplete Databases

  • ConclusionWe have studied data management in presence of contradictions

    Defined semantics in the centralized and distributed case