Web Site Veriï¬cation and Repair: an Abductive Logic Programming Tool

21
Web Site Verification and Repair: an Abductive Logic Programming Tool P. Mancarella 1 , G. Terreni 1 , and F. Toni 2 1 Dipartimento di Informatica, Universit` a di Pisa, Italy Email: {paolo,terreni}@di.unipi.it 2 Department of Computing, Imperial College London, UK Email: [email protected] Abstract. We present the CIFFWEB system, an innovative tool for the verification and repair of web sites, relying upon abductive logic program- ming. The system allows the user to define rules that a web site should fulfill and uses abductive reasoning for checking their fulfillment and, if needed, for suggesting repairs to those (parts of the) web sites respon- sible for violating them. The rules are expressed by using (a fragment of) the rule-based semi-structured query language Xcerpt. Then, we give a mapping from Xcerpt rules into abductive logic programs, that can be executed by means of the general-purpose CIFF proof procedure for abductive logic programming with constraints. Finally, we illustrate how a similar technique can be used effectively for actually repairing errors in web sites. The CIFFWEB system is a prototype system composed of (1) the CIFF System 4.0 and (2) a translator which, given checking rules expressed in Xcerpt and XML/XHTML pages, computes an abductive logic program with constraints that can be run under CIFF 4.0. 1 Introduction The exponential growth of the WWW raises the question of maintaining and repairing automatically web sites, in particular when the designers of these sites require them to exhibit certain properties at both structural and data level. The capability of maintaining and repairing web sites is also important to ensure the success of the Semantic Web vision. As the Semantic Web rely upon the definition and the maintenance of consistent data schemas (XML/XMLSchema, RDF/RDFSchema, OWL and so on [1]), tools for reasoning over such schemas (and possibly extending the reasoning to multiple web pages) show great promise. This paper introduces a tool, referred top as CIFFWEB, for verifying and for repairing web sites against sets of requirements which have to be fulfilled by a web site instance. For lack of space, we focus on the verification features of CIFFWEB and only illustrate the repair capabilities of the tool. We define an expressive characterization of rules for checking web sites’ errors by using (a fragment of) the well-known semi-structured data query language Xcerpt [2, 3]. With respect to the others semi-structured query languages (like

Transcript of Web Site Veriï¬cation and Repair: an Abductive Logic Programming Tool

Page 1: Web Site Veriï¬cation and Repair: an Abductive Logic Programming Tool

Web Site Verification and Repair:an Abductive Logic Programming Tool

P. Mancarella1, G. Terreni1, and F. Toni2

1 Dipartimento di Informatica, Universita di Pisa, ItalyEmail: {paolo,terreni}@di.unipi.it

2 Department of Computing, Imperial College London, UKEmail: [email protected]

Abstract. We present the CIFFWEB system, an innovative tool for theverification and repair of web sites, relying upon abductive logic program-ming. The system allows the user to define rules that a web site shouldfulfill and uses abductive reasoning for checking their fulfillment and, ifneeded, for suggesting repairs to those (parts of the) web sites respon-sible for violating them. The rules are expressed by using (a fragmentof) the rule-based semi-structured query language Xcerpt. Then, we givea mapping from Xcerpt rules into abductive logic programs, that canbe executed by means of the general-purpose CIFF proof procedure forabductive logic programming with constraints. Finally, we illustrate howa similar technique can be used effectively for actually repairing errorsin web sites. The CIFFWEB system is a prototype system composed of(1) the CIFF System 4.0 and (2) a translator which, given checking rulesexpressed in Xcerpt and XML/XHTML pages, computes an abductivelogic program with constraints that can be run under CIFF 4.0.

1 Introduction

The exponential growth of the WWW raises the question of maintaining andrepairing automatically web sites, in particular when the designers of these sitesrequire them to exhibit certain properties at both structural and data level. Thecapability of maintaining and repairing web sites is also important to ensurethe success of the Semantic Web vision. As the Semantic Web rely upon thedefinition and the maintenance of consistent data schemas (XML/XMLSchema,RDF/RDFSchema, OWL and so on [1]), tools for reasoning over such schemas(and possibly extending the reasoning to multiple web pages) show great promise.This paper introduces a tool, referred top as CIFFWEB, for verifying and forrepairing web sites against sets of requirements which have to be fulfilled bya web site instance. For lack of space, we focus on the verification features ofCIFFWEB and only illustrate the repair capabilities of the tool.We define an expressive characterization of rules for checking web sites’ errorsby using (a fragment of) the well-known semi-structured data query languageXcerpt [2, 3]. With respect to the others semi-structured query languages (like

Page 2: Web Site Veriï¬cation and Repair: an Abductive Logic Programming Tool

XQuery [4], XPath [5]) that all propose a path-oriented approach for query-ing semi-structured data, Xcerpt is the a rule-based language which relies upona (partial) pattern matching mechanisms allowing to easily express complexqueries in a natural and human-tailored syntax. Xcerpt shares many featuresof logic programming, for example its use of variables as place-holders and uni-fication. However, to the best of our knowledge, it lacks (1) a clear semanticsfor negation constructs and (2) an implemented tool for running Xcerpt pro-grams/evaluating Xcerpt queries. A by-product of this paper is the provision ofboth (1) and (2) for a fraction of Xcerpt, namely the subset of this languagethat we adopt for expressing web checking rules.We map formally the chosen fragment of Xcerpt for expressing checking rulesinto abductive logic programs with constraints that can be fed in input to thegeneral-purpose CIFF proof procedure [6]. This proof procedure is an extensionof the IFF abductive proof procedure [7] allowing a more sophisticated handlingof variables and integrating a constraint solver for arithmetical operations. CIFFis proven sound with respect to the three valued completion semantics [8] andhas been implemented in the CIFF System 4.0 [9]. By mapping web checkingrules onto abductive logic programs with constraints and deploying CIFF for de-termining fulfillment (or identify violation) of the rules, we inherit the soundnessproperties of CIFF thus obtaining a sound concrete tool for web checking.Being CIFF a general-purpose logic programming procedure, it is not able tohandling directly semi-structured data of the kind Xcerpt does. Thus, we alsodefine, as part of the CIFFWEB tool, a simple XML/XHTML translation intoa representation of the web pages suitable for CIFF, and we rely upon thatrepresentation in deifning the translation function for the web checking rules.At the end of the translation process the CIFF System 4.0 can be successfullyused to reason upon the (translation of the) web checking rules finding thoseXML/XHTML instances not fulfilling the rules, and representing errors as ab-ducibles in abductive logic programs.Abductive reasoning can be fully exploited not only for identifying errors butalso for repairing them. In this respect, abducibles may represent not only anerror instance fired by XML/XHTML instance violating a rule r (for checking)but also possible modifications (repairs) to that XML/XHTML data such thatboth r is fulfilled and no other rules are violated. For lack of space, in this paperwe only show some examples of repair which illustrates the use of CIFFWEB,leaving a more formal characterization as future work.

2 Motivation

Searching the web, it is easy to encounter web pages containing errors in theirstructure and/or their data. The main goal of this section is to categorize sometypical errors and define a rule pattern for checking these errors.We argue that, in most cases, considering an XML/XHTML web site instance,the errors can be divided into two main categories: structural errors and content-related (data) errors. Structural errors are those errors regarding the presence

2

Page 3: Web Site Veriï¬cation and Repair: an Abductive Logic Programming Tool

and/or absence of tag elements and relations amongst tag elements in the pages.For example, if a tag tag1 is intended to be a child of a tag tag2, the occurrence inthe web site of a tag1 instance outside the scope of a tag2 instance is a structuralerror. Data errors, instead, are about the in-tag data content of tag elements.For example a tag3 could be imposed to hold a number greater than 100.To better exemplify the types of error we consider, we present here a very simpleXML web site instance of a theatre company. The site is composed of two pagesrepresenting a list of shows produced by the company and the list of directorsof the company, given below

%%% showindex.xml %%% directorindex.xml

<showlist> <directorlist>

<show> <director>

<showname>Epiloghi</showname> dir1</director>

<dir_by>dir1</dir_by> <director>

</show> dir2</director>

<show> <director>

<showname>Black Comedy</showname> dir3</director>

</show> <director>

<show> dir2</director>

<showname>Confusioni</showname> <directorlist>

<dir_by>dir3</dir_by>

</show>

</showlist>

We could specify a number of rules which any web site instance should fulfill.For example, we could specify that the right structure of a show tag in the firstpage must contain both a showname tag element and a dir by tag element asits child. In this case we have a structural error, due to the lack of a dir bytag element in the second show of the list. Moreover we could specify that twodirector tags in the second page must not contain the same data. In this casewe have a data error due to the double occurrence of the dir2 data.Requirements (and thus errors) can involve more than one web page. For exam-ple, a possible requirement for the theatre company specification may be that

each director must direct at least one showThe above requirement can lead to content-related errors which involve bothpages. There is a simple way to check this requirement: for each director tagdata in the directorindex page, if there does not exist a matching data of adir by tag in the showindex page then a data-content error is detected. This isthe case for dir2 in the earlier example.In all examples in this section, we have assumed that an error instance is firedby a piece of XML/XHTML data which does not fulfill a certain requirement(or specification). We will first formalize any such requirement as a web checkingrule. For each such rule we need both a condition part and an error part. For eachinstance of the considered data such that the condition part is matched, then anerror instance needs to be returned. We will then introduce web repair rules tosuggest modifications to the data, instead of simply detecting an error, whenever

3

Page 4: Web Site Veriï¬cation and Repair: an Abductive Logic Programming Tool

a requirement is not fulfilled. As we will see by means of examples, repair rulescan be obtained by modifying checking rules. However, being repairing a muchmore difficult task with respect to checking a formal characterization and a clearmethodology for writing web repair rules is still under development.

3 Web checking rules

In order to formalize and characterize requirements, such as the ones expressedin natural language in the earlier section, as web checking rules we first needa formal language. Our choice is to use the Xcerpt [2, 3] language: a deductive,rule-based query language for semi-structured data which allows for direct accessto XML data in a very natural way. Our characterization of web checking rulescan be accommodated straightforwardly in a fragment of the Xcerpt language.Here, we give some background notions about the Xcerpt fragment we use. 3.An Xcerpt program is composed of a GOAL part and a FROM part. The FROM partprovides access to the sources (XML files or other sources) via (partial) patternmatching among terms, while the GOAL part reassembles the results of the queryinto new terms. Variables can be used within either parts and act as placeholders(as in logic programming).As an example, the requirement, in the context of our earlier example pages,that each director must appear at most once in the director list [Rule1] can beexpressed in Xcerpt as follows:

GOAL

all err [ var Dir1, var Dir2, "double director occurrence"]

FROM

in { resource {"file:directorindex.xhtml"},

directorlist {{

director {{var Dir1}},

director {{var Dir2}}

}}

} where {Dir1 = Dir2} END

The main Xcerpt statement we use in the GOAL part is the all t statement(where t is a term), indicating that each possible instance of t satisfying theFROM part gives rise to a new instance of t returned by the GOAL part. In ourmethodology for writing web checking rules, all t will always be all err, whereerr stands for “error”.In the FROM part, an access to a resource is wrapped within a in statement.Multiple accesses must be connected by and indicating that all subqueries haveto succeed in order to make the whole query succeed. The main Xcerpt queryterms t which we use in our work are: (1) double curly brackets, i.e. t{{ }},denoting partial term specification of t; the order of the subterms of t withinthe curly brackets is irrelevant; (2) variables, expressed by var followed by an3 The full Xcerpt language is much more expressive then the fragment we adopt here

for expressing web checking and repair rules. A full treatment of Xcerpt is outsidethe scope of this paper. For further information about Xcerpt see [2].

4

Page 5: Web Site Veriï¬cation and Repair: an Abductive Logic Programming Tool

identifier (variable name); values for variables can be strings and numeric values;(3) a where statement for expressing constraints through standard operatorslike =, \=, <, >, <=; (4) subterms of the form without t denoting subtermnegation, illustrated within the following formulation of the requirement eachdirector directs at least one show [Rule2] :

GOAL

all err[var DirName,"director w/out show"]

FROM

and {

in { resource{"file:directorindex.xhtml"},

director {{ var DirName }},

in { resource{"file:showindex.xhtml"},

showlist {{

without show {{ dir_by {{ var DirName}} }}

}}

}

}

END

without is only applicable to subterms t that do not occur at root level in theunderlying web pages (in our example, showlist and directorlist occur atroot level); all variables that occur within a without have to appear elsewhereoutside the without and finally a without subterms cannot occur nested.

4 A Xcerpt-like grammar for positive web checking rules

To simplify the presentation, we first focus our attention to positive web checkingrules, i.e. rules in which negation (the without statement in Xcerpt syntax) doesnot occur.In defining the syntax, we distinguish between basic syntactic categories andstructural syntactic categories. The basic categories are used to define the basiccomponents of the rules, namely variables, constants (strings and numbers),XML tags and constraints (e.g. X > Y ). The structural categories, instead, arethe main parts of the Xcerpt rules, as we have seen in the previous examplesbefore: the query part, the error part, the in construct and so on.For each syntactic category, we avoid stating explicitly the syntactic rules forsequences of elements derived from it. We use instead the following notationalconvention. Given a syntactic category C, C ? denotes the syntactic category forelements derived from C. The metarule for C ? is the following:

C ? ::= C, C ? | ε (ε denotes the empty string)

The grammar for the basic categories is the following:

Const ::= String | NumberTag ::= StringVarName ::= StringVar ::= var VarNameVarOrConst ::= Var | ConstRel ::= < | > | ≤ | ≥ | = | \=Constraint ::= Var Rel VarOrConst

5

Page 6: Web Site Veriï¬cation and Repair: an Abductive Logic Programming Tool

Notice that the Tag and the VarName categories generate simply strings. How-ever we keep them distinct categories to improve readability of the syntacticrules for the structural categories.In the second part of the grammar we define effectively the interesting con-structs of the web checking rules. Recall that a web checking rule is composedof two main parts: a query part and an error part. The error part gives the errorspecification, while the query part is a conjunction of queries represented by aand wrapping a list of in construct. Each element of the conjunction (i.e. eachin construct) expresses a query involving a specific resource. At the end of thewhole conjunction a set of constraint can be expressed by the where construct.

CheckRule ::= GOAL Error FROM Query END

Error ::= all err [ VarOrConst? , String ]

Query ::= InPart WhereInPart ::= In | and In?

In ::= in { { Resource } Term }Resource ::= resource { file: String }Term ::= Tag {{ VarOrConst? Term? }}Where ::= where { Constraint? } | ε

We denote by α ∈ C that α is a string derived from the syntactic category C.

5 The (positive) translation process

In this section we show formally how to translate the positive web checkingrules into abductive logic programs suitable for the CIFF proof procedure. Thissection is organized as follows: first we give a brief background about abductivelogic programs, then we show how we represent the XML/XHTML pages in auseful way for CIFF, then we show the formal translation function for the rules.Finally we give some examples of translations.

5.1 Abductive logic programming with constraints

An abductive logic program with constraints (ALPC) consists of (1) a constraintlogic program, referred to as the theory, namely a set of clauses of the formA ← L1 ∧ . . . ∧ Lm, where the Lis are literals (ordinary or abducible atoms,their negation, or constraint atoms), A is an ordinary atom and whose variablesare all implicitly universally quantified from the outside; (2) a finite set of ab-ducible predicates, that do not occur in any conclusion A of any clause in thetheory, and (3) a finite set of integrity constraints (ICs), namely implicationsof the form L1 ∧ · · · ∧ Lm → A1 ∨ · · · ∨ An where the Lis are literals and theAjs are (ordinary, abducible, constraint or false) atoms, and whose variablesare all implicitly universally quantified from the outside. The theory providesdefinitions for ordinary predicates, while constraint atoms are evaluated withinan underlying structure, as in conventional constraint logic programming.A query is a conjunction of literals (whose variables are implicitly existentiallyquantified). An answer to a query specifies which instances of the abducible

6

Page 7: Web Site Veriï¬cation and Repair: an Abductive Logic Programming Tool

predicates should hold so that both (some instance of) the query is entailed bythe constraint logic program extended with the abducibles and ICs are satisfied.The CIFF procedure computes such answers, with respect to the notion of en-tailment given by the 3-valued completion. [8] It operates with a presentationof the theory as a set of iff-definitions, which are obtained by the (selective)completion of all predicates defined in the theory except for the abducible andthe constraint predicates. CIFF returns three possible outputs: (1) an abductiveanswer to the query (a set of possibly non-ground abducible atoms and a setof constraints on the variables of the query and of the abducible atoms); (2) afailure, indicating that there is no answer and finally (3) an undefined answer,indicating that a critical part of the input is not allowed. Allowedness relates toinput formulas with certain quantification patterns for which the concept of a(finite) answer cannot be defined [6].A CIFF computation starts from a node composed of the query plus the ICs.Then the CIFF procedure repeatedly applies a set of rewrite rules (see [6]) tothe nodes in order to update them and to construct a proof search tree. A nodecontaining a goal false is called a failure node. If all branches in a derivationterminate with failure nodes, then the procedure is said to fail (no answer to thequery). A non-failure (allowed) node to which no more proof rules apply can beused to extract an abductive answer.The CIFF System 4.0 is a Prolog implementation of CIFF, available at

www.di.unipi.it/∼terreni/research.php.The system takes as input a list of files representing an ALPC. The abduciblepredicates Pred are declared explicitly in the form abducible(Pred) while theoperators representing clauses and ICs are respectively :- (Prolog-like syntax)and implies. Within clauses and integrity constraints, negative literals are rep-resented in the form not(Atom), constraint atoms are represented in the form#> #< #=..., and disequality atoms are represented as A\==B.

5.2 XML representation

To pave the way to writing, in a format suitable for the CIFF system, specifica-tion rules for properties of web sites, we first provide a suitable representationof web sites data. Obviously CIFF is not able to handle directly XML data andXML structure, hence we propose a translation whereby for each tag element ofthe original XML file, an atom pg el is generated having the following form:

pg_el(ID,TagName,IDFather)

where TagName is the name of the tag element, whereas ID and IDFather rep-resent the unique identifiers for the tag element and its father. They are neededfor keeping information about the structure of the XML page.Furthermore a data inside a tag element is represented by an atom of the form:

data_el(ID,Data,IDFather)

The following is the translation of the XML pages seen in section 2

xml_pg(’showindex.xml’,0). xml_pg(’directorindex.xml’,15).

pg_el(1,showlist,0). pg_el(16,directorlist,15).

7

Page 8: Web Site Veriï¬cation and Repair: an Abductive Logic Programming Tool

pg_el(2,show,1). pg_el(17,director,16).

pg_el(3,showname,2). data_el(18,’dir1’,17).

data_el(4,’Epiloghi’,3). pg_el(19,director,16).

pg_el(5,dir_by,[’dir1’],2). data_el(20,’dir2’,19).

data_el(6,’dir1’,5). pg_el(21,director,16).

pg_el(7,show,1). data_el(22,’dir3’,21).

pg_el(8,showname,7).

data_el(9,’Black Comedy’,8).

pg_el(10,show,1).

pg_el(11,showname,10).

data_el(12,’Confusioni’,11).

pg_el(13,dir_by,10).

data_el(14,’dir3’,13).

For each page, we store also a fsct of the form xml page(FileName,ID), holdingthe name of the file and the unique identifier of the page.

5.3 Positive translation function

We are now ready to show the formal translation function for positive web check-ing rules mapping them from Xcerpt-like syntax to abductive logic programs.We define inductively two functions, namely T JCK and T ′JαK(id) which, given astring α derivable from one of the syntactic categories defined in the previousSection, return a string, representing a part of an abductive logic program4. Theabductive logic program corresponding to a whole web checking rule is obtainedby applying the function T JRK to a rule R ∈ CheckRule. This abductive logicprogram with constraints is then suitable as input for CIFF. In particular thetranslation of a CheckRule gives rise to a single CIFF integrity constraint.The two translation functions differ in that the T ′JαK(id) has an extra argument,id, representing a number. Intuitively, a call of the form T ′JαK(id) amounts attranslating the element α which is a component of another element, uniquelyidentified by id. This identifier id is needed in the translation of α to keep thecorrespondence between α and the element it is part of. In the sequel we assumethat the auxiliary function

newID()generates a fresh variable whenever called. This function will be used when a newidentifier id will be needed to uniquely identify the element being translated.For brevity we omit the specification of the translation of the basic syntacticelements, since they basically remain unchanged (e.g. the translation of a stringα representing a variable name is α itself).In Xcerpt, variables are used as in logic programming, that is they can be sharedbetween distinct parts of a rule by using the same name. In the translation,variable names are simply left unchanged, so that sharing is maintained.

We use a top-down approach in defining the translation functions. Hence, thefirst rule is for the CheckRule category.4 As usual, we use J·K to highlight the syntactic arguments of the translation functions

8

Page 9: Web Site Veriï¬cation and Repair: an Abductive Logic Programming Tool

CheckRuleT JGOAL Err FROM Query ENDK =

let Head = T JErrKand Body = T JQueryK

inBody implies Head.

As we can see the result is a string representing a CIFF integrity constraint: thebody is the translation of the query part and the head is the translation of theerror part. Intuitively, each instance of the XML data matching the body willfire an instance of the head representing the corresponding error. We will seethat the predicate used in Head of such constraint is abducible, hence firing theconstraint will amount at abducing its head, construucted as follows:ErrorT Jall err[Vars, Msg]K = [abd err([Vars],Msg)]

The predicate abd err/2 is defined as an abducible predicate. The first argumentis the list of variables occurring in the error part of the rule, the second argumentis simply the error message.For the query component, the translation produces the conjunction of the trans-lations of the InPart component and of the Where component of the query.QueryT JInPart WhereK =

let Conjunction = T JInPartKand Constraints = T JWhereK

in[Conjunction, Constraints]

The constraints in the Where part of a query are simply translated into the sameconjunction of CIFF constraints: we omit this part for space reasons.The translation of the InPart of a query results in the conjunction of the trans-lations of the single components.In-1T Jin{{resource{file : Res} Term}}K = T JTermK

In-2T JIn, InRestK =

let Conjunct = T JInKand OtherConjuncts = T JInRestK

inConjunct, OtherConjuncts

The following rules translate terms in Term, and produce the single conjuncts inthe conjunction constituting the body of the integrity constraint correspondingto a checking rule. Each term specifies a tag and hence the translation amountsat producing an instance of the pg el predicate, possibly in conjunction withother conjuncts corresponding to the translation of the subtags. A fresh variableis generated through the newID() function, to uniquely represent the tag beingtranslated. Note the usage of the function T ′J·K(·) for the translation of the dataelements (variables and constants) and of the subtags of the tag being translated.

9

Page 10: Web Site Veriï¬cation and Repair: an Abductive Logic Programming Tool

Term-1T JtagName {{VarConstSeq, TermSeq}}K =

let Id = newID()and DataElements = T ′JVarConstSeqK(Id)and SubTerms = T ′JTermSeqK(Id)and Ids = IdV ars(DataElements) ∪ IdV ars(SubTerms)and Constraints = all distinct(Ids)

inpg el(Id, tagName, ), DataElements, SubTerms, Constraints

In the above rule, we have introduced two new auxiliary functions:- IdV ars(X) which is assumed to return the set of all Id’s generated by the

newID() function occurring in its argument X;- all distinct(V ars) which is assumed to generate the constraint∧

x,y∈Varsx6=y

x 6= y

Term-2.1T ′Jvar VarNameK(Id) =

let Id′ = newID()in

data el(Id′, VarName, Id)

Term-2.2T ′JConstK(Id) =

let Id′ = newID()in

data el(Id′, Const, Id)

Term-3T ′JsubTagName {{VarConstSeq, TermSeq}}K(Id) =

let Id′ = newID()and DataElements = T ′JVarConstSeqK(Id′)and SubTerms = T ′JTermSeqK(Id′)

inpg el(Id′, subTagName, Id), DataElements, SubTerms

Term-4T ′JX, XsK(Id) =

let Conjunct = T ′JXK(Id)and OtherConjuncts = T ′JXsK(Id)

inConjunct, OtherConjuncts

Cases Term-2.1 and Term-2.2 produce instances of the data el/3 predicate.Notice that the translation amounts at producing an instance data el(Id′, El,Id), where El is the actual data element, Id′ is the unique identifier assignedto it and Id is the unique identifier of the term the data element is part of.In the case Term-3 we have a translation similar to case Term-1, the onlydifference being that, in the case Term-1 the third argument of pg el/3 is the

10

Page 11: Web Site Veriï¬cation and Repair: an Abductive Logic Programming Tool

anonymous variable, whereas in case Term-3 is the unique identifier of the termcontaining the subterm being translated. Finally, the case Term-4 correspondsto the translation of a sequence X, Xs in VarOrConst? or in Term?.

Example 1. Let us consider the web checking rule [Rule1] for the theater com-pany web site, seen in section 3. To simplify the presentation of the translationresults, let us identify the main syntactic components of the rule:

err1 = all error [ var Dir1, var Dir2, "double director occurrence"]

query1 = in1 , where1

in1 = in {resource {"file:directorindex.xml"} root1 }root1 = directorlist {{

director {{var Dir1}}, director {{var Dir2}} }}where1 = where {Dir1 = Dir2}

Let us apply the translation function T to rule 1.T JRule1K = T Jquery1K implies T Jerr1K.T Jerr1K = abd err([Dir1,Dir2],"double director occurrence")

T Jquery1K = [T Jin1K T Jwhere1K]T Jin1K = T Jroot1KT Jwhere1K = Dir1 = Dir2

T Jroot1K = pg el(ID1,directorlist, ), ID2 #\= ID4,

pg el(ID2,director,ID1), data el(ID3,Dir1,ID2),

pg el(ID4,director,ID1), data el(ID5,Dir2,ID4),

Summarizing we have the following translation:[pg el(ID1,directorlist, ), ID2 #\= ID4

pg el(ID2,director,ID1), data el(ID3,Dir1,ID2)

pg el(ID4,director,ID1), data el(ID5,Dir2,ID4), Dir1 = Dir2]

implies

[abd err([Dir1,Dir2], "double director occurrence")].

6 Adding negation to the translation process

The mapping of web checking rules considering also negative parts, i.e. withwithout statements in Xcerpt syntax, is a bit more complicated. The resultingabductive logic program is no more a single integrity constraints but it includesalso a set of clauses which defines new (fresh) predicates needed for handlingcorrectly the negations. Also the grammar must be extended for covering thewithout construct on which we impose the following restrictions: (1) a withoutcould not appear at the top level of a in statement, (2) without statementscould not appear nested each other, and (3) variables appearing at any deeplevel in the scope of a without appear elsewhere outside a without.In the whole the structure of a rule is essentially the same as before. All thegrammar rules regarding basic constructs remain unchanged as all other gram-mar rules which does not regard directly negation. In the following we show onlythe grammar rules regarding the newly introduced syntactical categories and thegrammar rules regarding syntactical categories which need to be changed.

11

Page 12: Web Site Veriï¬cation and Repair: an Abductive Logic Programming Tool

Term ::= Tag {{ VarOrConst? Term? Without? }}Without ::= without PosTerm | without VarOrConstPosTerm ::= Tag {{ VarOrConst? PosTerm? }}

6.1 Translation function

As anticipated, the ALPC obtained by translating a whole rule is no more asingle CIFF integrity constraint but it contains also a set of clauses. Hence weneed to update the translation function T because now it must both handle thenew grammar rules and take into account a different output, i.e. T could nomore return a straight string but it has to return a couple 〈 Conjs ; Clauses〉 where Conjs is a string, representing a conjunction, which will augment thebody of the integrity constraints exactly as for the output of the positive transla-tion function, and Clauses is a set of definitions for fresh predicates introducedto address negation. Intuitively only those parts of a rule within the scope of awithout statement augment the Clauses component while the other rules leavesit unchanged. Hence, apart of having a different codomain the translation func-tion T operates as the old translation function T over most positive syntacticalcategories.As for positive translation we need a further translation function TI for handlingthe unique identifiers of the XML/XHTML elements. To simplify the presenta-tion we do in the sequel an abuse of notation denoting again as c the translationof a basic constructs c ∈ C. I.e. we avoid to represent formally the translation ofbasic constructs which would be 〈 c ; � 〉, but we treat them as if it were stringsas in positive rules.

We follow again a top-down approach in presenting the function, hence the firstrule is for the Checkrule syntactic category.

CheckRuleT JGOAL Err FROM Query ENDK =

let 〈 Head ; � 〉 = T JErrKand 〈 Body ; Clauses 〉 = T JQueryK

in〈 Body implies Head. ; Clauses 〉

As we can see, the output of this rule represents the whole abductive logicprogram with constraints as for the positive translation function. However, theALPC is now composed of an integrity constraint and a set of clauses. The lattercomponent is only augmented by the query part of the web checking rule, and inparticular, as we will see, it is only augmented by the translation of the withoutstatements.

In the following list of rules, the difference between a rule and the correspondingrule of the positive translation function is only represented by its codomain. Weinsert further comments only if there are others differences.

12

Page 13: Web Site Veriï¬cation and Repair: an Abductive Logic Programming Tool

ErrorT Jall err[Vars, Msg]K = 〈 [abd err([Vars],Msg)] ; � 〉

QueryT JInPart WhereK =

let 〈 Conjunction ; Clauses 〉 = T JInPartKand 〈 Constraints ; � 〉 = T JWhereK

in〈 [Conjunction, Constraints] ; Clauses 〉

WhereT Jwhere ConstrK = 〈 Constr ; � 〉

In-1T Jin{{resource{file : Res} Term}}K = T JTermK

In-2T JIn, InRestK =

let 〈 Conjunct ; Clauses 〉 = T JInKand 〈 OtherConjuncts ; OtherClauses 〉 = T JInRestK

in〈 Conjunct, OtherConjuncts ; Clauses ∪ OtherClauses 〉

Term-1T JtagName {{VarConstSeq, TermSeq, WithoutSeq}}K =

let Id = newID()and 〈 DataElements ; � 〉 = T ′JVarConstSeqK(Id)and 〈 SubTerms Conjs ; SubTerms Clauses 〉 = T ′JTermSeqK(Id)and Ids = IdV ars(DataElements) ∪ IdV ars(SubTerms Conjs)and Constraints = all distinct(Ids)and 〈 Without Conjs ; Without Clauses 〉 = T ′JWithoutSeqK(Id)

in〈 pg el(Id, tagName, ), DataElements, SubTerms, Constraints ;

Without Clauses ∪ SubTerms Clauses 〉

Term-2.1T ′Jvar VarNameK(Id) =

let Id′ = newID()in〈 data el(Id′, VarName, Id) ; � 〉

Term-2.2T ′JConstK(Id) =

13

Page 14: Web Site Veriï¬cation and Repair: an Abductive Logic Programming Tool

let Id′ = newID()in〈 data el(Id′, Const, Id) ; � 〉

Term-3T ′JtagName {{VarConstSeq, TermSeq, WithoutSeq}}K(Id) =

let Id′ = newID()and 〈 DataElements ; � 〉 = T ′JVarConstSeqK(Id′)and 〈 SubTerms Conjs ; SubTerms Clauses 〉 = T ′JTermSeqK(Id′)and Ids = IdV ars(DataElements) ∪ IdV ars(SubTerms Conjs)and Constraints = all distinct(Ids)and 〈 Without Conjs ; Without Clauses 〉 = T ′JWithoutSeqK(Id′)

in〈 pg el(Id′, tagName,Id), DataElements, SubTerms, Constraints ;

Without Clauses ∪ SubTerms Clauses 〉

Term-4T ′JX, XsK(Id) =

let 〈 Conjunct ; Clause 〉 = T ′JXK(Id)and 〈 OtherConjuncts ; OtherClauses 〉 = T ′JXsK(Id)

in〈 Conjunct, OtherConjuncts ; Clauses ∪ OtherClauses 〉

Case Term-1 must now take into account the new syntax of the Term syntacticcategory, in particular the Without statements which could be added at the endof the subterm specification. Those statements are the statements which willaugment the Clauses part of the output. Also the other cases for the Termcategory must take into account Without statements in the same way.

The next translation rules are for the new syntactic categories, namely the With-out and PosTerm categories.

Without-1T ′Jwithout PosTermK(Id) =

let Pred = newPred()and V s = vars(PosTerm)and 〈 PosTerm Conjs ; � 〉 = T ′JPosTermK(Id)

in〈 not(Pred(V s,Id)) ; {Pred(V s,V s) :- PosTerm Conjs.} 〉

Without-2T ′Jwithout VarConstK(Id) =

let Pred = newPred()and V = vars(VarConst)and 〈 DataElement ; � 〉 = T ′JVarConstK(Id)

in〈 not(Pred(V ,Id)) ; {Pred(V ,V ) :- DataElement.} 〉

14

Page 15: Web Site Veriï¬cation and Repair: an Abductive Logic Programming Tool

In the above rules, we have introduced two new auxiliary functions:- newPred() which returns a fresh predicate name of unspecified arity;- vars(t) which returns, in an element vs ∈ VarConstSeq string, all the distinct

variables appearing in t at any deep level.

Cases Without-1 and Without-2 both creates a new (fresh) predicate Predfor handling negation. The definition of Pred is represented by PosTerm Conjswhich is, roughly speaking, the evaluation of the PosTerm inside the tt without.Note that the PosTerm category will not augment the Clauses part of the outputbecause without statements can not appear nested. A singleton containing theso obtained definition of Pred augments the Clauses part of the output.The Conjs part, instead is augmented by a single conjunct of the following form:

not(Pred(Args))where the set of arguments Args, is represented by the ID of the father of thewithout statement and all the variables which appear inside of it. In this way,we keep the variable sharing outside the integrity constraint. I.e. when the clausedefining Pred will be evaluated, the arguments of Pred ensure that the variablereferences into the integrity constraint are kept. The case Without-3 whichfollows, is a standard rule for handling sequences.

Without-3T ′JW, WsK(Id) =

let 〈 Conjunct ; Clause 〉 = T ′JWK(Id)and 〈 OtherConjuncts ; OtherClauses 〉 = T ′JWsK(Id)

in〈 Conjunct, OtherConjuncts ; Clauses ∪ OtherClauses 〉

Finally, we have the rules for handling the PosTerm category.

PosTerm-1T ′JtagName {{VarConstSeq, PosTermSeq}}K(Id) =

let Id′ = newID()and 〈 DataElements ; � 〉 = T ′JVarConstSeqK(Id′)and 〈 SubPosTerms Conjs ; � 〉 = T ′JPosTermSeqK(Id′)and Ids = IdV ars(DataElements) ∪ IdV ars(SubPosTerms Conjs)and Constraints = all distinct(Ids)

in〈 pg el(Id′, tagName,Id), DataElements, SubPosTerms,

Constraints, Without Conjs ; � 〉

PosTerm-2T ′JX, XsK(Id) =

let 〈 Conjunct ; � 〉 = T ′JXK(Id)and 〈 OtherConjuncts ; � 〉 = T ′JXsK(Id)

in〈 Conjunct, OtherConjuncts ; � 〉

15

Page 16: Web Site Veriï¬cation and Repair: an Abductive Logic Programming Tool

Case PosTerm-1 is very similar to Term-1. The only difference is that nowithout statements can appear in the subterm specification at any deep leveland thus the Clauses part of the output is not affected by this syntactic category.Case PosTerm-2 is again a standard translation rule for handling sequences.

Example 2. Let us consider again the theatre company web site. We want toexpress that each show has at least a director [Rule3]. The Xcerpt representationis as follows:

GOAL

all error [ var ShowName, "show without a director"]

FROM

in { resource {"file:showindex.xml"},

show {{

showname {{var ShowName}},

without dir_by {{ }}

}}

}

END

Let us identify the main syntactic components of the rule:err3 = all error [ var ShowName, "show without a director"]

query3 = in {resource {"file:showindex.xml"} root3 }root3 = show {{ showname {{var ShowName}} wout3 }}wout3 = without dir by {{ }}

The root3 element is translated as follows (note that now the output is composedof two parts: the conjuncts part and the clauses part resulting from wout3).

T JRule3K =let 〈 query conjs ; query clauses 〉 = T Jquery3K

and 〈 err conjs ; � 〉 = T Jerr3Kin 〈 [query conjsimplies err conjs]. ; query clauses 〉

T Jerr3K = 〈 [abd err([ShowName],"show without a director")] ; � 〉

T Jquery3K = T Jroot3K

T Jroot3K =let 〈 wout conjs ; wout clauses 〉 = T ′Jwout3K(ID1)in 〈 pg el(ID1,show, ), pg el(ID2,showname,ID1),

data el(ID3,ShowName,ID2),wout conjs ; wout clauses 〉

Finally, we translate the wout3 part adding a negated atom (of a fresh predicatepred 1) into the body of the integrity constraint, and we define pred 1 withthe conjuncts derived from the term specification inside the without statement.The variables used as arguments for pred 1 are for keeping the right bindingsoutside the integrity constraint.

T ′Jwout3K(ID1) = 〈 not(pred 1(ID1)) ; pred 1(ID1) :- pg el(ID4,dir by,ID1) 〉

16

Page 17: Web Site Veriï¬cation and Repair: an Abductive Logic Programming Tool

Summarizing we have that the output is the following ALPC:

[pg_el(ID1,show,[],_),pg_el(ID2,showname,ID1),data_el(ID3,ShowName,ID2),not(pred_1(ID2))

]implies

[abd_err([ShowName],"show without a director")].

pred_1(ID2) :- pg_el(ID4,dir_by,ID2).

7 Analysis

In this section we outline how our translation provides a semantics for the frag-ment of Xcerpt we have adopted for defining web checking rules, and CIFFWEBa sound mechanism for performing the checking, by virtue of the soundness ofCIFF for abductive logic programming. Given a web site W , let X (W ) be the re-sult of applying the translation given in section 5.2 to W . X (W ) is a set of groundunit clauses. Given a set of web checking rules R, let T JRK= {T JrK|r ∈ R}.T JRK is an abductive logic program with constraints 〈P,A, I〉 such that A con-sists of all atoms in the predicate abd err and P and I are given as in sec-tion 6. 5 Trivially, 〈P ∪X (W ), A, I〉 is an ALPC. Note that the abducible pred-icate abd err only occurs in the conclusions of integrity constraints in I.We say that the rules in R are fulfilled in W of and only if P ∪X (W ) |= I, where|= stands for consequence wrt the 3-value completion. Namely, there exists noi ∈ I such that P ∪X (W ) |= b, where b is the conjunctin of premises of a groundinstance of i. We also say that the rules in R are violated in W of and only ifP ∪ X (W ) 6|= I and there exists ∆ ⊆ A such that P ∪ X (W ) ∪∆ |= I.Intuitively, due to the fact that the abducible atoms can appear only in thehead of a rule, the need of abducible atoms (i.e. a non-empty ∆) for satisfyingthe integrity constraints means that some web checking rule has been violated.Furthermore the abducible atoms in ∆ represent those violations.By soundness of CIFF [10], we obtain the following results: if CIFFWEB succeedswith an empty query returning an empty answer then all rules in R are satisfiedin W ; if CIFFWEB succeeds with an empty query returning a non-empty answerthen some rule in R is violated.

8 Running the System

For running the CIFFWEB system it is needed to specify all the web checkingrules in Xcerpt syntax and the XML/XHTML pages. The system will compile

5 T actually returns the representation of the ALPC in the format required by CIFFSystem 4.0. We assume here the logical - rather than system - version of this ALPC.

17

Page 18: Web Site Veriï¬cation and Repair: an Abductive Logic Programming Tool

all the rules and the sources giving rise to the corresponding abductive logicprogram with constraints. That ALPC can be used directly as input for theCIFF System 4.0. Detailed information on how to run the CIFFWEB Systemcan be found at www.di.unipi.it/∼terreni/research.php.Running the system with the three rules and the two XML pages seen above,the following abductive answer is produced:[abd_err([’dir2’,’dir2’],’double director occurrence’),

abd_err([’Black Comedy’],’show without a director’),

abd_err([’dir2’],’director w/out a show’].

representing correctly (1) the fact that ’dir2’ appears twice in the director list,(2) that ’dir2’ does not direct any show and finally (3) that the show ’BlackComedy’ has been inserted without a director associated with it. As we can seefrom the results, the CIFFWEB system conjoins all the web checking rules andthe abducibles in the answer correspond to all the error instances of all the rules.

9 Web Repair Rules

We argue that the abductive reasoning can be exploited not only for checkingerrors of a web site, but also for repairing such errors, defining abductive logicprograms in a more complex way. The idea is to modify checking rules by addingnew clauses and integrity constraints which could abductively suggest how torepair the site instead of simply detecting errors.In practice, the system should identify actions for repairing the web sites in theform of new elements to be integrated in the data. Obviously new problemsarise as the introduced data could interfere with the other checking/repairingrules. This means that a simple Xcerpt interpreter is not enough to evaluateweb repair rules: we need a computational tool which takes into account all thepossible rules interconnections. This is exactly what CIFF does.For expressing repairing rules as ALPCs we introduce two new abducible pred-icates, namely abd pg el and abd data el which intuitively represent the ”ab-ducible versions” of the pg el and data el predicates. The key difference isobviously that atoms of the abducible versions can be dynamically introducedinto the framework giving a way for fixing the errors into the web pages.Let us consider again the requirement that each show has at least a director[Rule3]. The ALPC obtained after the translation process is the following:

pred_3(ID0) :- pg_el(ID3,dir_by,ID0).

[pg_el(ID0,show,_), pg_el(ID1,showname,ID0),

data_el(ID2,ShowName,ID1), not(pred_3(ID0))]

implies

[abd_err([ShowName],’show without a director’)].

We could think that instead of returning an error by means of an abd err atom,we could add a new director as a children of the interested show tag. This couldbe done replacing the head of the integrity constraint by [repair 3(ID0)]. whererepair 3(ID0) is defined as follows:

repair_3(ID) :- abd_pg_el(_,dir_by,ID).

18

Page 19: Web Site Veriï¬cation and Repair: an Abductive Logic Programming Tool

In this way CIFF tries to add a dir by element inside the show ID0 instead ofdisplaying an error. Indeed, the answer returned by the system is:

abd_pg_el(_A,dir_by,18)

The above is a very simple example which could give an idea of the possibilities ofthe abductive reasoning for repairing purposes. A characterization of web repairrules and a formal methodology for obtaining them from web checking rules isstill work in progress. It is also not totally clear if an Xcerpt-like language couldaccommodate straightforwardly web repair rules.Indeed, there are a lot of issue to take into account when new atoms are abducedand added to the framework. In particular web checking rules must be aware ofthe fact that tag elements or data elements could be added to the framework inorder to check if those additions imply further violations to them.

10 Related Work and Conclusions

In this paper we have illustrated the CIFFWEB system for verifying and re-pairing XML/XHTML web sites by using abductive logic programming and inparticular the CIFF [6] proof procedure as computational counterpart. Despitethe exponential WWW growth and the success of the Semantic Web, there islimited work on specifying, verifying and repairing web sites at a semantic level.Notable exceptions, at least for specifying and checking web sites, are repre-sented by works such as [11] (which mainly inspired our work) [12] and [13]; theXLINKIT framework [14] and the GVERDI-R system [15, 16].The work more closely related to ours is the GVERDI-R system [15, 16] whichverifies web sites against rules written in an ad-hoc language which allows tospecify both correctness and completeness rules to be fulfilled. The GVERDI-Rweb rule language is an ad-hoc language which relies upon a (partial) pattern-matching mechanism very similar to the Xcerpt one and its expressiveness iscomparable to our Xcerpt fragment. However we argue that our use of negationconstructs in the query part of the rules allows for a bit more expressiveness.Furthermore, it seems that queries like the [Rule1] seen in section 3 are diffi-cult to express in the GVERDI-R language due to the ordering imposed by itslanguage specification in a subterm specification of a query.Another key difference is that the GVERDI-R system relies upon an ad-hoccomputational counterpart for its framework, while our system relies upon thegeneral purpose CIFF abductive proof procedure. This is an important issuebecause while the GVERDI-R system had to define both the language and itsformal properties, the CIFFWEB systems inherits all from the CIFF proof pro-cedure which has a well-defined syntax and has been proved sound with respectto the three-valued completion semantics [10].Conversely, the GVERDI-R system allows for the use of functions for managingstrings and data in a non-straightforward way, e.g. by matching strings to regu-lar expressions or using arithmetic functions on numbers. While the CIFFWEBsystem deals with arithmetic functions thanks to the underlying integrated con-straint solver, it lacks for the use of other types of functions. As pointed out in[15], this is an important feature for a web verification tool.

19

Page 20: Web Site Veriï¬cation and Repair: an Abductive Logic Programming Tool

Another clear drawback of our system is the absence of a GUI which allows fora better usability as for systems like GVERDI-R and XLINKIT.In section 9 we have shown, by means of examples, how the CIFFWEB systemcan be used for repairing purposes. Even if it has been presented in an informalway, we claim that this is a novel contribution as on this topic there is no muchliterature to our knowledge. We have not found a framework which is able topropose repairing actions in a similar way of our examples in section 9. We arguethat the full exploitation of the abductive reasoning allowed by the CIFF proofprocedure, in particular its capability of handling variables in a constructive way,can be very interesting for repairing purposes. However, as said throughout thepaper, we are still studying how to formal characterize web repair rules and howto define them in an Xcerpt-like syntax (or similar) keeping in mind that this isabsolutely a not straightforward task.The GVERDI-R system seems to put some steps towards in that direction aspointed out in [16], but a repairing specification and a concrete tool are stillwork in progress. Also the XLINKIT system seems to be capable of some formof repairing actions but they are limited to broken links in web sites.Our work uses Xcerpt [2, 3] for expressing checking rules which are then trans-lated into ALPCs. Xcerpt is query language which offers much more expressivityin its rules, both in the query part and in the construct part of them. To ourscope, there a number of interesting features of Xcerpt which could be inte-grated in our framework. One of them is the possibility of expressing aggregatesas aggregated data is a very typical content of a web page.

References

1. Antoniou, G., van Harmelen, F.: A Semantic Web Primer. MIT Press (2004)

2. Bry, F., Schaffert, S.: The XML query language Xcerpt: Design principles, exam-ples, and semantics (2002)

3. Bry, F., Furche, T., Linse, B.: Data model and query constructs for versatile webquery languages: State-of-the-art and challenges for Xcerpt. In: Proc. PPSWR06.(2006) 90–104

4. Boag, S., Chamberlin, D., Fernndez, M.F., Florescu, D., Robie, J., Simon, J.:XQuery 1.0: An XML query language (2007)

5. Clark, J., DeRose, S.: XML Path language (XPath) version 1.0 (1999)

6. Endriss, U., Mancarella, P., Sadri, F., Terreni, G., Toni, F.: The ciff proof procedurefor abductive logic programming with constraints. In: Proc. JELIA 2004. (2004)

7. Fung, T.H., Kowalski, R.A.: The IFF proof procedure for abductive logic program-ming. Journal of Logic Programming 33(2) (1997) 151–165

8. K.Kunen: Negation in logic programming. Journal of LP 4 (1987) 231–245

9. Mancarella, P., Sadri, F., Terreni, G., Toni, F.: Programming applications in ciff.In: Proc. LPNMR 2007, to appear. (2007)

10. Endriss, U., Mancarella, P., Sadri, F., Terreni, G., Toni, F.: The CIFF proofprocedure: Definition and soundness results. Technical Report 2004/2, Departmentof Computing, Imperial College London (May 2004)

11. Toni, F.: Automated information management via abductive logic agents. Journalof Telematics and Informatics 18(1) (2001) 89–104

20

Page 21: Web Site Veriï¬cation and Repair: an Abductive Logic Programming Tool

12. Despeyroux, T., Trousse, B.: Semantic verification of web sites using natural se-mantics. In: Proc. CC-BMIA’00. (2000)

13. Fernandez, M., Florescu, D., Levy, A., Suciu, D.: Verifying integrity constraintson web site. In: Proc. IJCAI 1999. (1999)

14. Capra, L., Emmerich, W., Finkelstein, A., Nentwich, C.: Xlinkit: a consistencychecking and smart link generation service. ACM Transac. on IT 2 (2002) 151–185

15. Alpuente, M., Ballis, D., Falaschi, M.: A rewriting-based framework for web sitesverification. Electronic Notes in Theoretical Computer Science 124 (2005) 41–61

16. Ballis, D., Romero, D.: Fixing web sites using correction stategies. In: Proc.WWV’06. (2006)

21