A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A...

42
IT 15 004 Examensarbete 15 hp Januari 2015 A New SAT Encoding of Earley Parsing Tobias Neil Institutionen för informationsteknologi Department of Information Technology

Transcript of A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A...

Page 1: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

IT 15 004

Examensarbete 15 hpJanuari 2015

A New SAT Encoding of Earley Parsing

Tobias Neil

Institutionen för informationsteknologiDepartment of Information Technology

Page 2: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem
Page 3: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

A New SAT Encoding of Earley Parsing

Tobias Neil

While the Boolean satisfiability problem (SAT) lies in NP, prodigious work in SATsolvers has allowed for its use in modeling a multitude of practical problems. Stating aproblem in SAT can be cumbersome though and so the demand for SAT encodingsarises, providing a means to formulate problems or parts of problems in a moreintuitive environment. Several algorithms have been proposed in the past to encodecontext-free grammars as SAT formulae, allowing for the comprehensive constructionof many interesting constraints such as at-most k constraints or such ones pertainingto language syntax. In 2011 a new algorithm was proposed, differing from previousones in it being based on Earley parsing instead of CYK parsing. Although itperformed well for interesting groups of grammars it was later found to behaveincorrectly for certain inputs. This thesis discusses the flaws in said algorithm,presents a revision of it and argues for the altered algorithm's correctness. Thealterations come with a price, however, and both the running time and output sizecomplexities suffer more-than-quadratic blowup. Since no empirical tests have beenperformed as of yet, it is still unclear what impact this blowup will have on practicalinstances.

Tryckt av: Reprocentralen ITCIT 15 004Examinator: Olle GällmoÄmnesgranskare: Lars-Henrik ErikssonHandledare: Tjark Weber

Page 4: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem
Page 5: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

Contents

1 Introduction 71.1 The SAT Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Previous Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background 82.1 Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Propositional Formulae, Satisfiability . . . . . . . . . . . . . . . . . 12

3 Old Encoding 143.1 Informal Description . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.4 Flaws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 A New Encoding 184.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.3 Translation into CNF . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Output Size 235.1 Number of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Number of clauses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.3 Comparison with previous encodings . . . . . . . . . . . . . . . . . 24

6 Correctness 246.1 Induction proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256.2 Production Trees and Derivation Sets . . . . . . . . . . . . . . . . . 266.3 The Relation � and Derivation Sets . . . . . . . . . . . . . . . . . 286.4 Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

7 Discussion 38

8 Conclusion 388.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5

Page 6: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

9 Appendix 419.1 Logical derivation from Section 4.2 . . . . . . . . . . . . . . . . . . 419.2 Logical derivation from Section 6 . . . . . . . . . . . . . . . . . . . 42

6

Page 7: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

1 Introduction

1.1 The SAT Problem

The Boolean satisfiability problem or SAT is the decision problem whether ornot there is any assignment to the Boolean variables of a formula φ such that φevaluates to true. It is a much studied one: In 1971 it was the first problem to beproven to be NP-complete [1], and as such many problems within NP have beenformulated as SAT. Furthermore, dedicated work spanning the last half-centuryhas resulted in some powerful tools to subjugate the theoretically hard problem (ifP 6= NP there is no polynomial-time algorithm to solve SAT) into one that canbe practically solved even for large inputs.

Context-free grammars provide an intuitive way of expressing many useful pat-terns, e.g. syntax for both natural and programming languages, palindromes orat-most k structures (does the string w ∈ {a, b}∗ have at most k occurrences ofthe symbol a?). While parsing a given word, that is deciding whether or not it canbe generated by a grammar G, can be done in polynomial time for context-freegrammars there are reasons why we would want to encode this decision as a SATinput: It can be used to describe constraint problems, either on its own or as apart of a larger problem, modeling a subset of the desired constraints (such as theabove mentioned at most k).

1.2 Previous Encoding

In 2011 Giannaros proposed an encoding of context-free grammars into the SATproblem [6]. The encoding produced good performance results, especially for prac-tically occurring grammars such as those describing program language syntaxes.However, the encoding algorithm is not altogether sound, and will produce faultyoutput for some grammars. These flaws are presented in detail in Section 3.4.There are other encoding algorithms as well, most notably Axelsson et al from2008 [7] and Quimper and Walsh’s from 2007 [8]. What separates these from thealgorithm laid out by Giannaros is that they mimic the methods of the CYK pars-ing algorithm whereas Giannaros’ uses the parsing techniques introduced by J.Earley in his parsing algorithm from 1970 [3].

1.3 Contribution

The purpose of this paper is thus to reform Giannaros’ algorithm so as to haveit produce only correct output. In Section 4 I propose an altered version of theencoding that overcomes the problems described in 3.4. A proof for the correctnessof the algorithm is given, but is coarsely written with several unproven assump-

7

Page 8: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

tions. The algorithm is presented in a somewhat naive way but some optimizationis hinted at; as of now the size of the output is very large, and while no em-pirical studies have been made it is highly unlikely that the promising results ofGiannaros’ faulty encoding carry over to this altered version.

2 Background

The reader is assumed to be familiar with basic complexity notation (Θ(n), O(n))and with Boolean algebra. If not, there are numerous textbooks covering thesetopics in various levels of depth. For the purposes of this paper a introductorylevel of understanding is sufficient. I recommend the standard work Introductionto Algorithms [5] for an introduction to complexity analysis and Language, Proofand Logic [4] for Boolean algebra.

2.1 Grammars

Grammars provide a powerful way to express structure in data and even com-putation itself. They can be used to find substrings in texts without knowingexactly what is being sought, being able to match not only on content but alsoon form. Some grammars are used to check whether a program is syntacticallycorrect before attempting compilation, some are powerful enough to describe allthe workings of a computer’s processor. They come in different classes, buildinga hierarchy of expressive potential where a more powerful class of grammars candescribe everything a less powerful class can, but not vice versa. This paper re-volves around the context-free grammars, popular uses of which include modelingnatural language grammars, syntax checks for compilers and structuring researchdata in biotechnology.

2.1.1 Languages

Definition 1. (Strings) A string over an alphabet Σ of symbols is a sequencew = w1w2...wk s.t. ∀i : 1 ≤ i ≤ k : wi ∈ Σ.

Definition 2. (Languages) A set of strings is called a language.

Much like sets over numbers, languages can be defined ostensively, i.e. by nam-ing every member of the set, or by specifying a property that is common to themembers but not to any non-members. In the case of the set of even numbers be-tween 0 and 9, this could either be described by {0, 2, 4, 6, 8} or by {n ∈ N | 0 ≡ nmod 2, 0 ≤ n ≤ 9}. When specifying languages, this discerning property is oftena grammar.

8

Page 9: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

Definition 3. (Grammars) A grammar is a quadruple (N,Σ, P, S).

• N is a finite set on non-terminal symbols

• Σ is a finite set of terminal symbols

• P is a finite set of production rules on the form (Σ ∪ N)∗N(Σ ∪ N)∗ →(Σ ∪N)∗

• S ∈ N is a unique root symbol.

From here on, terminal symbols will be denoted by lowercase roman letters(a, b, ...), non-terminals by uppercase roman letters (A,B, ...) and sequences (Σ ∪N)∗ by lowercase Greek letters (α, β, ...). We say that β can be derived in one stepfrom α, α ⇒ β, iff there is a production rule A → γ ∈ P such that α = α′Aα′′

and β = α′γα′′. A finite series of one-step derivations, α ⇒ α1 ⇒ ... ⇒ αk ⇒ βcan be written as

α⇒∗ βWe can then use the grammar G to define the set L over an alphabet Σ as L =L(G) = {w ∈ Σ∗ | S ⇒∗G w}, i.e. the set of strings over Σ that can be derivedfrom the start symbol S according to the rules in grammar G. We say that Ggenerates the language L(G).

2.1.2 Context-Free Grammars

One class of grammars (and languages generated by them) is the context-free gram-mars. They have the property that all production rules in P are of the form

N → (N ∪ Σ)∗

A subset of context-free grammars are those written in Chomsky normal form,named after the linguist and philosopher Noam Chomsky. The Chomsky normalform grammars separate non-terminal right-hand sides from terminal right-handsides. Each of their production rules follows one of these three schemes:

N → N ×N

N → Σ

S → ε

The last production rule scheme is necessary in order to allow for the emptystring to be a member of the generated language. Every CFG can be written inChomsky normal form, although this might cause a quadratic blowup of the sizeof the grammar [10].

9

Page 10: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

Backus-Naur Form Grammars (especially context-free grammars) are com-monly written in the Backus-Naur Form (BNF). For a grammar where S ⇒∗Aba,A⇒∗ Aa,A⇒∗ ε we write:

S → Aba

A→ Aa | ε

This grammar G would generate the language L(G) = {ba, aba, aaba, ...}.

2.1.3 The Size of a Grammar

When discussing algorithms operating on grammars it is useful to relate the per-formance of said algorithms to some metric on the complexity of the grammars- intuitively, the CFG describing the syntax of the English language ought tobe ”bigger” than the example grammar in the previous section. One might betempted to define such a metric as a sum or product of the sizes of the threecomponents N,Σ, P . However, the alphabet of terminal symbols, or the set ofnon-terminals, could be arbitrary large and still not affect the potential memberstrings of the language. As a case in point, the alphabet employed in the examplegrammar {ba, aba, aaba, ...} may very well include the symbols c1, ..., c100. A muchmore reasonable measure of grammar size would then be limited to the length ofthe production rules in P .

Definition 4. (Size of a Grammar) The size of a grammar G is defined as thesum of the length of its production rules:

|G| =∑

(A→α)∈P

|Aα|

2.2 Parsing

The objective of a parser is two-fold: Firstly, given a string w and a grammar G,to answer whether w is a member of the language generated by G, i.e. w ∈ L(G).Secondly - if G indeed recognizes w - to construct a parse tree of the string. Sucha tree shows which production rules are employed in order to derive w from S, andin what order. Figure 2.2 shows a parse tree for a parenthesis matching grammarS → SS | (S) | ε and the string ()(()).

2.2.1 Earley

Earley’s algorithm from 1970 [3] is a parsing algorithm for context-free grammars,taking as its inputs a grammar G = (N,Σ, P, S) and a string w and in its most

10

Page 11: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

S

S

( S

ε

)

S

( S

( S

ε

)

)

Figure 1: A parse tree

basic form decides whether or not w ∈ L(G). The algorithm scans the inputstring from left to right, constructing intermediate parsing states as it goes. Thesestates are derived from the production rules of the grammar, marking how much ofthe rule’s right-hand side has already been successfully matched against w. SuchEarley items are of the form (A → α · β, j), where the dot denotes how muchof the right-hand side has been matched thus far. The items are contained insets X0 to X|w|, where (A → α · β, j) ∈ Xi means that (A → αβ) ∈ P and thatα⇒∗ wj+1...wi. The integer j points back to the Earley item set where the currentmatching began, e.g. in the item (A → α · β, j) ∈ Xi, it points to the the set Xj

of which the yet-to-be-matched item (A→ ·αβ, j) is a member.For simplicity, and without loss of generality, a new start symbol S ′, along

with the production rule S ′ → S, is added to the grammar. Thus we can concludethat if (S ′ → S·, 0) ∈ X|w| then S ⇒∗ w1...w|w| = w i.e. w is recognized by thegrammar.

The sets X1 to X|w| are initialized as empty, and X0 as containing only theEarley item (S ′ → ·S, 0). Then, for each i : 0 ≤ i ≤ |w|, Xi is populated byapplying these three rules repeatedly, until no more items can be added:

• Predictor : ∀(A→ α ·Bβ, j) ∈ Xi : ∀B → γ ∈ P add (B → ·γ, i) to Xi

• Scanner (if i < |w|): ∀(A → α · wi+1β, j) ∈ Xi add (A → αwi+1 · β, j) toXi+1.

• Completer : ∀(A→ γ·, j) ∈ Xi : ∀(B → α ·Aβ′, k) ∈ Xj add (B → αA · β, k)to Xi.

The original paper suggests the use of look ahead strings. However, this paperuses a revised algorithm proposed by Aycock and Horspool [2] where look aheadstrings are not employed.

11

Page 12: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

2.2.2 Ambiguity

Definition 5. (Ambiguous grammars) A grammar that, for some string in itsgenerated language, can yield more than one parse tree is called ambiguous.

Definition 6. (Infinitely ambiguous grammars) A grammar that, for some stringin its generated language, can yield an infinite number of parse trees is calledinfinitely ambiguous.

The parenthesis matching in Figure 2.2 is an example of an infinitely ambiguousgrammar, since any occurrence of S can be used to derive S ⇒ SS ⇒ εS = S anarbitrary amount of times, thus creating arbitrary large parse trees for any stringin its language.

2.3 Propositional Formulae, Satisfiability

A propositional formula consists of Boolean variables z ∈ Z and constants (>(true) and ⊥ (false)) put together with Boolean connectives, such as AND (∧)and OR (∨). Depending on what values are being assigned to the variables in aformula it can evaluate to either > or ⊥.

Definition 7. (Propositional Formulae) We define a propositional formula induc-tively:

• ⊥ is a propositional formula

• > is a propositional formula

• Any Boolean variable z ∈ Z is a propositional formula

• If A is a propositional formula, then ¬A is a propositional formula.

• If A and B are propositional formulae, then A∧B is a propositional formula.

• If A and B are propositional formulae, then A∨B is a propositional formula.

• If A and B are propositional formulae, then A ⇒ B is a propositional for-mula.

• If A and B are propositional formulae, then A ⇔ B is a propositional for-mula.

Each Boolean variable can be interpreted as either > or ⊥. We can for exampleinterpret the variables A and B as A = > and B = ⊥, or A = B = ⊥. Given aset Z ′ of Boolean variables there are 2|Z

′| number of assignments, specifying theBoolean value for each variable.

12

Page 13: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

Definition 8. (Boolean Assignment) An assignment σ is a mapping Z 7→ {>,⊥}.

The evaluation of a formula for some assignment σ is the value of the formula, asdictated by the rules of the logical connectives, when all the variables are replacedwith the constants they correspond to under σ.

Definition 9. (Evaluation) The evaluation function evalσ maps, given an assign-ment σ, a propositional formula φ to either > or ⊥. It is defined inductively:

• evalσ(⊥) = ⊥

• evalσ(>) = >

• evalσ(z) = σ(z)

• evalσ(¬A) = > iff evalσ(A) = ⊥

• evalσ(A ∧B) = > iff evalσ(A) = > and evalσ(B) = >

• evalσ(A ∨B) = > iff evalσ(A) = > or evalσ(B) = >

• evalσ(A⇒ B) = > iff evalσ(A) = ⊥ or evalσ(B) = >

• evalσ(A⇔ B) = > iff evalσ(A) = evalσ(B)

Definition 10. (Satisfiable) A propositional formula φ is satisfiable iff there is atleast one Boolean assignment σ s.t. evalσ(φ) = >.

SAT is the decision problem whether or not a formula is satisfiable. It wasthe first problem to be proven NP-complete. As such, the number of clauses aswell as the number of variables outputted by the encoding algorithm is of practicalinterest.

AMO - ALO In the coming sections we will have a need for a propositional logicencoding of constraints such as ”Exactly one of these variables must be assignedto >”. For this purpose we introduce the following two concepts:

• When at most one (AMO) of n ≥ 1 variables can be true.

• When at least one (ALO) of n ≥ 1 variables must be true.

When combined, these constraints can be used to model the above mentioned”Exactly one...”-criterion. ALO of a set {z1, ..., zn} ⊂ Z is trivially implementedas a disjunction of the variables:

ALO({z1, ..., zn}) =∨

1≤i≤n

zi (1)

13

Page 14: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

AMO is a bit more complex, but can for example be represented by a binomialencoding :

AMO({z1, ..., zn}) =∧

1≤i≤n−1

∧i+1≤j≤n

¬zi ∨ ¬zj (2)

This concludes the background theory required of the reader in order to follow theupcoming discussions and proofs.

3 Old Encoding

3.1 Informal Description

In [6] the Earley algorithm is used to generate logical formulae reflecting the mem-bership of strings for a given grammar. Instead of looking at any one specificstring, the algorithm produces a formula that is satisfiable for any string of a cer-tain length in the grammar’s language. The algorithm first creates an extendedversion of the Earley sets, in the paper called Earley spine sets. The reason forthis is that Giannaros’ algorithm has no particular string in mind when perform-ing the encoding, and so the scanner rule must be applicable on a larger set ofEarley items. When the spine sets are fully constructed the algorithm explicitlylays out the dependencies among the Earley items, stating for each item underwhat conditions it would be constructed in the original algorithm.

3.2 Encoding

Given a CFG G = (N,Σ, S,→), where N is the set of non-terminals, Σ the al-phabet, S ∈ N the start symbol and → is the set of production rules, and a wordlength n, the encoding aims to output a Boolean formula JGKn, such that

∃σ : evalσ(JGKn) = > ⇔ ∃w ∈ Σ∗ :: |w| = n ∧ w ∈ L(G)

i.e. JGKn is satisfiable iff there is some string in L(G) of length n.We first produce the Earley spine sets Y0 to Yn by running the Earley algorithm

described in 2.2.1, but with the scanner rule modified so as to accept any terminalsymbol. We introduce two types of variables - letter variables and Earley itemvariables. For all i : 1 ≤ i ≤ n and a ∈ Σ, the letter variable wi(a) is true iffwi = a. The Earley item variable Xi(A→ α ·β, j) has a slightly more complicatedsemantics. It will not be described in detail here, but can be thought of as havingthe interpretation that Xi contains the Earley item (A→ α · β, j). This notion ishowever not true in all cases and should rather be used as an informal description.

14

Page 15: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

JGKn is initialized as the formula Xn(S ′ → S·, 0) and is then incrementallyexpanded by applying the following rules to every Earley item in every Earleyspine set.

For all i : 0 ≤ i ≤ n and (A→ α · β, j) ∈ Yi:

• If α = α′a, add as a conjunct to JGKn:

Xi(A→ α′a · β, j)⇔ Xi−1(A→ α′ · aβ, j) ∧ wi(a) (3)

• If α = ε and A 6= S ′, add as a conjunct to JGKn:

Xi(A→ ·β, j)⇔∨

(B→α′·Aβ′,k)∈Yi

Xi(B → α′ · Aβ, k) (4)

• If α = α′B, add as a conjunct to JGKn:

Xi(A→ α′B · β, j)⇔∨

(B→γ·,k)∈Yis.t. (B→γ,k)6=(A→α′Bβ,j)

Xi(B → γ·, k)∧Xk(A→ α′ ·Bβ, j)

(5)

• If Xi(A→ α · β, j) = X0(S′ → ·S, 0), add as a conjunct to JGKn:

X0(S′ → ·S, 0) (6)

• We also need to ensure that every letter variable wi maps to exactly onesymbol, i.e. that for any index i, there is exactly one symbol a such thatwi(a)⇔ > ∧

1≤i≤n

AMO{wi(s) | s ∈ Σ} ∧ ALO{wi(s) | s ∈ Σ} (7)

3.3 Example

In [6] the following example run on the infinitely ambiguous grammar

S → SS | a | ε

is done to illustrate the point of the exclusion condition (B → γ, k) 6= (A →α′Bβ, j) in Rule 5. We will parse the string w = aa. The spine sets Y0, Y1, Y2 are:

15

Page 16: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

Y0 Y1 Y2(S → ·, 0) (S → ·, 1) (S → ·, 2)(S → ·SS, 0) (S → ·SS, 1) (S → ·SS, 2)(S → S · S, 0) (S → S · S, 0) (S → S · S, 0)(S → SS·, 0) (S → S · S, 1) (S → S · S, 1)(S → ·a, 0) (S → SS·, 0) (S → S · S, 2)(S ′ → ·S, 0) (S → SS·, 1) (S → SS·, 0)(S ′ → S·, 0) (S → ·a, 1) (S → SS·, 2)

(S → a·, 1) (S → SS·, 1)(S → a·, 0) (S → SS·, 2)(S ′ → S·, 0) (S → ·a, 2)

(S → a·, 1)(S ′ → S·, 0)

Figure 2: Spine sets Y0 to Y2

When we come to the Earley item (S → S ·S, 0) ∈ Y2 we have to apply Rule 5.Since its pointer is 0 we need to find every item (B → γ·, k) ∈ Yi such that thereis an item (S → ·SS, 0) ∈ Yk. The only such item is (S → SS·, 0). However, if welook at the encoding of (S → SS·, 0) we see that this item in turn could dependon (S → S · S, 0). We thus get the following implications:

X2(S → S · S, 0)⇒ X2(S → SS·, 0) ∧X0(S → ·SS, 0)

∧X2(S → SS·, 0)⇒ (X2(S → ·, 2) ∧X2(S → S · S, 0)) ∨ P1

{Transitivity} ⇒

X2((S → S · S, 0)⇒ (X2(S → ·, 2) ∧X2(S → S · S, 0)) ∨ P) ∧X0(S → ·SS, 0)

{Distribution of ∧} ⇒

X2(S → S·S, 0)⇒ (X2(S → ·, 2)∧X2(S → S·S, 0)∧X0(S → ·SS, 0))∨(P∧X0(S → ·SS, 0))

{Or-simplification} ⇒

X2(S → S · S, 0)⇒ (X0(S → ·SS, 0) ∧X2(S → ·, 2)) ∨Q2

This resulting clause is intuitively problematic, since the truth of the statement(S → S · S, 0) ∈ X2, i.e. ”The right-hand sequence S can be expanded into thesymbol sequence w1w2”, is not related to any letter variable in that range, nor

1P is the remainder of the RHS disjunction, after X2(S → ·, 2) ∧X2(S → S · S, 0) has beenfactored out.

2Q⇔ (P ∧X0(S → ·SS, 0))

16

Page 17: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

any Earley item variable implying any statement about that range, but can besatisfied by the truth of the variables X0(S → ·SS, 0) and X2(S → ·, 2) alone. Wemust therefore exclude self reference of this kind, although we will soon show thatthe exclusion condition of Rule 5 will not be enough.

3.4 Flaws

3.4.1 Semantics

In the previous encoding the Earley item variable Xi(A→ α · β, j) had the inter-pretation (A→ α ·β, j) ∈ Xi, where Xi is the i:th Earley set after a completed runof an Earley parse. Looking at the item variable X2(S → S ·S, 0) in Section 3.3 wecan see that this interpretation is faulty: by applying the completer rule on items(S → ·SS, 0) ∈ X0 and (S → SS·, 0) ∈ X2 we add the item (S → S · S, 0) to X2.Yet, the encoding sets the variable X2(S → S · S, 0) to false. Clearly something isamiss in the semantics of the variables.

3.4.2 Indirect Recursions

The encoding rule reflecting the completer rule prevents the dependence of imme-diate recursions, as in the example of X2(S → S · S, 0), but does not prevent anyindirect recursions. As an example, the production rules used in Section 3.3 canbe extended with the trivial rule S → S. We can now circumvent the recursionprevention by the derivation

X2(S → S · S, 0)⇒ X2(S → S·, 0) ∧X0(S → ·SS, 0)

∧X2(S → S·, 0)⇒ X2(S → SS·, 0) ∧X0(S → ·S, 0) ∧ P3

∧X2(S → SS·, 0)⇒ X2(S → ·, 2) ∧X2(S → S · S, 0) ∧Q

{Several steps, not shown for the sake of brevity} ⇒

X2(S → S · S, 0)⇒ (X0(S → ·SS, 0) ∧X0(S → ·S, 0) ∧X2(S → ·, 2))∨

(X0(S → ·SS, 0) ∧X0(S → ·S, 0) ∧Q)∨

X0(S → ·SS, 0) ∧ P

As we can see in the first part of the disjunction, we have once again come upona self satisfying clause.

3P and Q are the remainders of the RHS disjunctions, see the footnotes of Section 3.3

17

Page 18: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

4 A New Encoding

I will now present a new encoding of context-free grammars. It is in many aspectslike Giannaros’ encoding, but resolves the problem of circular dependencies. Thegeneral idea is that we augment the production rules with the transitive, non-reflexive relation A� B, informally interpreted as A depends on B. The alteredencoding is then executed much like that of Giannaros’, without the exclusioncondition of the completer rule encoding.

4.1 The Algorithm

The following encoding will, given a context-free grammar G = (N,Σ, P, S) andword length n, output the Boolean formula JGKn, such that

∃σ : evalσ(JGKn) = > ⇔ ∃w ∈ Σ∗ :: |w| = n ∧ w ∈ L(G)

i.e. JGKn is satisfiable iff there is some string in G of length n. We first construct amore general version of Earley sets, the spine sets, Y0 to Yn. The difference lies inthat while the scanner rule used to construct Earley sets only applies to a terminalif it matches a symbol in the input string, the scanner rule for the spine sets isalways applicable when a terminal is encountered.

We introduce letter variables, wi(a) for all 1 ≤ i ≤ n and a ∈ Σ, where wi(a)means that wi = a. We also introduce Earley set variables, where Xi(A→ α ·β, j)means that the Earley item (A → α · β, j) can be added to Xi in a non-recursiveway. Stylized letters P ,Q,R denote Earley items.

JGKn is initialized as the formula Xn(S ′ → S·, 0) and is then incrementallyexpanded by applying the following rules to every Earley item in every Earleyspine set.

For all i : 0 ≤ i ≤ n and all P ∈ Yi:

• If P = (A→ αa · β, j), add as a conjunct to JGKn

Xi(P)⇔ Xi−1(Q) ∧ wi(a) ∧ (Pi � Qi−1) (8)

Where Q = (A→ α · aβ, j)

• If P = (A→ ·β, i) and A 6= S ′, add as a conjunct to JGKn:

Xi(P)⇔∨

Q=(B→α′·Aβ′,k)∈Yi

Xi(Q) ∧ (Pi � Qi) (9)

18

Page 19: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

• If P = (A→ αB · β, j), add as a conjunct to JGKn:

Xi(P)⇔∨

Q=(B→γ·,k)∈Yi

Xi(Q) ∧Xk(R) ∧ (Pi)� Qk) ∧ (Pi � Ri) (10)

Where R = (A→ α′ ·Bβ′, j)

• If Xi(P) = X0(S′ → ·S, 0), add as a conjunct to JGKn:

X0(S′ → ·S, 0) (11)

AMO - ALO As with the previous encoding, we need to ensure that there isexactly one true letter variable wi for each index i. We therefore introduce AMO-and ALO constraints to JGKn:∧

1≤i≤n

AMO{wi(s) | s ∈ Σ} ∧ ALO{wi(s) | s ∈ Σ} (12)

Properties of � We also need to encode the transitivity and non-reflexivity ofthe relation �. Here is presented a naive solution to the problem; one obviousoptimization would be to keeping track of which dependencies are actually createdin the previous phase. For the sake of brevity, this will not be described in thispaper.For all i : 0 ≤ i ≤ n and all P ∈ Yi:

For all i : 0 ≤ j ≤ n and all Q ∈ Yj:For all i : 0 ≤ k ≤ n and all R ∈ Yk, add as a conjunct to JGKn:

(Pi � Qj) ∧ (Q � Rk)⇒ (Pi � Rk) (13)

For all i : 0 ≤ i ≤ n and all P ∈ Yi, add as a conjunct to JGKn:

¬(Pi � Pi) (14)

4.2 Example

Let us revisit the case of the extended grammar from Section 3.4, S → SS | S |a | ε, and the item (S → S · S, 0) ∈ Y2. We will now have the following clauses:

P (S → S · S, 0)Q (S → S·, 0)R (S → ·SS, 0)S (S → SS·, 0)T (S → ·S, 0)U (S → ·, 2)

19

Page 20: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

Our encoding gives us the following clauses, where Q and S are the rest of thedisjunctions of Rule 10.

X2(P)⇒ X2(Q) ∧X0(R) ∧ (P2 � Q2) ∧ (P2 � R0)

X2(Q)⇒ X2(S) ∧X0(T ) ∧ (Q2 � S2) ∧ (Q2 � T0) ∨Q

X2(S)⇒ X2(U) ∧X2(P) ∧ (S2 � U2) ∧ (S2 � P2) ∨ S

{See appendix} ⇒

X2(P)⇒ (X0(R)∧X0(T )∧X2(U)∧X2(P)∧(P2 � Q2)∧(Q2 � S2)∧(S2 � P2)∧R4)∨

(X0(R) ∧X0(T ) ∧ P5 ∧ S)∨

(X0(R) ∧ (P2 � Q2) ∧ (P2 � R0) ∧Q)

{Rule 13 and repeated application of modus ponens} ⇒

X2(P)⇒ (X0(R) ∧X0(T ) ∧X2(U) ∧X2(P) ∧ (P2 � P2) ∧ R∨

(X0(R) ∧X0(T ) ∧ P ∧ S)∨

(X0(R) ∧ (P2 � Q2) ∧ (P2 � R0) ∧Q)

{¬(P2 � P2) from Rule 14 } ⇒

X2(P)⇒ ⊥∨

(X0(R) ∧X0(T ) ∧ P ∧ S)∨

(X0(R) ∧ (P2 � Q2) ∧ (P2 � R0) ∧Q)

{or-simplification} ⇒

X2(P)⇒ (X0(R) ∧X0(T ) ∧ P ∧ S)∨

(X0(R) ∧ (P2 � Q2) ∧ (P2 � R0) ∧Q)

4R⇔ (P2 � R0) ∧ (Q2 � T0) ∧ (S2 � U2)5P⇔ (P2 � Q2) ∧ (P2 � R0) ∧ (Q2 � S2) ∧ (Q2 � T0)

20

Page 21: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

Not only does this preclude any circular satisfaction of X2(P), but also opens upthe possibility of the variable being set to true, if any of the remaining dependen-cies can be set to true given the additional (P2 � Wi)-statements, guaranteeingthrough the Rules 13 and 14 that any such dependency could not in turn implyX2(P).

4.3 Translation into CNF

The new encoding is slightly more tricky to translate into CNF than that presentedby Gianarros. If we are content with only establishing ⇒, i.e. the necessity of theexistence of certain Earley items for the LHS item to exist6, we can get away withfewer clauses, but Rules 9 and 10 will still suffer from blowups in clauses andnumber of variables.

The scanner encoding, Rule 8, is easily transformed into CNF:

A⇒ B ∧ C ∧D ⇔ (A⇒ B) ∧ (A⇒ C) ∧ (A⇒ D)

Applying this to Rule 8, where P = (A → αa · β, j) and Q = (A → α · aβ, j),yields clauses:

Xi(P)⇒ Xi−1(Q)

Xi(P)⇒ wi(a)

Xi(P)⇒ (Pi � Qi−1)

When translating Rules 9 and 10 we run in to the problem of them being dis-junctions of conjunctions - rather the opposite of the conjunctive normal form.A straightforward translation would cause an exponential blowup in the numberof clauses and we elect instead to use the Tseitin transformation [9], where anauxiliary variable is added for every conjunction. The Tseitin transformation ofthe following formula

A⇒∨

1≤i≤n

xi1 ∧ ... ∧ ximi

requires us to introduce the variables v1, ..., vn, and then the following clauses:

For all i : 1 ≤ i ≤ n:For all j : 1 ≤ j ≤ mi:

Add vi ⇒ xij to JGKn6Giannaros provides good arguments for the viability of such an approach in Section 3.1.3

in [6], but as with the encoding in general no proof is provided for the correctness of thissimplification.

21

Page 22: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

Finally, we add the clause

A⇒∨

1≤i≤n

vi

The Tseitin transformation gives us the tools to convert the encoding of the scannerand completer rules into CNF. For Rule 9 we get, for Xi(P) = Xi(A→ ·β, i)

Algorithm 1 Tseitin translation of Rule 9

1: V ← ∅2: for Q = (B → α′ · Aβ′, k) ∈ Yi do3: Introduce a new variable v4: Add clause v ⇒ Xi(Q) to JGKn5: Add clause v ⇒ (Pi � Qi) to JGKn6: Add variable v to V7: Add clause Xi(P)⇒

∨v∈V

v to JGKn

To encode Rule 10 for Xi(P) = Xi(A → αB · β, j), we perform an analogousoperation:

Algorithm 2 Tseitin translation of Rule 10

1: V ← ∅2: for Q = (B → α′ · Aβ′, k) ∈ Yi do3: Introduce a new variable v4: Add clause v ⇒ Xi(Q) to JGKn5: Add clause v ⇒ Xk(R) to JGKn . Where R = (A→ α′ ·Bβ′, j)6: Add clause v ⇒ (Pi � Qk) to JGKn7: Add clause v ⇒ (Pi � Ri) to JGKn8: Add variable v to V9: Add clause Xi(P)⇒

∨v∈V

v to JGKn

The ALO part of 12 is already in CNF as one single clause, and translatingAMO is trivially done by separating the conjunctions into O(n2) clauses of theform ¬zi ∨ ¬zj.

Having transformed all clauses of JGKn into disjunctions, we can now use theresulting formula as an input to most major SAT-solvers.

22

Page 23: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

5 Output Size

There are two major candidates for the establishment of an upper bound on num-ber of clauses and variables, and runtime complexity: The application of Rules8 through 10 when written in CNF, and Rule 13. Rules 11, 12 and 14 producean obviously smaller number of variables and clauses, and are therefore excludedfrom this comparison.

5.1 Number of variables

In Section 4 we define a naive implementation of the transitive relationship wherewe first for every Earley item in every spine set introduce one variable for everyEarley item in every spine set. From [6] we get that the number of Earley itemsin all of the spine set, i.e. |Y | =

∑0≤i≤n

|Yi| is O(|G2|n2). The number of variables

introduced by the transitivity of the �-relation is therefore

|Y |2 = O((|G|2n2)2) = O(|G|4n4)

The number of variables created when translating the formula into CNF is alsobounded by O(|G|4n4) - in fact, an even lower bound can be established: Trans-lating either Rules 9 and 10 for a variable Xi(P) creates at most |Yi| number ofnew variables, and Rule 8 produces none (if we assume the letter variables to bealready instantiated). For all |Yi| items in a given set Yi, we therefore get an up-per bound O(|Yi|2) on variables introduced by the above rules. The total sum ofvariables created through the translation into CNF is thus

O(∑

0≤i≤n

|Yi|2) = O(|Y |2−∑

0≤i≤n

∑0≤j≤n,i 6=j

|Yi||Yj|) = O(|G|4n4−∑

0≤i≤n

∑0≤j≤n,i 6=j

|Yi||Yj|)

And so, JGKn can be asserted to have O(|G|4n4) variables.

5.2 Number of clauses

Encoding the transitivity of � requires one clause to be added for every triplet(x,y,z ), where x,y,z are Earley item variables. Since we have |Y | Earley itemvariables, there are |Y |3 such triplets and the encoding of Rule 13 therefore adds

|Y |3 = O((|G|2n2)3) = O(|G|6n6)

clauses to JGKn.The number of clauses introduced when translating Rules 9 and 10 into CNF is

linear in the number of Earley item variables appearing in the RHS disjunction: For

23

Page 24: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

Rule 9 we add 2 clauses per variable in addition to the final clause Xi(P)⇒∨v∈V

v;

in translating Rule 10 we add 4. Since the RHS disjunction for an application ofRules 9 and 10 on a variable Xi(P) range over the items in Yi, the encoding of sucha rule will produce O(|Yi|) new clauses. Since each instance of Rule 8 translationadds a constant number of clauses, the number of clauses added by an arbitraryrule encoding for an item Xi(P) is O(|Yi|). The number of clauses added for allvariables in Yi is then O(|Yi|2), which results in a grand total of

O(∑

0≤i≤n

|Yi|2) = O(|G|4n4 −∑

0≤i≤n

∑0≤j≤n,i6=j

|Yi||Yj|)

new clauses added to JGKn. Since the upper bound on clauses added from Rule13 also gives an upper bound on those created by the CNF-translation the finalbound on clauses for JGKn is O(|G|6n6).

5.3 Comparison with previous encodings

Quimper and Walsh’s encoding [8] produces outputs with O(|G|2n3) clauses andthe encoding presented by Axelsson et al. [7] produces an output with O(|G|n3)clauses. These are both significantly lower bounds than O(|G|6n6), but sincethe state of the art SAT-solvers employ a multitude of heuristics and micro-optimizations, and display a general air of black magic, actual performance isnot easily deduced from theoretical ditto. That being said, the discrepancy be-tween the output sizes of previous encodings and those of that proposed here isquite overwhelming. It would be reasonable to think that these theoretical resultscarry over to the practical realm.

6 Correctness

Caveat Lector This section is written in a much less lucid style than the restof the paper. It has been tacked on in a late stage of the process, and the authorhas had very little time to polish the language and the disposition; the steps takenin the proof are possibly too large to hold up to a reasonable standard of rigor.Several assumptions are made and never proven to hold. It is to be regarded as abonus feature and nothing essential will be lost, should the reader choose to skip toSection 8

We have shown specific examples of where the new encoding succeeds when theold one did not, and the augmented algorithm sure seems like it is guarding itselffrom any sort of ”empty implications”. The algorithm is however still not provencorrect, nor shall a complete formal proof be given here, but we will at least present

24

Page 25: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

a fairly detailed sketch of a proof. This sketch could - and should - be fleshed outand rigorously described at some later point, but will in the meantime give usstrong justifications for believing the algorithm to be correct.

6.1 Induction proofs

We will start by defining a general sort of induction that we will later use toshow the relationship between the assignments of Earley item variables and themembership of their corresponding Earley items in the Earley sets created in asuccessful parse.

Definition 11. (Well-founded Relation) A relation ≺ is well-founded iff it does notcontain any infinite descending chains, i.e. there is no infinite sequence x1, x2, ...s.t. xn+1 ≺ xn for all n ∈ N.

For example, the relation < over the natural numbers is well-founded, since nochain can go below 0, but the same relation over the set of all integers is not.

Theorem 1. (Well-founded Induction) Given a well-founded relation ≺ on X, thetheorem of well-founded induction states that if we can show that for any x ∈ X,the fact that the property P (y) holds for all y ∈ X s.t. y ≺ x implies that it alsoholds for P (x), then that property holds for every member of X:

(∀x ∈ X : (∀y ∈ X : (y ≺ x⇒ P (y)))⇒ P (x))⇒ (∀x ∈ X : P (x))

This tells us that if we can find a partial order over our Earley item variables,preferably one that reflects on the production rules in Section 4, and can showthat the desired property (i.e. the above mentioned connection between Earleyitem variables and Earley items in a run of the parser) by necessity carries overfrom the right-hand side to the left-hand side in all the rules, we have shown thatthe property holds for all Earley item variables. The theorem is proven in [11](Chapter 3).

Previously we said that we want to relate Earley item variable Xi(P) to themembership of the corresponding Earley item and Earley set, P ∈ Xi. However,that membership query could change during the course of a parse; initially, almostall Earley sets are empty and it’s only later that they are incrementally expanded.When asking whether or not (S ′ → S·, 0) ∈ Xn, we obviously do not mean inan arbitrary intermediate state of the parsing. We therefore need to define theconcept of the sets in a final state of the parsing algorithm.

Definition 12. (Final State) A run of the Earley parsing algorithm proposed in[2] is in its final state when no Earley item can be added to any set.

25

Page 26: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

Definition 13. (Final State Set) We define Ew,Gi as the set Xi in the final state

of a parse on string w and grammar G.

Note that the above definitions also allow us to specify which string we parsedin order to obtain the Earley set. This is useful to us since the encoding does notoperate on a single string, but rather on all strings in L(G) of a certain length.

6.2 Production Trees and Derivation Sets

We shall now attempt a definition of a well-founded relation. First we introducethe concept of production trees. We can see them informally as acyclic, directedgraphs of item nodes Ni(P)W indicating how an item can be added to a set, verymuch in the same way as the rules in Section 4. Note that the same Earley itemvariable may exist in several different locations of the graph; we call these differentoccurrences instances of Earley item variables.

Since we want the graph to be acyclic, we need to exclude edges running to anitem node higher up in the tree. We do this by introducing derivation sets to eachnode that record what nodes lie along the path down to it from the root node. Wethen stipulate the following invariant7:

A production tree rooted at node Ni(P)W has no node Nj(Q)P ,Qj ∈ W (15)

Both the node set N and the edge set E of the production tree is definedinductively:

• Nn(S ′ → S·, 0)∅ ∈ N

• If Ni(P)W ∈ N,P = (A → αa · β, j) then if Qi−1 6∈ W , where Q = (A →α · aβ, j)

Ni−1(Q)W∪{Pi} ∈ N, (16)

(Ni(P)W , Ni−1(Q)W∪{Pi}) ∈ E

• If Ni(P)W ∈ N,P = (A → ·β, i) then for all Q = (B → α · Aβ, j) ∈ Yi s.t.Qi 6∈ W

Ni(Q)W∪{Pi} ∈ N, (17)

(Ni(P)W , Ni−1(Q)W∪{Pi}) ∈ E7For the sake of brevity the invariant remains unproven, but it’s quite clear that it holds for

the base case, and that the edges maintain it.

26

Page 27: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

• If Ni(P)W ∈ N,P = (A→ αB · β, j) then for all Q = (B → γ·, k) ∈ Yi andall R = (A→ α ·Bβ, j) ∈ Yk s.t. Qi,Rk 6∈ W

Ni(Q)W∪{Pi} ∈ N, (18)

Nk(R)W∪{Pi} ∈ N,(Ni(P)W , Ni(Q)W∪{Pi}) ∈ E,(Ni(P)W , Nk(R)W∪{Pi}) ∈ E

The fact that the derivation sets are strictly increasing when we traverse the edgesof the graph shows, along with the invariant (15) and the fact that the derivationsets can not grow infinitely, that the paths are finitely long. We can thereforeconclude that the graph is acyclic (since any cycles would allow for infinite paths).Furthermore, we can at last define a well-founded relation on the set {(P , i,W ) |P ∈ Y, 0 ≤ i ≤ n,W ∈ Y ∗}.

Definition 14. (The relation ≺) (Q, j, U) ≺ (P , i,W ) iff there is a path in T fromNi(P)W to Nj(Q)U

We also want to relate the concept of derivation sets to the items and sets inan actual Earley parse. If an item P can be added to some set Xi, then naturallyit can be done so without P to be in Xi in the first place - if we already haveP ∈ Xi then we have no need for further rule applications in order for P ∈ Xi

to be true. This also extends to the production children of P ∈ Xi: If P can beadded to Xi through either of the memberships Q1 ∈ Xj1, ...,Qf ∈ Xjf , then wecan be assured that there is some Qg ∈ Xjg, 1 ≤ g ≤ f for which P ∈ Xi is not aprerequisite, lest we shall arrive at a contradiction. With this in mind, we revisitthe Ew,G

i -notation previously defined, but with derivation sets added as a taboolist of item/set pairs:

Definition 15. We define Ew,G,Wi as the i:th final state set of a modified Earley

parse, where the rules are changed so that P cannot be added to set Xi if Pi ∈ W :

• Predictor : ∀(A → α · Bβ, j) ∈ Xi : ∀(B → γ) ∈ P : If (B → ·γ, i)i 6∈ Wthen add (B → ·γ, i) to Xi)

• Scanner (if i < |w|): ∀(A→ α ·wi+1β, j) ∈ Xi : If (A→ αwi+1 ·β, j)i+1 6∈ Wthen add (A→ αwi+1 · β, j) to Xi+1.

• Completer : ∀(A→ γ·, j) ∈ Xi : ∀(B → α·Aβ′, k) ∈ Xj : If (B → αA·β, k)i 6∈W then add (B → αA · β, k) to Xi.

Note that if W = ∅ then the modified rules won’t exclude any items and Ew,G,Wi =

Ew,Gi for all w,G, i.

27

Page 28: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

6.3 The Relation � and Derivation Sets

The transitive relation � makes the encoding reflect how derivation sets prop-agate downwards in the production tree: If we interpret Xj(Q) ∧ (Pi � Qj)as ”this instance of Xj(Q) has Xi(P) in its derivation set”, we can clearly seehow the derivation sets accumulate with successive applications of modus ponens:No matter what encoding rule we apply on a left-hand side variable Xi(P), eachdisjunction Xj(Q) on the right-hand side will not only come with the propertyXi(P) � Xj(Q) but will also, through the application of modus ponens and thetransitive property of�, inherit the derivation set of the instance it was producedfrom. We can write any of the rules of the form

Xi(P)⇔ Pc ∧Xj(Q) ∧ (Pi � Qj) ∨ Pd

where Pc is the conjunctive part in the disjunction: wi(a) for some symbol a inthe case of Rule 8, > in the case of Rule 9 and Xk(R) for some k,R in the caseof Rule 10. Pd is the remaining disjunction; for Rule 8 this is ⊥. We therefore getthe equivalence

Xi(P) ∧∧Sl∈W

(Sl � Pi)

Pc ∧Xj(Q) ∧ (Pi � Qj) ∧∧Sl∈W

(Sl � Pi) ∨ (Pd ∧∧Sl∈W

(Sl � Pi))

Pc ∧Xj(Q) ∧∧

Sl∈W∪{Pi}

(Sl � Qj) ∨ (Pd ∧∧Sl∈W

(Sl � Qj))

In the first disjunction we can see that {Pi} has been added to the original deriva-tion set W .

6.4 Proof

We start by defining a lemma stating that if an instance of a disjunct on the right-hand side of some rule application is not in the production tree of an instance of theleft-hand side variable, then that disjunct is equivalent to ⊥. The lemma remainsunproven due to the limited scope of the thesis, but should be fairly self-evident.

Lemma 1. 8 Given the rule

Xi(P)⇔ (Pc ∧Xj(Q) ∧ (Pi � Qj)) ∨ Pd8This lemma remains unproven for the sake of brevity, but ought to be fairly self-evident.

28

Page 29: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

(Q, j,W ∪ {Pi}) 6≺ (P , i,W ) implies that

(Xi(P) ∧∧Sl∈W

Sl � Pi)⇔ ⊥∨ (Pd ∧∧Sl∈W

Sl � Pi)

Lemma 2. P ∈ Ew,G,Wi implies that there is a pair (Q, j) s.t. (Q, j,W ∪ {Pi}) ≺

(P , i,W ).

Proof for Lemma 2 First off: Pi 6∈ W ; otherwise P could not be added to Xi

for any parse run (by the definition of the Ew,G,Wi -structure). Secondly, there must

be at least one Qj among the potential producers of Pi that can be produced with-out the item/set pairs in W : If not, then we obviously cannot add P to Xi withoutthe same pairs. Therefore there must be some potential producing item/set pairQj (or two such, should we deal with an instance of the completer rule) such thatQj 6∈ W . Rules 16 to 18 tell us that we then have an edge (Ni(P)W , Nj(QW∪{Pi}).The definition of ≺ tells us then that (Q, j,W ∪ {Pi}) ≺ (P , i,W ).

We will use induction over the order established by ≺ to show the relationshipbetween satisfiability of logical formulae and membership of their correspondingEarley item/set pairs. We shall, for the induction part of the proof, exclude theclause Xn(S ′ → S·, 0) from JGKn and reintroduce it further on. We call the reducedformula JG′Kn.

Theorem 2. For all P , i,W :

∃σ : evalσ(Xi(P) ∧∧Qj∈W

(Qj � Pi)) = > ∧ evalσ(JG′Kn) = >

⇔∃w ∈ Σ∗ : P ∈ Ew,G,W

i

Proof Assume that the above property holds for all (R, k, U), s.t. (R, k, U) ≺(P , i,W ). We will show that it holds for (P , i,W ) by first showing how, for eachrule used to construct the production tree, the above property will be preservedand then showing that it holds for all minimal elements in the order establishedby ≺.

6.4.1 Scanner Rule (8)

If P = (A → αa · β, j) then Q = (A → α · aβ, j). In order to show that theproperty persists when we move upwards in the production tree by the applicationof Rule 16, we need to prove that

(Q, i− 1,W ∪ {Pi}) ≺ (P , i,W ) ∧ ∃w ∈ Σ∗ : Q ∈ Ew,G,W∪{Pi}i−1

29

Page 30: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

⇔ (19)

∃w ∈ Σ∗ : P ∈ Ew,G,Wi

and that

(Q, i−1,W∪{Pi}) ≺ (P , i,W )∧∃σ : evalσ(Xi−1(Q)∧∧

Sl∈W∪{Pi}

(Sl � Qi−1)) = >∧evalσ(JG′Kn) = >

⇔ (20)

∃σ : evalσ(Xi(P) ∧∧Sl∈W

(Sl � Pi)) = > ∧ evalσ(JG′Kn) = >

If (Q, i− 1,W ∪ {Pi}) ≺ (P , i,W ) then we either have

∃w ∈ Σ∗ : Q ∈ Ew,G,W∪{P}i−1

or¬∃w ∈ Σ∗ : Q ∈ Ew,G,W∪{P}

i−1

20 ⇒ In the first case we know that at some point in the parse of a word w,Q ∈ Xi−1, and that a scanner rule application on the same item will add P to theset Xi. We also know that Pi 6∈ W , since (P , i,W ) is the root of a subtree and theinvariant 14 thus prohibits Pi ∈ W . Therefore, if we can add Q to Xi−1 withoutthe use of the item/set pairs in W ∪ {Pi}, then we can add P to Xi without theuse of the item/set pairs in W . We can therefore conclude that P ∈ Ew,G,W

i .

20 ⇐ The other case is quite straightforward: If there is no word w for which(A→ α · aβ, j) ∈ E(w,G,W∪{Pi})

i−1 then there is no item to apply the scanner rule to

in order to produce (A→ αa · β, j) ∈ E(w,G,W )i .

19 ⇒ Looking at the logical formula, we will have to apply Rule 8 and so getthe equivalence

Xi(P) ∧∧Sl∈W

(Sl � Pi)

⇔Xi−1(Q) ∧

∧Sl∈W∪{Pi}

(Sl � Qi−1) ∧ wi(a)

Naturally, if there is no σ, such that

evalσ(Xi−1(Q) ∧∧

Sl∈W∪{Pi}

(Sl � Qi−1)) = >

30

Page 31: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

then nor can there be a σ satisfying that formula with the constraint wi(a) added,since evalσ(A ∧ B) = > if and only if both evalσ(A) = > and evalσ(B) = >,and any σ satisfying the more constrained formula would also satisfy the lessconstrained one - contrary to our assumption.

We can thus conclude that

¬∃σ : evalσ(Xi−1(Q) ∧∧

Sl∈W∪{Pi}

(Sl � Qi−1)) = > ∧ evalσ(JG′Kn) = >

¬∃σ : evalσ(Xi(P) ∧∧Qj∈W

(Qj � Pi)) = > ∧ evalσ(JG′Kn) = >

19 ⇐ If, on the other hand, there is a σ satisfying the less constrained formula,then we only need to show that any such σ also can satisfy the constraint wi(a)to show that

∃σ : evalσ(Xi−1(Q) ∧∧

Sl∈W∪{Pi}

(Sl � Qi−1)) = > ∧ evalσ(JG′Kn) = >

∃σ : evalσ(Xi(P) ∧∧Sl∈W

(Sl � Pi)) = > ∧ evalσ(JG′Kn) = >

The only reason that evalσ(wi(a)) = ⊥ is if

(Xi(P) ∧∧Sl∈W

(Sl � Pi))⇒ wi(b), b 6= a

since the AMO constraint is the only constraint keeping us from setting all wk(c)to >. However, no movement upwards (that is applying the rules from right toleft) in the production tree requires us to set any letter variables to true9; thisonly happens when we move downwards in the tree. We thus need to show that noseries of left-to-right rule applications will cause the undesired implication. Sincewe only mention wi(b) (for some b) when the variable on the left-hand side is on theform Xi(R), and the only left-to-right implication of the above mentioned Xi(P)...is Xi−1(Q)... the implications would have to be on the form

(Xi(P)...)⇒ (Xi−1(Q)...)

(Xi−1(Q)...)⇒ (Xi(R)...)

9Once again, as the scope of this thesis is limited this statement lacks formal proof

31

Page 32: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

(Xi(R)...)⇒ (wi(b)...)

But since m ≤ n for all (S,m, U) ≺ (T , n, V ), we cannot reach (R, i,W ′) from(Q, i − 1,W ∪ {Pi}). We therefore have no conflicting letter variable wi(b), andare free to set wi(a) to >.

We have yet to show what happens when (Q, i− 1,W ∪ {Pi}) 6≺ (P , i,W ), i.e.we’re dealing with the scanner rule: Lemma 1 tells us that

(Xi(P) ∧∧Sl∈W

Sl � Pi)⇔ ⊥∨ (Pd ∧∧Sl∈W

Sl � Pi)

and so Xi(P) ∧∧Sl∈W (Sl � Pi) is not satisfiable, since Pd ⇔ ⊥ for the scanner

case.Likewise, if (Q, i− 1,W ∪ {Pi}) 6≺ (P , i,W ) it can only be because Qi−1 ∈ W

(this being the only criterion for (Q, i−1,W∪{Pi}) ≺ (P , i,W )). If we then modifythe parser rules so as not to add the items in W ∪{Pi} to their corresponding sets,and Qi ∈ W , we will at no point have Q ∈ Xi for any parse run. We can thereforeconclude that Q 6∈ Ew,G,W∪{Pi}

i−1 and so we have no item to apply the scanner rule

to in order to add P to Ew,G,Wi .

We have now shown both A⇒ B and ¬A⇒ ¬B, which is the same as A⇔ Bfor both the parsing semantics and the encoding. Using our induction assumptionwe can conclude that for P = (A→ αa · β, j), the property is preserved:

∃σ : evalσ(Xi(P) ∧∧Sl∈W

(Sl � Pi)) = > ∧ evalσ(JG′Kn) = >

(Q, i−1,W∪{Pi}) ≺ (P , i,W )∧∃σ : evalσ(Xi−1(Q)∧∧

Sl∈W∪{Pi}

(Sl � Qi−1)) = >∧evalσ(JG′Kn) = >

(Q, i− 1,W ∪ {Pi}) ≺ (P , i,W ) ∧ ∃w ∈ Σ∗ : Q ∈ Ew,G,W∪{Pi}i−1

∃w ∈ Σ∗ : P ∈ Ew,G,Wi

6.4.2 Predictor Rule (9)

In order to show that the property persists when we move upwards in the produc-tion tree by the application of Rule 17, we need to prove that for P = (A→ ·γ, i):

∃σ : evalσ(Xi(P) ∧∧Sl∈W

(Sl � Pi)) = > ∧ evalσ(JG′Kn) = >

32

Page 33: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

⇔ (21)

∃σ,Q : (Q, i,W ∪ {Pi}) ≺ (P , i,W ) ∧ evalσ(Xi(Q) ∧∧

Sl∈W∪{Pi}

(Sl � Qi)) = >

∧ evalσ(JG′Kn) = >

Where Q = (B → α · Aβ, j)

and that

∃w ∈ Σ∗ : P ∈ Ew,G,Wi

⇔ (22)

∃w ∈ Σ∗,Q : (Q, i,W ∪ {Pi}) ≺ (P , i,W ) ∧ |w| = n ∧Q ∈ Ew,G,W∪{Pi}i

Where Q = (B → α · Aβ, j)

We will begin with noting that Q ∈ Yi for all Q = (B → α · Aβ, j) such that(Q, i,W ∪ {Pi}) ≺ (P , i,W ). This is easily realized from the way we constructthe production tree, more precisely how Rule 17 is formulated: (R, i,W ∪{Pi}) ≺(P , i,W )} for all R ∈ Yi such that Ri 6∈ W .

21 ⇐ Rule 9 gives us that

Xi(P) ∧∧Sl∈W

(Sl � Pi)

⇔∨Q=(B→α·Aβ,j)∈Yi

(Xi(Q) ∧∧

Sl∈W∪{Pi}

(Sl � Qi)))

Since all Q = (B → α · Aβ, j) such that (Q, i,W ∪ {Pi}) ≺ (P , i,W ) are in Yi,any σ satisfying

Xi(Q) ∧∧

Sl∈W∪{Pi}

(Sl � Qi)

will also satisfy ∨Q=(B→α·Aβ,j)∈Yi

(Xi(Q) ∧∧

Sl∈W∪{Pi}

(Sl � Qi)))

We therefore only need to find one such pair of Q and σ in order to know thatXi(P)... can be satisfied:

33

Page 34: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

∃σ,Q : (Q, i,W ∪ {Pi}) ≺ (P , i,W ) ∧ evalσ(Xi(Q) ∧∧

Sl∈W∪{Pi}

(Sl � Qi)) = >

∧ evalσ(JG′Kn) = >

∃σ : evalσ(Xi(P) ∧∧Sl∈W

(Sl � Pi)) = > ∧ evalσ(JG′Kn) = >

Where Q = (B → α · Aβ, j)

21 ⇒ In the opposite case, where no satisfying σ/Q-pair can be found, the right-hand side of the equivalence (and consequently the left-hand side) will have to relyon those disjuncts

Xi(R) ∧∧

Sl∈W∪{Pi}

(Sl � Qi)

where (R, i, U) 6≺ (P , i,W ). But Lemma 1 tells us that these disjuncts will resultin ⊥ - which is intrinsically unsatisfiable - when introduced in the context ofXi(P) ∧

∧Sl∈W (Sl � Pi). Since the right-hand side of the equality cannot be

satisfied by neither those disjuncts in the production tree of (P , i,W ), nor those notin it, the absence of satisfying σ/Q-pairs will result in the absence of a satisfyingσ for the (P , i,W )-clause, or:

¬∃σ,Q : (Q, i,W ∪ {Pi}) ≺ (P , i,W ) ∧ evalσ(Xi(Q) ∧∧

Sl∈W∪{Pi}

(Sl � Qi)) = >

∧ evalσ(JG′Kn) = >

¬∃σ : evalσ(Xi(P) ∧∧Sl∈W

(Sl � Pi)) = > ∧ evalσ(JG′Kn) = >

Where Q = (B → α · Aβ, j)

22 ⇐ If we, in parsing some word w, can add an item Q = (B → α · Bβ, j) tothe set Xi without the use of the item/set pairs in W ∪ {Pi}, then relaxing thetaboo constraint to just W and applying the predictor rule (9) on Q will add Pto Xi, and we thus get P ∈ Ew,G,W

i .

34

Page 35: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

22 ⇒ If we assume that there is no w ∈ Σ∗ and Q = (B → α · Aβ, j) such that

(Q, i,W ∪ {Pi}) ≺ (P , i,W ) ∧ |w| = n ∧ Q ∈ Ew,G,W∪{Pi}i , then the occurrence of

P = (A → ·γ, i) ∈ Xi in the parsing of any w must have been produced by someR = (B → α · Aβ, j) ∈ Xi for the same w, where (R, i, U) 6≺ (P , i,W ). In orderto ensure that no item/set pair in W has been employed in producing P ∈ Xi,this constraint must obviously be fulfilled by the producing item R ∈ Xi. Inaddition, there must be at least some item/set pair producing P ∈ Xi which is inturn producible without employing Pi, lest P ∈ Xi should be dependent on itself.Thus, the most lax constraint we can put on R ∈ Xi is that the item/set pairs inW ∪ {Pi} have not been employed in its production. This in turn means that Umust encompass all those pairs, i.e. W ∪{Pi} ⊆ U . We also note that all the itemswe can add to a set Xi while ignoring the pairs in V ∪ V ′ we also can add to Xi

if we relax the constraint to only ignore the pairs in V , thus Ew,G,V ∪V ′

i ⊆ Ew,G,Vi .

We use this observation to state that if R ∈ Ew,G,Ui and W ∪ {Pi} ⊆ U , then

Ew,G,Ui ⊆ E

w,G,W∪{Pi}i and consequently R ∈ E

w,G,W∪{Pi}i . But looking at Rule

17, the only reason (R, i, U) 6≺ (P , i,W ) would be that Ri ∈ W , and so R by thedefinition of the derivation sets cannot be added to Xi, and so R 6∈ Ew,G,W∪{Pi}.If R cannot be added to Xi without employing the item/set pairs in W ∪ {Pi},then nor can it be used if we want to add P to Xi without employing the pairs inW . Thus, if

¬∃w ∈ Σ∗,Q : (Q, i,W ∪ {Pi}) ≺ (P , i,W ) ∧ |w| = n ∧Q ∈ Ew,G,W∪{Pi}i

Where Q = (B → α · Aβ, j)and since no R ∈ Ew,G,U

i such that (R, i, U) 6≺ (P , i,W ) can be used to constructP ∈ Ew,G,W

i , we have¬∃w ∈ Σ∗ : P ∈ Ew,G,W

i

Once again, we have shown that A ⇒ B and ¬A ⇒ ¬B for both the Earley setmembership and the encoding rules. We therefore conclude that for P = (A →·γ, i):

∃σ : evalσ(Xi(P) ∧∧Sl∈W

(Sl � Pi)) = > ∧ evalσ(JG′Kn) = >

⇔∃σ,Q : (Q, i,W ∪ {Pi}) ≺ (P , i,W ) ∧ evalσ(Xi(Q) ∧

∧Sl∈W∪{Pi}

(Sl � Qi)) = >

∧ evalσ(JG′Kn) = >Where Q = (B → α · Aβ, j)

35

Page 36: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

∃w ∈ Σ∗,Q : (Q, i,W ∪ {Pi}) ≺ (P , i,W ) ∧ |w| = n ∧Q ∈ Ew,G,W∪{Pi}i

Where Q = (B → α · Aβ, j)

∃w ∈ Σ∗ : P ∈ Ew,G,Wi

6.4.3 Completer Rule (10)

The proof for the completer rule is so analogous to that for the predictor rule thatit will not be presented here.

6.4.4 Base cases

There are two possible situations for which we cannot further extend the produc-tion tree from Ni(P)W ; either there are no producers of the node - this happensfor (P , i) = ((S ′ → ·S, 0), 0) - or all the potential producers of Pi are in W . Inthe first case, (S ′ → ·S, 0) is always a member of Ew,G,W

0 iff (S ′ → ·S, 0) 6∈ W andX0(S

′ → ·S, 0) ∧∧Sl∈W (Sl � (S ′ → ·S, 0)0) is true iff (S ′ → ·S, 0) 6∈ W , since

X0(S′ → ·S, 0) is given as true in JGKn.

In the second case Lemma 1 shows that every disjunct

Pc ∧Xj(Q) ∧∧

Sl∈W∪{Pi}

(Sl � Qj)

on the right-hand side of

Xi(P) ∧∧Sl∈W

(Sl � Pi)

((Pc ∧Xj(Q) ∧∧

Sl∈W∪{Pi}

(Sl � Qj)) ∨ ((Pd ∧∧Sl∈W

Sl � Pi) ∨ R)

will turn out to be false, rendering the left-hand side unsatisfiable.Similarly, taking the inverse of Lemma 2 gives us that

¬∃(Q, j, U) : (Q, j, U) ≺ (P , i,W )⇒ P 6∈ Ew,G,Wi

and so the property holds for all the base cases.

36

Page 37: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

6.4.5 In conclusion

We have shown that for all P , i,W :

∃σ : evalσ(Xi(P) ∧∧Qj∈W

(Qj � Pi)) = > ∧ evalσ(JG′Kn) = >

⇔∃w ∈ Σ∗ : P ∈ Ew,G,W

i

We defined JG′Kn as the formula JGKn produced by the encoding rules in Section4, without the clause Xn(S ′ → S·, 0), that is

JGKn ⇔ JG′Kn ∧Xn(S ′ → S·, 0)

We can extend the right-hand side with an empty conjunction, since∧Sl∈∅

(Sl � Tj)⇔ >

and thus get

JGKn ⇔ JG′Kn ∧Xn(S ′ → S·, 0) ∧∧Sl∈∅

(Sl � Xn(S ′ → S·, 0))

From this follows that∃σ : evalσ(JGKn) = >

⇔∃σ : evalσ(JG′Kn ∧Xn(S ′ → S·, 0) ∧

∧Sl∈∅

(Sl � Xn(S ′ → S·, 0))) = >

⇔∃w ∈ Σ∗ : (S ′ → S·, 0) ∈ Ew,G,∅

n

And since for all w,G, i : Ew,G,∅i = Ew,G, i.e. if W = ∅ then none of the parsing

rules will be modified and the set Ew,G,Wi is the same as it would be on a normal

run of the Earley parse. The statement (S ′ → S·, 0) ∈ Ew,Gn tells us that there is

some word w, the parsing of which will in its final state have (S ′ → S·, 0) ∈ Xn,which in turn tells us that ∃w ∈ Σ∗ : w ∈ L(G) ∧ |w| = n. We therefore get thedesired equality

∃σ : evalσ(JGKn) = >⇔

∃w ∈ Σ∗ : (S ′ → S·, 0) ∈ Ew,G,∅n

⇔∃w ∈ Σ∗ : w ∈ L(G) ∧ |w| = n

That is, JGKn is satisfiable if and only if there is a word w in the language generatedby G of length n. �

37

Page 38: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

7 Discussion

A major shortcoming of this thesis is lack of a fully fleshed out proof. If indeedthere ever was a point to this thesis, it would be to prove that an encoding algo-rithm based on Earley parsing can work. As it stands now the proof sketch mightbe intuitively convincing, but is certainly not unassailable. This and the absenceof empirical data are due to a shortage of time largely caused by the author hav-ing followed too many red herrings and having spent too much time searching forother, better solutions: The current solution was in fact conceived at an earlystage, but due to its large output size the author looked for ways to a priori weedout any circular dependencies. At one point the project revolved solely aroundchartering the dependencies in the variable graph. Ultimately this approach wasscrapped since it failed to yield any improvements regarding run time or outputsize, and the current solution was finally decided upon. Another result of thesetime-consuming diversions is the rather lengthy and cumbersome style in whichthe proof sketch has been written. In the words of Pascal: I would have written ashorter letter, but I did not have the time.

8 Conclusion

8.1 Summary

A SAT encoding presented by Giannaros in [6] has been scrutinized, and a flawhas been found and discussed in Section 3.4. A new encoding aimed at remedyingsaid flaw is suggested in Section 4, and through the example in Section 4.2 itis reasonable to believe that the new encoding is successful in this; furthermorea proof sketch is presented. This, however, rests on a handful of lemmas that,though intuitively acceptable, remain unproven.

A way to algorithmically translate the encoding formula into CNF is laid out,and an upper bound for the sizes of the resulting formulae is established in Section5. It is found to be thus: O(|G|6n6) number of clauses, with O(|G|4n4) numberof variables. In Section 5.3 this result is compared with two previous encodings -that of Quimper and Walsh [8] and that of Axelsson et al [7] - and is found to givea worse theoretical bound on the output size. No empirical tests have been made,but an argument is made for the guess that the poor theoretical performance ofthe new encoding also yields a poor practical performance.

Three goals for future work are stipulated in Section 8.2, addressing the prob-lems of the large upper bound on clause size as well as the lack of correctness proofand empirical data.

38

Page 39: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

8.2 Future Work

Theoretical performance As the results in Section 5 shows, the most pressingoptimization required for this algorithm to be of any practical use is that of findinga leaner way to represent the transitivity of the �-relation. The optimizationhinted at in Section 4 would be a reasonable first step, but there is no doubtmore work to be done here. An alternative approach would be to only encode theproduction trees introduced in Section 6 - this should yield strictly less variablesthan the current method.

Actual performance As has been discussed in previous sections, the correla-tion between theoretical and actual performance is rather weak. In [6] Giannarospresents test data for a variety of grammars - some of theoretical interest (likethe infinite ambiguous grammar shown in 2.2) and some of practical applicability(Java syntax, at most k-grammars). Anyone endeavoring to run empirical testson this encoding would do well in studying the results presented therein.

Proof The proof, such as it is, rests on several unproven assumptions, mostnotably Lemma 1 and the invariant (14). The current proof also omits how theletter variables are related to letters in a parsed word.

References

[1] S. Cook The complexity of theorem-proving procedures Proceedings of the 3rdAnnual ACM Symposium on Theory of Computing, 1971

[2] J. Aycock and R. N. Horspool, Practical Earley parsing, The Computer Jour-nal, vol. 45, no. 6, 2002

[3] J. Earley, An efficient context-free parsing algorithm. Communications of theACM, vol. 13, no. 2, 1970

[4] J. Barwise and J. Etchemendy Language, Proof and Logic CSLI Publications,2002

[5] Cormen, Leiserson, Rivest, Stein Introduction to Algorithms The MIT Press,2009

[6] P.A. Giannaros, A new SAT encoding of context-free grammars. Master’s thesis,University of Cambridge, submitted June 15th 2011

39

Page 40: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

[7] R. Axelsson, K. Heljanko, and M. Lange, Analyzing context-free grammarsusing an incremental SAT solver Lecture Notes in Computer Science, vol.5126, pp. 410 - 422, 2008

[8] C. G. Quimper and T. Walsh, Decomposing global grammar constraints Prin-ciples and Practice of Constraint Programming, pp. 590 - 604, 2007

[9] A. Biere, M. Heule, H. van Maaren and T. Walsh, Handbook of SatisfiabilityIOS Press, 2009

[10] M. Lange and H. Leiss, To CNF or not to CNF? An efficient yet presentableversion of the CYK algorithm, Informatica Didactica, vol. 8, 2009

[11] G. Winskel The Formal Semantics of Programming Languages MIT Press,2001

40

Page 41: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

9 Appendix

9.1 Logical derivation from Section 4.2

X2(P)⇔ X2(Q) ∧X0(R) ∧ (P2 � Q2) ∧ (P2 � R0)

X2(Q)⇔ X2(S) ∧X0(T ) ∧ (Q2 � S2) ∧ (Q2 � T0) ∨Q

X2(S)⇔ X2(U) ∧X2(P) ∧ (S2 � U2) ∧ (S2 � P2) ∨ S

We apply equivalence substitution on the last two clauses

X2(P)⇔ X2(Q) ∧X0(R) ∧ (P2 � Q2) ∧ (P2 � R0)

X2(Q)⇔ ((X2(U) ∧X2(P) ∧ (S2 � U2) ∧ (S2 � P2) ∨ S)∧

X0(T ) ∧ (Q2 � S2) ∧ (Q2 � T0)) ∨Q

...and again on the remaining two clauses

X2(P)⇔ (((X2(U) ∧X2(P) ∧ (S2 � U2) ∧ (S2 � P2) ∨ S)∧

X0(T ) ∧ (Q2 � S2) ∧ (Q2 � T0)) ∨Q)

∧X0(R) ∧ (P2 � Q2) ∧ (P2 � R0)

We then distribute the last conjunction

X2(P)⇔ (X0(R) ∧ (P2 � Q2) ∧ (P2 � R0)∧

(X2(U) ∧X2(P) ∧ (S2 � U2) ∧ (S2 � P2) ∨ S)∧

X0(T ) ∧ (Q2 � S2) ∧ (Q2 � T0))∨

(X0(R) ∧ (P2 � Q2) ∧ (P2 � R0) ∧Q)

We use commutativity and associativity of ∧ for some rearranging

X2(P)⇔ ((X0(R)∧ (P2 � Q2)∧ (P2 � R0)∧X0(T )∧ (Q2 � S2)∧ (Q2 � T0))∧

(X2(U) ∧X2(P) ∧ (S2 � U2) ∧ (S2 � P2) ∨ S)∨

(X0(R) ∧ (P2 � Q2) ∧ (P2 � R0) ∧Q)

And then distribute the first clause over the disjunction (X2(U)... ∨ S)

X2(P)⇔ (X0(R) ∧ (P2 � Q2) ∧ (P2 � R0) ∧X0(T ) ∧ (Q2 � S2)∧

(Q2 � T0) ∧X2(U) ∧X2(P) ∧ (S2 � U2) ∧ (S2 � P2))∨

41

Page 42: A New SAT Encoding of Earley Parsinguu.diva-portal.org/smash/get/diva2:788985/FULLTEXT01.pdf · A New SAT Encoding of Earley Parsing Tobias Neil While the Boolean satisfiability problem

(X0(R) ∧ (P2 � Q2) ∧ (P2 � R0) ∧X0(T ) ∧ (Q2 � S2) ∧ (Q2 � T0)) ∧ S)∨(X0(R) ∧ (P2 � Q2) ∧ (P2 � R0) ∧Q)

We do some reshuffling within the disjuncts and use the shortenings P10and R11,and so get

X2(P)⇔ (X0(R)∧X0(T )∧X2(U)∧X2(P)∧(P2 � Q2)∧(Q2 � S2)∧(S2 � P2)∧R)∨

(X0(R) ∧X0(T ) ∧ P ∧ S)∨(X0(R) ∧ (P2 � Q2) ∧ (P2 � R0) ∧Q)

9.2 Logical derivation from Section 6

Xk(R)⇔ (Xi(P) ∧∧Wl∈A

Wl � Pi) ∨ R

Xi(P)⇔ Pc ∧Xj(Q) ∧ (Pi � Qj) ∨ PdAnd by modus ponens we get

Xk(R)⇔ ((Pc ∧Xj(Q) ∧ (Pi � Qj) ∨ Pd) ∧∧Wl∈A

Wl � Pi) ∨ R

Which we can expand to

Xk(R)⇔ ((Pc ∧Xj(Q) ∧ (Pi � Qj) ∧∧Wl∈A

Wl � Pi) ∨ (Pd ∧∧Wl∈A

Wl � Pi)) ∨R

We distribute ”and”

Xk(R)⇔ ((Pc ∧Xj(Q) ∧ (Pi � Qj) ∧∧Wl∈A

((Wl � Pi) ∧ (Pi � Qj))) ∨ S12

Rule 13 (transitivity of �), along with some application of modus ponens, givesus

Xk(R)⇔ ((Pc ∧Xj(Q) ∧ (Pi � Qj) ∧∧Wl∈A

(Wl � Qj)) ∨ S

Which we can simplify to

Xk(R)⇔ ((Pc ∧Xj(Q) ∧∧

Wl∈A∪{Pi}

(Wl � Qj)) ∨ S

10P⇔ (P2 � Q2) ∧ (P2 � R0) ∧ (Q2 � S2) ∧ (Q2 � T0)11R⇔ (P2 � R0) ∧ (Q2 � T0) ∧ (S2 � U2)12S⇔ (Pd ∧

∧Wl∈AWl � Pi) ∨ R

42