Rule-Based Parsing of Morphologically Rich Languages

MASARYK UNIVERSITY

FACULTY OF INFORMATICS

}w�� !"#$%&'()+,-./012345<yA|Rule-Based Parsing of

Morphologically Rich Languages

DISSERTATION THESIS PROPOSAL

Miloš Jakubícek

Advisor: doc. PhDr. Karel Pala, CSc.

Brno, January 18th 2012Advisor’s signature:

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1 Syntax in Natural Language Processing: an Overview . . . . 3

2 State of the art in Syntax and Parsing . . . . . . . . . . . . . . . . 52.1 Syntax and Parsing: the Theory and the Practice . . . . . . . 52.2 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 Phrase structure syntax . . . . . . . . . . . . . . . . . 52.2.2 Dependency syntax . . . . . . . . . . . . . . . . . . . . 62.2.3 Phrase structure vs. dependency syntax . . . . . . . . 7

Discontinuity and non-projectivity . . . . . . . . . . 7Coordinating dependencies . . . . . . . . . . . . . . . 8Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.4 Advanced Syntactic Formalisms . . . . . . . . . . . . 9Head-Driven Phrase Structure Grammar . . . . . . . 9Lexical Functional Grammar . . . . . . . . . . . . . . 9Tree-Adjoining Grammar . . . . . . . . . . . . . . . . 9Combinatory Categorial Grammar . . . . . . . . . . 10Link Grammar . . . . . . . . . . . . . . . . . . . . . . 10Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.1 Rule-Based Parsing . . . . . . . . . . . . . . . . . . . . 102.3.2 Statistical Parsing . . . . . . . . . . . . . . . . . . . . . . 112.3.3 Partial Parsing (Chunking) . . . . . . . . . . . . . . . . 112.3.4 The Purpose of Parsing . . . . . . . . . . . . . . . . . 122.3.5 Parsing Evaluation . . . . . . . . . . . . . . . . . . . . 132.3.6 Criticism of Statistical Parsing . . . . . . . . . . . . . 16

2.4 Parsing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.1 English and multi-language parsers . . . . . . . . . . 172.4.2 Czech parsers . . . . . . . . . . . . . . . . . . . . . . . 18

3 Aims of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1 Strong Lexicalization . . . . . . . . . . . . . . . . . . . . . . . . 213.2 Grammar Stratification Exploiting Competing PCFG Rules . 223.3 Metagrammar Development with Focus on Rich Morphology 22

1

3.4 Evaluation on Particular Applications . . . . . . . . . . . . . 223.5 Thesis Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . 243.6 Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Achieved results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.1 Improving Parsing Accuracy of Existing Systems . . . . . . . 264.2 Exploiting Parsers in Particular Applications . . . . . . . . . 274.3 Evaluation of Parsing Results . . . . . . . . . . . . . . . . . . 284.4 Alternative Parsing Systems . . . . . . . . . . . . . . . . . . . 28

5 Author’s publications . . . . . . . . . . . . . . . . . . . . . . . . . 29

2

Chapter 1

Introduction

1.1 Syntax in Natural Language Processing: an Overview

This thesis proposal focuses on syntactic analysis (parsing), a research areawithin the scope of natural language processing (NLP), a well-establishedresearch domain at the interface between computer science and linguisticsdealing with natural languages. It bears various other names such as com-putational linguistics, human language processing and others, all of whichare used in the following text interchangeably, referring to a field that aimsat narrowing the still so wide gulf between natural language and computersystems.

The bold goals of NLP can be seen in two respects: in a rather theoreticalone, consisting of formal understanding and modelling of natural language,including its analysis and synthesis (based on separate linguistic levels), lan-guage acquisition, reasoning and other cognitive processes. Or it can be seenwith regard to particular tasks, as defined by particular tools, applicationsand systems for information extraction and retrieval, machine translation,question answering, computer lexicography and corpus linguistics etc.

Where does syntax and syntactic analysis fit into this view? In theory it isa part of any tool and any language processing that goes beyond morphology(word forms). In practice this is however not true most of the time, for rea-sons such as low accuracy and complexity of many parsing solutions. Thereare successful NLP applications (e. g. in the area of information retrieval ormachine translation) that use zero or only very limited syntactic processing.Many other tools are intentionally bypassing any syntactic processing, andrather use simple rule-based or statistic methods that rely on morphologyonly. Nevertheless syntactic information might be of great use for a largevariety of NLP applications. In this thesis proposal I try to shed some lightinto the state of the art of syntax and parsing technologies, together withmy conclusions on what should be the next steps in this domain. I focus onsyntactic processing of morphologically rich inflective languages, especiallyon Czech.

3

1. INTRODUCTION

The structure of this text is as follows: Chapter 2 contains an overview ofcurrent status quo in syntax and parsing, Chapter 3 outlines the aims of mythesis, Chapter 4 summarizes the results achieved so far and finally Chapter5 provides an overview of my publications.

4

Chapter 2

State of the art in Syntax and Parsing

2.1 Syntax and Parsing: the Theory and the Practice

How does syntax and parsing stand to each other? Like the theory to prac-tice. Starting with the era of Noam Chomsky [Chomsky, 1965], linguistsintensively study formal properties of natural language syntax: how wordscombine to phrases, phrases to clauses and clauses finally build up a sen-tence. How morphology is propagated into syntax and what is the interfacebetween syntax and semantics. How far can be syntactic properties universalamong natural languages. Independently the advances in computer tech-nologies brought practical parsing techniques, focusing primarily on formallanguages – to parse communication protocols and programming languages.

In this chapter I present a very brief overview of the most influencingtheories of syntax and parsing methods and systems.

2.2 Syntax

2.2.1 Phrase structure syntax

The modern era of formal investigation of natural language syntax hasbeen promoted by Noam Chomsky’s Syntactic Structures [Chomsky, 1957].Chomsky laid down the principles of phrase structure syntax and relatedformal transformational generative grammar definition, some of them, ashe claimed, shall be universal and innate, and further evolved the the-ory finally ending up with the Minimalist Program [Chomsky, 1995]. Eventhough his theories have been later criticised [Johnson and Lappin, 1997,Sampson, 2001, Sampson, 2005] as missing empirical evidence, and may beconsidered overcome to large extent, he was one of the most influentialsyntacticians of the 20th century. And while I believe that language is byfar more than just a grammar, the notion of a phrase structure (constituent),corresponding to a formal grammatical derivation, will certainly survive asa concept that can be practically used in combination with other language

5

2. STATE OF THE ART IN SYNTAX AND PARSING

description methods.In the Chomsky hierarchy, natural languages have been mostly assigned

to the class of context-free languages, described by a context-free grammar(CFG). The question whether CFG have enough expressive power to describenatural languages has been often disputed in linguistic circles with regard toEnglish in the 70’s and 80’s, and it was claimed by many that it is not the case,mostly without any empirical evidence or by providing dirty proofs. A verygood overview of this topic can be found in [Pullum and Gazdar, 1982].

Despite it has been later shown that indeed a few phenomena in someparticular languages (e. g. Swiss German1, see [Shieber, 1985]) exceed the ex-pressivity of CFG even in terms of weak generative capacity (i. e. string equiv-alence) and the fact that contextual phenomena2 play generally an importantrole in natural language processing, CFG remain the core of many grammar-based systems, often extended to Probabilistic CFG (PCFG) – for well-knownpractical reasons, such as parsing time in O(n3) (see e. g. [Younger, 1967]).

2.2.2 Dependency syntax

On contrary to the phrase structure syntax which places adjacency (con-stituency) as the dominant syntactic principles, dependency syntax relieson an entirely different notion (as the name suggest) the dependency, a bi-nary relation between words. The work on dependency syntax is histori-cally attributed to Tesnière [Tesnière, 1959] and has been further indepen-dently developed by multiple linguistic groups, resulting into a range ofdependency syntax theories, ones of the most notable definitely being Hud-son’s Word Grammar [Hudson, 1984], Mel‘cuk’s Meaning-text theory (MTT)[Mel‘cuk, 1988], and for Czech the work done by Šmilauer [Šmilauer, 1969]and later partially adopted by the Prague linguistic circle and exploited inthe Functional Generative Description (FDG) [Sgall et al., 1986].

Despite large theoretical differences among all these theories, the un-derlying idea is pretty much the same: syntactic structure is described byassigning each word – dependent – to its syntactic head, i. e. a word thatthe dependent is governed by, together with a further specification of thisrelation, called usually dependency (or edge, arc) label.

1. Which is a dialect not to be confused with Swiss Standard German.2. Which have absolutely nothing to do with the exact definition of context-free languages,of course.

6


2.2.3 Phrase structure vs. dependency syntax

Before moving forward to other, more advanced theories and grammaticalframeworks, some of them going far beyond the borders of syntax intosemantics, I would like to do away with the question whether these twoantagonist theories – phrase structures and dependency – have a clear winnerthat outperforms its counterpart. They simply do not. Both of these theorieshave their Achilles’ heels where those two principles they rely on do notseem to work very well in cases listed below.

Discontinuity and non-projectivity

In case of phrase structure syntax, this concerns discontinuous phrases. Theyare expressible in the dependency grammar, where this phenomenon gaineda new name – non-projectivity, and (at least in Czech) it has been paid farmore attention [Hajicová et al., 2004, Havelka, 2007] than it deserves in myeyes. In the example of the Prague Dependency Treebank (PDT) [Hajic, 1998],it can be seen that most of the non-projective constructions result just fromthe need to adjust the related dependency tree to the particular theory(FDG), with no applicational justification that the non-projective tree wouldbe syntactically more informative than any of its possible projective (buttheoretically less valid) variants. Non-projectivity in this sense becomes justa property of the given theory and formalism.

Even with such an extensive concept of non-projectivity, after buildingthe first version of the PDT, called PDT 1.0, Hajic et al. reported in Chapter 2of [Hajic et al., 1998] only 1.8 % of all dependencies in PDT 1.0 to be non-projective, corresponding to almost 80 % of all sentences having no non-projective dependency and more than 93.8 % sentences having at most onenon-projective dependency.

Czech is often claimed [McDonald et al., 2005, Hall and Novák, 2005] tobe one of those languages where due to the free word order non-projectivityalmost enforces using dependency syntax, but taking the above rough cal-cuations into consideration, one comes very fast to the conclusion that truenon-projectivity is rather rare and occurs only in archaic or poetic languageor in case of movement of some constituent due to topic-focus articulation.

This claim can be also supported by the detailed analysis of non-projectivedependencies in PDT provided in [Hajicová et al., 2004] where at least 70.5 %of the non-projective dependencies3 could have a very straightforward pro-

3. Namely the following categories in the referred paper: A1 – function words (21 %),A2 – prepositional groups with a focus sensitive particle (28 %), A3 – noun phrases with

7


jective variant (within the dependency notation) or continuous phrase struc-ture representation.

To conclude: I consider non-projectivity to be a valid issue in processingsyntax of natural languages (for some of them more serious than for others),but not an argument that would rule out phrase structure parsing.

Coordinating dependencies

From the opposite direction there are valid objections that for some con-structions (primarily coordinations) it is very hard to imagine what kindof a dependency relation there should be (e. g. between a conjunction andtwo members it is coordinating). Again, from the practical point of view thisdoes not play a role at all, at least as long as one is able to reconstruct thecoordinating structure.

Conclusions

From what was said, can one conclude that these two theories and particularformalisms behind them are more or less the same? Definitely not. The factthat both of them have their pitfalls does not make them more “same” inany way. Formally, they are mutually transformable, and it is a fairly easyexercise to write an algorithm that will convert a phrase structure tree to itsstructurally equivalent dependency counterpart and vice versa. But this isnot so easy anymore what concerns the dependency labels and bottlenecksof both formalisms that have been mentioned, i. e. non-projectivity andcoordinations. Moreover what gets not converted is the information valueentailed in the tree, about constituency/constituents in the former case,about mutual dependencies in the latter one.

To sum up: both formalisms express different phenomena, from a practi-cal point of view it may vary which one is more useful (or say, easily usable)depending on the particular application. My personal feeling is that thedependency notion is somewhat simpler to deal with, though this dependsheavily on the label set that is used to describe the dependency relations.

The lesson for parsing that comes out of this? If possible, a parser shouldbe able to produce both phrase structure as well as dependency output.

both preceding and succeeding genitive numerals (0.6 %), B4 – numerals with dislocateddependent (1.3 %), B6 – relatives and interrogatives (1.6 %) and B8 – particles (18 %).

8


2.2.4 Advanced Syntactic Formalisms

In the following I provide a very brief overview of advances syntacticformalisms that adopt phrase structure or dependency syntax and finallypresent conclusions with regard to their usability.

Head-Driven Phrase Structure Grammar

Head-driven phrase structure grammar (HPSG) [Pollard and Sag, 1994] isa formalism that adopts the basic phrase structure syntax and builds on theGeneralized phrase structure grammar (GPSG) [Gazdar, 1985]. On contraryto the phrase structure, HPSG specifies immediate dominance and linearprecedence separately and the grammar is not derivational, but unification-based – it exploits a sophisticated feature hierarchy attached to the phraseelements called typed feature structure and formalized as a bounded completepartial order. HPSG is lexicalized (words and phrases are handled differently)and goes beyond syntax in that it also specifies semantic categories (such asagent, patient etc.).

Lexical Functional Grammar

Lexical functional grammar (LFG) [Bresnan, 2001] also inherits the basicphrase structure notion, renamed within LFG to c-structure as for constituentstructure, and adds a second level represented by a so called f-structure,feature structure. As the name suggests, it is also lexicalized, and it uses thenotion of a function to describe the feature structure, hence functional.

Tree-Adjoining Grammar

Tree-adjoining grammar (TAG) [Joshi and Schabes, 1997] differs from bothdependency and phrase structure syntax in that the basic elements are notwords or non-terminals, but partially specified syntax trees that are subject toseveral admissible tree operations. As results, TAG has a higher generativecapacity than a context-free grammar, but lower than a context-sensitivegrammar. The class of languages such a system may generate is usuallycalled mildly context-sensitive languages, on the cost that practical algo-rithms have been proven to have O(n6) time complexity [Satta, 1994]. Theformalism later evolved into Lexicalized Tree-Adjoining Grammar (LTAG),having the elementary trees assigned a lexical anchor (i. e. word).

9


Combinatory Categorial Grammar

Combinatory categorial grammar (CCG) [Steedman, 2000] is a formalismthat also uses phrase structure representation, but is not derivation-based,instead it operates using composition of functions (combinatorial operators).Formal background is adopted from typed lambda calculus. It has beenproven that CCG is a weakly equivalent formalism (in terms of generatingthe same string language) to TAG [Vijay-Shanker and Weir, 1994].

Link Grammar

Link grammar (LG) [Sleator and Temperley, 1993] represents an interestingattempt to merge the advantages of both phrase structure and dependencysyntax. The result of LG parsing is also a binary relation on words, but not interms of head-dependent relation, but just links that are left-right directional,have a given length and even may have constituent interpretation.

Conclusions

This very brief overview presented above suits just to demonstrate thegeneral movement of syntactic theories over the past two decades to yet morecomplex theories, trying to extend the expressive power of the formalismsand involve semantics and syntax in one package, both being plausible fromthe point of view of numerous linguistic schools. This holds for all of themexcept for the Link Grammar and I believe it is not just a coincidence that itis Link Grammar that found most applications, as is shown in Section 2.4.

I find it being a misfortune that the development of various syntactictheories was not driven by the fact that existing ones would show to be prac-tically insufficient for the purposes of NLP, as it contributed to unnecessarypulverization of efforts in parsing and even to misleading the purpose ofparsing to some extent, as is discussed in the next section.

2.3 Parsing

2.3.1 Rule-Based Parsing

Rule-based parsing of natural languages basically originates in the old goodtimes when foundations of the theory of formal languages have been laiddown: give me a grammar and a sentence, and I will tell you whetherthis is a well-formed sentence according to the grammar, and what woulda derivation tree look like. The massive ambiguity of natural languages

10


required multiple adjustments of the core algorithms that were used toparse simple and deterministic communication protocols and programminglanguages. This was a tempting task for computer science which madea great contribution to the parsing algorithms, using standard techniquessuch as dynamic programming and memoization, and resulting into variouspractically usable algorithms for efficient parsing of context-free grammars(see e. g. [Kay, 1986]).

There are however considerable theoretical as well as practical relatedquestions: given the variety of natural languages, is it feasible to manuallydescribe them by rules? Indeed the development and maintenance of hand-written grammars turned out to be a hard job that needs to be addressedproperly, e. g. by using meta-grammars (see [Kadlec and Horák, 2005][Debusmann et al., 2003]), and is still considered to be one of the main coun-terarguments against rule-based parsing.

2.3.2 Statistical Parsing

Issues related to grammar development and the rise of statistical NLP inthe past decade led to an expectable outcome: learning a grammar fromexisting corpora with syntactic annotation (see e. g. [Brill, 1993, Belz, 2002]),or skipping a grammar (formalized as such) entirely by producing statisticalmodels that can be directly used for parsing (see e. g. [Charniak, 2000]).Statistical methods proved to be the most promising solution in many areasof NLP (tagging, machine translation etc.), they are the solution in speechprocessing, so why not use them for parsing?

Syntactically annotated corpora (treebanks) have been prepared for a rais-ing number of languages, serving as a source for training, developmentand evaluation data. Concurrently, the efforts in parsing over the past tenyears have been mostly oriented to statistical parsing (as can be seen in theoverview of parsing systems later).

At this point I will draw the attention to the purposes of parsing, andthen get back to the question whether statistical modelling is the solutionalso for parsing.

2.3.3 Partial Parsing (Chunking)

A standalone research has been conducted in the direction of partial parsing,i. e. identifying either flat structures (phrases) or phrase subtrees in the givensentence (see e. g. [Abney, 1996]). Various instances of this task such as NPbracketing (see e. g. [Vadas and Curran, 2007]) and NP chunking (see e. g.

11


[Sha and Pereira, 2003]) have been studied. Partial analysis may be definitelyof great use for particular applications, but in this proposal I primarilyfocus on full parsing. Nevertheless I am convinced that partial analysis (orsome sort of phrasal structures) should be a legitimate additional output ofstandard (full) parsers.

2.3.4 The Purpose of Parsing

Before moving to my conclusions between rule-based and statistical parsing,it is necessary to finally get to the bottom of the problem and declare whatI view as purpose of parsing. It is usually defined as “recovering the structureof sentence”, but this speaks rather about what the parser does, not whatshould its results be used for. I am not going to give any less vague definition,but focus on the purposes of parsing.

For many computational linguists parsing corresponds to producingsome sort of a structure that fits and confirms a particular theory of syntax (orlanguage in general). Though not saying that this might not be an appealinglinguistic task, that is not the view I share at all. I see the purpose of parsingand parsers in terms of standard tools for NLP that do not represent a finalgoal as such, but should contribute to improve other applications and servefor many tasks.

This issue is closely related to how parsing results should be represented.I am confident about the fact that the most used representation, namely a tree(be it a phrasal or dependency one), is not the one that would be directlyusable by many applications.

In general I have rather a skeptical attitude to how much parsing andparsers fulfill this purpose nowadays. And I also think that it is this fact,that parsers are primarily not developed to serve in multiple but particularapplications, and a general uncertainty about the representation of results,that make me conclude that parsers are hard to employ in existing applica-tions and do not serve well for their particular needs. Of course, the obviouscounterargument is that this approach might very well deteriorate into a sit-uation when each application requires a principally different parser, andindeed does parsing from scratch and hence necessarily partially reinventsthe wheel. I think this is a valid statement that make even come true, butthat is just the way it is. On the other hand I take this as a present challengeto elaborate on what would be a useful common representation of syntaxthat would be easily accessible by multiple applications, and address this inmy thesis aims.

Current state of parsing applications can be also seen reflected in pro-

12


ceedings and journal articles: the word parser is mentioned in 7,232 of (bythe date) 21,066 papers available within the ACL Anthology,4 a collection ofpapers from the NLP domain over the past 20 years. But looking for phrasesmatching the regular expression (used | using | employ | employing | exploit| exploiting) a? parser), one gets only 133 results. This is of course a veryrough and not rigorous estimate, nevertheless the message is clear: mostpapers mentioning parsers talk about parsing but not about using parsers.

And even if they talk about using parsers, it is not unlikely to find paperscautiously mentioning the truth:

“Our results do not yet indicate that parsing is beneficial to . . . ”[Musillo and Merlo, 2006, p. 104]

“There is no doubt that collocation extraction should be based on syntac-tic preprocessing of the source corpora (simply because collocations oftenhave syntactic structure), but the evaluation presented in the book is notvery convincing.” [Pecina, 2011, p. 633]

In [McCarthy and Navigli, 2007], word sense disambiguation has beendescribed as a “task in need of an application”, and I am afraid that thisapplies to large extent to parsing as well. Until parsing is going to becomeprimarily application-driven, it will be subject to Anja Belz’s proposition[Belz, 2009] used as a title of one of the Last Words’ articles in the Compu-tational Linguistics journal: That’s nice . . . what can you do with it? A citationfrom the same source is also going to serve as my conclusion on purposes ofparsing:

“If we don’t include application purpose in task definitions then not onlydo we not know which applications (if indeed any) systems are good for, wealso don’t know whether the task definition (including output representa-tions) is appropriate for the application purpose we have in mind.”

[Belz, 2009, p. 113]

2.3.5 Parsing Evaluation

Proper evaluation is a crucial part of development of any NLP tool. Inves-tigating how parsers are evaluated, it turns out that except for notable butvery rare exceptions, most parsers are evaluated in a similar manner as manyother NLP tools, namely compared to some gold standard parsing outputhandcrafted manually by human annotators. I am strongly convinced of the

4. http://aclweb.org/anthology-new/

13

http://aclweb.org/anthology-new/


fact that this approach, though widely accepted, has severe theoretical aswell as practical drawbacks that I list below.

The theory and the practiceA general theoretical remark follows up on the discussion of the pur-poses of parsing above: aiming at practical applications, it is hard tojustify that a particular syntactic theory and representation is the syntaxand that by approximating to it one is going to build a tool that will beuseful for various applications.

Inter-annotator agreementAnother, even more important theoretical issue, is the quality of thegold standard data one is compared to. Even though this problemhas been investigated from the statistical point of view, resulting intomeasures such as Cohen’s and Fleiss’ Kappa (see [Cohen et al., 1960,Carletta, 1996, Fleiss, 1981], a very good overview was also given in[Artstein and Poesio, 2008]) that take into account the possibility ofmutual agreement just by chance, I do think that the overall implica-tions of this process in general (mainly with regard to the consistencyof manually annotated data) are still underestimated, partially alsodeliberately, as they might invalidate lots of prestigious contributionsto the field.

Inter-annotator agreement (IAA) for syntactic annotation is, to thebest author’s knowledge, not known for the most used English syn-tactically annotated corpus (treebank), namely the Penn Treebank[Marcus et al., 1993], the same applies for the already mentioned largestCzech treebank PDT. Some figures are available for the German NE-GRA and TIGER treebanks [Brants, 2000, Brants and Hansen, 2002],unfortunately only in terms of an f-score that is claimed to be an appro-priate measure of IAA, computed on syntactic nodes that have beenidentified by two annotators as same.

An extremely interesting contribution5 to this topic represents thework done by [Sampson and Babarczy, 2008] who tried to set an upper-bound on human IAA, i. e. tried to answer the question how high humanmutual agreement on syntactic annotation can go? They show that evenextraordinarily skilled professionals with many years of experiencewith the annotation scheme achieve an average IAA of 95 % and thatmore precise annotation instructions do not help to achieve higherIAA.

5. unfortunately largely unknown by wide audience

14


Finally there is also a rather philosophical question regarding IAA thatI leave unanswered: if it should be a scientific measure in Popper’ssense [Popper, 1959], under what conditions it is falsifiable?

Similarity metricsLast counterargument to evaluation against treebanks is rather a prac-tical one – what is a proper similarity metric for a syntactic tree (or anynon-flat structure)? There are several established similarity metricsfor this purpose. For phrase structures trees, it is mainly the PAR-SEVAL metric [Abney et al., 1991] and the Leaf-Ancestor Assessment(LAA) metric [Sampson, 2000] which has been later argued to be moreappropriate than PARSEVAL in [Sampson and Babarczy, 2003]. For de-pendency structures, plain dependency precision (in terms of correctlyidentified dependencies) can be used, either with or without takingdependency labels into consideration.

The root of this problem lies however in the fact that any generalsimilarity metric is necessarily going to suffer from the issue that allparts of the syntactic annotation are not equally important: whether aninterjection was misattached in the tree is much less important than ifit was the subject of the sentence. Therefore any such similarity metriccannot answer the question how bad the result is, only how much itis structurally different. In order to avoid this, some very complexweighting of different parts of the structure would be necessary thatwould be very hard to define.

Promising attempts to make a comparison of parsers performance ona particular task were performed for English. In [Miyao et al., 2009] parserswhere about to extract protein-protein interactions from biomedical textsand it is not surprising that their performance did not correspond at all totheir f-score rank on the Wall Street Journal portion of the Penn Treebank.A similar attempt has been done when parsers contribution to machinetranslation has been evaluated recently in [Katz-Brown et al., 2011], comingto the same conclusion. Both of these provide empirical evidence for theclaim that evaluating parsers performance on treebanks is not appropriateand does not correlate with primary end applications parsers were designedfor.

It is clear that evaluating parsers on particular applications is much moredifficult than just running a test suite that will compute the similarity scoresagainst some gold data set. Moreover, one would need to evaluate againstmany (as much as possible) applications to avoid overfitting of the parser to

15


one very narrowly defined particular application. Nevertheless I considerthis to be the necessary step to move the parsers actually closer to their finaltasks, and address this issue in the aims of my thesis as well.

2.3.6 Criticism of Statistical Parsing

In the following I would like to get back to the question of the appropriate-ness of statistical parsing and basically, to make it clear right at the beginning,sum up my arguments against statistical parsing as one of the state-of-the-artmethods. At very first, let me add that this is not to discourage from statis-tical methods as such of course. I do believe that the progress of statisticalNLP that happened over the past decade was (and is going to continue tobe) an extremely important contribution. Statistical methods have proven tobe so useful for many NLP tasks, so why not for parsing?

To explain, let us compare parsing with one of the tasks where statisticsis the core of any serious solution to the problem, namely morphologicaltagging. Tagging has all the properties that make it a good candidate for sta-tistical modelling: it is a clearly defined task with well-measurable solutionsand high inter-annotator agreement, its output is directly usable and used bymany various applications. In other words: we know what we want6.

Parsing, on the other hand, lacks almost all of these properties: as dis-cussed, its definition is tightly bound to a particular theory and syntacticrepresentation, parsing accuracy is hard to be evaluated precisely and of-ten relies on language resources with unknown inter-annotator agreement,parsing output is frequently not directly usable for the end application assuch.

Now what does a statistical system do? It basically tries to mimic somedata. But this principle crucially relies on the assumption that what we “ape”is known to be correct, to be exactly what we want. I am convinced that thisis, unfortunately, not the case of parsing nowadays, where often one doesnot know what the result should look like in order to be easily usable, andthe reliability of the training data is at least questionable.

What we need are parsers – flexible tools – that can be easily adjustedfor various so much different tasks. This is a strong advantage of rule-basedparsing – one can easily modify and accommodate the parser to new tasks.A “statistical” answer would be: well, let’s annotate some new data. But thisis a problem not only from a practical point of view, but, as was shown in thediscussion of parsing evaluation, mainly from the theoretical one (especially

6. And even for such a task the inter-annotator agreement turns out to be an issue, see[Manning, 2011].

16


in case of development of new annotating schemata).I consider this inappropriateness of statistical parsing to be a real misfor-

tune, because statistical parsing has on the other hand two very importantadvantages: an implicit probabilistic nature and lexicalization. These twomake it winning the f-score competitions on the Penn Treebank, and both ofthem are not straightforward to implement in rule-based parsing.

Nevertheless I believe that – at least for now – our efforts should beconcentrated on rule-based methods, trying to minimize their disadvantages(such as grammar development and maintenance) and incorporating advan-tages of statistical parsing, with the primary aim of making actual use ofparsing. Only at that time we can revisit the approach of statistical parsing.

2.4 Parsing Systems

In this section I provide a short list of existing parsing systems and theirrelation to the presented syntactic formalisms. The list has no ambition to beexhaustive, since there exist numerous of them, I only mention those thathave multi-language scope (or I consider them to be worth mentioned forother reasons) and those available for Czech in particular.

2.4.1 English and multi-language parsers

• MST parser [McDonald et al., 2006] is a graph-based statistical depen-dency parser. The name MST is an abbreviation of Maximum SpanningTree since the parser first builds a dependency graph from which itlater generates dependency trees. It has been trained for Czech too (onthe Prague Dependency Treebank) – see [McDonald et al., 2005].

• MaltParser [Nivre, 2009] is a transition-based statistical dependencyparser. It comes with pre-trained models for English, French andSwedish (but has been tested with a large variety of languages).

• RASP (Robust Accurate Statistical Parser) [Briscoe et al., 2006] is wide-coverage rule-based parser of English including a tokeniser, taggerand lemmatiser with phrase structure trees and grammatical relationsas output format.

• Stanford parser [Klein and Manning, 2003] is a statistical parser withboth dependency and phrase structure output. Besides English it hasbeen adapted for Chinese, German, Arabic, Italian, Bulgarian andPortuguese.

17


• Link Grammar Parser [Grinberg et al., 1995] is a rule-based parser thatbuilds on the Link Grammar and produces the related linked structure.It is provided with grammars and lexicons for English, German, Frenchand Lithuanian. It has been adopted by an open-source communitythat develops the AbiWord editor7 and exploited as a grammar checkerin the editor.

• Enju [Sagae et al., 2007] is a rule-based HPSG parser primarily devel-oped for English, providing phrase structure and predicate argumentstructure as output.

• PET [Callmeier, 2000] is another HPSG parser providing a wide rangeof grammars including English, German, Chinese, Japanese, Spanish,Portuguese, Korean, Modern Greek, Norwegian and others, most ofthem originating in an grammar development environment called LKB– Lexical Knowledge Builder.

• XDG [Debusmann, 2006] is a dependency metagrammar formalismaccompanied with a “solver” and sample grammars for a number oflanguages, including a prototype for Czech.

• C&C Parser [Clark and Curran, 2007] is a statistical parser for Englishexploiting a combinatory categorial grammar automatically extractedfrom the Penn Treebank.

• Collins’ Parser [Collins, 1997] is a statistical parser developed by MichaelCollins. Among other languages it has been also trained for Czech onthe PDT.

• Charniak’s Parser [Charniak, 2000] is a statistical parser developed byEugene Charniak and also trained for Czech on PDT.

2.4.2 Czech parsers

statistical dependency parsersA number of statistical parsing systems has been trained on the PragueDependency Treebank, either in direct collaboration with people atthe Prague Institute of Formal and Applied Linguistics at the CharlesUniversity, or within the scope of the CoNLL shared tasks in 2006, 2007and 20098. So far the CoNLL 2009 best parsing system was the one

7. See http://www.abisource.com/projects/link-grammar/.8. See http://ufal.mff.cuni.cz/czech-parsing/ for an overview.

18

http://www.abisource.com/projects/link-grammar/

http://ufal.mff.cuni.cz/czech-parsing/


described in [Gesmundo et al., 2009], achieving labeled dependencyprecision of 80.38 %.

SET parserThe SET parsing system [Kovár et al., 2009] is a notable exception thatgoes in the opposite direction than the mainstream of syntactic pro-cessing – instead of introducing yet more complex syntactic theoriesand parsing systems, it tries to perform parsing using simple rankingof finite state patterns. It is still being developed in the NLP Centre atthe Masaryk University and recently has been applied for noun phrasechunking [Grác et al., 2010].

Synt parserThe Synt parsing system [Kadlec, 2007] has been developed over thepast decade, and it originates also in the NLP Centre at the MasarykUniversity. Since this is the one I plan to use primarily during the workon my thesis, I describe the system design in more detail.

The parsing schema is based on a context-free backbone, Synt performsa stochastic agenda-based head-driven chart analysis using the pro-vided grammar for Czech, employs several disambiguating techniquesand offers both a phrase-structure as well as dependency and chunkoutput.

An overview of the parsing workflow of Synt is given in Figure 2.1.First, the input sentence is morphologically analyzed if the annotationis not provided along with it. As the next step, a basic head-drivenCFG analysis is performed. All parsing results are collected in theso called chart structure which is built up during the analysis. Uponthe basic CFG analysis, we build a new graph structure called forestof values. This structure originates from applying contextual actions(especially morphologically agreements and tests) on the resultingchart and from the implementation point of view it comes more handyfor post-processing than the original chart.

To prevent maintenance problems which a hand-written grammar issusceptible to, the grammar used in Synt is edited and developed inthe form of a small meta-grammar (a detailed description is availablein [Kadlec and Horák, 2005]) which at this time contains only 239 rules.From this meta-grammar a full grammar can be automatically derived(having 3,867 expanded rules). In addition, each rule can be associatedwith several attributes, such as morphological agreement actions, headmarkers etc.

19


��

�AB�CD�DEFA��C��

�D��E��D��C

��

��D��D��

E��E��D��D��E

E��E��D��D��E

��E��E��D��D��E

��E��E��D��D��E

��C��

��C��

�

�

�

��E��

��D�E

��D��EA��C��C��A��

��B��E��F

Figure 2.1: Parsing workflow of the Synt parser

20

Chapter 3

Aims of the thesis

In the previous chapter I summarized the state of the art in syntax andparsing, and my views on where should further development in this field bedirected to. I argued why statistical parsing, despite having many importantpractical advantages, is not well suited to improve current parsing systemsin terms of their usability for practical applications.

Turning to rule-based methods, the menu of parsing systems for mor-phologically rich languages, such as Czech, is not large – rather the opposite.The Link Grammar Parser, in my opinion one of the most promising prac-tical rule-based solutions, has only very limited support for incorporatingmorphology into its rules (it suggests preprocessing the input with morpho-logical analyzer and using separate morphemes as grammar non-terminals,which is not a viable solution at all).

Therefore my goal is to extend the Synt parsing system in such a way thatit will be suitable for usage in a number of morphologically rich languageswith performance competitive to the state-of-the-art statistical and rule-based parsing systems that are designed primarily for analytical languagessuch as English. In order to achieve this, the following key principles willbe used to enhance the performance and reduce ambiguity of the parsingsystem.

3.1 Strong Lexicalization

As I already mentioned, implicit lexicalization is one of the main advantagesof a statistical parsing system which can therefore easily capture variouskinds of irregularities that occur on syntactic level. Successful lexicalizationmodels have been developed for rule-based parsers of analytic languages(see e. g. [Carroll and Rooth, 1996]). Hence I intend to develop a similarscheme appropriate for morphologically rich languages within the Syntparsing system.

21

3. AIMS OF THE THESIS

3.2 Grammar Stratification Exploiting Competing PCFG Rules

During the development of the Czech metagrammar of Synt, it turned outthat besides having the grammar rules assigned probabilities, it wouldbe handy to specify that some of them are just mutually exclusive in thesense that one of them has absolute precedence over the other one. Thisbecomes more and more important as one wants to achieve wide coverageof the parser not only on written newspaper texts, but also on differentkinds of written text (such as web blogs) and possibly also conversationaltexts (i. e. speech records). My plan is to continue developing the presentconcept of a Probabilistic CFG with rules stratified into different levels[Jakubícek, 2011], and explore its usability in other morphologically richlanguages.

3.3 Metagrammar Development with Focus on Rich Morphology

The metagrammar formalism used in Synt has proven to be a very usefulconcept that significantly speeds up the grammar development and makesits maintenance substantially easier. However, its current state focuses onlyon Czech and the contextual actions that may be assigned to each grammati-cal rule are hard-coded in the parsing system. In order to enable easy portingof the parser to other languages, it is necessary to extend the metagrammarformalism to the morphological level so that, when extending the parser toa new language, a unified description of a morphological attributive systemwill be directly usable in the metagrammar.

3.4 Evaluation on Particular Applications

As was demonstrated in the discussion of inter-annotator agreement, it iscrucial for current parsing systems to be evaluated on particular applications.Therefore I propose the following evaluation methodology to be used duringfurther parsing development:

• evaluation of contribution to a morphological tagging system

As I discuss in the next Chapter, it has been shown that Synt cansignificantly reduce the space of ambiguous morphological annotationproduced by a morphological analyzer. Our current experiments showthat this leads to increasing accuracy of a statistical tagger of Czech(connected to the parser in a simple pipeline), handling especially longdistance dependencies that are hard to capture for a statistical system

22


in case of languages with rich morphology due to data sparseness.Hence, the added value of a parser in a tagging pipeline can serve asone of evaluation measures.

• evaluation of contribution to the accuracy of word sketches

The concept of word sketches (see [Kilgarriff and Tugwell, 2001]), “one-page summaries of a word’s grammatical and collocational behaviour”,has been since its introduction successfully used for many lexico-graphical purposes in numerous languages, including Czech (see[Rychlý and Smrž, 2004]), and is one of the core components in theSketch Engine system [Kilgarriff et al., 2004]. Currently, word sketchesare computed using a statistical model over regular expressions in thecorpus query language (CQL, see [Jakubícek et al., 2010]), a robust butfrom the theoretical point of view certainly suboptimal descriptionmechanism. The contribution of parsing to word sketches has beenshown in an experimental setup (see [Horák et al., 2009]) and I plan touse it as an evaluation metric of Synt as well.

• evaluation of contribution to a text correction system

For many languages, including Czech, text correction systems that gobeyond the lexical level (i. e. spell checking) could profit from some syn-tactic preprocessing. In [Jakubícek and Horák, 2010] we have shownhow Synt, accompanied with a few simple post-processing steps, canbe used for correction of Czech punctuation. Even though there arevast differences in the codification of written language, I believe thatif similar phenomena could be looked up for other morphologicallyrich languages too, a text correction system has a valid place in the testsuite.

• evaluation against a treebank of modest size but high IAA

Despite the disadvantages of evaluation against treebanks, I am con-vinced that it is useful to perform this as well just because of the factthat such evaluation may provide yet more evidence on how muchit will correlate with application-driven measurements. For this pur-pose I am going to lead a small team of annotators that will producea rather small syntactically annotated corpus – the upper-bound sizeof the corpus will be 5,000 sentences, but crucially only sentenceson which there will be absolute (i. e. 100 %) agreement among all theannotators will be included. As a side effect of this process, we willbe able to provide more insight on where do annotators disagree

23


on Czech syntactic phenomena and compare to the work done by[Sampson and Babarczy, 2008] for English.

3.5 Thesis Outcomes

To sum up, the goal of the thesis is to develop a rule-based parsing systemwhose design will make it possible to extend it in a straightforward way onother morphologically rich languages. The proposed evaluation measure-ments are largely transferable to other languages. I assume availability ofa tagger, and rely on the fact that word sketch grammars are now availablefor a number of inflective languages (Czech, Slovak, Polish and Russian).The text correction system is just a simple post-processing of the parseroutput, availability of treebanks for other languages is questionable but notcrucial for the evaluation.

First experiments will be performed with Slovak, as a proof-of-concept,since it is a language that is very similar to Czech on the syntactic level, butrequires different handling on the morphological level. Following on that,one another inflective language (probably Polish or Russian) will be chosento extend the parsing system (in cooperation with native speaker linguistswho will be in charge of the grammar writing).

Practical outcome of the thesis will consist of a lexicalized wide-coveragerule-based parser suitable for morphologically rich languages, Czech in thefirst place. In the end, the following three components of the system will belanguage-dependent and to be accommodated when porting the parser toa new language:

• stratified metagrammar

• description of morphological attributes

• description of lexical (tag and word) mapping to grammar pre-terminals(i. e. non-terminals immediately producting surface words)

Besides practical results in the form of parser development and exten-sions, there will be several theoretical aspects that will be investigated si-multaneously. These involve further investigation of the inter-annotatoragreement on syntactic structures and how evaluation on a treebank cor-relates to the contribution to improving particular applications. Last butnot least, I would like to focus on the question of what is an appropriaterepresentation of parsing results and believe that the exploitation of theparser into the proposed applications might give more insight into this topic.

24


3.6 Schedule

spring 2012 improving the concept of stratified metagrammarevaluation of the contribution to a Czech statistical tagger

autumn 2012 lexicalization of the parserevaluation of the contribution to the word sketchesporting and evaluation of Slovakcoordinating the work on a high-IAA treebank

spring 2013 evaluation of the contribution to the word sketchescoordinating the work on a high-IAA treebankporting and evaluation of another inflective language

autumn 2013 evaluation of the contribution to a text correction systemevaluation on the high-IAA treebank

spring 2014 completing of the thesis and its submission

25

Chapter 4

Achieved results

The aims described in this thesis proposal follow on previous work thatI have accomplished either by my own or in collaboration with my col-leagues, mainly Vojtech Kovár and Aleš Horák. In this chapter I sum up theso far achieved results and Chapter 5 provides a complete list of publications,those related to this topic are in the following referred to by the respectivenumber in that list, two publications are also attached to this thesis proposal.

Our research interests in parsing were going into several directions, eachof which I summarize in a separate section below.

4.1 Improving Parsing Accuracy of Existing Systems

In 2009 we started the work on exploiting verb valency information availablein the VerbaLex lexicon of Czech verb valencies in the Synt parser. Weshowed that the parser and lexicon can coexist in a mutually beneficialsymbiosis – first, we used the parser to measure the lower-bound of thevalency lexicon coverage on common Czech texts. In [8] we estimated thecoverage of the lexicon to be at least 83.6 %, which was further motivation toemploy its information in the process of parsing to reduce standard parsingambiguities such as PP-attachment.

An implementation in the form of extension of the Synt parser followedup in 2010 as the topic of my master thesis [14]. It confirmed that verb valencyframe information can be used for the aimed goals, but is not easy to beincorporated into a probabilistic parser with regard to massively occurringlanguage phenomena such as ellipsis that decrease the number of presentvalency arguments. The work on incorporation of the valency informationcontinued and is currently submitted as a conference paper (under review).

In 2011 I started elaborating on the rule level mechanism present at thattime in the metagrammar of the Synt parser and extended this formalism,previously used only at the level of the basic PCFG analysis, to the wholeanalysis including morphosyntactic contextual actions which resulted into

26

4. ACHIEVED RESULTS

unambiguous ambiguous

value before after before after

# of sentences 3,779 1,820

# of not accepted sent. 114 73 88 43

median trees count 24 6 6144 480

average trees count 12,601.5 144.6 6,768,409.0 127,525.9

LAA Best 0.9128 0.9131 0.8643 0.8776

position of LAA Best 19 8 43 36

LAA First 0.8784 0.8833 0.8141 0.8263

Table 4.1: Evaluation of the forest-based non-local pruning separated for sentences with ambiguousand unambiguous morphological annotation. LAA stands for leaf-ancestor assessment tree similaritymetric (LAA Best is the best score among first 100 trees, next row shows the position of this tree, lastrow is the LAA of the first tree).

significant reduction of the ambiguity of Synt’s results without harmingboth parser’s precision and coverage.

The contribution of this approach has been evaluated on a phrasal tree-bank (see [7]) of about 5,500 Czech sentences and the results are presentedin Table 4.1. A detailed description of this method together with the resultshas been published in [1].

Finally in 2011 we conducted a series of changes to the Czech morpho-logical tagset that led to easier incorporation of morphological annotationinto the process of parsing and has been described in [12].

4.2 Exploiting Parsers in Particular Applications

In 2010 we employed the Synt parser in a system for automatic correction ofCzech punctuation. Czech has very complex rules codified for using punc-tuation in formal texts, and our experience with students at the Faculty ofInformatics showed that errors in punctuation are ones of the most common– in the error corpus that is being continuously developed from the textsthey have to submit, these errors build about 21 % and thereby represent thesecond most frequent category.

For this task we used the parser as a predictor/generator – our motiva-tion was as follows: if the parser is capable of parsing (detecting) punctuationon the right places, what about letting it predict the position of punctua-tion? We augmented the metagrammar in Synt so that it allowed empty

27

4. ACHIEVED RESULTS

productions for the rules capturing punctuation and enhanced the handlingof coordinations (where most of the punctuation occurs). This turned outto be a promising solution – after accompanying the parsers output withseveral simple post-processing steps we were able to predict punctuationplacement with both precision and recall over 80 %. A detailed descriptionof this method was published in [3].

In 2008 during the work on my bachelor thesis I extended the outputformats of Synt to phrasal chunks which can be used for shallow syntacticannotation in text corpora and this work was later published in [2]. Along-side with these changes I implemented backward propagation of the resultsof various morphosyntactic actions (morphological agreements and tests)bound to the metagrammar rules used in Synt and observed that it cansignificantly reduce the morphological ambiguity when parsing sentenceswith morphologically ambiguous annotation.

In 2011 I started re-evaluating the contribution of such morphologicalpruning by Synt in connection with a Czech tagger [Šmerk, 2004]. Prelimi-nary results (to be published) show that depending on the morphologicalcategory (part-of-speech/kind, case, gender etc.) it can remove up to 10 % ofthe tagger errors.

During the years 2009–2011 I collaborated on incorporation of the Syntsystem into a framework for logical analysis of natural language in Transpar-ent Intensional Logic. The work in progress on this topic has been publishedin [10] and [11].

4.3 Evaluation of Parsing Results

In 2009, following an evaluation of the SET system on the Prague Depen-dency Treebank (PDT), we conducted a preliminary analysis of annotationerrors in that treebank that was published in [9].

In 2010 we proposed NP chunking as one of the evaluation method forcomparison of parsing systems. A simple annotation schema of short (bare)noun, prepositional and verb phrases has been handcrafted and the workhas been published in [5].

4.4 Alternative Parsing Systems

In 2009 and 2010 I collaborated with Vojtech Kovár on the development ofa new simple parsing tool based on ranking of finite state patters, the SETparser. The system was described in detail in [4].

28

Chapter 5

Author’s publications

1. Jakubícek, Miloš. Effective Parsing Using Competing CFG Rules. In Haber-nal, Matoušek. Proceedings of Text, Speech and Dialogue 2011. Berlin,Heidelberg : Springer Verlag, 2011. p. 115–122. ISBN 978-3-642-23537-5.Contribution: 100 %.

2. Jakubícek, Miloš - Horák, Aleš - Kovár, Vojtech. Mining Phrases fromSyntactic Analysis. In Text, Speech, Dialogue 2009. Berlin Heidelberg :Springer Verlag, 2009. p. 124–130, ISBN 978-3-642-04207-2.Contribution: 80 %, design and implementation of the system.

3. Jakubícek, Miloš - Horák, Aleš. Punctuation Detection with Full SyntacticParsing. Research in Computing Science, Special issue: Natural Lan-guage Processing and its Applications, Mexico : Instituto PolitécnicoNacional, 46, March 2010, p. 335–343. ISSN 1870-4069. 2010.Contribution: 70 %, implementation and evaluation of the system.

4. Kovár, Vojtech - Horák, Aleš - Jakubícek, Miloš. Syntactic AnalysisUsing Finite Patterns: A New Parsing System for Czech. In Human Lan-guage Technology. Challenges for Computer Science and Linguistics.Berlin/Heidelberg : Springer, 2011. p. 161–171. ISBN 978-3-642-20094-6.Contribution: 20 %, co-implementation of the system with VojtechKovár.

5. Jakubícek, Miloš - Kovár, Vojtech - Grác, Marek. Through Low-CostAnnotation to Reliable Parsing Evaluation. In PACLIC 24 Proceedingsof the 24th Pacific Asia Conference on Language, Information andComputation. Tokyo : Waseda University, 2010. p. 555–562. ISBN 978-4-905166-00-9.Contribution: 25 %, co-design and co-evaluation of the approach withother authors.

29

5. AUTHOR’S PUBLICATIONS

6. Jakubícek, Miloš - Rychlý, Pavel - Kilgarriff, Adam - McCarthy, Di-ana. Fast syntactic searching in very large corpora for many languages. InPACLIC 24 Proceedings of the 24th Pacific Asia Conference on Lan-guage, Information and Computation. Tokyo : Waseda University, 2010.p. 741–747. ISBN 978-4-905166-00-9.Contribution: 75 %, implementation and evaluation of the system.

7. Kovár, Vojtech - Jakubícek, Miloš. Test Suite for the Czech Parser Synt.In Proceedings of Recent Advances in Slavonic Natural LanguageProcessing 2008. Brno : Masaryk University, 2008. p. 63–70. ISBN 978-80-210-4741-9.Contribution: 50 %, co-implementation and co-evaluation with VojtechKovár.

8. Jakubícek, Miloš - Kovár, Vojtech - Horák, Aleš. Measuring Coverage ofa Valency Lexicon using Full Syntactic Analysis. In RASLAN 2009 : RecentAdvances in Slavonic Natural Language Processing. Brno : MasarykUniversity, 2009. p. 75–79. ISBN 978-80-210-5048-8.Contribution: 80 %, design, implementation and evaluation of theprocedure.

9. Kovár, Vojtech - Jakubícek, Miloš. Prague Dependency Treebank Annota-tion Errors: A Preliminary Analysis. In RASLAN 2009 : Recent Advancesin Slavonic Natural Language Processing. Brno : Masaryk University,2009. p. 101–108. ISBN 978-80-210-5048-8.Contribution: 50 %, co-evaluation with Vojtech Kovár.

10. Kovár, Vojtech - Horák, Aleš - Jakubícek, Miloš. How to Analyze Natu-ral Language with Transparent Intensional Logic? In Proceedings of Re-cent Advances in Slavonic Natural Language Processing 2010. Brno :Masaryk University, 2010. p. 69–76. ISBN 978-80-7399-246-0.Contribution: 20 %, co-design with other authors.

11. Horák, Aleš - Jakubícek, Miloš - Kovár, Vojtech. Analyzing Time-RelatedClauses in Transparent Intensional Logic. In Horák, Rychlý. Proceedingsof Recent Advances in Slavonic Natural Language Processing 2011.Brno : Tribun EU, 2011. p. 3–9. ISBN 978-80-263-0077-9.Contribution: 20 %, co-design with other authors.

12. Jakubícek, Miloš - Kovár, Vojtech - Šmerk, Pavel. Czech MorphologicalTagset Revisited. In Horák, Rychlý. Proceedings of Recent Advances inSlavonic Natural Language Processing 2011. Brno : Tribun EU, 2011. p.

30


29–42. ISBN 978-80-263-0077-9.Contribution: 33 %, co-design with other authors.

13. Bušta, Jan - Hlavácková, Dana - Jakubícek, Miloš - Pala, Karel. Classifi-cation of Errors in Text. In RASLAN 2009 : Recent Advances in SlavonicNatural Language Processing. Brno : Masaryk University, 2009. p. 109–119. ISBN 978-80-210-5048-8.Contribution: 25 %, co-design of the annotation scheme,

14. Jakubícek, Miloš. Enhancing Czech Parsing with Complex Valency Frames.Master thesis. Brno : Masaryk University, 2010.Contribution: 100 %.

15. Jakubícek, Miloš - Rychlý, Pavel - Kilgarriff, Adam - McCarthy, Diana.Fast Syntactic Searching in Very Large Corpora for Many Languages. InPACLIC 24 Proceedings of the 24th Pacific Asia Conference on Lan-guage, Information and Computation. Tokyo : Waseda University, 2010.p. 741–747. ISBN 978-4-905166-00-9.Contribution: 75 %, co-implementation with Pavel Rychlý.

16. Jakubícek, Miloš - Kovár, Vojtech. CzechParl: Corpus of StenographicProtocols from Czech Parliament. In Proceedings of Recent Advances inSlavonic Natural Language Processing 2010. Brno : Masaryk University,2010. p. 41–46. ISBN 978-80-7399-246-0.Contribution: 75 %, development of the corpus.

17. Jakubícek, Miloš - Bušta, Jan - Hlavácková, Dana - Pala, Karel. Classifi-cation of Errors in Text. In RASLAN 2009 : Recent Advances in SlavonicNatural Language Processing. Brno : Masaryk University, 2009. p. 109–119. ISBN 978-80-210-5048-8.Contribution: 25 %, co-design and evaluation with other authors.

18. Kovár, Vojtech - Horák, Aleš - Jakubícek, Miloš. Power Networks Dialogs- Enhancing Domain-Specific Text Processing Techniques and Resources. InProceedings of ELNET 2008. Ostrava : Faculty of Electrical Engineeringand Computer Science, VŠB - Technical University of Ostrava, 2008. p.72–80. ISBN 978-80-248-1875-7.Contribution: 25 %, evaluation of the system.

19. Bušta, Jan - Jakubícek, Miloš. Building of Corpus Based E-learning Materi-als for Czech. In SCO 2009 : Sharable Content Objects. Brno : MasarykUniversity, 2009. p. 144–149. ISBN 978-80-210-4878-2.Contribution: 50 %, co-design and co-implementation with Jan Bušta.

31


20. Kovár, Vojtech - Jakubícek, Miloš - Bušta, Jan. Czech Vulgarisms inText Corpora. In After Half a Century of Slavonic Natural LanguageProcessing. Brno : Tribun EU s.r.o., 2009. p. 141–145. Neuveden. ISBN978-80-7399-815-8.Contribution: 25 %, co-evaluation with other authors.

32

Bibliography

[Abney, 1996] Abney, S. (1996). Partial Parsing via Finite-State Cascades.Natural Language Engineering, 2(4):337–344.

[Abney et al., 1991] Abney, S., Flickenger, S., Gdaniec, C., Grishman, C., Har-rison, P., Hindle, D., Ingria, R., Jelinek, F., Klavans, J., Liberman, M., et al.(1991). Procedure for Quantitatively Comparing the Syntactic Cover-age of English Grammars. In Proceedings of the workshop on Speechand Natural Language, pages 306–311. Association for ComputationalLinguistics.

[Artstein and Poesio, 2008] Artstein, R. and Poesio, M. (2008). Inter-CoderAgreement for Computational Linguistics. Computational Linguistics,34(4):555–596.

[Belz, 2002] Belz, A. (2002). Learning Grammars for Different Parsing Tasksby Partition Search. In Proceedings of the 19th international conferenceon Computational linguistics-Volume 1, pages 1–7. Association for Com-putational Linguistics.

[Belz, 2009] Belz, A. (2009). That’s Nice. . . What Can You Do With It? Com-putational Linguistics, 35(1):111–118.

[Brants and Hansen, 2002] Brants, S. and Hansen, S. (2002). Developmentsin the TIGER Annotation Scheme and Their Realization in the Corpus.In Proceedings of the Third Conference on Language Resources andEvaluation (LREC 2002), pages 1643–1649.

[Brants, 2000] Brants, T. (2000). Inter-Annotator Agreement for a GermanNewspaper Corpus. In Second International Conference on LanguageResources and Evaluation LREC-2000, pages 69–76.

[Bresnan, 2001] Bresnan, J. (2001). Lexical-Functional Syntax. Wiley-Blackwell.

[Brill, 1993] Brill, E. (1993). Automatic Grammar Induction and ParsingFree Text: A Transformation-Based Approach. In Proceedings of the

33

workshop on Human Language Technology, pages 237–242. Associationfor Computational Linguistics.

[Briscoe et al., 2006] Briscoe, T., Carroll, J., and Watson, R. (2006). The Sec-ond Release of the RASP System. In Proceedings of the COLING/ACLon Interactive presentation sessions, pages 77–80. Association for Compu-tational Linguistics.

[Callmeier, 2000] Callmeier, U. (2000). PET – a Platform for Experimen-tation with Efficient HPSG Processing Techniques. Natural LanguageEngineering, 6(1):99–107.

[Carletta, 1996] Carletta, J. (1996). Assessing Agreement on ClassificationTasks: the Kappa Statistic. Computational linguistics, 22(2):249–254.

[Carroll and Rooth, 1996] Carroll, G. and Rooth, M. (1996). Valence induc-tion with a head-lexicalized pcfg. In In Proceedings of the 3rd Conferenceon Empirical Methods in Natural Language Processing (EMNLP 3).

[Charniak, 2000] Charniak, E. (2000). A Maximum-Entropy-Inspired Parser.In Proceedings of the 1st North American chapter of the Association forComputational Linguistics conference, pages 132–139. Morgan KaufmannPublishers Inc.

[Chomsky, 1957] Chomsky, N. (1957). Syntactic Structures. Mouton.

[Chomsky, 1965] Chomsky, N. (1965). Aspects of the Theory of Syntax,volume 119. The MIT press.

[Chomsky, 1995] Chomsky, N. (1995). The Minimalist Program. Currentstudies in linguistics series. The MIT Press.

[Clark and Curran, 2007] Clark, S. and Curran, J. (2007). Wide-CoverageEfficient Statistical Parsing with CCG and Log-Linear Models. Computa-tional Linguistics, 33(4):493–552.

[Cohen et al., 1960] Cohen, J. et al. (1960). A Coefficient of Agreement forNominal Scales. Educational and psychological measurement, 20(1):37–46.

[Collins, 1997] Collins, M. (1997). Three Generative, Lexicalised Models forStatistical Parsing. In Proceedings of the eighth conference on Europeanchapter of the Association for Computational Linguistics, pages 16–23.Association for Computational Linguistics.

34

[Debusmann, 2006] Debusmann, R. (2006). Extensible Dependency Gram-mar: A Modular Grammar Formalism Based On Multigraph Description.PhD thesis, Saarland University.

[Debusmann et al., 2003] Debusmann, R., Duchier, D., et al. (2003). A Meta-Grammatical Framework for Dependency Grammar. In ACL’03.

[Fleiss, 1981] Fleiss, J. (1981). The Measurement of Interrater Agreement.Statistical Methods for Rates and Proportions, 2:212–236.

[Gazdar, 1985] Gazdar, G. (1985). Generalized Phrase Structure Grammar.Harvard University Press.

[Gesmundo et al., 2009] Gesmundo, A., Henderson, J., Merlo, P., and Titov,I. (2009). A Latent Variable Model of Synchronous Syntactic-SemanticParsing for Multiple Languages. In Proceedings of the Thirteenth Confer-ence on Computational Natural Language Learning: Shared Task, pages37–42. Association for Computational Linguistics.

[Grác et al., 2010] Grác, M., Jakubícek, M., and Kovár, V. (2010). ThroughLow-Cost Annotation to Reliable Parsing Evaluation. In PACLIC 24Proceedings of the 24th Pacific Asia Conference on Language, Informationand Computation, pages 555–562.

[Grinberg et al., 1995] Grinberg, D., Lafferty, J., and Sleator, D. (1995). ARobust Parsing Algorithm for Link Grammars. In Proceedings of theFourth International Workshop on Parsing Technologies.

[Hajic, 1998] Hajic, J. (1998). Building a Syntactically Annotated Corpus:The Prague Dependency Treebank. In Issues of Valency and Meaning,pages 106–132, Prague. Karolinum.

[Hajic et al., 1998] Hajic, J., Brill, E., Collins, M., Hladká, B., Jones, D., Kuo,C., Ramshaw, L., Schwartz, O., Tillmann, C., and Zeman, D. (1998). CoreNatural Language Processing Technology Applicable to Multiple Lan-guages. The Workshop 98 Final Report. Technical report, Center forLanguage and Speech Processing, Johns Hopkins University, Baltimore,Maryland. [online] http://www.clsp.jhu.edu/ws98/projects/nlp/report/.

[Hajicová et al., 2004] Hajicová, E., Havelka, J., Sgall, P., Veselá, K., and Ze-man, D. (2004). Issues of Projectivity in the Prague Dependency Treebank.Prague Bulletin of Mathematical Linguistics, 81:5–22.

35

http://www.clsp.jhu.edu/ws98/projects/nlp/report/

http://www.clsp.jhu.edu/ws98/projects/nlp/report/

[Hall and Novák, 2005] Hall, K. and Novák, V. (2005). Corrective Modelingfor Non-Projective Dependency Parsing. In Proceedings of the NinthInternational Workshop on Parsing Technology, pages 42–52. Associationfor Computational Linguistics.

[Havelka, 2007] Havelka, J. (2007). Beyond Projectivity: Multilingual Eval-uation of Constraints and Measures on Non-Projective Structures. InAnnual Meeting of the Association for Computational Linguistics, vol-ume 45, page 608.

[Horák et al., 2009] Horák, A., Rychlý, P., and Kilgarriff, A. (2009). CzechWord Sketch Relations with Full Syntax Parser. After Half a Century ofSlavonic Natural Language Processing, pages 101–112.

[Hudson, 1984] Hudson, R. (1984). Word Grammar. Blackwell Oxford.

[Jakubícek et al., 2010] Jakubícek, M., Kilgarriff, A., McCarthy, D., andRychlý, P. (2010). Fast Syntactic Searching in Very Large Corpora forMany Languages. In PACLIC, volume 24, pages 741–747.

[Jakubícek, 2011] Jakubícek, M. (2011). Effective Parsing Using CompetingCFG Rules. In Proceedings of Text, Speech and Dialogue 2011, pages115–122, Berlin, Heidelberg.

[Jakubícek and Horák, 2010] Jakubícek, M. and Horák, A. (2010). Punc-tuation Detection with Full Syntactic Parsing. Special issue: NaturalLanguage Processing and its Applications, page 335.

[Johnson and Lappin, 1997] Johnson, D. and Lappin, S. (1997). A Critiqueof the Minimalist Program. Linguistics and Philosophy, 20(3):273–333.

[Joshi and Schabes, 1997] Joshi, A. and Schabes, Y. (1997). Tree-AdjoiningGrammars, Handbook of Formal Languages, vol. 3: Beyond Words.Springer-Verlag New York, Inc., New York, NY.

[Kadlec, 2007] Kadlec, V. (2007). Syntactic Analysis of Natural LanguagesBased on Context-Free Grammar Backbone. PhD thesis, Faculty of Infor-matics, Masaryk University, Brno.

[Kadlec and Horák, 2005] Kadlec, V. and Horák, A. (2005). New Meta-grammar Constructs in Czech Language Parser Synt. In Lecture Notes inComputer Science. Springer Berlin / Heidelberg.

36

[Katz-Brown et al., 2011] Katz-Brown, J., Petrov, S., McDonald, R., Och, F.,Talbot, D., Ichikawa, H., Seno, M., and Kazawa, H. (2011). Training aParser for Machine Translation Reordering. In Proceedings of the 2011Conference on Empirical Methods in Natural Language Processing, pages183–192, Edinburgh, Scotland, UK. Association for Computational Lin-guistics.

[Kay, 1986] Kay, M. (1986). Algorithm Schemata and Data Structures inSyntactic Processing. Morgan Kaufmann Publishers Inc.

[Kilgarriff et al., 2004] Kilgarriff, A., Rychlý, P., Smrž, P., and Tugwell, D.(2004). The Sketch Engine. Information Technology, 105:116.

[Kilgarriff and Tugwell, 2001] Kilgarriff, A. and Tugwell, D. (2001). WordSketch: Extraction and Display of Significant Collocations for Lexicogra-phy. In Proceedings of the ACL Workshop on Collocation ComputationalExtraction Analysis and Exploitation.

[Klein and Manning, 2003] Klein, D. and Manning, C. (2003). Accurate Un-lexicalized Parsing. In Proceedings of the 41st Annual Meeting on Associa-tion for Computational Linguistics-Volume 1, pages 423–430. Associationfor Computational Linguistics.

[Kovár et al., 2009] Kovár, V., Horák, A., and Jakubícek, M. (2009). SyntacticAnalysis as Pattern Matching: The SET Parsing System. In Proceedingsof the 4th Language & Technology Conference, pages 100–104, Poznan,Poland.

[Manning, 2011] Manning, C. (2011). Part-of-Speech Tagging from 97% to100%: Is It Time for Some Linguistics? Computational Linguistics andIntelligent Text Processing, pages 171–189.

[Marcus et al., 1993] Marcus, M., Marcinkiewicz, M., and Santorini, B. (1993).Building a Large Annotated Corpus of English: The Penn Treebank. Com-putational linguistics, 19(2):313–330.

[McCarthy and Navigli, 2007] McCarthy, D. and Navigli, R. (2007).SemEval-2007 Task 10: English Lexical Substitution Task. In Pro-ceedings of the 4th International Workshop on Semantic Evaluations(SemEval-2007), pages 48–53.

[McDonald et al., 2006] McDonald, R., Lerman, K., and Pereira, F. (2006).Multilingual Dependency Analysis with a Two-Stage Discriminative

37

Parser. In Proceedings of the Tenth Conference on Computational Nat-ural Language Learning, pages 216–220. Association for ComputationalLinguistics.

[McDonald et al., 2005] McDonald, R., Pereira, F., Ribarov, K., and Hajic, J.(2005). Non-Projective Dependency Parsing Using Spanning Tree Algo-rithms. In Proceedings of the conference on Human Language Technologyand Empirical Methods in Natural Language Processing, HLT ’05, pages523–530, Stroudsburg, PA, USA. Association for Computational Linguis-tics.

[Mel‘cuk, 1988] Mel‘cuk, I. (1988). Dependency Syntax: Theory and Practice.State University of New York Press.

[Miyao et al., 2009] Miyao, Y., Sagae, K., Saetre, R., Matsuzaki, T., and Tsujii,J. (2009). Evaluating Contributions of Natural Language Parsers to Protein-Protein Interaction Extraction. Bioinformatics, 25(3):394–400.

[Musillo and Merlo, 2006] Musillo, G. and Merlo, P. (2006). Accurate Pars-ing of the Proposition Bank. In Proceedings of the Human LanguageTechnology Conference of the NAACL, Companion Volume: Short Papers,pages 101–104. Association for Computational Linguistics.

[Nivre, 2009] Nivre, J. (2009). Non-Projective Dependency Parsing in Ex-pected Linear Time. In Proceedings of the Joint Conference of the 47thAnnual Meeting of the ACL and the 4th International Joint Conference onNatural Language Processing of the AFNLP: Volume 1-Volume 1, pages351–359. Association for Computational Linguistics.

[Pecina, 2011] Pecina, P. (2011). Review: Syntax-Based Collocation Extrac-tion Violeta Seretan (University of Geneva) Berlin: Springer (Text, speechand language technology series, volume 44), 2011; ISBN 978-94-007-0133-5.Computational Linguistics, (Early Access):1–3.

[Pollard and Sag, 1994] Pollard, C. and Sag, I. (1994). Head-Driven PhraseStructure Grammar. University of Chicago Press.

[Popper, 1959] Popper, K. (1959). The Logic of Scientific Discovery.

[Pullum and Gazdar, 1982] Pullum, G. and Gazdar, G. (1982). Natural Lan-guages and Context-Free Languages. Linguistics and Philosophy, 4(4):471–504.

38

[Rychlý and Smrž, 2004] Rychlý, P. and Smrž, P. (2004). Manatee, Bonitoand Word Sketches for Czech. In Second International Conference onCorpus Linguisitcs.

[Sagae et al., 2007] Sagae, K., Miyao, Y., and Tsujii, J. (2007). HPSG Pars-ing with Shallow Dependency Constraints. In Annual Meeting of theAssociation for Computational Linguistics, volume 45, page 624.

[Sampson, 2000] Sampson, G. (2000). A Proposal for Improving the Mea-surement of Parse Accuracy. International Journal of Corpus Linguistics,5(01):53–68.

[Sampson, 2001] Sampson, G. (2001). Empirical Linguistics. Open Linguis-tics Series. Continuum.

[Sampson, 2005] Sampson, G. (2005). The ’Language Instinct’ Debate. OpenLinguistics Series. Continuum.

[Sampson and Babarczy, 2003] Sampson, G. and Babarczy, A. (2003). A Testof the Leaf-Ancestor Metric for Parse Accuracy. Natural Language Engi-neering, 9(04):365–380.

[Sampson and Babarczy, 2008] Sampson, G. and Babarczy, A. (2008). Def-initional and Human Constraints on Structural Annotation of English.Natural Language Engineering, 14(4):471–494.

[Satta, 1994] Satta, G. (1994). Tree-Adjoining Grammar Parsing and BooleanMatrix Multiplication. Computational Linguistics, 20:173–191.

[Sgall et al., 1986] Sgall, P., Hajicová, E., Panevová, J., and Mey, J. (1986).The Meaning of the Sentence in Its Semantic and Pragmatic Aspects. D.Reidel.

[Sha and Pereira, 2003] Sha, F. and Pereira, F. (2003). Shallow Parsing withConditional Random Fields. In Proceedings of the 2003 Conference of theNorth American Chapter of the Association for Computational Linguisticson Human Language Technology-Volume 1, pages 134–141. Associationfor Computational Linguistics.

[Shieber, 1985] Shieber, S. (1985). Evidence Against the Context-Freeness ofNatural Language. Linguistics and Philosophy, 8(3):333–343.

[Sleator and Temperley, 1993] Sleator, D. and Temperley, D. (1993). ParsingEnglish with a Link Grammar. In Third International Workshop onParsing Technologies.

39

[Šmerk, 2004] Šmerk, P. (2004). Unsupervised Learning of Rules for Morpho-logical Disambiguation. In Lecture Notes in Artificial Intelligence 3206,Proceedings of Text, Speech and Dialogue 2004, pages 211–216, Berlin.Springer-Verlag.

[Šmilauer, 1969] Šmilauer, V. (1969). Novoceská skladba: Vysokoškolskáprírucka. Edice Vysokoškolské prirucky. SPN, Ceský Tešín.

[Steedman, 2000] Steedman, M. (2000). The Syntactic Process. MIT Press.

[Tesnière, 1959] Tesnière, L. (1959). Élements de syntaxe structurale: Préf.de Jean Fourquet. C. Klincksieck.

[Vadas and Curran, 2007] Vadas, D. and Curran, J. (2007). Adding NounPhrase Structure to the Penn Treebank. In Annual Meeting of the Associa-tion for Computational Linguistics, volume 45, page 240.

[Vijay-Shanker and Weir, 1994] Vijay-Shanker, K. and Weir, D. J. (1994). TheEquivalence of Four Extensions of Context-Free Grammars. Theory ofComputing Systems, 27:511–546. 10.1007/BF01191624.

[Younger, 1967] Younger, D. (1967). Recognition and Parsing of Context-Free Languages in Time n3. Information and Control, 10(2):189–208.

40

Rule-Based Parsing of Morphologically Rich Languages

Documents

Transcript of Rule-Based Parsing of Morphologically Rich Languages