SEXT - University of LimerickSEXT ANT 153 Num b er Accept Reject % Accept % Reject Dynix 20 0 100...

SEXTANT 151Characteristic A B C D E F GSEXTANT yes yes yes yes yes yes yes/noTable 9.1: Linguistic characteristics which can be detected bythe SEXTANT Parser. See Table 9.2 for an explanation of theletter codes.Code ExplanationA Verbs recognisedB Nouns recognisedC Compounds recognisedD Phrase Boundaries recognisedE Predicate-Argument Relations identi�edF Prepositional Phrases attachedG Coordination/Gapping analysedTable 9.2: Letter codes used in Tables 9.1 and 9.5.3.In a �rst phase of testing, we used our original SEXTANT parseron the IPSM'95 test beds. Due to time constraints, we only evaluatedthe parser results on the �rst 130 sentences of the LOTUS corpus. Foreach sentence, we drew up a list of the relations that we thought thatthe parser should return from the list given in Figure 9.7, and com-pared those to the actual relations returned from the parser. For thisevaluation we used a format of output of the parser as shown in Figure9.8. Over these 130 sentences, we calculated that the original parser re-turned 432 correct binary dependency relations, 186 incorrect, and 248were missing. In terms of percentages, this means that, under the evalu-ation conditions stated above, that the original parser had a precision of70% (432/618) and a recall rate of 64% (432/680) for binary dependencyrelations. Remember that these precision and recall rates are only forbinary relations and cannot be directly compared to the much harderproblem of �nding the correct hierarchical parse of the entire sentence.Tables 9.1 and 9.5.3 use an IPSM-wide evaluation scheme whosecodes appear in Table 9.2.10 In these tables a number of linguistic char-acteristics are identi�ed with the letters A to G. Our interpretation ofthese characteristics follows. The letters A and B indicate whether verbsand nouns are correctly identi�ed. In our case, since we use a tagger thatmakes these choices before parsing, these characteristics evaluate the ac-curacy of this tagger. Table 9.5.3 indicates that the tagger was function-ing as well as reported in Cutting, Kupiec, Pederson & Silbun, (1992).10We were only able to completely evaluate our Phase III parser using these criteria.

152 GrefenstetteLetter C corresponds to compounds recognized. By this we understoodboth proper names, such as \Ami Pro", as discussed in Section 9.2.1.2,as well as proper attachment of noun compounds (marked NN in Figure9.7) during parsing, such as in \insertion point". We counted the parseras successful if the attachment was correctly indicated. This does not an-swer the terminological question of whether all noun{noun modi�cationsare compounds.For \phrase boundaries", indicated with the letter D, we notedwhether our maximal noun chains and verb chains end and begin wherewe expected them to. For example, in the fragment \is at the beginningof the the text you want to select" we expected our parser to divide thetext into four chains: \is", \at the beginning of the text", \you", and\want".For predicate-argument relations (E) we counted the relations men-tioned as SUBJ and DOBJ in Figure 9.7.The letter F concerns prepositional phrase attachment. Our parserattaches prepositional phrases to preceding nouns or verbs, and in thecase of a prepositional phrase following a verb, it will attach the prepo-sitional phrase ambiguously to both. We counted a correct attachmentwhen one of these was right, supposing that some ulterior processingwould decide between them using semantic knowledge not available tothe parser. Here are some examples of errors: In \Select the text tobe copied in the concordance window..." the parser produced \copy inwindow" rather than \select in window". In \certain abbreviations maywork at one prompt but not at another" the parser produced only \workat prompt" but not \work at another". This was counted as one successand one failure.9.4 Analysis I: Original Grammar, Origi-nal VocabularyThe original grammar was developed principally for newspaper textand scienti�c writing in mind, i.e. linguistic peculiarities linked to di-alogue and questions were not treated. The technical documentationtext treated here has two distinguishing characteristics: frequent use ofthe imperative verb form, e.g. \Press ENTER to : : :", and commonuse of lists, e.g. in describing di�erent ways to perform a given action.These two characteristic violate the idealized view of balanced verb{noun phrases described in Section 9.2.2. Both characteristics promptedthe parser modi�cation described below in Section 9.6.In order to evaluate the original parser on this text, a decision mustbe made of what an ideal parser would return. In keeping with theidea of an industrial parser, we adopt the stringent requirement that the

SEXTANT 153Number Accept Reject % Accept % RejectDynix 20 20 0 100 0Lotus 20 20 0 100 0Trados 20 20 0 100 0Total 60 60 0 100 0Table 9.3.1: Phase I acceptance and rejection rates for the SEX-TANT Parser.Total Time Average Time Average Timeto Parse (s) to Accept (s) to Reject (s)Dynix 0005 0.3 0.0Lotus 0005 0.3 0.0Trados 0005 0.3 0.0Total 0015 0.3 0.0Table 9.4.1: Phase I parse times for the SEXTANT Parser usinga SPARC 20 with 192 Megabytes. The �rst column gives the totaltime to attempt a parse of each sentence.ideal parser should only return one parse for each sentence. Our parserdoes not return a global tree structure but rather draws binary labeledrelations between sentence elements. Some elements such as introductoryprepositional phrases are left unattached, cf. Fidditch (Hindle, 1993).We decided therefore to use as evaluation criteria the six binary relationsshown in Figure 9.7. These relations are only a minimal set of the typesof relations that one would ask of a parser. For example, relations suchas those between adjectives and verbs, e.g. ready to begin, are missing,as are those between words and multi-word structures.9.5 Analysis II: Original Grammar, Addi-tional VocabularyNo additional vocabulary was added to the system. This phase wasempty.9.6 Analysis III: Altered Grammar, Addi-tional Vocabulary

154 GrefenstetteNumber Accept Reject % Accept % RejectDynix 20 20 0 100 0Lotus 20 20 0 100 0Trados 20 20 0 100 0Total 60 60 0 100 0Table 9.3.3: Phase III acceptance and rejection rates for theSEXTANT Parser.Total Time Average Time Average Timeto Parse (s) to Accept (s) to Reject (s)Dynix 0005 0.3 0.0Lotus 0005 0.3 0.0Trados 0005 0.3 0.0Total 0015 0.3 0.0Table 9.4.3: Phase III parse times for the SEXTANT Parserusing a SPARC 20 with 192 MB. The �rst column gives the totaltime to attempt a parse of each sentence.Char, A B C D E F G Avg.Dynix 98% 98% 75% 93% 83% 54% 31% 76%Lotus 97% 97% 92% 91% 79% 57% 08% 74%Trados 99% 97% 83% 88% 70% 65% 00% 72%Average 98% 97% 83% 91% 77% 59% 13% 74%Table 9.5.3: Phase III Analysis of the ability of the SEXTANTParser to recognise certain linguistic characteristics in an utter-ance. For example the column marked `A' gives for each set ofutterances the percentage of verbs occurring in them which couldbe recognised. The full set of codes is itemised in Table 9.2.The principal parser change was the introduction of heuristics torecognize conjunctive lists. The original parser took a simple view ofthe sentence as a sequence of noun phrases and prepositional phrasesinterspersed with verbal chains. The technical text used as a test bedhad a great number of conjunctive lists, e.g., \With a mouse, you canuse the vertical scroll arrows, scroll bar, or scroll box on the right sideof the screen to go forward or backward in a document by lines, screens,or pages.".In order to treat these constructions, the parser was modi�ed in fourways: First, commaswere reintroduced into the text, and were allowed toappear within a nominal chain. Commas and all other punctuation had

SEXTANT 155Correct Incorrect Missing Precision RecallOriginal 432 186 248 70% 64%ParserModi�ed 550 116 130 83% 81%ParserTable 9.6: Predicate argument recgonition improvement fromthe Phase I parser to the Phase III parser which no longer ignorescommas and which identi�es nominal list structures.been stripped out in the original parser. Secondly, rather than havingonly one head marked, nominal chains were allowed to have multipleheads. A head could appear before a comma or a conjunction, as well asat the end of the chain. Third, an additional pass went through all thenominal chains in the sentence and split the chain at commas if the chaindid not contain a conjunction. Fourth, search for subjects and objectsof verb chains was modi�ed to allow for multiple attachments. Thesemodi�cations took one and a half man-days to implement and test.A number of errors found in the original parser were due to taggingerrors. For the modi�ed parser, we decided to retag �ve words: press,use, and remains were forced to be verbs, bar(s) was forced to be a noun,and the tag for using was changed to a preposition. Other words causedproblems, such as select which was often tagged as an adjective, or togglewhose only possible tag was a verb, but these were not changed. Sincethe tagger does not use lexical probabilities, words like copy and holdsometimes appeared as nouns, eliminating all possible binary relationsthat would have derived from the the verb, as well as introducing incor-rect nominal relations. This might be treated by reworking the taggeroutput, using corpus speci�c data. This was not done here except for the�ve words mentioned. The results obtained must be regarded in light ofthis minimalist approach.When these changes were included in the parser, the same 130 sen-tences from LOTUS were reparsed, and the results from this modi�edparser11 were recomputed. The results are given in Table 9.6 which showan improvement in precision to 83% (550/666) and an improvement inrecall to 81% (550/680).11The parser modi�cations consisted only in incorporating limited list processingrather than any other corpus speci�c treatment.

156 Grefenstette9.7 Converting Parse Tree to DependencyNotationThe output of the SEXTANT parser as shown in Figure 9.6 is not aparse tree but a forest of labeled binary trees. Each output line with anumber greater than zero in the eighth column can be converted into abinary labeled tree using columns nine and beyond. For example, thetenth line (as numbered in column seven) corresponding to \ignore" inthis �gure, shows that the subject (SUBJ) of \ignore" is the word inline eight \Workbench". The \sentence" in line thirteen is marked asthe direct object (DOBJ) of the word \ignore" in line ten. The word\sentence" in line nineteen is in a binary tree labeled \IOBJ-from" withthe word \move" in line �fteen. These three examples could be rewrittenas: subject(Workbench,ignore)direct-object(ignore,sentence)verb-prep(move,from,sentence)9.8 Summary of FindingsIn all, 1.5 man-weeks was spent on this task. The principal strengths ofthis parser are its robustness (it always returns a parse), its generality (nodomain speci�c information in lexicon), and its speed (all 600 sentencesare parsed in under 41 seconds CPU on a Sparc 20).Its weaknesses are many. It cannot deal with unmarked embeddedclauses, i.e. in \: : : the changes you make appear : : :", \make appear"is not recognized as two verbal chains. Tagging errors cannot be recov-ered, so errors in tagging are propagated throughout the parse.12 Beingdesigned to work on declarative sentences, it can misread the subjectof imperative sentences.13 There is no attempt to identify the subjectof an in�nitive verb phrase which is not part of an active verb chain,e.g., in \: : :you want to go : : :", \you" will be identi�ed as a subject of\go" but not in \you will be able to go". Progressive verb phrases aresuccinctly handled by seeing if a noun precedes the verb and calling it asubject if it does. Questions are not treated. Gaps are not recognized or�lled; many words are simply thrown away, e.g. adverbs. No relations12Though this is a general problem in any system in which parsing and tagging areindependent.13The tagger was trained on untagged text with few imperatives and itself doesnot return an imperative verb tag, so imperative verbs are marked as active verb,in�nitives or nouns (especially following conjunctions).

SEXTANT 157are created between adjectives and verbs, e.g. \ready to begin" yieldsnothing. Being word based, the parser provides no relations between aword and a larger syntactic unit, such as between \refer" and a title asin \refer to Understanding formatting".In other words, the level of analysis returned by this parser is of littleutility for higher linguistic tasks, for example, automatic translation,that require more complete analysis of the sentence. It might serve,however, for lower level tasks such as terminology and subcategorizationextraction, or for information retrieval.9.9 ReferencesAbney, S. (1991). Parsing by Chunks. In R. Berwick, S. Abney & C.Tenny (Eds.) Principle-Based Parsing. Dordrecht, The Netherlands:Kluwer Academic Publishers.Chanod, J. P. (1996). Rules and Constraints in a French Finite-StateGrammar (Technical Report). Meylan, France: Rank Xerox ResearchCentre, January.Cutting, D., Kupiec, J., Pedersen, J., & Sibun, P. (1992). A Practi-cal Part-of-Speech Tagger. Proceedings of the Third Conference onApplied Natural Language Processing, April, 1992.COMPASS (1995). Adapting Bilingual Dictionaries for online Compre-hension Assistance (Deliverable, LRE Project 62-080). Luxembourg,Luxembourg: Commission of the European Communities.Debili, F. (1982). Analyse Syntaxico-Semantique Fondee sur une Acqui-sition Automatique de Relations Lexicales-Semantiques. Ph.D. The-sis, University of Paris XI.Francis, W. N., & Ku�cera, H. (1982). Frequency Analysis of English.Boston, MA: Houghton Mi�in Company.Gibson, E., & Pearlmutter, N.. (1993). A Corpus-Based Analysis ofConstraints on PP Attachments to NPs (Report). Pittsburgh, PA:Carnegie Mellon University, Department of Philosophy.Grefenstette, G. (1994). Explorations in Automatic Thesaurus Discov-ery. Boston, MA: Kluwer Academic Press.Grefenstette, G. (1994). Light Parsing as Finite State Filtering. Pro-ceedings of the Workshop `Extended �nite state models of language',European Conference on Arti�cial Intelligence, ECAI'96, BudapestUniversity of Economics, Budapest, Hungary, 11-12 August, 1996Grefenstette, G., & Schulze, B. M. (1995). Designing and EvaluatingExtraction Tools for Collocations in Dictionaries and Corpora (De-liverable D-3a, MLAP Project 93-19: Prototype Tools for ExtractingCollocations from Corpora). Luxembourg, Luxembourg: Commissionof the European Communities.

Grefenstette, G., & Tapanainen, P. (1994). What is a Word, Whatis a Sentence? Problems of Tokenization. Proceedings of the 3rdConference on Computational Lexicography and Text Research, COM-PLEX'94, Budapest, Hungary, 7-10 July.Hindle, D. (1993). A Parser for Text Corpora. In B. T. S. Atkins & A.Zampolli (Eds.) Computational Approaches to the Lexicon. Oxford,UK: Clarendon Press.Karttunen, L. (1994). Constructing Lexical Transducers. Proceedingsof the 15th International Conference on Computational Linguistics,COLING'94, Kyoto, Japan.Karttunen, L., Kaplan, R. M., & Zaenen, A. (1992). Two-Level Morphol-ogy with Composition. Proceedings of the 14th International Con-ference on Computational Linguistics, COLING'92, Nantes, France,23-28 August, 1992, 141-148.Voutilainen, A., Heikkila, J., & Anttila, A. (1992). A Lexicon and Con-straint Grammar of English. Proceedings of the 14th InternationalConference on Computational Linguistics, COLING'92, Nantes,France, 23-28 August, 1992.

10Using a Dependency StructureParser without any GrammarFormalism to Analyse a Soft-ware Manual CorpusChristopher H. A. Ting1Peh Li ShiuanNational University of SingaporeDSO10.1 IntroductionWhen one designs algorithms for the computer to parse sentences, one isasking the machine to determine the syntactic relationships among thewords in the sentences. Despite considerable progress in the linguistictheories of parsing and grammar formalisms, the problem of having amachine automatically parse the natural language texts with high accu-racy and e�ciency is still not satisfactorily solved (Black, 1993). Mostexisting, state-of-the-art parsing systems in the world rely on grammarrules (of a certain language) expressed in a certain formal grammar for-malism such as the (generalized) context free grammar formalism, thetree-adjoining grammar formalism and so on. The grammar rules arewritten and coded by linguists over many years. Alternatively, theycan be \learned" from an annotated corpus (Carroll & Charniak, 1992;Magerman, 1994). In any case, the underlying assumption of using aparticular grammar formalism is that most, if not all, of the syntactic1Address: Computational Science Programme, Faculty of Science, National Uni-versity of Singapore, Lower Kent Ridge Road, Singapore 0511. Tel: +65 373 2016,Fax: +65 775 9011, Email: [email protected]. C. Ting is grateful to Dr HowKhee Yin for securing a travel grant to attend the IPSM'95Workshop. He also thanksDr Paul Wu and Guo Jin of the Institute of System Science, National University ofSingapore for a discussion based on a talk given by Liberman (1993).

160 Ting, Pehconstructs of a certain language can be expressed by it. But is it reallytrue? Is grammar formalism indispensable to parsing algorithms?In this work, we propose a novel, hybrid approach to parsing thatdoes not rely on any grammar formalism (Ting, 1995a; Ting 1995b; Peh& Ting 1995). Based on an enhanced hidden Markov model (eHMM), webuilt a parser, DESPAR, that produces a dependency structure for aninput sentence. DESPAR is designed to be modular. It aims to achievethe following:� accurate;� capable of handling vocabulary of unlimited size;� capable of processing sentences written in various styles;� robust;� easy for non-linguists to build, �ne-tune and maintain;� fast.The linguistic characteristics of DESPAR are tabulated in Table 10.1.The verbs and nouns are recognized by the part-of-speech tagger, thecompounds by a noun phrase parser which is also statistical in nature,the phrase boundaries by a segmentation module and the attachment andcoordination by the synthesis module. The predicate-argument structureis analyzed by a rule-based module which reads out the subject, theobject, the surface subject, the logical subject and the indirect objectfrom the noun phrases and adjectives governed by the verbs. In otherwords, these rules are based on the parts of speech of the nodes governedby the predicates, and their positions relative to the predicates.We tested DESPAR on a large amount of unrestricted texts. Inall cases, it never failed to generate a parse forest, and to single outa most likely parse tree from the forest. DESPAR was also tested onthe IPSM'95 software manual corpus. In the following, we present theanalysis results after a brief description of the parsing system (Section10.2) and the parser evaluation criteria (Section 10.3).10.2 Description of Parsing SystemIn 1994, we were given the task to build a wide-coverage parser. Notknowing any grammar formalism,we wondered whether the conventionalwisdom of \parsing is about applying grammar rules expressed in a cer-tain grammar formalism" was an absolute truth. We began to exploreways of re-formulating the problem of parsing so that we could build aparser without having to rely on any formal grammar formalism.By coincidence, we came to know of the success of using HMM tobuild a statistical part-of-speech tagger (Charniak, Hendrickson, Jacob-son & Perkowitz, 1993; Merialdo, 1994), and Liberman's idea of viewing

DESPAR 161Characteristic A B C D E F GDESPAR yes yes yes yes yes yes yesTable 10.1: Linguistic characteristics which can be detected byDESPAR. See Table 10.2 for an explanation of the letter codes.Code ExplanationA Verbs recognisedB Nouns recognisedC Compounds recognisedD Phrase Boundaries recognisedE Predicate-Argument Relations identi�edF Prepositional Phrases attachedG Coordination/Gapping analysedTable 10.2: Letter codes used in Tables 10.1, 10.4.1 and 10.4.2.dependency parsing as some kind of \tagging" (Liberman, 1993). If de-pendency parsing were not unlike tagging, then would not it be a goodidea to model it as some hidden Markov process? The only worry, ofcourse, was whether the Markov assumption really holds for parsing.What about the long-distance dependency? How accurate could it be?Does it require a large annotated corpus of the order of a few millionwords? Nobody seemed to know the answers to these questions.We then began to work out a few hidden Markov models for this pur-pose. Through a series of empirical studies, we found that it was possible,and practical, to view dependency parsing as some kind of tagging, pro-vided one used an enhanced HMM. It is an HMM aided by a dynamiccontext algorithm and the enforcement of dependency axioms. Based onthis enhanced model, we constructed DESPAR, a parser of dependencystructure. DESPAR takes as input a sentence of tokenized words, andproduces a most likely dependency structure (Mel'�cuk, 1987). In addi-tion, it also captures other possible dependency relationships among thewords. Should one decide to unravel the syntactic ambiguities, one canreturn to the forest of parse trees and select the second most likely oneand so on.An advantage of analysing dependency structure is probably its rel-ative ease in extracting the predicate-argument structure from the parsetree. The version of dependency structure that we use is motivated bythe need to enable non-professional linguists to participate in annotatingthe corpus. The building of DESPAR involves the following:

162 Ting, PehPP NN IN

DT NN

DT

He cleaned a chair at the canteen

TaggingParsing

He

cleaned

chair

canteen

the

at

a

result

VB

PP VB DT NN IN DT NN

noun

verb

determiner

pronoun

preposition

Figure 10.1: An overview of the ow of processing in DESPAR.Given a tokenized sentence, \He cleaned a chair at the canteen",it is �rst tagged as \PP VB DT NN IN DT NN". These codes corre-spond to the parts of speech of the words in the sentence. Basedon these parts of speech, the computer is to arrive at the depen-dency structure of the sentence on the right. The pronoun PP,noun NN and the preposition IN are linked (directly) to the verbVB as they are the subject, object and the head of the preposi-tional phrase respectively. They are said to depend on the verb asthey provide information about the action \cleaned". The resultof parsing is a dependency parse tree of the sentence.� obtain a large part-of-speech (POS) corpus,� develop a statistical POS tagger,� invent an unknown word module for POS tagger,� develop a computational version of dependency structure,

DESPAR 163Nomination of candidate governors

byDynamic context algorithm

the Axioms of dependency structure

by

Pruning away invalid candidates

Hidden Markov Process

He chair at

canteena

the

cleaned

*

He cleaned a chair at the canteen .

Part-of-speech tagging

Figure 10.2: An illustration of the enhanced HMM. After tag-ging, the dynamic context algorithm searches for the likely gov-ernors for each part of speech. The axioms of the dependencystructure are employed to prune away invalid candidates. Onceall the statistical parameters are estimated, we use the Viterbialgorithm to pick up the dependency tree with the maximum like-lihood from the forest.� build up a small corpus of dependency structures,� invent a hidden Markov theory to model the statistical propertiesof the dependency structures,� invent a dynamic contextual string matching algorithm for gener-ating possible and most probable parses,� incorporate syntactic constraints into the statistical engine,

164 Ting, PehTokenization of sentence

Synthesis of parsed segments

For example , you can use an accelerated search command .

Noun phrase bracketing

Part-of-speech tagging IN NN , PP MD VB DT VBN NN NN .

Segmentation of sentence

IN [NN]_1 , [PP]_2 MD VB [DT VBN NN NN]_3 .

{ IN N1 } , { N2 MD VB N3}

IN

N1

VB

N2 MD N3

.

VB

N1

Dependency parsing of noun phrases

NN

DT NN

VBN

Dependency parsing of segments

NN PP

,

IN N2 MD N3Figure 10.3: DESPAR operating in the divide-and-conquermode. After the part-of-speech tagging, it performs noun phraseparsing and submits each noun phrase for dependency parsing.Then it disambiguates whether the comma is prosodic, logicalconjunctive or clausal conjunctive. The string of parts of speechis segmented and each segment is parsed accordingly. The parsedsegments are then synthesized to yield the �nal analysis.� develop a rule-based segmentation module and synthesis moduleto divide-and-conquer the parsing problem.An overview of the ow of processing is illustrated in Figure 10.1.The corpus-based, statistical approach to building the parser hasserved us well. By now, the e�ectiveness of the corpus-based, statisticalapproach is well documented in the literature. The attractive feature ofthis approach is that it has a clear cut division between the language-dependent elements and the inferencing operations. Corpora, being thedata or collections of examples used by people in the day-to-day usageof the language, together with the design of the tags and annotation

DESPAR 165Figure 10.4: The parse tree produced by DESPAR for sentenceL8 in Analysis I, `Scrolling changes the display but does not movethe insertion point .'. All the words were attached correctly.symbols, are the linguistic inputs for building up the system. Statisticaltools such as the hidden Markov model (HMM), the Viterbi algorithm(Forney, 1973) and so on, are the language-independent components. Ifone has built a parser for English using English corpora, using the samestatistical tools, one can build another parser for French if some Frenchcorpora are available. The main modules of our system are describedbelow.Part-of-Speech TaggerWith an annotated Penn Treebank version of the Brown Corpus andWall Street Journal (WSJ) Corpus (Marcus, Marcinkiewicz, & Santorini,1993), we developed a statistical tagger based on a �rst-order (bigram)HMM (Charniak, Hendrickson, Jacobson & Perkowitz, 1993). Our tag-ger gets its statistics from Brown Corpus' 52,527 sentences (1.16 millionwords) and WSJ Corpus' 126,660 sentences (2.64 million words).

166 Ting, PehFigure 10.5: The parse tree produced by DESPAR for sen-tence T125 in Analysis I, `If the Workbench cannot �nd any fuzzymatch, it will display a corresponding message ( \No match " ) inthe lower right corner of its status bar and you will be presentedwith an empty yellow target �eld.'. In this sentence, the tokenizermade a mistake in not detaching " from No. The word "No wasan unknown word to the tagger and it was tagged as proper nounby the unknown word module. The tokens ( and empty were at-tached wrongly by DESPAR. Though these were minor mistakes,they were counted as errors because they did not match theirrespective counterparts in the annotated corpus.Unknown Word ModuleTo make the tagger and parser robust against unknown words, we de-signed an unknown word module based on the statistical distribution ofrare words in the training corpus. During run-time, the dynamic contextalgorithm estimates the conditional probability of the POS given thatthe unknown word occurs in the context of a string of POSs of knownwords. Then we apply the Viterbi algorithm again to disambiguate the

DESPAR 167Figure 10.6: The parse tree produced by DESPAR for sentenceL113 in Analysis I, `To move or copy text between documents .'Here, the tagger tagged `copy' wrongly as a noun, which was fatalto the noun phrase parser and the dependency parser.POS of the unknown word. With the unknown word module installed,our parser e�ectively has unlimited vocabulary.Computational Dependency StructureWith the aim of maintaining consistency in annotating the corpus, westandardized a set of conventions for the annotators to follow. For in-stance, we retain the surface form of the verb and make other wordswhich provide the tense information of the verb depend on it.2 The con-ventions for the dependency relationships of punctuations, delimiters,dates, names etc. are also spelled out.Dependency ParserWe manually annotated a small corpus (2,000 sentences) of dependencystructures, and used it to estimate the statistical parameters of a �rst-order enhanced HMM for parsing. The key idea is to view parsing as2Examples of these are the modals \will", \can" etc., the copula, the in�nitiveindicator \to" and so on.

168 Ting, PehFigure 10.7: The parse tree produced by DESPAR for sentenceD84 in Analysis I, `For example , you can use an accelerated searchcommand to perform an author authority search or a title keywordsearch .'. The highlight of this example is the attachment of `per-form'. In our version of computational dependency structure, thisis a fully correct parse. This is so because we can re-order the sen-tence as `For example , to perform an author authority search or atitle keyword search you can use an accelerated search command.'if one is tagging the dependency \codes" for each word in the sentence.The dependency structure of a sentence can be represented or codedby two equivalent schemes (see the appendix). These dependency codesnow become the states of the HMM. To reduce the perplexity, we alsouse the dynamic context algorithm that estimates the conditional prob-abilities of the dependency codes given the contexts of POSs during runtime. And to enhance the performance of the system, we also makeuse of the axioms of dependency structures to throw out invalid can-didate governors and to constrain possible dependency parse trees. Itis worthwhile to remark that the dynamic context algorithm and the

DESPAR 169language-independent axioms are critical in the dependency \tagging"approach to parsing. The HMM aided by the dynamic context algo-rithm and the axioms, called the enhanced HMM (eHMM), is the novelstatistical inference technique of DESPAR.Noun Phrase ParserUsing the eHMM, we have also succeeded in building an (atomic) noun-phrase parser (Ting, 1995b). A key point in our method is a representa-tion scheme of noun phrases which enables the problem of noun phraseparsing to be formulated also as a statistical tagging process by eHMM.The noun phrase requires only 2,000 sentences for estimating the statisti-cal parameters; no grammar rules or pattern templates are needed. Ourexperimental results show that it achieves 96.3% on the WSJ Corpus.Divide-and-Conquer ModuleThe divide-and-conquer module is designed to enhance the e�ectivenessof the parser by simplifying complex sentences before parsing. It par-titions complex sentences into simple segments, and each segment isparsed separately. The rule-based segmentation module decides whereto segment based on the outcome of a disambiguation process (Peh &Ting, 1995). The noun phrase bracketing provided by the noun phraseparser is also used in this module. Finally, a rule-based synthesizer gluestogether all the segments' parse trees to yield the overall parse of theoriginal complex sentence.The working mode of the parser is illustrated in Figure 10.2 andFigure 10.3.All the program code of the parser system was written in-house inUnix C. Currently, the parser system runs on an SGI 5.3 with the fol-lowing con�guration:� 1 150 MHZ IP19 Processor� CPU: MIPS R4400 Processor Chip Revision: 5.0� FPU: MIPS R4010 Floating Point Chip Revision: 0.0� Data cache size: 16 Kbytes� Instruction cache size: 16 Kbytes� Secondary uni�ed instruction/data cache size: 1 Mbyte� Main memory size: 256 Mbytes, 1-way interleaved10.3 Parser Evaluation CriteriaSince dependency parsing in our approach is about assigning a depen-dency code to each word, we can evaluate the accuracy of the parser's

170 Ting, Pehoutputs in the same way as we evaluate the performance of the tagger.We evaluate the performance of DESPAR at the word level and at thesentence level de�ned as follows:� word level: a word is said to be tagged correctly if the answergiven by the system matches exactly that of the annotated corpus.� sentence level: a sentence is said to be recognized correctly ifthe parse tree given by the system matches exactly that of theannotated corpus.The word level is a somewhat lenient evaluation criterion. On theother extreme, sentence level is very stringent. It favours short sentencesand discriminates against long sentences. If there is just one tag of aword in the sentence that does not match exactly that of the annotatedcorpus, the whole sentence is deemed to be analyzed wrongly by thesystem. It may be an over-stringent criterion, because a sentence mayhave more than one acceptable parse. Scoring a low value at the sentencelevel is no indication that the parser is useless.The accuracy of the noun phrase parser is evaluated according to theexact match of the beginning and the ending of noun phrase bracketswith those in the annotated corpus. For example, if the computer returns[w1 w2] [w3 w4] w5 w6 [w7 w8] and the sentence in the annotated corpusis [w1 w2 w3 w4] w5 w6 [ w7 w8 ], then there are two wrong noun phrasesand one correct one.These measures account for the consistency between the system'soutputs and human's annotation. It could happen that the tag of aparticular word was annotated wrongly. As a result, though the systemproduces the correct result, it is counted as wrong, because it does notmatch the tag in the corpus. We estimate the corpora to be contaminatedwith 3 to 6% \noise". These measures therefore give a lower bound ofhow well a system can perform in terms of producing the really correctoutputs.10.4 Analysis I: Original Grammar, Orig-inal VocabularyAfter having received the 600 sentences in 3 �les from the organizers ofthe IPSM'95 Workshop, we tokenized the sentences, namely, we used acomputer program to detach the punctuations, quotation marks, paren-theses and so on from the words, and the isolated tokens were retainedin the sentences. This was the only pre-processing we did.Then, we annotated the 600 sentences in a bootstrapping manner.The so obtained IPSM'95 Corpus (see Appendix A for a sample) becomes

DESPAR 171Number Accept Reject % Accept % RejectDynix 20 20 0 100 0Lotus 20 20 0 100 0Trados 20 20 0 100 0Total 60 60 0 100 0Table 10.2.1: Phase I acceptance and rejection rates forDESPAR. Total Time Average Time Average Timeto Parse (s) to Accept (s) to Reject (s)Dynix 209 10.5 N.A.Lotus 185 09.3 N.A.Trados 224 11.2 N.A.Total 618 10.3 N.A.Table 10.3.1: Phase I parse times for the DESPAR. The �rstcolumn gives the total time (seconds) to parse 20 sentences ineach �le. The last column is not applicable (N.A.) to DESPAR.Char. A B C D E F G Avg.Dynix 098% 97% 97% 98% 84% 85% 080% 91%Lotus 096% 96% 90% 92% 83% 61% 067% 84%Trados 100% 98% 94% 99% 86% 70% 100% 92%Average 098% 97% 94% 96% 84% 72% 082% 89%Table 10.4.1: Phase I Analysis of the ability of DESPAR torecognise certain linguistic characteristics in an utterance. Forexample the column marked `A' gives for each set of utterances thepercentage of verbs occurring in them which could be recognised.The full set of codes is itemised in Table 10.2.the standard for checking against the outputs of our tagger, noun-phraseparser and dependency structure parser.3 Though we spared no e�ortto make sure that the Corpus be free of errors, we estimate the IPSM'95Corpus to contain 2 to 4% noise.For Analysis I, the POS tagger was trained on the PennTree Bank'sBrown Corpus and the Wall Street Journal Corpus, while the nounphrase parser and the dependency structure parser on a small subsetof it.3These, togetherwith the unknownword module and the divide-and-conquermod-ule, form a total system called DESPAR.

172 Ting, PehDynix Error Total AccuracyPOS (word level) 10 343 97.1 %Dependency (word level) 44 343 87.2 %IN (word level) 04 027 85.2 %CC (word level) 02 010 80.0 %POS (sentence level) 09 020 55.0 %Dependency (sentence level) 15 020 25.0 %Lotus Error Total AccuracyPOS (word level) 13 289 95.5 %Dependency (word level) 51 289 82.4 %IN (word level) 10 026 61.5 %CC (word level) 04 012 66.7 %POS (sentence level) 10 020 50.0 %Dependency (sentence level) 14 020 30.0 %Trados Error Total AccuracyPOS (word level) 11 389 097.2 %Dependency (word level) 56 389 085.6 %IN (word level) 14 047 070.2 %CC (word level) 00 004 100.0 %POS (sentence level) 09 020 055.0 %Dependency (sentence level) 16 020 020.0 %Table 10.5.1: A detailed break down of the performance ofDESPAR at the word level and the sentence level for AnalysisI.For each test sentence, DESPAR will always select one parse tree outof the forest generated by the dynamic context algorithm. The selectionof one parse tree is carried out by the Viterbi algorithm in the enhancedHMM framework. In this sense, all the sentences can be recognized(i.e. parsed), although not exactly as those in the annotated corpus.DESPAR is absolutely robust; it produces parses for sentences whichcontain grammatical errors, even random strings of words.4Figure 10.4 shows a parse tree of the test sentence L8 which wasanalysed correctly at the sentence level by DESPAR, although some ofthe words were tagged wrongly by the part-of-speech tagger.Figure 10.5 shows that DESPAR is tolerant to some errors in thetokenization. The current version of DESPAR uses only the parts of4We designed DESPAR not with the intention of using it as a grammar checker.Rather, we wanted DESPAR to be very robust. Currently we use it for a machinetranslation project, and other natural language applications in the pipe-line.

DESPAR 173Number Accept Reject % Accept % RejectDynix 20 20 0 100 0Lotus 20 20 0 100 0Trados 20 20 0 100 0Total 60 60 0 100 0Table 10.2.2: Phase II acceptance and rejection rates forDESPAR. Total Time Average Time Average Timeto Parse (s) to Accept (s) to Reject (s)Dynix 208 10.4 N.A.Lotus 185 09.3 N.A.Trados 225 11.3 N.A.Total 618 10.3 N.A.Table 10.3.2: Phase II parse times for the DESPAR. The �rstcolumn gives the total time (seconds) to parse 20 sentences in each�le. The last column is not applicable (N.A.) to DESPAR.Char. A B C D E F G Avg.Dynix 098% 98% 97% 98% 86% 86% 080% 92%Lotus 096% 97% 95% 94% 85% 62% 067% 85%Trados 100% 99% 94% 99% 86% 70% 100% 93%Average 098% 98% 95% 97% 86% 73% 083% 90%Table 10.4.2: Phase II Analysis of the ability of DESPAR torecognise certain linguistic characteristics in an utterance. Forexample the column marked `A' gives for each set of utterances thepercentage of verbs occurring in them which could be recognised.The full set of codes is itemised in Table 10.2.speech to perform dependency parsing. If the tagger tags wrongly, thedependency parser is still able to parse correctly in the cases of mistakinga noun by a proper noun, a noun by an adjective and so on. However,if a noun is tagged wrongly as a verb and vice versa, the analysis byDESPAR is usually unacceptable, as in Figure 10.6.Figure 10.7 gives a avour of the computational version of depen-dency structure we adopt.A detailed summary of the performance of DESPAR at the word leveland the sentence level is in Table 10.5.1.

174 Ting, PehDynix Error Total AccuracyPOS (word level) 09 343 97.4 %Dependency (word level) 40 343 88.3 %IN (word level) 04 028 85.7 %CC (word level) 02 010 80.0 %POS (sentence level) 08 020 60.0 %Dependency (sentence level) 15 020 25.0 %Lotus Error Total AccuracyPOS (word level) 09 289 96.9 %Dependency (word level) 51 289 82.4 %IN (word level) 10 026 61.5 %CC (word level) 04 012 66.7 %POS (sentence level) 07 020 65.0 %Dependency (sentence level) 14 020 30.0 %Trados Error Total AccuracyPOS (word level) 09 389 097.7 %Dependency (word level) 55 389 085.9 %IN (word level) 14 047 070.2 %CC (word level) 00 004 100.0 %POS (sentence level) 08 020 060.0 %Dependency (sentence level) 16 020 020.0 %Table 10.5.2: A detailed break down of the performance ofDESPAR at the word level and the sentence level for AnalysisII.10.5 Analysis II: Original Grammar, Ad-ditional VocabularyFor Analysis II, we re-trained our POS tagger. The training corporaare the Brown Corpus, the WSJ Corpus, and the IPSM'95 Corpus itself.That was all we did to incorporate \additional vocabulary". We did notre-train the noun phrase parser nor the dependency parser. We also didnot use the lists of technical terms distributed by the organizers. Theonly di�erence with Analysis I is that now, all the words are known tothe tagger.We ran the tagger and the parsers on the 3 �les again as we did forAnalysis I. The results are tabulated below.Since most of the mistakes which the tagger made were not fatal,in the sense that there were only isolated instances where a verb wasmistaken as a noun and vice versa, the performance of DESPAR in

DESPAR 175Analysis II was not signi�cantly di�erent from that in Analysis I. Whilethe tagger registered a hefty 20% of error reduction, the dependencyparser only improved by 3% error reduction. These �gures show thatone need not feed DESPAR with additional vocabulary for it to performreasonably well. The unknown word module, though making mistakesin tagging unknown nouns as proper nouns and so on, is su�cient forthe approach we take in tackling the parsing problem.A detailed summary of the performance of DESPAR at the word leveland the sentence level is in Table 10.5.2.10.6 Analysis III: Altered Grammar, Ad-ditional VocabularyAnalysis III was not carried out on DESPAR.10.7 Converting Parse Tree to DependencyNotationThe issue of conversion to dependency notation was not addressed forthe DESPAR system.10.8 Summary of FindingsThe results show that one can analyse the dependency structure of asentence without using any grammar formalism; the problem of depen-dency parsing can be formulated as a process of tagging the dependencycodes. When tested on IPSM'95 Corpus, DESPAR is able to produce aparse for each sentence with an accuracy of 85% at the word level.Its performance can be improved by having a collocation module topre-process the sentence before submitting it to DESPAR for analysis.To attain higher accuracy, it is also desirable to have some module thatcan process dates, time, addresses, names etc.As no formal grammar formalism is used, it is relatively easy tomaintain the parser system by simply providing it with more corpora.Our current corpus for training the eHMM has only 2,000 sentences. Ifa dependency-code corpus of the order of millions of sentences in sizeis available, it will be interesting to see how far the enhanced HMM,namely, HMM + dynamic context + dependency structure axioms cango. Another dimension for improvement is to further develop the statis-tical inference engine, eHMM. The current eHMM is based on �rst-order

176 Ting, Peh(i.e., bigram) state transition. We expect the system to do better if weuse second-order (trigram) transition and other adaptive models to tunethe statistical parameters.In conclusion, we remark that deviation from the established Chom-sky mode of thinking is both fruitful and useful in opening up a newavenue for creating a practical parser. We also show that it is feasibleto model dependency structure parsing with a hidden Markov modelsupported by a dynamic context algorithm and the incorporation of de-pendency axioms.10.9 ReferencesBlack, E. (1993). Parsing English By Computer: The State Of TheArt (Internal report). Kansai Science City, Japan: ATR InterpretingTelecommunications Research Laboratories.Carroll, G., & Charniak, E. (1992). Two Experiments On LearningProbabilistic Dependency Grammars From Corpora (TR CS-92-16).Providence, RI: Brown University, Department of Computer Science.Charniak, E., Hendrickson C., Jacobson, N., & Perkowitz M. (1993).Equations for Part-of-Speech Tagging. Proceedings of AAAI'93, 784-789.Forney, D. (1973). The Viterbi Algorithm, Proceedings of the IEEE. 61,268-278.Liberman, M. (1993). How Hard Is Syntax. Talk given at Taiwan.Magerman, D. (1994). Natural Language Parsing As Statistical PatternRecognition. Ph.D. Thesis, Stanford University.Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Buildinga Large Annotated Corpus of English: The Penn Treebank. Compu-tational Linguistics, 19, 313-330.Mel'�cuk, I. A. (1987). Dependency Syntax: Theory and Practice. StonyBrook, NY: State University of New York Press.Merialdo, B. (1994). Tagging English Text With A Probabilistic Model.Computational Linguistics, 20, 155-171.Peh, L. S., & Ting, C. (1995) Disambiguation of the Roles of Commasand Conjunctions in Natural Language Processing (Proceedings of theNUS Inter-Faculty Seminar). Singapore, Singapore: National Univer-sity of Singapore.Ting, C. (1995a). Hybrid Approach to Natural Language Processing(Technical Report). Singapore, Singapore: DSO.Ting, C. (1995b). Parsing Noun Phrases with Enhanced HMM (Pro-ceedings of the NUS Inter-Faculty Seminar). Singapore, Singapore:National University of Singapore.

DESPAR 177Appendix A: Samples of dependency struc-tures in the IPSM'95 Corpus#D77IN If 1 --> 10 -PP you 2 --> 3 [ SUBVBP need 3 --> 1 ]DT the 4 --> 5 [NP BIB 5 --> 3 +NN information 6 --> 3 + OBJ, , 7 --> 6 ]PP you 8 --> 10 [ SUBMD can 9 --> 10 ]VB create 10 --> 22 -DT a 11 --> 14 [VBN saved 12 --> 14 +NN BIB 13 --> 14 +NN list 14 --> 10 + OBJCC and 15 --> 10 ]VB print 16 --> 15 -PP it 17 --> 16 [RP out 18 --> 16 ]WRB when 19 --> 16 -PP you 20 --> 21 [ SUBVBP like 21 --> 19 ]. . 22 --> 0 -#L30NP Ami 1 --> 2 [NP Pro 2 --> 3 + SUBVBZ provides 3 --> 10 ]JJ additional 4 --> 6 [NN mouse 5 --> 6 +NNS shortcuts 6 --> 3 + OBJIN for 7 --> 6 ]VBG selecting 8 --> 7 -NN text 9 --> 8 [ OBJ. . 10 --> 0 ]#T18DT The 1 --> 5 [" " 2 --> 3 +NN View 3 --> 5 +" " 4 --> 3 +NN tab 5 --> 11 +IN in 6 --> 5 ]NP Word 7 --> 8 [POS 's 8 --> 9 +NNS Options 9 --> 10 +VBP dialog 10 --> 6 ]. . 11 --> 0 -

The sentences are taken one each from the 3 �les, Dynix, Lotus andTrados respectively. The �rst �eld in each line is the part of speech(see Marcus, Marcinkiewicz & Santorini, 1993, for an explanation of thenotation symbols), the second �eld is the word, the third is the serialnumber of the word, the forth is an arrow denoting the attachment ofthe word to the �fth �eld, which is the serial number of its governor.For example, in the �rst sentence, the word number 1 \If" is attachedto word number 10 \create". It is apparent that there is a one-to-onemapping from this scheme of representing the dependency structure tothe dependency parse tree. As a convention, the end-of-sentence punc-tuation is attached to word number 0, which means that it does notdepend on anything in the sentence. This scheme of coding the depen-dency structure of the sentence makes it easy for a human to annotateor verify; one just needs to edit the serial number of the governor of eachword.The sixth �eld of each line is the atomic noun phrase symbol associ-ated to each location of the sentence. There are 5 noun phrase symbolsde�ned as follows:[ : start of noun phrase.] : end of noun phrase." : end and start of two adjacent noun phrases.+ : inside the noun phrase.- : outside the noun phrase.A conventional and perhaps more intuitive look of these symbols willbe to write the sentence horizontally and then shift each symbol by half aword to the left and omit the + and -. For example: [ Ami Pro ] provides[ additional mouse shortcuts ] for selecting [ text ] .The seventh �eld of each line is the argument of the predicate towhich it is attached. We use SUB to indicate subject, OBJ is object,S_SUB is surface subject, L_SUB is the logical subject, and I_OBJ is theindirect object.The dependency structure coded in this manner is equivalent to adependency tree. Another way of representing the same structure is viathe tuple of (g; o), where g is the POS of the governor, and o is therelative o�set of the governor. In other words, instead of using the serialnumber, the governor of each word is represented as (g; o). We use (g; o)as the state of the eHMM when parsing the dependency structure.

11Using the TOSCA AnalysisSystem to Analyse a SoftwareManual CorpusNelleke Oostdijk1University of Nijmegen11.1 IntroductionThe TOSCA analysis system was developed by the TOSCA ResearchGroup for the linguistic annotation of corpora.2 It was designed to fa-cilitate the creation of databases that could be explored for the purposeof studying English grammar and lexis. Such databases can serve a dualuse. On the one hand they can provide material for descriptive studies,on the other they can serve as a testing ground for linguistic hypotheses.In either case, the annotation should meet the standard of the state ofthe art in the study of English grammar, and should therefore exhibitthe same level of detail and sophistication. Also, the descriptive notionsand terminology should be in line with state-of-the-art descriptions ofthe English language.In the TOSCA approach, the linguistic annotation of corpora isviewed as a two stage process. The �rst stage is constituted by thetagging stage, in which each word is assigned to a word class, while ad-ditional semantico-syntactic information may be provided in the form ofadded features as appropriate. The parsing stage is concerned with thestructural analysis of the tagged utterances. Analysis is carried out bymeans of a grammar-based parser.1Address: Nelleke Oostdijk. Dept. of Language and Speech, University of Ni-jmegen. P.O. Box 9103, 6500 HD Nijmegen, The Netherlands. Tel: +31 24 3612765,Fax: +31-24-3615939, Email: [email protected] text in the introductory sections in this paper, describing the TOSCA analy-sis system, is an adapted version of Aarts, van Halteren and Oostdijk (1996). Thanksare due to Hans van Halteren for his help in preparing the �nal version of this paper.

180 OostdijkThe analysis system is an ambitious one in two respects. Not onlyshould the analysis results conform to the current standards in descrip-tive linguistics, but it is required that for each corpus sentence thedatabase should contain only the one analysis that is contextually ap-propriate. It will be clear that whatever contextual (i.e. semantic, prag-matic and extra-linguistic) knowledge is called upon, input from a hu-man analyst is needed. This is the major reason why the analysis systemcalls for interaction between computer and human analyst. However, forreasons of consistency, human input should be minimized. Since con-sistency is mainly endangered if the human analyst takes the initiativein the analysis process, it is better to have the linguist only react toprompts given by automatic processes, by asking him to choose from anumber of possibilities presented by the machine. This happens in twoplaces in the TOSCA system: once after the tagging stage (tag selection)and once after parsing (parse selection).Finally, it should be pointed out that one requirement that oftenplays a role in discussions of automatic analysis systems has not beenmentioned - the robustness of the system. In our view, in analysis sys-tems aiming at the advancement of research in descriptive linguistics theprinciple of robustness should play a secondary role. A robust systemwill try to come up with some sort of analysis even for (often marginallylinguistic) structures that cannot be foreseen. Precisely in such cases wethink that it should not be left to the system to come up with an answer,but that control should be passed to the linguist.So far the TOSCA analysis system has been successfully applied inthe analysis of (mostly written, British English) corpus material thatoriginated from a range of varieties, including �ction as well as non-�ction. Through a cyclic process of testing and revising, the formalgrammar underlying the parser | which was initially conceived on thebasis of knowledge from the literature as well as intuitive knowledgeabout the language | has developed into a comprehensive grammar ofthe English language. It is our contention that, generally speaking, forthose syntactic structures that belong to the core of the language, thegrammarhas reached a point where there is little room for improvement.3However, as novel domains are being explored, structures are encoun-tered that so far were relatively underrepresented in our material. Itis especially with these structures that the grammar is found to showomissions. Since the system has not before been applied to the domainof computer manuals, and instructive types of text in general have beenunderrepresented in our test materials, the experiment reported on inthis paper can be considered a highly informative exercise.3Remaining lacunae in the grammar often concern linguistically more marginalstructures. While their description is not always unproblematic, quite frequently itis also unclear whether the grammar should describe them at all (see Aarts, 1991).

TOSCA 181Characteristic A B C D E F GTOSCA yes yes yes yes yes yes yesTable 11.1: Linguistic characteristics which can be detected byTOSCA. See Table 11.2 for an explanation of the letter codes.Code ExplanationA Verbs recognisedB Nouns recognisedC Compounds recognisedD Phrase Boundaries recognisedE Predicate-Argument Relations identi�edF Prepositional Phrases attachedG Coordination/Gapping analysedTable 11.2: Letter codes used in Tables 11.1 and 11.5.1.11.2 Description of Parsing SystemIn the next section we �rst give an outline of the TOSCA analysis envi-ronment, after which two of its major components, viz. the tagger andthe parser are described in more detail.11.2.1 The TOSCA Analysis EnvironmentAs was observed above, the TOSCA system provides the linguist with anintegrated analysis environment for the linguistic enrichment of corpora.In principle the system is designed to process a single utterance frombeginning to end. The successive stages that are distinguished in theannotation process are: raw text editing, tokenization and automatictagging, tag selection, automatic parsing, selection of the contextuallyappropriate analysis, and inspection of the �nal analysis. For each ofthese steps the linguist is provided with menu options, while there arefurther options for writing comments and for moving from one utteranceto the next. However, it is not necessary to use the environment interfacefor all the steps in the process. Since the intermediate results have a well-de�ned format, it is possible to use other software for speci�c steps. Amuch used time-saver is a support program which starts with a completesample in raw text form, tags it and splits it into individual utterances,so that the linguist can start at the tag selection stage.During the tagging stage each word in the text is assigned a tagindicating its word class and possibly a number of morphological and/or

182 Oostdijksemantico-syntactic features. The form of the tags is conformant withthe constituent labels in the syntactic analysis and is a balance betweensuccinctness and readability. The set of word classes and features area compromise between what is needed by the parser and what was feltcould be easily handled by linguists.After the automatic tagger has done its work, the linguist can usethe tag selection program to make the �nal selection of tags. He ispresented with a two column list, showing the words and the proposedtags. A mouse click on an inappropriate tag calls up a list of alternativetags for that word fromwhich the correct tag can then be selected. In thecase of ditto tags, selection of one part of the tag automatically selectsthe other part(s). If the contextually appropriate tag is not among thelist of alternatives, it is possible to select (through a sequence of menus)any tag from the tag set. Since this is most likely to occur for unknownwords, a special Top Ten list of the most frequent tags for such unknownwords is also provided. The tag selection program can also be used toadd, change or remove words in the utterance in case errors in the rawtext are still discovered at this late stage. Finally, the program allowsthe insertion of syntactic markers (see below) into the tagged utterance.During the automatic parsing stage, all possible parses for the utter-ance, given the selected tags and markers, are determined. The outputof this stage, then, is a set of parse trees. Before they are presented tothe linguist for selection, these parse trees are transformed into analysistrees which contain only linguistically relevant information. The humanlinguist then has to check whether the contextually appropriate analysisis present and mark it for selection. Since it is not easy to spot the di�er-ence between two trees and because the number of analyses is sometimesvery large, it is impractical to present the trees one by one to the lin-guist. Instead, the set of analysis trees is collapsed into a shared forest.The di�erences between the individual trees are represented as local dif-ferences at speci�c nodes in the shared forest. This means that selectionof a single analysis is reduced to a small number of (local) decisions.For this selection process a customized version of the Linguistic DataBase program (see below) is used. The tree viewer of this program showsa single analysis, in which the selection points are indicated with tildecharacters, as shown in Figure 11.1. The linguist can focus on theseselection points and receives detailed information on the exact di�er-ences between the choices at these points, concerning function, categoryand attributes of the constituent itself and the functions, categories andboundaries of its immediate constituents. In most cases this informationis su�cient for making the correct choice. Choices at selection pointsare never �nal; it is always possible to return to a selection point andchange one's choice. Only when the current set of choices is pronouncedto represent the contextually appropriate parse are the choices �xed.

TOSCA 183HHee wwaass wwoorrrriieedd aabboouutt hhiiss ffaatthheerr .1 1 HHee2 1 wwaass~~11~~ 1 wwoorrrriieedd3 2 1 aabboouutt* 2 1 1 hhiiss2 ffaatthheerr2 .1 UTT:S(decl,intens,unm,past,indic,act) SU:He V:was CS:worried_father2 UTT:S(decl,intens,unm,past,indic,act) SU:He V:was CS:worried A:about_fath1(2)/1(1) UUTTTTEERRAANNCCEE::SSEENNTTEENNCCEE((ddeeccllaarraattiivvee,,iinntteennssiivvee,,uunnmmaarrkkeedd,,ppaasstt,,iinnddiicommand:scroll:YUDLR<>() focus:FS1-90PNMJ amb:T=+-CA view:V help:? exit:XFigure 11.1: An analysis selection screenThe resulting analysis is stored in a binary format for use with the stan-dard version of the Linguistic DataBase system (cf. van Halteren & vanden Heuvel, 1990) and in an ASCII/SGML format for interchange withother database systems.If an utterance cannot be parsed, either because it contains construc-tions which are not covered by the grammar or because its complexitycauses time or space problems for the parser, a tree editor can be used tomanually construct a parse. Restrictions within the tree editor and sub-sequent checks ensure that the hand-crafted tree adheres to the grammaras much as possible.11.2.2 The TaggerThe �rst step of the tagging process is tokenization, i.e. the identi�ca-tion of the individual words in the text and the separation of the textinto utterances. The tokenizer is largely rule-based, using knowledgeabout English punctuation and capitalization conventions. In addition,statistics about e.g. abbreviations and sentence initial words are used tohelp in the interpretation of tokenization ambiguities. Where present,a set of markup symbols is recognized by the tokenizer and, if possible,used to facilitate tagging (e.g. the text unit separator < # >).Next, each word is examined by itself and a list of all possible tags for

184 Oostdijkthat word is determined. For most words, this is done by looking themup in a lexicon. In our system we do not have a distinct morphologi-cal analysis. Instead we use a lexicon in which all possible word formsare listed. The wordform lexicon has been compiled using such diverseresources as tagged corpora, machine readable dictionaries and exper-tise gained in the course of years. It currently contains about 160,000wordform-tag pairs, covering about 90,000 wordforms.Even with a lexicon of this size, though, there is invariably a non-negligible number of words in the text which are not covered. Ratherthan allowing every possible tag for such words (and shifting the problemto subsequent components), we attempt to provide a more restricted listof tags. This shortlist is based on speci�c properties of the word, suchas type of �rst character (upper case, lower case, number, etc.) and the�nal characters of the word. For example, an uncapitalized word endingin -ly can be assumed to be a general adverb. The statistics on suchproperty-tag combinations are based on su�x morphology and on thetagging of hapax legomena in corpus material.The last step in the tagging process is the determination of the con-textually most likely tag. An initial ordering of the tags for each wordis given by the probability of the word-tag pair, derived from its fre-quency in tagged corpora. This ordering is then adjusted by examiningthe direct context, the possible tags of the two preceding and the twofollowing words. The �nal statistical step is a Markov-like calculationof the `best path' through the tags, i.e. the sequence of tags for whichthe compound probability of tag transitions is the highest. The lattertwo steps are both based on statistical information on tag combinationsfound in various tagged corpora. The choice of the most likely tag isnot done purely by statistical methods, however. The �nal word is givento a rule-based component. This component tries to correct observedsystematic errors of the statistical components.The tagset we employed in the analysis of the three computer manualsreported on here consists of around 260 tags, with a relatively high degreeof ambiguity: when we compare our tagset to other commonly usedtagsets such as the Brown tagset (which has only 86 tags), we �nd thatwith the Brown tagset the number of tags for a given word ranges from1 to 7 tags and fully 60% of the words or tokens in the text appears tobe unambiguous, while with our tagset the number of tags ranges from1 to 33 and only 29% of the words are unambiguous.11.2.3 The ParserThe TOSCA parser is grammar-based, i.e. the parser is derived from aformal grammar. The rules of the grammar are expressed in terms ofrewrite rules, using a type of two-level grammar, viz. A�x Grammar

TOSCA 185c VP VERB PHRASE (cat operator, complementation, �niteness, INDIC, voice):f OP OPERATOR (cat operator, �niteness, INDIC),n FURTHER VERBAL OPTIONS (cat operator, complementation, INDIC, voice1),n establish voice (cat operator, voice, voice1).n FURTHER VERBAL OPTIONS (cat operator, INTR, INDIC, ACTIVE):.n FURTHER VERBAL OPTIONS (cat operator, complementation, INDIC, voice):n A ADVERBIAL OPTION,n next aux expected (cat operator, next cat),n expected �niteness (cat operator, �niteness),f AVB AUXILIARY VERB (next cat, �niteness, INDIC),n FURTHER VERBAL OPTIONS (next cat, complementation, INDIC, voice1),n establish voice (next cat, voice1, voice);n A ADVERBIAL OPTION,n expected �niteness (cat operator, �niteness),f MVB MAIN VERB (complementation1, �niteness, INDIC, voice),n establish voice (cat operator, ACTIVE, voice),n reduced compl when passive (voice, complementation1, complementation).Figure 11.2: Some example rules in AGFL.5over Finite Lattices (AGFL) (Nederhof & Koster, 1993). This formalismand the parser generator that is needed for the automatic conversionof a grammar into a parser, were developed at the Computer ScienceDepartment of the University of Nijmegen.4 The parser is a top-downleft corner recursive backup parser.In our experience, linguistic rules can be expressed in the AGFL for-malism rather elegantly: the two levels in the grammar each play a dis-tinctive role and contribute towards the transparency of the descriptionand resulting analyses. An outline of the overall structure is containedin the �rst level rules, while further semantico-syntactic detail is foundon the second level. Thus generalizations remain apparent and are notobscured by the large amount of detail. Some example rules in AGFLare given in Figure 11.2.4AGFL comes with a Grammar Workbench (GWB), which supports the devel-opment of grammars, while it also checks their consistency. The AGFL formal-ism does not require any special hardware. The parser generator, OPT, is rela-tively small and runs on regular SPARC-systems and MS-DOS machines (386 andhigher). Required harddisk space on the latter type of machine is less than 1MB. AGFL was recently made available via FTP and WWW. The address of theFTP-site is: ftp://hades.cs.kun.nl/pub/ag / The URL of the AGFL home page is:http://www.cs.kun.nl/ag /5As parse trees are transformed into analysis trees the pre�xes that occur in these

186 OostdijkIn these rules the �rst level describes the (indicative) verb phrase interms of an operator which may be followed by further verbal elements,i.e. auxiliary verbs and/or a main verb. The �rst level description isaugmented with the so-called a�x level. The a�xes that are associatedwith the verb phrase record what type of auxiliary verb realizes thefunction of operator (e.g. modal, perfective, progressive or passive),what complementation (objects and/or complements) can be expectedto occur with the verb phrase, whether the verb phrase is �nite or non-�nite, and whether it is active or passive. The predicate rules that aregiven in small letters (as opposed to the other �rst level rules for whichcapital letters are used) are rules that are used to impose restrictions onor to e�ect the generation or analysis of a particular a�x value elsewhere.For example, the predicate rule `next aux expected' describes the co-occurrence restrictions that hold with regard to subsequent auxiliaryverbs.The objective of the formalized description is to give a full and ex-plicit account of the structures that are found in English, ideally interms of notions that are familiar to most linguists. As such, the formalgrammar is interesting in its own right. The descriptive model that isemployed in the case of the TOSCA parser is based on that put forwardby Aarts and Aarts (1982), which is closely related to the system foundin Quirk, Greenbaum, Leech and Svartvik (1972). This system is basedon immediate constituent structure and the rank hierarchy. Basically,the three major units of description are the word, the phrase and theclause/sentence. As was said above, words are assigned to word classesand provided with any number of features, which may be morphologi-cal, syntactic or semantic in character. They form the building blocksof phrases; in principle, each word class can function as the head of aphrase. Every unit of description that functions in a superordinate con-stituent receives a function label for its relation to the other elementswithin that constituent. On the level of the phrase we �nd function labelslike head and modi�er, on the level of the clause/sentence we �nd sen-tence functions like subject and object. The relational concepts that areexpressed in function labels are essentially of three types: subordinationand superordination (e.g. in all headed constituents), government (e.g.in prepositional phrases and in verb complementation), and concatena-tion (in compounding on word level, in apposition and in coordination).The analysis result is presented to the linguist in the form of a la-belled tree. Unlike the tree diagrams usually found in linguistic studies,the trees grow from left to right. With each node at least function andcategory information is associated, while for most nodes also more de-rules are used to �lter out the linguistically relevant information. The pre�x f iden-ti�es the function of a constituent, c its category, while n is used to indicate anynon-terminals that should not be included in the analysis tree.

TOSCA 187-,TXTU

PUNC,PMoquo 'RPDU,Sact,declextsu,indicintens,pres

PRSU,PRIT ItV,VPact,indicintens,pres MVB,LVencl,indicintens,pres 'sCS,AJPprd AJHD,ADJprd wonderfulNOSU,CLact,indicintr,presunm,zsub A,AVPinter AVHD,ADVinter howSU,NP NPHD,Nprop,sing FriaryV,VPact,indicintr,pres MVB,LVindic,intrpres getsA,AVPphr AVHD,ADVphr backA,PP P,PREP onPC,NP DT,DTP DTCE,ARTdef theNPHD,Ncom,sing jobPUNC,PMcomma ,PUNC,PMcquo 'RPGT,Sact,declindic,intrpast,unm SU,NP NPHD,PNpers,sing heV,VPact,indicintr,past MVB,LVindic,intrpast saidPUNC,PMcomma ,A,PP P,PREP by way ofPC,CL-su,actindic,motrpresp,unmzsub V,VPact,indicmotr,presp MVB,LVindic,motrpresp findingOD,NP DT,DTP DTCE,PNass someNPPR,AJPattru AJHD,ADJattru casualNPHD,Ncom,sing remarkPUNC,PMper .Figure 11.3: Example analysis.tailed semantico-syntactic information is provided. An example analysisis given in Figure 11.3. The information that is found at the nodes ofthe tree is as follows: function-category pairs of labels are given in cap-ital letters (e.g. \SU,NP" describes the constituent as a subject that isrealized by a noun phrase), while further detail is given in lower case.Lexical elements are found at the leaves of the tree.

188 OostdijkThe present grammar comprises approximately 1900 rewrite rules (inAGFL) and has an extensive coverage of the syntactic structures thatoccur in the English language. It describes structures with unmarkedword order, as well as structures such as cleft, existential and extraposedsentences, verbless clauses, interrogatives, imperatives, clauses and sen-tences in which subject-verb inversion occurs and/or in which an objector complement has been preposed. Furthermore, a description is pro-vided for instances of direct speech, which includes certain marked andhighly elliptical clause structures, enclitic forms, as well as some typicaldiscourse elements (e.g. formulaic expressions, connectives, reaction sig-nals). A small number of constructions has not yet been accounted for.These are of two types. First, there are constructions the descriptionof which is problematic given the current descriptive framework. Anexample of this type of construction is constituted by clauses/sentencesshowing subject-to-object [1], or object-to-object raising [2].[1] Who do you think should be awarded the prize?[2] Who do you think they should award the prize?The traditional analysis of these structures is arrived at by postu-lating some kind of deep structure. In terms of the present descriptiveframework, however, only one clausal level is considered at a time, whilethe function that is associated with a constituent denotes the role thatthis constituent plays within the current clause or sentence. The analy-sis of [1] is problematic since Who, which by a deep structure analysiscan be shown to be the subject of the embedded object clause, occursas a direct object in the matrix clause or sentence. In [2] Who mustbe taken to be the indirect object of award in the object clause. Onthe surface, however, it appears as a direct object of think in the matrixclause. Second, there are constructions that occur relatively infrequentlyand whose description can only be given once we have gained su�cientinsight into their nature, form and distribution of occurrence or which,so far, have simply been overlooked. In handbooks on English grammarthe description of these structures often remains implicit or is omittedaltogether.The parser is designed to produce at least the contextually appro-priate analysis for each utterance. But apart from the desired analysis,additional analyses may be produced, each of which is structurally cor-rect but which in the given context cannot be considered appropriate.Therefore, parse selection constitutes a necessary step in the analysisprocess. In instances where the parser produces a single analysis thelinguist checks whether this is indeed the contextually appropriate one,while in the case of multiple analyses the linguist must select the one he�nds appropriate. Although it has been a principled choice to yield allpossible analyses for a given utterance, rather than merely yielding theone (statistically or otherwise) most probable analysis, the overgenera-

TOSCA 189tion of analyses has in practice proved to be rather costly. Therefore, inorder to have the parser operate more e�ciently (in terms of both com-puter time and space) as well as to facilitate parse selection, a numberof provisions have been made to reduce the amount of ambiguity. Priorto parsing, the boundaries of certain constituents must be indicated bymeans of syntactic markers. This is partly done by hand, partly auto-matically. For example, the linguist must insert markers for constituentslike conjoins, appositives, parenthetic clauses, vocatives and noun phrasepostmodi�ers. As a result, certain alternative parses are prohibited andwill not play a role in the parsing process. In a similar fashion an auto-matic lookahead component contributes to the e�ciency of the parsingprocess. In the tagged and syntactically marked utterance, lookaheadsymbols are inserted automatically. These are of two types: they eitherindicate the type of sentence (declarative, interrogative, exclamatory orimperative), or they indicate the potential beginnings of subclauses. Ine�ect these lookahead symbols tell the parser to bypass parts of thegrammar. Since the analysis result may still be ambiguous, in spite ofthe nonambiguous tag assignment of an utterance, its syntactic markingand the insertion of lookahead symbols, the analysis result may still beambiguous, a (rule-based) �lter component has been developed whichfacilitates parse selection by �ltering out (upon completion of the pars-ing process) intuitively less likely analyses. For a given utterance, forexample, analyses in which marked word order has been postulated arediscarded automatically when there is an analysis in which unmarkedword order is observed.The selection of the contextually appropriate analysis for a givenutterance is generally fairly straightforward. Moreover, since a formalgrammar underlies the parser, consistency in the analyses is also war-ranted, that is, up to a point: in some instances, equally appropriateanalyses are o�ered. It is with these instances that the linguist has to beconsistent in the selection he makes from one utterance to the next. Forexample, the grammar allows quantifying pronouns to occur as predeter-miner, but also as postdeterminer. As predeterminers they precede itemslike the article, while as postdeterminers they typically follow the article.While there is no ambiguity when the article is present, a quantifyingpronoun by itself yields an ambiguous analysis.11.3 Parser Evaluation CriteriaFor the IPSM'95 workshop that was held at the University of Limer-ick in May 1995 participants were asked to carry out an analysis of theIPSM'95 Corpus of Software Manuals. The corpus comprised some 600sentences. The material was to be subjected to analysis under varying

190 Oostdijkcircumstances. In the experiment three phases were distinguished andfor each phase the material had to be re-analysed. During phase I of theexperiment the system made use of its original grammar and vocabulary,while in phases II and III changes to the lexicon and grammar were per-mitted. The �ndings of the participants were discussed at the workshop.In order to facilitate comparisons between the di�erent systems, for thepresent paper participants were instructed to report on their �ndings onthe basis of only a small subset (60 sentences) of the original corpus. Inour view the number of sentences to be included is unfortunately small:as our results show, the success rate is very much in uenced by chance.While on the basis of the subset of 60 sentences we can claim the successrate to be 91.7% (on average), the results over the full corpus are lessfavourable (88.3% on average). Therefore, in order to provide a moreaccurate account of the system and its performance, we have decidedto include not only our �ndings for the subset of 60 sentences, but alsothose for the full corpus.The structure of the remainder of this paper, then, is as follows: Sec-tions 11.4, 11.5 and 11.6 describe the procedure that was followed in eachof the di�erent phases of the experiment. In Section 11.7 a descriptionis given of the way in which output from the TOSCA system can beconverted to a dependency structure. A summary of our �ndings in theexperiment is given in Section 11.8. Section 11.9 lists the references.11.4 Analysis I: Original Grammar, Orig-inal VocabularyFor the �rst analysis of the material we ran the TOSCA system withoutany adaptations on an MS/DOS PC with a 486DX2/66 processor and16Mb of RAM. Each of the three samples (hereafter referred to as LO-TUS, DYNIX and TRADOS) was tagged and then further processed. Asa preparatory step to the tagging stage, we inserted text unit separatorsto help the tokenizer to correctly identify the utterance boundaries inthe texts.6 This action was motivated by the observation that the use ofpunctuation and capitalization in these texts does not conform to com-mon practice: in a great many utterances there is no �nal punctuation,while the use of quotes and (esp. word-initial) capital letters appears tobe abundant. These slightly adapted versions of the raw texts were thensubmitted to the tagger. As a result of the (automatic) tagging stagethe texts were divided into utterances, and with each of the tokens in anutterance the contextually most likely tag was associated.6In fact, at the beginning of each new line in the original text a text unit separatorwas inserted.

TOSCA 191Number Accept Reject % Accept % RejectDynix 20 20 0 100.0 00.0Lotus 20 18 2 090.0 10.0Trados 20 20 0 100.0 00.0Total 60 58 2 096.7 03.3Table 11.3.1: Phase I acceptance and rejection rates for TOSCA.The machine used was a 486DX2/66 MS/DOS PC. The �gurespresented here are the number of utterances for which the systemproduces an analysis (not necessarily appropriate).Total Time Average Time Average Timeto Parse (s) to Accept (s) to Reject (s)Dynix 0258 12.9Lotus 1065 59.1 0.5Trados 0673 33.7Total 1996 34.4 0.2Table 11.4.1: Phase I parse times for TOSCA. The machineused was a 486DX2/66 MS/DOS PC with 1.6 Mb of memory.Char. A B C D E F G Avg.Dynix 100% 100% 100% 100% 100% 100% 100% 100%Lotus 085% 092% 094% 091% 088% 092% 100% 092%Trados 100% 100% 100% 100% 100% 100% 100% 100%Avg. 095% 097% 098% 097% 096% 097% 100% 097%Table 11.5.1: Phase I Analysis of the ability of TOSCA to recog-nise certain linguistic characteristics in an utterance. For examplethe column marked `A' gives for each set of utterances the per-centage of verbs occurring in them which could be recognised.The full set of codes is itemised in Table 11.2.A characterization of the three texts in terms of the language varietyexempli�ed, the number of words, tokens, and utterances and the meanutterance length is given in Table 11.6.2. The material in the selectedsubset can be characterized as in Table 11.6.1.Since the parser will only produce the correct analysis (i.e. the con-textually appropriate one) when provided with non-ambiguous and fullycorrect input, tag correction constitutes a necessary next step in theanalysis process. Therefore, after the texts had been tagged, we then

192 OostdijkNumber Accept Reject % Accept % RejectDynix 20 20 0 100.0 00.0Lotus 20 17 2 085.0 10.0Trados 20 18 0 090.0 00.0Total 60 55 2 091.7 03.3Table 11.3.1a: Phase I acceptance and rejection rates forTOSCA. The machine used was a 486DX2/66 MS/DOS PC. The�gures presented here are the number of sentences for which thecontextually appropriate analysis is produced.proceeded by checking | and where necessary correcting | the tag-ging of each utterance and inserting the required syntactic markers. Ashad been expected, a great many utterances required tag correction andsyntactic marker insertion. With regard to the need for tag correction,the texts did not greatly di�er: for approximately 20-25 per cent of theutterances a fully correct tagging was obtained, while in the remainderof the utterances minimally one tag needed to be corrected.7 Syntacticmarker insertion was required with the majority of utterances. In thisrespect the LOTUS text was least problematic: in 35 per cent of theutterances no syntactic markers were required (cf. the DYNIX and theTRADOS texts in which 32.5 and 25 per cent respectively could remainwithout syntactic markers). More interesting, however, are the di�er-ences between the texts when we consider the nature of the syntacticmarkers that were inserted. While in all three texts coordination ap-peared to be a highly frequent phenomenon, which explains the frequentinsertion of the conjoin marker, in the LOTUS text only one other typeof syntactic marker was used, viz. the end-of-noun-phrase-postmodi�ermarker. In the DYNIX and TRADOS texts syntactic markers were alsoused to indicate apposition and the occurrence of the noun phrase asadverbial.After an utterance has been correctly tagged and the appropriatesyntactic markers have been inserted, it can be submitted to the parser.As the analyst hands back control to the system, the words are strippedfrom the utterance and the tags together with the syntactic markers7The success rate reported here is by TOSCA standards: a tag is only consideredto be correct if it is the one tag that we would assign. All other taggings are countedas erroneous, even though they may be very close to being correct (as for examplethe tagging of a noun as a N(prop, sing) instead of N(com, sing), or vice versa), orsubject to discussion (monotransitive vs. complex transitive). The major problemfor the tagger was constituted by the compound tokens that occurred in the texts.The list of terms provided here was of little use since it contained many items thatin terms of our descriptive model are not words but (parts of) phrases.

TOSCA 193Lotus Dynix TradosAm. English Am. English Eur. EnglishLanguage with with withVariety Am. Spelling Am. Spelling Am. SpellingNo. of Words 256 293 340No. of Tokens 296 340 386No. of Utterances 020 020 020Mean Utt. Length(In No. of Tokens) 14.8 17.0 19.3Table 11.6.1: Characterization of the texts (subset of 60 sen-tences). Lotus Dynix TradosAm. English Am. English Eur. EnglishLanguage with with withVariety Am. Spelling Am. Spelling Am. SpellingNo. of Words 2477 2916 3609No. of Tokens 2952 3408 4221No. of Utterances 0200 0200 0207Mean Utt. Length(In No. of Tokens) 14.7 17.0 20.4Table 11.6.2: Characterization of the texts (original IPSM'95Corpus).are put to the automatic lookahead component which inserts two typesof lookahead symbol: one that indicates the type of the utterance (d =declarative, m = imperative), and one that indicates possible beginningsof (sub)clauses (#). Upon completion of the insertion of lookahead sym-bols, the parser is then called upon. In Figure 11.4 an example is givenof an utterance as it occurs (1) in its original format and (2) in its taggedand syntactically marked format, including the lookahead symbols thatwere added by the automatic lookahead component.11.4.1 E�cacy of the ParserIn the light of our experiences with the analysis of other types of text,the e�cacy of the parser with this particular text type was somewhatdisappointing. While the success rate (i.e. the percentage of utterancesfor which a contextually appropriate analysis is yielded) for �ction textsranges from 94 to 98 per cent, the overall success rate with the three

194 OostdijkOriginal format of the utterance:The level you specify determines the number of actions or levelsAmi Pro can reverse.Input format for the parser:d ART(def) N(com,sing) # PRON(pers) V(montr, pres)MARK(enppo) V(montr,pres) # ART(def) N(com, sing)PREP(ge) MARK(bcj) N(com, plu) MARK(ecj) CONJUNC(coord)MARK(bcj) N(com, plu) MARK(ecj) MARK(enppo) # N(prop,sing) AUX(modal, pres) V(montr, in�n) MARK(enppo)PUNC(per)Figure 11.4: Example utterance. For explanation of codes seeFigure 11.5.ART(def) de�nite articleAUX(modal, pres) modal auxiliary, present tenseCONJUNC(coord) coordinating conjunctionN(com, sing) singular common nounN(prop, sing) singular proper nounPREP(ge) general prepositionPRON(pers) personal pronounPUNC(per) punctuation, periodV(montr, pres) monotransitive verb, present tenseV(montr, in�n) monotransitive verb, in�nitiveMARK(bcj) beginning-of-conjoin markerMARK(ecj) end-of-conjoin markerMARK(enppo) end-of-noun-phrase postmodi�er markerFigure 11.5: Explanation of codes used in Figure 11.4. With thetags capitalized abbreviations indicate word class categories, whilefeatures are between brackets (using small letters). The syntacticmarkers are labelled MARK, while their nature is indicated bymeans of the information given between brackets.texts under investigation is 88.3% on average, ranging from 85 per centfor the DYNIX text to 91.5 per cent for the LOTUS text. Note that ifwe only take the subset into account, the success rate is 91.7 per centon average, ranging from 85 per cent for the LOTUS text to 100 percent for the DYNIX text. A breakdown of the analysis results is givenin Table 11.7.2 (for the original, full IPSM'95 Corpus) and Table 11.7.1(for the subset of 60 utterances).As is apparent from the breakdown in Table 11.7.2, the success rate,especially in the case of the DYNIX and TRADOS texts, is very muchnegatively in uenced by the percentage of utterances for which the pars-

TOSCA 195Lotus Dynix Trados# # % of # % of # % ofAnalyses Utts Utts Utts Utts Utts UttsParse Failure 2 10.0 0 00.0 0 00.0Erroneous 1 05.0 0 00.0 2 10.0Inconclusive 0 00.0 0 00.0 0 00.01 5 25.0 7 35.0 3 15.02 5 25.0 3 15.0 4 20.03 0 00.0 2 10.0 1 05.04 2 10.0 2 10.0 2 10.05 1 05.0 0 00.0 1 05.06 0 00.0 0 00.0 1 05.0> 6 4 20.0 6 30.0 6 30.0Table 11.7.1: Breakdown of analysis results (success rate anddegree of ambiguity; subset of 60 sentences).Lotus Dynix Trados# # % of # % of # % ofAnalyses Utts Utts Utts Utts Utts UttsParse Failure 09 04.5 13 06.5 05 02.4Erroneous 06 03.0 05 02.5 08 03.9Inconclusive 02 01.0 12 06.0 11 05.31 50 25.0 71 35.5 52 25.12 53 26.5 23 11.5 40 19.33 07 03.5 09 04.5 08 03.94 15 07.5 16 08.0 17 08.25 05 02.5 01 00.5 05 02.46 09 04.5 13 06.5 08 03.9> 6 44 22.0 37 18.5 53 25.6Table 11.7.2: Breakdown of analysis results (success rate anddegree of ambiguity; original IPSM'95 Corpus).ing stage did not yield a conclusive result. For up to 6 per cent of theutterances (in the DYNIX text) parsing had to be abandoned after theallotted computer time or space had been exhausted.8 If we were tocorrect the success rate with the percentage of utterances for which noconclusive result could be obtained under the present conditions, assum-ing that a PC with a faster processor and more disk space would alleviateour problems, the result is much more satisfactory, as is shown in Table8Here it should be observed that the problem did not occur while parsing theutterances of the subset.

196 OostdijkLotus Dynix TradosTOSCA(1) 91.5% 85.0% 88.4%TOSCA(2) 92.5% 91.5% 93.7%Table 11.8: Success rate in parsing the texts (original IPSM'95Corpus; TOSCA(1) gives the strict success rate, while TOSCA(2)gives the corrected success rate).11.8.The degree of ambiguity of the three texts appears much higher thanthat observed in parsing other types of text. For example, with �ctiontexts on average for approximately 48 per cent of the utterances theparser produces a single analysis, some 63 per cent of the utterancesreceive one or two analyses, and 69 per cent receive up to three analyses.For the computer manual texts these �gures are as given in Tables 11.8and 11.9.1.An examination of the utterances for which the parser failed to yieldan analysis revealed the following facts:� all parse failures in the LOTUS text and half the failures in theDYNIX text could be ascribed to the fact that the grammar un-derlying the parser does not comprise a description of structuresshowing raising. Typical examples are:[3] Move the mouse pointer until the I-beam is at the beginning ofthe text you want to select.[4] Select the text you want to move or copy.[5] The type of search you wish to perform.� the parser generally fails to yield an analysis for structures thatdo not constitute proper syntactic categories in terms of the de-scriptive model that is being employed; occasionally the parser willcome up with an analysis, which as is to be expected, is always er-roneous. For example, in the DYNIX text the parser failed in theanalysis of the following list-items:[6] From any system menu by entering \S" or an accelerated searchcommand.[7] From the item entry prompt in Checkin or Checkout by entering\.S".� apart from the two types of structure described above, there didnot appear to be any other systematic cause for failure. Especially

TOSCA 197No. of Analyses Lotus Dynix TradosSingle 25.0% 35.0% 15.0%One or two 50.0% 50.0% 35.0%One to three 50.0% 60.0% 40.0%Table 11.9.1: Degree of ambiguity (subset of 60 sentences).No. of Analyses Lotus Dynix TradosSingle 25.0% 35.5% 25.1%One or Two 51.5% 37.0% 44.4%One to Three 55.0% 41.5% 48.3%Table 11.9.2: Degree of ambiguity (original IPSM'95 Corpus).in the TRADOS text it would seem that parse failure should beascribed to omissions in the grammar, more particularly wherethe description of apposition is concerned (in combination withcoordination).The percentage of utterances for which the parser did not yield thecontextually appropriate analysis (i.e. where only erroneous analyseswere produced) was relatively high when compared to our experienceswith other types of text. On examination we found that this was only inpart due to omissions in the grammar. A second factor appeared to beoverspeci�cation: in an attempt to reduce the amount of ambiguity asmuch as possible a great many restrictions were formulated with regardto the co-occurrence of consecutive and coordinating categories. Some ofthese now proved to be too severe. An additional factor constituted the�lter component that comes into operation upon completion of the pars-ing process and which, as was explained above, serves to automatically�lter out intuitively less likely analyses. In a number of instances theparser would yield the correct analysis, but this would then be discardedin favour of (an) analysis/-es that, at least by the assumptions underly-ing the �lter component, was/were considered to be more probable.Omissions in the grammar were found to include the following:� the grammar does not describe the possible realization of the prepo-sitional complement by means of a wh-clause; this explains theerroneous analysis of utterances such as [8] and [9]:[8] The following are suggestions on how to proceed when using theTranslator's Workbench together with Word for Windows 6.0.[9] This has an in uence on how your Translation Memory looks

198 OostdijkLotus Dynix Trados#CPU # % of # % of # % ofSecs Utts Utts Utts Utts Utts Uttst � 5 15 75.0 15 75.0 14 70.05 � t � 10 01 05.0 00 00.0 01 05.010 � t � 15 00 00.0 01 05.0 01 05.015 � t � 20 01 05.0 01 05.0 00 00.020 � t � 25 00 00.0 01 05.0 00 00.025 � t � 30 00 00.0 00 00.0 00 00.0t > 30 03 15.0 02 10.0 04 20.0Table 11.10.1: Breakdown of parsing times (subset of 60 sen-tences). Lotus Dynix Trados#CPU # % of # % of # % ofSecs Utts Utts Utts Utts Utts Uttst � 5 114 57.0 110 55.0 114 55.15 � t � 10 028 14.0 017 08.5 024 11.610 � t � 15 007 03.5 012 06.0 009 04.415 � t � 20 005 02.5 005 02.5 004 01.920 � t � 25 002 01.0 003 01.5 009 04.425 � t � 30 008 04.0 001 00.5 004 01.9t > 30 038 19.0 052 26.0 043 20.8Table 11.10.2: Breakdown of parsing times (original IPSM'95Corpus).and how the target-language sentences are transferred from Trans-lation Memory to the yellow target �eld in WinWord 6.0 in thecase of a fuzzy match.� the grammar does not describe the realization of an object comple-ment by means of a wh-clause; this explains the erroneous analysisof utterances such as [10]:[10] Place the insertion point where you want to move the text.A typical example of an utterance for which only an erroneous anal-ysis was yielded as a result of overspeci�cation is given in [11]:[11] This is a sample sentence that should be translated, formattedin the paragraph style \Normal".While the grammar describes relative clauses and zero clauses as pos-sible realizations of the function noun phrase postmodi�er, restrictions

TOSCA 199have been formulated with regard to the co-occurrence of consecutivepostmodifying categories. One of the assumptions underlying these re-strictions is that zero clauses always precede relative clauses, an assump-tion which here is shown to be incorrect.The �lter component typically failed with utterances for which it wasassumed that an analysis postulating a sentence as the category realizingthe utterance was more probable than one in which the utterance wastaken to be realized by a prepositional phrase. For example, while onthe basis of the rules contained in the grammar the correct analysiswas produced for the utterance in [12], it was discarded and only theerroneous analysis remained.[12] From any system menu that has \Search" as a menu option.11.4.2 E�ciency of the ParserThe parsing times recorded in Tables 11.10.2 and 11.10.1 are the times ittook the parser to parse the utterance after it had been properly taggedand syntactically marked.At a �rst glance there do not appear to be any major di�erencesbetween the three texts: the proportion of utterances for which a resultis yielded within 5 seconds is similar for all three texts. However, theproportion of utterances for which it takes the parser more than 30 sec-onds to produce a result is much higher in the DYNIX text than it isin the other two texts (26 per cent vs. 19.0 and 20.8 per cent respec-tively). As we saw above, the DYNIX text di�ers from the other twotexts in other respects as well: it has the highest percentage of utter-ances for which no conclusive result is obtained, while the percentage ofsuccessfully parsed utterances that receives a single analysis stands outas well. It would appear that an explanation for the observed di�erencesmay be sought in the fact that the length of utterances in the DYNIXtext varies a great deal, a fact which does not emerge from the meanlength of the utterances recorded above (cf. Table 11.6.2). Although therelationship between the length of utterances on the one hand and thesuccess rate and e�ciency in parsing on the other is not straightforward,the complexity of utterances is generally found to be greater with longerutterances so that even when the analysis result is not extremely am-biguous, the amount of ambiguity that plays a role during the parsingprocess may be problematic.11.4.3 ResultsThis section will present the linguistic characteristics of the TOSCAsystem and the results based on the subset of the original corpus. Aswas described in Section 11.2.3 the TOSCA parser is based upon a formal

200 Oostdijkgrammar. This grammar has an extensive coverage and describes mostlinguistic structures. The descriptive model is essentially a constituency-model, in which the word, the phrase and the clause/sentence form themajor units of description. Each constituent is labelled for both itsfunction and its category. The relational concepts that are expressed infunction labels are of three types: subordination and superordination,government, and concatenation. Table 11.1 lists some of the constituentsthat the parser can in principle recognise.The following observations are in order:1. the TOSCA system employs a two stage analysis model in whicha tagging stage precedes the parsing stage. Each tag that resultsfrom the (automatic) tagging is checked and if necessary correctedbefore the parser is applied. Therefore, while the parser recognisesverbs, nouns and compounds (they occur after all as terminal sym-bols in the formal grammar from which the parser is derived), anyambiguity that arises at the level of the word (token) is actuallyresolved beforehand, during the tagging stage and subsequent tagselection.2. PP-attachment and the analysis of coordinations and instances ofgapping in the TOSCA system is not problematic due to the factthat the user of the system must insert syntactic markers with cer-tain constituents. Thus the conjoins in coordinations are marked,as are prepositional phrases (and indeed all categories) that func-tion as noun phrase postmodi�ers.The TOSCA parser is fairly liberal. We permit the parser to over-generate, while at the same time we aim to produce for any given (ac-ceptable) utterance at least the contextually appropriate analysis. Forthe present subset, we are not entirely successful (cf. Table 11.3.1). Asmany as 58 of the 60 utterances (96.7%) receive some kind of analysis; for55 utterances (91.7%) the contextually appropriate analysis is actuallypresent (cf. Table 11.3.1a).The two utterances for which the parser fails to produce an analysisare both instances of raising. Raising also explains one of the instancesin which no appropriate analysis was obtained.In Table 11.4.1 the performance of the parser is given in terms ofthe total time (in CPU seconds) that it took the parser to attempt toproduce a parse for each of the utterances. In the third column (headed\avg. time to accept") the average time is listed that it took the parserto produce a parse when the input was found to be parsable, while thefourth column (\avg. time to reject") lists the average time that wasrequired to determine that the input could not be parsed. The averagetimes are hardly representative: for example, in the Lotus text there

SEXT - University of LimerickSEXT ANT 153 Num b er Accept Reject % Accept % Reject Dynix 20 0 100...

Documents

Transcript of SEXT - University of LimerickSEXT ANT 153 Num b er Accept Reject % Accept % Reject Dynix 20 0 100...