An Interface To LDOCE: Towards The Construction Of A ...

Imperial College of Science, Technology and Medicine(University of London)

Department of Computing

An Interface To LDOCE: Towards The Construction

Of A Machine-TractableDictionary

By Arash Eshghi

Submitted in partial fulfillment of the requirements for the MSc in Advanced Computing and

the Diploma of Imperial College of Science, Technology and medicine.

September 2005

Abstract

The area of computational linguistics is becoming ever more important in AI. It reliesheavily on resources of world and lexical knowledge such as machinereadabledictionaries(MRD's). We would like to be able to extract the lexical knowledge requiredby Natural Language Processing(NLP) applications from MRD's. Some of thisknowledge is explicitly encoded in MRD's which is not difficult to extract. There is also alot of implicit semantic information in MRD's the extraction of which is a far morecomplex and involved process. The possibility of such a task is still not completelyestablished.

This project aims to describe a Prolog interface to the machine readable version of theLongman's Dictionary of Contemporary English(LDOCE). The dictionary containsvarious explicitly encoded linguistic information which we would like to be able toextract. The LDOCE is a proprietary resource which makes the presence of a nonproprietary interface such as what we are trying to describe, almost imperative.

We also try to tag the sense definitions of LDOCE using a probabilistic parser developedat the Stanford University as an experiment to test whether the grammar used in theparser fits the sense definitions of LDOCE. Parsing the sense definitions in LDOCE is asmall first step towards the construction of a Machine Tractable Dictionary(MTD) whichwould in turn facilitate the extraction of the implicit world and lexical knowledge in thedictionary.

An Interface To LDOCE: Towards The Construction Of an MTD

Acknowledgments

I would like to thank the following people for their inspiration, guidance and supportthroughout the project period without which this project would not have been possible:

● Jim Cunnigham at the Imperial College for supervising the project and providingpriceless support, guidance and advice.

● Jaspreet Shaheed at the Imperial College for the inspiration he has provided methroughout and his ever timely and invaluable help in developing some of the ideasinvolved in the project.


Table of Contents

1. INTRODUCTION AND BACKGROUND...................................................................6

1.1. The Development of Objectives................................................................6

1.2. Corpora, MachineReadability and Annotation.........................................71.2.1. Text Encoding and annotation...........................................................71.2.2. Formats of Annotation/Encoding......................................................71.2.3. Text Encoding Initiative(TEI)...........................................................81.2.4. Standard Generalized Markup Language(SGML)............................81.2.5. Linguistic Annotations....................................................................10

1.3. Phrase Structure And Parsing..................................................................121.3.1. Phrase Structure Grammars.............................................................121.3.2. Context Free Grammars..................................................................141.3.3. Phrase Structure Ambiguity............................................................151.3.4. Probabilistic Context Free Grammars(PCFG)................................16

1.4. Semantic Annotation................................................................................161.4.1. Word Sense Disambiguation............................................................171.4.2. Dictionary Based Disambiguation...................................................18

1.5. The LDOCE.............................................................................................20 1.6. MachineTractable Dictionaries(The ultimate goal)................................22

1.7. Parsing Dictionary Definitions................................................................25

2. THE PROLOG INTERFACE TO LDOCE..................................................................26

2.1. Removing The Redundancies: The reparse_term predicate....................27

2.2. The Prolog Interface.................................................................................292.2.1. The morphological information contained in the dictionary............292.2.2. Extracting Inflected Forms: the inflected_forms predicate..............302.2.3. Generating Regular Inflected Forms: the inflections predicate........322.2.4. Part of Speech extraction: the extract_pos predicate.......................332.2.5. Entry Extraction: the find_entry predicate.......................................342.2.6. Extracting Sense Definitions: the extract_senses predicate.............36


3. THE STANFORD PARSER.........................................................................................41

3.1. The Penn Tree Bank Project....................................................................41

3.2. The Prolog interface to the Stanford Parser.............................................423.2.1. The PrologTermCreator class..........................................................423.2.2. The parse predicate..........................................................................433.2.3. The parse_and_bracket_entry predicate..........................................443.2.4. Example Parse of an LDOCE entry.................................................46

3.3. An overall assessment of the Stanford Parser..........................................48

4. EVALUATION.............................................................................................................494.1. The Prolog interface................................................................................494.2. A Machine Tractable LDOCE?...............................................................494.3. Recommendations for Future Work........................................................50

5. BIBLIOGRAPHY AND REFERENCES.....................................................................52

A. CODE LISTING...........................................................................................................54 A.1. The Prolog Interface To LDOCE............................................................54 A.2. Interface To the Stanford Parser..............................................................66

B. Penn POS TagSet.........................................................................................................68


1. Introduction And Background

1.1. The Development of Objectives

The initial goal of the project was an automatic system for 'revealing' the implicitinformation contained in a piece of raw text by explicit markup in a manner which is asfar as reasonable consistent with the style used for the Longman's Dictionary ofContemporary English (LDOCE).

The project is clearly in the area of Natural Language Processing. The description abovewas quite vague at the time and had to be investigated by reading the correspondingtheoretical background on tagging and markup. The objectives of the project were latermodified slightly to better suite the specific context of Machine Readable Dictionaries(MRD's) in general and the LDOCE in particular. The modified objectives can be statedas follows:

1. The implementation of a PROLOG interface for the LDOCE which would essentiallybe the continuation of the work done by Natasha Yarwood on the same dictionary. Theinterface would facilitate the extraction of the linguistic features encoded explicitly inthe dictionary for each entry. The work done by Yarwood consists mainly of ananalysis of the overall structure of the dictionary, namely the extraction of a DocumentType Definition for the SGML markup used in this dictionary. The work also verifiesthat the SWIProlog SGML parser is compatible with the dictionary and that it can beused to parse it. Her report has been valuable since there were many assumptionswhich would otherwise have to be made without any real ground in order toimplement the predicates accessing the explicitly markedup information in the SGMLfiles provided.

2. The LDOCE is only partially marked up in the sense that neither the wordsensedefinitions nor the examples used to describe them are tagged in any way. Theultimate goal is to produce a fully tagged (partofspeech and phrase structure/parse),crossreferenced version of the dictionary(this is very close to the notion of aMachinetractable Dictionary which will be introduced later on in the report), thoughthis project serves as a very small first step, plus it should give a measure of the task athand while producing a sketch of what needs to be done. The motivation for this goal


will be stated clearly in later sections, having given some background on the LDOCE.

1.2. Corpora, Machine readability and Annotation

Nowadays the term corpus almost always refers to machine readable text. Corpora whichare machine readable have several advantages over the original written or spoken format.

These advantages include the following:

● They may be searched and manipulated in manners which are not possible with anyother format. Machinereadable corpora facilitate for instance the search for all instancesof a particular word for say frequency analysis. This is simply not feasible with corpora inthe book form unless they are indexed.

● Machinereadable Corpora can be a lot more easily and quickly enriched withadditional information (e.g. textual, nontextual and linguistic).

1.2.1. Text encoding / annotation

Corpora in general exist in two forms: unannotated (raw or plain text), orannotated (accompanied with different types of extra information (e.g. linguistic)about the text in a structured manner). In any piece of raw text there are several kinds ofimplicit linguistic information. These can be made explicit with the help of annotation.This includes for example data about parts of speech which is only accessible in a pieceof raw text through a knowledge of say English grammar. In an annotated corpus the verb'goes' for instance might appear as 'goes_VVZ' indicating that it is a third person singularpresent tense(Z) form of a lexical verb(VV). Such annotation makes it a lot faster andeasier to retrieve and analyse such information.

1.2.2. Formats of Annotation / Encoding

There is currently no internationally accepted standard of representing additionalinformation in text. Over the years several approaches to corpus encoding and annotationhave been adopted some of which have lasted longer than the others. Work in the area is


currently progressing towards establishing such an international standard. One of the longlasting annotation schemes has been known as COCOA. The conventions of this standardhave been applied to several corpora including the LongmanLancaster Corpus and theHelsinki Corpus to include additional textual information. Put very simply the schemeuses balanced angled brackets which contain the additional textual information in theform of a variable name and its instantiation. An example follows:

< A CHARLES DICKENS>

where 'A' is the name of the variable standing for 'author' and 'Charles Dickens' is thename of the author of the document being annotated. However, this scheme is veryinformal and can only account for a limited range of textual information, such as authors,dates and titles. Clearly we need a far more general and formal approach if we are toreach an international standard which can facilitate the annotation of any type ofinformation. Work in the TEI is making such an end possible.

1.2.3. Text Encoding Initiative (TEI)

The main objective of the TEI is to provide standardized implementations for machinereadable text encoding and exchange. It uses an existing form of mark up known asSGML(Standard Generalized Markup Language). The reason for the employment of thismarkup scheme is that it is very general, formal and already recognized as aninternational markup standard. A number of other markups including HTML and XML(Extensible Markup Language) are derivatives (instances) of SGML and thereforeconform to its standards. The TEI's own contribution is a set of guidelines as to how thisstandard is to be used for the special purpose of corpus encoding. The LongmanDictionary of Contemporary English also employs SGML. We will be dealing with thesame encoding scheme in the work on the LDOCE.

1.2.4.Standard Generalised Mark-up Language (SGML)

“A GENTLE INTRODUCTION:

SGML is an international standard for the description of markedup electronic text.More exactly, SGML is a metalanguage, that is, a means of formally describing alanguage, in this case, a markup language. Before going any further we should definethese terms.

Historically, the word markup has been used to describe annotation or other marks


within a text intended to instruct a compositor or typist how a particular passage shouldbe printed or laid out. Examples include wavy underlining to indicate boldface, specialsymbols for passages to be omitted or printed in a particular font and so forth. As theformatting and printing of texts was automated, the term was extended to cover all sortsof special markup codes inserted into electronic texts to govern formatting, printing, orother processing.

Generalizing from that sense, we define markup, or (synonymously) encoding, as anymeans of making explicit an interpretation of a text. At a banal level, all printed textsare encoded in this sense: punctuation marks, use of capitalization, disposition of lettersaround the page, even the spaces between words, might be regarded as a kind of markup, the function of which is to help the human reader determine where one word endsand another begins, or how to identify gross structural features such as headings orsimple syntactic units such as dependent clauses or sentences. Encoding a text forcomputer processing is in principle, like transcribing a manuscript from scriptiocontinua, a process of making explicit what is conjectural or implicit, a process ofdirecting the user as to how the content of the text should be interpreted.

By markup language we mean a set of markup conventions used together for encodingtexts. A markup language must specify what markup is allowed, what markup isrequired, how markup is to be distinguished from text, and what the markup means.SGML provides the means for doing the first three; documentation such as the TEIGuidelines is required for the last.”[20]

In TEI each individual text is conceived of as consisting of two parts: a header and thetext. The header contains information about the text such as: author, title , etc. Thisinformation includes any feature system declarations.

The annotation process depends upon two basic devices: tags and entity references. Textsare assumed to be made up of elements. Each element can be any unit of text – word,sentence, paragraph . . . These elements are marked using SGML tags indicated bybalanced pairs of angled brackets. A start tag at the beginning of an element isrepresented by a pair of angled brackets containing the annotation string < . . . >. An endtag at the end of an element contains a slash character followed by the annotation strings< / ... >. HTML tags are all special cases of SGML tags.

Entity references on the other hand are delimited by the characters & and ;. They arebasically shorthand ways of encoding detailed information within a text. The shorthandform which is contained in the text refers outwards to a feature system declaration (FSD)in the document header which contains all the relevant information in full TEI tagbasedmarkup. Take the shorthand code 'vvd' as an example. This is a shorthand code used inpartofspeech tagging (more on this later). The first v says that the word is a verb, thesecond says that it is a lexical verb and the d says that it is the past tense form. This code


is used in the form of an entity reference in the following example:

killed&vvd;

The entity reference in this example might be an outward reference to the followingfeature system declaration:

<fs id=vvd type=wordform><f nameverbclass><sym value=verb><f name=base><sym value=lexical><f name=verbform><sym value=past>

</fs>

This is identical to looking up a code in a code book. The code book here is the documentheader. In a corpus before TEI such as the Lancaster/IBM Spoken English Corpus thesame word plus annotation might look like this:

killed_vvd

The user would then be able to look the code up in a table listing all the codes used in theannotated corpus:

vvd past tense form of a lexical verb

The overall entity of a text is in the TEI(SGML) based on the notion of a document typedefinition (DTD). This is a formal representation of what elements the text contains, howthey can be combined and also contains a set of entity declarations, for example,representation of nonstandard characters. The TEI has already defined standard DTDsfor basic text types such as poems, letters and so on. The DTD could be used by anSGML parser to check whether the document is TEI conformant.

1.2.5. Linguistic Annotations

Linguistic annotations are of particular interest in this project. The information which isadded through encoding is in this case to do with various linguistic features of the text.Certain kinds of linguistic annotation which involve the attachment of special codes towords in order to indicate particular features, are often known as 'tagging' rather thanannotation. The following is a list of some of the most common types of linguistic


annotation(tagging):

● Parsing

● Part of speech tagging

● Lemmatisation

● Semantic annotation

1.2.5.1. Lemmatisation

Lemmatisation is basically the reduction of the words in a corpus to their respectivelexemes (the lexical entries in the dictionary from which they have been derived). Thusfor example the words 'went', 'goes', 'gone' would all be annotated with the the verb 'go'which is said to be their Lemma.

1.2.5.2. Part-of-Speech Tagging

This is the most basic type of corpus annotation, sometimes also known as grammaticaltagging or morphosyntactic annotation. The aim of partofspeech tagging is to assign toeach lexical unit(word) a code indicating its partofspeech (e.g. singular common noun,past participle, etc.). Partofspeech information is an essential foundation for furtherforms of analysis. These include syntactic parsing and semantic field annotation. Part ofspeech annotation is a basic first step in the disambiguation of homographs. For instancethe word 'boot' can be interpreted both as a noun and as a verb, i.e. partofspeech taggingis able to distinguish between the word 'boot' as a noun and as a verb. However it will notbe able to distinguish between the senses of 'boot' as a verb meaning 'to kick' and 'to startup a computer'.

The following is an example of partofspeech tagging from the British National Corpus(BNC) using the TEI (SGML) entity references delimited by & and ; :

Perdita&NN1NP0;,&PUN; covering&VVG; the&AT0; bottom&NN1; of&PRF;the&AT0; lorries&NN2; with&PRP; straw&NN1; to&TO0; protect&VVI; the&AT0;ponies&NN2;'&POS; feet.&PUN;


Partofspeech tagging is today by far the most common form of annotation. One reasonis that the process of such annotation can be automated with a great accuracy without anymanual intervention since the partofspeech of any word in a sentence is highlypredictable from the surrounding context, given a very basic knowledge of the languagein question.

NOTE: Most of the above background on encoding and annotations have beenextracted from [21] and [22]. Please refer to these books for more detaileddescriptions.

1.3. Phrase Structure and Parsing

Generally speaking, the position of a word in a sentence is not arbitrary meaning thatsyntax imposes constraints on where a word can occur in a sentence. Moreover it is thecase that the words in a sentence are not simply assembled together as a sequence of partsof speech. Instead they are organized into phrases, which are groupings of words that aretreated as single units. These units are called constituents of a sentence. Theseconstituents are detectable by their being able to occur in various positions, showinguniform syntactic possibilities for expansion. These syntactic possibilities are describedusing phrasestructure grammars.

1.3.1. Phrase structure grammars

The regularities of word order are often captured by means of rewrite rules. These ruleshave the general form: Category > Category*.

This rule states that the symbol on the left hand side of the arrow can be rewritten(replaced by) as the sequence of symbols on the right. The Category above is itself asentence constituent. Noun Phrases, Verb Phrases and Prepositional Phrases are commonexamples of such categories. To produce a sentence of the language, we start with theStartSymbol 'S'. What follows is an example of a simple set of rewrite rules:

S > NP VP AT > the


NP > AT NNS NNS > children

NP > AT NN NNS > students

NP > NP PP NNS > mountains

VP > VP PP VBD > slept

VP > VBD VBD > ate

VP > VBD NP VBD > saw

PP > IN NP IN > in

IN > of

NN > cake

Each of the rules on the right hand side of the above set rewrite a syntactic category (partof speech) into word. This part of the grammar is often separated off as the lexicon.These rules would mean that each syntactic category, e.g. NP, can be rewritten as one ormore other syntactic categories or words. The rewriting however depends solely on thecategory in question and not on any surrounding context. The following are examples ofsentence derivations using the above grammar:

● The children slept:

S

> NP VP

> AT NNS VBD

> The children slept

● The children ate the cake:

S

> NP VP

> AT NNS VBD NP

> AT NNS VBD AT NN

> The children ate the cake

We can represent such structures as trees the leafs of which are what's referred to asterminal symbols. These terminal symbols comprise the lexicon. The rest of the treenodes are labeled with syntactic categories which are the nonterminal symbols of thegrammar. The tree representation of the first derivation above follows:


S

NP VP

AT NNS VBD

The children slept

Another useful way of representing such derivations is called (labeled) bracketing, inwhich sets of brackets delimit the syntactic categories (constituents). The above tree isrepresented in this manner as follows:

[S [NP [AT The] [ NNS children] ] [ VP [ VBD slept] ] ]

This is the representation which will be used to phrasestructuretag the senses of wordsin the LDOCE, the reason being that the above representation is equivalent to a validProlog term.

The process of reconstructing a (according to some grammars) valid derivation givingrise to a given sentence is known as parsing. The constructed tree (derivation) is knownas parse of that sentence.

Note here that all nodes at exactly one level above any leaf in the tree representation arelabeled with the partofspeech of the word labeling the leaf. Put another way, we couldsay that phrasestructure tagging subsumes partofspeech tagging if done as describedabove.

1.3.2. Context-Free Grammars

The syntactic model described in the previous section adopts what is referred to as acontextfree view of language. The term "contextfree" comes from the fact that a nonterminal such as a NP can always be replaced by what appears on the right hand side of aNP expansion rule, regardless of the context in which it occurs. Subjectverb agreement(the fact that the subject and verb of a sentence agree in number and person) is anexample of a linguistic phenomenon, which contextfree grammars is unable to account


for. More generally, syntactic dependencies of two words occurring far apart in asentence, i.e. within two different sentence constituents, cannot be accounted for with acontextfree model.

1.3.3. Phrase structure ambiguity

More often than not, it is the case that there are many different syntactic derivations thatgive rise to a single sentence. Parsers based on a comprehensive English grammar willusually find hundreds of parses for a sentence. This phenomenon is known as phrasestructure or syntactic ambiguity. Attachment ambiguity is a common type of syntacticambiguity. It occurs with phrases that could have been generated by two different nodes.Consider for example the grammars in the previous section. There are two ways to derivethe prepositional phrase(PP) 'with a spoon' . Consider the sentence:

The children ate the cake with a spoon.

The PP can be generated as the child of the verb phrase(VP) or as a child of one of thenoun phrases(NP).

Different attachments give rise to different meanings. The attachment to the verb phrasemakes a statement about the instrument that the child is using to eat the cake whereas thenoun phrase attachment is a descriptive statement about the cake (the cake with a spoon).Note here that syntactic ambiguities should be dealt with before any meaning can beassigned to the sentence in question. Each parse would yield a different interpretation ofthe sentence. A choice has to be made between all valid parses of a sentence. This is donevia statistical methods. Put simply just to give an idea at this stage, we would very oftenwant to have the most probable parse of a sentence. One way to assign probabilities tothe different parses of a single sentence is to describe the language in terms of what isreferred to as Probabilistic Context Free Grammars which is introduced very briefly inthe next section. Please note that there are other probabilistic models describing syntaxand word order in a language. However The Stanford Parser is generally based on alearned PCFG. A short introduction is therefore necessary.

The above should serve as a sufficient general introduction to ContextFree Grammars,although I will have to come back to the concept of parsing later on when I assess theapplicability of the Stanford Lexicalized Parser to the sense definitions in LDOCE.


1.3.4. Probabilistic Context-Free Grammars

In the previous section on ContextFree Grammars we talked about rewrite rules and theway they can be applied to derive the “grammatical” sentences of a particular language.Here I will introduce a system based on CFG's for the syntactic description of a languagein terms of similar rewrite rules whereby it becomes possible to assign probabilities toeach valid derivation of a sentence in ambiguous cases.

Probabilistic Context Free Grammar is basically a CFG with probabilities added to therules indicating how probable different rewritings are. For instance, when rewriting a NP(noun phrase) there are always more than one possible rewrite. After we have derived theparse trees for a single sentence, the probability of each parse can be calculated bymultiplying the probabilities of the rewrite rules used whenever a choice has been madebetween the rewritings. (P(A&B) = P(A) * P(B) when A and B are independent events.They are indeed independent since the model is contextfree) This means that at eachnode we have a probability assigned to each subtree at that node.

For a more detailed background on grammars and probability theory see [3].

1.4. Semantic Annotation

There are generally two identifiable types of semantic annotation:

1. the marking of semantic relationships between items in the text, for example, theagents or patients of a particular action(this is often known as thematic semantics)

2. the marking of semantic features of words in a text, essentially the annotation of wordsenses in one form or another.

The second of the above becomes very important when we are dealing with theconstruction of a crossreferenced version of the LDOCE. It relies mostly on the notion ofautomatic wordsense disambiguation. An overall description of this process will follow.


1.4.1. Word Sense disambiguation

The problem to be solved here is that most words have various meanings or senses. Thesewords are thus ambiguous if given out of context, i.e. there are a number of ways inwhich they can be interpreted. The task of disambiguation is therefore to determine whichof the word's senses is invoked in a particular use of the word. In general this is done bylooking at the context in which the word has been used.

A word is assumed to have a finite number of discrete senses, given by a dictionary,thesaurus, or any other form of reference. The task at hand is to automate thedisambiguation of the words in context, in the sense that the computer programme incharge will make a forced choice between the senses of the word. Typically however, aword has various related senses and thus unclear where and when to draw a line betweenthem. As an example consider the following senses of the word 'title':

● Name/heading of a book. statute, work of art or music etc.● Material at the start of a film● The right of legal ownership (of land)● The document that is evidence of this right● An appellation of respect attached to a person's name

One approach is simply to define the senses of a word as the meanings given to it in aparticular dictionary. However since dictionaries differ in many ways in the number andkinds of senses that they list this approach can become scientifically problematic. Thegroupings of word senses in dictionaries often seem arbitrary. For example, the above listof senses distinguishes as two senses a right of legal title to property and a document thatshows that right. This however has not been done for the rest of the uses of the word, e.g.the same ambiguity exists when talking about the title of a painting. Consider thefollowing sentence:

This work doesn't have a title.

Now it could either mean that the work hasn't been given a title in the first place by theArtist or that the placard displaying the title in the exhibition is missing. One can alsonotice here that the second definition can be thought of as a special case of the first. It isquite common for dictionaries to list senses that are really special cases of others if thesense is frequently used. The conclusion to be drawn here is that for most words we can'talways draw a hard line between the senses of that word.

As was very briefly mentioned in the partofspeech tagging section above, another form


of ambiguity to be dealt with is when the word is used as different parts of speech.Clearly if a word is used as a noun it has a different sense(meaning) than where it is usedas a verb. Therefore partofspeech tagging can be view as a word disambiguationproblem. They have however been distinguished in NLP context partly because ofdifferent natures of the problem at hand and partly because of the methods which havebeen used to approach them. The partofspeech of a word can be determined by lookingat the structure of the other words close by. This is far from true in the case of sensedisambiguation since we need to look at the broader context(distant words) in order todisambiguate a word. Conversely distant words are quite useless when it comes to partofspeech tagging.

There are in general three main disambiguation methods each of which will be brieflyexplained:

● Supervised Disambiguation● Dictionarybased Disambiguation● Unsupervised Disambiguation

The first and the last of the above involve the idea of learning and training sets.Disambiguation can be thought of as a classification problem. The general idea is that theprogramme having been fed a training set will be able to classify new data. Thedistinction between Supervised and Unsupervised learning(involving the first and the lastof the above) is that in the case of Supervised learning the training set has already beenclassified(in our case the training set would be a semantictagged(disambiguated) corpus)whereas in Unsupervised learning the training set is not already classified.

In the statistical NLP domain it is quite expensive to produce semantictagged trainingdata for Supervised learning. Therefore Unsupervised Disambiguation algorithms willneed to make use of various knowledge sources such as dictionaries or more richlystructured data.

1.4.2. Dictionary-based Disambiguation

If we have no information about the sense categorization of specific instances of a word,we can fall back on a general characterization of the senses. This general method uses thedefinition of senses in dictionaries and thesauri. Three types of information have beenused. Lesk(1986) (whose algorithm I will state in this section) exploits the sensedefinitions in the dictionary directly. Yarwosky (1992) shows how to apply the semanticcategorization of words(in a thesaurus) to the semantic categorization and disambiguationof contexts. In Dagan and Itai's method(1994), translations of the different senses are


extracted from a bilingual dictionary and their distribution in a foreign language corpus isanalysed for disambiguation.

Lesk's Disambiguation method

Lesk(1986) starts from the simple idea that a word's dictionary definitions are likely to begood indicators for the sense they define. Suppose that two of the definitions of cone areas follows:

1. A mass of ovulebearing or pollenbearing scales of bracts in the trees of the pinefamily or in cycads that are arranged usually on a somewhat elongated axis,

2. Something that resembles a cone in shape: as . . . a crisp coneshaped wafer forholding ice cream.

If either tree or ice occur in the same context as cone then chances are that the occurrencebelongs to the sense whose definition contains that word: Sense 1 for tree and Sense 2 forice.

Let D1, . . , Dk be the dictionary definitions of the senses S1, . . ., Sk of the ambiguousword w, represented as the bag of words occurring in the definition, and Evj the dictionarydefinition of a word vj occurring in the context of use c of w, represented as the bag ofwords occurring in the definition of vj (if Sj1, . . ., SjL are the senses of vj then

Evj = Uji Dji. We simply ignore sense distinctions for the words vj that occur in thecontext of w.) then Lesk's algorithm can be described as follows:

1. \*Given: context c*\2. for all senses Sk of w do

3. score(Sk) = overlap(Dk, Uvj in c Evj

4. end5. choose s' s.t. s'=argmaxskscore(Sk)

For the overlap function we can just count the number of common words in the definition

of Dk of sense Sk and the union Uvj in c Evj of the definitions of the words vj in thecontext.Information of this sort derived from the dictionary will not by itself produce high qualityword sense disambiguation. Lesk reports accuracies between 50% and 70%.

Some background on LDOCE and machinetractability is needed now to motivate the


second objective of the project.

For more detailed information on WordSense Disambiguation see [3].

1.5. The LDOCE

The LDOCE is the Longman's Dictionary of Contemporary English. This is a dictionaryfor learners of English as a second language which means that the dictionary provides thelinguistic features of words in a more explicit manner than a normal monolingualdictionary, which is quite useful for any NLP application. The dictionary is organized as alist of entries, each of which is basically a collection of sense definitions. Each entry hasa head which encodes the word being defined. Each sense definition is a collection ofdefinitions, examples and other information relating to that particular sense of the headword(phrase). The LDOCE has been created using the British National Corpus, TheLongman Lancaster corpus and the Longman corpus of Learners' English.

I have been working on the latest digital version of this dictionary(LDOCE3). Theexplicit linguistic features in the dictionary have been marked up using SGML. Previousversions used LISP for markup. The dictionary is a proprietary resource and comes withvery limited documentation. The work done by Natasha Yarwood on the SGML markupstructure used has proved very useful particularly in the creation of the Prolog interface toLDOCE3.

The following is the entry for the word 'myth' in LDOCE3(just to give an idea of what thedictionary looks like in SGML form):

<Entry><Head><HWD>myth</HWD><PronCodes><PRON>mIT</PRON></PronCodes><POS>n</POS></Head><Sense><GRAM>C,U</GRAM><ACTIV>BELIEVE</ACTIV><DEF>an idea or story that many people believe, but which is not true</DEF><EXAMPLE>the myth of male superiority</EXAMPLE><EXAMPLE>Most people think that bats are blind, but in fact this is a myth.</EXAMPLE><ColloExa><COLLO>popular myth</COLLO><GLOSS>one that a lot of people believe</GLOSS><EXAMPLE>Contrary to popular myth, there is no evidence that long jail sentences really deter youngoffenders.</EXAMPLE></ColloExa><ColloExa><COLLO>explode/dispel a myth</COLLO><GLOSS>prove that it is not true</GLOSS></ColloExa></Sense>


<Sense><GRAM>C</GRAM><FIELD>RM</FIELD><DEF>an ancient story, especially one invented in order to explain natural or historical events</DEF><EXAMPLE>the myth of Orpheus</EXAMPLE></Sense><Sense><GRAM>U</GRAM><FIELD>RM</FIELD><DEF>this kind of ancient story in general</DEF><EXAMPLE>the giants of myth and fairy-tale</EXAMPLE></Sense></Entry>

What makes the dictionary particularly well suited for the second objective of this projectis that the dictionary has a 2000 word defining vocabulary, i.e. all the sense definitions inthe dictionary have been written using this defining vocabulary. Guo in [1] devides thevocabulary in LDOCE into the controlled and the noncontrolled. Controlled means thesame as defining here.

According to the analysis of the dictionary by Plate(1988) the number of senses for wordsbelonging to the controlled vocabulary is extremely high.(this is a relatively old analysis,however the difference between the old figures and the new should be negligible)Although there are only about 2166 words in the controlled vocabulary, over 24000 of the74000 senses defined are senses of these words. In terms of semantic ambiguity thecontrolled vocabulary is roughly six times as ambiguous as the noncontrolledvocabulary. Each word in the controlled vocabulary contains 12 senses whereas the restof the words contain roughly about 2.

Guo in [1] notes 4 different types of words and word senses in the context of sensedefinition. Each of the following categories is a subset of the preceding one:

1. Any word contained in LDOCE is an LDOCE word. Any word sense of any LDOCEword is an LDOCE sense.

2. Controlled words are words from the list of the controlled vocabulary given at theback of the LDOCE dictionary. All the word senses of the controlled words defined inLDOCE are the controlled senses.

3. Defining words are words that are used to define the meanings of all the controlledwords in their sense definitions. Note that not every controlled word is used in thedefinitions of the meanings of other controlled words. Defining senses are individualword senses of the defining words that are used in the definitions of the meanings ofthe controlled words.

4. Seed senses are natural semantic primitives to be derived from this work on thedictionary. The words that the seed senses are senses of are called the seed words. [1]


The above categorization becomes relevant when we examine the notion of a MachineTractable Dictionary(As opposed to a MachineReadable Dictionary). The secondobjective of the project will also be motivated(set in context) when the overall process ofthe creation of a machine tractable dictionary out of something like the LDOCE isdescribed.

1.6. Machine-Tractable dictionaries(The ultimategoal)

The essence of tractability is that a machine tractable dictionary comprises acomputational representation of the contents of a machine readable dictionary thatembodies both the computer science understanding of a database and the computationallinguistic understanding of a parsed and semantically disambiguated text[1].

It was after some time that I realized that what is required in the second objective of thisproject, namely a fully tagged and crossreferenced version of the dictionary, is actuallya “tractable” version of LDOCE3.

A fully tagged version of the dictionary would mean that we need one way or another toparse all the sense definitions in the dictionary. As you will see later on this report I haveused the Stanford LexParser to parse the sense definitions and the examples of anentry. (merely an experimentation)

In order to produce a crossreferenced version of the dictionary, it is absolutelyimperative that we know which senses of the words being used in some other word'ssense definition is being invoked. This is essentially a process of word sensedisambiguation. The leafs of the parse tree(the words used in the definition of a sense)would have be replaced by the parse tree of the definition of the word sense beinginvoked in that particular context. Clearly such nesting cannot go on infinitely. The seedsenses(as defined in the previous section) would be the end point of the nesting process,i.e. if a seed sense is being invoked, then we wouldn't replace that particular leaf with aparse tree.

Guo in [1] describes the MTD construction of the LDOCE dictionary as two mainprocesses: The process of reduction which is comprised of two other subprocesses andthe process of composition:

The reduction process:

1. The determination of the defining senses

2. The derivation of the seed senses


The compositional process:

Involves a process of machinelearning with the natural set of semantic primitives(seedsenses) obtained through the reduction process to compose formalized sense entries of theMTD called Fregean Formulae(FF).

Definition: An FF(Fregean Formula) is a MTD definition of a word sense. It is a twoplace predicate where the first argument is a word sense being defined and the second aparse tree of its sense definition. Hanging at each terminal symbol of the pares tree is aprimitive or another FF in the case of a nonprimitive leaf[1].

The figure on the next page is an example of an FF defining the second sense of the word'doctor'.

Having derived the seed senses(semantic primitives in terms of which all the other sensedefinitions are to be represented. This is by itself not an easy task), we need to parse andsemantically disambiguate the controlled sense definitions(part of the compositionalprocess). According to Guo in [1], in order to be able to semantically disambiguate wordsenses of the dictionary definitions we need somehow to encode the semantic preferencesassociated with all possible pairs of semantic primitives. The methods he proposes toachieve this all involve the handcoding of these preferences.

The next step is what is referred to by Guo as the bootstrapping process. This is theprocess of parsing dictionary definitions and acquiring semantic preference informationbetween the set of semantic primitives derived previously. The result of this process arethe same dictionary definitions, but this time with explicit word sense numbers attachedto the words in the definition. These are the senses that have been invoked in the contextof that particular sense definition.

According to Guo's own evaluation of his work on the LDOCE, the derivation process(the first two steps) has successfully been carried out. The last two steps of theconstruction how ever have only been demonstrated to be feasible using implementedexamples.

Please note that the above descriptions of the processes involved in the construction of anMTD are extremely general and brief. For further details please refer to [1].

We are in this project simply experimenting on the LDOCE sense definitions with aprobabilistic parser(The Stanford Parser), to establish whether the parser can be used forthe purpose of parsing the sense definitions, in any future work involving the dictionary.


FF(doctor2, a person whose profession is to attend to sick people or animals).

FF

doctor2(N) S.NP

Det N FF S

a1 person1

whose2(PRON) S.POST_MOD.PP NP VP

PREP FF N LV VIP

of1 profession1 is2

whom1(PRON) S INFS VP

to29

NP PP

N N PREP PRON FF NP

object1 form2 of1 they1

attend_to1(V) S.POST_MOD.VIP ADJ N CONJ N

sick1 people1 or2 animals1

INFS VP

to29

V N PREP

give5 help12 to9


1.7. Parsing dictionary sense definitions

Generally speaking, the sense definitions in a dictionary can be parsed more easily thanany form of naturally occurring text. Definition texts in a dictionary are a highlyspecialized form of natural language with a more predictable structure. This specializedform can be exploited to derive from a dictionary what is referred to as definingformulae in order determine and simplify the semantic relationships between the wordsused in a sense definition. Defining formulae are the recurring patterns common to allsense definitions. Some general examples are that Taxonomical relations between nounsare suggested by the pattern “AnyNP” (NP is noun phrase here), set membershiprelations are indicated y the pattern “AmemberofNP” and the pattern “ActofV ing”(Vbeing a verb) of nouns is an indicator of action verbs.


2. The Prolog Interface to LDOCE

This part of the project is basically the continuation of the work done by NatashaYarwood very recently. She has produced the SGML tag hierarchy of the LDOCE SGMLfiles. She has also shown that the SWIProlog SGML parser can be used to parse theSGML dictionary files into Prolog terms. The work done has primarily involved one ofthe dictionary SGML files(A.sgm). Quite rightly however the results have beengeneralized to cover all the SGML files. I have used this information to extract some ofthe explicitly marked up linguistic information provided in the dictionary.

The SWIprolog SGML parser is able to parse the dictionary files. However there are anumber of redundancies in the parse due to the fact that the parser is general purposewhich means that it assumes the fully featured SGML document. There are someredundancies in the produced parse. I am reparsing the result to remove theseredundancies. They are as follows:

● When parsing any of the SGML files, the parser produces a list of similar error reports.The error is: “Bad closeelement tag found “#PCDATA”. According to Yarwood, the<#PCDATA> tags represent a placeholder for parsed character data (character datathat the parser does not ignore). It is possible that the occurrence of these tags in theSGML files flag the occurrence of required, as opposed to optional tags or that thescript that was used to generate the SGML file inserted tags for empty elements. Thesewould therefore have to be removed in the reparse.

● “The outputted Prolog term contains a hierarchically organized list of elements. At thehighest level in the hierarchy is the element <ldoce3> followed by the element<entry>. The element <entry> contains the deep and surface structure encoding oflinguistic features and their representation.”[23] I am removing the <ldoce3> elementin the reparse.

● “The outputted Prolog term contains the element <piccal> which occurs at the samelevel as the element <entry>. The element <piccal> does not appear to containembedded tags. In the absence of documentation on the markup of the SGML files themeaning of the held between the open and close <piccal> tags is not clear.”[23] I amalso removing the occurrences of the <piccal> tags, since they don't seem to have anyrelevant significance.

● The outputted Prolog term inserts an empty list, ‘[]’ after each tag element. This looksto be consistent for each tag element in the A.sgm file. It is also consistent with therest of the SGML files since none of the tags have any attributes. I am removing also


this empty list in the reparse.

2.1. Removing the redundancies: The reparse_termpredicate

As mentioned in the previous section I am reparsing the result of the SWI SGML parse toremove the redundancies mentioned. This is done using the reparse_term predicate whichitself uses 2 other helper predicates, namely, process_entry_list and process_entry:

reparse_term(File, Result):

load_sgml_file(File, SWI_Term),

[Ldoce]=SWI_Term,

Ldoce=..[_,_,_,EntryList],

process_entrylist(EntryList,[],Result).

The above predicate uses the builtin load_sgml_file predicate to produce the initial parseof the SGML file. A list of element(entry,[],Subtaglist)'s is then extracted from the parse.This list is then passed to process_entrylist predicate for reparsing. This predicate unifiesits third argument with the reparse of the entry list passed to it as the first argument. Theproduced list is in reverse order. Consequently the order is reversed again to restore theoriginal order of elements.

The following is a description of the recursive reparse of the extracted list using the twohelper predicates:

process_entrylist([],[]).

process_entrylist([Value],[Value]):-Value\=element(_,_,_),!.

process_entrylist([Entry|Rest], Result):-

(Entry=='<#PCDATA>';Entry=element(piccal,_,_))->

process_entrylist(Rest, Result)

;

process_entrylist(Rest,RestProcessed),

process_entry(Entry,ParsedEntry),

Result=[ParsedEntry|RestProcessed].

process_entry(Entry, Parsed):-


Entry=element(Tag,_,SubTagList)->

process_entrylist(SubTagList,SubResult),

Parsed=..[Tag,SubResult]

;

Parsed=Entry,!.

The process_entry_list predicate goes through all the elements of the entry list one byone. For each element of the list we have the following 2 cases:

● It is either the case that the element is of the form element(piccal,_,_) or it is of theform <#PCDATA>, in which case we simply ignore that element and jump to the nextelement in the list.

● Otherwise we are dealing with a valid dictionary tag. The tag together with all of itssubtags is then passed to the process_entry predicate which unifies its secondargument with the reparsed representation of the tag passed to it as the first argument.The rest of the elements in the list are then recursively processed.

The process_entry predicate extracts the tag name and the contents of the tag(the list ofsubtags, etc) through unification. It then uses the process_entry_list predicate recursivelyto parse the contents of the tag transforming every entry of the form element(TagName,[],Contents), to an entry of the form TagName(ParsedContents), i.e. the followingtransformation is performed on all entries:

element(TagName,[],Contents) ===> TagName(RecursivelyParsedContents)

The functionality of the reparse_term predicate is illustrated in the following exampleuse of the predicate:

Consider the following dictionary entry before the reparse:

element(entry, [], [element(head, [], [element(hwd, [], [tag]), element(homnum, [], [1]), element(proncodes, [], [element(pron, [], [t{g])]), element(pos, [], [n])]), element(sense, [], [element(gram, [],[C]), element(field, [], [D]), element(def, [], [a small piece of paper, plastic etc, fixed to something toshow what it is, who owns it, what it costs etc]), element(colloexa, [], [element(collo, [],[name/identification/price tag]), element(example, [], [Where's the price tag on this dress?])])]),element(sense, [], [element(gram, [], [U]), element(field, [], [DG]), element(def, [], [a children's gamein which one player chases and tries to touch the others])]), element(sense, [], [element(gram, [], [C]),element(field, [], [SLG]), element(def, [], [a phrase such as `isn't it?', `won't it?', or `does she?', addedto the end of a sentence to make it a question or to ask you to agree with it])]), element(sense, [],[element(gram, [], [C]), element(field, [], [DC]), element(def, [], [a metal or plastic point at the end of


a piece of string or, element(nondv, [], [element(refhwd, [], [shoelace])]), that prevents it fromsplitting])])])

This is transformed to the following entry after the reparse:

entry([head([hwd([tag]), homnum([1]), proncodes([pron([t{g])]), pos([n])]), sense([gram([C]), field([D]), def([a small piece of paper, plastic etc, fixed to something to show what it is, who owns it, whatit costs etc]), colloexa([collo([name/identification/price tag]), example([Where's the price tag on thisdress?])])]), sense([gram([U]), field([DG]), def([a children's game in which one player chases andtries to touch the others])]), sense([gram([C]), field([SLG]), def([a phrase such as `isn't it?', `won'tit?', or `does she?', added to the end of a sentence to make it a question or to ask you to agree withit])]), sense([gram([C]), field([DC]), def([a metal or plastic point at the end of a piece of string or,nondv([refhwd([shoelace])]), that prevents it from splitting])])])

2.2.The Prolog interface

Now that we have a neat representation of the dictionary files we can try to extract thedifferent linguistic information contained in the dictionary. The first predicate in need ofdescription is the toplevel find_entry predicate which takes any word and finds thecorresponding entry in the dictionary files. The word searched for here doesn't have to bea stem. It can be any valid inflected form of a stem. This predicate makes use of a numberof other predicates in the interface since it needs to find the entry based possibly on aninflected form of a stem.

2.2.1. The morphological information contained in thedictionary

In order to extract an entry corresponding to a particular word from the dictionary weneed to know the stem of that word. There is very limited morphological informationcontained in the dictionary. However the information contained is sufficient for theextraction of an entry. The main SGML deep structure tag which encodes inflectionalinformation in the dictionary is <Inflections>. It seems that this tag has the followingrelevant surface structure subtags:

● <INFLECTYPE> : This dictionary tag encloses the type of the inflection being statedinside the <Inflections> tag that encloses it. The information enclosed can be any oneof: 'plural', 'past tense', 'past participle' etc.


● <SPELLING>: The tag that encloses the actual inflected form of the root word. Forthe dictionary entry of the word 'bring' for instance, there would be a <SPELLING>tag containing 'brought' which is both the past tense and the past participle of 'bring'.

However this information is included only for the words that have irregular inflections.Irregular here is in the sense that the inflection is not in accordance with any of the usualinflection rules. The entry for the word 'doctor' for instance would definitely not containany inflectional information.

The derivation of the stem from an inflected form is not trivial, i.e. it is difficult to derivefor instance the root form 'entry' from the plural form 'entries'. Nevertheless This is asolved problem for which there is an algorithm outlined in [15] by Jurafski and Martin. Itis easy however to derive(not in the linguistic sense of the word) all the inflected forms ofa root word. This way for each entry we can produce the set of all its inflectionsdepending on its partofspeech(part of speech information is also included in thedictionary) and check whether the word the entry corresponding to which we are lookingfor, is a member of that set.

2.2.2. Extracting inflected forms: The inflected_formspredicate

The predicate implemented for the extraction of the inflected forms of an entry, if any, isthe inflected_forms(+EntryList, +ListSofar, InflectedForms) predicate. It takes a list ofdictionary entries and unifies its third argument with a list of all the inflected forms foundwithin the entries passed to it as the first argument. The second argument should be theempty list '[]' when the predicate is invoked from an outside context. A description of thepredicate follows presently:

inflected_forms([],Sofar,Sofar).

inflected_forms([Entry|Rest],List,Sofar):

Entry=entry(TagList),

process_tag_list(TagList,Result,[]),

union(Sofar,Result,NewSofar),

inflected_forms(Rest,List,NewSofar).

This predicate makes use of a helper predicate called process_tag_list which extracts theinflectional information from an entry's subtag list. Having invoked this predicate on theentry's subtag list, the union is taken between the result of the process_tag_list invokation


and the inflected forms found so far. The rest of the entry list is then processedrecursively.

The process_tag_list(+EntrySubTagList, InflectionalForms, InflectionalFormsSoFar)predicate should be called with the third argument as the empty list [], from an outsidecontext.

process_tag_list([],Sofar,Sofar).

process_tag_list([Single],Sofar,Sofar): \+(Single =..[_,_]),!.

process_tag_list([Tag|Rest],Result,Sofar):

Tag=..[TagName,TagList]>

(

TagName==inflections>

(

get_spellings(TagList,List),

append(Sofar,List,NewSofar),

process_tag_list(Rest,Result,NewSofar)

)

;

(

process_tag_list(TagList,Result1,Sofar),

process_tag_list(Rest,Result,Result1)

)

);

process_tag_list(Rest,Result,Sofar).

The predicate proceeds as follows:

For all the elements in the main list(first argument) do:

● If the element is a tag extract the tag name and the subtag list through unification

● If the tag is <Inflections> use the get_spellings(+Inflectionalsubtags,Spellings)helper predicate to extract the spellings included within the <Inflections> subtag,append the result to the inflected forms so far and call the same predicaterecursively on the rest of the main list with the new InflectionalFormsSoFar value.

● Otherwise call the same predicate recursively on the subtag list, to extract theinflected forms within the list, and then again call the same predicate to process therest of the elements in the main list, setting the InflectionalFormsSoFar value to


what was returned as the inflectional forms within the subtag list.

● Otherwise call the same predicate recursively on the rest of the main list, withoutchanging the InflectionalFormsSoFar value.

The base cases should be clear.

2.2.3. Generating regular inflected forms: the inflectionspredicate

We need here to be able to generate the regular inflected forms of a word, given its partof speech. Three different predicates invoked by the inflections predicate, have beencreated each of which handle the rules corresponding to a part of speech. There are three,namely:

● noun_inflections(+Word,InflectedForms)

● verb_inflections(+Word,InfelctedForms)

● adjective_inflections(+Word,InflectedForms)

The implementations of these predicates don't demand much elaboration. They arebasically running versions of the following sets of rules:

Noun Inflections(the Plural and the Possessive Forms):

● If the word ends in an 's' add 'es' at the end of the word to produce the plural form andthe single quote character to produce the possessive.

● If the word ends in a 'y' with a vowel as the preceding character, remove the 'y' and add'ies' at the end of the result, to produce the plural form. Add a single quote followed by's' at the end of the whole word to produce the possessive.

● If the word ends in any of the character sequences, 'ch', 'sh', 'x' or 'z' add 'es' at the endof the word to produce the plural and a single quote followed by 's' to produce thepossessive.

● Otherwise add an 's' at the end of the word.


Verb Inflections(the Gerund, the Past Tense and the Third Person Singular forms):

● If the verb ends in 'e', add 'd' to produce the Past Tense and 's' to get the SingularForm. Remove the 'e' and add 'ing' to get the Gerund.

● If the verb ends in 'y', add 'ing' to produce the Gerund. Remove the 'y' and add 'ies' toproduce the Singular and 'ied' to produce the past

● Otherwise add 'ing' for the Gerund, 'ed' for the past tense and 's' for the third personsingular.

Adjective inflections(The comparative and superlative forms):

● If the adjective ends in 'y' remove the 'y' and add 'ier' and 'iest' to get the Comparativeand Superlative forms respectively. Also correct alternative forms in this case areproduced by just adding 'er' and 'est' at the end of the word.

● If the adjective ends in 'e' add 'r' and 'st' for the comparative and superlative formsrespectively.

● Otherwise just add 'er' or 'est'.

The inflections predicate has the form: inflections(+Word, +POS, ListOfInflections)where POS is one of {n, v, adj}.

2.2.4. Part of speech extraction: The extract_POS predicate

This predicate takes a list of entries and extracts the parts of speech included within them,unifying its second argument with a list of the extracted parts of speech. It has the generalform: extract_POS(+EntryList, POSList).

extract_POS([],[]).

extract_POS([Entry|List], Resultlist):

extractpos(Entry, POS),

extract_POS(List,POSList),

union(POS,POSList,Resultlist).

The part of speech of every entry inside the list is extracted using the extractpos helperpredicate which takes a single entry as argument and unifies its second argument with the


part of speech of the entry. The extract_POS predicate is then called recursively on therest of the entry list. The POSList is then unified with the union of the two results.

extractpos(Entry,POS):

Entry=entry([head(L)|_]),

pos(L,POS,[]).

pos([],Sofar,Sofar).

pos([Tag|T],POS,Sofar):

Tag=pos([SPOS])

>

(

pos(T,POS,[SPOS|Sofar])

)

;

(

pos(T,POS,Sofar)

).

An assumption is made here that the <POS> tag in contained withing the <Head> deepstructure tag. The two predicates above should be selfexplanatory.

We are now in a position to implement the predicate which is able to find the dictionaryentry that corresponds to some given word.

2.2.5. Entry Extraction: the find_entry predicate

Given the full list of entries found in a dictionary file, we would like to be able to find theentries corresponding to a given word. The method used here is to look at the entries oneby one while extracting their head words. The inflections predicate mentioned in previoussections is then invoked on the head word, and the inflected_forms predicate on the wholeentry to extract the irregular inflections if any. Taking the union of the two sets, we checkif the given word belongs to the union. If so we have found a corresponding entry. Whatfollows is the code listing of the find_entry method and the helper predicates plus shortdescriptions of each:


find_entry(Dictionary_Folder,Word,Result):

char_at(Word,0,Char),

convert_to_lowercase(Char,LChar),

atom_concat(LChar,'.sgm',File),

atom_concat(Dictionary_Folder,File,Absolute_Path),

reparse_term(Absolute_Path,Dic),

find_entry1(Dic,Word,Result).

The above is the toplevel predicate which takes the folder in which the dictionary sgmlfiles reside, as an atom, and then depending on the first letter of the given word constructsan absolute path to the corresponding dictionary file. The file is then parsed and theresulting Prolog term passed to the find_entry1(+Dic, +Word, EntryList), which calls thefour place find_entry2 predicate.

find_entry1(Dic,Word,Result): find_entry2(Dic,Word,[],Result).

find_entry2([],_,ListSoFar,ListSoFar).

find_entry2([Entry|Rest], Word, ListSoFar,Result):

(root_form(Entry,Word))>

find_entry2(Rest, Word, [Entry|ListSoFar],Result);

(

ListSoFar=[]>

find_entry2(Rest, Word, ListSoFar, Result);

find_entry2([],Word, ListSoFar, Result)

).

The above predicate simply goes through the entries of the parsed dictionary file one byone. Using the root_form(+Entry,+Word) predicate it checks whether the entrycorresponds to the word given. The root_form predicate is an implementation of theprocess explained at the beginning of this section. It uses the predicates for part of speechextraction and the generation and extraction of regular and irregular inflected formsrespectively:

root_form(Entry, Word):



entry([(head([(hwd([W]))|_]))|_])=Entry,

list_all_inflections(W,POS,Inflections1),

inflected_forms([Entry],Inflections2,[]),

union(Inflections1,Inflections2,U),

member(Word,U).

2.2.6. Extracting sense definitions: The extract_sensespredicate

In this section, the implementation of a toplevel predicate for the extraction of LDOCEword sense definitions is being described. The senses in LDOCE are stated inside<SENSE> deep structure tags. The <SENSE> tag seems to appear at two different taglevels:

1. It can be a direct subtag of an <Entry> tag

2. It can be a direct subtag of a <PhrVbEntry> tag which is itself a direct subtag of<Entry>

In the second case above, the lexical unit being defined is a phrasal verb created using theentry head word and some other words or particles. It is important that we recognize theexact lexical unit being defined by each sense definition. The actual sense definitions areenclosed by the <DEF> tag. The following assumptions have had to be made regardingthe overall structure of sense definitions:

● The <DEF> tag can only appear at the tag levels described below:

1. <Entry> .................... <Sense> ............<DEF> .....</DEF> ..... </Sense>...</Entry>

2. <Entry>.............<Sense>.........<Subsense>......<DEF>.......</DEF>...</Subsense>........</Sense> ........ </Entry>

3. <Entry>...........<PhrVbEntry>........<Sense>..........<DEF>......</DEF>....</Sense>.........</PhrVbEntry>.....</Entry>

4. <Entry>...........<PhrVbEntry>........<Sense>...<Subsense>.......<DEF>......</DEF>......... </Subsense>......</Sense>.........</PhrVbEntry>.....</Entry>

● There is exactly one <DEF> tag at any of the levels mentioned above.


The above assumptions are consistent with Yarwood's analysis of the structure of theSGML dictionary files.

The lexical unit being defined by any <DEF> tag is either enclosed within <LEXUNIT>tags at the same level, or within the <PHRVBHWD> tag which is itself a direct subtag of<Head>(The <Head> tag in this case is direct subtag of <PhrVbEntry>). Regarding thelexical units being defined, the assumptions are:

● <LEXUNIT> is always a direct subtag of a <Se1nse> tag.

● The <Sense> tag enclosing a <LEXUNIT> tag is a direct subtag of <Entry>

● If a <Sense> tag doesn't contain <LEXUNIT> and it is a direct subtag of <Entry> thenthe lexical unit being defined is the main entry head word.

Taking into account all the above assumptions I will now describe the implementedpredicates handling sense definitions. At the top level we have the extract_sensespredicate:

extract_senses(DicFolder, Word, Senses):

find_entry(DicFolder,Word,Entries),

member(Entry,Entries),

extract_sense_phrvb_tags(Entry,SenseList),

extract_defs(SenseList,[meaning(Word,[])|[]],Senses).

The predicate takes the folder containing the dictionary files and the word the senses ofwhich are to be extracted. It then unifies its third argument with a list of sense definitions.Elements of this list are Prolog terms of the form meaning(LEXUNIT, [Definitions])where LEXUNIT denotes the lexical unit being defined and the second argument is a listof the definitions given in the dictionary for that lexical unit.

The extract_senses predicate makes use of two helper predicates which do most of theprocessing. The extract_sense_phrvb_tags predicate creates a list of all <Sense> and<PhrVbEntry> tags contained within the corresponding entry together with all theirdescendants(subtags). (The implementation of this predicate is trivial.) The extract_defspredicate in turn takes this list and extracts each member's lexical unit and definitions,unifying its third argument with a list of all lexical units, each attached to a list of theirdefinitions as mentioned earlier. This predicate takes a second argument which is used tomaintain a list gathered so far through the recursive process:

extract_defs([],AllSenses,AllSenses).

extract_defs([phrvbentry(L)|Rest],[meaning(Word,Senses)|OtherSenses],AllSenses):

L=[head([phrvbhwd(Phrvb)|_])|_],

1


extract_sense_phrvb_tags(phrvbentry(L),PhrvbSenses),

extract_phrvb_senses(PhrvbSenses,Defs),

flatten(Phrvb,PhrvbFlat),

extract_defs(Rest,

[meaning(Word,Senses)|[meaning(PhrvbFlat,Defs)|OtherSenses]],

AllSenses

).

extract_defs([sense(L)|Rest],[meaning(Word,Senses)|OtherSenses],AllSenses):

extract_lexunit(L,LEXUNIT)>

(

extract_d(L, Defs),

extract_defs(Rest,

[meaning(Word,Senses)|[meaning(LEXUNIT,Defs)|OtherSenses]],AllSenses

)

)

;

(

extract_d(L, Defs),

extract_defs(Rest,

[meaning(Word,[Defs|Senses])|OtherSenses],AllSenses)

).

The extract_defs predicate proceeds as follows(this is just the general idea):

For each element in the given sense list(these are either <Sense> tags or <PhrVbEntry>tags) do:

● If the element is a <Sense> tag, extract the lexical unit contained inside it using theextract_lexunit helper predicate.

● If there was a <LEXUNIT> tag present, using the extract_d helper predicate extractthe contents of all the <DEF> tags present within the <Sense> tag being considered.In this case there either only one <DEF> tag or else each is contained withing a<Subsense> tag(the extract_d predicate deals with this). Process recursively the restof the elements in the main list.

● Otherwise if there was no <LEXUNIT> tag present(in which case theextract_lexunit predicate fails) use the main entry head word as the lexical unitbeing defined. Extract the <DEF> tags similarly using the extract_d predicate.


Process recursively the rest of the elements in the main list.

● Otherwise if the element is a <PhrVbEntry> tag, the lexical unit (this is containedwithin the <PHRVBHWD> tag) is extracted through unification. The list of <Sense>tags contained withing the <PhrVbEntry> tag is then extracted using theextract_sense_phrvb_tags predicate. The definitions contained within these are thenextracted using the extract_phrvb_senses predicate. The rest of the main list is thenprocess recursively.

The flatten predicate:

There is another helper predicate which is being used in the above process, namelyflatten(+DEFList, FlattenedDefinition) which takes the list contained inside a <DEF>tag, and unifies its second argument with the full definition contained within the <DEF>tag, as an atomic sentence. <DEF> tags in general are allowed to have other subtags,including the <ROMAN> and the <REFHWD> tags. A parsed <DEF> tag (the Prologterm representing it) has the following general form(abstracting from the order ofelements within the enclosed list):

def([sentence part, nondv([refhwd([word])]),. . . , . , roman([contents],rest ofsentence]).

The list contained within should be flattened in the sense that we need a full atomicsentence which is the concatenation of all the elements in the list(if the element is asubtag we need the contents). To make this clear consider the following example:

<DEF>air that is thin is more difficult to breathe than usual because it hasless<NonDV><REFHWD>oxygen</REFHWD></NonDV>in it</DEF>

yields

“air that is thin is more difficult to breathe than usual because it has less oxygen in it”

An example use of extract_senses

Consider the following small and parsed dictionary entry for the word “tag” as a noun:

entry([

head([hwd([tag]),

homnum([1]),

proncodes([pron([t{g])]),

pos([n])]),


sense([

gram([C]),

field([D]),

def([a small piece of paper, plastic etc, fixed to something to show what it is, who owns it, what it costs etc]),

colloexa([collo([name/identification/price tag]),

example([Where's the price tag on this dress?])])

]),

sense([

gram([U]),

field([DG]),

def([a children's game in which one player chases and tries to touch the others])

]),

sense([

gram([C]),

field([SLG]),

def([a phrase such as `isn't it?', `won't it?', or `does she?', added to the end of a sentence to make it a question or to ask you to agree with it])

]),

sense([

gram([C]),

field([DC]),

def([a metal or plastic point at the end of a piece of string or, nondv([refhwd([shoelace])]), that prevents it from splitting])

])

])

The above entry would yield the following when given to the extract_senses predicate:

[

meaning(tag, [

' a metal or plastic point at the end of a piece of string or shoelace \n that prevents it from splitting',

' a phrase such as `isn\'t it?\', `won\'t it?\', or `does she?\', added to the end of a sentence to make it a question or to ask you to agree with it',

' a children\'s game in which one player chases and tries to touch the others',

' a small piece of paper, plastic etc, fixed to something to show what it is, who owns it, what it


costs etc'])

]

For bigger entries there are usually more meaning terms within the resulting list, eachgiving a lexical unit and a corresponding list of its definitions.

3. The Stanford parser

The Stanford Parser package contains implementations of three parsers for naturallanguage text. There is an accurate unlexicalized probabilistic contextfree grammar(PCFG) parser, a lexical dependency parser, and a factored, lexicalized probabilisticcontext free grammar parser, which does joint inference over the first two parsers. With awellengineered grammar (as supplied for English), it is claimed to be fast, accurate,requires much less memory, and in many realworld uses, lexical preferences areunavailable or inaccurate across domains or genres and it will perform just as well as alexicalized parser. However, the factored parser will sometimes provide greater accuracythrough knowledge of lexical dependencies. Using the dependency parser by itself is notvery useful (its accuracy is much lower). The parser has been trained using the PennTr2ee Bank project a brief description of which will come shortly.

The parser can also be used for part of speech tagging. In general, the leafs of theproduced tree are the terminal symbols of the grammar which is in this case the actualwords in the sense definitions of the dictionary. The parent node of any word(terminalsymbol of the grammar) is in our case the part of speech of that word. The parts of speechproduced for each word of a sentence(a sense definition in our case) are likely to be moreaccurate than the actual parse tree itself.

No matter which of the previously mentioned parsers we use to tag the definitions in thedictionary, it is not possible to trust the parses, the reason being that lexical preferences inthe dictionary are either not taken into account(in the case of the unlexicalized PCFGparser) or inaccurate(in the case of the factored parser).

3.1. The Penn Tree Bank Project

The Penn Treebank Project annotates naturallyoccurring text for linguistic structure.Skeletal parses are produced showing rough syntactic and semantic information a bank

2


of linguistic trees. The texts are also partofspeech tagged. The corpora annotated in theproject include the wallstreet journal, the Brown Corpus and the switchboard corpus oftelephone conversations. This database of linguistic trees have been used to train theStanford Parser(Supervised Learning).

The tag set used which is the same as the tag set used by the Stanford parser is includedin appendix B.

3.2. The Prolog interface to the Stanford parser

Using the JPL package in SWIProlog an interface between my Prolog program and theparser has been implemented to allow the Javabased parser to be invoked from withinSWIprolog. JPL makes it possible to create Java objects and invoke their methods.

3.2.1. The PrologTermCreator Class

The output of the parser is a Tree object which facilitates a recursive tree representationof a sentence parse. I have therefore had to port this representation to Prolog. For thispurpose a small Java class has been implemented containing a single main method. Whatfollows is a description of the relatively simple recursive algorithm used to generate avalid Prolog term as a String object from the parse Tree:

createPrologTerm(Tree)::String :

Takes a Stanford tree object as the argument and generates as output a Stringobject containing a valid Prolog term which is the bracketed representation (as describedin the background section) of the parse tree taken as argument. The following is a pseudocode describing the algorithm used:

1. Extract the label of the current node as Label

2. Double the occurrences of the single quote character:

E.g. “don't” is converted to “don''t”

3. Place Label inside single quotes:

E.g. “NP” is converted to “ 'NP' ”


4. Add an open bracket to the beginning of the string generated so far.

E.g. “ 'NP' “ becomes “ 'NP'( “

5. For each child of the current node do:

1. Call the same method recursively on the child and add the returned string to theend of the string generated so far

2. If the current child is NOT the last child:

Add the comma character ',' to the end of the string generated so far

6. Add the close bracket character '(' to the end of the string generated so far

7. Return as result the generated string

The following is an example of the method's usage:S

NP VP

AT NNS VBD

The children slept

Yields: 'S'( 'NP' ('AT' ('The') , 'VP' ('NNS' ('children') , 'VBD' ('slept') ) )

which is a valid Prolog term.

3.2.2. The parse predicate

The main Prolog predicate in charge of the parsing is parse(+Sentence,PrologTerm)which takes a sentence as a Prolog atom and produces the bracketed representation of theStanford tree corresponding to the parse of that sentence.

parse(Sentence, PrologTerm):

load_parser(ParserRef),

load_prolog_term_creator(PTC),

parse_sentence(Sentence, ParserRef, PTC, PrologTerm).

The load_parser predicate above is invoked to produce the reference to a Stanford lexparser object. The apply method is then invoked on this object reference with thesentence passed as argument, using the builtin jpl_call predicate in SWI's JPL package.


The load_prolog_term_creator predicate produces a reference to a PrologTermCreatorobject. The implementation of the PrologTermCreator class was described in theprevious section. The actual parsing process is done inside the parse_sentence predicate:

parse_sentence(Sentence,ParserRef,PrologTermCreator,PrologTerm):

jpl_call(ParserRef,apply,[Sentence],Parse),

jpl_call(PrologTermCreator,createPrologTerm,[Parse],PrologParse),

string_to_atom(PrologParse,PrologAtom),

atom_to_term(PrologAtom,PrologTerm,_).

The createPrologTerm(Tree) method is called with the parse Tree object passed asargument. The result is a String which is then converted to a Prolog atom which is in turnconverted into a Prolog term. The created term is then unified with the fourth argument ofthe predicate, namely PrologTerm.

3.2.3. The parse_and_bracket_entry(+DicEntry,-ParsedDicEntry)Predicate

The predicate takes an SGML parsed dictionary entry(these are the entries produced bythe reparse_term predicate described in previous sections) as the first argument andunifies its second argument with the same dictionary entry with all the sense definitionsand examples replaced with corresponding bracketed representations of their parse trees.The Stanford parser is used to parse the definitions. This is done through unification andrecursion using 2 other helper predicates, bracket_tag_list and bracket_tag. The processis described below:

parse_and_bracket_entry(entry(TagList), entry(BracketedEntryList)):



bracket_tag_list(ParserRef, PTC, TagList, BracketedEntryList).

bracket_tag_list(_,_,[],[]).

bracket_tag_list(ParserRef,

PTC,


[Tag|Rest],

[BracketedTag|Bracketed1]):

bracket_tag_list(ParserRef,PTC,Rest,Bracketed1),

bracket_tag(ParserRef,PTC,Tag,BracketedTag).

The parser and the PrologTermCreator references are created using the correspondingpredicates explained in the previous section, after which point they are used throughoutthe process to parse and handle parse trees. The main predicate in charge of the toplevelrecursive navigation of a dictionary entry is:

bracket_tag_list(+ParserReference, +PrologTermCreatorRef, +EntrySubtagList, ParsedEntrySubtagList).

As mentioned in previous sections on the structure of the reparse of the SWI SGMLparse, each dictionary entry is in the form entry(subtag list). The above predicate goesrecursively through the subtag list, invoking the bracket_tag predicate for every subtag.This predicate in turn unifies its argument with the Stanford parsed version of thatparticular subtag. The parsed subtag is then added to the result list(ParsedEntrySubtagList).

bracket_tag(ParserRef,PTC,def(L),def(BracketedSentence)):

!,

flatten(L,Sentence),

parse_sentence(Sentence,ParserRef,PTC,BracketedSentence).

bracket_tag(ParserRef,PTC,example(L),example(BracketedSentence)):

!,



bracket_tag(ParserRef,PTC,Tag,BracketedTag):

atomic(Tag)>

(

BracketedTag=Tag

);

(


Tag=..[TagName,TagList],

bracket_tag_list(ParserRef,PTC,TagList,BracketedTagList),

BracketedTag=..[TagName,BracketedTagList]

).

Inside any tag we have a list which contains 1 or more elements. We have the following3 cases:

● The tag is def(list) contains a sense definition which would have to be parsed. The listin this case would either contain a single atomic sentence or a number of subsentencespossibly enclosed by other tags like <ROMAN>. list would have to be flattened usingthe flatten(described in previous sections) predicate for the full sentence to beproduced.

● The tag is example(list) which contains an example use of a definition, which is treatedexactly the same as the previous case.

● Otherwise it is either an atomic element with no subtags(the base case of the recursion)in which case we would simply unify the result with the same atomic element leavingit intact, Or it's a tag with a list of other subtags. In the latter case we wouldrecursively invoke the bracket_tag_list predicate to parse that list, unifying the resultwith the parsed form of the subtag list.

3.2.4. An example parse of a small LDOCE entry

The following shows the result of invoking the parse_and_bracket_entry predicate on thefirst entry for the word 'toady' where the definitions and examples have been replaced bycorresponding bracketed representations of their parse trees:

Original Entry:

entry([head([hwd([toady]), homnum([2]), pos([v]), gram([I])]), sense([activ([PRETEND]),

def(

[to pretend to like an important person so that they will help you]


), gramexa([propform([+ to]),

example(

[toadying to the boss]

)])])])

Parsed Entry:

entry([head([hwd([toady]), homnum([2]), pos([v]), gram([I])]), sense([activ([PRETEND]),

def(

ROOT

(FRAG

(S

(VP (TO (to))

(VP (VB (pretend))

(S

(VP (TO (to))

(VP (VB (like))

(NP (DT (an)) (JJ (important)) (NN (person)))

(SBAR (RB (so)) (IN (that))

(S

(NP (PRP (they)))

(VP (MD (will))

(VP (VB (help))

(NP (PRP (you))))))))))))))

), gramexa([propform([+ to]),

example(

ROOT

(S

(VP (VBG (toadying))

(PP (TO (to))

(NP (DT (the)) (NN (boss))))))))])])])


3.3. An overall assessment of the Stanford parser

● More often than not, LDOCE definitions aren't complete sentences in the sense thatthey are usually either fragments, noun phrase or verb phrases describing a particularsense or usage of a word. They are however always grammatically correct. TheStanford parser is able to handle these cases as can be seen in the small example givenin the previous section.

● It has proved very difficult to know whether the phrase structure ambiguities are beingresolved correctly by the Stanford Parser. The question here is whether the parseforced by the Stanford's English PCFG grammar, is indeed the correct one in terms ofthe intended semantics of the definition. Most of them however are expected to becorrect. The parser has been tested on several dictionary definitions. It does seem toproduce the correct parse in each case. Nevertheless it cannot be stated for certain thatthe parses produced are all accurate and that the lexical preferences used in theStanford Lexicalized Factored parser fit the LDOCE definitions.

● It was mentioned at the beginning of this chapter that the parser can be applied as aPOStagger. Although the produced phrase structures are prone to error in ambiguouscases, the parts of speech are unlikely to be inaccurate.


4. Evaluation

This wasn't a project with clear initial objectives which showed from the beginning that itwas mainly research based. The objectives became clearer as the project progressed. Itinvolved extensive research in the area of MRDs in general and LDOCE in particular.Most of the background on the various areas which had to be studied were completelynew, including Machine Readable Dictionaries and statistical(mainly corpusbased)methods of parsing and disambiguation both at the syntactic and semantic levels. Theprimary work done on the LDOCE previously has been studied (mostly by Guo), and theoverall process of creating an MTD from LDOCE briefly outlined. I will be restating theproblem of MTD construction in other terms in the next sections.

4.1. The Prolog interface

A small parser has been written to reparse the output of SWIProlog's builtin SGMLparser, in order to remove the redundancies mentioned in Yarwood's report.

A Prolog interface to the machinereadable version of the Longman's Dictionary ofContemporary English has been successfully implemented which facilitates the extractionof the explicitly encoded linguistic features in LDOCE. The implemented Prologpredicates serve as a base to any interface which might be written for the dictionary in thefuture. It is still generally difficult for an enduser to use the MRD since the interface is inProlog. There are still many explicitly encoded features in the dictionary without acorresponding predicate for their extraction. However, the predicates already writtenmake it trivial to implement these.

4.2. A Machine-Tractable LDOCE?

Research has been carried out to try to produce a measure of the major task of creating aMachineTractable Dictionary out of the LDOCE. Put in simplest terms, this is a processof disambiguation at the syntactic and semantic levels. We would like to create a fullyparsed, partofspeech tagged and cross referenced version of the LDOCE, i.e. we wouldlike to replace each sense definition of the LDOCE with the Fregean Formularepresenting it. To conclude I will restate the problem in other terms:


Consider the following relation on all senses in LDOCE:

DEFINING(sense1, sense2): if sense1 is invoked in the definition of sense2.

Considering this relation as an enormous directed graph, we'd like to discover the edgesbetween the nodes(the senses). But without any processing, we do have an initial idea ofwhere these edges might be. If for instance some word 'A' is used in the definition ofsense1 of word 'B', then we know that there is an edge going from one of the senses ofword 'A' to sense1. The DEFINING relation above, is a partial order on the senses of thedictionary.

As mentioned in the background section, the LDOCE has been designed to have a limiteddefining vocabulary. However this does not mean that the exact wordsense beinginvoked in some other sense definition is guaranteed to be defined under the entry for theword used in that definition. Nevertheless we should be able to find the sense closest tothe actual meaning of a word used.

Parsing the sense definitions is a required step(though far from trivial) before any furtherprocessing. The steps to produce an MTD have already been briefly outlined in thebackground section on MachineTractable Dictionaries.

In the hope to contribute to the process of MTD construction, the Stanford parser hasbeen applied and proved to be able to handle the sense definitions of LDOCE. A Prologinterface to the Javabased parser has also been implemented which makes the parsing ofthe dictionary entries(the definitions within) possible handling the SGML entries and theparses as Prolog terms which is extremely handy. We are now able to invoke theimplemented predicate on the whole dictionary to find and parse a dictionary entry andreturn as a result the entry with all its sense definitions and examples replaced by theircorresponding parse trees.

The parses produced by the Stanford Parser, though still in need of further validation, arevery unlikely to prove useless as far as creating an MTD is concerned.

4.3. Recommendations For Future Work

● The development of a more user friendly frontend to the Prolog interface developedin this project, possibly webbased(html). The Prolog interface itself would have to beextended in the sense that there aren't predicates present for all the explicitly encodedinformation in the dictionary. This step is however trivial with the alreadyimplemented predicates as a base.


● Improving the efficiency of the find_entry predicate using the algorithm outlined byJurafski and Martin in [15], in order to derive the stem from a given inflected form(weare doing this the other way around in this project, i.e. extracting the root form from adictionary entry, generating its regular inflected forms and comparing them to theword we are looking for)

● Further investigation into the inner workings of the Stanford Parser, in order toprovide an answer to the question of whether the parser(the methods and the grammarsused) suites the definitions in LDOCE(in terms of ambiguity resolution).

● Further research on how we might be able to disambiguate the word senses in thedictionary definitions, enabling us to produce a crossreferenced version of thedictionary. Prior to this, the semantic primitives in LDOCE(the seed senses) wouldhave to have been derived, so that we can express all “meaning” in LDOCE in termsof these semantic primitives. Work has been done in this area by Guo[1] whoseinsights can surely be applied to achieve the objectives. The last and most involvedsteps of the MTD construction process described by Guo have merely beendemonstrated to be feasible, and have not yet been applied to the whole dictionary,which creates a lot of space for the future continuation of his work on the LDOCE.


6. Bibliography and References

[1] ChengMing Guo (ed), “Machine Tractable Dictionaries: Design and Construction”, Ablex Publishing Corporation, Norwood, New Jersey,1995.

[2] “LDCOE 3 NLP Database” [LDOCE3 documentation] Version 1.2,February 1997, Addison Wesley Longman, 1997.

[3] Christopher D. Manning and Hinrich Schutze, “Foundations of Statistical Natural Language Processing”

[4] Joan M Smith and Robert Stutely, “SGML The user's guide to ISO8879”, Ellis Horwood Ltd, 1988.

[5] Alfred V.Aho, Ravi Sethi and Jeffery D. Ullman, “Compilers: Principles, Techniques and Tools”

[6] Jeremy Valentine Pitt, “Compositional Grammars and Natural Language Processing”

[7] Neil Bradley, “Anatomy of an SGML Document” in GCA SGML 96.GCA, 1996.

[8] “Using SGML for Linguistic Analysis The Case of the BNC” in GCA1996, GCA, 1996.

[9] The Stanford NLP group Website [http://wwwnlp.stanford.edu]

[10] Michael A. Covington, “Natural Language Processing for PrologProgrammers”, PrenticeHall, 1994.

[11] Dan Klein and Christopher D. Manning. 2002. Fast Exact Inference with a Factored Model for Natural Language Parsing. In Advances in Neural Information Processing Systems 15 (NIPS 2002), December 2002.

[12] Dan Klein and Christopher D. Manning. 2003. Accurate Unlexicalized Parsing. Proceedings of the 41st Meeting of the Association for Computational Linguistics.


[13] George A. Miller, Richard Beckwith, Christiane Fellbaum, DerekGross, and Katherine Miller, “Introduction to WordNet: An OnlineLexical Database”, 1993

[14] Princeton University Website [http://wordnet.princeton.edu/doc].

[15] Daniel Jurafsky and James H. Martin, “Speech and LanguageProcessing: An Introduction to Natural Language Processing,Computational Linguistics, and Speech Recognition”, PrenticeHall2000.

[16] Michael A. Covington, “Tokenization using DCG Rules”, ArticialIntelligence Center, The University of Georgia, Athens, Georgia 306027415 U.S.A., 2000 April 21 [http://www.ai.uga.edu/mc/projpaper.pdf]

[17] Longmans Website: [http://www.longman.com/dictionaries/research/dictres.html]

[18] SWIProlog Website [http://www.swiprolog.org]

[19] The Penn TreeBank Project Website: [http://www.cis.upenn.edu/~treebank/]

[20] A Gentle Introduction to SGML: [http://www.w3c.org]

[21] Tony McEnery and Andrew Wilson, Corpus Linguistics (Second Edition)

[22] Roland Hausser, Foundations of Computational Linguistics

[23] Natasha Yarwood, An interface to LDOCE: A machine readable dictionary

[24] Beatrice Santorini, Bracketing Guidelines for The Penn TreeBank Project


A. Code Listing

This appendix includes the code listings for the Prolog interface to LDOCE described insection 2.2 and the interface to the Stanford Parser described in section 3.2.

A.1. The Prolog interface to LDOCE

%========================================The parser========

write_list_to_file(List, Target):

tell(Target),

write_to_file(List),

told.

write_to_file([]).

write_to_file([H|T]):

write_term(H,[quoted(true)]),writeln('.'),

write_to_file(T).

write_term_to_file(Term,Target):

tell(Target),

write(Term),

told.

%

reparse_term(File, Result):

load_sgml_file(File, SWI_Term),

[Ldoce]=SWI_Term,

Ldoce=..[_,_,_,EntryList],

process_entrylist(EntryList,Result).


process_entrylist([],[]).

process_entrylist([Value],[Value]):Value\=element(_,_,_),!.

process_entrylist([Entry|Rest], Result):

(Entry=='<#PCDATA>';Entry=element(piccal,_,_))>

process_entrylist(Rest, Result)

;

process_entrylist(Rest,RestProcessed),

process_entry(Entry,ParsedEntry),

Result=[ParsedEntry|RestProcessed].

process_entry(Entry, Parsed):

Entry=element(Tag,_,SubTagList)>

process_entrylist(SubTagList,SubResult),

Parsed=..[Tag,SubResult];

Parsed=Entry,!.

reverse_list(A,B):rev(A,[],B).

rev([],Sofar,Sofar).

rev([H|T],Sofar,R):

rev(T,[H|Sofar],R).

%

%================= Entry Extraction=============================

convert_to_lowercase(Char,Result):

atom_codes(Char,[Code]),


(

(Code=<122,Code>=97)>

Result=Char;

NewCode is Code+32,

char_code(Result,NewCode)

).

find_entry(Dictionary_Folder,Word,Result):

char_at(Word,0,Char),

convert_to_lowercase(Char,LChar),

atom_concat(LChar,'.sgm',File),

atom_concat(Dictionary_Folder,File,Absolute_Path),

reparse_term(Absolute_Path,Dic),

find_entry1(Dic,Word,Result).

find_entry1(Dic,Word,Result): find_entry2(Dic,Word,[],Result).

find_entry2([],_,ListSoFar,ListSoFar).

find_entry2([Entry|Rest], Word, ListSoFar,Result):

(root_form(Entry,Word))>

find_entry2(Rest, Word, [Entry|ListSoFar],Result);

(

ListSoFar=[]>

find_entry2(Rest, Word, ListSoFar, Result);

find_entry2([],Word, ListSoFar, Result)

).

root_form(Entry,Word):


entry([(head([(hwd([W]))|_]))|_])=Entry,

list_all_inflections(W,POS,Inflections1),

inflected_forms([Entry],Inflections2,[]),

union(Inflections1,Inflections2,U),


member(Word,U).

%Inflected forms

list_all_inflections(_,[],[]).

list_all_inflections(W,[P|Rest],Inflections):

inflections(W,P,List),

list_all_inflections(W,Rest,Inflections0),

union(List,Inflections0,Inflections).

equal_case_insensitive(S1,S2):

string_to_list(S1,L1),

string_to_list(S2,L2),

compare_list(L1,L2).

compare_list([],[_|_]):fail.

compare_list([_|_],[]):fail.

compare_list([],[]).

compare_list([H1|L1], [H2|L2]):

abs((H1H2),N),

(N==0;N==32)>

compare_list(L1,L2);

fail.

inflected_forms([],Sofar,Sofar).

inflected_forms([Entry|Rest],List,Sofar):

Entry=entry(TagList),

process_tag_list(TagList,Result,[]),

append(Sofar,Result,NewSofar),

inflected_forms(Rest,List,NewSofar).

process_tag_list([],Sofar,Sofar).

process_tag_list([Single],Sofar,Sofar): \+(Single =..[_,_]),!.

process_tag_list([Tag|Rest],Result,Sofar):

Tag=..[TagName,TagList]>

(


TagName==inflections>

(

get_spellings(TagList,List),

append(Sofar,List,NewSofar),

process_tag_list(Rest,Result,NewSofar)

)

;

(

process_tag_list(TagList,Result1,Sofar),

process_tag_list(Rest,Result,Result1)

)

);

process_tag_list(Rest,Result,Sofar).

get_spellings([spelling(List)|_],List):!.

get_spellings([_|T],List):

get_spellings(T,List).

%POS extraction

extract_POS([],[]).

extract_POS([Entry|List], Resultlist):


extract_POS(List,POSList),

union(POS,POSList,Resultlist).

extractpos(Entry,POS):

Entry=entry([head(L)|_]),

pos(L,POS,[]).

pos([],Sofar,Sofar).

pos([Tag|T],POS,Sofar):


Tag=pos([SPOS])

>

(

pos(T,POS,[SPOS|Sofar])

)

;

(

pos(T,POS,Sofar)

).

%Generating regular inflections

inflections(Word, v, [Word|Inflections]):!,verb_inflections(Word,Inflections).

inflections(Word, n, [Word|Inflections]):!,noun_inflections(Word,Inflections).

inflections(Word, adj, [Word|Inflections]):!,adjective_inflections(Word,Inflections).

inflections(Word,_,[Word]).

adjective_inflections(Word,Inflections):

last_chars(Word,1, Chars1),

(

(

Chars1=y,!,

atom_length(Word,N),

M is N1,

sub_atom(Word,0,M,_,Sub),

atom_concat(Sub,ier,Comparative),

atom_concat(Sub,iest,Superlative),

atom_concat(Word,er,Comparative1),

atom_concat(Word,est,Superlative1),

Inflections=[Comparative,Superlative,Comparative1,Superlative1]

);

(

Chars1=e,!,

atom_concat(Word,r,Comparative),


atom_concat(Word,st,Superlative),

Inflections=[Comparative,Superlative]

);

(

atom_concat(Word,er,Comparative),

atom_concat(Word,est,Superlative),

Inflections=[Comparative,Superlative]

)

).

noun_inflections(Word, [Plural,Possessive]):

atom_length(Word,1),!,

atom_concat(Word,s,Plural),

atom_concat(Word,'\'s', Possessive).

noun_inflections(Word, [Plural, Possessive]):

last_chars(Word, 1, Chars1),

last_chars(Word,2,Chars2),

atom_length(Word,L),

(

(

Chars1=s,!,

%plural

atom_concat(Word,es,Plural),

%Possesive

atom_concat(Word,'\'', Possessive)

);

(

Chars1=y,

L1 is L1,

L2 is L2,

char_at(Word,L2,Penul),

\+is_vowel(Penul),!,

sub_atom(Word,0,L1,_,Sub),


%plural

atom_concat(Sub,ies,Plural),

%possessive

atom_concat(Word,'\'s',Possessive)

);

(

(Chars2=ch;Chars1=x;Chars1=z;Chars2=sh),!,

%plural

atom_concat(Word,es,Plural),

%Possesive


);

(

atom_concat(Word,s,Plural),


)

).

verb_inflections(Word, [Past,Gerund,Third]):

last_chars(Word, 1, Chars1),


(

(

Chars1=e,!,

%past tense

atom_concat(Word,d,Past),

%gerund


L1 is L1,


atom_concat(Sub,ing,Gerund),

%third person

atom_concat(Word,s,Third)


);

(

Chars1=y,

L1 is L1,

L2 is L2,

char_at(Word,L2,Penul),

\+is_vowel(Penul),!,


%past tense

atom_concat(Sub,ied,Past),

%gerund

atom_concat(Word,ing,Gerund),

%third person

atom_concat(Sub,ies,Third)

);

(

%past tense

atom_concat(Word,ed,Past),

%gerund

atom_concat(Word,ing,Gerund),

%third person

atom_concat(Word,s,Third)

)

).

is_vowel(i).

is_vowel(o).

is_vowel(u).

is_vowel(e).

is_vowel(a).


char_at(Word,I,Char):

sub_atom(Word,I,1,_,Char).

last_chars(Word,N,Chars):

atom_length(Word,Length),

L is LengthN,

sub_atom(Word,L,N,_,Chars).

%Extracting the senses

extract_senses(DicFolder, Word, Senses):

find_entry(DicFolder,Word,Entries),

member(Entry,Entries),

extract_sense_phrvb_tags(Entry,SenseList),

extract_defs(SenseList,[meaning(Word,[])|[]],Senses).

extract_sense_phrvb_tags(entry(Entry_subtags),Senses):

search_tags(Entry_subtags,Senses).

%reverse_list(Senses1,Senses).

extract_sense_phrvb_tags(phrvbentry(Subtags),Senses):

search_tags(Subtags,Senses).

search_tags([],[]).

search_tags([phrvbentry(Subtags)|Rest],[phrvbentry(Subtags)|Senses]):

!,search_tags(Rest,Senses).

search_tags([sense(Subtags)|Rest],[sense(Subtags)|Senses]):

!,search_tags(Rest,Senses).

search_tags([_|Rest],Senses):

search_tags(Rest,Senses).

%Defs=[meaning(LEXUNIT,[definitions go here]),meaning(LEXUNIT,[definitions go here]), . . . ]

extract_defs([],AllSenses,AllSenses).

extract_defs([phrvbentry(L)|Rest],[meaning(Word,Senses)|OtherSenses],AllSenses):


L=[head([phrvbhwd(Phrvb)|_])|_],

extract_sense_phrvb_tags(phrvbentry(L),PhrvbSenses),

extract_phrvb_senses(PhrvbSenses,Defs),

flatten(Phrvb,PhrvbFlat),

extract_defs(Rest,

[meaning(Word,Senses)|[meaning(PhrvbFlat,Defs)|OtherSenses]],

AllSenses

).

extract_defs([sense(L)|Rest],[meaning(Word,Senses)|OtherSenses],AllSenses):

extract_lexunit(L,LEXUNIT)>

(

extract_d(L, Defs),

extract_defs(Rest,

[meaning(Word,Senses)|[meaning(LEXUNIT,Defs)|OtherSenses]],AllSenses

)

)

;

(

extract_d(L, Defs),

extract_defs(Rest,

[meaning(Word,[Defs|Senses])|OtherSenses],AllSenses)

).

extract_phrvb_senses([],[]).

extract_phrvb_senses([sense(L)|Rest],[Def|Defs]):

extract_d(L,Def),

extract_phrvb_senses(Rest,Defs).

extract_lexunit([],_):fail.

extract_lexunit([lexunit(LEXUNIT)|_],LEXUNITFlat):

!,

flatten(LEXUNIT,LEXUNITFlat).


extract_lexunit([_|Tags],LEXUNIT):extract_lexunit(Tags,LEXUNIT).

extract_d([],[]).

extract_d([def(Def)|_],DefFlat):!,flatten(Def,DefFlat).

extract_d([subsense(Tags)|Rest],[Def|Defs]):

!,

extract_subsense_def(Tags,Def),

extract_d(Rest,Defs).

extract_d([_|Rest],Defs):extract_d(Rest,Defs).

extract_subsense_def([],_):fail.

extract_subsense_def([def(Def)|_],DefFlat):!,flatten(Def,DefFlat).

extract_subsense_def([_|Rest],Def):extract_subsense_def(Rest,Def).

flatten([], '').

flatten([DefPart|Rest], Sentence):

flatten(Rest,Sen),

(

(

DefPart=nondv([refhwd([Word])|_]),!,

atom_concat(' ',Word,Spaced),

atom_concat(Spaced,Sen,Sentence)

);

(

DefPart=roman([R]),!,

atom_concat(' ',R,Spaced),


);

(

atom_concat(' ',DefPart,Spaced),


)).


A.2. Interface to the Stanford Parser

:[parser].

parse_sentence(Sentence,ParserRef,PrologTermCreator,PrologTerm):

jpl_call(ParserRef,apply,[Sentence],Parse),

jpl_call(PrologTermCreator,createPrologTerm,[Parse],PrologParse),

string_to_atom(PrologParse,PrologAtom),

atom_to_term(PrologAtom,PrologTerm,_).

load_prolog_term_creator(PrologTermCreatorRef):

jpl_new('PrologTermCreator',[],PrologTermCreatorRef).

load_parser(ParserRef):

jpl_new('edu.stanford.nlp.parser.lexparser.LexicalizedParser',

['../stanfordparser20050721/englishPCFG.ser.gz'],ParserRef

).

load_printer(TreePrint):

jpl_new('edu.stanford.nlp.trees.TreePrint',['wordsAndTags'],TreePrint).

parse(Sentence,PrologTerm):



parse_sentence(Sentence,ParserRef,PTC,PrologTerm).

parse_and_bracket_entry(entry(TagList), entry(BracketedEntryList)):



bracket_tag_list(ParserRef, PTC, TagList, BracketedEntryList).


bracket_tag_list(_,_,[],[]).

bracket_tag_list(ParserRef,

PTC,

[Tag|Rest],

[BracketedTag|Bracketed1]):

bracket_tag_list(ParserRef,PTC,Rest,Bracketed1),

bracket_tag(ParserRef,PTC,Tag,BracketedTag).

bracket_tag(ParserRef,PTC,def(L),def(BracketedSentence)):

!,



bracket_tag(ParserRef,PTC,example(L),example(BracketedSentence)):

!,



bracket_tag(ParserRef,PTC,Tag,BracketedTag):

atomic(Tag)>

(

BracketedTag=Tag

);

(

Tag=..[TagName,TagList],

bracket_tag_list(ParserRef,PTC,TagList,BracketedTagList),

BracketedTag=..[TagName,BracketedTagList]

).


B. The Penn POS Tag-Set

Tag: Description: Example:

CC coordinating conjunction and

CD cardinal number 1, third

DT determiner the

EX existential there there is

FW foreign word d'hoevre

IN preposition/subordinating conjunction in, of, like

JJ adjective green

JJR adjective, comparative greener

JJS adjective, superlative greenest

LS list marker 1)

MD modal could, will

NN noun, singular or mass table

NNS noun plural tables

NNP proper noun, singular John

NNPS proper noun, plural Vikings

PDT predeterminer both the boys

POS possessive ending friend's

PRP personal pronoun I, he, it

PRP$ possessive pronoun my, his

RB adverb however

RBR adverb, comparative better

RBS adverb, superlative best

RP particle give up

TO to to go, to him


Tag: Description: Example:

UH interjection uhhuhhuhh

VB verb, base form take

VBD verb, past tense took

VBG verb, gerund/present participle taking

VBN verb, past participle taken

VBP verb, sing. present, non3d take

VBZ verb, 3rd person sing. present takes

WDT whdeterminer which

WP whpronoun who, what

WP$ possessive whpronoun whose

WRB whabverb where, when

Please refer to [24] for the Penn phrasestructure tags used by the Stanford Parser.


An Interface To LDOCE: Towards The Construction Of A ...

Documents

Transcript of An Interface To LDOCE: Towards The Construction Of A ...