2 – Related work -...

Bar Ilan University

The Department of Computer Science

Using Semantic Knowledge for

Coreference Resolution

by

Chen Erez

Submitted in partial fulfillment of the requirements for the Master's

Degree in the Department of Computer Science, Bar-Ilan University

Ramat Gan, Israel June 2009, Sivan 5769

1

This work was carried out under the supervision of Dr. Ido Dagan

The Department of Computer Science

Bar-Ilan University

Israel

2

AcknowledgementsThis thesis paper has been accomplished with the help of a number of

people, and I wish to express my heartfelt thanks to them.

I am grateful to Dr. Ido Dagan at Bar-Ilan University for his

supervision of the thesis. It has been a great pleasure working with him, and I

have learned much.

I would like to thank Shachar Mirkin, Eyal Shnarch and Roy Bar Haim

for their guidance and helpful comments throughout the work.

I would like to thank our NLP group for their mutual support.

I would like to thank my family and friends for understanding,

supporting and encouraging.

3

Table of contentAbstract 5

1. Introduction 7

2. Background and related work

12

3. Coreference resolution using semantic features 25

3.1 Baseline coreference resolution system 25

3.2 Additional semantic features 25

4. Results analysis

4.1 The dataset 38

4.2 Evaluation measures 39

4.3 Experimental setup 43

4.4 Results 45

4.5 Detail analysis 48

5. Conclusion and future work 60

6. References 62

7. Appendixes 64

7.1 Decision trees example 64

4

AbstractCoreference resolution is the process of matching pairs of natural

language expressions that refer to the same entity in the real world. For

example, consider the sentence: "Yesterday Danny had a birthday party and

he got a lot of presents'". 'Danny' and 'He' are the same entity in the real

world, that is, Danny got a lot of presents.

This task relates mainly to coreference resolution between two noun

phrases in a document. The goal is to determine if some noun phrase, called

the anaphor, refers to a preceding noun phrase in the text, called the

antecedent. In this thesis, we focus on a supervised machine learning approach

for coreference resolution, using classifiers to determine whether two noun

phrases are coreferent or not. After classifying all candidate coreference pairs,

we group together all those noun phrases in the text that were found as

referring to the same real world entity, based on the equivalence relation

induced by the pair-wise classification (due to the transitivity of the

coreference relation). This process is called coreference chaining, and the set

of noun phrases in the text that refer to the same real world entity are called a

coreference chain.

Learning approaches for coreference resolution try to improve

performance mainly by enhancing the feature set of the coreference classifier.

Soon (2001) and Ng and Cardie (2002) introduced a coreference resolution

system with multiple simple lexical, grammatical and positional features.

However, their feature set lacks semantic knowledge – specifically,

recognizing whether the antecedent-anaphor candidate pair is related

semantically, such as through hypernym or synonymy, which is most relevant

for coreference resolution.

This thesis presents new enhancements for coreference resolution,

adding semantic knowledge to the machine learning approach. The main

idea is using semantic resources in order to recognize relevant semantic

relations between the antecedent-anaphor candidates. The semantic

resources we use are querying the web with Google, applying a pattern-

5

based approach, WordNet and Wikipedia. For each of these resources we

have created a new feature group.

We demonstrate that semantic knowledge is relevant for the

coreference resolution task by applying the coreference resolution system

with another group of features at each time, comparing results and

presenting the learnt decision trees. Also, we give a thorough error

analysis. We analyze each resource and describe its usefulness and its

types of errors, revealing many directions for further researches.

6

1. IntroductionThe Coreference Resolution Task

Coreference resolution is the process of matching pairs of natural

language expressions that refer to the same entity in the real world. In the

process of identifying (resolving) coreference, we use the linguistic concept of

anaphora. Anaphora resolution is the process of finding the antecedent (or the

referent) of an anaphoric noun phrase. An anaphoric noun phrase (the

anaphor) is a noun phrase that refers to a previous noun phrase (the

antecedent), which has already appeared in the text. For example, consider

refer to the following sentences:

"John gave Michael the Harry Potter book. He told Michael that the

book is marvelous."

In the above example we mark two noun phrases, John and He. John is

the antecedent and He is the anaphor. They corefer, that is, they refer to the

same real world entity (John). Other coreferring noun phrases are the two

mentions of Michael and the noun phrase 'the Harry Potter book' which is the

antecedent of the noun phrase 'the book'.

The next step is to create coreference chains. A coreference chain is an

equivalence relationship, because it is commutative, reflexive and transitive.

For instance, a coreference chain with three noun phrases includes an

anaphoric noun phrase A, which refers to an anaphoric noun phrase B (B is the

referent of A), where B refers to a noun phrase C (C is the referent of B).

Obviously, A, B and C all refer to the same real world entity.

The Importance of Coreference Resolution in NLP

Coreference resolution is a key task in many natural language

processing applications such as question answering and information retrieval.

Let us demonstrate the benefit of having a coreference resolution system as a

part of a question answering system:

Generally, the question-answering (QA) task is concerned with finding

an answer to a natural language question from a large text collection, or

7

determining that an answer cannot be found. The following example concerns

a typical question, and the benefit of a coreference resolution system:

Question: Where was Mozart born?

A part of the retrieved paragraph (identified as talking about Mozart):

"Mozart was a great musician … the musician was born in Salzburg…"

If we correctly identify Mozart as the antecedent of the musician, this assists in

answering the question.

The machine learning approach for coreference

In our research we focus on a supervised machine learning approach,

using classifiers to determine whether two noun phrases are coreferent. When

referring to the coreference resolution task as a classification problem, we

refer to a binary classification problem, coreferent or not-coreferent

A coreference classification system includes the following steps and

definitions: First, preprocessing the text, and perform various grammatical-

syntactic analyses, including the marking of noun phrases. Second, we prepare

a list of noun phrase pairs, where each pair contains an anaphor candidate and

an antecedent candidate. This list of noun phrase pairs defines our

classification instances. Third, we extract different features describing the

anaphor candidate, the antecedent candidate and different relationships

between them. This step is called feature extraction, and the different features

make up the feature set. The representation of an anaphor and an antecedent

noun phrase pair by these features is called a feature vector. We denote the

feature vector of a noun phrase pair as a coreference instance. Finally, a

classification algorithm creates a classification model, by tuning different

weights for these features, giving them preferences, or combining them

together to create decision trees. The created model is referred to as a

coreference classifier. The task of a trained coreference classifier is to

determine whether a coreference instance is coreferent or not.

8

Using semantic resources for coreference resolution

Many pairs of coreferring noun phrases have a semantic relationship.

For example, the pair of words 'pet' and 'dog' are having a hypernym relation

and are likely to be coreferent in a text like:

"Danny has a dog. He likes his pet very much"

Thus, extracting the semantic relation of the candidate antecedent-

anaphor pair seems to be useful for determining whether the candidate

pair is coreferent or not. More specifically, the most relevant semantic

relations are hypernym and synonyms. Having or not a strong semantic

relation between the pair of words helps to decide whether these words

corefer or not.

Therefore, in this research we investigate three significant semantic

resources for extracting hypernym and synonym relations: Google, by

applying the pattern-based approach, WordNet and Wikipedia.

The pattern-based module aims to identify joint occurrences of two

words within particular patterns, which typically indicate concrete semantic

relationships. For example, using the pattern 'NP1 and other NP2' (where NP

stands for Noun Phrase), a hyponymy relationship between copper and goods

can be identified from the sentence: “The Egyptians came hither from the land

of the Blacks bringing gold, which they exchange for copper and other

goods”. Thus, in our research we apply this module for each of the

antecedent-anaphor candidate pair, measuring the occurrences of each pair

with each of our patterns, described in section 3.2.1.1.

WordNet1 is a large lexical database of English, in which nouns, verbs,

adjectives and adverbs are grouped into sets of cognitive synonyms (synsets),

each expressing a distinct concept. Synsets are interlinked by means of

semantic and lexical relations. Most synsets are connected to other synsets via

1 http://wordnet.princeton.edu/obtain

9

a number of semantic relations. These relations vary based on the type of

word, and include hypernym and synonym. In addition, Snow et al. (2006)

presented a probabilistic model for taxonomy induction which considers as

features paths in parse trees between related taxonomy nodes. They show that

the best performing taxonomy they acquired was the one adding 400,000

hyponyms to WordNet. Using these two resources seems to be significant for

extracting relevant semantic relations, thus creating new semantic features for

the coreference classifier, as described in 3.2.2. Notice that in the basic

features of Soon (2001), WordNet is used for finding the semantic class of the

noun phrases (i.e., 'person', 'organization', 'date', 'time', 'money' etc'). Thus,

they only created a binary feature which determines whether the first senses of

a candidate pair are in the same semantic class or not, but did not check for

semantic relations such as synonyms or hypernym.

Wikipedia1 is another resource we used. Utilizing Wikipedia is based

on the work of Shnarch and Dagan (2009), for extracting lexical rules from

Wikipedia by examining the context, links and titles of Wikipedia pages. They

present the extraction of a large scale rule base from Wikipedia designed to

cover a wide scope of the lexical reference relation. They examine the

potential of definition sentences as a source for lexical reference rules, since

when writing a concept definition, one aims to formulate a concise text that

includes the most characteristic aspects of the defined concept. They show that

a definition is a promising source for reference relations between the defined

concept and the definition terms. In addition, they extract lexical reference

rules from Wikipedia redirect and hyperlink relations. As their work is

dedicated to extracting lexical reference rules, we have utilized their method

for the coreference resolution task by adding semantic features specifically

whether a candidate pair has a hypernym relation extracted from Wikipedia

(see section 3.2.3).

In this thesis we show that using only the semantic features achieves

results close to the base features of a standard coreference resolution system

(Soon 2001). In this case, the built tree (Appendix 1) includes semantic

1 http://www.wikipedia.org/

10

features from all used resources, thus demonstrating that all semantic features

are relevant to the coreference resolution task. FurthermoreAs a result, using

all semantic features in this case we obtain a higher score than using each of

the semantic feature groups alone.

Although using both of the base-system features and the semantic

features didn't obtain a higher f-measure, we give a thorough error analysis.

We analyze each resource and describe its usefulness and its types of errors,

revealed many directions for further researches.

The rest of the thesis is outlined as follows: Section 2 gives background

about the coreference resolution task definition and importance, and related

works using the semantic resources. Section 3 explains in details our

coreference application including the way of utilizing our new semantic

features. Section 4 describes our experiments and the obtained results and

gives a detailed error analysis concerning each of our semantic resource.

Section 5 concludes and presents suggested future work.

11

2. Background and related workIn this section we describe the work done in different areas which

is related to our work: first, we describe the coreference resolution task

(2,1), and BART(2.1.1), the baseline toolkit for coreference resolution

which we have used .Next, we describe works for coreference resolution

using the web (2.2) .Then, we described the patterns based module for

lexical entailment acquisition (2.3), and extracting lexical reference rules

from Wikipedia, since we applied these two works for coreference

resolution.

2.1 Coreference resolution using a machine learning approach In the machine learning based approach for coreference resolution

(Soon et al., 2001; Ng & Cardie, 2002a; Kehler et al., 2004; Yang et al.,

2004 and others), we relate to candidate <anaphor, antecedent> pairs,

where our task is to classify them as COREFERENT or NON-

COREFERENT. For each pair we collect different characteristics, called

features set, which relate to the coreference relationship. Using these

features, a machine-learning-classification-algorithm trains and builds the

classifier on a training set that contains annotated examples of coreference

relations. The training and testing instances are typically created

following the method of Soon et al. (2001). We create a positive training

instance from each pair of adjacent coreferent pairs <Pi,Pj>. Negative

training instances are obtained by pairing the anaphoric <Pj> with any

noun occurring between the anaphor <Pj> and the antecedent <Pi>.

During testing each text is processed from left to right: each noun is

paired as anaphor <Pj> with any preceding <Pk> from right to left, until a

pair labeled as coreferent is output, or the beginning of the document is

reached. That is because we assume there is at most one antecedent. The

main machine-learning classification algorithms, which were used for the

coreference resolution task, are decision trees (Quinlan 1993), maximum

entropy (Berger et al. 1996) and RIPPER (Cohen, 1995).

12

Soon et al. (2001) implemented one of the first simple domain-

independent machine learning coreference resolution systems. They built

a coreference resolution system based on a C4.5 decision tree classifier,

which uses only twelve features. Their features include grammatical

agreement features, syntactical features, positional features, string match

features, a proper name match normalizer feature, and a WordNet

semantic feature. In the next subsection we describe a modular toolkit,

which implements that module.

2.1.1 BART module 1 Bart is a modular toolkit for coreference resolution, which we use in

our research as a baseline coreference resolution application (more details in

section 3.1)

Bart implements the 12 features of Soon (at el 2001), described in table

1. BART is reaching 64.8% F1 measure on MUC6 corpus and 62.9% F1

measure on MUC7 corpus. (For description of MUC6 and MUC7 corpus, see

subsection 4.1)

Nu

m

Name Values

1 STRING

MATCH

TRUE if REi and REj have the same spelling, else

FALSE

2 ALIAS TRUE if one RE is an alias of the other; else FALSE

3 I PRONOUN TRUE if REi is a pronoun; else FALSE.

4 J PRONOUN TRUE if REj is a pronoun; else FALSE.

5 J DEF TRUE if REj starts with the; else FALSE.

6 J DEM TRUE if REj starts with this, that, these, or those;

else FALSE.

7 NUMBER TRUE if both REi and REj agree in number; else

1 1 An open source version of BART is available from

http://www.sfs.uni-tuebingen.de/˜versley/BART/.

13

FALSE.

8 GENDER UNKNOWN if either REi or REj have an undefined

gender. Else if they are both defined and agree

TRUE; else FALSE.

9 PROPER

NAME

TRUE if both REi and REj are proper names; else

FALSE.

10 APPOSITIVE TRUE if REj is in apposition with REi; else FALSE.

11 WN CLASS UNKNOWN if either REi or REj have an undefined

WordNet semantic class1 . Else if they both have a

defined one and it is it is the same TRUE; else

FALSE.

12 DISTANCE The number of sentences between REi and REj

TABLE 1 – the features implemented in the BART tollkit

BARTs design provides effective separation of concerns across

several tasks, including engineering new features that use different

sources of knowledge, designing improved or specialized preprocessing

methods, and improving the way that coreference resolution is mapped to

a machine learning problem.

BARTs architecture includes 4 phases: Preprocessing, Feature

Extraction, Learning, Training/Testing: Preprocessing consists in marking up

noun chunks and named entities, as well as additional information such as

part-of-speech tags and merging this information into markables that are the

starting point for the noun chunk mentions used by the coreference resolution.

In the feature extraction phase, each candidate pair of anaphor and antecedent

candidate is represented as a PairInstance object, which is enriched with

classification features by feature extractors which are realized as separate

classes, allowing for their independent development. For the learning, the

module uses the functionality of the WEKA machine learning toolkit. In the

training phase, the pairs that are to be used as training examples have to be

selected in a process of sample selection, whereas in the testing phase, it has to 1

? Semantic classes such as: "person," "organization", "location", "date, "time", etc'. see more in subsection 3.2.2.2

14

be decided which pairs are to be given to the decision function and how to

group mentions into equivalence classes given the classifier decisions.

2.2 Using the web for coreference resolutionMost of the works using the machine learning approach build a

classifier trained on a corpus such as MUC or ACE. However, the size of

the corpora leads to data sparseness. To solve this problem, we propose

using the web which is the largest corpus available. Although the web has

been already used for coreference resolution, it has been used only for two

specific types of coreference: Pronoun anaphora resolution (X. Yang

2005) and other – anaphora (N. Modjeska at el, 2003), as explained

below.

Pronoun resolution

Pronoun resolution is a subtask of coreference resolution, in which

the antecedent must be found only for a pronoun anaphor.

In the work of Yang (2005), for each pair of anaphor and a candidate

antecedent, they obtain statistics by querying a web search engine. They

obtain a predicate-argument statistics via a web search engine like Google and

Altavista. In their method, three relationships: possessive-noun, subject-verb

and verb-object, are considered. For these three types of predicate-argument

relationships, queries are constructed for each candidate antecedent NPcandi, in

the forms of “NPcandi VP” (for subject-verb), “VP NPcandi” (for verb-object),

and “NPcandi’s NP” or “NP of NPcandi” (for possessive-noun).

For example, consider the sentence:

"Several experts suggested that IBM’s accounting grew much more

liberal since the mid 1980s as its business turned sour."

15

For the pronoun “its” and the candidate “IBM”, the two generated

queries are “business of IBM” and “IBM’s business”. To reduce data

sparseness, in an initial query only the nominal or verbal heads of the

candidate antecedent are retained. Also, each named entity (such as

company, person, location etc') is replaced by the corresponding common

noun. (e.g, “IBM’s business” OR “company’s business” and “business of

IBM” OR “business of company”).

The semantic compatibility of the candidate with the anaphor

could be represented simply in terms of frequency

(1) StatSem(candi, ana) = count(candi, ana)

Where count(candi, ana) is the hit number of the queries returned by the

search engine. Alternatively, in terms of conditional probability (P(candi,

ana|candi)), where the count of the pair is divided by the count of the

single candidate in the corpus. That is:

(2) StatSem(candi, ana) = count(candi, ana)/count(candi)

Where count(candi) is the hit number of the query formed with only the

head of the candidate candi. In this way, the statistics would not bias

negatively candidates having lower frequency.

Thus, the values of equation 1 and Equation 2 are used for

creating new features for the coreference classifier. Their study shows that

the semantic compatibility obtained from the web significantly improves

the resolution of neutral pronouns.

Other anaphora:

Other anaphora is a subtask of coreference resolution, aiming to

find the 'other-anaphors' - that is, the referential noun phrases with the

modifiers 'other' or 'another'. For example, consider the following

sentences:

16

(1) An exhibition of American design and architecture opened in

September in Moscow and will travel to eight other Soviet cities.

(2) The alumni director of a Big Ten university: “I’d love to see

sports cut back and so would a lot of my counterparts at other

schools"

In example 1, 'other Soviet cities' refers to 'other than Moscow', and in

example 2, ' other schools' refers to 'other than the above mentioned Big

Ten university'.

Modjeska at el. (2003) present a machine learning approach to

other anaphora, using a Naive Bayes (NB) classifier. They show the

benefit of integrating the Web frequency counts obtained for syntactic

patterns specific to other-anaphora as an additional feature into the NB

algorithm.

They used the following pattern for other anaphora:

1. (N1 {sg} OR N1 {pl}) and other N2 {pl}

For common noun antecedents, they instantiate the pattern by

substituting N1 with each possible antecedent from set A, and N2 with the

anaphor- because normally N1 is a hyponym of N2 in (1), and the

antecedent is a hyponym of the anaphor. For example, An instantiated

pattern in the sentence (2) above is: "(university OR universities) and

other schools"

For NE antecedents they also instantiate (1) by substituting N1

with the NE category of the antecedent, and N2 with the anaphor

In addition, for NE antecedents they used the pattern:

2. N1 and other N2 {pl}

Where N1 is instituted with the original antecedent and N2 with

the anaphor. For example, in the sentence:

Will Quinlan had not inherited a damaged retinoblastoma supressor

gene and, therefore, faced no more risk than other children

Instantiation will give:

Will Quinlan and other children”

17

They submit these instantiations as queries to the Google search

engine, and these frequencies are than used for calculating new features

for the NB classifier. The new features raise the other-anaphora F-

measure from 45.5% to 56.9%.

2.3 The pattern-based module for relation extractionIn their work, Mirkin and Dagan (2006) use a pattern-based

approach for lexical entailment acquisition. Their general pattern-based

extraction module receives as input a set of lexical-syntactic patterns (as

in Table 1) and either a target term or a candidate pair of terms. It then

searches the web for occurrences of the patterns with the input term(s).

1 NP1 such as NP2

2 Such NP1 as NP2

3 NP1 or other NP2

4 NP1 and other NP2

5 NP1 ADV known as NP2

6 NP1 especially NP2

7 NP1 like NP2

8 NP1 including NP2

9 NP1-sg is (a OR an) NP2-sg

10 NP1-sg (a OR an) NP2-sg

11 NP1-pl are NP2-pl

Table 1: The patterns Mirkin and Dagan (2006) used for lexical

entailment acquisition

A small set of queries is created for each pattern-terms

combination, in order to retrieve as much relevant data with as few

queries as possible .Each pattern has two variable slots to be instantiated

by candidate terms for the sought relation.

18

In their research, the extraction module can be used in two modes:

(a) receiving a single target term as input and searching for instantiations

of the other variable to identify candidate related terms (b) receiving a

pair of terms and searching pattern instances with both terms - in order to

validate and collect information about the relationship between the terms.

Google provides a useful tool for these purposes, as it allows using

a wildcard which might match either a un-instantiated term or optional

words such as modifiers.

For example, the query

"such ** as *** (war OR wars)"

is one of the queries created for the input pattern such NP1 as NP2 and

the input target term war, allowing new terms to match the first pattern

variable.

For the candidate entailment pair war- struggle, the first variable is

instantiated as well. The corresponding query would be:

"such *(struggle OR struggles) as *** (war OR wars)”.

The automatically constructed queries, covering the possible

combinations of multiple wildcards, are submitted to Google1 and a

specified number of snippets are downloaded. The snippets are processed

using a word splitter and a sentence splitter2 , and the sentences are

processed with the OpenNLP43 POS tagger and NP chunker. Then,

pattern-specific regular expressions are used to extract relationships from

the chunked sentences, by verifying that the instantiated pattern indeed

occurs in the sentence and identifying variable instantiations.

In our research, we apply a similar method of the patterns based

module for coreference resolution (details in section 3.2.1)

1 http://www.google.com/apis /

2 Available from the University of Illinois at Urbana-Champaign, http://l2r.cs.uiuc.edu/~cogcomp/tools.php

3 www.opennlp.sourceforge.net /

19

2.4 Semantic features for coreference resolutionIn this subsection we focus on the use of semantic features for the

coreference classifier. We give examples for using semantic features from

the work of Ponzetto and Strube (2006).

2.4.1 WordNetPonzetto and Strube (2006) enrich the semantic information

available to the coreference classifier by using semantic similarity

measures based on the WordNet taxonomy (Pedersen et al., 2004). The

measures they use include path length based measures (Rada et al.,

1989;Wu & Palmer, 1994; Leacock & Chodorow, 1998), as well as ones

based on information content (Resnik, 1995; Jiang & Conrath, 1997; Lin,

1998).

In their work, the measures are obtained by computing the

similarity scores between the head lemma of each potential antecedent-

anaphor pair. In order to overcome the sense disambiguation problem,

they factorize over all possible sense pairs: given a candidate pair, they

take the cross product of each antecedent and anaphor sense to form pairs

of synsets. For each similarity measure they compute the similarity score

for all synset pairs, and create the following features:

WN SIMILARITY BEST the highest similarity score from all

senses of antecedent and anaphor.

WN SIMILARITY AVG the average similarity score from all

senses of antecedent and anaphor.

Pairs containing noun which cannot be mapped to WordNet synsets are

assumed to have a null similarity measure.

2.4.2 WikipediaWikipedia is a multilingual Web-based free-content encyclopedia.

The English version, as of 14 February 2006, contains 971,518 articles

with 16.8 million internal hyperlinks thus providing a large coverage

20

available knowledge resource. In addition, it provides also taxonomy by

means of the category feature: articles can be placed in one or more

categories, which are further categorized to provide a category taxonomy.

In practice, the taxonomy is not designed as a strict hierarchy or tree of

categories, but allows multiple categorization schemes to co-exist

simultaneously. Because each article can appear under more than one

category, and each category can appear in more than one parent category,

the categories do not form a tree structure, but a more general directed

graph.

Ponzetto and Strube (2006) used Wikipedia as follows: given the

candidate referring expressions <Pi> and <Pj> they pull the Wikipedia

pages they refer to. This is accomplished by querying the page titled as

the head lemma. They follow all redirects and check for disambiguation

pages, i.e. pages for ambiguous entries which contain links only (e.g.

Lincoln). If a disambiguation page is hit, we first get all the hyperlinks in

the page. If a link containing the other queried noun is found (i.e. a link

containing president in the Lincoln page), the linked page (President of

the United States) is returned, and else - the first article linked in the

disambiguation page is returned. Given a candidate coreference pair

<Pi,Pj>, the related Wikipedia pages <PREi, PREj> they point to,

retrieved by querying pages with titles <TREi, TREj> , they extract the

following features:

I/J GLOSS CONTAINS: U if no Wikipedia page titled TREi/j is

available. T if the first paragraph of text of PREi/j contains

TREj/i ; else F.

I/J RELATED CONTAINS: U if no Wikipedia page titled TREi/j

is available. T if at least one Wikipedia hyperlink of PREi/j

contains TREj/i ; else F.

I/J CATEGORIES CONTAINS: U if no Wikipedia page titled as

TREi/j is available. T if the list of categories PREi/j belongs to

contains TREj/i ; else F.

21

GLOSS OVERLAP: the overlap score between the first paragraph

of text of PREi and PREj . Following Banerjee & Pedersen (2003)

its computed as for n phrasal m-word overlaps.

In addition, they used three additional features based of the

Wikipedia category graph, based on Reda et al (1989).

Thus, most of their features are based on the Wikipedia category

graph, differently from Shnarch (2009) work (described in the next

subsection), on which we based our Wikipedia features for the

coreference classifier (Section 3.2.3.).

2.5 Using Wikipedia for extracting Lexical Reference

RulesA most common need in applied semantic inference is to infer the

meaning of a target term from other terms in a text. For example, a Question

Answering system may infer the answer to a question regarding luxury cars

from a text mentioning Bentley, which provides a concrete reference to the

sought meaning.

Aiming to capture such lexical inferences Shnarch and Dagan (2009)

followed (Glickman et al., 2006), which coined the term lexical reference (LR)

to denote references in text to the specific meaning of a target term. They

analyzed the dataset of the First Recognizing Textual Entailment Challenge

(Dagan et al., 2006), which includes examples drawn from seven different

application scenarios. It was found that an entailing text indeed includes a

concrete reference to practically every term in the entailed (inferred) sentence.

Thus, the goal of Shnarch (2009) is to utilize the broad knowledge of

Wikipedia to extract a knowledge base of lexical reference rules. Each

Wikipedia article provides a definition for the concept denoted by the title of

22

the article. As the most concise definition they take the first sentence of each

article, following (Kazama and Torisawa, 2007).

Since a concept definition usually employs more general terms than the

defined concept (Ide and Jean, 1993), the concept title is more likely to refer to

terms in its definition rather than vice versa. Therefore the title of the

Wikipedia article is taken as the left side of the constructed rule while an

extracted definition term is taken as its right side. As Wikipedia’s titles are

mostly noun phrases, the terms they extract as the right sides are the nouns and

noun phrases in the definition.

Their methods for extracting rules from Wikipedia are:

Be-Comp They identify the 'IS - A' pattern in the definition sentence

by extracting nominal complements of the verb ‘be’, taking them as the

right side of a rule whose left side is the article title.

All-N The Be-Comp extraction method yields mostly hypernym

relations, which do not exploit the full range of lexical references

within the concept definition. Therefore, we further create rules for all

head nouns and base noun phrases within the definition

Title Parenthesis A common convention in Wikipedia to disambiguate

ambiguous titles is adding a descriptive term in parenthesis at the end

of the title, as in The Siren (Musical), The Siren (sculpture) and Siren

(amphibian). From such titles they extract rules in which the

descriptive term inside the parenthesis is the right side and the rest of

the title is the left side.

Redirect As any dictionary and encyclopedia, Wikipedia contains

Redirect links that direct different search queries to the same article,

which has a canonical title. For instance, there are 86 different queries

that redirect the user to United States (e.g. U.S.A., America, Yankee

land). Redirect links are hand coded, specifying that both terms refer to

the same concept. We therefore generate a bidirectional entailment rule

for each redirect link.

Link Wikipedia texts contain hyper links to articles. For each link they

generate a rule who's LHS is the linking text and RHS is the title of the

23

linked article. In this case they generate a directional rule since links do

not necessarily connect semantically equivalent entities.

Based of this work and the described extraction method, we apply new

features from Wikipedia, as described in section 3.2.3.

24

3. Coreference resolution using semantic

featuresIn this section we describe the application of a range of semantic

resources for coreference resolution and the new features we have created

for the coreference classifier.

3.1 Baseline coreference resolution systemAs our baseline system we use the BART toolkit, which

implements the twelve features used by Soon (2001) ,described in 2.1.1.

These features include syntactical features, grammatical agreement

features, positional features, string match features, proper name match

feature, and a basic WordNet semantic feature (section 2.1.1). As this

toolkit is very modular, we could use it as the baseline coreference

system, and add our new semantic features to the toolkit.

In addition, as the machine learning approach for coreference

resolution includes building a classifier, in our research we tried to use

several classification algorithms implemented in Weka1. The major

algorithms are J482 and SVM3.

In the next subsections we describe the new semantic features we added to

the BART toolkit (3.2).

3.2 Additional semantic featuresIn this sub section we describe the semantic features we developed

for the coreference resolution classifier: The patterns-based module using

Google (3.2.1), WordNet features (3.2.2), and Wikipedia features (3.2.3).

All of these features actually aiming to test the semantic relations between

1 See: http://www.cs.waikato.ac.nz/ml/weka/

2 J48: http://grb.mnsu.edu/grbts/doc/manual/J48_Decision_Trees.html

3 SVM: http://en.wikipedia.org/wiki/Support_vector_machine

25

http://www.cs.waikato.ac.nz/ml/weka/

the word pairs in the coreference classifier. Having or not a strong

semantic relation between the pair of words helps to decide whether these

words corefer or not.

3.2.1 The web-patterns based moduleIn this subsection we first describe the module structure and its

steps (3.2.1.1) and then we explain the way of applying the module for

creating the new coreference features (3.2.1.2).

3.2.1.1 The module structureFollowing Mirkin and Dagan (2006), the main idea of the pattern-

base module (2.3) is using the web as the hugest corpus available for

searching specific patterns in it. The module includes four main steps

performed for every pair of terms, as listed below. Next we give the

details of each of the steps.

1. Choosing the patterns

2. Query creation

3. Snippet processing - downloading, cleaning, filtering and syntactic

processing

4. Using regular expressions defined for the patterns

1. Choosing the patterns

In the web-patterns based approach we start with pre-defined

patterns, according to the research specific task – which is, in our case,

coreference resolution.

Many co-referring words (antecedent – anaphor pair) have a

hypernym relation. For example: Sony and Company, TWA and airline,

etc. In other words, having a hypernym relation is indicative for a

coreference relation between the words.

26

Therefore, the patterns we used are patterns suitable for the

hypernym relation. The list of the patterns and an example sentence are

given in table l

Num Pattern Sentence Example

1 NP1 such as NP2 Scientists such as Einstein

2 Such NP1 as NP2 Such scientists as Einstein

3 NP1 or other NP2 Einstein or other scientists

4 NP1 and other NP2 Einstein and other scientists

5 NP1 ADV known as NP2 Einstein known as an important scientist

6 NP1 especially NP2 Scientists, especially Einstein

7 NP1 like NP2 Scientists like Einstein

8 NP1 including NP2 Scientists including Einstein

Table 1- the patterns we used for the coreference resolution task and an

example sentences

2. Query Creation

The query construction method must support queries with two

terms and a variable. When designing the query construction module, we

must consider that terms may contain several words and that queries are

an expensive resource, since the number of queries per day one can

submit to a commercial search engine is limited.

Queries were submitted to Google’s search engine through Google API1.

Therefore, the queries are constructed using the engine’s syntax, while

complying with some restrictions posed by the API package, such as a

limit to the maximal number of words per query. We use a feature of the

search engine which allows using an asterisk instead of any single word

(some stop words are excluded from this count, such as a, the etc.). In our

research, up to one consecutive asterisk is supported.

1 http://www.google.com/apis /

27

In addition, the query is constructed with the singular and the plural form

of the terms- depends on the pattern.

For example, for the pair TWA-airline, here are some of the created queries:

"(airline OR airlines) such as * TWA"

"(airline OR airlines) especially * TWA"

"(airline OR airlines) including * TWA"

"TWA and other * (airline OR airlines)"

3. Snippet Processing

For each query submitted to the search engine, we download a

predefined number of snippets. Snippets are used for a practical reason –

we do not need to download and process the entire document for a single

instance of the pattern we’re after. The drawback in using snippets is that

many times they contain partial sentences, decreasing the accuracy of the

syntactic processing.

Each downloaded snippet is cleaned from HTML tags and is

converted to plain text format. Then, using a word splitter and a sentence

segmenter from The University of Illinois at Urbana-Champaign1, we

tokenize the snippets and split them into sentences. Each sentence that

does not contain the target terms is deleted. All duplicate sentences are

deleted as well. Then, using OpenNLP2 Part of Speech Tagger and NP-

Chunker, we processed each of the sentences. Here’s an example of a

sentence retrieved for Sony and Company, processed up to shallow

parsing:

1 http://l2r.cs.uiuc.edu/~cogcomp/tools.php

2 www.opennlp.sourceforge.net /

28

[NP Sony,/NNP while/IN ] [VP well/RB known/VBN ] [PP

as/IN ] [NP THE/DT company/NN ] [NP that/WDT ] [VP

has/VBZ made/VBN ] [NP broadcast/NN television/NN

camerass,/NN ] [VP is/VBZ ] [NP the/DT owner/NN ] [PP

of/IN ] [NP the/DT old/JJ Minolta/NNP camera/NN

company,/JJ

4. Using regular expression defined for the patterns

After processing the snippets, pattern-specific regular expressions

are used over the chunked sentences to verify that the instantiated pattern

indeed occurs in the sentence.

Table 1 lists the regular expressions used by our method in order to

construct the extraction patterns grammar. Table 2 shows how the

patterns’ expressions were constructed from smaller building blocks. The

extraction stage, applied on chunked sentences, was designed to trade off

precision and recall in the extraction by handling some of the common

chunker errors, while not attempting to cover all cases. When compiling

the regular expressions we had in mind the idea that when using the web it

might not be necessary to extract information from complex text

structures, but rather settle for simpler text while relying on the scale and

redundancy of the web (Etzioni, 2004).

PatternPattern NameNum

NPWithPossCommaExpsuchAsExp

NPListExp

NP1 such as NP21

suchExpasExp NPListExpSuch NP1as NP22

NPListExporExp otherNPExpNP1 or other NP23

NPListExpandExp otherNPExpNP1 and other NP24

NPWithPossCommaExpadvKnownAsExp NP1ADV known as 5

29

NPExpNP2

NPWithPossComma ExpespeciallyNPExpNP1 especially NP26

NPWithPossCommaExplikeExp NPListExpNP 1like NP27

NPWithPossCommaExpincludingExp

NPListExp

NP 1including NP28

Table 1-the patterns we use, specified by sub-patterns (continuing in table 2)

Regular expression PattenNum

<NP [^>]* >NPExp1

<PP such/JJ as/IN >( :/:)? suchAsExp2

<NP (s|S)uch/JJ [^>]*> suchExp3

<PP as/IN > asExp4

NPExp (( ,/, NPExp ){0,10} (and|or)/CC NPExp )? NPListEx15

<NP [^>]* (and|or)/CC [^>]* > NPandNPExp6

(NPExp ,/, )* NPandNPExp NPListEx27

((NPListEx1)|(NPListEx2)) NPListExp8

<NP [^>]* > ,/, NPWithCommaExp9

((NPWithCommaExp)|(NPExp)) NPWithPossComma

Exp

10

<(PP|VP) including/VBG > includingExp11

(and/CC)?<ADVP especially/RB > NPExp especiallyNPExp112

(and/CC)?<NP especially/RB [^>]* > especiallyNPExp213

((especiallyNPExp1)|(especiallyNPExp2)) especiallyNPExp14

<PP like/IN > likeExp15

(([^/]*/RB)|(<ADVP [^/]*/RB >)) advExp16

((<VP (advExp)?known/VBN >|((advExp )?<VP

known/VBN >)) <PP as/IN >)

advKnownAsExp17

<NP other/JJ [^>]* > otherNPExp18

30

Table 2- the sub-patterns and its regular expression

3.2.1.2 Pattern-based feature extraction After applying the module as described above, we use its output

for creating new features for the coreference classifier.

We tried two sets of features, binary features and numeric features:

In the numeric mode, the value of each feature is the number of snippets

retrieved by its pattern. In the binary mode, the value of each feature is

true when there is at least one snippet returned by its pattern.

For example, the feature values for the pair ' TWA – airline' are: described in

table 3.

Feature name Numeric Value Binary value

NP1 such as NP2 11 1

Such NP1 as NP2 0 0

NP1 or other NP2 3 1

NP1 and other NP2 7 1

NP1 ADV known as NP2 0 0

NP1 especially NP2 0 0

NP1 like NP2 1 1

NP1 including NP2 6 1

Table 3- feature values for the pair ' TWA – airline'

In addition to the features created from the patterns, we used two

additional global features, that look at the overall occurrences and not only at

the occurrences of each pattern separately :

31

1. 'SUM OF ALL PATTERNS' - The sum of all 8 values of

the features from the patterns.

Formally:

Score1(w1,w2)

Where count(w1,w2,P) is the number of snippets returned

for the words pair (w1,w2), for the specific pattern P

(Which is actually the value of the numeric feature for

pattern P).

2. 'NUMBER OF UNIQUE PATTERNS' – The number of

features whose values are more then 0.

Formally:

ScoreF(w1,w2)

where count(w1,w2,P) is the same as described above.

Both features seem relevant because they consider the results from all

patterns. While feature (1) considers the number of matches from all

patterns, feature (2) relates to the number of different patterns which

retrieved a match.

Continuing the example of table 3, the value of 'ALL PATTERNS'

feature is the sum of 11+3+7+1+6 = 28 and the value of 'UNIQUE

PATTERNS' is 5.

Using Synonyms:

In order to increase recall, we are using the words synonyms from

WordNet1: For each word in each pair we find the first synonym in the

first sense.

32

Then, we create queries not only for the original pair, but also for the

word's synonyms.

For example: if the input is the pair 'acquisition - transaction', we create

queries for the pairs 'acquisition - transaction' and 'acquisition – dealing'

since 'dealing' is a synonym in the first sense of 'transaction', while

acquisition doesn't have a synonym in the first sense.

Then, the value of each feature of the original pairs is the sum of

the values in the derived pairs (including the original pair).

3.2.2 WordNet1 Feature3.2.2.1 WordNet as a lexical tool WordNet1 is another resource for finding synonyms and

hypernym, which is ,as said before, a relevant relation for coreference.

For the WordNet feature we used WordNet 3.0 and the

Snow 400k resource, (Snow et al.2006) which is a statistical extension of

WordNet. WordNet is a large lexical database of English, developed

under the direction of George A. Miller. Nouns, verbs, adjectives and

adverbs are grouped into sets of cognitive synonyms (synsets), each

expressing a distinct concept. Synsets are interlinked by means of

conceptual-semantic and lexical relations. Thus, for each pair of words in

our classifier instances, we check whether it appears in a hypernym

relation. More statistics about WordNet 3.0 can be found in table 1.

In addition, Snow et al. (2006) presented a probabilistic model for

taxonomy induction which considers as features paths in parse trees

between related taxonomy nodes. They show that the best performing

taxonomy was the one adding 400,000 hyponyms to WordNet.

POS Unique Synsets Total

Strings Word-Sense Pairs

1 http://wordnet.princeton.edu/

33

Noun 117798 82115 146312

Verb 11529 13767 25047

Adjective 21479 18156 30002

Adverb 4481 3621 5580

Totals 155287 117659 206941

Table 1 – WordNet 3.0 statistics

3.2.2.2 WordNet Feature

WordNet is utilized within one of the twelve features of Soon

(2001). This feature is called 'Semantic Class Agreement Feature'

(SEMCLASS): Its possible values are True, False, or Unknown. They

defined the following semantic classes: "female," "male," "person,"

"organization," "location," "date," "time," "money, "percent," and

"object." These semantic classes are arranged in a simple IS-A hierarchy.

Each of the "female" and "male" semantic classes is a subclass of the

semantic class "person," while each of the semantic classes

"organization," "location," "date," "time," "money," and "percent" is a

subclass of the semantic class "object." Each of these defined semantic

classes is then mapped to a WordNet synset. For example, "male" is

mapped to the second sense of the noun “male” in WordNet; "location" is

mapped to the first sense of the noun “location”, and so on.

In addition, they assume that the semantic class for every markable

extracted is the first sense of the head noun of that markable. Since

WordNet orders the senses of a noun by their frequency, this is equivalent

to choosing the most frequent sense as the semantic class. If the selected

semantic class of a markable is a subclass of one of the defined semantic

classes C then the semantic class of the markable is considered to be C;

else its semantic class is "Unknown." The semantic classes of markables i

and j are in agreement if one is the parent of the other (e.g., “chairman”

with semantic class "person" and “Mr. Lim” with semantic class "male"),

34

or if they are the same (e.g., “Mr. Lim” and “he”, both of semantic class

"male"). The value returned for such cases is “True”. If the semantic

classes of i and j are not the same (e.g., “IBM” with semantic class

"Organization" and “Mr. Lim” with semantic class "Male"), the returned

value is “False”. If either semantic class is "Unknown" then the head noun

strings of both markables are compared. If they are the same, return True;

else return Unknown.

In our research we used WordNet in order to find hypernym

and synonyms relations between the anaphor and antecedent. Our

additional feature can get 3 possible values: true, false and unknown.

1. True- if WordNet contains a path of hypernym between the two words

in the pair.

For example, giving the pair 'dog' and 'animal', WordNet contains

the path:

dog ->domesticated animal -> animal

Another example is for the pair CEO and person. In this case, the path is:

chief executive officer ->corporate executive -> executive, ->

administrator -> leader -> person

We consider a max of 10 words in the path (considering time of running

and relation relevancy).

2. False – if the two words appear in WordNet, but, WordNet doesn't

include a path of hypernym between them.

3. Unknown – if at least one word from the pair does not appear in

WordNet, or if one word of the pair is a pronoun.

Separating between case number 2 (False) and case number

3 (Unknown) aims to treat the sparseness of WordNet. The fact that the

two words appear in WordNet without having the appropriate relation is

more indicative as not being coreference then the fact that the words do

not appear in WordNet.

35

http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=leader

http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=administrator

http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=executive

http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=corporate+executive

3.2.3 Wikipedia features

For the Wikipedia feature, we utilized the method of (Shnarch,

2009) which used Wikipedia for extracting Lexical Reference Rules,

(subsection 2.5.), for coreference resolution.

As already mentioned, the hypernym relation is mostly relevant for

the coreference resolution. (Shanrch , 2009) work (subsection 2.5)

generates rules from Wikipedia containing many hypernym rules (in

addition to other types of lexical references). For example, the Be-Comp

extraction method yields mostly hypernym relations. Thus, in our

research, we use an appropriate feature which can get 3 possible values:

1. True - Where a pair of words (w1,w2) corresponds to a rule from

Wikipedia, where the rule can be extracted from each of the extraction

method described in 2.5 (Be-Comp, All-N ,Title Parenthesis, Redirect

and Link).

For example, consider the pair (Clinton, president):

The first sentence in the page titled as Bill Clinton is:

William Jefferson "Bill" Clinton (born William Jefferson Blythe III,

August 19, 1946) served as the 42nd President of the United States

from 1993 to 2001.

Thus , the rule 'Clinton -> President' is extracted, and the value of the

feature is true.

2. False - Where a pair of words (w1,w2) does not corresponds to a rule

from Wikipedia.

3. Unknown – Where at least one word for the checked pair is a pronoun.

36

http://en.wikipedia.org/wiki/President_of_the_United_States

http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States

4. Results and analysisIn this section the data set and annotation scheme of our

experiments are described (4.1), and the evaluation measures (4.2) are

detailed. Next, we described the experimental setup (4.3), our obtained

results (4.4) and detailed analysis (4.5).

4.1 The dataset MUC (Message Understanding Conference) has been a series of

conferences, supported by DARPA, that aim to develop technologies for

information extraction. The coreference resolution task is considered one

of the important layers of the information extraction task.

The coreference annotation could be considered a kind of

hyperlinked version of the text, where the links connect all the mentions

of a given entity. To assist in building and evaluating such a system, MUC

gathered a set of articles annotated with pair-wise coreference links

between anaphors and antecedents, thus creating a text annotated with

coreference chains. The coreference chains are used to train and test a

coreference resolution system. In the evaluation, we use one of the

standard coreference corpora, the MUC 7 (MUC 7, 1998) dataset.

MUC 7 – Statistics and Structure

The MUC corpus and the MUC 7 coreference dataset in particular

are composed exclusively of newswire articles. Following is an example

of a short annotated text from MUC 7:

At the White House, shortly before departing for New York,

<Clinton> said the deaths of the <Force crew members> were

painful to

<him> because <they> worked for…

38

The noun phrases, which are part of a coreference relationship, are

emphasized, using a bold italic font. There are two coreference

relationships in this paragraph between the noun phrases:

1. Clinton and him

2. Force crew members and they

The MUC 7 dataset includes fifty documents annotated for

coreference, which are divided into 30 training documents, referred to as

dry-run documents, and 20 test documents, referred to as formal

documents. Summarization of statistics concerning the training set and the

test set is found in table 1.

All

database

Train

set

Test set Ave per

Doc

Num of docs 50 30 20

Num of paragraph 635 258 377 12.7

Num of sentences 1033 620 413 20.66

Num of tokens 25583 15331 10252 511.66

Num of coreference

nodes

3780 2259 1521 75.6

Num of coreference

links

2738 1624 1114 54.76

Num of coreference

groups/Clusters

977 603 374 19.54

Table 1 – MUC7 statistics. Coreference nodes are noun phrases participanting

coreference relation, coreference links are the links annotated between the

anaphor and the antecedent, and coreference groups are the chains created by

the coreference links.

4.2 Evaluation measuresIn this sub section we report the way of evaluating the results of

the coreference resolution system, as implemented in the MUC scoring

metric (Vilain et al., 1995).

39

4.2.1 IntroductionAs mentioned previously, the coreference clustering task aims to

group all the coreference links into clusters, where each cluster represents

the mentions in the document of a single entity in the real world. When

evaluating the performance of the coreference clusters we need to

compare the clusters which are implicitly created by the gold standard

key, i.e., the annotated coreference links, with the ones which are created

by the response, i.e., the output of our coreference resolution system. The

key links constitute a minimal spanning set, since each noun phrase in the

annotation could have at most one coreference link (as an anaphor,

referring to some other noun phrase), and each noun phrase could be

referred to only once (as an antecedent). On the other hand, in the base of

a cluster from the response, an anaphor doesn’t necessarily have a unique

antecedent, thus it is not a minimal spanning set of links. Nevertheless, a

cluster might be created by a different set of links, depending on its

source: the key or the response. Therefore, the idea of MUC’s scoring

scheme is not to compare the links specified in a key cluster with the links

of a response cluster, but rather to compare the nodes consisting of a key

cluster with the ones consisting of a response cluster.

The scoring scheme produces three evaluation measures, including

precision, recall and the F-measure .In the next sub-section we will

formally explain these measures.

4.2.2 Evaluating Recall and Precision

4.2.2.1 IntroductionAs mentioned previously, the coreference clustering task aims to

gather all the coreference links into clusters, while each cluster represents

the instances in the document of a single entity in the real world. When

evaluating the performance of the coreference clusters we need to

compare the clusters which are implicitly created by the key, i.e., the

40

annotated coreference links, with the ones which are created by the

response, i.e., the output of our coreference resolution system. Both the

key links and the response links constitute a minimal spanning set, since

each noun phrase could have at most one coreference link (as an anaphor,

referring to some other noun phrase), and each noun phrase could be

referred to only one time (as an antecedent). Nevertheless, a cluster might

be created by a different set of links, depending on its source: the key or

the response. Therefore, the idea of MUC’s scoring scheme is not to

compare the links inside a key cluster with the links of a response cluster,

but rather to compare the nodes consisting of a key cluster with the ones

consisting of a response cluster.

The scoring scheme produces three evaluation measures, including

precision, recall and the F-measure. In this sub-section we will formally

explain the recall, precision and F1 measures.

4.2.2.2 Recall and precision evaluationIn order to evaluate the recall and precision, let us first define the

following:

C – Cluster of coreferring nodes

e(C) – The minimal number of edges needed for spanning the cluster C.

This number is in fact equals to |C|-1.

Let us define the cluster Si from the keys and the corresponding

cluster/clusters Ri from response, Recalli is then the recall for

clusters/cluster groups Si and Ri, than:

Recalli =

That is, the number of edges needed for spanning the common

nodes of the cluster in the keys and in response, divided by the number of

edges needed for spanning the cluster in the keys. This is actually the

fraction of the correct links found in the response cluster out of the

number of edges needed for spanning the cluster in the key.

41

Extending this method to all clusters of the key for the entire test set, we

get:

Total recall for all clusters =

In addition, we define Precisioni as the precision for clusters Si and Ri, than

Precisioni =

That is, the number of edges needed for spanning the common

nodes of the cluster in the keys and in response, divided by the number of

edges needed for spanning the cluster in the response. This is actually the

fraction of the correct links founded in the cluster out of the number of

edges needed for spanning the cluster in response.

Extending this method to all clusters of the key for the entire test set, we

get:

Total precision for all clusters =

For example, consider the keys clusters: S1= {1, 2}, S2= {3, 4, 5}

and the response clusters: R1 = {1,2,3}, R2= {4,5,6}:

Than:

Recall = = = = 2/3

Precision = = = = 2/4

42

F1 is a standard measure, which combines the precision and recall

measures. In the coreference clustering evaluation we refer to F1. F1 is

defined here in the same way it is usually defined:

F1 is the harmonic average between the recall and the precision.

4.3 Experimental setupIn this subsection we describe our experiment setting: choosing a

part of the testing/training documents from MUC (4.3.1), the way of

applying our new features (4.3.2) and using then within the BART toolkit

(4.3.3)

4.3.1 Choosing documents from MUCIn our experiments we used 15 documents from the training set and

5 documents of the test set of MUC, due to the running time and the

number of queries we need to submit to Google. For the 20 documents we

used we got over 40000 pairs of words participating in a coreference

relation in the training set or in the test set of the coreference classifier.

For each pair we check 8 patterns, and for each pattern we create a query

with 0 or 1 asterisk Thus, we submitted around 400,000 queries to

Google.

For each query we processed up to the first 30 snippets – overall,

we had to process over 1.5 million snippets (since some queries did not

produce any results or less than 30 snippets), and use our regular

expressions over each of the processed snippet. For these reasons, we

have decided to randomly choose 20 documents for the training and test

set. To apply the module for all documents, one should use faster regular

43

expression matching method and a faster chunker, or, alternatively, use

fewer queries and fewer snippets, which might harm the recall.

4.3.2 Applying our new featuresThe first step after choosing the documents is applying our new

features (later used within the BART toolkit). This was done in 3 modes,

described below: The patterns-based module, WordNet features and

Wikipedia features.

1. The pattern-based module is applied on the pairs of words

participating in the training instances or in the test instances of the

coreference classifier. At the end of the patterns-based module processing,

we have a database table containing for each word pair the values of the

pattern-based features (3.2.1). This database table is used when running

the BART toolkit (4.3.3) during training and testing. As mentioned above

(section 4.3.1), we save the features values for over 40000 pairs of words.

That is because, as described in 3.2.1.2, we include also the synonyms of

the words appearing in the dataset.

In the query creation mode, we allow only one asterisk from each

direction of the patterns words. In addition, we relate only to the first 30

returned snippets.

2. As described in 3.2.2, for the WordNet feature we use WordNet

3.0, and Snow 400k. For each pair of words in the training or testing set

we check if WordNet includes a hypernym path of maximum length of 10.

Since a longer path length indicates less relatedness, and because of

running time, we had to set a threshold; a length of 10 seems to suffice.

Moreover; clearly, there is no use for checking pairs including

pronouns, so they get the value UNKNOWN.

3. As described in section 3.2.3, for the Wikipedia features we

used the work of Shnarch (2009). For each pair of words in the training or

testing set we check if the method extracted a reference rule. We used all

44

of the extraction method described in section 2.5 (Be-Comp, All-N, Title

Parenthesis, Redirect and Link).

4.3.3 Running BART toolkitWe added our new features to the BART toolkit. Thus, we could

run BART on 24 features: twelve base features, ten patterns-based

features, one WordNet feature and one Wikipedia feature.

In our experiment, we run each time a different subset of our new

features, in order to check their contribution. Our subset includes:

1. Base features

2. Web patterns + WordNet feature + Wikipedia feature

3. Base+ one of our new features each time

4. Each of our features alone

5. All of the 24 features

The used Classifier algorithms are J481 and SVM2 (its implementation in

Weka)

4.4 Results:Even though we didn't manage to improve BART's original F–

measure results, we did get to some interesting points. In this subsection

we give the results in details. Table 1 presents comparative results for the

coreference resolution task using the selected test/training documents, and

J48. The following subsections provide further results explanation and a

detailed error analysis.

Method Recall Precision F1

Base features 0.622 0.483 0.538

1 http://grb.mnsu.edu/grbts/doc/manual/J48_Decision_Trees.html2 http://en.wikipedia.org/wiki/Support_vector_machine

45

Patterns-Based + WN +

Wikipedia

0.612 0.401 0.509

Patterns-Based features

(using synonyms)

0.623 0.311 0.461

WN feature 0.503 0.412 0.485

Wikipedia feature 0.537 0.423 0.497

Base features + Patterns-Based

(using synonyms)

0.621 0.421 0.517

Base + WordNet 0.619 0.430 0.522

Base + Wikipedia 0.621 0.425 0.518

All features 0.621 0.430 0.521

Table 1 – Received results using SMO classifier/ J48 decision tree on the

selected testing documents

Results DetailsWhen analyzing Table 1 one can notice the following aspects:

Using only our new features (Patterns-Based + WN + Wikipedia)

achieves only less than 3% under the F measure when using the

base features. Thus, utilize semantic features does seem to be

relevant for the coreference resolution. Moreover, in this case the

built decision tree (appendix 1) includes features from all of the

semantic resources, thus demonstrating that all resources are

relevant for classification. This fact also impacts the results – when

applying all of the features groups, we get higher score than from

each group separately

The web-patterns extension of using synonyms (described in

3.2.1.2) does increase recall (than not using the extension) without

harming precision. For example, the synonym pair 'acquisition –

dealing' was an extension for the coreference pair 'acquisition -

transaction', which helped increasing the recall. Precision was

stable since we used only the first synonym from the first sense

from WordNet. Moreover, we have noticed that in many cases

using a hypernym for extension would also help. For example, in

46

the case of 'Sony-company', the pair 'Sony-Organization' will be a

good extension, where organization is a hypernym of company.

However, including hypernym will most likely harm the precision

significantly.

Applying only one semantic group, we could analyze the quality of

each resource. While in the patterns-based features we get a

relatively high recall but low precision, using Wikipedia or

WordNet features we achieve lower recall but higher precision.

This fact can be explained by the error analysis we did for each of

our features (subsection 4.5).

Adding the semantic features to the base features (which include

grammatical agreement features, syntactical features, positional

features, string match features and proper name match normalizer

feature) doesn’t achieve higher results. In the next subsection

(4.5) we give the error analysis of the semantic features. The

described errors cause this fact. In the error analysis and in section

5 we also suggest possible solutions, which may improve the

results in future work.

Learned Decision trees

Using decision trees is useful for testing the classifier model, since

they are quite easy to understand. The learned decision tree models and

their explanation are given in appendix 1. These trees show that in the

case of using only our features, the tree built includes 4 features: WordNet

feature, Wikipedia features, and the two features from the patterns-base

module – 'All Patterns ' and 'Unique Patterns' (described in 3.2.1.2). In

addition, when using the base features the first feature checked is the

String match feature, then the alias feature, etc'. The total number of used

features is 9.

.

47

4.5 Detailed AnalysisBelow is a detailed error analysis, concerning each of our new

feature groups. We first describe the patterns-based features errors (4.5.1),

than, WordNet errors (4.5.2), and Wikipedia errors (4.5.3).

4.5.1 - Pattern-based featuresThe pattern based module features indeed provides an indication

for coreference for many coreferent pairs of words. A sample for these

pairs is: TWA -> Airlines, Bush -> President, Officer -> Position, Period

-> year, Issues -> news, etc. However, in this subsection we describe the

pattern-based features errors. We analyze the two types of errors:

false positive errors – where snippets were retrieved for non-coreferent

pairs, and false negative errors – where the system didn't retrieve any

snippets for coreference pairs.

.

False positive errors

Analyzing the retrieved snippets, we noticed the fact that 40 % of

them are false positive. In this subsection we describe the classes of

the 'wrong' snippets, i.e., snippets retrieved for non-coreference words. In

order to determine the types of errors, we have analyzed 500 snippets,

selected randomly. For each group of errors we describe the error type,

give examples and analyze potential ways for solving the error. Also, for

each group, the number of the occurrences is reported.

1. Error type: Description / clause:

Word1 Word2 Snippet

place Post [NP Interviews/NNS ] [PP for/IN ] [NP

posts/NNS ] [VP taking/VBG ] [NP place/NN ]

48

and/CC [NP other/JJ posts/NNS ] [VP

advertised./VBN ] [NP Bid/NNP ] for…

News Area [VP send/VB ] [NP you/PRP ] [PP

updates/NNS ] [PP on/IN ] [NP security/NN

issues/NNS ] [PP in/IN ] [NP our/PRP$ area/NN

] and/CC [NP other/JJ news/NN ] [PP

about/IN ] ….

Year Bank [Madison/NNP Bank/NNP Ltd.,/NNP

Established/NNP ] [PP in/IN ] [NP the/DT

year/NN 1966,/CD ] [VP known/VBN ] [PP

as/IN ] [NP "AMCO/NNP BANK"/NNP ] …

Error explanation : In the pattern based module, the first noun in

the pattern may be the last noun of a clause in the snippet.

However, the correct noun for the pattern is other noun in the

clause – the head of the clause.

Consider the example of the pair 'News-Area'. The found

pattern is 'area and other news'. However, notice that area is a part

of the noun phrase: 'security issues in our area.' Thus, the correct

extraction of the pattern should be 'News -issues' rather than

'News-Area'.

Possible way of solution: Using full parsing would solve the

problem, since we could check that the first word of the pair does

not appear as a modifier in the snippet.

Number of the occurrences : 48

2. False reference of the pattern

Examples:

Word1 Word2 Snippets

Bush Taxes [NP democrats/NNS ] [VP would/MD

lower/VB ] [ taxes/NNS ] [PP like/IN ] [NP

49

BUSH/NNP ] [VP wanted ,perhaps/VBZ ] [NP

we/PRP ] [VP would/MD have seen/VB ] [ADJP

many/JJ ] [PP of/IN ] [NP those/DT jobs/NNS ]

[ADVP back/RB ]

Bush Taxes [NP He/PRP ] [NP deficit/NN ] [VP spent/VBD ]

[PP like/IN ] [NP Bush/NNP ] and/CC [NP

he/PRP ] [VP cuts/VBZ ] [NP taxes/NNS ] [PP

like/IN ] [NP Bush/NNP ] [VP to/TO get/VB ]

[NP us/PRP ] [PP out/IN ] [PP of/IN ] [NP those

numbers./DT]

Bush Taxes [NP John/NNP McCain/NNP ] [VP will/MD

lower/VB ] [NP taxes/NNS ] [PP like/IN ] [NP

Bush/NNP ] [VP did?/VBD ]

Error explanation: In the pattern based module, the first noun in

the pattern can be the last noun of a verb phrase.

Consider the first example, for the pair 'Bush-Taxes'. The

found pattern is 'taxes like Bush'. However, notice that like is

related to the verb phrase: 'lower taxes'.

Possible way of solution : Using full parsing would solve the

problem, since we could check that the second word of the pair is

not a part of a clause.

Number of the occurrences : 27

3. Separation of Proper Name

Word2 Word2 Snippet

Section Act [NP Government/NN Code/NNP Section/NNP] [VP

known/VBN ] [PP as/IN ] [NP the/DT Maddy/NNP

Act,/NNP ] [VP requires/VBZ ] [PP to/TO ] ..

Week Market [NP Independent/NNP Film/NNP Week,/NNP ]

[ADVP formerly/RB ] [VP known/VBN ] [PP as/IN ]

[NP IFP/NNP Market/NNP ] [VP |/VBD ]

50

Trust Bank [NP Arizona/NNP Bank/NNP and/CC Trust/NNP ]

[VP (formerly/RB known/VBN] [PP as/IN] [NP

Bank/NNP of/IN the/DT Southwest)/NNP] …

Error explanation: In this group the first or the second noun we

query to Google are a part of proper noun in the snippet, so the

snippets are correct only when considering the whole noun phrase

in the snippet.

Consider the example for the pair 'Week-market'. The

pattern is correct only when considering the full proper name

'Independent Film Week' and the full proper name 'IFP Market'.

Otherwise, obviously, week is not a coreference to market.

Possible way of solution: Ignoring the snippet when we search

only for a part of a proper name would solve such examples. This

can be done by changing the regular expressions to eliminate these

cases.

However, by doing so we would miss some cases in which

we should not ignore the snippet. For example, the pair

(bank ,place) is a coreference pair in our annotation, so we

wouldn't want to miss the snippet:

[NP 20/CD Churchill/NNP Place/NNP],[VP also/RB known/VBN ] [PP as/IN ] [NP Street/NNP Bank/NNP ],…

Thought both words in the pair are part of a proper name in the snippet.

Number of the occurrences: 65

4. Incorrect head is checked:

In this group the word extracted from the text, which we use in the

query, is not the correct head of the noun phrase in the text.

Word1 Word2 Snippet

51

Unit Vice [NP The/DT special/JJ investigations/NNS

unit/NN ] [ADVP also/RB ] [VP known/VBN ]

[PP as/IN ] [NP Vice/NNP and/CC

Narcotics/NNP ] [VP is/VBZ involved/VBN ]

[NP inscovert/JJ investigations/NNS ] [PP of/IN ]

[NP illegal/JJ activity/NN and/CC

enforcement/NN ] [PP of/IN ] [NP liquor/NN

codes./NN ]

Inc Effect [NP NetLearn/NNP Ventures,/NNP Inc./NNP ]

[VP formerly/RB known/VBN ] [PP as/IN ] [NP

Net/JJ Effect,/NN ] [VP is/VBZ ] [NP an/DT

IT/NNP consulting company/NN ] [VP

established/VBN ] [PP in/IN ] [NP 2001./CD]

Error explanation : When creating the query to Google; we consider

the pattern and the pair of nouns which are the heads of the noun

phrases from the text. In some cases, we extract a wrong head. For

example, in some cases we get the noun phrase 'Inc' as a noun

phrase in the query. Clearly, searching Google with too small head

could cause retrieving wrong snippets.

Possible way of solution: In order to get the head of the noun

phrase we used the implementation in BART. Probably trying

more tools for extracting the noun phrases heads is needed.

Another option is trying to submit to Google not only the head of

the noun phrase, though submitting the whole noun phrase will

case a very low recall.


5. Potentially co-referring words

In this group the pair of words may corefer in some context, so the

snippets retrieved make sense. However, these two words do not corefer

in the specific example given in our corpus.

52

Word1 Word2 Snippet

Post Chairman [NP Dr/NNP Sardar/NNP Singh/NNP

Johl,/NNP ] [NP a/DT well-known/JJ

economist/NN ] and/CC [NP former/JJ vice-

chairman/NN ] [PP of/IN ] [NP the

planning/DT board/NN ] [VP has/VBZ

held/VBN ] [NP many/JJ illustrious/JJ

posts/NNS ] [PP like/IN ] [NP

chairman/NN ] [PP of/IN ] [NP

Commission/NNP fors/NNS ]

Sony Company [NP Sony,/NNP while/IN ] [VP well/RB

known/VBN ] [PP as/IN ] [NP THE/DT

company/NN ] [NP that/WDT ] [VP has/VBZ

made/VBN ] [NP broadcast/NN television/NN

camerass,/NN ] [VP is/VBZ ] [NP the/DT

owner/NN ] [PP of/IN ] [NP the/DT old/JJ

Minolta/NNP camera/NN company,/JJ ] [VP

known/VBN ] [PP as/IN ] [NP a/DT

superb/NN ]

Heads Post [ADVP (a)/LS ] [NP how/WRB many/JJ

departmental/JJ heads/NNS ] and/CC [NP

other/JJ senior/JJ posts/NNS ] [VP have/VBP

been/VBN appointed/VBN ] [PP from

outside/IN ] [PP of/IN ] [NP the/DT

Island/NNP ] [PP during/IN ] [NP the/DT

last/JJ ten/CD years?/NN ]

Error explanation: In this group the snippets retrieved are not

wrong – that is, they certainly make sense for some contexts. For

example, 'Sony – Company' can actually be coreferent. However,

they are not in the specific annotation of our corpus - i.e.,

'company' relates to other company in the text.

53

Possible way of solution: Probably this is the most complicated

problem of the pattern-based module: it doesn't consider the

context of the words. Therefore, this group shows that finding

patterns without relating to the context of the examined text- can

cause errors.


The distribution of the errors is shown in figure 1.

Figure 1: False positive distribution by groups of errors

False negative errors:

Another analysis we have done is checking the false negative, i.e.,

a pair of coreferent words, which didn't get any snippets. We analyzed

about 100 cases of false negative, that is, 100 pairs which didn't yield any

retrieved snippet thought they are coreference pairs.

We have found the following groups:

1. Named entity which doesn't appear in the web

54

Although the web is the largest corpus available, in some cases it

doesn't include a specific entity/ number. For example: 20000$ -> sum,

Mischinski ->director, etc'.


2. Pattern – matching errors of the snippets

As described in (3.2.1), the last step of our pattern based module is

a pattern matching between the parsed snippet and a specific regular

expression. In some cases, even though the snippet is correct, it doesn't

pass the pattern matching step, for example, because of parsing error.


3. Too complex expression / understanding the context:

In some cases, the pair of words corefer in a very specific context,

so it can't be found in a specific pattern of the pattern-based-module.

For example, the pair scale -> issue (when talking about A two-tier wage

scale and the major outstanding issue) and the pair filing -> application

(when talking about the new filing and a serious application)


The percentages of the errors are shown in figure 2.

55

Figure 2: False negative distribution

4.3.2 WordNet Feature WordNet seems to be an important tool for coreference resolution.

Pairs of coreferent words such as: position ->job, merger ->integrating,

etc' are recognized by the feature as coreferring pairs of words. Yet, in this

subsection we analyze the false positive errors and the false negative

errors of the feature.

False positive errors

When examining the false positives, we have noticed that the main

problem is the large number of senses that WordNet includes for many

words in our corpus, where senses are sorted by frequency, from the most

to the least. For example, WordNet does have a path between the pairs of

words 'offer' and 'attempt', which are unlikely to corefer. The path is

derived from the last sense of 'offer', out of 4 possible senses. Another

example: is 'sale' and 'income': those words are related to the same topic,

but are not likely to corefer. In this case, the path for this pair is derived

from the last sense of 'sale', out of 6 possible senses.

56

Moreover, as in the pattern-based features, we have noticed the

error of potentially co-referring words which are not coreferent in our

corpus.

False negative errors

The main problem of WordNet causes the false negative is its

sparseness. This can be mainly realized in:

Specific named entities, so we miss pairs such as John Krieger -> a

spokesman, Bush -> President, 12.3.1997->today, etc.

Number, so we miss pairs such as: 20000$ -> sum, 1987 ->the

year, etc'.

Moreover, in many cases the words appear in WordNet but don't

have a path of hypernym, manly because the coreference includes

understanding the context. This includes pairs such as: The women-> the

members, or the pair of words scale->issue, which is coreferring in a

specific context (talking about 'two-tier wage scale' and the 'major

outstanding issue').

4.3.3 Wikipedia FeatureWikipedia seems to be another resource which is likely to

contribute to the coreference resolution task. By processing Wikipedia

pages we extract synonyms and hypernyms. For example, the alias Minn

is recognized as connected to Minesota. Yet, in this subsection we analyze

the false positive errors and the false negative errors of the Wikipedia

feature.

57

False positive Errors

For the false positive analysis, we have examined 80 false positive

pairs. We have noticed that the main problem of the feature is extracting

other semantic relations than synonyms and hypernyms. For example, we

sometimes get derivation relation: as in the pair 'Employment' and

'employee'. This happens because the first sentence in the Wikipedia page

for 'employment' is:

"Employment is a contract between two parties, one being the

employer and the other being the employee"

Another relation is meronym, as in the pair 'price' and 'service'.

This happens because the first sentence in the Wikipedia page for 'price'

is:

"Price in economics and business is the result of an exchange and

from that trade we assign a numerical monetary value to a good,

service or asset."

Moreover, as in the previous semantic features, we have noticed

the error of potentially co-referring words which are not coreferent in our

corpus. For example the pair: '1987' and 'year', and the pair 'yesterday'

the 'day'

False negative errors

As in WordNet, when using Wikipedia we suffer from its

sparseness. Though many named entities have a Wikipedia page (such as

IBM, George W. Bush, Minnesota), in many cases the extracted relation it

is not correct, as in the pair 'John Krieger' and 'spokesman'. Wikipedia

sparseness also appears in case of common nouns, such as for the pair

'talkes' and 'negotiations'

58

http://en.wikipedia.org/wiki/Asset

http://en.wikipedia.org/wiki/Service_(economics)

http://en.wikipedia.org/wiki/Product_(business)

http://en.wikipedia.org/wiki/Value_(economics)

http://en.wikipedia.org/wiki/Monetary

http://en.wikipedia.org/wiki/Business

http://en.wikipedia.org/wiki/Economics

http://en.wikipedia.org/wiki/Employment#Employee

http://en.wikipedia.org/wiki/Employment#Employer

http://en.wikipedia.org/wiki/Party_(law)

http://en.wikipedia.org/wiki/Contract

Another problem is the number of senses a word can have. For

example, Wikipedia does contain a page for 'party', but missed the pair:

'parties' and 'sides', since it relates to another sense of 'party'.

59

5. Conclusions and future workCoreference is a significant task of text understanding, and can be

used as a subtask for many other natural language processing applications.

such as question answering and information extraction. In this thesis we

have presented a coreference resolution application which includes new

semantic features, using varied semantic resources, including WordNet,

Wikipedia and Google (by applying the pattern-based module).

As a baseline coreference application, we used the BART system,

which implements twelve features, including grammatical agreement

features, syntactical features, positional features, string match features, a

proper name match normalizer feature, and a WordNet semantic feature.

However, these features are lack of semantic knowledge

Using our semantic resources, we aimed to identify the semantic

relation between the words of candidate antecedent-anaphor pair, where

the most relevant relations are synonyms and hypernyms. Having these

relations between the candidate antecedent and anaphor increases the

likelihood for coreference between the pair of words.

We applied the coreference resolution application each time with

another group of features: by applying the coreference resolution

application using only all our new semantic features (without the base

features of a standard coreference resolution system (Soon 2001)), we get

an F1 measure score which is close to the F1 measure when using the base

features. We showed that in this case the features which appear in the

decision tree are from all semantic resources, that is, all feature groups are

relevant. This fact also influences the results – when applying all of the

features groups, we get a higher score than from each group separately.

Thus, the results indicate that the semantic features are relevant for the

coreference resolution task. Also, when applying the pattern-based

module, we applied the synonyms extension (subsection 3.2.1.2), thus

increasing recall without harming precision. Next, we tried using both the

baseline features and the semantic features. In this case the semantic

features did not increase the F1 score.

60

We reported a detailed error analysis, concerning each of our used

semantic resources. We grouped the errors into main types, explaining

each group by examples and reporting its frequency. Our analysis reveals

many directions for future research. For example, as for the pattern-based

module, we notice that using a full parsing for the retrieved snippets is

needed for ignoring cases where the patterns don't relate to the checked

pair of words. Also, querying not only the heads of the noun phrases, but

also larger parts of the noun phrases, should be checked, since sometimes

we extract too small heads, thus retrieving unwanted snippets . As for

WordNet, we showed that searching for a hypernym path between the

candidate antecedent-anaphor pair might be noisy when the path includes

uncommon senses of the words. Therefore, relating paths which include

only the common sense or senses of the words should be considered. As

for Wikipedia, reducing noise should be done by ignoring unwanted

relations, mainly derivations and meronyms.

61

ReferencesRion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2006. Semantic

taxonomy induction from heterogenous evidence. In Proceedings of

COLING-ACL.

Soon, W., H. Ng, and D. Lim. 2001. A machine learning approach to

coreference resolution of noun phrases. Computational Linguistics,

27(4):521–544.

Yang, Xiaofeng, Jian Su, and Chew-Lim Tan. 2005. Improving pronoun

resolution using statistics-based semantic compatibility information. In

Proceedings of the 43rd Annual meeting of the Association for

Computational Linguistics (ACL05), pages 427–434.

Eyal Shnarch, Libby Barak, Ido Dagan. Extracting Lexical Reference Rules

from Wikipedia. In Proceedings of ACL 2009

Y. Versley, S. Ponzetto, M. Poesio, V. Eidelman, A. Jern, J. Smith, X.

Yang, A.Moschitti, BART: A Modular Toolkit for Coreference

Resolution. Companion Volume of the Proceedings of the 46th Annual

Meeting of the Association for Computational Linguistics, June 16-18,

2008.

Mirkin, Shachar, Ido Dagan, and Maayan Geffet. 2006. Integrating

pattern-based and distributional similarity methods for lexical entailment

acquisition. In Proceedings of COLING-ACL Poster Sessions, pages 579–

586.

Exploiting semantic role labeling, WordNet and Wikipedia for

coreference resolution. Proceedings of the Human Language Technology

Conference of the North American Chapter of the Association for

Computational Linguistics, New York, N.Y., 4-9 June 2006, pp. 192-199

62

Modjeska N, Markert K, Nissim M: Using the Web in Machine Learning for

Other-Anaphora Resolution. Proc of the 2003 Conference on Empirical

Methods in Natural Language Processing;

Etzioni, Oren, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked,

S. Soderland, D.S. Weld, and A. Yates. 2004. Web-Scale Information

Extraction in KnowItAll. In Proceedings of WWW-04. NY, USA .

Banerjee, S. & T. Pedersen (2003). Extended gloss overlap as a measure of

semantic relatedness. In Proc. of IJCAI-03, pp. 805–810.

Quinlan, J. R., 1993. C4.5: Programs for Machine Learning. San Mateo, CA:

Morgan Kaufmann.

Rion Snow, Daniel Jurafsky, Andrew Y Ng (2006) Semantic taxonomy

induction from heterogenous evidence In ACL '06: Proceedings of the 21st

International Conference on Computational Linguistics and the 44th annual

meeting of the ACL (2006), pp. 801-808.

Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The pascal

recognising textual entailment challenge. In Lecture Notes in Computer

Science, volume 3944, pages 177–190.

Ng V., Cardie C., 2002a. Improving Machine Learning Approaches to

Coreference Resolution, In Proceedings of the ACL 2002, pp. 104-111.

63

Appendix A- Trees Examples and explanation:

64

String match

TRUE Alias

TRUE

Unique pattern

Appositive

FALSE

Both Proper name

FALSE

TRUE

Number

FALSE All patterns

FLASE TRUE

Pronoun anaphor

Gender

FALSE FALSE NumberAgreement

SENTENCE DISTANCE

TRUE FALSE

FALSE

T

T

T

T

F

F

F

F

U

T

F

>1 <=1

=0 <0

F

T

T

F

F T

>2 <2

1 eerT serutaef desab-snrettaP dna serutaeF esaB -(woleb sliateD)

65

Wikipedia

WordNet All patterns Unique pattern

Wordnet FALSE

FALSE

TRUE

FALSE

TRUE WordNet

FALSE

TRUE

FALSE

TRUE

TRUE

FALSE

T

T

U

F

F

T

<1 =>1

U

<5 =>5

T

U

F

U

F

2 eerTerutaef aidepikiW + erutaef NW +serutaef dessap snrettaP -(woleb sliateD)

Trees explanation:In the given trees the nodes are the features names, and the edges are the

features values. Each leaf of the tree specified the pair classification – true

for coreference pair and false for non-coreferent pair.

Tree 1 is the model which was built for the base features and our

pattern-base features. In this tree we start with checking the string-match

feature, then the alias, continuing with the gender/Unique patterns etc.

Notice that two of our pattern-based features appear in the tree. This fact

indicates that the pattern-based features are indeed indicative for having a

coreference relation.

Tree 2 is the model which was built only for the new semantic

features: patterns-based features, WordNet feature, and Wikipedia feature.

Notice that the features appears in the tree are from all semantic resources,

indicating that all feature groups are relevant. This fact also impacts the

results – when applying all of the features groups we get higher score than

from each group separately.

66

תקציר אנאפורה היא תופעה שכיחה בדיבור ובטקסט. תופעה זו מתארת שני ביטוים

מטרת המשימהבטקסט המתייחסים לאותה ישות בעולם האמיתי. במילים אחרות,

אחר לצירוף שמני מתייחס אנפור, המכונה כלשהו, צירוף שמני לגלות האם היא

לדוגמא, במשפט: "לדני הייתה מסיבת יומולדת והואהקודם לו במסמך, המכונה קודמן.

דני. כלומר, דני הוא–קיבל הרבה מתנות" המילים 'דני' ו-'הוא' מתייחסות לאותה ישות

זה שקיבל את המתנות.

זו לבעיה ומתייחסים מכונה למידת בשיטות משתמשים אנחנו זו בעבודה

נאספים מאפיינים שונים, למשל, מספר זוג מילים בטקסט כל סיווג: עבור כבעיית

המילים בטקסט המפרידות בין המילים בזו, הסכמה לגבי יחיד/רבים, זכר/נקבה. בעזרת

מאפיינים אלה נבנה מסווג שיודע להחליט האם בין זוג מילים קיים קשר של אנאפורה

הצירופים השמניים, אשר קיים ביניהם לאחר סיום שלב הסיווג של כל זוגותאו לא.

בטקסט, אשר מתייחסים , אוספים את כל הזוגות הללו המופיעיםאנאפורהקשר של

ה ישות בעולם האמיתי. תהליך מכונה אשכול איסוףלאותה והצירופיםאנאפורות ,

.אנאפורההשמניים בטקסט המתייחסים לאותה ישות בעולם האמיתי נקראים אשכול

מנסות לשפר את ביצועיהן שלאנאפורהגישות מונחות למידה לפתרון בעיית ה

לגבי זוגותאפייניםהמ בעיקר על ידי הרחבת אוסףהאנאפורה מערכות לפתרון בעיית

.Soon (2001) , Ng and Cardie (2002) האנאפורה. מסווג המילים המועמדים עבור

אולם, באוסף המאפיינים חסר שימוש בידע סמנטי על זוגות המלים. למשל, זיהוי קשר

) או שמותsynonymסמנטי בין זוג המילים המועמד, כגון קשר של מילים נרדפות (

).hypernymמכלילים (

מטרת עבודה זו, אם כן, היא הצגת שיטה של שימוש בידע סמנטי עבור בעיית

זוגות המלים. האנאפורה כאשר השימוש בא לידי ביטוי בזיהוי קשרים סמנטים בין

קיומו של קשר כזה בין זוג מילים מהווה אינדיקציה עבור קיומו או אי קיומו של קשר

(בעזרת מודל מבוסס התבניות) ,גוגלאנאפורה. הכלים הסמנטיים בהם השתמשנו הם:

למסווג מאפיינים של חדשה קבוצה יצרנו סמנטי כלי לכל וויקיפדיה. וורדנט

האנאפורה.

בעבודה זו אנו מראים כי ידע סמנטי הוא אכן רלוונטי עבור פתרון האנאפורה

על ידי הרצת מערכת האנאפורה בעזרת קבוצת מאפיינים אחרת בכל פעם, השוואת

התוצאות והצגת עצי ההחלטה שהתקבלו. בנוסף , אנחנו מציגים ניתוח תוצאות מעמיק

לגבי כל אחד מהכלים: בחינת יתרונות, חסרונות וסוגי הטעויות של כל אחד. ניתוח

טעויות זה חושף כיוונים רבים למחקרים עתידיים.

67

דגן עידו'דר של בהדרכתו נעשתה זו עבודה

המחשב למדעי הפקולטה מן

בר-אילן אוניברסיטת

68

למדעי המחלקה בר-אילן אוניברסיטתהמחשב

בעית לפתרון סמנטי בידע שימושאנאפורה

ארז חן

מוסמך תואר קבלת לשם מהדרישות כחלק מוגשת זו עבודהבר-אילן אוניברסיטת של המחשב למדעי בפקולטה

תשס"טן, סיוו2009יוני רמת-גן, ישראל

69

2 – Related work -...

Documents

Transcript of 2 – Related work -...