Automatic Extraction of Collocations From Korean Text

26
Computers and the Humanities 35: 273–297, 2001. © 2001 Kluwer Academic Publishers. Printed in the Netherlands. 273 Automatic Extraction of Collocations From Korean Text SEONHO KIM, JUNTAE YOON and MANSUK SONG Department of Computer Science, College of Engineering, Yonsei University, Seoul, 120-749, Korea Abstract. In this paper, we propose a statistical method to automatically extract collocations from Korean POS-tagged corpus. Since a large portion of language is represented by collocation patterns, the collocational knowledge provides a valuable resource for NLP applications. One difficulty of collocation extraction is that Korean has a partially free word order, which also appears in col- locations. In this work, we exploit four statistics, ‘frequency’, ‘randomness’, ‘convergence’, and ‘correlation’ in order to take into account the flexible word order of Korean collocations. We separate meaningful bigrams using an evaluation function based on the four statistics and extend the bigrams to n-gram collocations using a fuzzy relation. Experiments show that this method works well for Korean collocations. Key words: α-cover, collocations, convergence, correlation, interrupted bigram, randomness 1. Introduction A large proportion of natural language is represented by collocation patterns. Thus, much work has been done on automatic extraction of collocations and the notion of collocation has also been defined in various ways depending on their interests. The general notion of collocation is the way that some words occur regularly whenever a certain word is used. They are usually unpredictable by syntactic and semantic features. In some computational and statistical literature, a collocation is defined as a sequence of consecutive words with a special behavior that has characteristics of a syntactic and semantic unit (Choueka et al., 1983). However, in many researches, a phrase is also regarded as a collocation even if it is not consecutive. Furthermore, Church and Hanks (1989) included cases of words, that are strongly associated in their meaning but do not occur in a common grammatical unit or with a particular order, in collocations. In the computational point of view, collocations include light verbs, phrasal verbs, proper names, terminological expressions and grammatical patterns. They often have a specialized meaning or are idiomatic, but they need not be (Manning and Schütze, 1999). Collocational knowledge is essential for many NLP applications such as lexical parsing, language generation, machine translation, and information retrieval. For example, we often cannot translate by word-by-word fashion in machine trans-

Transcript of Automatic Extraction of Collocations From Korean Text

Computers and the Humanities 35: 273–297, 2001.© 2001 Kluwer Academic Publishers. Printed in the Netherlands.

273

Automatic Extraction of Collocations FromKorean Text

SEONHO KIM, JUNTAE YOON and MANSUK SONGDepartment of Computer Science, College of Engineering, Yonsei University, Seoul, 120-749, Korea

Abstract. In this paper, we propose a statistical method to automatically extract collocations fromKorean POS-tagged corpus. Since a large portion of language is represented by collocation patterns,the collocational knowledge provides a valuable resource for NLP applications. One difficulty ofcollocation extraction is that Korean has a partially free word order, which also appears in col-locations. In this work, we exploit four statistics, ‘frequency’, ‘randomness’, ‘convergence’, and‘correlation’ in order to take into account the flexible word order of Korean collocations. We separatemeaningful bigrams using an evaluation function based on the four statistics and extend the bigramsto n-gram collocations using a fuzzy relation. Experiments show that this method works well forKorean collocations.

Key words: α-cover, collocations, convergence, correlation, interrupted bigram, randomness

1. Introduction

A large proportion of natural language is represented by collocation patterns. Thus,much work has been done on automatic extraction of collocations and the notion ofcollocation has also been defined in various ways depending on their interests. Thegeneral notion of collocation is the way that some words occur regularly whenevera certain word is used. They are usually unpredictable by syntactic and semanticfeatures.

In some computational and statistical literature, a collocation is defined as asequence of consecutive words with a special behavior that has characteristics of asyntactic and semantic unit (Choueka et al., 1983). However, in many researches,a phrase is also regarded as a collocation even if it is not consecutive. Furthermore,Church and Hanks (1989) included cases of words, that are strongly associated intheir meaning but do not occur in a common grammatical unit or with a particularorder, in collocations.

In the computational point of view, collocations include light verbs, phrasalverbs, proper names, terminological expressions and grammatical patterns. Theyoften have a specialized meaning or are idiomatic, but they need not be (Manningand Schütze, 1999).

Collocational knowledge is essential for many NLP applications such as lexicalparsing, language generation, machine translation, and information retrieval. Forexample, we often cannot translate by word-by-word fashion in machine trans-

274 SEONHO KIM ET AL.

lation. Due to the idiosyncratic nature between two languages, when translatingsource language into target language, we need something more than syntacticstructure and semantic representation. In this case, collocations provide a basisfor choosing the corresponding lexical items.

Despite the importance of collocational knowledge, it is usually not availablein manually compiled dictionaries. Our main objective in this paper is to auto-matically retrieve collocations with a broad coverage that are helpful for NLPapplications. Therefore, the notion of collocation here is closer to lexical andgrammatical patterns. That is, it is more in line with combinations of words whichfrequently occur together more than by chance as defined by Benson et al. (1986).

To some extent, finding common word combinations from large-scaled corporalooks easy. However, it is hard to identify the central tendency of distributionof collocations because the expressions take various forms of words of arbitrarylength. In addition, since criteria for judging collocations are often ambiguous,selecting meaningful patterns is difficult. In particular, Korean allows argumentsto be freely scrambled. As a result, this flexibility of word order makes it moredifficult to identify meaningful collocation patterns.

In order to deal with the free word order of Korean, we present the followingfour statistics: ‘high frequency’, ‘convergence’, ‘randomness’, and ‘correla-tion’. For each morpheme, our model first investigates the positional frequencydistribution of all possible bigrams that occur together within a specific window.Therefore, adjacent and interrupted word sequences are both retrieved. The termof ‘interrupted bigrams’ here refers to the bigrams that are separated by an arbit-rary number of other morphemes. Next, we extract meaningful bigrams using fourstatistics and the meaningful bigrams are extended to n-gram collocations.1

2. Related Works

As mentioned before, the notions of collocation have been variously definedaccording to applications. Nevertheless, most of authors in computational andstatistical literature agree that collocations have particular statistical distributionsthat the component words cannot be considered to be independent each other(Cruse, 1986).

Choueka et al. (1983) viewed a collocation as a sequence of consecutive wordsthat frequently occur together. However, many collocations involve words that maybe separated by other words. Church and Hanks (1989) defined a collocation asword pair that occurs together more often than expected. They included associatedpairs of word pairs in collocations even if they do not occur in a grammaticalunit. In order to evaluate the lexical association of two words they used mutualinformation. As a result, the extracted word pairs may not be directly related andinterrupted and uninterrupted bigrams were both retrieved.

Haruno et al. (1996) extended collocation up to n words using mutual infor-mation. It is a well-known problem that mutual information overestimates the

AUTOMATIC EXTRACTION OF COLLOCATIONS FROM KOREAN TEXT 275

probabilites of rare events. For this reason, Breidt (1993) used t-score to findsignificant verb-noun combinations.

Martin (1983) reported that a ten word window could cover more than 95% ofsignificant English collocations. Under this assumption, Smadja (1993) collectedall possible bigrams that co-occur within a ten word sized window. They definedfour properties of collocation as being an arbitrary, domain dependent, recurrent,and cohesive lexical cluster. The lexical strength of a bigram was computed using‘z-score’ and its offset distribution was analyzed using ‘spread’, a sort of variancemeasure. If the offsets are randomly distributed, the two words are considered tobe occurred together by coincidence.

Nagao and Mori (1994) retrieved consecutive n-grams with respect to anarbitrary large number of n. However, it is hard to find a proper n and a lotof fragments are produced. Besides, adjacent n-grams are insufficient to capturevarious patterns of collocations. Shimohata et al. (1997) applyed entropy conceptto filter out the fragments obtained by the above n-gram model. They measureda disorder degree based on the distribution of neighbor words that appear arounda string. Strings with a high disorder degree are accepted as consecutive n-gramcollocations. This disorder measure is efficient in eliminating wrong fragments.However, they could not deal with interrupted collocations. Ikehara et al. (1996)also extended the method presented by Nagao and Nori (1994). They producedinterrupted collocations by combining uninterrupted substrings. In extractionof interrupted collocations, substrings that partially overlap one another wereremoved.

Dunning (1993) presented a likelihood ratio test which works well for bothrare and common words. In order to prove the efficacy of likelihood ratio test,he retrieved significant bigrams which are highly associated in text. If words donot form a collocation, they are independent of one another. For checking theindependence of two words A and B, he tests if the distribution of A given B isthe same as the distribution of A given not B.

Lin (1998) used a parser to extract dependency triples from corpus and sepa-rated collocations from the dependency triples using mutual information. Mostworks do not make a distinction between compositional and non-compositionalcollocations. Lin (1999) retrieved non-compositional phrases based on the ideathat if an object appears only with one verb in a large corpus it has an idiomaticnature.

Lee et al. (1995) automatically retrieved meaningful interrupted bigrams fromKorean POS-tagged corpora using several filtering measures, but more than 90%of the results were consecutive bigrams.

In sum, lexical measures such as simple frequency, z-score, chi-square, t-test,likelihood ratio, relative frequency ratio, and mutual information have identifiedvarious properties of collocations and yielded useful results. However, most ofthem work well for fixed phrases.

276 SEONHO KIM ET AL.

In fact, component words of many collocations have a flexible relationship toone another. For example, a collocation regarding ‘make’ and ‘decision’ appears as‘to make a choice or a decision’, ‘make his decision’, ‘make my own decision’,‘make their decision’ and so on, which we call flexible patterns. The flexiblepatterns are common to a free word order language such as Korean. Moreover,as shown in Table I, the frequency of co-occurrences is not always an adequatecriterion for judging collocations. In this paper, we propose a method to extractcollocational patterns which are flexible and comprise infrequent words.

3. Input Format

In this section, we describe an input form which is appropriate to represent thestructure and linguistic content of Korean.

Above all, we will explain some characteristics of Korean. First, Korean isan agglutinative language. A sentence in Korean consists of a series of syntacticunits called eojeol. An eojeol is often composed of a content word and functionwords. Tense markers, clausal connectives, particles and so forth are containedin an eojeol. Thus, one or more words in English often correspond to an eojeol,i.e. a couple of morphemes. For instance, a phrase, ‘to the school’, in Englishcorresponds to an eojeol ‘ (haggyo-ro, school-to)’ in Korean.

Second, functional categories such as postpositions, endings, copula, auxiliaryverbs and particles are highly developed in Korean. From a collocational view-point, function words are important in producing collocations since they are usedto determine syntactic structures. In addition, phrases such as ‘ (eul/E-su/N-iss/V, can)’ and ‘ (e/P-ddareu/V-a/E, according to)’ operate as function wordsand form collocations. For these reasons, we employ a morpheme-based modelwhich extracts collocational patterns from POS-tagged3 corpus.

Another characteristic is that Korean is a free word order and head-finallanguage. A head follows its dependent and the positions of its arguments are free.Thus, words in collocation also occur in text more flexibly than other languages.This means a large volume of samples is required to estimate accurate probabilities.We avoid this problem by taking an interrupted bigram model. This model is alsoefficient to account for the flexibility of word order. In addition, similar to Xtract,we use relative positions of co-occurring words (Smadja, 1993).

To construct an interrupted bigram model, a frequency distribution of co-occurrences with a given morpheme is represented by co-occurrence matrix(CM). We will first define CM based on the structure of (1). Let (m1, . . ., mn)be a list of morphemes co-occurring with a given morpheme m. The co-occurrencematrix (CM) represents the co-occurrence frequencies of (m, mi) with respect toeach position. That is, each column in the CM represents the offset between m andmi . For example, f1,−2 indicates the frequency that m1 appears on the left side of m

AUTOMATIC EXTRACTION OF COLLOCATIONS FROM KOREAN TEXT 277

with a distance, 2. Since every head follows its modifier in Korean, ten morphemeson the left side of a focal morpheme are considered as a collocational window.

Xm =

f1,−10 f1,−9 · · · f1,−1

f2,−10 f2,−9 · · · f2,−1...

......

...

fn,−10 fn,−9 · · · fn,−1

(1)

To represent the collocational property of a given bigram, we transform CMinto a property matrix T(Xm). All bigrams of a given morpheme m are representedwith the following four statistics: VFrequency, VConvergence, VRandomness, VCorrelation. Wewill explain the statistics in Section 5.

T (Xm) =

V1F V1C V1R V1CR

V2F V2C V2R V2CR

......

......

VnF VnC VnR VnCR

(2)

4. Types of Collocations

In this section, we will describe collocation patterns of Korean. We divideKorean collocations into the following categories: idiomatic expression, syntacticcollocation, and morphological collocation.

Idiomatic expressions are non-compositional phrases, which can be againdivided into two classes: (1) idiom and (2) semantic collocation.

An idiom does not follow the compositionality principle and has anothermeaning which is totally different from original definitions of its components.Moreover, it requires a specific sentence structure or lexical combinationpattern to possess the idiomatic meaning. That is, each component in an idiomcannot be substituted by other words.

On the other hand, the original meaning of a semantic collocation issomewhat changed by the combination of words. The components can bemore freely replaced with other words or modified by other phrases thanidioms. For instance, each word in the phrase ‘ (sog’eul, heart/OBJ)

(tae’uda, burn)’ partially have its original meaning but the properinterpretation of the phrase is ‘make someone worry’. In this case, ‘(sog’eul, heart/OBJ)’ can be modified by other word such as ‘ (nae,my)’.

Syntactic collocations are subdivided into three classes: (1) case frame, (2) selec-tional restriction and (3) terminology.

278 SEONHO KIM ET AL.

Table I. Collocational patterns

In almost all cases, fairly strict semantic/grammatical restrictions holdbetween verb and noun phrase. A verb takes a particular syntactic structurewhich is specified by case frame. For instance, ‘ (jada, sleep)’ requiresan object and ‘ (juda, give)’ takes two noun phrases. In addition, the verb‘ (jada, sleep)’ requires a cognate object ‘ (jam’eul, sleep/OBJ)’. It isrelated with a legal combinations of senses that can co-occur, which is calledselectional restriction.

Collocations extracted from technical domains correspond to technicalterms or terminological phrases. In Korean, they are almost always combina-tions of nouns.

Morphological collocations correspond to multiple functional words or multi-words which appear in text as a consecutive word group. They are used asone unit. For example, ‘ (e ddara, according to)’ consists of threemorphemes but represents one meaning.

Table I shows some examples of collocations and their frequency counts. Idio-matic expressions and morphological collocations are structured in rigid ways,whereas other types of collocations are structured in flexible ways. In Table II,the basic syntactic patterns of collocations are shown.

5. Algorithm

In this section, we explain four statistics to represent the properties of collocationsand an algorithm to retrieve meaningful collocations. To extract collocations, twosteps are taken. First, we make use of four statistics to separate meaningful bigrams.

AUTOMATIC EXTRACTION OF COLLOCATIONS FROM KOREAN TEXT 279

Table II. Basic syntactic patterns

Types Relationship between Representation

A::N A noun and its modifier Adnominal noun

J::N A noun and its adjective modifier Adjective-adnominal ending noun

N::N A noun and its nominal modifier Noun-adnominal postposition noun

Noun noun

N::V A verb and its compliment Noun-subjective postposition verb

Noun-objective postposition verb

Noun-adverbial postposition verb

AD::V A verb and its adverbial modifier Adverb verb

N::J An adjective and its compliment Noun-subjective postposition adjective

AD::J An adjective and its adverbial modifier Adverb adjective

Next, the meaningful bigrams are extended to n-gram collocations using a fuzzycompatibility relation.

First of all, we will describe prerequisites to explain four statistics. Empirically,it has been shown that the frequency distribution of bigrams is approximated byWeibull distribution like (3). It means that there do exist many rare bigrams in text.Therefore, we exclude very rare bigrams using the median m such that P{X ≥ m}≥ 1/2 to a frequency distribution X. If the median is less than 3, we take 3 as themedian value. It has an effect on the computation of the four statistics.

F(x) = 1 − e−αxβ

0 < x < ∞ where α > 0, β > 0 (3)

For further discussion, suppose a sample space Smi, whose cardinality is n, with

respect to possible bigrams of a morpheme mi . Consider a bigram (mi , mk) with‘JP’ tag pattern and its frequency distribution (fik−10, fik−9, . . ., fik−1). Here J refersto an adjective morpheme and P refers to a postposition morpheme. In Table III, fikpdenotes the frequency of a bigram, where the offset between mi and mk in the textis p and fik+ denotes

∑−1p=−10 fikp . Also, (fi+−10|JP , fi+−9|JP . . . fi+−1|JP ) denotes the

distribution of frequency counts of all bigrams regarding mi such that each bigramhas the JP tag pattern. From now on, we use the bigram (mi , mk) with ‘JP’ tagpattern to demonstrate our algorithm.

In our problem, we don’t know the distribution of words which constitutecollocations. When the distribution of a random sample is unknown, rather thanattempting to make inferences about the distribution in its entirety, we often tryto make inferences about its properties that are described by suitably definedmeasures. The measure that does not depend on unknown parameters of the distri-bution but only on samples is called a statistic (Ross, 1987). We now define fourstatistics related with properties of collocation.

280 SEONHO KIM ET AL.

Table III. All interrupted bigrams of mi with ‘JP’ tag relation

Word pair Tag pattern Total frequency Variable (position) distribution

(mi , m1) (J, P) fi1+ fi1−10 fi1−9 . . . fi1−1

(mi , m2) (J, P) fi2+ fi2−10 fi2−9 . . . fi2−1...

......

......

......

(mi , mk) (J, P) fik+ fik−10 fik−9 . . . fik−1...

......

......

......

(mi , mn) (J, P) fin+ fin−10 fin−9 . . . fin−1

Total fi++|JP fi+−10|JP fi+−9|JP . . . fi+−1|JP

5.1. PROPERTIES

The distributional properties of collocations which we consider are mainly relatedwith the frequency and positional information of a word pair. As we mentionedbefore, the relationship between position and collocation is very complicated inKorean.1. Vf : Benson et al. (1986) defines collocation as a recurrent word combination.

A simple way for finding collocations in text is to use frequency counts ofwords. In that case, any frequently recurring pair is a candidate for colloca-tion. For this purpose, we introduce Vf statistic as (4). Taking the exampleof a bigram (mi , mk) with JP tag pattern, Vf is computed with its mean fiJP

and standard deviation σiJPas follows:

Vf = fik+ − fiJP

σiJP

, where fiJP=

∑nl=1 fil+n

= fi++|JP

n

σiJP=

√∑nl=1(fil+ − fiJP

)2

n(4)

2. Vc: The words in a collocation are lexically related under a syntactic structure.However, it is actually hard to decide the range of words related with a wordwithout accurate syntactic analysis of a sentence. That is, simply extractingall co-occurrences within a large window could generate many unrelatedbigrams although the characteristic of the flexible word order could be takeninto account. Vc is introduced to evaluate the relatedness between words(morphemes).

Intuitively, if two words tend to co-occur with a specific offset, wecan assume that they appear in a grammatical unit or with a particularorder. It is related with a syntactic constraint. In free word order language,there is no difference in choosing meaningful bigram whether the offset

AUTOMATIC EXTRACTION OF COLLOCATIONS FROM KOREAN TEXT 281

of two words is two or three. For example, consider two frequency distri-butions (0,1,0,0,0,1,0,0,1,0) and (0,0,0,1,1,1,0,0,0,0). The variances of twothe distributions are the same but (0,0,0,1,1,1,0,0,0,0) is expected to be moreinformative. In fact, under a free order language framework, it is intuitivelysimilar to (0,0,0,0,3,0,0,0,0,0).

For this purpose, we measure convergence on each position. To handlethe flexibility of word order, a spatial mask (1/2,1,1/2) is used for measuringconvergence on each position. That is, the convergence value of a bigramat a specific position is influenced by the frequency counts of its neighbors.We assume that two words of a bigram are related with each other if theyhave a high value of convergence on a specific position over the distribution.The bigram then would be served as a candidate for a meaningful bigram.

The convergence of mikp at the p-th position is computed as follows:

mikp =

4fik1+3fik2+fik34 p = 1

fikp−1 +2fikp+fikp+1

2 p = 2 . . . 9fik8 +3fik9+4fik10

4 p = 10

(5)

maxp

mikp

fik+seems to represent the value of convergence of (mi , mk) but it is

deficient. For example, a frequency distribution, (0,1,1,1,0,3,2,0,0,0) is lessinformative than (0,0,3,0,0,3,2,0,0,0). Accordingly, n′ was designed for apenalty factor.

Vc = maxp=1,2,...,10

mikp√n′fik+

(6)

In (6), n′ is the number of m such that fikm �= 0 for 0 ≤ m ≤ 10. We avoidexcessive influence of n′ by taking the square root.

3. Vr : To measure whether two words are meaningful or not, we make use ofrandomness of a distribution. If occurrences of a pair are randomly distrib-uted over position, it would not be meaningful. One way of checking therandomness is to measure how a given distribution differs from a uniformdistribution.

In (7), fik indicates the expected number of frequency counts of (mi , mk)at each position on the assumption that the pair randomly occurs at the posi-

tion. Consequently,|fikp−fik |

fikis an error rate of the observed frequency count

at a position p under the assumption. The differences between observed andexpected frequency counts for each position are summed over Vr . If thevalue is large, then the distribution is not random.

We here use the expected number of frequency counts as the denominatorof Vr and the expected number is computed only by the row of CM.

Vr =10∑

p=1

(fikp − fik

fik

)2 (7)

282 SEONHO KIM ET AL.

Figure 1. Frequency distributions of some pairs with PJ or PV tag pattern.

4. Vcr : According to Figure 1, bigrams with the same or similar syntactic tagpatterns have a similar shape of distribution over position. Thus, we assumethat if frequency distribution of a bigram follows that of bigrams with thesame tag pattern, then the bigram is meaningful. In order to check the struc-tural similarity between two distributions, we use the correlation measure.In general, the correlation coefficient is known as a measure related to thestrength of linear association between two variables.

Figure 1 shows the frequency distributions of tag patterns PJ and PVwhere PJ refers to postposition-adjective (predicative use) and PV topostposition-verb. They have sharp peaks at the first and third offsets. Itindicates that a word whose part of speech is postposition has a high proba-bility of occurring in the first and the third position before a predicate.

In the case of a bigram (mi , mk), the value of correlation between (fik−10 ,fik−9 , . . ., fik−1 ) and (fi+−10|JP , fi+−9|JP , . . ., fi+−1|JP ) is computed. Let (fik−10 ,fik−9 , . . ., fik−1 ) be x and (fi+−10|JP , fi+−9|JP , . . ., fi+−1|JP ) be y. The corre-lation is computed by standardization of x and y. Let x and y be twovectors whose components are composed of the differences of individualfrequencies and the mean of frequencies. That is, x represents xi − x and yrepresents yi − y. Suppose x* is x/σx and y* is y/σy. Then, the correlationVcr is represented as follows:

Vcr = x∗′y∗

10(8)

AUTOMATIC EXTRACTION OF COLLOCATIONS FROM KOREAN TEXT 283

Table IV. Correlations between factors

Vf Vc Vr Vcr

Vf 1.0

Vc −0.495

Vr −0.203 0.506 1.0

Vcr 0.252 −0.278 −0.002 1.0

5.2. EVALUATION FUNCTION

So far, we have explained four statistics to represent the properties of collocations.In this section, we will describe how to separate meaningful bigrams. In order tofind significant bigrams, we could apply the statistics one by one to a set of bigramsretrieved from corpus. However, when the properties were sequentially applied tobigrams, many useful bigrams were filtered out since each property has a differentfiltering behavior according to the mathematical characteristics.

Instead of independently separating meaningful bigrams with four statistics, weuse an evaluation function which can represent correlations among the statistics.In this section, we analyze correlations among four measures and describe how tomake an evaluation function for finding meaningful bigrams.

Table IV shows an example of inter-correlations among the four statistics:Vf , Vr , Vc, Vcr . The values in the table are the computed results of the bigramswith JP tag pattern out of bigrams of ‘ (ani, be not)’ The table indicates thatthe measures are not over-dependent with each other but they have redundantparts.

If a measure is highly correlated with other measures then it has a redundantpart to be eliminated. In this case, since each measure respectively explain oneside of properties of collocations, it is not easy to judge which factors are morereliable in determining a statistical significance. Hence, we use a linear weightedsum of the measures instead of directly using them as filters for finding meaningfulbigrams. We construct an evaluation function which can reflect the correlationsamong measures in the following way.

First of all, we standardize four measures to calculate degrees of relationshipsamong them. The standardization process has an effect on adjustment of the valuerange according to its variability. The degree of relationship between measure 1and measure 2 can be obtained by Cmeasure1,measure2 which is {correlation(measure1,measure2)}+, where x+ = x if x ≥ 0, 0 otherwise. The evaluation function (9)represents the degrees of relationships between the measures.

f (Vf , Vr, Vc, Vcr) = Vf + φrVr + φcVc + φcrVcr (9)

284 SEONHO KIM ET AL.

φr = (1 − CVr,Vf)(1 − a

CVr,Vc

2)(1 − a

CVr,Vcr

2)

φc = (1 − CVc,Vf)(1 − a

CVc,Vr

2)(1 − a

CVc,Vcr

2)

φcr = (1 − CVcr ,Vf)(1 − a

CVcr ,Vc

2)(1 − a

CVcr ,Vr

2) where a = 2 − 2√

3(10)

In (10), a is a compensation constant which makes the maximum value of 1.Here, the minimum values of φr , φc and φcr are 1/3 respectively, where CVf ,Vr

,CVf ,Vc

, and CVf ,Vcrare all 1. In addition, each coefficient has a maximum value

of 1 when CVf ,Vr, CVf ,Vc

, and CVf ,Vcrare all 0. That is, as the measures are less

correlated, the coefficients φr , φc, and φcr approach 1.As shown in (9) and (10), we treat Vf as a main factor in the discovery of

collocations. Each coefficient φ indicates how much the property has an influenceon the evaluation of meaningful bigrams. For example, in the formula φr , aCVr ,Vc

2 is

a correlation value between randomness and convergence factor, hence 1 − aCVr ,Vc

2means that the correlation with convergence is excluded from randomness factor.Consequently, φr is the influence of pure randomness on the evaluation.

After we compute the values of coefficients, we find meaningful bigrams usingthe evaluation function (9). We accept a bigram as a meaningful one if the valuecomputed by the evaluation function is greater than or equal to 0.5. Here, thethreshold was experimentally chosen from the data set. The evaluation functiongave a good result when the threshold was 0.5, but in noun morphemes, a highthreshold e.g. 0.9 gave a better result. Figure 2 shows the top 15 bigrams of ‘(ani, not)’ ranked according to the evaluation function. This agrees with our expec-tation. As Figure 2 shows, our system is efficient in the discovery of meaningfulbigrams that occur only a few times in text. We also investigated the correlationcoefficients with respect to the morpheme. In that case, the coefficients in theevaluation function are φr ≈ 0.432, φc ≈ 0.490, φcr ≈ 0.371. This means that whenevaluating whether a bigram is meaningful three other statistics have an influenceof 1.284 times as much as the frequency statistics does. The values of coefficientsare different according to a base morpheme.

5.3. EXTENDING TO N-GRAMS

In general, collocations consists not only of a pair of words but also of morethan two words. In our system, meaningful bigrams are extended to collocationscomposed of n words. In this section, we will describe the extension algorithm. Weuse the longest α-covers as n-gram collocations.

According to the definition of Kjellmer (1995) and Cowie (1981), one wordin a collocation can predict the rest of the words or a very limited number of theremaining words and there does exist a high degree of cohesion among the words of

AUTOMATIC EXTRACTION OF COLLOCATIONS FROM KOREAN TEXT 285

Figure 2. Top 15 bigrams of ‘ ’ (not) with ‘JP’ tag pattern according to the evaluationfunction.

a collocation. In order to evaluate the cohesion of the words based on this definition,we define a fuzzy compatibility relation which satisfies symmetry and reflexivity.In this paper, we regard a cluster of meaningful bigrams defined by the relation asa n-gram collocation.

First, we define a fuzzy compatibility relation R on X in order to find cohesiveclusters of the meaningful bigrams, where X is the set of the meaningful bigrams.In general, a fuzzy binary relation R(X, X) is represented in a matrix form using amembership function µA that maps elements of a set A into real numbers in [0,1],hence µA:A → [0,1]. Here, the set A is a subset of X such that <x, y> ∈R ∀ x,y ∈ A.

Suppose the meaningful bigrams of mi are x1, x2, x3. Then, the fuzzy relationR is represented by membership matrix like (11) where xij denotes the value ofmembership function µA(xi , xj ).

R = x11 x12 x13

x21 x22 x23

x31 x32 x33

(11)

In other words, the membership function computes the possibility that anelement of X would belong to a set A. In our problem, a set A can be interpretedas a cohesive set. We use the membership function to compute a degree of a cohe-sion. For the membership function, we consider two metrics: (1) Kullback-Leiblerdistance which is called relative entropy and (2) dice coefficient.

The relative entropy is used to measure the dissimilarity of two distributions.That is, it tells us how close two meaningful bigrams are. Given two probabilitymass functions, p(x), q(x), their relative entropy is represented as follows:

D(p||q) =∑x

p(x) logp(x)

q(x)(12)

286 SEONHO KIM ET AL.

Then, the membership function can be defined by

µA(x, y) ={

D(p(y|x)||p(x|y)) = p(y|x)(log(x|y) − log(y|x)) if p(y|x) ≤ p(x|y)D(p(x|y)||p(y|x)) = p(x|y)(log(y|x) − log(x|y)) if p(x|y) ≤ p(y|x)

p(y|x) = |x∩y||x|

p(x|y) = |x∩y||y| (13)

On the other hand, the dice coefficient is a similarity metric to measure howoften x and y exclusively co-occur if they appear in corpus, which is defined as

µA(x, y) = 2|x ∩ y||x| + |y| (14)

In the formulae (13) and (14), |x| and |y| are the number of concordancescontaining the bigrams x and y, respectively. And |x ∩ y| represents how manytimes two meaningful bigrams x and y appear in the same concordances within agiven distance.

Thus, both membership functions are used to compute a cohesive degree andare related to the lexical association of two meaningful bigrams. Next, the fuzzycompatibility classes of meaningful bigrams are obtained. The classes correspondto the n-gram collocations extended from meaningful bigrams. The outline toconstruct the fuzzy compatibility classes is as follows: We first apply a fuzzy binaryrelation to the meaningful bigrams of a given morpheme. As a result, the fuzzyrelation R(X, X) is represented by the membership matrix like (11). Second, weaccept compatibility classes defined in terms of a specified membership degree α

as n-gram collocations.If a relation is reflexive, symmetric, and transitive then it is called an equival-

ence relation or a similarity relation. On the contrary, if a relation is reflexive andsymmetric, it is called a compatibility relation or a quasi-equivalence relation. Inour case, since the relation R(X, X) we define does not satisfy transitivity, it is afuzzy compatibility relation. It means that an element of X can belong to multiplecompatibility classes.

We can partition the meaningful bigrams into a set of bigram clusters accordingto a degree of relatedness which corresponds to a value of a membership function.Given A, the set of elements whose values of a membership function are greaterthan α is called α-cover of A. The α-cover of A, Aα is represented as follows:

Aα = {x ∈ X|µA(x) ≥ α} (15)

Aα on a fuzzy compatibility relation is called a α-compatibility class and can bealso defined in terms of a specific membership degree α. The classes formed bythe levels of the relation can be interpreted as groups of elements that are similarto each other. In addition, a family of compatibility classes is called an α-cover ofX and the α-cover partitions X.

AUTOMATIC EXTRACTION OF COLLOCATIONS FROM KOREAN TEXT 287

Figure 3. Meaningful bigrams and all α-compatibility classes of ‘ ’.

To demonstrate the extension process, we will take ‘ (sinda, wear)’ asan example morpheme.4 As a consequence of the first step, Figure 3 shows themeaningful bigrams of ‘ ’. In next step, membership degrees of the meaningfulbigrams to a fuzzy compatibility relation are calculated using dice and relativeentropy measure. Figure 3 also shows membership degrees of X × X and α-covers.

Note that dice measure cannot handle the bigram pair such as ‘( (leul, objectcase), (mani, much))’. In that case, although the common frequency, 3 is rela-tively high in terms of the word of low frequency ‘ ’ (much), dice coefficientreceives a very low score. We achieved much better results in relative entropy thanin dice coefficient. However, if α-covers are considered with respect to all α valuesin the level set instead of a specific value, two membership functions producedalmost similar results.

Figure 4 shows the longest n-gram collocations of ‘ ’. Here, the order ofcomponents of n-gram collocations is decided by the concordances where theyappear. Accordingly, various orders are possible. These figures illustrate usefulnessof this algorithm.

In this example, we selected α-covers on the α level of 0.20 in dice and 0.30 inrelative entropy. This α level can also be changed according to applications using

288 SEONHO KIM ET AL.

Figure 4. The longest n-gram collocations of ‘ ’.

collocations. In some applications, only information about meaningful bigrams canbe enough.

6. Evaluation

Our test data consists of 8.5 million morphemes selected from Yonsei corpus.5

First, we examined Xtract’s results using z-score (strength) and variance (spread)as shown in (16). For this purpose we modified Xtract to deal with morpheme-based Korean text.

strength = f reqi−f

σ≥ k0

spread =∑10

j=1(pj

i −pi )

10 ≥ U0

pij ≥ pi + (k1

√Ui)

(16)

We will briefly describe the process of obtaining meaningful collocations.Smadja (1993) assumed that words of a collocation should co-occur in a relativelyrigid way because of a syntactic constraint. Therefore, bigrams that frequentlyoccurs at specific positions were viewed as meaningful ones for collocations.Among them, the bigrams that have low frequencies and flat peaks over positionwere filtered. In (16), pi

j denotes the interesting position j of a bigram i.Table V shows the meaningful bigrams of ‘ (masi, drink)’ retrieved by

Xtract. As seen in the table, there is no pair containing functional morphemes ornominals. It is due to the measure of ‘spread’ related with position. This means that‘spread’ statistics is not suitable for a free order language such as Korean, whichcauses it to over-filter bigrams. As a result, many useful bigrams were missed.

Furthermore, when compiling meaningful bigrams into n-gram collocations, ityields too many long n-grams as shown in Table VI. Most of results were alsoincorrect. The main reason for many long sequences of words is because in thecompiling process, Xtract simply keeps the words on their concordances such thatthe probability occupying a specific position is greater than a threshold. Therefore,

AUTOMATIC EXTRACTION OF COLLOCATIONS FROM KOREAN TEXT 289

Table V. Meaningful bigrams of ‘ ’ (drink) extracted by Xtract

Table VI. n-gram collocations of ‘ ’ (drink) retrieved by Xtract

if the number of concordances for a meaningful bigram is small, many erroneouslong sequences are produced as the results of n-gram extension.

Due to the structural differences of two languages, Xtract has some limita-tions to handle the free word order of Korean although it showed good results indiscovery of English collocations.

Other experiments were conducted on 250 morphemes. They were randomlyselected from the morphemes that occurred at least 50 times in the test corpus.

290 SEONHO KIM ET AL.

About 75% of them were predicative morphemes. Our system found 773 mean-ingful pairs out of a total of 8,064 interrupted bigrams. In the second stage, 3,490disjoint α-compatibility classes which correspond to cohesive clusters of the mean-ingful bigrams were generated. Finally, 698 n-gram collocations were retrieved byremoving the fragments that can be subsumed in longer classes. That is, approxi-mately 8.06% of bigrams turn out to be meaningful and each morpheme had aboutaverage 12 n-gram collocations.

To demonstrate the usefulness of the results, [in Tables VII and VIII] we showsome examples of n-gram collocations whose headwords are nominals and predic-ates. When the head morpheme is one of { (gachi, value), (geomsaeg,retrieval), (gwajeong, process), (gwangye, relation), (saeob, business),

(saneob, industry), (jeonhwa, telephone)}, the total 114 n-gram colloca-tions in Table VII is found to be NN-type terminology by our algorithm. Table VIIIshows the list of 218 n-gram collocations with respect to predicative morphemes{ (masi, drink), (ggeul, draw), (ggeutna, be over), (nanu, divide),(nah, bear), (dalu, treat), (daeha, face), (masi, drink), (byeonha,change), (dalu, use/write/wear/bitter), (olaga, go up), (ilg, read), (jeog,write/little), (hwaginha, confirm), (heureu, flow), (ta, ride/burn)}.

Generally, coverage of the discovery of collocations is very difficult to measure.One possibility is to compare extracted patterns with the entries listed in a manuallycompiled dictionary. However, since there is no existing database or dictionary ofKorean collocations or term banks for the comparison, evaluation of accuracy andcoverage is relied on indirect task or subjective judgment.

The inspection of sample output shows that this algorithm works well. However,formal evaluation of its accuracy and coverage remains to be future work and theretrieved collocations should be evaluated by their use in NLP applications.

As another experiment, we applied our algorithm to the English domain. Weused 10-words window(−5∼5) and (17) instead of (7) was used for an accuratecriterion.

Vr = ∑5p=−5

(fikp−fiktot p

fiktot p

)2

fiktot p = fik+

( ∑nj=1 fijp∑5

q=−5∑n

j=1 fijq

) (17)

Table IX shows the results of ‘Xtract’ and our algorithm. They are the meaningfulbigrams of ‘industry’ retrieved from the sample corpus that ‘Xtract’ offered. Sincethe sample text is subject-specific and small-sized, incorrect bigrams are extractedas meaningful ones and the lists is different to the collocation entries on BBICombinatory Dictionary of English for the word ‘industry’. However, as TableX shows, we cannot achieve broad coverage for NLP applications with only thedictionary lists.

As demonstrated in Table IX, ‘Xtract’ retrieved mostly bigrams of NN (noun-noun) tag patterns. On the other hand, bigrams with various grammatical patterns

AUTOMATIC EXTRACTION OF COLLOCATIONS FROM KOREAN TEXT 291

Table VII. NN-type collocations

292 SEONHO KIM ET AL.

Table VIII. Collocations

AUTOMATIC EXTRACTION OF COLLOCATIONS FROM KOREAN TEXT 293

Table VIII. Continued

294 SEONHO KIM ET AL.

Table IX. A comparison of meaningful bigrams of ‘Xtract’ and our algorithm

Xtract Our approach

Bigram POS Distance Bigram POSrelation relation

Forest industry NN 1 Any industry NDLumber industry NN 1 The industry NANew industry NJ 1 Industry and NCTransportation industry NN 1 Of industry NITextile industry NN 1 In industry NIU.S industry NN 1 Our industry NPIndustry product NN −2 New industry NJPotato industry NN 2 Industry is NBForestry industry NN 1 Transportation industry NNChip industry NN 1 Textile industry NNIndustry only NR −3 Potato industry NNIndustry not NR 5, −2 Lumber industry NNIndustry percent NN −5 Forestry industry NNDemand industry NN 3 Forest industry NNAir industry NN 2 Chip industry NNAmerican industry NN 2 Industry ’s NAUXIndustry well NR 4, −4 Industry product NNIndustry are NB −1 Trucking industry NN

U.S industry NNCanadian industry NNAmerican industry NNIndustry said NVIndustry allow NVIndustry attract NV

were considered in the results of our algorithm. While the phrase of ‘in theindustry’ frequently occurs in the sample text, it was not reflected in the resultsof ‘Xtract’.

For another comparison, we applied the log likelihood ratio λ to collocationdiscovery (Dunning, 1993). The ratio is known to be appropriate for rare words.We will not describe here details of the computation of log λ.

Table XI shows the twenty bigrams of ‘industry’ which are ranked accordingto the log likelihood ratio. This explains that the various tag patterns are found butmost results of interrupted bigrams can be included in our results.

In sum, the comparisons with some approaches shows that high precision andbroad coverage can be both achieved with our algorithm. However, the reliableevaluation function and the statistics for the properties of collocations opens upmany avenues for future work.

AUTOMATIC EXTRACTION OF COLLOCATIONS FROM KOREAN TEXT 295

Table X. Collocations in the BBI Combinatory Dictionary of English for the word industry

Collocations

1 To build up, develop (an) ∼2 (An) ∼ springs up

3 A basic, key ∼4 A cottage; defense; high-tech ∼5 Heavy; light ∼6 The film; machine-tool; meat-packing; pharmaceutical; steel; textile; tourist, travel;

trucking (esp. AE) ∼7 Smokestack (‘old, obsolete’) ∼8 (Misc.) government often regulates ∼; a branch of ∼

Table XI. Bigrams of industry with the highest scores according to Dunning’slikelihood test

Consecutive bigram −2 logλ Interrupted bigram −2 logλ

Forest industry 177.97 The industry 829.68

Lumber industry 126.24 Forest industry 177.97

The industry 96.99 In industry 143.72

Transportation industry 74.38 Of industry 136.86

Industry ’s 45.06 Lumber industry 126.24

Textile industry 40.45 Our industry 108.37

New industry 37.56 Canadian industry 100.54

Trucking industry 35.64 Industry is 97.927

Industry is 31.65 Transportation industry 81.88

Industry the 28.74 Industry and 65.02

Our industry 27.88 Industry to 53.18

Canadian industry 25.84 American industry 52.93

U.S industry 24.06 Industry its 52.55

Shingle industry 21.96 Industry in 51.935

Manufacturing industry 21.96 Industry has 48.04

An industry 20.61 Industry’s 45.06

Forestry industry 20.41 To industry 44.65

Airline industry 18.98 Trucking industry 43.34

Steel industry 18.07 Industry which 40.53

Industry has 17.93 Textile industry 40.45

296 SEONHO KIM ET AL.

7. Conclusion

We presented a method for extracting meaningful collocations from Korean text.We defined four statistics for the properties of Korean collocations and retrievedmeaningful bigrams based on an evaluation function. We extended them into n-grams by producing α-compatibility classes.

Our approach is effective in dealing with flexible word order and covers variouspatterns of Korean collocations, such as case frames, multiple function words,selectional restrictions, semantic phrases, compound nouns, and idioms. Examplesand tables shows that high precision and coverage can be both achieved with ouralgorithm. However, the evaluation function and a threshold for evaluation need afurther study.

Notes1 n-gram collocations here can be either consecutive morphemes or can be separated by other words.2 We used, as the tag set for input, 11 tags, i.e. N, J, V, P, D, E, T, O, C, A, S and X whichrepresent Noun, adJective, Verb, Postposition, aDverb, Ending, pre-ending(marking Tense), cOpular,Conjunction, Auxiliary verb, Suffix and others. respectively.3 We used MORANY, the Korean morphological analyzer of Yonsei University, whose of accuracyis approximately 96.0% (Yoon et al., 1999).4 That means ‘put on (wear or take on)’ in English, but it is used for only shoes or socks in Korean.5 The Yonsei balanced corpus consists of 40 million eojeols and it was constructed to make a Koreandictionary.

References

Benson, M., E. Benson and R. Ilson. The BBI Combinatory Dictionary of English: A Guide to WordCombinations. Amsterdam and Philadelphia: John Benjamins, 1986.

Breidt, E. “Extraction of V-N Collocations from Text Corpora: A Feasibility Study for German”. Inthe 1st ACL-Workshop on Very Large Corpora. 1993.

Choueka, Y., T. Klein and E. Neuwitz. 1983. “Automatic Retrieval of Frequent Idiomatic andCollocational Expressions in a Large Corpus”. Journal for Literary and Linguistic Computing, 4(1983), 34–38.

Church, K. and P. Hanks. “Word Association Norms, Mutual Information, and Lexicography”.Computational Linguistics, 16(1) (1989), 22–29.

Cowie, A.P. “The Treatment of Collocations and Idioms in Learner’s Dictionaries”. AppliedLinguistics, 2(3) (1981), 223–235.

Cruse, D.P. Lexical Semantics. Cambridge University Press, 1986.Dunning, T. “Accurate Methods for the Statistics of Surprise and Coincidence”. Computational

Linguistics (1993).Haruno, M., S. Ikehara and T. Yamazaki. “Learning Bilingual Collocations by Word-Level Sorting”.

In Proceedings of the 16th COLING, 1996, pp. 525–530.Ikehara, S., S. Shirai and H. Uchino. “A Statistical Method for Extracting Uninterrupted and

Interrupted Collocations”. In Proceedings of the 16th COLING, 1996, pp. 574–579.Kjellmer, G. 1995 A Mint of Phrases: Corpus Linguistics. Longman, 1995, pp. 111–127.Klir, J.G. and B. Yuan. Fuzzy Sets And Fuzzy Logic: Theory and Applications. Prentice-Hall, 1995.Lee, K.J., J.-H. Kim and G.C. Kim. “Extracting Collocations from Tagged Corpus in Korean”.

Proceedings of the 22nd Korean Information Science Society, 2 (1995), 623–626.

AUTOMATIC EXTRACTION OF COLLOCATIONS FROM KOREAN TEXT 297

Lin, D. “Extracting Collocations from Text Corpora”. In Proceedings of Tirst Workshop onComputational Terminology. Montreal, Canada, 1998.

Lin, D. “Automatic Identification of Non-compositional Phrases”. In the 37th Annual Meeting ofACL, 1999, pp. 317–324.

Manning, D.C. and H. Schütze. Foundations of Statistical Natural Language Processing. Cambridge,MA: The MIT Press, 1999.

Martin, W. and V.P. Sterkenburg. Lexicography: Principles and Practice, 1983.Nagao, M. and S. Mori. “A New Method of n-Gram Statistics for Large Number of n and Automatic

Extraction of Words and Phrases from Large Text Data of Japanese”. In Proceedings of the 15thCOLING, 1994, pp. 611–615.

Ross, S.M. Introduction To Probability and Statistics for Engineers and Scientists. John Wiley &Sons, 1987.

Shimohata, S., T. Sugio and J. Nagata. “Retrieving Collocations by Co-Occurrences and Word OrderConstraints”. In the 35th Annual Meeting of ACL, 1997, pp. 476–481.

Smadja, F. “Retrieving Collocations from Text: Xtract”. Computational Linguistics, 19(1) (1993),143–177.

Smadja, F., K. MaKeown and V. Hatzivassiloglou. “Translating Collocations for Bilingual Lexicons:A Statistical Approach”. In Computational Linguistics, 22(1) (1996), 1–38.

Yoon, J., C. Lee, S. Kim and M. Song. “Morphological Analysis Based on Lexical DatatbaseExtracted from Corpus”. In Proceedings of Hangul and Korean Information Processing. 1999.