Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Computational Approaches to Lexical
Collocations
Stefan Evert
IMS
Stuttgart
Brigitte Krenn
ÖFAI
Vienna
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Collocations
• What kind of animal are they?
• What are they good for?
• How can we find them?
• How likely are we to find them?
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Schedule
• Monday: Introduction to Collocations
• Tuesday: Extraction of Coocurrence Data
• Wednesday: Association Measures (AMs)
• Thursday: Evaluation of AMs
• Friday: Significance of Result Differences
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
CollocationsTerminology & Definitions
• Firth's Notion of Collocation
``Meaning by collocation is an abstraction at the syntagmatic level and is not directly concerned with the conceptual or idea approach to the meaning of words.''
``One of the meanings of night is its collocability with dark, and of dark, of course, its collocation with night.'‘ (Firth 1957)
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
CollocationsTerminology & Definitions
• Choueka's Notion of Collocation
"A collocation is defined as a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning cannot be derived directly from the meaning or connotation of its components." (Choueka 1988)
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
CollocationsTerminology & Definitions
Summing up, we have 2 different views1. Firth: collocations as lexical proximities in text2. Choueka: collocations as syntactic and semantic units, semantic
irregularity
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Terminology
• Idiomspreferably used in the English literature, e.g. Bar-Hillel:55, Hockett:58, Katz;Postal:63, Healey:68,Makkai:72.
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Terminology
• Phraseological Units, (Ge.: Phraseologismen) a widely used generic term in the German literature, e.g. BurgerEA:82, Fleischer:82.
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Terminology
• Light-verb Constructions, Support-verb Constructionsrefer to very particular phenomena, cross-categorisation with idioms
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Terminology
Terms stemming from Computational Linguistics, e.g.
• multi-word lexemes, e.g. Tschichold:97, BreidtEA:96.
• multi-word expressions, e.g. Segond;Tapanainen:95
• non-compositional compounds, e.g. Melamed:97
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Terminology
• Collocation e.g, red herring, kick the bucket, dark-night, dog-bark
• Base, Collocateelements if a collocation,the base selects its collocates,it is not always clear to define what the base is,e.g. red-herring, kick-bucket
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Terminology
• Collocation Phrasegrammatical unit which is constituted by a base and its collocates or by two or more collocates, e.g. a red herring, kick the bucket, unexpectedly kick the bucket
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Summing up
• Terminology and definitions are influenced by • different linguistic traditions• computational linguistic
applications
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Summing up
• The phenomena covered are manifold:• lexical proximities in texts• syntactic and semantic units• semantic irregularity• syntactic rigidity
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Observations: Generativity
• Collocations range from completely fixed to syntactically flexible constructions.
• Syntactic restrictions usually coincide with semantic restrictions and thus are indicators for the degree of lexicalization of a particular word combination.
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Observations: Generativity
• Particular word combinations are associated with specific restrictions that cannot be inferred from standard rules of grammar and thus need to be stored together with the collocation.
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Observations: Recurrence
• Within corpora, the proportion of collocations is larger among highly recurrent word combination than among infrequent ones.
• Recurrence is an effect of lexicalization.
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Observations: Idiomaticity
• Semantic opacity is not sufficient for the definition of collocations as there exists a variety of conventionalized word combinations that range from• fully compositional ones like Hut
aufsetzen (`put on a hat'), Jacke anziehen (`put on a jacket')
to • semantically opaque ones like ins Gras
beissen (`bite into the grass' literal meaning, `die' idiomatic meaning).
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Observations: Words, Multi-words or Phrases
• Collocations can be • word level phenomena • phrase level phenomena
(collocation phrase)
• Collocation phrases consist of the lexically determined words (collocates) only or contain additional lexically underspecified material.
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Word-level Collocations
• Adjective- and Adverb-Like Collocations• nichts desto trotz (`nonetheless') adverb • fix und fertig (`exhausted') adjective
• Preposition-Like Collocations• im Lauf(e), im Zuge (`during')• an Hand (`with the help of')
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Word-level Collocations
• Noun-Like Collocations• Rotes Kreuz (Red Cross)• Wiener Sängerknaben (Vienna choir
boys)• Hinz und Kunz (`every Tom, Dick and
Harry')
• Sequences where the nouns are duplicated• Schulter an Schulter (shoulder to
shoulder), • Kopf an Kopf (neck and neck)
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Phrase-level Collocations
• Modal constructions• sich (nicht) lumpen lassen (`to splash
out')
• Verb-object combinations• übers Ohr hauen (`take somebody for a
ride')• unter die Lupe nehmen (`take a close
look at')• zum Vorschein bringen (`bring something
to the light')• des Weges kommen (`to approach')• Lügen strafen (`prove somebody a liar')
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Phrase-level Collocations
• Copula constructions• guten Glaubens sein (`be in good faith')• auf Draht sein (`be on the ball')
• Proverbs• Morgenstund hat Gold im Mund
(morning hour has gold in the mouth, ‘the early bid catches the worm’)
• wissen, wo der Barthel den Most holt (know where the Barthel the cider fetches, `know every trick in the book')
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Summing up
• Structural dependency • the collocates of a collocation are
syntactic dependents, thus knowledge of syntactic structure is a precondition for accurate collocation identification.
• Syntactic context • may help to discriminate literal and
collocational readings, see for instance im Lauf, im Zug where a genitive to the right is a strong indicator for collocational reading.
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Summing up
• Markedness • morphologically or syntactically
marked constructions like seemingly incomplete syntactic structure or archaic e-suffix are suitable indicators for collocations, see im Laufe, im Zuge for e-suffix and zu Recht, an Hand for incomplete syntactic structures.
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Summing up
• Single-word versus multi-word units • single-word occurrences of word
combinations indicate word-level collocations, see for instance zu Recht, zurecht.
• Syntactic rigidity• is an important indicator for
collocations see for instance Hinz und Kunz, an und für sich, fix und fertig, Kopf an Kopf.
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Identification of Collocations: Operationalizable Criteria
• over proportionally high recurrence of collocational word combinations compared to noncollocational word combinations in corpora
• lexical determination of the collocates of a collocation
• collocations constitute grammatical units
• grammatical restrictions in the collocation phrases
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Positional N-Grams
• numerical span• typically AMs work on bi-grams
, i.e., all <wi, wj> pairs within a certain span are considered
• <wi,wi> pairs are not considered!
• e.g.: wi-3 ... wi-1 wi wi+1 ... wi+3
(span size is 6)
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Positional N-Grams
• grammatical span• e.g. we assume that collocational
relations do not exceed sentence boundaries
i.e., span size = sentence size
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Positional N-Grams:Problems
• huge amounts of cooccurrence data need to be managed this requires special algorithms
(e.g. Yamamoto,Church 2001)
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Positional N-Grams:Problems
• definition of span size is crucial If the span size is kept small, it is
unlikely to properly cover nonadjacent collocates of structurally flexible collocations.
Enlarging the span size leads to an increase of candidate collocations including an increase of noisy data which need to be discarded in a further processing step.
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Positional N-Grams:Problems
High amount of noise in the collocation data due to
• inappropriate span size
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Positional N-Grams:Problems
High amount of noise in the collocation data due to
• over-proportional frequency of function words within texts use stop word lists
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Positional N-Grams:Problems
High amount of noise in the collocation data due to
• insensitivity to punctuation use a sentence as the largest unit within
which the collocates of a collocation may occur
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Positional N-Grams:Problems
High amount of noise in the collocation data due to
• insensitivity to parts-of-speech knowing parts-of-speech allows a large
number of syntactically invalid n-grams to be excluded beforehand
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Positional N-Grams:Problems
High amount of noise in the collocation data due to
• insensitivity to syntactic structure further improvement of the
appropriateness of the collocation candidates selected is achieved by the availability of structural and/or dependency information
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Proposal
• (Partial) replacement of • positional n-grams
by • relational n-grams
• What does it imply?
• In which cases do we want/need it?
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
From Positional to Relational N-Grams
• positional with part-of-speech, eg. (Evert,Kermes 2003)• <Adj,N> pairs within certain span
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
From Positional to Relational N-Grams
• positional with part-of-speech and lexical knowledge, e.g. Breidt 93• sentence final <noun,past
participle> pairs • past participles must be part of a
list of 16 German support verbs e.g. kommen, bringen, stehen, stellen, ...
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
From Positional to Relational N-Grams
• partially relational, e.g. Krenn 2000• extraction of PN-combinations from
PPs• extraction of main verbs• combination of PN-pairs and verbs
co-occurring in a sentence
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
From Positional to Relational N-Grams
• Result• a theoretical maximum of PNV
combinations, i.e., • verbs are duplicated in sentences
that contain more than one PP,• PPs are duplicated in sentences
where more than one main verb is found.
This has effects on counting!
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Relational N-Grams
Assumption
The collocates of a collocation are syntactically related!
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Relational N-Grams
Requirements
• part-of-speech tagging
• partial parsing (phrase chunking)
• full parsingEvert,Kermes 2003
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Relational N-Grams
Problems
• full parsing of free text is error prone
• there are cases where collocation extraction from fully parsed text is less accurate
Ste
fan
Eve
rt,
IMS
- U
ni
Stu
ttg
art
Bri
git
te K
ren
n,
ÖF
AI
Wie
n
IMS
Summary
• Use application dependent notion of collocation as opposed to Firth/Choueka
• Extract collocations from relational data (relational n-grams)
• Interested in collocation phrases and their collocates
• Consider grammatically homogenous data (e.g. Adj-N, PP-V)
• Recurrence/cooccurrence frequency is main criterion for collocation extraction (statistical approaches)
Top Related