Albert Gatt Corpora and Statistical Methods Lecture 3
Slide 2
Morphology and productivity Part 2
Slide 3
Morphology Many languages have multiple word forms related to a
single base form (root form) Lexeme = base form from which related
forms are produced Three classes of productive morphological
processes: Inflection Derivation Compounding
Slide 4
Inflection Addition of prefixes and suffixes that leave core
meaning intact leave grammatical category intact add/alter some
features of meaning (especially relevant to syntax) Examples: -s to
form plural nouns -ed to form past tense
Slide 5
Derivation Addition of prefixes and suffixes which: result in a
more radical change in meaning often result in change of syntactic
category Examples: English -ly (ADJ ADV): wide-ly English -en (ADJ
V): weak-en English -able (V ADJ): accept-able
Slide 6
Compounding Combination of two independent words into a new
word NB new word can be orthographically one or several words can
cause recognisable changes in phonology new compound has a new
meaning (not necessarily 100% compositional) Example: English N-N
compounds disk drive, mad cow disease, credit crunch
Slide 7
Regular vs. irregular Inflectional and derivational rules often
have exceptions. E.g. Past tense in English: regular: -ed suffix
irregular: bring brought, ring - rang etc Sub-regularities
observable: -ing/k verbs in English seem to display a particular
pattern: rang, sank,
Slide 8
Productive vs non-productive Some morphological processes or
categories seem to have greater potential to form new words than
others e.g. English -able, -ness compare to English th: warmth,
strength (much less productive)
Slide 9
Classical approaches to productivity Jackendoff (1975):
unproductive rules are called redundancy rules: e.g. warmth is
listed in the English speakers (mental) lexicon as a single word
the redundancy rule captures the knowledge that it can be split
into warm+th rule as such isnt really active, i.e. forms not
produced online contrast with productive rules: e.g. Many
adjectives with able are produced online, not stored
Slide 10
Features of classical approaches 1. Relies on a binary
distinction (un/productive) 2. Productive rules are typically
regular & sub-regularities not considered much (Dressler 2003)
3. Most of these approaches do not look at corpus data Related
psycholinguistic model: Pinkers (1997) dual-route model of
morphological processing
Slide 11
Corpus-based approaches View productivity as a gradable
phenomenon: some forms become ingrained through frequent usage
category can still be productive to some extent productivity
estimated in terms of a categorys potential to produce new forms
can account for sub-regularities: productivity of a category is due
to a lot of factors, including analogy to existing words
Slide 12
The continuum Productive processes tend to: be compositional
result in a lot of new words Productive morphological process
lexicalised word ADJ+ness Noun ADJ+th Noun
Slide 13
Practical application (I) No finite lexicon can contain all
words of a language at a certain time productive processes can be
exploited to parse new/unseen lexical items this is helped by the
compositionality of productive processes can also help to
distinguish creative neologism from systematic rule- application.
compare: well-defined, well-intentioned, well-specified lots of
adjectives with a well- prefix YouTube a one-off
Slide 14
Practical application (II) Polarity/sentiment analysis: aim is
to identify the overall positive/negative slant of a text
concerning a topic Moilanen and Pulman (2008) obtain improvements
by considering adjectives formed with well- vs infested etc
Slide 15
Theoretical implications raises interesting questions about the
relationship between corpus-based measures and psycholinguistic
data likelihood of a morphological process being applied depends on
style, genre, speech community can give an indication of language
change over time (some processes are fossilised, others become more
productive)
Slide 16
Statistical measures of productivity (Baayen 2006)
Slide 17
What we need A measure of productivity of a process/category C
should reflect: our intuitions about how frequently we encounter C
how easily native speakers can form new words using C Is it easier
to produce a noun with th (like warmth) or one with ness (like
goodness)?
Slide 18
Realised productivity (RP) Given a morphological category C, RP
gives a rough indication of the past utility of C in forming new
words. Measured as the number of distinct types formed using C in a
corpus of size N. E.g. regular past tense ed displays many more
types than sub-regular forms such as keep-kept/sleep-slept
Slide 19
Realised productivity cont/d Why types, not tokens? Productive
processes have lots of types which are hapaxes, or are very
infrequent. Words formed from irregular processes tend to be very
frequent. Some limitations: a high RP for a category does not imply
that it will keep forming lots of new words RP is heavily dependent
on corpus size
Slide 20
Expanding productivity (P*) P* gives a rough indication of the
rate of expansion of C. Focuses on the number of hapaxes produced
using C in the corpus. aka hapax-conditioned productivity NB: P* is
still heavily dependent on corpus size! No. of types formed using C
which occur only once in N tokens No. of hapaxes in the corpus
Slide 21
Potential productivity (P) Gives an indication of how likely a
category C is to form new words in future. I.e. the potential for C
to be already saturated aka category-conditioned productivity No.
of types in C which occur only once in corpus of N tokens No. of
tokens of category C
Slide 22
Some more on P Unlike RP and P*, P is not very sensitive to
corpus size as such However, very sensitive to frequency of the
category. e.g. if C is realised only once in a corpus of size N,
then P = 1! Recent empirical work has shown that RP and P*
correlate very strongly, but both exhibit a weak correlation with P
(Vegnaduzzo 2009) pattern non-X has high RP and P*, but low P
pattern X-ish has low RP and P*, but high P
Slide 23
In graphics (after Baayen 2006) Corpus size No. of types Growth
curve for a specific category Slope of tangent represents growth
rate
Slide 24
P vs. RP and P* A category can have low RP and P*, but high P.
Corresponds to the ease with which new words can be formed using
the category. Even though a category has high RP, it may have
reached saturation, so have low P.
Slide 25
The psycholinguistic connection 1. Rule vs. direct access: To
produce a word (e.g. illegal), you can either store it directly, or
apply the rule on the fly. Evidence suggests that frequency of
baseform vs. derivation is related to which of the two alternatives
apply.
Slide 26
The psycholinguistic connection 2. Complexity-based affix
ordering: Corpus research: more productive affixes follow less
productive ones in word formation It seems that more highly
predictable (low productivity) affixes are processed first. High
productivity may also imply less likelihood of entering into
further derivational processes.
Slide 27
Works cited S. Vegnaduzzo (2009). Morphological productivity
rankings of complex adjectives. Proc. NAACL-HLT Workshop on
Computational Approaches to Linguistic Creativity. K. Molinen and
S. Pulman (2008). The good, the bad and the unknown: Morphosyllabic
sentiment tagging of unseen words. Proc. ACL 2008 Baayen 2006
linked from web page