School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and...
-
Upload
brandon-elliott -
Category
Documents
-
view
223 -
download
5
Transcript of School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and...
![Page 1: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/1.jpg)
School of somethingFACULTY OF OTHER
School of ComputingFACULTY OF ENGINEERING
Tokenization and Morphology
Eric Atwell, Language Research Group
(with thanks to Katja Markert, Marti Hearst, and other contributors)
![Page 2: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/2.jpg)
Reminder
The main areas of linguistics
Rationalism: language models based on expert introspection
Empiricism: models via machine-learning from a corpus
Corpus: text selected by language, genre, domain, …
Brown, LOB, BNC, Penn Treebank, MapTask, CCA, …
Corpus Annotation: text headers, PoS, parses, …
Corpus size is no. of words – depends on tokenisation
We can count word tokens, word types, type-token distribution
Lexeme/lemma is “root form”, v inflections (be v am/is/was…)
![Page 3: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/3.jpg)
What’s a word?
How many words do you find in the following short text?
What is the biggest/smallest plausible answer to this question?
What problems do you encounter?
It’s a shame that our data-base is not up-to-date. It is a shame that um, data base A costs $2300.50 and that database B costs $5000. All databases cost far too much.
Time: 3 minutes
![Page 4: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/4.jpg)
Counting words: tokenization
Tokenisation is a processing step where the input text is
automatically divided into units called tokens where each is either a word or a number or a punctuation mark…
So, word count can ignore numbers, punctuation marks (?)
Word: Continuous alphanumeric characters delineated by whitespace.
Whitespace: space, tab, newline.
BUT dividing at spaces is too simple: It’s, data base
Another approach is to use regular expressions to specify which substrings are valid words.
![Page 5: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/5.jpg)
Regular expressions for tokenization
• wordr = r'(\w+)‘
• hyphen = r'(\w+\-\s?\w+)‘
• Eg data-base, Allows for a space after the hyphen
• apostrophe = r'(\w+\'\w+)‘
• Eg isn’t
• numbers = r'((\$|#)?\d+(\.)?\d+%?)‘
• Needs to handle large numbers with commas
![Page 6: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/6.jpg)
Some Tokenization Issues
Sentence Boundaries
• Punctuation, eg quotation marks around sentences?
• Periods – end of line or not?
Proper Names
• What to do about
• “New York-New Jersey train”?
• “California Governor Arnold Schwarzenegger”?
Contractions
• That’s Fred’s jacket’s pocket.
• I’m doing what you’re saying “Don’t do!”.
![Page 7: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/7.jpg)
![Page 8: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/8.jpg)
![Page 9: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/9.jpg)
Jabberwocky Analysis
This is nonsense … or is it?
This is not English … but it’s much more like English than it is like French or German or Chinese or …
Why do we pretty much understand the words?
![Page 10: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/10.jpg)
Jabberwocky Analysis
Why do we pretty much understand the words?
We recognize combinations of morphemes.
• Chortled - Laugh in a breathy, gleeful way; (Definition from Oxford American Dictionary) A combination of "chuckle" and "snort."
• Galumphing - Moving in a clumsy, ponderous, or noisy manner. Perhaps a blend of "gallop" and "triumph." (Definition from Oxford American Dictionary)
Activity:
• Make up a word whose meaning can be inferred from the morphemes that you used.
![Page 11: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/11.jpg)
Jabberwocky Analysis
Why do we pretty much understand the words?
• Surrounding English words strongly indicate the parts-of-speech of the nonsense words.
• toves: probably can perform an action
(because they did gyre and gimble)
• wabe: is probably a place.
(they did … in the wabe)
http://assets.cambridge.org/052185/542X/excerpt/052185542X_excerpt.pdf
![Page 12: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/12.jpg)
Jabberwocky Analysis
• Surrounding English words strongly indicate the parts-of-speech of the nonsense words.
• It’s similar in the French Translation:
Example from http://www.departments.bucknell.edu/linguistics/lectures/05lect02.html
![Page 13: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/13.jpg)
Morphology
Morphology:
• The study of the way words are built up from smaller meaning units.
Morphemes:
• The smallest meaningful unit in the grammar of a language.
Contrasts:• Derivational vs. Inflectional• Regular vs. Irregular• Concatinative vs. Templatic (root-and-pattern)
A useful resource:• Glossary of linguistic terms by Eugene Loos• http://www.sil.org/linguistics/GlossaryOfLinguisticTerms/contents.htm
![Page 14: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/14.jpg)
Examples (English)
“unladylike”
• 3 morphemes, 4 syllables
un- ‘not’
lady ‘(well behaved) female adult human’
-like ‘having the characteristics of’
• Can’t break any of these down further without distorting the meaning of the units
“technique”
• 1 morpheme, 2 syllables
“dogs”
• 2 morphemes, 1 syllable
-s, a plural marker on nouns
![Page 15: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/15.jpg)
Morpheme DefinitionsRoot
• The portion of the word that:
• is common to a set of derived or inflected forms, if any, when all affixes are removed
• is not further analyzable into meaningful elements
• carries the principle portion of meaning of the words
Stem
• The root or roots of a word, together with any derivational affixes, to which inflectional affixes are added.
Affix
• A bound morpheme that is joined before, after, or within a root or stem.
Clitic• a morpheme that functions syntactically like a word, but does not appear as an
independent phonological word
• Arabic: al in Al-Qaeda (definite particle)
• English: ‘s in Hal’s (genitive particle)
![Page 16: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/16.jpg)
Inflectional vs. Derivational
Word Classes• Parts of speech: noun, verb, adjectives, etc.
• Word class dictates how a word combines with morphemes to form new words
Inflection:• Variation in the form of a word, typically by means of an affix, that expresses
a grammatical contrast.
• Doesn’t change the word class
• Usually produces a predictable, nonidiosyncratic change of meaning.
• run -> runs | running | ran
Derivation:• The formation of a new word or inflectable stem from another word or stem.
• compute -> computer -> computerization
![Page 17: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/17.jpg)
Inflectional Morphology
Adds:
• tense, number, person, mood, aspect
Word class doesn’t change
Word serves new grammatical role
Examples
• come is inflected for person and number:
The pizza guy comes at noon.
• las and rojas are inflected for agreement with manzanas in grammatical gender by -a and in number by –s
las manzanas rojas (‘the red apples’)
![Page 18: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/18.jpg)
Derivational MorphologyWord class changes: verb noun, noun adjective etc
Nominalization (formation of nouns from other parts of speech, primarily verbs in English):
• computerization
• appointee
• killer
• fuzziness
Formation of adjectives (primarily from nouns)
• computational
• clueless
• Embraceable
Difficult cases:
• building from which word-class and sense of “build”?
![Page 19: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/19.jpg)
Concatinative Morphology
Morpheme+Morpheme+Morpheme+…
Stems: also called lemma, base form, root, lexeme
• hope+ing hoping hop hopping
Affixes
• Prefixes: Antidisestablishmentarianism
• Suffixes: Antidisestablishmentarianism
• Infixes: hingi (borrow) – humingi (borrower) in Tagalog
• Circumfixes: sagen (say) – gesagt (said) in German
Agglutinative Languages
• uygarlaştıramadıklarımızdanmışsınızcasına
• uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına
• Behaving as if you are among those whom we could not cause to become civilized
![Page 20: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/20.jpg)
Templatic MorphologyRoots and Patterns• Example: Hebrew or Arabic or Amharic (spoken in Ethiopia)
• Root:
• Consists of 3 consonants CCC
• Carries basic meaning
• Template:
• Gives the ordering of consonants and vowels
• Specifies semantic information about the verb
• Active, passive, middle voice
• Example (Hebrew):
• lmd (to learn or study)
• CaCaC -> lamad (he studied)
• CiCeC -> limed (he taught)
• CuCaC -> lumad (he was taught)
![Page 21: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/21.jpg)
Morphological Analysis Tools
Porter stemmer:
• A simple approach: just hack off the end of the word!
• Frequently used, especially for Information Retrieval, but results are pretty ugly!
![Page 22: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/22.jpg)
porter.demo()Original *****************************
Pierre Vinken , 61 years old , will join the board as a nonexecutive
director Nov. 29 . Mr. Vinken is chairman of Elsevier N.V. , the Dutch
publishing group . Rudolph Agnew , 55 years old and former chairman of
Consolidated Gold Fields PLC , was named a nonexecutive director of
this British industrial conglomerate . A form of asbestos once used to
make Kent cigarette filters has caused a high percentage of cancer
deaths among a group of workers exposed to it more than 30 years ago ,
researchers reported .
Results *******************************
Pierr Vinken , 61 year old , will join the board as a nonexecut
director Nov. 29 . Mr. Vinken is chairman of Elsevi N.V. , the Dutch
publish group . Rudolph Agnew , 55 year old and former chairman of
Consolid Gold Field PLC , wa name a nonexecut director of thi British
industri conglomer . A form of asbesto onc use to make Kent cigarett
filter ha caus a high percentag of cancer death among a group of
worker expos to it more than 30 year ago , research report .
![Page 23: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/23.jpg)
Morphological Analysis Tools
WordNet’s morphy()
• A slightly more sophisticated approach
• Use an understanding of inflectional morphology
• Uses a set of Rules of Detachment
• Use an Exception List for irregulars
• Handle collocations in a special way
• Do the transformation, compare the result to the WordNet dictionary
• If the transformation produces a real word, then keep it, else use the original word.
• For more details, see
• http://wordnet.princeton.edu/man/morphy.7WN.html
![Page 24: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/24.jpg)
Some morphy() output
>>> wntools.morphy('dogs')
'dog'
>>> wntools.morphy('running', pos='verb')
'run'
>>> wntools.morphy('corpora')
'corpus'
>>>
![Page 25: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/25.jpg)
Morphological Analysis Tools
Very sophisticated programs have been developed
Use a techniqued called Two-Level Phonology
• Has been applied to numerous languages
Best known: PCKimmo
• After Kimmo Koskenniemi, based in part on work by Lauri Kartunnen in 1983
• Uses:
• A rules file which specifies the alphabet and the phonological (or spelling) rules,
• A lexicon file which lists lexical items and encodes morphotactic constraints.
• http://www.sil.org/pckimmo/
Commercial versions are available
• inXight’s LinguistX version based on technology developed by Kaplan and others from Xerox PARC (or at least used to be)
![Page 26: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/26.jpg)
Morphological Analysis Tools
“cheat”: store all variants in a dictionary database, eg
CatVar:
• Categorial Variation Database
• “A database of clusters of uninflected words (lexemes) and their categorial (i.e. part-of-speech) variants.”
• Example: the developing cluster:(develop(V), developer(N), developed(AJ), developing(N), developing(AJ), development(N)).
http://clipdemos.umiacs.umd.edu/catvar
based on published dictionaries: LDOCE, CELEX, OALD++, PROPOSEL ...
![Page 27: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/27.jpg)
MorphoChallenge
One problem with rule-based systems (PCkimmo) or dictionary-lookup systems: Porting to new languages
In principle, Unsupervised Machine Learning could learn from any language data-set, by finding recurring patterns which correspond to roots, prefixes, postfixes
MorphoChallenge is a contest to find the best UML morphological analyser
http://www.cis.hut.fi/morphochallenge2005/
http://www.cis.hut.fi/morphochallenge2007/
http://www.cis.hut.fi/morphochallenge2008/
Atwell, Roberts: Combinatory Hybrid Elementary Analysis of Text http://www.cis.hut.fi/morphochallenge2005/P07_Atwell.pdf
![Page 28: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/28.jpg)
Arabic morphological analysis
Arabic is particularly challenging - different script, infixes, vowels may be left out in written Arabic …
Leeds researcher Majdi Sawalha: online analysis tool http://www.comp.leeds.ac.uk/sawalha/
Sawalha, Majdi; Atwell, Eric (2010). Fine-Grain Morphological Analyzer and Part-of-Speech Tagger for Arabic Text. in: Proceedings of the Language Resource and Evaluation Conference LREC 2010, 17-23 May 2010, Valetta, Malta.
http://www.comp.leeds.ac.uk/sawalha/sawalha10lrecB.pdf
![Page 29: School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with.](https://reader034.fdocuments.net/reader034/viewer/2022042613/5515f55f550346a2308b469e/html5/thumbnails/29.jpg)
Reminder
Tokenization - by whitespace, regular expressions
Problems: It’s data-base New York …
Jabberwocky shows we can break words into morphemes
Morpheme types: root/stem, affix, clitic
Derivational vs. Inflectional
Regular vs. Irregular
Concatinative vs. Templatic (root-and-pattern)
Morphological analysers: Porter stemmer, Morphy, PC-Kimmo
Morphology by lookup: CatVar, CELEX, OALD++
Unsupervised Machine Learning: MorphoChallenge