SIMS 213: User Interface Design & Development Marti Hearst Tues, Feb 12, 2002.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.
-
date post
20-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.
![Page 1: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/1.jpg)
1
SIMS 290-2: Applied Natural Language Processing
Marti HearstSept 8, 2004
![Page 2: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/2.jpg)
2
Today
Tokenizing using Regular ExpressionsElementary MorphologyFrequency Distributions in NLTK
![Page 3: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/3.jpg)
3Modified from Dorr and Habash (after Jurafsky and Martin)
Tokenizing in NLTK
The Whitespace Tokenizer doesn’t work very well
What are some of the problems?
NLTK provides an easy way to incorporate regex’s into your tokenizer
Uses python’s regex package (re)http://docs.python.org/lib/re-syntax.html
![Page 4: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/4.jpg)
4Modified from Dorr and Habash (after Jurafsky and Martin)
Regex’s for TokenizingBuild up your recognizer piece by piece
Make a string of regex’s combined with OR’sPut each one in a group (surrounded by parens)
Things to recognize:urlswords with hyphens in themwords in which hyphens should be removed (end of line hyphens)Numerical termsWords with apostrophes
![Page 5: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/5.jpg)
5
Regex’s for TokenizingHere are some I put together:
url = r'((http:\/\/)?[A-Za-z]+(\.[A-Za-z]+){1,3}(\/)?(:\d+)?)‘» Allows port number but no argument variables.
hyphen = r'(\w+\-\s?\w+)‘ » Allows for a space after the hyphen
apostro = r'(\w+\'\w+)‘
numbers = r'((\$|#)?\d+(\.)?\d+%?)‘» Needs to handle large numbers with commas
punct = r'([^\w\s]+)‘
wordr = r'(\w+)‘
A nice python trick:regexp = string.join([url, hyphen, apostro, numbers, wordr, punct],"|")
– Makes one string in which a “|” goes in between each substring
![Page 6: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/6.jpg)
6
Regex’s for Tokenizing
More code:
import stringfrom nltk.token import *from nltk.tokenizer import *t = Token(TEXT='This is the girl\'s depart- ment.')regexp =
string.join([url, hyphen, apostrophe, numbers, wordr, punct],"|")
RegexpTokenizer(regexp,SUBTOKENS='WORDS').tokenize(t)print t['WORDS']
[<This>, <is>, <the>, <girl's>, <depart- ment>, <store>, <.>]
![Page 7: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/7.jpg)
7Modified from Dorr and Habash (after Jurafsky and Martin)
Tokenization Issues
Sentence BoundariesInclude parens around sentences? What about quotation marks around sentences?Periods – end of line or not?
– We’ll study this in detail in a couple of weeks.
Proper NamesWhat to do about
– “New York-New Jersey train”?– “California Governor Arnold Schwarzenegger”?
Clitics and Contractions
![Page 8: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/8.jpg)
8Modified from Dorr and Habash (after Jurafsky and Martin)
MorphologyMorphology:
The study of the way words are built up from smaller meaning units.Morphemes:
The smallest meaningful unit in the grammar of a language.Contrasts:
Derivational vs. InflectionalRegular vs. IrregularConcatinative vs. Templatic (root-and-pattern)
A useful resource:Glossary of linguistic terms by Eugene Looshttp://www.sil.org/linguistics/GlossaryOfLinguisticTerms/contents.htm
![Page 9: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/9.jpg)
9Modified from Dorr and Habash (after Jurafsky and Martin)
Examples (English)
“unladylike”3 morphemes, 4 syllables
un- ‘not’lady ‘(well behaved) female adult human’-like ‘having the characteristics of’
Can’t break any of these down further without distorting the meaning of the units
“technique”1 morpheme, 2 syllables
“dogs”2 morphemes, 1 syllable
-s, a plural marker on nouns
![Page 10: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/10.jpg)
10Modified from Dorr and Habash (after Jurafsky and Martin)
Morpheme DefinitionsRoot
The portion of the word that:– is common to a set of derived or inflected forms, if any, when all affixes
are removed – is not further analyzable into meaningful elements– carries the principle portion of meaning of the words
StemThe root or roots of a word, together with any derivational affixes, to which inflectional affixes are added.
AffixA bound morpheme that is joined before, after, or within a root or stem.
Clitica morpheme that functions syntactically like a word, but does not appear as an independent phonological word
– Spanish: un beso, las aguas– English: Hal’s (genetive marker)
![Page 11: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/11.jpg)
11Modified from Dorr and Habash (after Jurafsky and Martin)
Inflectional vs. Derivational
Word ClassesParts of speech: noun, verb, adjectives, etc.Word class dictates how a word combines with morphemes to form new words
Inflection:Variation in the form of a word, typically by means of an affix, that expresses a grammatical contrast.
– Doesn’t change the word class– Usually produces a predictable, nonidiosyncratic change of
meaning.
Derivation:The formation of a new word or inflectable stem from another word or stem.
![Page 12: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/12.jpg)
12Modified from Dorr and Habash (after Jurafsky and Martin)
Inflectional Morphology
Adds: tense, number, person, mood, aspect
Word class doesn’t changeWord serves new grammatical roleExamples
come is inflected for person and number:The pizza guy comes at noon.
las and rojas are inflected for agreement with manzanas in grammatical gender by -a and in number by –s
las manzanas rojas (‘the red apples’)
![Page 13: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/13.jpg)
13Modified from Dorr and Habash (after Jurafsky and Martin)
Derivational MorphologyNominalization (formation of nouns from other parts of speech, primarily verbs in English):
computerizationappointeekillerfuzziness
Formation of adjectives (primarily from nouns) computationalcluelessEmbraceable
Diffulcult cases:building from which sense of “build”?
A resource:CatVar: Categorial Variation Databasehttp://clipdemos.umiacs.umd.edu/catvar
![Page 14: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/14.jpg)
14Modified from Dorr and Habash (after Jurafsky and Martin)
Concatinative MorphologyMorpheme+Morpheme+Morpheme+…Stems: also called lemma, base form, root, lexeme
hope+ing hoping hop hopping
AffixesPrefixes: AntidisestablishmentarianismSuffixes: AntidisestablishmentarianismInfixes: hingi (borrow) – humingi (borrower) in TagalogCircumfixes: sagen (say) – gesagt (said) in German
Agglutinative Languagesuygarlaştıramadıklarımızdanmışsınızcasınauygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casınaBehaving as if you are among those whom we could not cause to become civilized
![Page 15: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/15.jpg)
15Modified from Dorr and Habash (after Jurafsky and Martin)
Templatic MorphologyRoots and Patterns
Example: Hebrew verbsRoot:
– Consists of 3 consonants CCC– Carries basic meaning
Template:– Gives the ordering of consonants and vowels– Specifies semantic information about the verb
Active, passive, middle voiceExample:
– lmd (to learn or study) CaCaC -> lamad (he studied) CiCeC -> limed (he taught) CuCaC -> lumad (he was taught)
![Page 16: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/16.jpg)
16Modified from Dorr and Habash (after Jurafsky and Martin)
Nouns and Verbs (in English)
Nouns have simple inflectional morphologycatcat+s, cat+’s
Verbs have more complex morphology
![Page 17: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/17.jpg)
17Modified from Dorr and Habash (after Jurafsky and Martin)
Nouns and Verbs (in English)
NounsHave simple inflectional morphologyCat/CatsMouse/Mice, Ox, Oxen, Goose, Geese
VerbsMore complex morphologyWalk/WalkedGo/Went, Fly/Flew
![Page 18: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/18.jpg)
18Modified from Dorr and Habash (after Jurafsky and Martin)
Regular (English) Verbs
Morphological Form Classes Regularly Inflected Verbs
Stem walk merge try map
-s form walks merges tries maps
-ing form walking merging trying mapping
Past form or –ed participle walked merged tried mapped
![Page 19: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/19.jpg)
19Modified from Dorr and Habash (after Jurafsky and Martin)
Irregular (English) Verbs
Morphological Form Classes Irregularly Inflected Verbs
Stem eat catch cut
-s form eats catches cuts
-ing form eating catching cutting
Past form ate caught cut
-ed participle eaten caught cut
![Page 20: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/20.jpg)
20Modified from Dorr and Habash (after Jurafsky and Martin)
“To love” in Spanish
![Page 21: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/21.jpg)
21Modified from Dorr and Habash (after Jurafsky and Martin)
Syntax and Morphology
Phrase-level agreementSubject-Verb
– John studies hard (STUDY+3SG)
Noun-Adjective– Las vacas hermosas
Sub-word phrasal structuresנויספרבש
נו+ים+ספר+ב+ש
That+in+book+PL+Poss:1PLWhich are in our books
![Page 22: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/22.jpg)
22Modified from Dorr and Habash (after Jurafsky and Martin)
Phonology and Morphology
Script Limitations
Spoken English has 14 vowels– heed hid hayed head had hoed hood who’d hide
how’d taught Tut toy enough
English Alphabet has 5– Use vowel combinatios: far fair fare– Consonantal doubling (hopping vs. hoping)
![Page 23: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/23.jpg)
23Modified from Dorr and Habash (after Jurafsky and Martin)
Computational MorphologyApproaches
Lexicon onlyRules onlyLexicon and Rules
– Finite-state Automata– Finite-state Transducers
SystemsWordNet’s morphyPCKimmo
– Named after Kimmo Koskenniemi, much work done by Lauri Karttunen, Ron Kaplan, and Martin Kay
– Accurate but complex– http://www.sil.org/pckimmo/
Two-level morphology– Commercial version available from InXight Corp.
BackgroundChapter 3 of Jurafsky and MartinA short history of Two-Level Morphology
– http://www.ling.helsinki.fi/~koskenni/esslli-2001-karttunen/
![Page 24: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/24.jpg)
24Modified from Dorr and Habash (after Jurafsky and Martin)
Porter Stemmer
Discount morphologySo not all that accurate
Uses a series of cascaded rewrite rulesATIONAL -> ATE
(relational -> relate)
ING -> if stem contains vowel (motoring -> motor)
![Page 25: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/25.jpg)
25Modified from Dorr and Habash (after Jurafsky and Martin)
Porter StemmerStep 4: Derivational Morphology I: Multiple Suffixes (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible
![Page 26: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/26.jpg)
26Modified from Dorr and Habash (after Jurafsky and Martin)
Porter StemmerErrors of Omission
European Europeanalysis analyzesmatrices matrixnoise noisyexplain explanation
Errors of Commissionorganization organdoing doegeneralization genericnumerical numerousuniversity universe
![Page 27: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/27.jpg)
27Modified from Dorr and Habash (after Jurafsky and Martin)
Computational MorphologyWORD STEM (+FEATURES)*
cats cat +N +PLcat cat +N +SGcities city +N +PLgeese goose +N +PLducks (duck +N +PL) or
(duck +V +3SG)merging merge +V +PRES-PARTcaught (catch +V +PAST-PART) or
(catch +V +PAST)
![Page 28: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/28.jpg)
28Modified from Dorr and Habash (after Jurafsky and Martin)
Lexicon-only Morphology
acclaim acclaim $N$
acclaim acclaim $V+0$
acclaimed acclaim $V+ed$
acclaimed acclaim $V+en$
acclaiming acclaim $V+ing$
acclaims acclaim $N+s$
acclaims acclaim $V+s$
acclamation acclamation $N$
acclamations acclamation $N+s$
acclimate acclimate $V+0$
acclimated acclimate $V+ed$
acclimated acclimate $V+en$
acclimates acclimate $V+s$
acclimating acclimate $V+ing$
• The lexicon lists all surface level and lexical level pairs
• No rules …
• Analysis/Generation is easy
• Very large for English
• What about
•Arabic or
•Turkish or
• Chinese?
![Page 29: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 8, 2004.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d4d5503460f94a2ca94/html5/thumbnails/29.jpg)
29
For Next Week
Software status:Software on 3 lab machines, more coming
Lecture on Monday Sept 13:Part of speech tagging
For Wed Sept 15Do exercises 1-3 in Tutorial 2 (Tokenizing)Do the following exercises from Tutorial 3 (Tagging)
1a-h2, 3, 4, 5a-b
Turn them in online (I’ll have something available for this by then)