Lecture 9: Part of Speech - University of Virginia School...
Transcript of Lecture 9: Part of Speech - University of Virginia School...
Lecture 9: Part of Speech
Kai-Wei ChangCS @ University of Virginia
Couse webpage: http://kwchang.net/teaching/NLP16
1CS6501 Natural Language Processing
This lecture
vParts of speech (POS) vPOS Tagsets
2CS6501 Natural Language Processing
CS6501 Natural Language Processing 3
Parts of Speech
vTraditional parts of speechv~ 8 of them
CS6501 Natural Language Processing 4
POS examples
vN noun chair, bandwidth, pacingvV verb study, debate, munchvADJ adjective purple, tall, ridiculousvADV adverb unfortunately, slowlyvP preposition of, by, tovPRO pronoun I, me, minevDET determiner the, a, that, those
CS6501 Natural Language Processing 5
Parts of Speech
vA.k.a. parts-of-speech, lexical categories, word classes, morphological classes, lexical tags...
v Lots of debate within linguistics about the number, nature, and universality of these
CS6501 Natural Language Processing 6
POS Tagging
vThe process of assigning a part-of-speech to each word in a collection (sentence).
WORD tag
the DETkoala Nput Vthe DETkeys Non Pthe DETtable N
CS6501 Natural Language Processing 7
Why is POS Tagging Useful?
vFirst step of a vast number of practical tasksvParsing
v Need to know if a word is an N or V before you can parse
v Information extractionv Finding names, relations, etc.
vSpeech synthesis/recognitionv OBject obJECTv OVERflow overFLOWv DIScount disCOUNTv CONtent conTENT
vMachine Translation
CS6501 Natural Language Processing 8
Open and Closed Classes
v Closed class: a small fixed membership v Prepositions: of, in, by, …v Pronouns: I, you, she, mine, his, them, …v Usually function words (short common words which
play a role in grammar)
v Open class: new ones can be createdv English has 4: Nouns, Verbs, Adjectives, Adverbsv Many languages have these 4, but not all!
CS6501 Natural Language Processing 9
Open Class Words
v Nounsv Proper nouns (Boulder, Granby, Eli Manning)v Common nouns (the rest). v Count nouns and mass nouns
v Count: have plurals, get counted: goat/goats, one goat, two goats
v Mass: don’t get counted (snow, salt, communism) (*two snows)
v Verbsv In English, have morphological affixes (eat/eats/eaten)
CS6501 Natural Language Processing 10
Closed Class Words
Examples:vprepositions: on, under, over, …vparticles: up, down, on, off, …vdeterminers: a, an, the, …vpronouns: she, who, I, ..vconjunctions: and, but, or, …vauxiliary verbs: can, may should, …vnumerals: one, two, three, third, …
CS6501 Natural Language Processing 11
Prepositions from CELEX
CELEX:onlinedictionaryFrequencycountsarefromCOBUILD16-billion-wordcorpus
CS6501 Natural Language Processing 12
English Particles
CS6501 Natural Language Processing 13
Conjunctions
CS6501 Natural Language Processing 14
Choosing a Tagset
v Could pick very coarse tagsetsv N, V, Adj, Adv, Other
v More commonly used set is finer grainedv E.g., “Penn TreeBank tagset”, 45 tags: PRP$, WRB,
WP$, VBGv Brown cropus, 87 tags.
v Prague Dependency Treebank (Czech)v 4452 tagsv AAFP3----3N----: (nejnezajímavějším)
Adj Regular Feminine Plural….Superlative [Hajic 2006, VMC tutorial]
CS6501 Natural Language Processing 15
Penn TreeBank POS Tagset
CS6501 Natural Language Processing 16
Using the Penn Tagset
vThe/DT grand/JJ jury/NN commmented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.
Universal Tag set
v ~ 12 different tagsvNOUN, VERB, ADJ, ADV, PRON, DET, ADP,
NUM, CONJ, PRT, “.”, X
CS6501 Natural Language Processing 17
CS6501 Natural Language Processing 18
POS Tagging v.s. Word clustering
vWords often have more than one POS: backvThe back door = JJvOn my back = NNvWin the voters back = RBvPromised to back the bill = VB
These examples from Dekang Lin
CS6501 Natural Language Processing 19
How Hard is POS Tagging?
POS tag sequences
vSome tag sequences more likely occur than others
vPOS Ngram viewhttps://books.google.com/ngrams/graph?content=_ADJ_+_NOUN_%2C_ADV_+_NOUN_%2C+_ADV_+_VERB_
CS6501 Natural Language Processing 20
ExistingmethodsoftenmodelPOStaggingasasequencetagging problem
Evaluation
vHow many words in the unseen test data can be tagged correctly?
vUsually evaluated on Penn TreebankvState of the art ~97% vTrivial baseline (most likely tag) ~94%vHuman performance ~97%
CS6501 Natural Language Processing 21