CPSC 7373: Artificial IntelligenceLecture 13: Natural Language Processing
Jiang Bian, Fall 2012University of Arkansas at Little Rock
Natural Language Processing
• Understanding natural languages:– Philosophically: We—human—have defined
ourselves in terms of our ability to speak with and understand each other.
– Application-wise: We want to be able to talk to the computers.
– Learning: We want the computers to be smarter, and learn human knowledge from text-books.
Language Models• Two types of language models
– Represented as a sequence of letters/words.• Probabilistic: the probability of a sequence: P(word1, word2, …)• Mostly are word-based; and• Learned from data
– Trees and abstract structure of words.• Logical: L = {S1, S2, …}• Abstraction: trees/categories• Hand-coded
S
NP VP
Name Verb
Sam Slept
Bag of Words
• A bag rather than a sequence• Unigram, Naïve Bayes model:– Each individual word is treated as a separate factor that
unrelated or unconditionally independent of all the other words.• Possible to take the sequence into account.
HONK OF
IFTHE
MODEL
WORDSBAG
LOVEYOU
Probabilistic Models• P(w1 w2 w3 … wn) = P(W1:n)
– = ∏iP(wi|w1:i-1)
• Markov Assumption: – the effect of one variable on another will be local;– the nth word is only relevant to its previous k words.– P(wi|w1:i-1) = P(wi|wi-k:i-1)
– For first-order Markov model: P(wi|wi-1)
• Stationary Assumption:– the probability of each variable is the same– i.e., the word probability only depends on its surrounding words in a
sentence, but does not depend on which sentence I am saying…– P(wi|wi-1)=P(wj|wj-1)
Applications of Language Models
• Classification (e.g., spam)• Clustering (e.g., news stories)• Input correction (spelling, segmentation)• Sentiment analysis (e.g., product reviews)• Information retrieval (e.g., web search)• Question answering (e.g., IBM’s Watson)• Machine translation (e.g., Chinese to English)• Speech recognition (e.g., Apple’s Siri)
N-gram Model• An n-gram is a contiguous sequence of n items from a
given sequence of text or speech.
• Language Models (LM)– Unigrams, Bigrams, Trigrams…
• Applications:– Speech recognition/data compression
• Predict the next word– Information Retrieve
• Retrieved documents are ranked based on the probability of the query and the document’s language model
• P(Q|Md)
N-gram examples
• S = “I saw the red house”– Unigram:• P(S) = P(I, saw, the, red, house) =
P(I)P(saw)P(the)P(red)P(house)– Bigram – Markov assumption• P(S) = P(I|<s>)P(saw|I)P(the|saw)P(red|the)P(house|
red)P(</s>|house)– Trigram:• P(S) = P(I|<s>, <s>)P(saw|<s>, I)P(the|I, saw)P(red|saw,
the)P(house|the, red)P(</s>|red, house)
How do we train these models?• Very large corpora: collections of text and speech– Shakespeare– Brown Corpus– Wall Street Journal– AP newswire– Hansards– Timit– DARPA/NIST text/speech corpora (Call Home, Call Friend,
ATIS, Switchboard, Broadcast News, Broadcast Conversation, TDT, Communicator)
– TRAINS, Boston Radio News Corpus
A Simple Bigram Example
• Estimate the likelihood of the sentence I want to eat Chinese food.– P(I want to eat Chinese food) = P(I | <start>) P(want
| I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese) P(<end>|food)
• What do we need to calculate these likelihoods?– Bigram probabilities for each word pair sequence in
the sentence– Calculated from a large corpus
Early Bigram Probabilities from BERP
.001Eat British.03Eat today
.007Eat dessert.04Eat Indian
.01Eat tomorrow.04Eat a
.02Eat Mexican.04Eat at
.02Eat Chinese.05Eat dinner
.02Eat in.06Eat lunch
.03Eat breakfast.06Eat some
.03Eat Thai.16Eat on
.01British lunch.05Want a
.01British cuisine.65Want to
.15British restaurant.04I have
.60British food.08I don’t
.02To be.29I would
.09To spend.32I want
.14To have.02<start> I’m
.26To eat.04<start> Tell
.01Want Thai.06<start> I’d
.04Want some.25<start> I
• P(I want to eat British food) = P(I|<start>) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British) = .25*.32*.65*.26*.001*.60 = .000080– Suppose P(<end>|food) = .2?– How would we calculate I want to eat Chinese food ?
• Probabilities roughly capture ``syntactic'' facts and ``world knowledge'' – eat is often followed by an NP– British food is not too popular
• N-gram models can be trained by counting and normalization
Early BERP Bigram Counts
0100004Lunch
000017019Food
112000002Chinese
522190200Eat
12038601003To
686078603Want
00013010878I
lunchFoodChineseEatToWantI
Early BERP Bigram Probabilities• Normalization: divide each row's counts by
appropriate unigram counts for wn-1
• Computing the bigram probability of I I– C(I,I)/C( I in call contexts )– p (I|I) = 8 / 3437 = .0023
• Maximum Likelihood Estimation (MLE): relative frequency
4591506213938325612153437
LunchFoodChineseEatToWantI
)()(
1
2,1
wfreqwwfreq
What do we learn about the language?
• What's being captured with ...– P(want | I) = .32 – P(to | want) = .65– P(eat | to) = .26 – P(food | Chinese) = .56– P(lunch | eat) = .055
• What about...– P(I | I) = .0023– P(I | want) = .0025– P(I | food) = .013
– P(I | I) = .0023 I I I I want – P(I | want) = .0025 I want I want– P(I | food) = .013 the kind of food I want is ...
Approximating Shakespeare• Generating sentences with random unigrams...– Every enter now severally so, let– Hill he late speaks; or! a more to leg less first you enter
• With bigrams...– What means, sir. I confess she? then all sorts, he is
trim, captain.– Why dost stand forth thy canopy, forsooth; he is this
palpable hit the King Henry.• Trigrams– Sweet prince, Falstaff shall die.– This shall forbid it should be branded, if renown made
it empty.
• Quadrigrams– What! I will go seek the traitor Gloucester.– Will you not tell me who I am?– What's coming out here looks like Shakespeare
because it is Shakespeare• Note: As we increase the value of N, the
accuracy of an n-gram model increases, since choice of next word becomes increasingly constrained
N-Gram Training Sensitivity
• If we repeated the Shakespeare experiment but trained our n-grams on a Wall Street Journal corpus, what would we get?
• Note: This question has major implications for corpus selection or design
WSJ is not Shakespeare: Sentences Generated from WSJ
Probabilistic Letter Models
• The probability of a sequence of letters.• What can we do with letter models?– Language identification
EN DE FR ES AZ
Hello, World
Guten Tag, Welt
Salam Dunya
Language Identification
# A B C
1 TH EN IN
2 TE ER AN
3 OU CH ƏR
4 AN DE LA
5 ER EI IR
6 IN IN AR
English?
German?
Azerbijani?
Bigram Model:
Language Identification
# A B C
P(the) 1.1 % 0.03% 0.00%
P(der) 0.06% 0.68% 0.00%
P(rba) 0.00% 0.01% 0.53%
English?
German?
Azerbijani?
Trigram Model:
ClassificationPeople Places Drugs
Steve Jobs San Francisco Lipitor
Bill Gates Palo Alto Prevacid
Andy Grove Stern Grove Zoloft
Larry Page San Mateo Zocor
Andrew Ng Santa Cruz Plavix
Jennifer Widom New York Protonix
Daphne Koller New Jersey Celebrex
Noah Goodman Jersey City Zyrtec
Julie Zelinski South San Francisco Aggrenox
Naïve Bayes, k-Nearest Neighbor, Support Vector Machine, Logistic RegressionGzip Commond ???
Gzip
• EN– Hello world!– This is a file
full of English words…
• DE– Hallo Welt!– Dies ist eine
Datei voll von deutschen Worte …
• AZ– Salam Dunya!– Bu fayl
AzƏrbaycan tam sozlƏr…
This is a new piece of text to be classified.
(echo `cat new EN | gzip | wc –c` EN; \ echo `cat new DE | gzip | wc –c` DE; \ echo `cat new AZ | gzip | wc –c` AZ; \| sort –n | head -1
Segmentation
• Given a sequence of words, how to break it up into meaningful segments.– e.g., 羽西中国新锐画家大奖
• Written English has spaces in between words:– e.g., words have spaces– Speech Recognition– URL: choosespain.com• Choose Spain• Chooses pain
Segmentation
• The best segmentation is the one that maximizes the joint probability of the segmentation.– S* = max P(w1:n) = max ∏iP(wi|w1:i-1)– Markov assumption:• S* ≈ max∏iP(wi|wi-1)
– Naïve Bayes assumption: words don’t depend on each other• S* ≈ maxP(wi)
Segmentation
• “nowisthetime”: 12 letters– How many possible segmentations?
• n-1• (n-1)^2• (n-1)!• 2^(n-1)
• Naïve Bayes assumption:– S* = argmaxs=f+rP(f)P(S*(r))– 1) Computationally easy– 2) Learning is easier: it’s easier to calculate the unigram
probabilities
Best Segmentation
• S* = argmaxs=f+rP(f)P(S*(r))• “nowisthetime”
f r P(f) P(S*(r)
n owis… .000001 10^-19
no wis… .004 10^-13
now is… .003 10^-10
nowi st… - 10^-18
Segmentation Examples
• Trained on 4 billions words corpus• e.g.,– Baseratesoughtto
• Base rate sought to• Base rates ought to
– smallandinsignificant• small and in significant• small and insignificant
– Ginormousego• G in or mouse go• Ginormous ego
What to do to improve?1) More data ???2) Markov assumption ???3) Smoothing ???
Spelling Correction
• Given a misspelled the word, find the best correction:– C* = argmaxcP(c|w)
– Bayes theorem: C* = argmaxcP(w|c)P(c)• P(c) = from data counts• P(w|c) = from spelling correction data
Spelling Data• c:w => P(w|c)
– pulse: pluse– elegant: elagent, elligit– second: secand, sexeon, secund, seconnd, seond, sekon– sailed: saled, saild– blouse: boludes– thunder: thounder– cooking: coking, chocking, kooking, cocking– fossil: fosscil
• We cannot have all the common misspelling cases.– Letter-based models, e.g.,
• ul:lu
Correction Example
• w = “thew” => P(w|c)P(c)
w c w|c P(w|c) P(c) 10^9P(w|c)P(c)
thew the ew|e 0.000007 .02 144.
thew thew 0.95 0.00000009 90
thew thaw e|a 0.001 0.0000007 0.7
thew threw h|hr 0.000008 0.000004 0.03
thew thwe ew|we 0.000003 0.00000004 0.0001
Sentence Structure
• P(Fed raises interest rates) = ???
Fed raises interest rates
NV NN
NP
VP
NP
S
Fed raises interest rates
NN VN
VP
NP
S
NP
Context Free Grammar Parsing
• Sentence structure trees are constructed according to grammar.
• A grammar is a list of rules: e.g.,– S -> NP VP• NP -> N | D (determiners: e.g., the, a) N | NN | NNN
(mortgage interest rates), etc.• VP -> V NP | V | V NP NP (e.g., give me the money)• N -> interest | Fed | rates | raises• V -> interest | rates | raises• D -> the | a
Ambiguity
How many parsing options do I have? ? ?• The Fed raises interest rates• The Fed raises raises • Raises raises interest raises
Ambiguity
How many parsing options do I have? ? ?• The Fed raises interest rates (2)
– The Fed (NP) raises (V) interest rates (NP)– The Fed raises (NP) interest (V) rates (NP)
• The Fed raises raises (1) – The Fed (NP) raises (V) raises (NP)
• Raises raises interest raises– Raises (NP) raises (V) interest raises (NP)– Raises (NP) raises (V) interest (NP) raises (NP)– Raises raises (NP) interest (V) raises (NP)– Raises raises interest (NP) raises (V)
Problems and SolutionsT F
Is it easy to omit good parsers?Is it easy to include bad parsers?Trees are unobservable?
Problems:
Solutions:
T F
Probabilistic view of the trees?
Consider word associations in the trees?
Make grammar unambiguous (like in programming languages)?
Problems and SolutionsT F
X Is it easy to omit good parsers?
X Is it easy to include bad parsers?
X Trees are unobservable?
Problems:
Solutions:
T F
X Probabilistic view of the trees?
X Consider word associations in the trees?
X Make grammar unambiguous (like in programming languages)?
Problems of writing grammars
• Natural languages are messy unorganized things evolved through the human history in variety contexts.
• It is naturally hard to specify a set of grammar rules that can comprehend all possibilities with out introduce errors.
• Ambiguity is the “enemy”…
Probabilistic Context-Free Grammar
• S -> NP VP (1)– NP ->
• | N (.3)• | DN (.4)• | NN (.2)• | NNN (.1)
– VP -> • | V NP (.4)• | V (.4)• | V NP NP (.2)
– N -> • | interest (.3)• | Fed (.3)• | rates (.3)• | raises (.1)
– V ->• | interest (.1)• | rates (.3)• | raises (.6)
– D -> • | the (.5)• | a (.5)
Probabilistic Context-Free Grammar
• S -> NP VP (1)– NP ->
• | N (.3)• | DN (.2)• | NN (.2)• | NNN (.1)
– VP -> • | V NP (.4)• | V (.4)• | V NP NP (.2)
– N -> • | interest (.3)• | Fed (.3)• | rates (.3)• | raises(.1)
– V ->• | interest (.1)• | rates (.3)• | raises(.6)
– D -> • | the (.5)• | a (.5)
Fed raises interest rates
NV NN
NP
VP
NP
S
.3 .6 .3 .3
.3 .2
.4
1
P() = 0.00038880.039%
Probabilistic Context-Free Grammar
• S -> NP VP (1)– NP -> N (.3) | DN (.2) | NN (.2) | NNN (.1)– VP -> V NP (.4) | V (.4) | V NP NP (.2)– N -> interest (.3) | Fed (.3) | rates (.3) | raises (.1)– V -> interest (.1) | rates (.3) | raises (.6)– D -> the (.5) | a (.5)
Raises raises interest rates
NV NN
NP
VP
NP
S
N N VN
NP
VP
NP
S
Raises raises interest rates
P() = ???% P() = ???%
Probabilistic Context-Free Grammar
• S -> NP VP (1)– NP -> N (.3) | DN (.2) | NN (.2) | NNN (.1)– VP -> V NP (.4) | V (.4) | V NP NP (.2)– N -> interest (.3) | Fed (.3) | rates (.3) | raises (.1)– V -> interest (.1) | rates (.3) | raises (.6)– D -> the (.5) | a (.5)
Raises raises interest rates
NV NN
NP
VP
NP
S
N N VN
NP
VP
NP
S
Raises raises interest rates
P() = .012% P() = .00072%
Statistical Parsing
• S -> NP VP (1)– NP -> N (.3) | DN (.2) | NN (.2) | NNN (.1)– VP -> V NP (.4) | V (.4) | V NP NP (.2)– N -> interest (.3) | Fed (.3) | rates (.3) | raises (.1)– V -> interest (.1) | rates (.3) | raises (.6)– D -> the (.5) | a (.5)
• Where are these probabilities coming from?– Training from large annotated corpus
• e.g., The Penn Treebank Project (1990): The Penn Treebank Project annotates naturally-occuring text for linguistic structure.
The Penn Treebank Project( (S
(NP-SBJ (NN Stock-market) (NNS tremors) )(ADVP-TMP (RB again) )(VP (VBD shook)
(NP (NN bond) (NNS prices) ) (, ,)
(SBAR-TMP (IN while)(S
(NP-SBJ (DT the) (NN dollar) )(VP (VBD turned)
(PRT (RP in) )(NP-PRD (DT a) (VBN mixed) (NN
performance) ))))) (. .) ))
Resolving Ambiguity
• Ambiguity:– Syntactical – more than one possible structure for
the same string of words.• e.g., We need more intelligent leaders.
– need more or more intelligent?
– lexical (homonymity) – a word form has more than one meaning.• e.g., Did you see the bat?• e.g., Where is the bank?
“The boy saw the man with the telescope”
V PP
with
NPP
the
Det N
telescopeThe
N
saw
S
NP VP
Det
boy
NP
the
Det N
man
V
PP
with
NPP
the
Det N
telescopeThe
N
saw
S
NP VP
Det
boy
NP
the
Det N
man
“The boy saw the man with the telescope”
Lexicalized PCFG• CFG:
– rules: VP -> V NP NP• PCFG:
– P( VP -> V NP NP | lhs = VP) = .2• LPCFG:
– P( VP -> V NP NP | V = `gave’) = .25• e.g., “Gave me the knife”
– P( VP -> V NP NP | V = `said’) = .0001• e.g., “I said my piece”
– P( VP -> V | V = `quake’) = – P ( VP -> V NP | V = `quake’) = 0.0001
• i.e., Dict: quake: verb (used without object)• ??? Web: “quake the earth”; 595,000 Google results ????
LPCFG
• “The boy saw the man with the telescope”– P( NP -> NP PP | H(NP) = man, PP = with/telescope)
• “The boy saw the man with the telescope”– P( VP -> V NP PP | V = saw, H(NP) = man, PP =
with/telescope)• These probabilities are hard to get, since we are
conditioning on quite a few specific words.– Back-off: instead conditioning on H(NP) = man, we can
conditioning on any subjects.
Parsing into a Tree
Fed raises interest rates
NV NN
NP
VP
NP
S
• S -> NP VP (1)– NP -> N (.3) | DN (.2) | NN (.2) | NNN (.1)– VP -> V NP (.4) | V (.4) | V NP NP (.2)– N -> interest (.3) | Fed (.3) | rates (.3) | raises (.1)– V -> interest (.1) | rates (.3) | raises (.6)– D -> the (.5) | a (.5)
Fed raises interest rates
NV NN
VPNP
S
N V V
Machine Translation
Yo lo haré mañana. => I will do it tomorrow.word
Yo lo haré mañana. => I will do it tomorrow.phrase
Yo lo haré mañana. => I will do it tomorrow.tree
NP VP
S
VP
S
Yo lo haré mañana. => I will do it tomorrow.Semantics
Action: doing+ Time: tomorrow
Vauquois’ pyramid:
Phrase-based Translation model• The models define probabilities over inputs
Morgen fliege ich nach Kanada zur Konferenz
Tomorrow I will fly to the conference In Canada
• What is the probability of a specific phrase segmentation of both languages?• What is the probability of a foreign foreign phrase being translated as a particular English phrase?• What is the probability of a word/phrase changing ordering?• Is the translated English text the best sentence? (From language model)
Segmentation Translation Distortion
Statistical Machine Translation• Components: Translation model, language
model, decoder
Foreign/English parallel text English text
Statistical analysis Statistical analysis
Translation model Language model
Machine Translation Algorithm
Top Related