Log-Linear Models6.891: Lecture 7 (September 29, 2003)
Transcript of Log-Linear Models6.891: Lecture 7 (September 29, 2003)
6.891: Lecture 7 (September 29, 2003)Log-Linear Models
Overview� Anatomy of a tagger
� An alternative: Hidden markov model taggers
Our Goal
Training set:1 Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will /MD join/VB the/DTboard/NN as/IN a/DT nonexecutive/JJdirector/NN Nov./NNP29/CD ./.2 Mr./NNP Vinken/NNP is/VBZ chairman/NN of/IN Elsevier/NNP N.V./NNP ,/, the/DTDutch/NNPpublishing/VBG group/NN ./.3 Rudolph/NNP Agnew/NNP ,/, 55/CD years/NNS old/JJ and/CC chairman/NN of/INConsolidated/NNP Gold/NNP Fields/NNP PLC/NNP ,/, was/VBD named/VBN a/DTnonexecutive/JJdirector/NN of/IN this/DT British/JJindustrial/JJconglomerate/NN ./.
: : :
38,219 It/PRP is/VBZ also/RB pulling/VBG 20/CD people/NNS out/IN of/INPuerto/NNP Rico/NNP ,/, who/WP were/VBD helping/VBG Huricane/NNP Hugo/NNPvictims/NNS ,/, and/CC sending/VBG them/PRP to/TO San/NNP Francisco/NNPinstead/RB ./.
� From the training set, induce a function or “program” thatmaps new sentences to their tag sequences.
Our Goal (continued)� A test data sentence:
Influential members of the House Ways and Means Committee introduced legislation thatwould restrict how the new savings-and-loan bailout agency can raise capital , creatinganother potential obstacle to the government ’s sale of sick thrifts .
� Should be mapped to underlying tags:Influential/JJmembers/NNS of/IN the/DT House/NNP Ways/NNP and/CC Means/NNPCommittee/NNP introduced/VBD legislation/NN that/WDT would/MD restrict/VBhow/WRB the/DT new/JJ savings-and-loan/NN bailout/NN agency/NN can/MDraise/VB capital/NN ,/, creating/VBG another/DT potential/JJobstacle/NN to/TO the/DTgovernment/NN ’s/POSsale/NN of/IN sick/JJthrifts/NNS./.
�Our goal is to minimize the number of tagging errors on sentencesnot seen in the training set
Some Data Structures� A word is a symbol that is a member of the word set,V
e.g.,V = set of all possible ascii strings
� A tag is a symbol that is a member of the tag set,T
e.g.,T = fNN,NNS,JJ,IN,Vt,: : :g
� A word sequenceS is an array of words
S:length = number of words in the array
S(j) = j’th word in the array
� A tag sequenceT is an array of tags
T:length = number of tags in the array
T (j) = j’th tag in the array
� Thetraining data D is an array of paired word/tag sequences
D:length = number of sequences in the training data
D:Si = thei’th word sequence in training data
D:Ti = thei’th tag sequence in training dataNote: D:Si:length = D:Ti:length for all i
� A parameter vectorW is an array of reals (doubles)W:length = number of parameters
Wk = thek’th parameter value
FAQ Segmentation: McCallum et. al
<head>X-NNTP-POSTER: NewsHound v1.33<head><head>Archive name: acorn/faq/part2<head>Frequency: monthly<head>
<question>2.6) What configuration of serial cable should I use<answer><answer> Here follows a diagram of the necessary connections<answer>programs to work properly. They are as far as I know t<answer>agreed upon by commercial comms software developers fo<answer><answer> Pins 1, 4, and 8 must be connected together inside<answer>is to avoid the well known serial port chip bugs. The
� Tags =<head>, <question>, <answer>
� Words = entire sentences, e.g.,is to avoid the well known serial port chip bugs. The
Three Functions� TRAIN is the training algorithm. TRAIN(D) returns a
parameter vectorW.
� TAGGER is the function which tags a new sentence.TAGGER(S;W) returns a tag sequenceT .
� ) All of the information from the training set is captured inthe parameter vectorW
� PROBreturns a probability distribution over the possible tags.i.e.,
PROB(S; j; t�2; t�1;W) = P ( � j S; j; t�2; t�1;W)
Three Functions� TRAIN and PROB can be implemented with log-linear
models.
� How do we implementTAGGER(S;W), given access toPROB?
Log-Linear Taggers: Independence Assumptions� The input sentenceS, with lengthn = S:length, hasjT jn possible tag
sequences.
� Each tag sequenceT has a conditional probability
P (T j S) =Qn
j=1 P (T (j) j S; j; T (1) : : : T (j � 1)) Chain rule
=Qn
j=1 P (T (j) j S; j; T (j � 2); T (j � 1)) Independenceassumptions
� We have a black-boxPROBwhich can compute the
P (T (j) j S; j; T (j � 2); T (j � 1)) terms
� How do we computeTAGGER(S;W) where
TAGGER(S;W) = argmaxT2T nP (T j S)
= argmaxT2T n logP (T j S)
The Viterbi Algorithm� Definen = S:length, i.e., the length of the sentence
� Define a dynamic programming table
�[i; t�2; t�1] = maximum log probability of a tag sequence ending
in tagst�2; t�1 at positioni
� Our goal is to calculatemaxt�2;t�12T �[n; t�2; t�1]
The Viterbi Algorithm: Recursive Definitions� Base case:
�[0; �; �] = log 1 = 0
�[0; t�2; t�1] = log 0 = �1 for all othert�2; t�1
here� is a special tag padding the beginning of the sentence.
� Recursive case:for i = 1 : : : S:length, for all t�2; t�1,
�[i; t�2; t�1] = max
t2T [f�gf�[i� 1; t; t�2] + logP (t�1 j S; i; t; t�2)g
Backpointers allow us to recover the max probability sequence:
BP[i; t�2; t�1] = argmaxt2T [f�g f�[i� 1; t; t�2] + logP (t�1 j S; i; t; t�2)g
The Viterbi Algorithm: Running Time� O(njT j3) time to calculatelogP (t�1jS; i; t; t�2) for all i, t,
t�2, t�1.
� njT j2 entries in� to be filled in.
� O(T ) time to fill in one entry(assumingO(1) time to look uplogP (t�1 j S; i; t; t�2))
� ) O(njT j3) time
Coming Next...� How to implementPROBandTRAIN
A New Data Structure: Sparse Binary Arrays� A sparse arrayA is an array of integers
A:length = length of the array
A(k) = thek’th integer in the array
� For example, the binary vector
100010000010010
is represented as
A:length = 4
A(1) = 1; A(2) = 5; A(3) = 11; A(4) = 14
Why the Need for Sparse Binary Arrays?
From last lecture:
�1(h; t) =
(1 if current wordwi is base andt = Vt
0 otherwise
�2(h; t) =
(1 if current wordwi ends ining andt = VBG
0 otherwise
: : :
�1(hJJ, DT,h Hispaniola, . . .i, 6i;Vt) = 1
�2(hJJ, DT,h Hispaniola, . . .i, 6i;Vt) = 0
: : :�(hJJ, DT,h Hispaniola, . . .i, 6i;Vt) = 1001011001001100110
Operations on Sparse Binary Arrays� INSERT(A; 12):
before: A:length = 4
A(1) = 1; A(2) = 5; A(3) = 11; A(4) = 14
after: A:length = 5
A(1) = 1; A(2) = 5; A(3) = 11; A(4) = 14, A(5) = 12
� Inner products with a parameter vectorIP (A;W) returns the value of the inner product
� Adding a sparse vector to a parameter vectorADD(W; A; �) setsW W + � � A
� FEATUREVECTOR(S; j; t�2; t�1; t) returns a sparse featurearray
Implementing FEATUREVECTOR� Intermediate step: map history/tag pair to set offeature
strings
Hispaniola/NNP quickly/RB became/VB an/DT important/JJbase/Vt from which Spain expanded its empire into the rest of theWestern Hemisphere .
e.g., Ratnaparkhi’s features:
“TAG=Vt;Word=base”“TAG=Vt;TAG-1=JJ”“TAG=Vt;TAG-1=JJ;TAG-2=DT”“TAG=Vt;SUFF1=e”“TAG=Vt;SUFF2=se”“TAG=Vt;SUFF3=ase”“TAG=Vt;WORD-1=important”“TAG=Vt;WORD+1=from”
Implementing FEATUREVECTOR� Next step: match strings to integers through a hash table
Hispaniola/NNP quickly/RB became/VB an/DT important/JJ base/Vt fromwhich Spain expanded its empire into the rest of the Western Hemisphere .
e.g., Ratnaparkhi’s features:
“TAG=Vt;Word=base” ! 1315“TAG=Vt;TAG-1=JJ” ! 17“TAG=Vt;TAG-1=JJ;TAG-2=DT” ! 32908“TAG=Vt;SUFF1=e” ! 459“TAG=Vt;SUFF2=se” ! 1000“TAG=Vt;SUFF3=ase” ! 1509“TAG=Vt;WORD-1=important” ! 1806“TAG=Vt;WORD+1=from” ! 300
In this case, sparse array is:
A:length = 8; A(1:::8) = f1315; 17; 32908; 459; 1000; 1509; 1806; 300g
Implementing FEATUREVECTOR
FEATUREVECTOR(S; j; t�2; t�1; t)
A:length = 0
String = “TAG=t;WORD=S(j)”
tmp = hash(String)
INSERT(A; tmp)
String = “TAG=t;TAG-1= t�1”
tmp = hash(String)
INSERT(A; tmp)
String = “TAG=t;TAG-1= t�1;TAG-2= t�2”
tmp = hash(String)
INSERT(A; tmp)
ReturnA
Implementing PROB
PROB(S; j; t�2; t�1;W)
Z = 0
For eacht 2 T
A = FEATUREVECTOR(S; j; t�2; t�1; t)
ip = IP (A;W)
Score(t) = eip
Z = Z + Score(t)
For eacht 2 T
Prob(t) = Score(t)=Z
ReturnProb
Implementing TRAIN� Possible training algorithms: iterative scaling, conjugate gradient methods
� Possible loss functions: log-likelihood, log-likelihood plus gaussian prior
� All of these methods were iterative
� Methods all require following functions ofD andW at each iteration:
L(W) =
Xi
logP (yi j xi;W) Log-likelihood
H =
Xi
�(xi; yi) Empirical counts
E(W) =
Xi
Xy
P (y j x;W)�(xi; y) Expected counts
� Next: implementSTATISTICS (D;W) which returnsL(W); E(W); H
Implementing STATISTICS
STATISTICS (D;W)
H = 0, E = 0, L = 0
m = D:lengthfor i = 1 : : :m
n = D:Si:lengthfor j = 1 : : : n
t = D:Ti(j)
t�1 = D:Ti(j � 1)
t�2 = D:Ti(j � 2)
A = FEATUREVECTOR(D:Si; j; t�2; t�1; t)
ADD(H;A; 1)
Prob = PROB(D:Si; j; t�2; t�1;W)
L = L+ logProb(t)
for eacht 2 T
A = FEATUREVECTOR(D:Si; j; t�2; t�1; t)
p = Prob(t)
ADD(E;A; p)
Return(L;E;H)
Implementing TRAIN with Iterative Scaling
TRAIN(D)
W = 0
DO:
(L;E;H) = STATISTICS (D;W)
for k = 1 : : :W:length
Wk =Wk +1
C
log Hk
Ek
UNTIL:
L does not change significantly
Summary
Implemented following functions:� FEATUREVECTOR(S; j; t�2; t�1; t) maps history/tag pair to sparse array
� PROB(S; j; t�2; t�1;W) returns a probability distribution over tags
� STATISTICS (D;W) maps training set, parameter vector, to
(L;E(W ); H).
� TRAIN(D) returns optimal parametersW
� TAGGER(S;W) returns most likely tag sequence under parametersW
Case Studies� Ratnaparkhi’s part-of-speech tagger
� McCallum et. al work on FAQ segmentation
FAQ Segmentation: McCallum et. al
<head>X-NNTP-POSTER: NewsHound v1.33<head><head>Archive name: acorn/faq/part2<head>Frequency: monthly<head>
<question>2.6) What configuration of serial cable should I use<answer><answer> Here follows a diagram of the necessary connections<answer>programs to work properly. They are as far as I know t<answer>agreed upon by commercial comms software developers fo<answer><answer> Pins 1, 4, and 8 must be connected together inside<answer>is to avoid the well known serial port chip bugs. The
FAQ Segmentation: Line Featuresbegins-with-numberbegins-with-ordinalbegins-with-punctuationbegins-with-question-wordbegins-with-subjectblankcontains-alphanumcontains-bracketed-numbercontains-httpcontains-non-spacecontains-numbercontains-pipecontains-question-markends-with-question-markfirst-alpha-is-capitalizedindented-1-to-4indented-5-to-10more-than-one-third-spaceonly-punctuationprev-is-blankprev-begins-with-ordinalshorter-than-30
FAQ Segmentation: McCallum et. al
<head>X-NNTP-POSTER: NewsHound v1.33<head><head>Archive name: acorn/faq/part2<head>Frequency: monthly<head>
<question>2.6) What configuration of serial cable should I use
Here follows a diagram of the necessary connectionsprograms to work properly. They are as far as I know tagreed upon by commercial comms software developers fo
Pins 1, 4, and 8 must be connected together insideis to avoid the well known serial port chip bugs. The
) “tag=question;prev=head;begins-with-number”“tag=question;prev=head;contains-alphanum”“tag=question;prev=head;contains-nonspace”“tag=question;prev=head;contains-number”“tag=question;prev=head;prev-is-blank”
A Third Case Study: Tagging Arabic
NOUN_PROP lwngNOUN_PROP byt$PUNC -LRB-DET+NOUN_PROP+NSUFF_FEM_PL AlwlAyAtDET+ADJ+NSUFF_FEM_SG AlmtHdp...VERB_PERFECT tgyrPREP fyNOUN+NSUFF_FEM_SG HyApDET+NOUN Almt$rdNOUN_PROP styfnNOUN_PROP knt
A Third Case Study: Tagging Arabic
NOUN_PROP lwngNOUN_PROP byt$PUNC -LRB-DET+NOUN_PROP+NSUFF_FEM_PL AlwlAyAt
DET+NOUN_PROP+NSUFF_FEM_PL AlwlAyAt
)
tag=DET+NOUN_PROP+NSUFF_FEM_PL;word=AlwlAyAt
tag-contains=DET;word=AlwlAyAt
tag-contains=DET;pref=Al
tag-contains=DET;pref=Alw
tag-contains=NOUN;word=AlwlAyAt
tag-contains=DET;tag-2-contains=NOUN
Overview� Anatomy of a tagger
� An alternative: Hidden markov model taggers
Log-Linear Taggers: Independence Assumptions� Each tag sequenceT has a conditional probability
P (T j S) =Qn
j=1 P (T (j) j S; j; T (1) : : : T (j � 1)) Chain rule=Qn
j=1 P (T (j) j S; j; T (j � 2); T (j � 1)) Independenceassumptions
� TAGGERinvolves search for most likely tag sequence:
TAGGER(S;W) = argmaxT2T n logP (T j S)
Hidden Markov Models� ModelP (T; S) rather thanP (T j S)
� Then
P (T j S) =
P (T; S)PT P (T; S)
TAGGER(S;W) = argmaxT2T n logP (T j S)
= argmaxT2T n logP (T; S)
How to modelP (T; S)?P (T; S) =Qn
j=1 ( P (Tj j S1 : : : Sj�1; T1 : : : Tj�1)�
P (Sj j S1 : : : Sj�1; T1 : : : Tj)) Chain rule
=Qn
j=1 (P (Tj j Tj�2; Tj�1)� P (Sj j Tj)) Independence assumptions
Why the Name?
P (T; S) =
nYj=1P (Tj j Tj�2; Tj�1)
| {z }
Hidden Markov Chain
�
nYj=1P (Sj j Tj)
| {z }
Sj ’s are observed
How to modelP (T; S)?
Hispaniola/NNP quickly/RB became/VB an/DT important/JJbase/Vt from which Spain expanded its empire into the rest of theWestern Hemisphere .
“Score” for tag Vt :P (Vt j DT, JJ)� P (basej Vt)
Smoothed Estimation
P (Vt j DT, JJ) = �1 �Count(Dt, JJ, Vt)
Count(Dt, JJ)
+�2 �Count(JJ, Vt)
Count(JJ)
+�3 �Count(Vt)
Count()
P (basej Vt) =
Count(Vt, base)
Count(Vt)
Dealing with Low-Frequency Words� Step 1: Split vocabulary into two sets
Frequent words = words occurring� 5 times in trainingLow frequency words= all other words
� Step 2: Map low frequency words into a small, finite set,depending on prefixes, suffixes etc.
Dealing with Low-Frequency Words: An Example[Bikel et. al 1999] An Algorithm that Learns What’s in a Name
Word class Example Intuition
twoDigitNum 90 Two digit yearfourDigitNum 1990 Four digit yearcontainsDigitAndAlpha A8956-67 Product codecontainsDigitAndDash 09-96 DatecontainsDigitAndSlash 11/9/89 DatecontainsDigitAndComma 23,000.00 Monetary amountcontainsDigitAndPeriod 1.00 Monetary amount,percentageothernum 456789 Other numberallCaps BBN OrganizationcapPeriod M. Person name initialfirstWord first word of sentence no useful capitalization informationinitCap Sally Capitalized wordlowercase can Uncapitalized wordother , Punctuation marks, all other words
Dealing with Low-Frequency Words: An Example
Profits/NA soared/NA at/NA Boeing/SC Co./CC ,/NA easily/NA topping/NAforecasts/NA on/NA Wall/SL Street/CL ,/NA as/NA their/NA CEO/NA Alan/SPMulally/CPannounced/NA first/NA quarter/NA results/NA ./NA
+
firstword /NA soared/NA at/NA initCap /SC Co./CC ,/NA easily/NAlowercase /NA forecasts/NA on/NA initCap /SL Street/CL ,/NA as/NAtheir/NA CEO/NA Alan/SP initCap /CP announced/NA first/NA quarter/NAresults/NA ./NA
NA = No entitySC = Start CompanyCC = Continue CompanySL = Start LocationCL = Continue Location
: : :
The Viterbi Algorithm: Recursive Definitions� Base case:
�[0; �; �] = log 1 = 0
�[0; t�2; t�1] = log 0 = �1 for all othert�2; t�1
here� is a special tag padding the beginning of the sentence.
� Recursive case:for i = 1 : : : S:length, for all t�2; t�1,
�[i; t�2; t�1] = max
t2T [f�gf�[i� 1; t; t�2] + logPROB(S; i; t; t�2; t�1)g
Backpointers allow us to recover the max probability sequence:
BP[i; t�2; t�1] = argmaxt2T [f�g f�[i� 1; t; t�2] + logPROB(S; i; t; t�2; t�1)g
Only difference is that PROB(S; i; t; t�2; t�1) returns
P (t�1 j t; t�2)� P (Si j t�1)rather than
P (t�1 j S; i; t; t�2)
Pros and Cons� Hidden markov model taggers are very simple to train
(compile counts from the training corpus)
� Perform relatively well (over 90% performance on namedentities)
� Main difficulty is modelingP (word j tag)
can be very difficult if “words” are complex
FAQ Segmentation: McCallum et. al
<head>X-NNTP-POSTER: NewsHound v1.33<head><head>Archive name: acorn/faq/part2<head>Frequency: monthly<head>
<question>2.6) What configuration of serial cable should I use<answer><answer> Here follows a diagram of the necessary connections<answer>programs to work properly. They are as far as I know t<answer>agreed upon by commercial comms software developers fo<answer><answer> Pins 1, 4, and 8 must be connected together inside<answer>is to avoid the well known serial port chip bugs. The
FAQ Segmentation: McCallum et. al
<question>2.6) What configuration of serial cable should I use
� First solution forP (word j tag):
P (“2.6) What configuration of serial cable should I use”j question) =
P ( 2:6) j question)�
P (What j question)�
P (configuration j question)�
P (of j question)�
P (serial j question)�
: : :� i.e. have alanguage modelfor eachtag
FAQ Segmentation: McCallum et. albegins-with-numberbegins-with-ordinalbegins-with-punctuationbegins-with-question-wordbegins-with-subjectblankcontains-alphanumcontains-bracketed-numbercontains-httpcontains-non-spacecontains-numbercontains-pipecontains-question-markends-with-question-markfirst-alpha-is-capitalizedindented-1-to-4indented-5-to-10more-than-one-third-spaceonly-punctuationprev-is-blankprev-begins-with-ordinalshorter-than-30
FAQ Segmentation: McCallum et. al� Second solution: first map each sentence to string of features:
<question>2.6) What configuration of serial cable should I use
)
<question>begins-with number contains-alphanum contains-nonspac
� Use a language model again:
P (“2.6) What configuration of serial cable should I use”j question) =
P (begins-with-numberj question)�
P (contains-alphanumj question)�
P (contains-nonspacej question)�
P (contains-numberj question)�
P (prev-is-blankj question)�
FAQ Segmentation: McCallum et. al
Method COAP SegPrec SegRecME-Stateless 0.520 0.038 0.362TokenHMM 0.865 0.276 0.140FeatureHMM 0.941 0.413 0.529MEMM 0.965 0.867 0.681