LING 388: Language and Computers Sandiway Fong Lecture 24: 11/17.
LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1.
-
Upload
jeffry-clement-gallagher -
Category
Documents
-
view
213 -
download
0
description
Transcript of LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1.
1
LING/C SC/PSYC 438/538
Lecture 24Sandiway Fong
Adminstrivia
• Homeworks 7 and 8 graded
Last Time
1. Homework 8 review2. CKY algorithm for parsing context-free grammars– Chomsky Normal Form (CNF)– 2D table
w1 w2 w3
[0,1] x1 [0,2] y [0,3] s s
[1,2] x2 [1,3] z
[2,3] x3
s --> y, x3.s --> x1, z.s --> y, z.y --> x1, x2.z --> x2, x3.x1 --> [w1].x2 --> [w2].x3 --> [w3].
Adminstrivia
• TCEs are fully online (paper TCEs are history)
https://tce.oirps.arizona.edu/TCEOnline
Spelling Errors
• Sections 3.10 and 5.9
Spelling Errors• Textbook cites (Kukich, 1992):
– Non-word detection (easiest)• graffe (giraffe)
– Isolated-word (context-free) error correction• graffe (giraffe,…)• graffed (gaffed,…)• by definition cannot correct when
error word is a valid word– Context-dependent error detection
and correction (hardest)• your an idiot you’re an idiot• Their is there is• (Microsoft Word corrects this by
default)
Spelling Errors• OCR
– visual similarity• hb, ec , jumpjurnps
• Typing– keyboard distance
• smallsmsll, spellspel;
• Graffiti (many HCI studies)– stroke similarity
• Common error characters are: V, T, 4, L, E, Q, K, N, Y, 9, P, G, X• Two stroke characters: B, D, P(error: two characters)
• Cognitive Errors– bad spellers
• separateseperate
correct
– textbook section 5.9• Kernighan et al. (correct)
– take typo t (not a word)• mutate t minimally by deleting, inserting, substituting or transposing
(swapping) a letter• look up “mutated t“ in a dictionary• candidates are “mutated t“ that are real words
– example (5.2)• t = acress• C = {actress,cress, caress, access, across, acres, acres}
correct• formula
– = max c C P(t|c) P(c) (Bayesian Inference)– C = {actress, cress, caress, access, across, acres, acres}
• Prior: P(c)– estimated using frequency information over a large corpus (N
words)– P(c) = freq(c)/N– P(c) = freq(c)+0.5/(N+0.5V)
• avoid zero counts (non-occurrences)• (add fractional part 0.5)• add one (0.5) smoothing • V is vocabulary size of corpus
correct
• Likelihood: P(t|c)– using some corpus of errors– compute following 4 confusion matrices– del[x,y] = freq(correct xy mistyped as x)– ins[x,y] = freq(correct x mistyped as xy)– sub[x,y] = freq(correct x mistyped as y)– trans[x,y] = freq(correct xy mistyped as yx)– P(t|c) = del[x,y]/f(xy) if c related to t by deletion of y– P(t|c) = ins[x,y]/f(x) if c related to t by insertion of y etc…
probability of typo t given candidate word c
26 x 26matrix
a–z
a–z
Veryhardto collectthis data
correct• example
– t = acress – = acres (44%)
• despite all the math• wrong result for
– was called a stellar and versatile acress
• what does Microsoft Word use?– was called a stellar and versatile
acress
Microsoft Word
corrected here
Part 2
• Another algorithm using dynamic programming:– Minimum Edit Distance• Textbook: section 3.11• File: eds.xls
27
Minimum Edit Distance
• general string comparison• edit operations are insertion, deletion and substitution• not just limited to distance defined by a single operation away• we can ask how different is string a from b by the minimum edit distance
28
Minimum Edit Distance• applications
– could be used for multi-typo correction– used in Machine Translation Evaluation (MTEval)– example
• Source: 生産工程改善について• Translations:• (Standard) For improvement of the production process• (MT-A) About a production process betterment• (MT-B) About the production process improvement• method
– compute edit distance between MT-A and Standard and MT-B and Standard in terms of word insertion/substitution etc.
29
Minimum Edit Distance
• cost models– Levenshtein
• insertion, deletion and substitution all have unit cost
– Levenshtein (alternate)• insertion, deletion have unit cost• substitution is twice as expensive• substitution = one insert followed by one
delete
– Typewriter• insertion, deletion and substitution all
have unit cost• modified by key proximity
Minimum Edit Distance
• Dynamic Programming– divide-and-conquer
• to solve a problem we divide it into sub-problems– sub-problems may be repeated
• don’t want to re-solve a sub-problem the 2nd time around– idea: put solutions to sub-problems in a table
• and just look up the solution 2nd time around, thereby saving time• memoization
we’ll use a spreadsheet…
Minimum Edit Distance
• Consider a simple case: xy yx⇄
• Minimum # of operations: • insert and delete• cost = 2
• Minimum # of operations: • swap• cost = ?
Minimum Edit Distance
• Generally
Minimum Edit Distance• Programming Practice: could be easily implemented in Perl
Minimum Edit Distance
• Generally
Minimum Edit Distance Computation
• Or in Microsoft Excel, file: eds.xls (on course webpage)
$ in a cell referencemeans don’t change when copiedfrom cell to celle.g. in C$11 stays the samein $A3A stays the same
Minimum Edit Distance
• Task: transform string s1..si into string t1..tj
– each sn and tn are letters– string s is of length i, t is of length j
• Example: – s = leader, t = adapter– i = 6, j = 7– Let’s say you’re allowed just three operations: (1) delete a
letter, (2) insert a letter, or (3) substitute a letter for another letter
– What is one possible way to generate t from s?
Minimum Edit Distance
• Example: – s = leader, t = adapter– What is one possible way to generate t from s?– leader– ↕ ↕ – adapter– cost is 2 deletes and 3 inserts, total 5 operations– Question: is this the minimum possible?
leader◄leade◄lead◄lea◄le◄l◄◄a◄ad◄ada◄adap◄adapt◄adapte◄adapter◄Simplest methodcost: 13 operations
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l2 e3 a4 d5 e6 r
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l2 e3 a4 d5 e6 r
cell (2,3)cost of
transforming le into ada
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l2 e3 a4 d5 e6 r
cell (2,3)cost of
transforming le into ada
cell (6,7)cost of
transforming leader into adapter
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l2 e3 a4 d5 e6 r
cell (3,0)cost of
transforming lea into (empty)
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l2 e3 a4 d5 e6 r
cell (0,4)cost of
transforming (empty) into adap
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l2 e3 a4 d5 e k
6 r
cell (5,6)cost of
transforming leade into adapte
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l2 e3 a4 d5 e k
6 r
cell (5,6)cost of
transforming leade into adapte
➡︎
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l2 e3 a4 d5 e k
6 r k
cell (5,6)cost of
transforming leade into adapte
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l2 e k
3 a4 d5 e6 r
cell (2,3)cost of
transforming le into ada
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l2 e k
3 a4 d5 e6 r
cell (2,3)cost of
transforming le into ada
cell (2,4)cost of
transforming le into adap
➡︎
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l2 e k k+1
3 a4 d5 e6 r
cell (2,3)cost of
transforming le into ada
cell (2,4)cost of
transforming le into adap
➡︎
l e
a d a p
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l k
2 e3 a4 d5 e6 r
cell (1,4)cost of
transforming l into adap
➡︎
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l k
2 e k+1
3 a4 d5 e6 r
cell (1,4)cost of
transforming l into adap
➡︎
l e
a d a p
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l k
2 e3 a4 d5 e6 r
cell (1,3)cost of
transforming l into ada
➡︎
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l k
2 e k+2
3 a4 d5 e6 r
cell (1,3)cost of
transforming l into ada
➡︎
assuming the cost of swapping e for p
is 2
l e
a d a p
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l k1,3 k1,4
2 e k2,3 ?
3 a4 d5 e6 r
➡︎
➡︎
➡︎
cell (2,4)minimum of the
three costs to get here in one step
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0
1 l2 e3 a4 d5 e6 r
cell (3,0)cost of
transforming lea into (empty)
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0
1 l2 e3 a4 d5 e6 r
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0
1 l 1
2 e3 a4 d5 e6 r
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0
1 l 1
2 e 2
3 a4 d5 e6 r
➡︎
cost of le =cost of l , plus the cost of deleting the e
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0
1 l 1
2 e 2
3 a 3
4 d 4
5 e 5
6 r 6
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0
1 l2 e3 a4 d5 e6 r
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1
1 l2 e3 a4 d5 e6 r
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l2 e3 a4 d5 e6 r
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1
2 e 2
3 a 3
4 d 4
5 e 5
6 r 6
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1
2 e 2
3 a 3
4 d 4
5 e 5
6 r 6
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1
2 e 2
3 a 3
4 d 4
5 e 5
6 r 6
➡︎
➡︎
➡︎
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1 2
2 e 2
3 a 3
4 d 4
5 e 5
6 r 6
➡︎
➡︎
➡︎
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1
2 e 2
3 a 3
4 d 4 5 6
5 e 5 6
6 r 6
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1
2 e 2
3 a 3
4 d 4 5 6
5 e 5 6
6 r 6
➡︎
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1
2 e 2
3 a 3
4 d 4 5 6
5 e 5 6 5
6 r 6
➡︎
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1
2 e 2 6 7
3 a 3 5
4 d 4
5 e 5
6 r 6
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1
2 e 2 6 7
3 a 3 5
4 d 4
5 e 5
6 r 6
➡︎
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1
2 e 2 6 7
3 a 3 5 6
4 d 4
5 e 5
6 r 6
➡︎
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1
2 e 2
3 a 3
4 d 4
5 e 5 6 5
6 r 6 7
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1
2 e 2
3 a 3
4 d 4
5 e 5 6 5
6 r 6 7
➡︎
Minimum Edit Distance0 1 2 3 4 5 6 7
a d a p t e r0 0 1 2 3 4 5 6 7
1 l 1
2 e 2
3 a 3
4 d 4
5 e 5 6 5
6 r 6 7 6
➡︎
Minimum Edit Distance