Speech & NLP
Levenshtein Edit Distance (LED)
Vladimir Kulyukin
www.vkedco.blogspot.comwww.vkedco.blogspot.com
Outline● Levenshtein Edit Distance
DefinitionExamplesRecursive ComputationDynamic Programming Computation
Levenshtein Edit Distance
Minimum Edit Distance● Suppose we have two strings: source and target● Suppose we have a finite set of operations (edit_ops) that can be
used to transform source to target● Each operation has a cost● A Minimum Edit Distance is a metric that measures the total cost of
transforming source to target
Strings as Prefix Sequences Any string can be viewed as a sequence of prefixes
1. s = '', then the prefix sequence is ''2. s = 'a', then the prefix sequence is <'', 'a'>3. s = 'ab', then the prefix sequence is <'', 'a', 'ab'>
In general, if s = c1c
2...c
n, then the prefix sequence is <'', 'c
1',
'c1c
2', ..., s>
Definition● Levenshtein edit distance (LED) is a metric, one of the best
known, that measures similarity between two character se-quences
● The metric is named after Vladimir Levenshtein who discov-ered this metric in 1965
● Given two strings, source and target, LED is defined as the minimum number of edit operations (aka edits) to trans-form source to target
Edit Operations (AKA Edits)● The standard edit operations, aka edits, are insertion, deletion, & substitution
● Assume pt and p
s are legal positions in target and source, respectively
● Insertion – a character at position pt in target is inserted into source at posi-
tion ps
● Deletion – a character is deleted from source at position ps
● Substitution - a character at position pt in target is substituted for a character
at position ps in source
Edit Costs
● The standard edit operations have associated costs● The costs are application dependent, and are typically
positive integers● For example, the costs of insert, deletion, and substitu-
tion can all be set to 1● In some contexts, substitution is set to 2
String Transformation Cost
CT(s, t) is numerical cost of transforming source string s to target string t
SOURCE
Tabulating Transformation CostsTARGET
'' c1 c2 c3 c4 c5 … cn
'' c1
c2 …
cm
TARGET '' c1 c2 c3 c4 c5 … cn
CT('', '')
SOURCE
'' c1
c2 …
cm
TARGET '' c1 c2 c3 c4 c5 … cn
CT('','c1')
SOURCE
'' c1
c2 …
cm
TARGET '' c1 c2 c3 c4 c5 … cn
CT('','c1c2')
SOURCE
'' c1
c2 …
cm
TARGET '' c1 c2 c3 c4 c5 … cn
CT('', 'c1c2c3')
SOURCE
'' c1
c2 …
cm
TARGET '' c1 c2 c3 c4 c5 … cn
CT('c1', '')
SOURCE
'' c1
c2 …
cm
TARGET '' c1 c2 c3 c4 c5 … cn
CT('c1c2', '')
SOURCE
'' c1
c2 …
cm
TARGET '' c1 c2 c3 c4 c5 … cn
CT('c1c2c3', '')
SOURCE
'' c1
c2 …
cm
TARGET '' c1 c2 c3 c4 c5 … cn
CT('c1...cm', 'c1...cn')
SOURCE
'' c1
c2 …
cm
TARGET '' c1 c2 c3 c4 c5 … cn
Transforming Empty Source to Target
'' c1 c2 c3 c4 c5 … cn
'' c1
c2 …
cm
0 1i 2i 3i 4i 5i ni
The only way to transform empty source to some target is to insert 0 or more characters into it (ki is the cost of inserting k characters)
Transforming Source to Empty Target
'' c1 c2 c3 c4 c5 … cn
'' c1
c2 …
cm
01d2d3d
md
The only way to transform some source to empty target is to delete 0 or more corresponding characters from it (kd is the cost of deleting k characters)
Examples
Example 01
Let insertion cost = deletion cost = substitution cost = 1.
Let source = '' and target = 'ab'.
How can we transform source to target?
Example 01
''
'' a b
Example 01
''
'' a b
0
CT('', '')
Example 01
''
'' a b
0 1
CT('', 'a')
Example 01
''
'' a b
0 1 2
CT('', 'ab')
Example 01
- insert 'a' at position 1 in source at cost 1;- insert 'b' at position 2 in source at cost 1;
So, LED('', 'ab') = 2.
Example 02
Let insert cost = delete cost = substitute cost = 1. Let source = 'ab' and target = ''.
How can we transform source to target?
Example 02
- Delete 'a' at position 1 in source at cost 1;- Delete 'b' at position 2 in source at cost 1;
So, LD('ab', '') = 2.
Example 03
Let insert_cost = delete_cost = substitute_cost = 1. Let source = 'abc' and target = 'ac'.
- match 'a' at position 1 with 'a' at position 1 in target;- delete 'b' at position 2 in source at cost 1;- match 'c' at position 3 in source with 'c' at position 2 in target at cost 0.
So, LD( 'abc', 'ac') = 1.
Recursive LED Algorithm
SpecificationLevEdDist(source, target, ins_cost, del_cost, sub_cost)
- source – source string
- target – target string
- ins_cost – cost of insertion
- del_cost – cost of deletion
- sub_cost – cost of substitution
LevEdDist(source, target, ins_cost, del_cost, sub_cost) returns a sequence of edits to convert source to target and the levenshtein distance, i.e., the total cost of edits
Base Case LevEdDist(source_str, target_str, edit_ops, ins_cost=1, del_cost=1, sub_cost=1):
#1. compute lengths of source and target strings target_len, source_len = len(target_str), len(source_str) #2. edit_ops is a list of edit operations that is destructively modified edit_ops_copy = copy(edit_ops) if source_len == 0: #3. if source is empty, insert all target characters into it for c in target_str: edit_ops_copy.append(new InsertOperator(c, ins_cost)) return (target_len*ins_cost, edit_ops_copy)
if target_len == 0: #4. if target is empty, delete all characters from source for c in source_str: edit_ops_copy.append(new DeleteOper('del', c, del_cost)) return (source_len*del_cost, edit_ops_copy)
Recursion● Two strings source[0, m] and target[0, n]● If character in source[m] == character in target[n] position,
add the cost of 0 to the value of the recursive call LevEdDist on source[0, m-1] and target[0, n]
● If these characters are not the same, compute the costs of deletion, insertion and substitution, and choose the minimum cost
Recursion: Three Recursive Calls
dc_cost, dc_edit_ops = LevEdDist(source_str[0:source_len-1], target_str, edit_ops, ins_cost, del_cost, sub_cost)
ic_cost, ic_edit_ops = LevEdDist(source_str, target_str[0:target_len-1], edit_ops, ins_cost, del_cost, sub_cost) ' sc_cost, sc_edit_ops = LevEdDist(source_str[0:source_len-1], target_str[0:target_len-1], edit_ops, ins_cost, del_cost, sub_cost)
choose the minimum cost
LED Computation with
Dynamic Programming
Computing CT(r, c)
1. Construct an m x n table CT2. Fill row 03. Fill column 04. Then CT[r, c] = min{ CT[r-1,c-1] + sub_cost, CT[r-1, c] + del_cost, CT[r, c-1] + ins_cost }5. CT[m, n] is the final (and minimal!) cost
LED Properties● LED is minimal● LED is correct● LED can be computed only with 2 rows● An optimal sequence of edits can be recovered from the
CT table
Top Related