Speech & NLP (Fall 2014): Levenshtein Edit Distance

38
Speech & NLP Levenshtein Edit Distance (LED) Vladimir Kulyukin www.vkedco.blogspot.com www.vkedco.blogspot.com

description

 

Transcript of Speech & NLP (Fall 2014): Levenshtein Edit Distance

Page 1: Speech & NLP (Fall 2014): Levenshtein Edit Distance

Speech & NLP

Levenshtein Edit Distance (LED)

Vladimir Kulyukin

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 2: Speech & NLP (Fall 2014): Levenshtein Edit Distance

Outline● Levenshtein Edit Distance

DefinitionExamplesRecursive ComputationDynamic Programming Computation

Page 3: Speech & NLP (Fall 2014): Levenshtein Edit Distance

Levenshtein Edit Distance

Page 4: Speech & NLP (Fall 2014): Levenshtein Edit Distance

Minimum Edit Distance● Suppose we have two strings: source and target● Suppose we have a finite set of operations (edit_ops) that can be

used to transform source to target● Each operation has a cost● A Minimum Edit Distance is a metric that measures the total cost of

transforming source to target

Page 5: Speech & NLP (Fall 2014): Levenshtein Edit Distance

Strings as Prefix Sequences Any string can be viewed as a sequence of prefixes

1. s = '', then the prefix sequence is ''2. s = 'a', then the prefix sequence is <'', 'a'>3. s = 'ab', then the prefix sequence is <'', 'a', 'ab'>

In general, if s = c1c

2...c

n, then the prefix sequence is <'', 'c

1',

'c1c

2', ..., s>

Page 6: Speech & NLP (Fall 2014): Levenshtein Edit Distance

Definition● Levenshtein edit distance (LED) is a metric, one of the best

known, that measures similarity between two character se-quences

● The metric is named after Vladimir Levenshtein who discov-ered this metric in 1965

● Given two strings, source and target, LED is defined as the minimum number of edit operations (aka edits) to trans-form source to target

Page 7: Speech & NLP (Fall 2014): Levenshtein Edit Distance

Edit Operations (AKA Edits)● The standard edit operations, aka edits, are insertion, deletion, & substitution

● Assume pt and p

s are legal positions in target and source, respectively

● Insertion – a character at position pt in target is inserted into source at posi-

tion ps

● Deletion – a character is deleted from source at position ps

● Substitution - a character at position pt in target is substituted for a character

at position ps in source

Page 8: Speech & NLP (Fall 2014): Levenshtein Edit Distance

Edit Costs

● The standard edit operations have associated costs● The costs are application dependent, and are typically

positive integers● For example, the costs of insert, deletion, and substitu-

tion can all be set to 1● In some contexts, substitution is set to 2

Page 9: Speech & NLP (Fall 2014): Levenshtein Edit Distance

String Transformation Cost

CT(s, t) is numerical cost of transforming source string s to target string t

Page 10: Speech & NLP (Fall 2014): Levenshtein Edit Distance

SOURCE

Tabulating Transformation CostsTARGET

'' c1 c2 c3 c4 c5 … cn

'' c1

c2 …

cm

TARGET '' c1 c2 c3 c4 c5 … cn

Page 11: Speech & NLP (Fall 2014): Levenshtein Edit Distance

CT('', '')

SOURCE

'' c1

c2 …

cm

TARGET '' c1 c2 c3 c4 c5 … cn

Page 12: Speech & NLP (Fall 2014): Levenshtein Edit Distance

CT('','c1')

SOURCE

'' c1

c2 …

cm

TARGET '' c1 c2 c3 c4 c5 … cn

Page 13: Speech & NLP (Fall 2014): Levenshtein Edit Distance

CT('','c1c2')

SOURCE

'' c1

c2 …

cm

TARGET '' c1 c2 c3 c4 c5 … cn

Page 14: Speech & NLP (Fall 2014): Levenshtein Edit Distance

CT('', 'c1c2c3')

SOURCE

'' c1

c2 …

cm

TARGET '' c1 c2 c3 c4 c5 … cn

Page 15: Speech & NLP (Fall 2014): Levenshtein Edit Distance

CT('c1', '')

SOURCE

'' c1

c2 …

cm

TARGET '' c1 c2 c3 c4 c5 … cn

Page 16: Speech & NLP (Fall 2014): Levenshtein Edit Distance

CT('c1c2', '')

SOURCE

'' c1

c2 …

cm

TARGET '' c1 c2 c3 c4 c5 … cn

Page 17: Speech & NLP (Fall 2014): Levenshtein Edit Distance

CT('c1c2c3', '')

SOURCE

'' c1

c2 …

cm

TARGET '' c1 c2 c3 c4 c5 … cn

Page 18: Speech & NLP (Fall 2014): Levenshtein Edit Distance

CT('c1...cm', 'c1...cn')

SOURCE

'' c1

c2 …

cm

TARGET '' c1 c2 c3 c4 c5 … cn

Page 19: Speech & NLP (Fall 2014): Levenshtein Edit Distance

Transforming Empty Source to Target

'' c1 c2 c3 c4 c5 … cn

'' c1

c2 …

cm

0 1i 2i 3i 4i 5i ni

The only way to transform empty source to some target is to insert 0 or more characters into it (ki is the cost of inserting k characters)

Page 20: Speech & NLP (Fall 2014): Levenshtein Edit Distance

Transforming Source to Empty Target

'' c1 c2 c3 c4 c5 … cn

'' c1

c2 …

cm

01d2d3d

md

The only way to transform some source to empty target is to delete 0 or more corresponding characters from it (kd is the cost of deleting k characters)

Page 21: Speech & NLP (Fall 2014): Levenshtein Edit Distance

Examples

Page 22: Speech & NLP (Fall 2014): Levenshtein Edit Distance

Example 01

Let insertion cost = deletion cost = substitution cost = 1.

Let source = '' and target = 'ab'.

How can we transform source to target?

Page 23: Speech & NLP (Fall 2014): Levenshtein Edit Distance

Example 01

''

'' a b

Page 24: Speech & NLP (Fall 2014): Levenshtein Edit Distance

Example 01

''

'' a b

0

CT('', '')

Page 25: Speech & NLP (Fall 2014): Levenshtein Edit Distance

Example 01

''

'' a b

0 1

CT('', 'a')

Page 26: Speech & NLP (Fall 2014): Levenshtein Edit Distance

Example 01

''

'' a b

0 1 2

CT('', 'ab')

Page 27: Speech & NLP (Fall 2014): Levenshtein Edit Distance

Example 01

- insert 'a' at position 1 in source at cost 1;- insert 'b' at position 2 in source at cost 1;

So, LED('', 'ab') = 2.

Page 28: Speech & NLP (Fall 2014): Levenshtein Edit Distance

Example 02

Let insert cost = delete cost = substitute cost = 1. Let source = 'ab' and target = ''.

How can we transform source to target?

Page 29: Speech & NLP (Fall 2014): Levenshtein Edit Distance

Example 02

- Delete 'a' at position 1 in source at cost 1;- Delete 'b' at position 2 in source at cost 1;

So, LD('ab', '') = 2.

Page 30: Speech & NLP (Fall 2014): Levenshtein Edit Distance

Example 03

Let insert_cost = delete_cost = substitute_cost = 1. Let source = 'abc' and target = 'ac'.

- match 'a' at position 1 with 'a' at position 1 in target;- delete 'b' at position 2 in source at cost 1;- match 'c' at position 3 in source with 'c' at position 2 in target at cost 0.

So, LD( 'abc', 'ac') = 1.

Page 31: Speech & NLP (Fall 2014): Levenshtein Edit Distance

Recursive LED Algorithm

Page 32: Speech & NLP (Fall 2014): Levenshtein Edit Distance

SpecificationLevEdDist(source, target, ins_cost, del_cost, sub_cost)

- source – source string

- target – target string

- ins_cost – cost of insertion

- del_cost – cost of deletion

- sub_cost – cost of substitution

LevEdDist(source, target, ins_cost, del_cost, sub_cost) returns a sequence of edits to convert source to target and the levenshtein distance, i.e., the total cost of edits

Page 33: Speech & NLP (Fall 2014): Levenshtein Edit Distance

Base Case LevEdDist(source_str, target_str, edit_ops, ins_cost=1, del_cost=1, sub_cost=1):

#1. compute lengths of source and target strings target_len, source_len = len(target_str), len(source_str) #2. edit_ops is a list of edit operations that is destructively modified edit_ops_copy = copy(edit_ops) if source_len == 0: #3. if source is empty, insert all target characters into it for c in target_str: edit_ops_copy.append(new InsertOperator(c, ins_cost)) return (target_len*ins_cost, edit_ops_copy)

if target_len == 0: #4. if target is empty, delete all characters from source for c in source_str: edit_ops_copy.append(new DeleteOper('del', c, del_cost)) return (source_len*del_cost, edit_ops_copy)

Page 34: Speech & NLP (Fall 2014): Levenshtein Edit Distance

Recursion● Two strings source[0, m] and target[0, n]● If character in source[m] == character in target[n] position,

add the cost of 0 to the value of the recursive call LevEdDist on source[0, m-1] and target[0, n]

● If these characters are not the same, compute the costs of deletion, insertion and substitution, and choose the minimum cost

Page 35: Speech & NLP (Fall 2014): Levenshtein Edit Distance

Recursion: Three Recursive Calls

dc_cost, dc_edit_ops = LevEdDist(source_str[0:source_len-1], target_str, edit_ops, ins_cost, del_cost, sub_cost)

ic_cost, ic_edit_ops = LevEdDist(source_str, target_str[0:target_len-1], edit_ops, ins_cost, del_cost, sub_cost) ' sc_cost, sc_edit_ops = LevEdDist(source_str[0:source_len-1], target_str[0:target_len-1], edit_ops, ins_cost, del_cost, sub_cost)

choose the minimum cost

Page 36: Speech & NLP (Fall 2014): Levenshtein Edit Distance

LED Computation with

Dynamic Programming

Page 37: Speech & NLP (Fall 2014): Levenshtein Edit Distance

Computing CT(r, c)

1. Construct an m x n table CT2. Fill row 03. Fill column 04. Then CT[r, c] = min{ CT[r-1,c-1] + sub_cost, CT[r-1, c] + del_cost, CT[r, c-1] + ins_cost }5. CT[m, n] is the final (and minimal!) cost

Page 38: Speech & NLP (Fall 2014): Levenshtein Edit Distance

LED Properties● LED is minimal● LED is correct● LED can be computed only with 2 rows● An optimal sequence of edits can be recovered from the

CT table