NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

65
Natural Language Processing Levenshtein Edit Distance (LED) & Skip Trie Matching (STM) Vladimir Kulyukin www.vkedco.blogspot.com www.vkedco.blogspot.com

Transcript of NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Page 1: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Natural Language Processing

Levenshtein Edit Distance (LED)&

Skip Trie Matching (STM)

Vladimir Kulyukin

www.vkedco.blogspot.comwww.vkedco.blogspot.com

Page 2: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Outline● Levenshtein Edit Distance (LED)

– Definition– Recursive Computation– Dynamic Programming Computation

● Skip Trie

– Background– Trie & Skip Trie– Skip Trie Matching

Page 3: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Levenshtein Edit Distance

Page 4: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Minimum Edit Distance

● Suppose we have two strings: source and target● Suppose we have a finite set of operations

(edit_ops) that can be used to transform source to target

● Each operation has a cost● A Minimum Edit Distance is a metric that mea-

sures the total cost of transforming source to tar-get

Page 5: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Strings as Prefix Sequences

Any string can be viewed as a sequence of prefixes

1. s = '', then the prefix sequence is ''2. s = 'a', then the prefix sequence is <'', 'a'>3. s = 'ab', then the prefix sequence is <'', 'a', 'ab'>

In general, if s = c1c

2...c

n, then the prefix sequence is

<'', 'c1', 'c

1c

2', ..., s>

Page 6: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Definition

● Levenshtein edit distance (LED) is a metric, one of the best known, that measures similarity be-tween two character sequences

● The metric is named after Vladimir Levenshtein who discovered this metric in 1965

● Given two strings, source and target, LED is de-fined as the minimum number of edit opera-tions (aka edits) to transform source to target

Page 7: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Edit Operations (AKA Edits)● The standard edit operations, aka edits, are insertion, dele-

tion, & substitution

● Assume pt

and ps

are legal positions in target and source,

respectively

● Insertion – a character at position pt in target is inserted

into source at position ps

● Deletion – a character is deleted from source at position ps

● Substitution - a character at position pt in target is substi-

tuted for a character at position ps in source

Page 8: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Edit Costs

● The standard edit operations have associated costs

● The costs are application dependent, and are typically positive integers

● For example, the costs of insertion, deletion, and substitution can all be set to 1

● In some contexts, substitution is set to 2 (substi-tution can be viewed as insertion and deletion)

Page 9: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

String Transformation Cost

CT(s1, s2) = numerical cost of transforming source string s1 to target string s2

Page 10: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Tabulating Transformation CostsTARGET

'' c1 c2 c3 c4 c5 … cn

'' c1

c2 …

cm

SOURCE

TARGET '' c1 c2 c3 c4 c5 … cn

Page 11: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

CT('', '')

'' c1 c2 c3 c4 c5 … cn

'' c1

c2 …

cm

Page 12: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

CT('','c1')

'' c1 c2 c3 c4 c5 … cn

'' c1

c2 …

cm

Page 13: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

CT('','c1c2')

'' c1 c2 c3 c4 c5 … cn

'' c1

c2 …

cm

Page 14: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

CT('', 'c1c2c3')

'' c1 c2 c3 c4 c5 … cn

'' c1

c2 …

cm

Page 15: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

CT('c1', '')

'' c1 c2 c3 c4 c5 … cn

'' c1

c2 …

cm

Page 16: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

CT('c1c2', '')

'' c1 c2 c3 c4 c5 … cn

'' c1

c2 …

cm

Page 17: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

CT('c1c2c3', '')

'' c1 c2 c3 c4 c5 … cn

'' c1

c2 …

cm

Page 18: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

CT('c1...cm', 'c1...cn')

'' c1 c2 c3 c4 c5 … cn

'' c1

c2 …

cm

Page 19: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Transforming Empty Source to Target

'' c1 c2 c3 c4 c5 … cn

'' c1

c2 …

cm

0 i1 i2 i3 i4 i5 in

The only way to transform empty source to some target is to insert 0 or more characters into it (ik is the cost of inserting k characters)

Page 20: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Transforming Source to Empty Target

'' c1 c2 c3 c4 c5 … cn

'' c1

c2 …

cm

0d1d2d3

dm

The only way to transform some source to empty target is to delete 0 or more corresponding characters from it (dk is the cost of deleting k characters)

Page 21: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Examples

Page 22: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Example 01 Let insertion cost = deletion cost = substitution cost = 1.

Let source = '' and target = 'ab'.

How can we transform source to target?

Page 23: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Example 01

''

'' a b

Page 24: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Example 01

''

'' a b

0

Page 25: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Example 01

''

'' a b

0 1

Page 26: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Example 01

''

'' a b

0 1 2

Page 27: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Example 01

- insert 'a' at position 1 in source at cost 1;- insert 'b' at position 2 in source at cost 1;

So, LED('', 'ab') = 2.

Page 28: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Example 02

Let insert cost = delete cost = substitute cost = 1. Let source = 'ab' and target = ''.

How can we transform source to target?

Page 29: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Example 02

- Delete 'a' at position 1 in source at cost 1;- Delete 'b' at position 2 in source at cost 1;

So, LED('ab', '') = 2.

Page 30: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Example 03

Let insert cost = delete cost = substitute cost = 1. Let source = 'abc' and target = 'ac'.

- match 'a' at position 1 with 'a' at position 1 in target;- delete 'b' at position 2 in source at cost 1;- match 'c' at position 3 in source with 'c' at position 2 in target at cost 0.

So, LED('abc', 'ab') = 1.

Page 31: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Recursive LED Algorithm

Page 32: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Specification

LevEdDist(source, target, ins_cost, del_cost, sub_cost)

- source – source string

- target – target string

- ins_cost – cost of insertion

- del_cost – cost of deletion

- sub_cost – cost of substitution

LevEdDist(source, target, ins_cost, del_cost, sub_cost) returns a sequence of edits to convert source to target and the levenshtein distance, i.e., the total cost of edits

Page 33: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Pseudo Code

LED(source_str, target_str, edit_ops, edit_cost, ins_cost=1, del_cost=1, sub_cost=1):

#1. compute lengths of source and target strings target_len, source_len = len(target_str), len(source_str) #2. edit_ops is a list of edit operations that is destructively modified edit_ops_copy = copy(edit_ops) if source_len == 0: #3. if source is empty, insert all target characters into it for c in target_str: edit_ops_copy.append(new InsertOperator(c, ins_cost)) return edit_cost + target_len, edit_ops_copy

if target_len == 0: #4. if target is empty, delete all characters from source for c in source_str: edit_ops_copy.append(new DeleteOper('del', c, del_cost)) return edit_cost + source_len, edit_ops_copy

Page 34: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Recursion

● If character at position source_len-1 in source is the same as character at position target_len-1 in target, set the current cost to 0 (this is the character match, which can be viewed as substitute the character in the source for the same character in the target)

● Match is a zero-cost substitution

● If these characters are not the same, compute the costs of deletion, insertion and substitution, and choose the minimum cost

Page 35: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Pseudo Code: Three Recursive Calls

// choose deletion and recursedc_cost, dc_edit_ops = LED(source_str[0:source_len-1], target_str, edit_ops, edit_cost, ins_cost=ins_cost, del_cost=del_cost, sub_cost=sub_cost)

// choose insertion and recurseic_cost, ic_edit_ops = LED(source_str, target_str[0:target_len-1], edit_ops, edit_cost, ins_cost=ins_cost, del_cost=del_cost, sub_cost=sub_cost)

// choose substitution and recurse sc_cost, sc_edit_ops = LED(source_str[0:source_len-1], target_str[0:target_len-1], edit_ops, edit_cost, ins_cost=ins_cost, del_cost=del_cost, sub_cost=sub_cost)

Page 36: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Pseudo Code: Choosing Minimal Edit Sequence

if min_cost == dc_cost: edit_ops_copy = copy(dc_edit_ops) // add a new delete operator edit_ops_copy.append(new DelOper(source_str[source_len-1], del_cost)) else if min_cost == ic_cost: edit_ops_copy = copy(ic_edit_ops) // add a new insertion operator edit_ops_copy.append(new InsOper(target_str[target_len-1], ins_cost)) else if min_cost == sc_cost:' edit_ops_copy = copy(sc_edit_ops) if target_str[target_len-1] == source_str[source_len-1]: // if the characters are the same, then there is a match edit_ops_copy.append(new MatchOper(target_str[target_len-1], source_str[source_len-1], 0)) else: edit_ops_copy.append(new SubOper(target_str[target_len-1], source_str[source_len-1], sub_cost)) else: edit_ops_copy = copy(edit_ops)

min_cost = compute the cost of edit ops in edit_ops return min_cost, edit_ops_copy

Page 37: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

LED Computation with

Dynamic Programming

Page 38: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Computing CT(r, c)

1. Construct an m x n table CT2. Fill row 03. Fill column 04. Then CT[r, c] = min{ CT[r-1,c-1] + sub_cost, CT[r-1, c] + del_cost, CT[r, c-1] + ins_cost }5. CT[m, n] is the final (and minimal!) cost

Page 39: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Side Notes

● LED is a minimal distance● LED is a correct minimal distance● LED can be computed only with 2 rows● An optimal sequence of edits can be recovered from

the CT table

Page 40: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Skip Trie & Skip Trie Matching

Page 41: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Motivation● According to U.S. Department of Agriculture, U.S.

residents have increased their caloric intake by 523 calories per day since 1970

● Mismanaged diets are estimated to account for 30-35% of cancer and diabetes cases

● A major contributor to the increased caloric intake is the consumer's inability (and sometimes unwillingness) to read & understand nutrition labels

● Nutrition information is rarely available to blind and visually impaired individuals

Page 42: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Critical Barriers

● Manual nutrition intake recording is time-consuming and error-prone, especially on smartphones

● Automated, real-time nutrition information extraction & analysis is weak or nonexistent

● Nutrition decision support – is not context-sensitive; – does not couple consumers with dieticians;– is not integrated with PHRs or ODLs

Page 43: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Persuasive NUTrion Management System (PNUTS)

Page 44: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

RoboCart ShopTalk ShopMobile I ShopMobile II PNUTS

dd

2003-052006-08

2008-10

2010-12 2013-Now

R&D Road to PNUTS

Page 45: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

PNUTS Architecture

Nutritionist

Coach

Cloud

Consumer/Patient

Inference Engine OCR Image Analysis

Page 46: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Vision-Based Nutrition Information Extraction in PNUTS

Line Segmentor

Nutrition Label

Localizer

TEXT

Image Table Lines

OCR

Page 47: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

OCR Engine Accuracy Evaluation● Two hundred images of nutrition label text chunks

● Three categories used to categorize accuracy:– Complete: OCRed characters are identical to

image text– Partial: at least one OCRed character is missing or

misrecognized– Garbled: either empty string is returned or all

OCRed characters are misrecognized

Page 48: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

OCR Engine Accuracy

  Complete Partial Garbled

Tesseract on Device 146(73%) 36(18%) 18(9%)

GOCR on Device 42(21%) 23(11.5%) 135(67.5%)

Tesseract on Server 158(79%) 23(11.5%) 19(9.5%)

GOCR on Server 58(28.99%) 56(28%) 90(45%)

Page 49: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

OCR Engine Speed in Milliseconds

  Run 1 Run 2 Run 3 Run 4 Run 5 AVG/Sample AVG/Image

Tesseract on Device 128238 101438 101643 109678 103205110439.6 552.1

GOCR on Device 50349 47746 48964 52450 48247 49019.6 245

Tesseract on Server 38958 38061 37850 9891 39032 38289.6 191

GOCR on Server 21253 20842 20195 21182 20520 20763.3 103.8

Page 50: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

OCR Error Types● Error Classification (Kukich 1992)

– Non-words: 'polassium' vs. 'potassium'– Real-words: 'fats' vs. 'facts'

● State of the Art Error Correction:– N-Gram– Levenshtein Edit Distance (LED)– Both algorithms are implemented in

Apache Lucene

Page 51: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Big O Analysis

● LED – O(m*n2), where n is the number of entries in the dictionary and n is the size of the input

● N-Gram – O(n), where n is the size of the input if the dictionary is implemented as a hash with constant lookup

Page 53: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Trie Data Structure

● Tries are popular on mobile platforms for word completion due to space efficiency

● Worst-case lookup is O(n) where n is the length of the input string

● Efficient storage compared to hash table

Page 54: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Skip Trie Matching● Skip Trie Matching (STM) algorithm is based on the

idea that the trie data structure can be used to find closest dictionary matches to misspelled words

● It is assumed that the dictionary of words is stored as a trie

● The only parameter in STM is the skip distance – a non-negative integer that defines the maximum number of misrecognized characters allowed in a misspelled word

Page 55: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

STM Basic Steps

● Process the input string character by character● At the trie's current node, find the child character

that matches the input's current character● If a match is found, recurse to that node and

consume the input's character● If no match is found, recurse on each child node

after incrementing the skip distance and without consuming the input's current character

● Details and pseudocode are in this paper

Page 56: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

STM Example

Suppose that the OCR engines recognizes the string 'ACID' as 'ACIR' and the trie dictionary has the word 'ACID' as a character path.

Page 61: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Back to Big O Analysis

● LED – O(m*n2), where n is the number of entries in the dictionary and n is the size of the input

● N-Gram – O(n), where n is the size of the input if the dictionary is implemented as a hash with constant lookup

● STM – O(nlog|Σ|), where |Σ| is the size of the alphabet

Page 62: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

LED, N-Gram, STM Accuracy & Speed

  STM N-Gram LED

Run Time(In milliseconds)

20 51 51

Recall 15% 9% 8%

The results in the table below were obtained on a sample of 600 texts OCRed with Tesseract

Page 63: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

STM Limitations

● Since STM is greedy, it cannot find all possible suggestions (not a limitation if a vocabulary is limited but a limitation in general)

● Current implementation finds matches only of the same length as the misspelled input

● STM cannot correct real-word errors

Page 64: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Conclusions ● On the tested samples OCRed with

Tesseract, STM ran faster and was more accurate than Apache Lucene's implementations of N-GRAM & LED

● On the tested samples, Tesseract was more accurate than GOCR

● On the tested samples, GOCR ran faster than Tesseract

Page 65: NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

References

1. Levenshtein V. (1966). “Binary Codes Capable of Correcting Deletions, Insertions, and Reversals.” Soviet Physics Doklady 10: 707–10. (pdf)

2. K. Kukich, "Techniques for Automatically Correcting Words in Text." ACM Computing Surveys, Vol. 24, No. 4, Dec. 1992. (pdf)

3. Kulyukin, V., Vanka, A., Wang, H. Skip Trie Matching: A Greedy Algorithm for Real-Time OCR Error Correction on Smartphones. International Journal of Digital Information and Wireless Communication (IJDIWC): 3(3): 56-65, 2013. ISSN: 2225-658X. (pdf)