NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Natural Language Processing

Levenshtein Edit Distance (LED)&

Skip Trie Matching (STM)

Vladimir Kulyukin

www.vkedco.blogspot.comwww.vkedco.blogspot.com

http://www.linkedin.com/pub/vladimir-kulyukin/23/2a2/150

http://www.vkedco.blogspot.com/

Outline● Levenshtein Edit Distance (LED)

– Definition– Recursive Computation– Dynamic Programming Computation

● Skip Trie

– Background– Trie & Skip Trie– Skip Trie Matching

Levenshtein Edit Distance

Minimum Edit Distance

● Suppose we have two strings: source and target● Suppose we have a finite set of operations

(edit_ops) that can be used to transform source to target

● Each operation has a cost● A Minimum Edit Distance is a metric that mea-

sures the total cost of transforming source to tar-get

Strings as Prefix Sequences

Any string can be viewed as a sequence of prefixes

1. s = '', then the prefix sequence is ''2. s = 'a', then the prefix sequence is <'', 'a'>3. s = 'ab', then the prefix sequence is <'', 'a', 'ab'>

In general, if s = c1c

2...c

n, then the prefix sequence is

<'', 'c1', 'c

1c

2', ..., s>

Definition

● Levenshtein edit distance (LED) is a metric, one of the best known, that measures similarity be-tween two character sequences

● The metric is named after Vladimir Levenshtein who discovered this metric in 1965

● Given two strings, source and target, LED is de-fined as the minimum number of edit opera-tions (aka edits) to transform source to target

http://en.wikipedia.org/wiki/Vladimir_Levenshtein

Edit Operations (AKA Edits)● The standard edit operations, aka edits, are insertion, dele-

tion, & substitution

● Assume pt

and ps

are legal positions in target and source,

respectively

● Insertion – a character at position pt in target is inserted

into source at position ps

● Deletion – a character is deleted from source at position ps

● Substitution - a character at position pt in target is substi-

tuted for a character at position ps in source

Edit Costs

● The standard edit operations have associated costs

● The costs are application dependent, and are typically positive integers

● For example, the costs of insertion, deletion, and substitution can all be set to 1

● In some contexts, substitution is set to 2 (substi-tution can be viewed as insertion and deletion)

String Transformation Cost

CT(s1, s2) = numerical cost of transforming source string s1 to target string s2

Tabulating Transformation CostsTARGET

'' c1 c2 c3 c4 c5 … cn

'' c1

c2 …

cm

SOURCE

TARGET '' c1 c2 c3 c4 c5 … cn

CT('', '')

'' c1 c2 c3 c4 c5 … cn

'' c1

c2 …

cm

CT('','c1')

'' c1 c2 c3 c4 c5 … cn

'' c1

c2 …

cm

CT('','c1c2')

'' c1 c2 c3 c4 c5 … cn

'' c1

c2 …

cm

CT('', 'c1c2c3')

'' c1 c2 c3 c4 c5 … cn

'' c1

c2 …

cm

CT('c1', '')

'' c1 c2 c3 c4 c5 … cn

'' c1

c2 …

cm

CT('c1c2', '')

'' c1 c2 c3 c4 c5 … cn

'' c1

c2 …

cm

CT('c1c2c3', '')

'' c1 c2 c3 c4 c5 … cn

'' c1

c2 …

cm

CT('c1...cm', 'c1...cn')

'' c1 c2 c3 c4 c5 … cn

'' c1

c2 …

cm

Transforming Empty Source to Target

'' c1 c2 c3 c4 c5 … cn

'' c1

c2 …

cm

0 i1 i2 i3 i4 i5 in

The only way to transform empty source to some target is to insert 0 or more characters into it (ik is the cost of inserting k characters)

Transforming Source to Empty Target

'' c1 c2 c3 c4 c5 … cn

'' c1

c2 …

cm

0d1d2d3

dm

The only way to transform some source to empty target is to delete 0 or more corresponding characters from it (dk is the cost of deleting k characters)

Examples

Example 01 Let insertion cost = deletion cost = substitution cost = 1.

Let source = '' and target = 'ab'.

How can we transform source to target?

Example 01

''

'' a b

Example 01

''

'' a b

0

Example 01

''

'' a b

0 1

Example 01

''

'' a b

0 1 2

Example 01

- insert 'a' at position 1 in source at cost 1;- insert 'b' at position 2 in source at cost 1;

So, LED('', 'ab') = 2.

Example 02

Let insert cost = delete cost = substitute cost = 1. Let source = 'ab' and target = ''.

How can we transform source to target?

Example 02

- Delete 'a' at position 1 in source at cost 1;- Delete 'b' at position 2 in source at cost 1;

So, LED('ab', '') = 2.

Example 03

Let insert cost = delete cost = substitute cost = 1. Let source = 'abc' and target = 'ac'.

- match 'a' at position 1 with 'a' at position 1 in target;- delete 'b' at position 2 in source at cost 1;- match 'c' at position 3 in source with 'c' at position 2 in target at cost 0.

So, LED('abc', 'ab') = 1.

Recursive LED Algorithm

Specification

LevEdDist(source, target, ins_cost, del_cost, sub_cost)

- source – source string

- target – target string

- ins_cost – cost of insertion

- del_cost – cost of deletion

- sub_cost – cost of substitution

LevEdDist(source, target, ins_cost, del_cost, sub_cost) returns a sequence of edits to convert source to target and the levenshtein distance, i.e., the total cost of edits

Pseudo Code

LED(source_str, target_str, edit_ops, edit_cost, ins_cost=1, del_cost=1, sub_cost=1):

#1. compute lengths of source and target strings target_len, source_len = len(target_str), len(source_str) #2. edit_ops is a list of edit operations that is destructively modified edit_ops_copy = copy(edit_ops) if source_len == 0: #3. if source is empty, insert all target characters into it for c in target_str: edit_ops_copy.append(new InsertOperator(c, ins_cost)) return edit_cost + target_len, edit_ops_copy

if target_len == 0: #4. if target is empty, delete all characters from source for c in source_str: edit_ops_copy.append(new DeleteOper('del', c, del_cost)) return edit_cost + source_len, edit_ops_copy

Recursion

● If character at position source_len-1 in source is the same as character at position target_len-1 in target, set the current cost to 0 (this is the character match, which can be viewed as substitute the character in the source for the same character in the target)

● Match is a zero-cost substitution

● If these characters are not the same, compute the costs of deletion, insertion and substitution, and choose the minimum cost

Pseudo Code: Three Recursive Calls

// choose deletion and recursedc_cost, dc_edit_ops = LED(source_str[0:source_len-1], target_str, edit_ops, edit_cost, ins_cost=ins_cost, del_cost=del_cost, sub_cost=sub_cost)

// choose insertion and recurseic_cost, ic_edit_ops = LED(source_str, target_str[0:target_len-1], edit_ops, edit_cost, ins_cost=ins_cost, del_cost=del_cost, sub_cost=sub_cost)

// choose substitution and recurse sc_cost, sc_edit_ops = LED(source_str[0:source_len-1], target_str[0:target_len-1], edit_ops, edit_cost, ins_cost=ins_cost, del_cost=del_cost, sub_cost=sub_cost)

Pseudo Code: Choosing Minimal Edit Sequence

if min_cost == dc_cost: edit_ops_copy = copy(dc_edit_ops) // add a new delete operator edit_ops_copy.append(new DelOper(source_str[source_len-1], del_cost)) else if min_cost == ic_cost: edit_ops_copy = copy(ic_edit_ops) // add a new insertion operator edit_ops_copy.append(new InsOper(target_str[target_len-1], ins_cost)) else if min_cost == sc_cost:' edit_ops_copy = copy(sc_edit_ops) if target_str[target_len-1] == source_str[source_len-1]: // if the characters are the same, then there is a match edit_ops_copy.append(new MatchOper(target_str[target_len-1], source_str[source_len-1], 0)) else: edit_ops_copy.append(new SubOper(target_str[target_len-1], source_str[source_len-1], sub_cost)) else: edit_ops_copy = copy(edit_ops)

min_cost = compute the cost of edit ops in edit_ops return min_cost, edit_ops_copy

LED Computation with

Dynamic Programming

Computing CT(r, c)

1. Construct an m x n table CT2. Fill row 03. Fill column 04. Then CT[r, c] = min{ CT[r-1,c-1] + sub_cost, CT[r-1, c] + del_cost, CT[r, c-1] + ins_cost }5. CT[m, n] is the final (and minimal!) cost

Side Notes

● LED is a minimal distance● LED is a correct minimal distance● LED can be computed only with 2 rows● An optimal sequence of edits can be recovered from

the CT table

Skip Trie & Skip Trie Matching


Motivation● According to U.S. Department of Agriculture, U.S.

residents have increased their caloric intake by 523 calories per day since 1970

● Mismanaged diets are estimated to account for 30-35% of cancer and diabetes cases

● A major contributor to the increased caloric intake is the consumer's inability (and sometimes unwillingness) to read & understand nutrition labels

● Nutrition information is rarely available to blind and visually impaired individuals


Critical Barriers

● Manual nutrition intake recording is time-consuming and error-prone, especially on smartphones

● Automated, real-time nutrition information extraction & analysis is weak or nonexistent

● Nutrition decision support – is not context-sensitive; – does not couple consumers with dieticians;– is not integrated with PHRs or ODLs


Persuasive NUTrion Management System (PNUTS)


RoboCart ShopTalk ShopMobile I ShopMobile II PNUTS

dd

2003-052006-08

2008-10

2010-12 2013-Now

R&D Road to PNUTS


PNUTS Architecture

Nutritionist

Coach

Cloud

Consumer/Patient

Inference Engine OCR Image Analysis


Vision-Based Nutrition Information Extraction in PNUTS

Line Segmentor

Nutrition Label

Localizer

TEXT

Image Table Lines

OCR

OCR Engine Accuracy Evaluation● Two hundred images of nutrition label text chunks

–

● Three categories used to categorize accuracy:– Complete: OCRed characters are identical to

image text– Partial: at least one OCRed character is missing or

misrecognized– Garbled: either empty string is returned or all

OCRed characters are misrecognized


OCR Engine Accuracy

Complete Partial Garbled

Tesseract on Device 146(73%) 36(18%) 18(9%)

GOCR on Device 42(21%) 23(11.5%) 135(67.5%)

Tesseract on Server 158(79%) 23(11.5%) 19(9.5%)

GOCR on Server 58(28.99%) 56(28%) 90(45%)


OCR Engine Speed in Milliseconds

Run 1 Run 2 Run 3 Run 4 Run 5 AVG/Sample AVG/Image

Tesseract on Device 128238 101438 101643 109678 103205110439.6 552.1

GOCR on Device 50349 47746 48964 52450 48247 49019.6 245

Tesseract on Server 38958 38061 37850 9891 39032 38289.6 191

GOCR on Server 21253 20842 20195 21182 20520 20763.3 103.8


OCR Error Types● Error Classification (Kukich 1992)

– Non-words: 'polassium' vs. 'potassium'– Real-words: 'fats' vs. 'facts'

● State of the Art Error Correction:– N-Gram– Levenshtein Edit Distance (LED)– Both algorithms are implemented in

Apache Lucene


Big O Analysis

● LED – O(m*n2), where n is the number of entries in the dictionary and n is the size of the input

● N-Gram – O(n), where n is the size of the input if the dictionary is implemented as a hash with constant lookup


Skip Trie Matching


Trie Data Structure

● Tries are popular on mobile platforms for word completion due to space efficiency

● Worst-case lookup is O(n) where n is the length of the input string

● Efficient storage compared to hash table


Skip Trie Matching● Skip Trie Matching (STM) algorithm is based on the

idea that the trie data structure can be used to find closest dictionary matches to misspelled words

● It is assumed that the dictionary of words is stored as a trie

● The only parameter in STM is the skip distance – a non-negative integer that defines the maximum number of misrecognized characters allowed in a misspelled word


STM Basic Steps

● Process the input string character by character● At the trie's current node, find the child character

that matches the input's current character● If a match is found, recurse to that node and

consume the input's character● If no match is found, recurse on each child node

after incrementing the skip distance and without consuming the input's current character

● Details and pseudocode are in this paper

http://www.vkedco.blogspot.com/2013/05/skip-trie-matching-for-real-time-ocr.html


STM Example

Suppose that the OCR engines recognizes the string 'ACID' as 'ACIR' and the trie dictionary has the word 'ACID' as a character path.


STM Example


Back to Big O Analysis

● LED – O(m*n2), where n is the number of entries in the dictionary and n is the size of the input

● N-Gram – O(n), where n is the size of the input if the dictionary is implemented as a hash with constant lookup

● STM – O(nlog|Σ|), where |Σ| is the size of the alphabet


LED, N-Gram, STM Accuracy & Speed

STM N-Gram LED

Run Time(In milliseconds)

20 51 51

Recall 15% 9% 8%

The results in the table below were obtained on a sample of 600 texts OCRed with Tesseract


STM Limitations

● Since STM is greedy, it cannot find all possible suggestions (not a limitation if a vocabulary is limited but a limitation in general)

● Current implementation finds matches only of the same length as the misspelled input

● STM cannot correct real-word errors


Conclusions ● On the tested samples OCRed with

Tesseract, STM ran faster and was more accurate than Apache Lucene's implementations of N-GRAM & LED

● On the tested samples, Tesseract was more accurate than GOCR

● On the tested samples, GOCR ran faster than Tesseract


References

1. Levenshtein V. (1966). “Binary Codes Capable of Correcting Deletions, Insertions, and Reversals.” Soviet Physics Doklady 10: 707–10. (pdf)

2. K. Kukich, "Techniques for Automatically Correcting Words in Text." ACM Computing Surveys, Vol. 24, No. 4, Dec. 1992. (pdf)

3. Kulyukin, V., Vanka, A., Wang, H. Skip Trie Matching: A Greedy Algorithm for Real-Time OCR Error Correction on Smartphones. International Journal of Digital Information and Wireless Communication (IJDIWC): 3(3): 56-65, 2013. ISSN: 2225-658X. (pdf)

http://profs.sci.univr.it/~liptak/ALBioinfo/files/levenshtein66.pdf

http://www.devl.fr/docs/these/bibli/Kukich1992Techniqueforautomatically.pdf

http://www.slideshare.net/VladimirKulyukin/skip-trie-matching-a-greedy


NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie

Technology

Transcript of NLP (Fall 2013): Levenshtein Edit Distance & Skip Trie