Approximate string comparators

1

Approximate string comparators

TvungenOne, 2012-06-15Lars Marius Garshol, <[email protected]>http://twitter.com/larsga

2

Approximate string comparators?

• Basically, measures of the similarity between two strings

• Useful in situations where exact match is insufficient– record linkage– search– ...

• Many of these are slow: O(n2)

3

Levenshtein

• Also known as edit distance• Measures the number of edit operations

necessary to turn s1 into s2• Edit operations are– insert a character– remove a character– substitute a character

4

Levenshtein example

• Levenshtein -> Löwenstein– Levenstein (remove ‘h’)– Lövenstein (substitute ‘ö’)– Löwenstein (substitute ‘w’)

• Edit distance = 3

5

Weighted Levenshtein

• Not all edit operations are equal• Substituting “i” for “e” is a smaller edit

than substituting “o” for “k”• Weighted Levenshtein evaluates each

edit operation as a number 0.0-1.0• Difficult to implement– weights are also language-dependent

6

Jaro-Winkler

• Developed at the US Bureau of the Census

• For name comparisons– not well suited to long strings– best if given name/surname are separated

• Exists in a few variants– originally proposed by Winkler– then modified by Jaro– a few different versions of modifications etc

7

Jaro-Winkler definition

• Formula:– m = number of matching characters– t = number of transposed characters

• A character from string s1 matches s2 if the same character is found in s2 less then half the length of the string away

• Levenshtein ~ Löwenstein = 0.8• Axel ~ Aksel = 0.783

8

Jaro-Winkler variant

9

Soundex

• A coarse schema for matching names by sound– produces a key from the name– names match if key is the same

• In common use in many places– Nav’s person register uses it for search– built-in in many databases– ...

10

Soundex definition

11

Examples

• soundex(“Axel”) = ‘A240’• soundex(“Aksel”) = ‘A240’• soundex(“Levenshtein”) = ‘L523’• soundex(“Löwenstein”) = ‘L152’

12

Metaphone

• Developed by Lawrence Philips• Similar to Soundex, but much more

complex– both more accurate and more sensitive

• Developed further into Double Metaphone

• Metaphone 3.0 also exists, but only available commercially

13

Metaphone examples

• metaphone(“Axel”) = ‘AKSL’• metaphone(“Aksel”) = ‘AKSL’• metaphone(“Levenshtein”) = ‘LFNX’• metaphone(“Löwenstein”) = ‘LWNS’

14

Dice coefficient

• A similarity measure for sets– set can be tokens in a string– or characters in a string

• Formula:

15

TFIDF

• Compares strings as sets of tokens– a la Dice coefficient

• However, takes frequency of tokens in corpus into account– this matches how we evaluate matches

mentally

• Has done well in evaluations– however, can be difficult to evaluate– results will change as corpus changes

16

More comparators

• Smith-Waterman– originated in DNA sequencing

• Q-grams distance– breaks string into sets of pieces of q characters– then does set similarity comparison

• Monge-Elkan– similar to Smith-Waterman, but with affine gap

distances– has done very well in evaluations– costly to evaluate

• Many, many more– ...

Approximate string comparators

Technology

Transcript of Approximate string comparators