Approximate string comparators

16
1 Approximate string comparators TvungenOne, 2012-06-15 Lars Marius Garshol, <[email protected]> http://twitter.com/larsga

description

A quick overview of some common approximate string comparators used in record linkage.

Transcript of Approximate string comparators

Page 1: Approximate string comparators

1

Approximate string comparators

TvungenOne, 2012-06-15Lars Marius Garshol, <[email protected]>http://twitter.com/larsga

Page 2: Approximate string comparators

2

Approximate string comparators?

• Basically, measures of the similarity between two strings

• Useful in situations where exact match is insufficient– record linkage– search– ...

• Many of these are slow: O(n2)

Page 3: Approximate string comparators

3

Levenshtein

• Also known as edit distance• Measures the number of edit operations

necessary to turn s1 into s2• Edit operations are– insert a character– remove a character– substitute a character

Page 4: Approximate string comparators

4

Levenshtein example

• Levenshtein -> Löwenstein– Levenstein (remove ‘h’)– Lövenstein (substitute ‘ö’)– Löwenstein (substitute ‘w’)

• Edit distance = 3

Page 5: Approximate string comparators

5

Weighted Levenshtein

• Not all edit operations are equal• Substituting “i” for “e” is a smaller edit

than substituting “o” for “k”• Weighted Levenshtein evaluates each

edit operation as a number 0.0-1.0• Difficult to implement– weights are also language-dependent

Page 6: Approximate string comparators

6

Jaro-Winkler

• Developed at the US Bureau of the Census

• For name comparisons– not well suited to long strings– best if given name/surname are separated

• Exists in a few variants– originally proposed by Winkler– then modified by Jaro– a few different versions of modifications etc

Page 7: Approximate string comparators

7

Jaro-Winkler definition

• Formula:– m = number of matching characters– t = number of transposed characters

• A character from string s1 matches s2 if the same character is found in s2 less then half the length of the string away

• Levenshtein ~ Löwenstein = 0.8• Axel ~ Aksel = 0.783

Page 8: Approximate string comparators

8

Jaro-Winkler variant

Page 9: Approximate string comparators

9

Soundex

• A coarse schema for matching names by sound– produces a key from the name– names match if key is the same

• In common use in many places– Nav’s person register uses it for search– built-in in many databases– ...

Page 10: Approximate string comparators

10

Soundex definition

Page 11: Approximate string comparators

11

Examples

• soundex(“Axel”) = ‘A240’• soundex(“Aksel”) = ‘A240’• soundex(“Levenshtein”) = ‘L523’• soundex(“Löwenstein”) = ‘L152’

Page 12: Approximate string comparators

12

Metaphone

• Developed by Lawrence Philips• Similar to Soundex, but much more

complex– both more accurate and more sensitive

• Developed further into Double Metaphone

• Metaphone 3.0 also exists, but only available commercially

Page 13: Approximate string comparators

13

Metaphone examples

• metaphone(“Axel”) = ‘AKSL’• metaphone(“Aksel”) = ‘AKSL’• metaphone(“Levenshtein”) = ‘LFNX’• metaphone(“Löwenstein”) = ‘LWNS’

Page 14: Approximate string comparators

14

Dice coefficient

• A similarity measure for sets– set can be tokens in a string– or characters in a string

• Formula:

Page 15: Approximate string comparators

15

TFIDF

• Compares strings as sets of tokens– a la Dice coefficient

• However, takes frequency of tokens in corpus into account– this matches how we evaluate matches

mentally

• Has done well in evaluations– however, can be difficult to evaluate– results will change as corpus changes

Page 16: Approximate string comparators

16

More comparators

• Smith-Waterman– originated in DNA sequencing

• Q-grams distance– breaks string into sets of pieces of q characters– then does set similarity comparison

• Monge-Elkan– similar to Smith-Waterman, but with affine gap

distances– has done very well in evaluations– costly to evaluate

• Many, many more– ...