Two Different Approximate String Matching Problems and Their Algorithms
Approximate text matching with the stringdist package
-
Upload
mark-van-der-loo -
Category
Data & Analytics
-
view
528 -
download
0
description
Transcript of Approximate text matching with the stringdist package
![Page 1: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/1.jpg)
.
......
Approximate text matching with the stringdistpackage
Mark van der Loo
Statistics Netherlands
useR!2014
markvanderloo.eu @MarkPJvanderLoo
................ ..................... ..................... ..................... .....
...... ................
![Page 2: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/2.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. The stringdist package
.Fuzzy dictionary lookup..
......
amatch Fuzzy matching equivalent of matchain Fuzzy matching equivalent of %in%
.String metrics..
......
stringdist Pairwise distancesstringdistmatrix Distance matrixqgrams Compute q-gram profile
.Design“philosophy”..
......
Create interfaces that resemble base R(e.g. match, adist, nchar, agrep)
Mark van der Loo Approximate text matching with the stringdist package
![Page 3: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/3.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. The stringdist package
.Fuzzy dictionary lookup..
......
amatch Fuzzy matching equivalent of matchain Fuzzy matching equivalent of %in%
.String metrics..
......
stringdist Pairwise distancesstringdistmatrix Distance matrixqgrams Compute q-gram profile
.Design“philosophy”..
......
Create interfaces that resemble base R(e.g. match, adist, nchar, agrep)
Mark van der Loo Approximate text matching with the stringdist package
![Page 4: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/4.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. The stringdist package
.Fuzzy dictionary lookup..
......
amatch Fuzzy matching equivalent of matchain Fuzzy matching equivalent of %in%
.String metrics..
......
stringdist Pairwise distancesstringdistmatrix Distance matrixqgrams Compute q-gram profile
.Design“philosophy”..
......
Create interfaces that resemble base R(e.g. match, adist, nchar, agrep)
Mark van der Loo Approximate text matching with the stringdist package
![Page 5: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/5.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Dictionary lookup
> match("leia", c("leela","leia"))
[1] 2
> match("liea", c("leela","leia"))
[1] NA
> amatch("liea", c("leela","leia"), maxDist=1)
[1] 2
> "liea" %in% c("leela","leia")
[1] FALSE
> ain("liea", c("leela","leia"), maxDist=1)
[1] TRUE
Mark van der Loo Approximate text matching with the stringdist package
![Page 6: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/6.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Dictionary lookup
> match("leia", c("leela","leia"))
[1] 2
> match("liea", c("leela","leia"))
[1] NA
> amatch("liea", c("leela","leia"), maxDist=1)
[1] 2
> "liea" %in% c("leela","leia")
[1] FALSE
> ain("liea", c("leela","leia"), maxDist=1)
[1] TRUE
Mark van der Loo Approximate text matching with the stringdist package
![Page 7: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/7.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Dictionary lookup
> match("leia", c("leela","leia"))
[1] 2
> match("liea", c("leela","leia"))
[1] NA
> amatch("liea", c("leela","leia"), maxDist=1)
[1] 2
> "liea" %in% c("leela","leia")
[1] FALSE
> ain("liea", c("leela","leia"), maxDist=1)
[1] TRUE
Mark van der Loo Approximate text matching with the stringdist package
![Page 8: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/8.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Dictionary lookup
> match("leia", c("leela","leia"))
[1] 2
> match("liea", c("leela","leia"))
[1] NA
> amatch("liea", c("leela","leia"), maxDist=1)
[1] 2
> "liea" %in% c("leela","leia")
[1] FALSE
> ain("liea", c("leela","leia"), maxDist=1)
[1] TRUE
Mark van der Loo Approximate text matching with the stringdist package
![Page 9: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/9.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Dictionary lookup
> match("leia", c("leela","leia"))
[1] 2
> match("liea", c("leela","leia"))
[1] NA
> amatch("liea", c("leela","leia"), maxDist=1)
[1] 2
> "liea" %in% c("leela","leia")
[1] FALSE
> ain("liea", c("leela","leia"), maxDist=1)
[1] TRUE
Mark van der Loo Approximate text matching with the stringdist package
![Page 10: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/10.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. String distance
Mark van der Loo Approximate text matching with the stringdist package
![Page 11: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/11.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. String distances
.Implemented in the package..
......
edit-based distances
q-gram based distances
heuristic distances
.Review papers..
......
L. Boytsov (2011). ACM Journal of ExperimentalAlgorithmics 16 1–86.
G. Navarro (2001). ACM Computing Surveys 33 31–88.
.stringdist paper..
......
M.P.J. van der Loo (2014). The stringdist package forapproximate string matching. The R Journal 6 xx-xx .
Mark van der Loo Approximate text matching with the stringdist package
![Page 12: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/12.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Edit-based distances
.Definition..
......
Count the minimum number of (weighted) basic operations thatturns string s into string t.
Allowed operationDistance substitution deletion insertion transpositionHamming 4 6 6 6LCS 6 4 4 6Levenshtein 4 4 4 6OSA 4 4 4 4∗
Damerau-Levenshtein
4 4 4 4
∗Substrings may be edited only once.
> stringdist('leia','liea',method='hamming')
[1] 2
> stringdist('leia','liea',method='dl')
[1] 1
Mark van der Loo Approximate text matching with the stringdist package
![Page 13: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/13.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Edit-based distances
.Definition..
......
Count the minimum number of (weighted) basic operations thatturns string s into string t.
Allowed operationDistance substitution deletion insertion transpositionHamming 4 6 6 6LCS 6 4 4 6Levenshtein 4 4 4 6OSA 4 4 4 4∗
Damerau-Levenshtein
4 4 4 4
∗Substrings may be edited only once.
> stringdist('leia','liea',method='hamming')
[1] 2
> stringdist('leia','liea',method='dl')
[1] 1
Mark van der Loo Approximate text matching with the stringdist package
![Page 14: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/14.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Edit-based distances
.Definition..
......
Count the minimum number of (weighted) basic operations thatturns string s into string t.
Allowed operationDistance substitution deletion insertion transpositionHamming 4 6 6 6LCS 6 4 4 6Levenshtein 4 4 4 6OSA 4 4 4 4∗
Damerau-Levenshtein
4 4 4 4
∗Substrings may be edited only once.
> stringdist('leia','liea',method='hamming')
[1] 2
> stringdist('leia','liea',method='dl')
[1] 1
Mark van der Loo Approximate text matching with the stringdist package
![Page 15: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/15.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Edit-based distances
.Definition..
......
Count the minimum number of (weighted) basic operations thatturns string s into string t.
Allowed operationDistance substitution deletion insertion transpositionHamming 4 6 6 6LCS 6 4 4 6Levenshtein 4 4 4 6OSA 4 4 4 4∗
Damerau-Levenshtein
4 4 4 4
∗Substrings may be edited only once.
> stringdist('leia','liea',method='hamming')
[1] 2
> stringdist('leia','liea',method='dl')
[1] 1
Mark van der Loo Approximate text matching with the stringdist package
![Page 16: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/16.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. q-gram based distances
.Definition........Any (vector) distance between two q-gram profiles.
bananaQ:x :
Mark van der Loo Approximate text matching with the stringdist package
![Page 17: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/17.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. q-gram based distances
.Definition........Any (vector) distance between two q-gram profiles.
ba nanaQ: bax : 1
Mark van der Loo Approximate text matching with the stringdist package
![Page 18: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/18.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. q-gram based distances
.Definition........Any (vector) distance between two q-gram profiles.
b an anaQ: ba anx : 1 1
Mark van der Loo Approximate text matching with the stringdist package
![Page 19: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/19.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. q-gram based distances
.Definition........Any (vector) distance between two q-gram profiles.
ba na naQ: ba an nax : 1 1 1
Mark van der Loo Approximate text matching with the stringdist package
![Page 20: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/20.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. q-gram based distances
.Definition........Any (vector) distance between two q-gram profiles.
ban an aQ: ba an nax : 1 2 1
Mark van der Loo Approximate text matching with the stringdist package
![Page 21: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/21.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. q-gram based distances
.Definition........Any (vector) distance between two q-gram profiles.
bana naQ: ba an nax : 1 2 2
Mark van der Loo Approximate text matching with the stringdist package
![Page 22: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/22.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. q-gram based distances
.Definition........Any (vector) distance between two q-gram profiles.
Jaccard|Q1 ∩ Q2||Q1 ∪ Q2|
> stringdist('leia','leela'
+ , method='jaccard',q=2)
[1] 0.8333333
Cosine cos (x∠y)> stringdist('leia','leela'
+ , method='cosine',q=2)
[1] 0.7113249
Mark van der Loo Approximate text matching with the stringdist package
![Page 23: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/23.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. q-gram based distances
.Definition........Any (vector) distance between two q-gram profiles.
> qgrams(x = 'leia',y = 'leela',q=2)
le ei ia la el ee
x 1 1 1 0 0 0
y 1 0 0 1 1 1
Mark van der Loo Approximate text matching with the stringdist package
![Page 24: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/24.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Heuristic distances: Jaro-Winkler
.Definition..
......
It’s complicated :-).
Intended for human-typed name/address data
> stringdist('liea','leia',method='jw',p=0.1)
[1] 0.075
Ranges from 0 (equal) to 1 (dissimilar).
0 ≤ p ≤ 0.25: emphasis on first 4 characters.
p = 0: Jaro-distance
Mark van der Loo Approximate text matching with the stringdist package
![Page 25: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/25.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Character encoding
Image from https://code.google.com/p/tworsekey/
Mark van der Loo Approximate text matching with the stringdist package
![Page 26: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/26.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Character encoding
> stringdist(’o’,’o’)
[1] 1 # Replace one symbol
> stringdist(’o’,’o’,useBytes=TRUE)
[1] 2 # delete one byte, replace another (utf-8)
Mark van der Loo Approximate text matching with the stringdist package
![Page 27: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/27.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Missing values
Name
:
Value Si
x
Missing
Since :
30-06-20
14
Last see
n at :
input.cs
v
If you
have se
en Valu
e Six
or know
someone
who has,
please
contact
your
local st
atistici
an at +3
1 415 92
6 53
Mark van der Loo Approximate text matching with the stringdist package
![Page 28: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/28.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Handling missing values
> NA == NA
[1] NA
> adist(NA, NA)
[1] NA
> stringdist(NA, NA)
[1] NA
> match(NA, NA)
[1] 1 # <- note the useR’s OMGWTFBBQ right there
> amatch(NA, NA)
[1] 1 # <- ok, at least we’re consistent
> amatch(NA, NA, matchNA=FALSE)
[1] NA
Mark van der Loo Approximate text matching with the stringdist package
![Page 29: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/29.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Handling missing values
> NA == NA
[1] NA
> adist(NA, NA)
[1] NA
> stringdist(NA, NA)
[1] NA
> match(NA, NA)
[1] 1 # <- note the useR’s OMGWTFBBQ right there
> amatch(NA, NA)
[1] 1 # <- ok, at least we’re consistent
> amatch(NA, NA, matchNA=FALSE)
[1] NA
Mark van der Loo Approximate text matching with the stringdist package
![Page 30: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/30.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Handling missing values
> NA == NA
[1] NA
> adist(NA, NA)
[1] NA
> stringdist(NA, NA)
[1] NA
> match(NA, NA)
[1] 1 # <- note the useR’s OMGWTFBBQ right there
> amatch(NA, NA)
[1] 1 # <- ok, at least we’re consistent
> amatch(NA, NA, matchNA=FALSE)
[1] NA
Mark van der Loo Approximate text matching with the stringdist package
![Page 31: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/31.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Handling missing values
> NA == NA
[1] NA
> adist(NA, NA)
[1] NA
> stringdist(NA, NA)
[1] NA
> match(NA, NA)
[1] 1 # <- note the useR’s OMGWTFBBQ right there
> amatch(NA, NA)
[1] 1 # <- ok, at least we’re consistent
> amatch(NA, NA, matchNA=FALSE)
[1] NA
Mark van der Loo Approximate text matching with the stringdist package
![Page 32: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/32.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Handling missing values
> NA == NA
[1] NA
> adist(NA, NA)
[1] NA
> stringdist(NA, NA)
[1] NA
> match(NA, NA)
[1] 1 # <- note the useR’s OMGWTFBBQ right there
> amatch(NA, NA)
[1] 1 # <- ok, at least we’re consistent
> amatch(NA, NA, matchNA=FALSE)
[1] NA
Mark van der Loo Approximate text matching with the stringdist package
![Page 33: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/33.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Handling missing values
> NA == NA
[1] NA
> adist(NA, NA)
[1] NA
> stringdist(NA, NA)
[1] NA
> match(NA, NA)
[1] 1 # <- note the useR’s OMGWTFBBQ right there
> amatch(NA, NA)
[1] 1 # <- ok, at least we’re consistent
> amatch(NA, NA, matchNA=FALSE)
[1] NA
Mark van der Loo Approximate text matching with the stringdist package
![Page 34: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/34.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Handling missing values
> NA == NA
[1] NA
> adist(NA, NA)
[1] NA
> stringdist(NA, NA)
[1] NA
> match(NA, NA)
[1] 1 # <- note the useR’s OMGWTFBBQ right there
> amatch(NA, NA)
[1] 1 # <- ok, at least we’re consistent
> amatch(NA, NA, matchNA=FALSE)
[1] NA
Mark van der Loo Approximate text matching with the stringdist package
![Page 35: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/35.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Handling missing values
> NA == NA
[1] NA
> adist(NA, NA)
[1] NA
> stringdist(NA, NA)
[1] NA
> match(NA, NA)
[1] 1 # <- note the useR’s OMGWTFBBQ right there
> amatch(NA, NA)
[1] 1 # <- ok, at least we’re consistent
> amatch(NA, NA, matchNA=FALSE)
[1] NA
Mark van der Loo Approximate text matching with the stringdist package
![Page 36: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/36.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Handling missing values
> NA == NA
[1] NA
> adist(NA, NA)
[1] NA
> stringdist(NA, NA)
[1] NA
> match(NA, NA)
[1] 1 # <- note the useR’s OMGWTFBBQ right there
> amatch(NA, NA)
[1] 1 # <- ok, at least we’re consistent
> amatch(NA, NA, matchNA=FALSE)
[1] NA
Mark van der Loo Approximate text matching with the stringdist package
![Page 37: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/37.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Handling missing values
> NA == NA
[1] NA
> adist(NA, NA)
[1] NA
> stringdist(NA, NA)
[1] NA
> match(NA, NA)
[1] 1 # <- note the useR’s OMGWTFBBQ right there
> amatch(NA, NA)
[1] 1 # <- ok, at least we’re consistent
> amatch(NA, NA, matchNA=FALSE)
[1] NA
Mark van der Loo Approximate text matching with the stringdist package
![Page 38: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/38.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Handling missing values
> NA == NA
[1] NA
> adist(NA, NA)
[1] NA
> stringdist(NA, NA)
[1] NA
> match(NA, NA)
[1] 1 # <- note the useR’s OMGWTFBBQ right there
> amatch(NA, NA)
[1] 1 # <- ok, at least we’re consistent
> amatch(NA, NA, matchNA=FALSE)
[1] NA
Mark van der Loo Approximate text matching with the stringdist package
![Page 39: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/39.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Handling missing values
> NA == NA
[1] NA
> adist(NA, NA)
[1] NA
> stringdist(NA, NA)
[1] NA
> match(NA, NA)
[1] 1 # <- note the useR’s OMGWTFBBQ right there
> amatch(NA, NA)
[1] 1 # <- ok, at least we’re consistent
> amatch(NA, NA, matchNA=FALSE)
[1] NA
Mark van der Loo Approximate text matching with the stringdist package
![Page 40: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/40.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Parallelization
For a single call:
> stringdistmatrix(a,b,ncores=4)
Or, define your own cluster:
> cl <- makeCluster(4)
> stringdistmatrix(a, b, cluster=cl)
> stringdistmatrix(c, d, cluster=cl)
> stopCluster(cl)
Mark van der Loo Approximate text matching with the stringdist package
![Page 41: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/41.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Performance
stringdist(method=’lv’)
About 30% faster than adist
About 2 times faster then RecordLinkage
When comparing strings of 5 - 25 characters
Mark van der Loo Approximate text matching with the stringdist package
![Page 42: Approximate text matching with the stringdist package](https://reader033.fdocuments.net/reader033/viewer/2022052600/557fc636d8b42ad1048b465d/html5/thumbnails/42.jpg)
.....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
....
.
....
.
.....
.
....
.
.....
.
....
.
....
.
.. Summary
Nine different string metrics; core in C99 (sorry Dirk :-) )
Approximate dictionary lookup
Proper handling of encoding and missing values
Fast
Paralellization built in
Thank you for your attention!t : @markpjvanderloo
w : markvanderloo.eu
Mark van der Loo Approximate text matching with the stringdist package