An Open-source Similar-name Finder
-
Upload
dallan-quass -
Category
Technology
-
view
1.195 -
download
1
description
Transcript of An Open-source Similar-name Finder
![Page 2: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/2.jpg)
What's the problem?
![Page 3: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/3.jpg)
People can't spell unusual names
Maybe a piece of mail addressed to Solverg Quast?
Solverg Quast5934 Phoenix Ave.Shoreview, MN 55126
Johnston Bros.1256 Bristol St.Mapleton, MN 55126
Should be: Solveig Quass
![Page 4: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/4.jpg)
People use nicknames
John
Johnny
Jack
![Page 5: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/5.jpg)
Transcribers make typos
Jhon
![Page 6: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/6.jpg)
Most of our ancestors didn't know how to read or write
signature
![Page 7: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/7.jpg)
What does it matter?
![Page 8: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/8.jpg)
How do you find records?Johnny SnithJohn Smith
![Page 9: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/9.jpg)
How do you match people?
John Smith Johnny Smithe
![Page 10: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/10.jpg)
Not a new problem
![Page 11: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/11.jpg)
Lots of solutions
Soundex
Nysii
s
Double
Metaphone
Refined Soundex
Daitch-Mokotoff
Caverphone
LevensteinJaro Winkler
Monge Elkan
Needleman Wunch
Smith
Waterman
![Page 12: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/12.jpg)
No Bullseye
![Page 13: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/13.jpg)
Why is this so hard?
![Page 14: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/14.jpg)
How similar are two names?
We’re neighbors
JohnJonnyJoe
I don’t know those guys
![Page 15: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/15.jpg)
First approach: Coders
Soundex
Nysii
s
Double
Metaphone
Refined Soundex
Daitch-Mokotoff
Caverphone
General approach
Combine repeated letters
Remove vowels (except maybe for leading)
Unite similar-sounding letters
![Page 16: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/16.jpg)
First approach: Coders
Jim
John
Jane
Johan
Johannes
![Page 17: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/17.jpg)
Second approach: Distance functions
LevensteinJaro Winkler
Monge Elkan
Needleman Wunch
Smith
Waterman
General approach
Align sequences of letters
Score based upon the number of matches, transpositions, differences
Monge Elkan considers similar-sounding letters
![Page 18: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/18.jpg)
Second approach: Distance functions
Jim
John
Jane
Johan
Johannes
Better results,but
Doesn't scale well
![Page 19: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/19.jpg)
Can we do better?
![Page 20: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/20.jpg)
Warning: Machine learning ahead!
![Page 21: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/21.jpg)
Thank you Ancestry!
Ancestry.com paid someone to label 100,000 pairs of names
Name pairs were drawn from actual matching records at Ancestry
Labeled name pairs have been made freely available
![Page 22: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/22.jpg)
A closer look at Levenstein
Jon
John
Bohn-1
-1
![Page 23: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/23.jpg)
Maximize your expectations
Expectation Maximization Algorithm
Expectation step: calculate the expected value of a function
Maximization step: find parameters that maximize the expected value
Iterate until convergence
Jon
John
Bohn
high cost
low costWeighted Edit Distance
![Page 24: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/24.jpg)
Learn to classify
Positive and negative examples
Features
Coders
Distance functions
Weighted edit distance
Learn weights
several algorithms to choose from
Results in a vector
Threshold separates matches from non-matches
![Page 25: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/25.jpg)
Wait, isn't this just another distance function?
Distance functions don't scale, right?
![Page 26: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/26.jpg)
Right
![Page 27: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/27.jpg)
Back to the basics
x f(x)
-5 -1-3 4.5 0 7 2 3.5 4 2
![Page 28: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/28.jpg)
Long tail
![Page 29: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/29.jpg)
Long tail
200,000 Surnames 70,000 Given names
≤ 1/5,000,000 names
![Page 30: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/30.jpg)
Long tail
Use distance function with table here
Use coder here
![Page 31: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/31.jpg)
Result: Table initialized by a function
Dallan: Dalana Daleen Dalen Dalin … Talan Tallon
Ryan: Aaran Aran Arrin … Rian Riana ...
![Page 32: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/32.jpg)
A nice thing about tables...
Dallan: Dalana Daleen Dalen Dalin … Talan Tallon
Ryan: Aaran Aran Arrin … Rian Riana ...
![Page 33: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/33.jpg)
Add to the table
Nicknames
BehindTheName.com
The New American
Dictionary of Baby
Names by Leslie
Dunking and William
Gosling A Dictionary of Surnames by Patrick
Hanks and Flavia Hodges
WeRelate community
![Page 34: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/34.jpg)
Thank you BehindTheName.com!
Fascinating Family Treesfor given names
![Page 35: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/35.jpg)
Result
97 65
97 74
SoundexOur approach
Precision Recall
28% decrease in false negatives
Given names
89 68
89 77
SoundexOur approach
Precision Recall
28% decrease in false negatives
Surnames
![Page 36: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/36.jpg)
Who is using it?
![Page 37: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/37.jpg)
WeRelate.org
![Page 38: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/38.jpg)
Continuous improvement
![Page 39: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/39.jpg)
Continuous improvement
![Page 40: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/40.jpg)
Community oversight
![Page 41: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/41.jpg)
How do I use it?
Source code and table available on Github: http://github.com/DallanQ/Names
SearchNormalizeIndexSearch
ScoreEvalService
![Page 42: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/42.jpg)
Roadmap
Jan 2011 Open-source project created
Jan 2011 Implemented at WeRelate
Feb 2011 Announce at RootsTech
Continued improvements
![Page 43: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/43.jpg)
Future work
![Page 44: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/44.jpg)
Future work
Reduce the number of name variants to look up
Look up multiple codesRefined soundex?
Cluster namesMahout?
Remove “chaff” variants from common names
![Page 45: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/45.jpg)
Conclusion
Images appearing on these slides are copyrighted by the contributors to http://commons.wikimedia.org and are used under license
Thank you Ancestry.com and BehindTheName.com!!!
Identifying name variants is hard
But getting it right is pretty important
names are at the core of genealogical research
Open source algorithm is now freely available
http://github.com/DallanQ/Names
28% reduction in false negatives compared to Soundex
continuous improvement
Hopefully others will benefit from this effort
goal is to improve genealogical searches across the Web
![Page 46: An Open-source Similar-name Finder](https://reader035.fdocuments.net/reader035/viewer/2022062514/558605e6d8b42a90638b48b1/html5/thumbnails/46.jpg)