Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer....
Transcript of Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer....
![Page 1: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/1.jpg)
Matching Dirty Data
Yet another wheel
Jeff Sherwood, Programmer.Anjanette Young, Systems Librarian.University of Washington, Libraries.
![Page 2: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/2.jpg)
DSpace RepositoryIngest Metadata and PDF's for ETD's received from UMI into a DSpace repository.
Goal
![Page 3: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/3.jpg)
Electronic Theses & Dissertations
Sources Output
![Page 4: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/4.jpg)
MARC Fields
=001 (Filename)=520 (Abstract)
=001 (OCLC number) =100 (Author)=245 (Title)=260 (Date published)=502 (type and date)=695 (Department)=941 (Local identifier)
UMI Records III Records
![Page 5: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/5.jpg)
dublin_core.xml
<dublin_core> <dcvalue element="identifier" qualifier="other"> iii[941]</dcvalue> <dcvalue element="title" qualifier="none"> iii[245][a][b]</dcvalue> <dcvalue element="contributor" qualifier="author"> iii[100][a][b][c]</dcvalue> <dcvalue element="description" qualifier="abstract"> umi[520][a]</dcvalue> <dcvalue element="subject" qualifier="other"> iii[655][a][x]</dcvalue></dublin_core>
![Page 6: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/6.jpg)
|||0|0| | |0|n|G|0|@ov_action="o"|||0|0| | |0|n|G|0|@ov_protect="b=V0123456789d(690,695:d) hn(590:d)y(099,249,852,856:d)y(910,925, 980,981)F26"035|001 |+|0|0|b|o|0|y|N|0|%001(start="1-9",char="!-~")245||+|0|0|b|t|0|y|N|0|%bracket="h"500-599||+|0|0|b|n|0|y|N|0|600-651||-w|0|0|b|d|0|y|N|0|653-657||+|0|0|b|d|0|y|N|0|690-699||-w|0|0|b|d|0|y|N|0|700-715||-w|0|0|b|b|0|y|N|0|730-740||-w|0|0|b|f|0|y|N|0|
MARC Loader . . . No.
![Page 7: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/7.jpg)
Matching overview
1. Exact Title + Exact Author2. Exact Title + Shortened Author
Ham-fisted Method
Cool Math Method
Calculate Similarity of TitleCalculate Similarity of Author1. Exact Title + Fuzzy Author2. Fuzzy Title + Fuzzy Author3. Fuzzy Title or Fuzzy Author
![Page 8: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/8.jpg)
Pymarc - the MARC Hammer
umi_dict = { Alaskan Bootlegger: {author: Leon Kania, umi_count = 1}, title2_value: {author: author2_value, umi_count = index2}, . . . }
iii_dict = { Alaskan Bootlegger: {author: Leon W. Kania, iii_count = 9}, title2_value: {author: author2_value, iii_count = index2}, . . . }
![Page 9: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/9.jpg)
Exact title + exact author
# Exact Title# Create sets out of the dictionary keysumi_set = set(umi_dict.iterkeys())iii_set = set(iii_dict.iterkeys())
# Verify Intersection with Exact Authorfor x in title_match: if umi_dict[x][author] == iii_dict[x][author]: . . . do stuff.
# Find the Intersection of sets. title_match = umi_set & iii_set
![Page 10: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/10.jpg)
Exact title + Truncated author
def shortenAuthorName(name): #Leon W. Kania -> [Leon, W., Kania] namelist = str(name).split() if len(namelist) > 2: shortname = "%s %s" % (namelist[0], namelist[-1]) else: shortname = name return shortname
![Page 11: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/11.jpg)
"If you break three spokes, it is time for a rebuild"Charles Hadrann, "Hadrann Wheelcraft Method – Part 1 Lacing"
![Page 12: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/12.jpg)
Rogues Gallery
![Page 13: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/13.jpg)
Use of crown length to define stem form :: segmented taper equation
USE OF CROWN LENGTH TO DEFINE STEM FORM: SEGMENTED TAPER EQUATION (DOUGLAS FIR)
![Page 14: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/14.jpg)
Towards an understanding of seismic performance of three-dimensional structures: Stability and reliability
Towards an understanding of seismic performance of 3D structures :: stability & reliability
![Page 15: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/15.jpg)
Hoekstra, Hopi Danielle Elisabeth
Hoekstra, Danielle E
![Page 16: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/16.jpg)
Arnason, Halldor
Halldór Árnason
![Page 17: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/17.jpg)
![Page 18: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/18.jpg)
Levenshtein Edit Distance
![Page 19: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/19.jpg)
Edit distance is the number of operations required to transform one string of characters into the another.
![Page 20: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/20.jpg)
How many steps to turn
kitten into sitting?
![Page 21: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/21.jpg)
3
![Page 22: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/22.jpg)
kitten ➔ sitten
sitten ➔ sittin
sittin ➔ sitting
(k changes to s)
(e changes to i)
(insert g)
![Page 23: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/23.jpg)
≥ difference in string lengths≤ length of the longer string= 0 if the strings are identical
LD is Always...
![Page 24: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/24.jpg)
Similarity Score
![Page 25: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/25.jpg)
Optimizations
![Page 26: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/26.jpg)
Reduce the Search Space
"A stochastic model of cyclical interaction processes"
All titles
![Page 27: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/27.jpg)
Reduce the Search Space
the: 24587for: 7643with: 3323effects: 1958evaluation: 1073...hypoxic: 1reduplication: 1picaresque: 1emperador 1heteroduplex 1
Throw out common words in titles
Keep the rarer ones
Identify Stopwords
![Page 28: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/28.jpg)
"Stochastic models for DNA sequence data"
Reduce the Search Space
stochastic dnasequence
Extract Significant Words
![Page 29: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/29.jpg)
Reduce the Search Space
rec = {'title': 'Stochastic models...',}
index['stochastic'].append(rec)index['dna'].append(rec)index['sequence'].append(rec)
![Page 30: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/30.jpg)
Reduce the Search Space
{'title': "Stochastic models for DNA sequence data", ...}{'title': "A stochastic model of clan systems", ...}{'title': "A stochastic model of cyclical interaction processes", ...}{'title': "Stochastic reliability models for maintained systems", ...}{'title': "Uniform approximation and almost periodicity of doubly stochastic operators", ...}
index['stochastic']
![Page 31: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/31.jpg)
Normalize Names
Hoekstra, Hopi Danielle Elisabeth
Hoekstra, Danielle E
![Page 32: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/32.jpg)
Normalize Names
Hoekstra, H
Hoekstra, D
![Page 33: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/33.jpg)
Normalize Names
Arnason, Halldor
Halldór Árnason
![Page 34: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/34.jpg)
Normalize Names
Arnason, H
Árnason, H
![Page 35: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/35.jpg)
Improvements
![Page 36: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/36.jpg)
Jaro-Winkler Algorithm
![Page 37: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/37.jpg)
What's a "match"?
Two characters match if they are a reasonable distance from one another as defined by:
![Page 38: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/38.jpg)
Example
s1 = Marthas2 = Marhta
![Page 39: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/39.jpg)
Example
s1 = Marthas2 = Marhta
![Page 40: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/40.jpg)
Jaro-Winkler works best for short strings
![Page 41: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/41.jpg)
Resources
![Page 42: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/42.jpg)
Levenshtein & Jaro-Winkler
Backgroundhttp://en.wikipedia.org/wiki/Levenshtein_distancehttp://en.wikipedia.org/wiki/Jaro-Winkler_distance
Codehttp://pypi.python.org/pypi/editdist/0.1http://pypi.python.org/pypi/python-Levenshtein/0.10.1
![Page 43: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/43.jpg)
String Comparison Tutorial
http://bit.ly/ZGSmF
SecondString - Java text analysis library
http://secondstring.sourceforge.net/
MarcXimiL - MARC de-duping package
http://marcximil.sourceforge.net/
Miscellaneous
![Page 44: Matching Dirty Data - Code4Lib · Matching Dirty Data Yet another wheel Jeff Sherwood, Programmer. Anjanette Young, Systems Librarian. University of Washington, Libraries.](https://reader033.fdocuments.net/reader033/viewer/2022060311/5f0ac1327e708231d42d2dfc/html5/thumbnails/44.jpg)
http://snurl.com/uggtn