Cleaning data with Google Refine
-
Upload
paul-bradshaw -
Category
Education
-
view
3.152 -
download
1
description
Transcript of Cleaning data with Google Refine
![Page 1: Cleaning data with Google Refine](https://reader034.fdocuments.net/reader034/viewer/2022052522/554dad91b4c905ff7a8b4fb2/html5/thumbnails/1.jpg)
• Google ‘Google Refine download’http://code.google.com/p/google-refine/wiki/Downloads• Download and install Google Refine• Download data at http://bit.ly/nqbIaI• Open it up - it should open in a browser at http://127.0.0.1:3333/
Get yourself ready
Saturday, 15 October 2011
![Page 2: Cleaning data with Google Refine](https://reader034.fdocuments.net/reader034/viewer/2022052522/554dad91b4c905ff7a8b4fb2/html5/thumbnails/2.jpg)
Google Refine: cleaning data
Paul Bradshaw OnlineJournalismBlog.com, Twitter.com/paulbradshaw
Saturday, 15 October 2011
![Page 3: Cleaning data with Google Refine](https://reader034.fdocuments.net/reader034/viewer/2022052522/554dad91b4c905ff7a8b4fb2/html5/thumbnails/3.jpg)
• Getting rid of common data problems• ‘Clustering’ data to clean up multiple names for same thing• Manual tidying
In a nutshell...
Saturday, 15 October 2011
![Page 4: Cleaning data with Google Refine](https://reader034.fdocuments.net/reader034/viewer/2022052522/554dad91b4c905ff7a8b4fb2/html5/thumbnails/4.jpg)
The basics
Common transforms
Saturday, 15 October 2011
![Page 5: Cleaning data with Google Refine](https://reader034.fdocuments.net/reader034/viewer/2022052522/554dad91b4c905ff7a8b4fb2/html5/thumbnails/5.jpg)
• Clean common data problems: wrong format, inconsistent case, HTML, spaces, etc.• Use algorithms to find similar items• Use APIs and GREL to add new data
What can you do with Google Refine?
Saturday, 15 October 2011
![Page 6: Cleaning data with Google Refine](https://reader034.fdocuments.net/reader034/viewer/2022052522/554dad91b4c905ff7a8b4fb2/html5/thumbnails/6.jpg)
"Because we take the time to clean the data, we are able to do lobbying stories no other news organisation can do."
David Donald, Center for Public Integrity
Saturday, 15 October 2011
![Page 7: Cleaning data with Google Refine](https://reader034.fdocuments.net/reader034/viewer/2022052522/554dad91b4c905ff7a8b4fb2/html5/thumbnails/7.jpg)
Humans collect dataHumans enter dataHuman error
Time spent now...
Saturday, 15 October 2011
![Page 8: Cleaning data with Google Refine](https://reader034.fdocuments.net/reader034/viewer/2022052522/554dad91b4c905ff7a8b4fb2/html5/thumbnails/8.jpg)
Different words for the same thingDouble spaces, punctuationWrong data typeMistypedDuplicate entriesDefault entries (1/1/00)
...Saves time later
Saturday, 15 October 2011
![Page 9: Cleaning data with Google Refine](https://reader034.fdocuments.net/reader034/viewer/2022052522/554dad91b4c905ff7a8b4fb2/html5/thumbnails/9.jpg)
Save some copies of the raw data Work on a new copySave versions as you go to revertNote: Docs limited to 200,000 cells/256 cols; some Excel limited to 66,000 rows
First!
Saturday, 15 October 2011
![Page 10: Cleaning data with Google Refine](https://reader034.fdocuments.net/reader034/viewer/2022052522/554dad91b4c905ff7a8b4fb2/html5/thumbnails/10.jpg)
Group by term to see duplicationsFind & replace double spaces, etc. Select column/row & check data typeSort to find unusually large/small, and neighbouring misspellings
Cleaning methods
Saturday, 15 October 2011
![Page 11: Cleaning data with Google Refine](https://reader034.fdocuments.net/reader034/viewer/2022052522/554dad91b4c905ff7a8b4fb2/html5/thumbnails/11.jpg)
Never publish a name from data without running a background check
Check.
Saturday, 15 October 2011
![Page 12: Cleaning data with Google Refine](https://reader034.fdocuments.net/reader034/viewer/2022052522/554dad91b4c905ff7a8b4fb2/html5/thumbnails/12.jpg)
Edit cells>Common transforms
Saturday, 15 October 2011
![Page 13: Cleaning data with Google Refine](https://reader034.fdocuments.net/reader034/viewer/2022052522/554dad91b4c905ff7a8b4fb2/html5/thumbnails/13.jpg)
Facets
Saturday, 15 October 2011
![Page 14: Cleaning data with Google Refine](https://reader034.fdocuments.net/reader034/viewer/2022052522/554dad91b4c905ff7a8b4fb2/html5/thumbnails/14.jpg)
Facets, Edit cells
Edit cells > common transforms > cluster & edit > unescape HTMLEdit cells > split multi-valued cellsFacet > text facetExport...
Saturday, 15 October 2011
![Page 15: Cleaning data with Google Refine](https://reader034.fdocuments.net/reader034/viewer/2022052522/554dad91b4c905ff7a8b4fb2/html5/thumbnails/15.jpg)
Clustering
An intelligent helper
Saturday, 15 October 2011
![Page 16: Cleaning data with Google Refine](https://reader034.fdocuments.net/reader034/viewer/2022052522/554dad91b4c905ff7a8b4fb2/html5/thumbnails/16.jpg)
Algorithms
Fingerprint: looks for items with identical characters, e.g. “John Smith,” and “Smith, John”Double-metaphone: looks for similar sounds, e.g. “Horowitz” and “Horowicz”PPM: partial matches - try increasing radius to increase
Saturday, 15 October 2011
![Page 17: Cleaning data with Google Refine](https://reader034.fdocuments.net/reader034/viewer/2022052522/554dad91b4c905ff7a8b4fb2/html5/thumbnails/17.jpg)
Algorithms
Nearest neighbor: looks for shared clusters of characters, e.g. “Johnson” and “Johnsons”Levenshtein: looks for number of edits needed to change one to another, e.g. “New York” -> “newyork” = 3 edits
Saturday, 15 October 2011
![Page 18: Cleaning data with Google Refine](https://reader034.fdocuments.net/reader034/viewer/2022052522/554dad91b4c905ff7a8b4fb2/html5/thumbnails/18.jpg)
Just a helper...
Check and tick to apply the cleanup - click ‘Browse this cluster’ to see in more detail.Research to check if there are 2 people with same nameWill not spot abbreviations, e.g. MOJ vs Ministry of Justice
Saturday, 15 October 2011
![Page 19: Cleaning data with Google Refine](https://reader034.fdocuments.net/reader034/viewer/2022052522/554dad91b4c905ff7a8b4fb2/html5/thumbnails/19.jpg)
Saturday, 15 October 2011
![Page 20: Cleaning data with Google Refine](https://reader034.fdocuments.net/reader034/viewer/2022052522/554dad91b4c905ff7a8b4fb2/html5/thumbnails/20.jpg)
Delicious.com/paulb/kiev11Delicious.com/paulb/googlerefineOnlineJournalismBlog.com/tag/google-refine
Links
Saturday, 15 October 2011