digitized by - Internet Archive...digitized by . digitized by . digitized by
Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab...
Transcript of Accessing Historical Data · Why gather digitized historical data en masse? • It can let you grab...
Accessing Historical Data
en masse
Ian Milligan Assistant Professor
Hello!• Who am I?
• Ian Milligan (Assistant Professor, University of Waterloo)
• Canadian, digital, youth, and web archives.
• @ianmilligan1
• Slides will be all available at http://ianmilligan.ca/getting-data/, as well as links to tutorials and data
Why gather digitized historical data en masse?• It can let you grab data from across the globe
for minimal extra effort;
• When digitized, it can save time + effort (no more right clicking);
• Can let you explore extremely large datasets to find patterns, inferences, etc. in bodies that you couldn’t otherwise read!
Pitfalls?
• Digitization has proceeded unevenly: requires institutional money and support, so replicates holdings of elite + western institutions;
• We may not know how it works - Optical Character Recognition (OCR) for plain text, collection biases, etc.
Pitfalls?
0"
1"
2"
3"
4"
5"
6"
1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Ave
rage
num
ber
of a
ppea
ranc
es, d
ivid
ed b
y ye
ar
Years
Globe
Star
Telegram
Gazette
Citizen
Gap between appearance and usage in ProQuest dissertations
Impact of Pages of the Past and Canada's Heritage Online
Pre–Pages of the Past and Canada's Heritage Online
Handle with Care
But still, these sources present considerable power when used by the right historians
(you!)
Different Methods• The Dream Case
• Application Programming Interfaces (APIs)
• Scraping Data yourself (Outwit Hub)
• Computational Methods (Python, Bash, Programming Historian)
• HistoryCrawler Virtual Machine
The Dream Case
• A Dream:
• For you to find on your own websites;
• And for you to create for others if you make databases…
The Dream Case
• Examples
• http://edh-www.adw.uni-heidelberg.de/home
• http://www.cwgc.org/find-war-dead.aspx
• Lexis|Nexis
• Sometimes limited (i.e. CWGC to 50,000 records, Lexis|Nexis to a few hundred) which requires multiple searches
http://adamcrymble.blogspot.ca/2014/01/does-your-online-collection-need-api.html
Or maybe you just want a few
documents?
Worth bookmarking• Google Books Advanced Search: http://
books.google.com/advanced_book_search
• Internet Archive Advanced Search: http://archive.org/advancedsearch.php
• Hathi Trust Advanced Search: http://babel.hathitrust.org/cgi/ls?a=page;page=advanced
• (Let’s Visit Each)
Google Books
• In ‘advanced search,’ select ‘Full view only’
• Do a search, pre-1923 content will be most fruitful
Internet Archive
Hathi Trust• The world’s backup drive for libraries - 4.5+ billion pages!
Also..
• Sometimes a colleague might have compiled this data for you…
• Shawn Graham (Carleton) has compiled a great list: https://github.com/hist3907b-winter2015/module2-findingdata
But if the dream case doesn’t work out, it’s
OK.
Different Methods• The Dream Case
• Application Programming Interfaces (APIs)
• Scraping Data yourself (Outwit Hub)
• Computational Methods (Python, Bash, Programming Historian)
• HistoryCrawler Virtual Machine
(this one is a bit difficult, but it helps us get some foundational concepts)
Application Programming Interfaces• API - programs talking to
each other
• In our context, it’s a way to send an HTTP request and get some responses
• (this is relatively complex, but will make more sense as we proceed through workshop)
APIs
• JSON format (instead of human-readable format like HTML, machine-readable).
• So if I own 3 iPhones and an iPad (I don’t), I’d structure it like this
{ "iphones" : "3", "ipads" : "1" }
APIs
APIs (added &fmt=json)
Good intro - start studying URLs
http://search.canadiana.ca/search?df=1800&dt=1900&q=psycholog*&fmt=json
http://search.canadiana.ca/support/api [instructions]
URLs
• 939 pages of results (!)
• Each document in this case has a unique record key
• But we do figure out the URL formula
• http://search.canadiana.ca/search/X?df=1800&dt-‐1900&q=psycholog*&fmt=json
• And solve for X, where X is a value between 1 and 939
URLs
• http://search.canadiana.ca/search/1?df=1800&dt-‐1900&q=psycholog*&fmt=json
• http://search.canadiana.ca/search/2?df=1800&dt-‐1900&q=psycholog*&fmt=json
• http://search.canadiana.ca/search/3?df=1800&dt-‐1900&q=psycholog*&fmt=json
• http://search.canadiana.ca/search/4?df=1800&dt-‐1900&q=psycholog*&fmt=json
URLs
"contributor" : [
"oocihm",
3837,
EACH item on these pages has a unique number that it, and only it, has. If we can get a list of those oocihm
numbers, we could get EVERY full text item in a database.
URLs• How do we get those key values? (stay tuned)
• But once we have them, we’d see that we have a list of files like:
• http://eco.canadiana.ca/view/X/?r=0&s=1&fmt=json&api=text=1; (where X is the oocihm information)
• So a code like http://eco.canadiana.ca/view/oocihm.16278/?r=0&s=1&fmt=json&api_text=1 would get the full text of an item.
• You’d have to automate this to get all full text sources having to do with psychology. But how?
Downloading all the files
• We can turn to some other resources - which are a useful demonstration of how DH involves code sharing
• http://ianmilligan.ca/2014/01/07/historians-love-json-or-one-quick-example-of-why-it-rocks/
• https://canzac.wordpress.com/2014/09/02/canadiana-in-context/
• I’ll explain code and share
Different Methods• The Dream Case
• Application Programming Interfaces (APIs)
• Scraping Data yourself (Outwit Hub)
• Computational Methods (Python, Bash, Programming Historian)
• HistoryCrawler Virtual Machine
Outwit Hub
• A free software suite that finds ‘structure’ in web pages and grabs the information that you’re looking for.
• Free in limited version.
• https://www.outwit.com/products/hub/
Outwit Hub
• Starting database to try this out on - Suda On Line - a 10th century Byzantine Greek historical encyclopedia
• http://www.stoa.org/sol/
Outwit Hub
Outwit Hub
Outwit Hub
Adler NumberBegins after: Adler number: </strong> Ends before: <br/>TranslationBegins after: <div class=“translation”> Ends before: </div>
Outwit Hub
• Step One: install Outwit Hub
• Step Two: paste URL into the bar at the top of the page
• Step Three: click ‘scrapers,’ then ‘new,’ give it a name.
• Step Four: Say no thanks to buying it (at least now).
Outwit Hub
Adler NumberBegins after: Adler number: </strong> Ends before: <br/>TranslationBegins after: <div class=“translation”> Ends before: </div>
Outwit Hub
Outwit Hub
Outwit Hub• Press ‘Catch’ if you want to
keep going with other websites
• CATCH moves it into your memory
• Or you can press ‘Export’ when you’re done to generate a spreadsheet
• (Do a second search for ‘rome’ and see it auto catch)
It’s a good introduction, but sometimes you need better tools…
Different Methods• The Dream Case
• Application Programming Interfaces (APIs)
• Scraping Data yourself (Outwit Hub)
• Computational Methods (Python, Bash, Programming Historian)
• HistoryCrawler Virtual Machine
The Dreaded Command Line
• Most of these programs are based in a UNIX environment
• Ian Milligan and James Baker (British Library), “Introduction to the Bash Command Line.” http://programminghistorian.org/lessons/intro-to-bash
The Dreaded Command Line
• Not so bad once you get into it!
• Allows you to run some pretty fine-tuned commands, and begin to rapidly move around your computer.
• Does have a learning curve, but it is worth it.
Basic Programming
• ProgrammingHistorian.org
• Basic programming techniques with an applied perspective
• Not general examples, but specific ones.
Basic Programming
Wget• A powerful tool for
retrieving online material
• Command line only (!)
• Easy way to install on OS X:
• Install homebrew (one line to install at brew.sh)
• and then ‘brew install wget’
To install on all platforms: http://programminghistorian.org/lessons/automated-downloading-
with-wget
The Internet Archive
• 15 PB of awesome historical, cultural sources
• But occasionally cumbersome to access en masse
Wget and the Internet Archive
• http://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/
• Let’s grab all the files relating to a given collection
Finding a Collection
• The Boston Public Library Anti-Slavery Collection
• https://archive.org/details/bplscas
• (but there are many other ones)
Finding a Collection
• Everything in the Internet Archive has a unique URL, like this: http://archive.org/details/[IDENTIFIER]
• So an item might be: http://archive.org/details/lettertowilliaml00doug
• And the collection is: http://archive.org/details/bplscas/
Finding a Collection
• Create a directory to store all our files
• Visit the advanced search page (http://archive.org/advancedsearch.php)
• Click on ‘collection’ - big list loads. Click on ‘bplscas’ and then search
Finding a Collection
• 8,265 results. That’d be a lot of ‘right clicking’ to download.
• We confirm that this is indeed what we want.
• So we go back.
Finding a Collection• So we do this
• Scroll down and do a search for “collection: bplscas”, we can sort by “date asc” - ascending dates, and we select CSV FORMAT.
• Number of results: 7971
• Click ‘search’ and download the file
Finding a Collection
• Looks like this. It’s one line per file.
• The first one is dialoguscreatura00nico
• Put that into the search bar, press enter.. and voila..
Finding a Collection
Finding a Collection
• We can now download every single entry in that list - in this case, everything within the Boston Public Library Slavery collection.
• We can decide if we want every single format (probably not), or perhaps just the TXT files, or the PDFs, etc.
Finding a Collection
• Step One: Open the CSV file and delete the first line that reads ‘identifier’
• Step Two: Save it as a text file - itemlist.txt
• Step Three: use WGET. Copy commands from the Internet Archive.. :)
Example Commands
• All files:
• wget -‐r -‐H -‐nc -‐np -‐nH -‐-‐cut-‐dirs=2 -‐e robots=off -‐l1 -‐i ./itemlist.txt -‐B ‘http://archive.org/download/'
• Certain file formats
• wget -‐r -‐H -‐nc -‐np -‐nH -‐-‐cut-‐dirs=2 -‐A .pdf,.epub -‐e robots=off -‐l1 -‐i ./itemlist.txt -‐B 'http://archive.org/download/'
Our command
• We just want the TXT files
• wget -‐r -‐H -‐nc -‐np -‐nH -‐-‐cut-‐dirs=2 -‐A .txt -‐e robots=off -‐l1 -‐i ./itemlist.txt -‐B 'http://archive.org/download/'
Exploring
• Now we have LOTS of text files. Or PDFs. Or EPUBs. Or whatever we want for whatever purposes.
Programmatically Interacting
• Caleb McDaniel’s “Data Mining the Internet Archive Collection” at http://programminghistorian.org/lessons/data-mining-the-internet-archive
• Uses the Python programming language to download metadata (information about information)
Programmatically Interacting
• It goes through and grabs the MARC data (library records) for everything in the Anti-Slavery Collection
• It is decently documented and we don’t have time today. However, we can steal his code.
Stealing Code#!/usr/bin/python
import internetarchive
import time
error_log = open('bpl-‐marcs-‐errors.log', 'a')
search = internetarchive.search_items('collection:bplscas')
for result in search:
itemid = result['identifier']
item = internetarchive.get_item(itemid)
marc = item.get_file(itemid + '_marc.xml')
try:
Programmatically Interacting
• We save this file into a new directory (slavery-marc) and then run it.
• BORROWING CODE IS OK.
• On command line we could type:
• python ia-‐download.py
Programmatically Interacting
• The results!
• Using his pymarc script to generate location data.
Programatically Interacting
• Other tools
• Adam Crymble, “Downloading Multiple Records Using Query Strings.” [http://programminghistorian.org/lessons/downloading-multiple-records-using-query-strings]
Two Main Programs• obo.py
• (which contains definitions for several functions that you call)
• download-searches.py
• Where you can swap out your query and get files, download them, all without visiting the site
Different Methods• The Dream Case
• Application Programming Interfaces (APIs)
• Scraping Data yourself (Outwit Hub)
• Computational Methods (Python, Bash, Programming Historian)
• HistoryCrawler Virtual Machine
HistoryCrawler
• Download link: http://ianmilligan.ca/historycrawler [link to repository at York University, Toronto]
• Instructions: http://williamjturkel.net/2014/09/09/creating-the-historycrawler-virtual-machine/
• Solving problems of dependencies, reproducibility, working on a virtual environment for scholars
HistoryCrawler• Step One: Download
HistoryCrawler201407-32b.ova from previous links
• Step Two: Install Oracle VM Virtual Box (https://www.virtualbox.org/)
• Step Three: File —> Import Appliance —> Select the ova file to generate your machine
• Step Four: Press ‘start.’ You may have to wait ~ 1-2 minutes.
• Step Five: password is ‘go’
HistoryCrawler
Tutorials
• Mary Beth Start (PhD Candidate, Western University, Ontario): http://marybethstart.wordpress.com/2014/09/09/getting-started-virtualbox-and-historycrawler/
• William Turkel (Associate Professor, Western University, Ontario): http://williamjturkel.net/how-to/#virtualmachine
HistoryCrawler: A platform for teaching?• Does require a decent computer to run it on
• But
• eliminates problems of dependencies;
• installation issues;
• gets everybody on same platform;
• allows for sharing and reproducibility of research inputs/outputs;
• Still in progress - would love any feedback.
Conclusions, Questions & Your
Own Data?
Ian Milligan Assistant Professor