Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource
-
Upload
gaurav-vaidya -
Category
Technology
-
view
2.613 -
download
1
description
Transcript of Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource
![Page 1: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/1.jpg)
Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource
Andrea Thomer, Gaurav Vaidya*, RobertGuralnick, David Bloom, Laura Russell
![Page 2: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/2.jpg)
GBIF (389 million records!)
http://data.gbif.org/
![Page 3: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/3.jpg)
Where species are, where species aren’t
http://www.mappinglife.org/Sayornis_saya
![Page 4: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/4.jpg)
The big picture (AKN)
Chronhorogram (Ariño & Otegui, 2010), extracted using BIDDSAT (Otegui & Ariño, 2012)http://www.unav.es/unzyec/mzna/biddsat/recsperyear.php?prov=10&dataset=all&db=GBIF_201202
![Page 5: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/5.jpg)
An expedition into the Rockies, 1904
http://commons.wikimedia.org/wiki/File:Tent_in_montane_field_site.tif
![Page 6: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/6.jpg)
The Great Outdoors
http://commons.wikimedia.org/wiki/File:Step_Valley_Lake_near_Arapahoe_Glacier.tif
![Page 7: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/7.jpg)
http://commons.wikimedia.org/wiki/File:Four_field_biologists_on_glacier.tif
Exploration time
![Page 8: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/8.jpg)
University of Colorado Museum of Natural History (CUMNH) -- founded 1909
http://pinterest.com/cumnh/http://media-cache-ec3.pinterest.com/avatars/ucmnh-1346976471_600.jpg
![Page 9: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/9.jpg)
Junius HendersonCUMNH Curator, 1902-1933
http://commons.wikimedia.org/wiki/File:Junius_Henderson.jpg
![Page 10: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/10.jpg)
http://commons.wikimedia.org/wiki/File:Four_field_biologists_on_glacier.tif
Exploration time
![Page 11: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/11.jpg)
Henderson’s notebooks
![Page 12: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/12.jpg)
![Page 13: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/13.jpg)
![Page 14: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/14.jpg)
“This entire project was only possible because people had been making small steps towards digitization over the last 10
years” -- Andrea Thomer
![Page 15: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/15.jpg)
Wikisource: a transcription platform
![Page 16: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/16.jpg)
Step 1: Scanning (1996)
![Page 17: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/17.jpg)
The Process1. Images on the Wikimedia Commons.
2. Images + text on Wikisource.
3. Images + text + annotations on Wikisource.
4. Data using the MediaWiki APIs.
• Full details: http://dx.doi.org/10.3897/zookeys.209.3247
• Short URL: http://bit.ly/henderson-paper
![Page 18: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/18.jpg)
#1. The Wikimedia Commons
![Page 19: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/19.jpg)
Copyright?
http://commons.wikimedia.org/wiki/File:Licensing_tutorial_en.svg
![Page 20: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/20.jpg)
Copyright!
http://commons.wikimedia.org/wiki/Template:PD-scan
http://commons.wikimedia.org/wiki/Template:PD-US-unpublished
![Page 21: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/21.jpg)
Result #1: Images
http://commons.wikimedia.org/wiki/File:Field_Notes_of_Junius_Henderson,_Notebook_1.pdf
![Page 22: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/22.jpg)
#2. Images + text
http://en.wikisource.org/wiki/Index:Field_Notes_of_Junius_Henderson,_Notebook_1.djvu
![Page 23: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/23.jpg)
Just like Wikipedia
http://en.wikisource.org/w/index.php?title=Page:Field_Notes_of_Junius_Henderson,_Notebook_1.djvu/3&oldid=3528371
![Page 24: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/24.jpg)
Dr. Peter Robinson
http://cumuseum.colorado.edu/about/newsdetail.php?newsID=3
CUMNH Director, 1971-1982Transcribed Henderson’s notebooks, 2000-02
![Page 25: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/25.jpg)
Step 2: Transcription
![Page 26: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/26.jpg)
Result #2: Images + text
http://en.wikisource.org/w/index.php?title=Page:Field_Notes_of_Junius_Henderson,_Notebook_1.djvu/3&oldid=3528371
![Page 27: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/27.jpg)
Result #2: Images + text
http://en.wikisource.org/w/index.php?title=Page:Field_Notes_of_Junius_Henderson,_Notebook_1.djvu/3&oldid=3528371
![Page 28: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/28.jpg)
Combining multiple pages
![Page 29: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/29.jpg)
#3. Images + text + annotations
http://en.wikisource.org/w/index.php?title=Page:Field_Notes_of_Junius_Henderson,_Notebook_1.djvu/3&oldid=3528371
![Page 30: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/30.jpg)
Wikipedia templates
![Page 31: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/31.jpg)
Wikipedia templates are everywhere
![Page 32: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/32.jpg)
The “Neutrality” template
![Page 33: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/33.jpg)
The “Neutrality” template
![Page 34: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/34.jpg)
Examples of templates
![Page 35: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/35.jpg)
Examples of templates
![Page 36: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/36.jpg)
Examples of templates
![Page 37: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/37.jpg)
An template of our own
{{element|formal name of this element|element as written by Henderson}}Examples:
{{taxon|Sayornis saya|Say Phoebe}}{{taxon|Carduelis pinus|siskins}}{{taxon|Siskin|siskins}}
![Page 38: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/38.jpg)
An template of our own
{{element|formal name of this element|element as written by Henderson}}Examples:
{{dated|1905-07-28|July 28, 1905}}{{place|Boulder, Colorado|Boulder, Colo}}
![Page 39: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/39.jpg)
#3. Annotations
![Page 40: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/40.jpg)
#3. Annotations
![Page 41: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/41.jpg)
#3. Annotations
![Page 42: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/42.jpg)
Calling all volunteers!
![Page 43: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/43.jpg)
Calling all volunteers!
![Page 44: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/44.jpg)
Result #3. Image + text + annotations!
![Page 45: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/45.jpg)
Volunteers arrive
![Page 46: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/46.jpg)
Volunteers arrive
![Page 47: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/47.jpg)
#4. Data
http://www.mappinglife.org/Sayornis_saya
![Page 48: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/48.jpg)
Simple algorithm
![Page 49: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/49.jpg)
Simple algorithm
![Page 50: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/50.jpg)
Simple algorithm
![Page 51: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/51.jpg)
Simple algorithm
![Page 52: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/52.jpg)
Simple algorithm
![Page 53: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/53.jpg)
Complicated script
![Page 54: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/54.jpg)
Complicated, open source script
![Page 55: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/55.jpg)
Result #4. (Text + Images + Annotation) = Data!
![Page 56: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/56.jpg)
Where do we go from here?
http://commons.wikimedia.org/wiki/File:Bighorn_sheep_skull_at_Arapaho_glacier,_1904.tif
![Page 57: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/57.jpg)
More books to upload
![Page 58: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/58.jpg)
More books to transcribe
![Page 59: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/59.jpg)
More books to transcribe
http://www.biodiversitylibrary.org/
![Page 60: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/60.jpg)
A better Wikisource
https://commons.wikimedia.org/wiki/File:Wikisource_2012_-_Aubrey.pdf?page=19
![Page 61: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/61.jpg)
“This entire project was only possible because people had been making small steps towards digitization over the last 10
years” -- Andrea Thomer
![Page 63: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/63.jpg)
![Page 64: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/64.jpg)
The following slides were not used in my presentation
![Page 65: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/65.jpg)
Museum collections
![Page 66: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/66.jpg)
Museum records
240. Physa anatina Lea .......................................... Identified by BastschCreek 1 mile north of Loveland, Colo. June 9, 1906. Junius Henderson
![Page 67: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/67.jpg)
Museum records
240. Physa anatina Lea .......................................... Identified by BastschCreek 1 mile north of Loveland, Colo. June 9, 1906. Junius Henderson
![Page 68: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/68.jpg)
Problem: context
240. Physa anatina Lea .......................................... Identified by BastschCreek 1 mile north of Loveland, Colo. June 9, 1906. Junius Henderson
![Page 69: Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikisource](https://reader038.fdocuments.net/reader038/viewer/2022102806/559319421a28abff7b8b47c1/html5/thumbnails/69.jpg)