Stephen Gottschalk, Anthony Kirchgessner, Kimberly Watson
description
Transcript of Stephen Gottschalk, Anthony Kirchgessner, Kimberly Watson
![Page 1: Stephen Gottschalk, Anthony Kirchgessner, Kimberly Watson](https://reader035.fdocuments.net/reader035/viewer/2022081505/56816380550346895dd46377/html5/thumbnails/1.jpg)
Stephen Gottschalk, Anthony Kirchgessner, Kimberly Watson
IDigBio July, 2012
Implementing Optical Character Recognition in Herbarium Digitization: current practices
and challenges
![Page 2: Stephen Gottschalk, Anthony Kirchgessner, Kimberly Watson](https://reader035.fdocuments.net/reader035/viewer/2022081505/56816380550346895dd46377/html5/thumbnails/2.jpg)
Caribbean Workflow
Curation and rapid barcoding of specimens
Specimen imaging
Optical CharacterRecognition (OCR)and data parsing
Specimen CatalogRecord
Fieldbook Data
Manual keying of specimen data
![Page 3: Stephen Gottschalk, Anthony Kirchgessner, Kimberly Watson](https://reader035.fdocuments.net/reader035/viewer/2022081505/56816380550346895dd46377/html5/thumbnails/3.jpg)
What is OCR?
Image Output
![Page 4: Stephen Gottschalk, Anthony Kirchgessner, Kimberly Watson](https://reader035.fdocuments.net/reader035/viewer/2022081505/56816380550346895dd46377/html5/thumbnails/4.jpg)
Processing ConsiderationsImage Processing:• Image size• Color = ~10 mb• Grayscale = ~1 mb
• Processing time• Images cropped to
label can be OCR’d ~10 x faster than uncropped
![Page 5: Stephen Gottschalk, Anthony Kirchgessner, Kimberly Watson](https://reader035.fdocuments.net/reader035/viewer/2022081505/56816380550346895dd46377/html5/thumbnails/5.jpg)
• Corporate edition allows for batch processing large numbers of images at once
• Unique identifiers link the specimen OCR data and the image
• Option for pattern training to enhance OCR quality
Optically Recognizing with ABBYY
![Page 6: Stephen Gottschalk, Anthony Kirchgessner, Kimberly Watson](https://reader035.fdocuments.net/reader035/viewer/2022081505/56816380550346895dd46377/html5/thumbnails/6.jpg)
• 162 Charles Wright, Cuba labels and 114 Tom Zanoni, Dominican Republic labels
• Wright labels chosen because they are difficult to read with OCR, have the most room for improvement
• Zanoni labels are in general more legible, but also contain much more text
• Label headings are unique to each label type, changes in OCR accuracy can be tracked across trials
A Case Study
![Page 7: Stephen Gottschalk, Anthony Kirchgessner, Kimberly Watson](https://reader035.fdocuments.net/reader035/viewer/2022081505/56816380550346895dd46377/html5/thumbnails/7.jpg)
OCR Parameters• Both label types put through the same set of
OCR trials
Trial 1: Built-in parametersTrial 2: Train Pattern Recognition
on one label*Trial 3: Train PR on multiple labelsTrial 4: Train PR on Zanoni label type
Trial 1: Built-in parametersTrial 2: Train Pattern Recognition
on one labelTrial 3: Train PR on multiple labelsTrial 4: Train PR on Wright label type
Wright Labels Zanoni Labels
Trial 5: Train PR on both label types
*Trial 6: add ‘æ’ to English language, train PR on multiple labels
![Page 8: Stephen Gottschalk, Anthony Kirchgessner, Kimberly Watson](https://reader035.fdocuments.net/reader035/viewer/2022081505/56816380550346895dd46377/html5/thumbnails/8.jpg)
TrialsStep 1: all images set to 300 dpi, cropped to label, language = autoselect
![Page 9: Stephen Gottschalk, Anthony Kirchgessner, Kimberly Watson](https://reader035.fdocuments.net/reader035/viewer/2022081505/56816380550346895dd46377/html5/thumbnails/9.jpg)
TrialsStep 2: Pattern Recognition is carried out
![Page 10: Stephen Gottschalk, Anthony Kirchgessner, Kimberly Watson](https://reader035.fdocuments.net/reader035/viewer/2022081505/56816380550346895dd46377/html5/thumbnails/10.jpg)
TrialsStep 3: Run the OCR!
(trained multiple)
(built in)
![Page 11: Stephen Gottschalk, Anthony Kirchgessner, Kimberly Watson](https://reader035.fdocuments.net/reader035/viewer/2022081505/56816380550346895dd46377/html5/thumbnails/11.jpg)
Trial Results: Wright Labels162 Labels total
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%"Plantæ"
"Cubenses"
"Wrightianæ"
Full String
Perc
enta
ge o
f lab
els r
ead
corr
ectly
Pattern recognition trial
![Page 12: Stephen Gottschalk, Anthony Kirchgessner, Kimberly Watson](https://reader035.fdocuments.net/reader035/viewer/2022081505/56816380550346895dd46377/html5/thumbnails/12.jpg)
Trial Results: Zanoni Labels114 Labels Total
Pattern recognition trial
Perc
enta
ge o
f lab
els r
ead
corr
ectly
built-in trained once trained mult trained other trained both0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
100.0%
"Moscoso"
"Rafael"
"Zanoni"
Full Heading: Jardin Botanico Nacional "Dr. Rafael M. Moscoso"
stripped “ " . ” punctuation from heading: Jardin Botanico Nacional Dr Rafael M Moscoso
![Page 13: Stephen Gottschalk, Anthony Kirchgessner, Kimberly Watson](https://reader035.fdocuments.net/reader035/viewer/2022081505/56816380550346895dd46377/html5/thumbnails/13.jpg)
OCR Text and Next Steps• How to get the individual text files into a database
![Page 14: Stephen Gottschalk, Anthony Kirchgessner, Kimberly Watson](https://reader035.fdocuments.net/reader035/viewer/2022081505/56816380550346895dd46377/html5/thumbnails/14.jpg)
OCR Text and Next Steps• How to get the individual text files into a database– Step 1. Read the file name and text into Excel
using a Powershell script.
![Page 15: Stephen Gottschalk, Anthony Kirchgessner, Kimberly Watson](https://reader035.fdocuments.net/reader035/viewer/2022081505/56816380550346895dd46377/html5/thumbnails/15.jpg)
OCR Text and Next Steps
![Page 16: Stephen Gottschalk, Anthony Kirchgessner, Kimberly Watson](https://reader035.fdocuments.net/reader035/viewer/2022081505/56816380550346895dd46377/html5/thumbnails/16.jpg)
OCR Text and Next Steps• How to get the individual text files into a database– Step 1. Read the file name and text into Excel
using a Powershell script.– Step 2. Parse the file name and migrate to
database of choice.• File names are created with a pattern, so that unique
barcodes are easily parsed:v-284-00041202.txt -> 41202
![Page 17: Stephen Gottschalk, Anthony Kirchgessner, Kimberly Watson](https://reader035.fdocuments.net/reader035/viewer/2022081505/56816380550346895dd46377/html5/thumbnails/17.jpg)
OCR Text and Next Steps
• Finally, what we end up with is: Skeletal data with some data parsed into
fields (e.g. barcode, taxon, image). Images associated with these records. OCR data associated with the images and
database records. OCR data parsed into fields within database records.
![Page 18: Stephen Gottschalk, Anthony Kirchgessner, Kimberly Watson](https://reader035.fdocuments.net/reader035/viewer/2022081505/56816380550346895dd46377/html5/thumbnails/18.jpg)
OCR Text and Next Steps• Natural Language Processing, Machine
Learning and data parsing through Symbiota, Salix, etc. are emerging technologies being explored to complete the catalog records directly from OCR text.
![Page 19: Stephen Gottschalk, Anthony Kirchgessner, Kimberly Watson](https://reader035.fdocuments.net/reader035/viewer/2022081505/56816380550346895dd46377/html5/thumbnails/19.jpg)
AcknowledgementsNational Science Foundation• Digitization of Caribbean Plants and Fungi in The New York Botanical
Garden Herbarium• Digitization TCN: Collaborative Research: Plants, Herbivores, and Parasitoids:
A Model System for the Study of Tri-Trophic Associations
• Barbara Thiers, Robert Naczi, Michael Bevans, Melissa Tulig, Nicole Tarnowsky, Vinson Doyle, Jessica Allen, Elizabeth Kiernan, Annie Virnig, Brandy Watts, Charles Zimmerman
• Visit the Virtual Herbarium: http://sciweb.nybg.org/science2/vii2.asp