Digitizing California Arthropod Collections

20
Digitizing California Arthropod Collections Peter Oboyski, Phuc Nguyen, Serge Belongie, Rosemary Gillespie Essig Museum of Entomology University of California Berkeley, California, USA

description

Digitizing California Arthropod Collections. Peter Oboyski, Phuc Nguyen, Serge Belongie , Rosemary Gillespie Essig Museum of Entomology University of California Berkeley, California, USA. Who is CalBug ?. Essig Museum of Entomology California Academy of Sciences - PowerPoint PPT Presentation

Transcript of Digitizing California Arthropod Collections

Page 1: Digitizing California Arthropod Collections

Digitizing California Arthropod Collections

Peter Oboyski, Phuc Nguyen, Serge Belongie, Rosemary GillespieEssig Museum of Entomology

University of CaliforniaBerkeley, California, USA

Page 2: Digitizing California Arthropod Collections

Who is CalBug?

Essig Museum of Entomology

California Academy of Sciences

California State Collection of Arthropods

Bohart Museum, UC Davis

Entomology Research Museum, UC Riverside

San Diego Natural History Museum

LA County Museum

Santa Barbara Museum of Natural History

Page 3: Digitizing California Arthropod Collections
Page 4: Digitizing California Arthropod Collections

(Optional) Sort by locality, date, sex, etc.

Remove labels, add unique identifier

Replace labels, return to collection

Manually enter data into MySQL database

Online crowd-sourcing of manual data entry

Optical Character Recognition (OCR) &

Automated data parsing

Error checking

Geographic referencing

Aggregate data in online cache

Temporospatial analyses

Take digital image, name and save file

Digitization workflow

Handling & Imaging Data Capture Data Manipulation

Page 5: Digitizing California Arthropod Collections

Why Image Specimens/Labels?• Data capture can be done remotely• Magnify difficult to read labels• Potential for OCR• Verbatim digital archive of label data

Page 6: Digitizing California Arthropod Collections

1st generation - DinoLite digital microscope

Page 7: Digitizing California Arthropod Collections
Page 8: Digitizing California Arthropod Collections

2nd generation – Digital Camera (Canon G9)

Page 9: Digitizing California Arthropod Collections

Higher resolution

Labels flat & unobstructed

Scale bar, controlled light

Important to add species name to image or file name

EMEC218958 Paracotalpa ursina.jpg~150,000 images waiting to database

Page 10: Digitizing California Arthropod Collections

Manually enter data into MySQL database

Online crowd-sourcing of manual data entry

Optical Character Recognition (OCR) &

Automated data parsing

Data capture

Using our own MySQL database (EssigDB)Built-in error checkingData carry-over one record to nextTaxonomy automatically added

“Notes from Nature”Collaboration with ZooniverseCitizen Scientist transcription of labels

Collaboration with UC San DiegoImproved word spotting & OCR

Page 11: Digitizing California Arthropod Collections
Page 12: Digitizing California Arthropod Collections

Notes from NatureCitizen Science data transcription

Page 13: Digitizing California Arthropod Collections
Page 14: Digitizing California Arthropod Collections
Page 15: Digitizing California Arthropod Collections

Integrating OCR with crowd sourcing

o Spotting words within imageso Copy-paste, highlight-drag fieldso Auto-detecting repeated “words”

o eg. species, states, countieso Providing an additional “vote” for

transcription consensus

Page 16: Digitizing California Arthropod Collections

The OCR challenge for specimen labels

DETECTION:Finding text in a complex matrixMachine-typed vs. hand-written labelsSliding window classifier creating text bounding boxes>95% detection and localization using pixel-overlap measures

Page 17: Digitizing California Arthropod Collections

RECOGNITION:

Using Tesseract OCR engine

Machine Type

74% accuracy for word-level

82% accuracy for character-level

Hand Writing

5.4% accuracy for word-level

9.2% accuracy for character-level

Current Progress in OCR recognition

Page 18: Digitizing California Arthropod Collections
Page 19: Digitizing California Arthropod Collections

Where do we go from here?

• Improved recognition of hand-writing• Incorporate OCR into crowd sourcing• Develop (semi-) automated data parsing

Page 20: Digitizing California Arthropod Collections

Thank you

http://calbug.berkeley.edu