Digitization Projects Tech Con 2006

44
Digitization Projects at the State Library of Pennsylvania: Where the Past and Future Meet Bill Nork Head of Systems & Preservation William Fee Digital Collections Librarian Kurt Bodling Digital Resources Cataloger Pennsylvania Department of Education, State Library of Pennsylvania

description

Digital:Bill

Transcript of Digitization Projects Tech Con 2006

Page 1: Digitization Projects Tech Con 2006

Digitization Projects at the State Library of Pennsylvania: Where the Past and Future Meet

Bill NorkHead of Systems & Preservation

William FeeDigital Collections Librarian

Kurt BodlingDigital Resources Cataloger

Pennsylvania Department of Education, State Library of Pennsylvania

Page 2: Digitization Projects Tech Con 2006

www.statelibrary.state.pa.us/digital_projects

Or

Visit the State Library of PA Website

www.statelibrary.state.pa.us

Select Digital Projects of the State Library

Page 3: Digitization Projects Tech Con 2006

Digitization things learned the hard wayOr why do I drink so much coffee?By Bill Fee

•Try to plan things out as much as you can before starting a project

•No matter how much you plan, something will blow up in your face.

•It’s often better to throw people at a problem than equipment (if they hit just right, this also counts as percussive maintenance)

•Loud, obnoxious and driving punk rock and techno really improve the workflow (though that could be just a personal preference)

Page 4: Digitization Projects Tech Con 2006

Hardware & Software

We run a Dell Optiplex GX260 with a 2.26 Ghz non-hyperthreaded processor. Alas, we’re a PC shop.

Scanner-wise, we have a $25,000 Minolta PS 7000 overhead engine book scanner and an HP ScanJet 7400C that’s up for replacement.

Page 5: Digitization Projects Tech Con 2006

Hardware & Software- again

Direct scans into Photoshop. I can save the archival TIFF, then edit it and create the access JPEG right there. As a library you should be able to get an educational license, which is a heck of a lot cheaper. The program itself may seem more full featured than you need, but things like batch process when you're doing a whole directory of images with the same edits and sizing really save time. Get them to pay for classes, though- about 200 per but well worth it.

Page 6: Digitization Projects Tech Con 2006

Still More Hardware & Software

We use Omnipage for OCR. You'll save yourself a heck of a lot of correction time by doing a dual scan- 1 into Photoshop, one directly into the OCR program, whichever you use. Omnipage has about a 98 or 99 percent accuracy for anything but newspapers, but there are others just as good. Hit up ComputerShopper.com and read reviews.

If I'm doing a web page, I use the Composer feature in Mozilla or Netscape.

I’ve been using these programs and essentially the same hardware since the bad old pre-standards Dark Ages of 5 years ago, and they seem to work.

Page 7: Digitization Projects Tech Con 2006

What criteria do you use to have an item digitized? Must be PA related. Usually in such poor shape that it cannot circulate, or from the

Rare Book Room, or ordered by the Director or Commissioner. Must have less than 5-10 holding libraries in FirstSearch (not

counting us). Usually fits a theme- current is the VLaT project- Violence,

Labor and Transportation = riots, train wrecks, mine accidents, etc.

Page 8: Digitization Projects Tech Con 2006

Other problems you will find

Bureaucracy Shipment File and folder nomenclature Poor scans and OCR Storage Personnel “High-priority” projects New software, new uses for software, new problem

with software that only come up because it’s a new project.

Page 9: Digitization Projects Tech Con 2006

Metadata Considerations

Kurt A.T. BodlingDigital Resources CatalogerState Library of Pennsylvania

Page 10: Digitization Projects Tech Con 2006

The Starting Place What is the digital object?

– Something newly created?– Already cataloged?– A collection?– A single item?– A selection from an item?

Who is it for?

Page 11: Digitization Projects Tech Con 2006
Page 12: Digitization Projects Tech Con 2006
Page 13: Digitization Projects Tech Con 2006

Ben Franklin solutions

Easy call: siphon data from OPAC Tougher: dealing with chapters and

single letters

Page 14: Digitization Projects Tech Con 2006
Page 15: Digitization Projects Tech Con 2006

General solution to obit challenges

Sampling and testing Hunting down exceptions Creating a data dictionary And, of course, going back later to

make changes

Page 16: Digitization Projects Tech Con 2006

Data Dictionary defined

MARC : AACR2 :: Dublin Core : Data Dictionary

Page 17: Digitization Projects Tech Con 2006

Data Dictionary for the Pennsylvania Scrap Book Necrology collection. Label in CONTENTdm

Dublin Core Mapping

Content Description and Instructions

Title Title Pennsylvania Scrap Book Necrology, Volume ##, p. ##. Metadata crew replaces the first ## with the actual volume numbers (Arabic, not Roman) in the template before loading images of each volume. Add page numbers for each page as part of uploading process.

Creator Creator State Library of Pennsylvania Surname(s) included

Subject Metadata crew enters the surnames of the deceased individuals on each page. Separate surnames with commas.

Description Description Microfilmed scrapbooks of obituaries clipped from Pennsylvania newspapers from 16 October 1891 to 3 March 1904. Many Civil War veterans included.

Publisher Publisher State Library of Pennsylvania Contributor Contributor Date Date Metadata crew enters the year(s) of obituaries

included in each volume. Type Type text Format Format image/jpeg Identifier Identifier Source Source PHAK 929.3748 P384mi Language Language eng Relation Relation Coverage Coverage Rights Rights Digital images copyright State Library of

Pennsylvania. All rights reserved. May be used for educational purposes as long as a credit statement is included. For all other uses, contact the State Library of Pennsylvania, Digital Rights Office, 333 Market Street, Harrisburg, PA 17126-1745. Phone: (717) 783-5969

Audience Audience Transcripts None This field is a full-text searchable field into

which the OCR for each page will be loaded. It will not be viewable by users, only searchable. The uploading process, if followed correctly, should do this automatically.

Page 18: Digitization Projects Tech Con 2006
Page 19: Digitization Projects Tech Con 2006

Creating the data dictionary

Simple issues first:– Steal data from the catalog– Use boilerplate ‘rights management’

statement– Get repeated data into a template

Page 20: Digitization Projects Tech Con 2006

Creating the data dictionary

More difficult challenges– Names of the deceased– Citation to original source newspapers– Omissions– Enhancements– Difficulties caused by original scrapbooking

Page 21: Digitization Projects Tech Con 2006

Names of the deceased

Not authority controlled Variations between two obit versions Variations within one obit Lacking first name

Page 22: Digitization Projects Tech Con 2006

Name variations:

Page 23: Digitization Projects Tech Con 2006

Anonymous child:

Page 24: Digitization Projects Tech Con 2006

Names of the deceased

Solutions:– Enter only surname, but– Enter all spellings that appear

Page 25: Digitization Projects Tech Con 2006

Citations to original sources

Visible on microfilm, but NOT in jpeg Easily recoverable

Page 26: Digitization Projects Tech Con 2006

Citations to original sources

Solution:– Leave this information out of metadata

Page 27: Digitization Projects Tech Con 2006

Omissions

Blank pages Pages glued together Military unit information

Page 28: Digitization Projects Tech Con 2006

Military unit info:

Page 29: Digitization Projects Tech Con 2006

Omissions

Solutions:– Record page numbers as they appear– Note when pages don’t appear– Omit unit information

Page 30: Digitization Projects Tech Con 2006

Enhancements

Geographic info Occupational info Marital status And on and on and on.

Page 31: Digitization Projects Tech Con 2006
Page 32: Digitization Projects Tech Con 2006
Page 33: Digitization Projects Tech Con 2006

Enhancements

Solutions:– Forego most enrichment– Include “former slave”– Include some terms like “suicide” and

“murder”

Page 34: Digitization Projects Tech Con 2006

Scrapbook difficulties

Running on to second page Running on to 3rd, 4th, 5th … pages

Page 35: Digitization Projects Tech Con 2006

Multiple page obit:

Page 36: Digitization Projects Tech Con 2006

Scrapbook difficulties

Repeated obituaries

Page 37: Digitization Projects Tech Con 2006

Scrapbook difficulties

Label at bottom of page, obit on next

Page 38: Digitization Projects Tech Con 2006

Text and title split:

Page 39: Digitization Projects Tech Con 2006

Scrapbook difficulties

Year-end cumulative death notice Articles that were not obits at all Volumes containing two years

Page 40: Digitization Projects Tech Con 2006

Cumulative notice:

Page 41: Digitization Projects Tech Con 2006

Not an obit:

Page 42: Digitization Projects Tech Con 2006

My Lessons Learned

Metadata isn’t (aren’t?) scary Patience and perseverance win out Small crew = quick decisions

Page 43: Digitization Projects Tech Con 2006

What Did we Learn?

More man-hours than we thought

More staffing to complete task

Decisions about how deep to go with metadata

Page 44: Digitization Projects Tech Con 2006

Questions?

Call or email one of us

Bill Fee 717-783-7014 [email protected]

Kurt Bodling 717-783-5996 [email protected]

Bill Nork [email protected]