Automated metadata creation - Possibilities and pitfalls

63
Automated Metadata Creation: Possibilities and Pitfalls Presented by Wilhelmina Randtke June 10, 2012 Nashville, Tennessee At the annual meeting of the North American Serials Interest Group. Materials posted at www.randtke.com/presentations/NASIG .html

description

This program presents an overview of automated indexing and automated metadata creation, and then discuss a project completed last summer at the Florida State University Law Research Center (formerly Law Library) which used computer created metadata to index individual pages of a looseleaf resource. The program will cover an overview of machine created metadata. Internet search engines use this almost exclusively. Some library projects, and some database companies use automated indexing. The program will highlight an index and search designed to retrieve pages from a looseleaf resource as the page appeared on a specific date over a 20 year period. This search is located at www.fsulawrc.com . This project was indexed using scripting to extract most metadata. Staff then completed missing metadata fields and audited for errors. I will present on the cost-effectiveness of automated metadata creation, given error rates and costs for human and machine produced metadata, and an overall assessment of the potentials for digital library projects. The goal is to assist catalogers in knowing what is possible, what is difficult, and what is easy in using techniques for automated metadata creation. Presenter: Wilhelmina Randtke, Florida State University Libraries - Law Research Center

Transcript of Automated metadata creation - Possibilities and pitfalls

Page 1: Automated metadata creation - Possibilities and pitfalls

Automated Metadata Creation: Possibilities and Pitfalls

Presented by Wilhelmina Randtke

June 10, 2012

Nashville, Tennessee

At the annual meeting of the North American Serials Interest Group.

Materials posted at www.randtke.com/presentations/NASIG.html

Page 2: Automated metadata creation - Possibilities and pitfalls

Teaser: Preview of the sample project.

http://www.fsulawrc.com

Page 3: Automated metadata creation - Possibilities and pitfalls

Background: What is “metadata”?

Metadata = any indexing information

Examples:

MARC records

color, size, etc. to allow clothes shopping on a website

writing on the spine of a book

food labels

Page 4: Automated metadata creation - Possibilities and pitfalls

What we'll cover Automated indexing:

Human vs machine indexing Range of tools for automated metadata creation:

Techy and less techy. Sample projects

A little background on relational databases Database design for a looseleaf (a resource that

changes state over time). Sample project: The Florida Administrative

Code 1970-1983

Page 5: Automated metadata creation - Possibilities and pitfalls

Automated Indexing: What’s easy for computers?

Computers like black and white decisions.

Computers are bad with discretion.

Page 6: Automated metadata creation - Possibilities and pitfalls

Word search vs. Subject headings

Page 7: Automated metadata creation - Possibilities and pitfalls

One Trillion

1,000,000,000,000

webpages indexed in Google

… 4 years ago …

Page 8: Automated metadata creation - Possibilities and pitfalls

Nevertheless…

… Human indexing is alive and well

Page 9: Automated metadata creation - Possibilities and pitfalls

How to fund indexing?

Page 10: Automated metadata creation - Possibilities and pitfalls

http://www.ebay.com/sch/Dresses-/63861/i.html?_nkw=summer+dress

Page 11: Automated metadata creation - Possibilities and pitfalls

How to fund indexing?

Page 12: Automated metadata creation - Possibilities and pitfalls

How to fund indexing?

Page 13: Automated metadata creation - Possibilities and pitfalls

Who made the metadata:Human or Machine?

How GoogleBooks gets its metadata: http://go-to-hellman.blogspot.com/2010/01/google-exposes-book-metadata-privates.html

Page 14: Automated metadata creation - Possibilities and pitfalls

Not automated indexing, but a related concept….

Always try to think about

how to reuse existing metadata.

Page 15: Automated metadata creation - Possibilities and pitfalls

High Tech automated metadata creation

Page 16: Automated metadata creation - Possibilities and pitfalls

The high end: Assigning subject headings with computer code

Some technologies:

• UIMA (Unstructured Information Management Architecture)

• GATE (General Architecture for Text Engineering)

• KEA (Keyphrase Extraction Algorithm)

Page 17: Automated metadata creation - Possibilities and pitfalls

Computer Program for Automated Indexing

OntologyThesaurus

Person’s role:Select an appropriate

ontology.Configure the

program so that it’s looking at outside sources.

Review the results and make sure the assigned subject headings are good.

Program’s role:Take ontology or

thesaurus and apply it to each item to give subject headings.

Subject Headings

Item

Page 18: Automated metadata creation - Possibilities and pitfalls

http://www.nzdl.org/Kea/examples1.html

Page 19: Automated metadata creation - Possibilities and pitfalls

The lower end: Deterministic fields

Page 20: Automated metadata creation - Possibilities and pitfalls
Page 21: Automated metadata creation - Possibilities and pitfalls
Page 22: Automated metadata creation - Possibilities and pitfalls

There’s an app for that

Scripts for extracting fields from a thesis posted on GitHub: https://github.com/ao5357/thesisbot

Page 23: Automated metadata creation - Possibilities and pitfalls
Page 24: Automated metadata creation - Possibilities and pitfalls
Page 25: Automated metadata creation - Possibilities and pitfalls
Page 26: Automated metadata creation - Possibilities and pitfalls
Page 27: Automated metadata creation - Possibilities and pitfalls
Page 28: Automated metadata creation - Possibilities and pitfalls
Page 29: Automated metadata creation - Possibilities and pitfalls
Page 30: Automated metadata creation - Possibilities and pitfalls
Page 31: Automated metadata creation - Possibilities and pitfalls

Batch OCR

Page 32: Automated metadata creation - Possibilities and pitfalls
Page 33: Automated metadata creation - Possibilities and pitfalls
Page 34: Automated metadata creation - Possibilities and pitfalls
Page 35: Automated metadata creation - Possibilities and pitfalls
Page 36: Automated metadata creation - Possibilities and pitfalls
Page 37: Automated metadata creation - Possibilities and pitfalls
Page 38: Automated metadata creation - Possibilities and pitfalls
Page 39: Automated metadata creation - Possibilities and pitfalls
Page 40: Automated metadata creation - Possibilities and pitfalls

Many tools exist to extract text from PDFS to Excel

Page 41: Automated metadata creation - Possibilities and pitfalls
Page 42: Automated metadata creation - Possibilities and pitfalls
Page 43: Automated metadata creation - Possibilities and pitfalls

Walkthrough – examining the extracted spreadsheets

http://fsulawrc.com/excelVBAfiles/index.html

Page 44: Automated metadata creation - Possibilities and pitfalls

How to plan the program• Look for patterns

• Write step-by-step instructions about how to process the Excel file

• Remember, NO DISCRETION, computers do not take well to discretion.

• Good steps:• Go to the last line of the worksheet

• Look for the letter a or A

• Copy starting from the first number in the cell, up to and including the last number in the cell.

• Bad steps:• Find the author’s name (this step needs to be broken into

small “stupid” steps)

Page 45: Automated metadata creation - Possibilities and pitfalls

Writing the program• Identify appropriate advisors.

• Remember, most IT staff on a campus just install computers in offices, etc. Programming and database planning are rare skills. The worst IT personnel will not realize that they do not have these skills.

• If an IT staff tells you they do not know how to do something, then go back to that person for advice on all future projects.

• Try to find entry level material on coding.

• (Sadly, most computer programming instructions already assume you know some programming.)

• If outsourcing or collaborating, remember, the index is the ultimate goal. Understanding of the index needs to be in the picture. You probably have to bring it in.

Page 46: Automated metadata creation - Possibilities and pitfalls

Finding Advisors: Most campus IT is about carrying heavy objects

Page 47: Automated metadata creation - Possibilities and pitfalls

Finding Advisors: Most campus IT is about carrying heavy objects

Page 48: Automated metadata creation - Possibilities and pitfalls

Perfection?

How close to perfection can you get?

Let’s run some code:

A spreadsheet with extracted text: http://fsulawrc.com/excelVBAfiles/23batch6A.xls

Visual Basic script: http://fsulawrc.com/excelVBAfiles/VBAscriptForFAC.docx

The files: You can retrieve some of these same files by searching 6A-1 in the main search for the database at www.fsulawrc.com

Page 49: Automated metadata creation - Possibilities and pitfalls

How much metadata was missing?

Field Number of empty fields(27,992 fields total, after preliminary removal of blank pages)

Percent of Field filled

Chapt. No before dash 183 99.3%

Chapt no after dash 2179 92.2%

Page no. 1766 93.6%

Supp no (ie. Date page went into the looseleaf)

3242 88.4%

Replacing supplement (ie. Data page was removed from the looseleaf)

All (however, 105 fields were entered manually in order to demonstrate the interface and get funding for manual metadata creation)

0%

Page 50: Automated metadata creation - Possibilities and pitfalls

Cheap and fastand incomplete

This is a search engine build on an index for the automated metadata only:

http://fsulawrc.com/automatedindex.php

It’s better than a shuffled pile of 30,000 pages.

It’s not very good.

If you are thousands of miles away, then this is better than print. If you are in the same room as organized print, print might be better.

Page 51: Automated metadata creation - Possibilities and pitfalls

Filling in the gapsCode helps speed workflow, but still time consuming.

http://fsulawrc.com/phptest/chaptbeforedashfill.php

This is editing a copy of the automated metadata database. You can enter as much as you like, and not break anything.

Page 52: Automated metadata creation - Possibilities and pitfalls

Last step: Auditing for missing pages, by comparing instruction sheets that went out with

supplements

www.fsulawrc.com/supplementinstructionsheets.pdf

Page 53: Automated metadata creation - Possibilities and pitfalls

Task Hours spent Category of work

Inspecting looseleaf and planning a database

20 (high skill, high training) Database work

Digitization with sheetfed scanner

35 (low skill, low training) Digitization

Planning the code for automated indexing

20 hours (high skill, high training) Database work

Coding for the automated indexing

35 hours (would be faster for someone with a programming background)

Automated metadata

Running script, and cleaning up metadata

35 hours (skilled staff) Automated metadata

Loading database and metadata on a server

10 hours (would be about twice as fast for someone with more database design experience)

Database work

Coding online forms to speed data entry

15 hours (skilled staff) Manual metadata

Training on documents and database design

15 hours (unskilled staff, but done before the student assistant got setup with computer forms and permissions)

Manual metadata

Metadata entry for fields the computer didn’t get

98.25 hours (unskilled staff) Manual metadata

Auditing the database against instruction sheets which went out with supplements

342.75 hours (skilled staff; includes training time for student assistant)

Auditing

Page 54: Automated metadata creation - Possibilities and pitfalls

Where did the time go?

Tasks and Hours

Database Work

Digitization

Auditing

Manual Metadata Creation

Automated Metadata Creation

Page 55: Automated metadata creation - Possibilities and pitfalls

Error ratesAutomated metadata for Supplement Number: 2.4%

Human metadata for Supplement Number: 0.8%

Automated metadata for Page Number

with systematic error: 1.0%

with the systematic error removed: 0.3%

Human metadata for Page Number: 3.1%

Error rates for the thesis indexer on GitHub: 5% - 6%

Page 56: Automated metadata creation - Possibilities and pitfalls

Do error rates matter?

For computer rates, might be measuring OCR.

Most metadata will be words, not numbers.

• Words are easier for a computer to pull out. Misspellings are obvious when reviewing output.

• Words are easier for a person to pull out. Less fatigue.

Page 57: Automated metadata creation - Possibilities and pitfalls

Recommendations

• For practitioners:

• Consider automating a process. Is it possible to index this without human involvement?

• Understand what IT support is available. Support can be someone who picks the appropriate tool, then you apply it.

• For administrators

• Allow work time for this type of experimentation.

Page 58: Automated metadata creation - Possibilities and pitfalls

Good resources to get started• A-PDF to Excel Extractor

• A program that takes text from PDFs and puts it in Excel.

• www.a-pdf.com/to-excel/download.htm

• This is an easy start to get source material into a format you can work with.

• Excel Visual Basic (VBA) Tutorials by Pan Pantziarka

• Almost all training material on coding assumes you already know how to code. These tutorials are good, because they assume you do not already know something.

• www.techbookreport.com/tutorials/excel_vba1.html

• For more advanced instructions, use a search engine to read message boards.

Page 59: Automated metadata creation - Possibilities and pitfalls

• An eHow instructions telling you how to turn on the Developer Ribbon in Excel 2007

• http://www.ehow.com/how_7175501_turn-developer-tab-excel-2007.html

(use these same instructions for Excel 2010; older versions of Excel have the developer ribbon turned on by default)

• How to get to the tab where you can do simple coding.

• How to Build a Search Engine

• http://www.udacity.com/overview/Course/cs101/CourseRev/apr2012

• Takes you through how webcrawlers work, using the programming language Python. (A website is a string of text only, nothing more, so these concepts are similar to metadata extraction.)

• This was good, because it doesn’t assume that you know how to code already.

Good resources to get started

Page 60: Automated metadata creation - Possibilities and pitfalls

• Wikipedia section on string processing algrithms.

• http://en.wikipedia.org/wiki/String_%28computer_science%29#String_processing_algorithms

• These six links go to lists of all the things you can do to strings. (Remember, a string is a string of letters – it’s what you will be working with.)

• Use the terminology from here to know what term of art to put into a search engine so that you can find instructions on how to do that in whatever code you choose.

• Wikipedia page on relational databases

• http://en.wikipedia.org/wiki/Relational_database

• It will be useful for you to understand primary keys, foreign keys, and tables referencing each other.

Good resources to get started

Page 61: Automated metadata creation - Possibilities and pitfalls

Automated Metadata Creation: Possibilities and Pitfalls

Presented by Wilhelmina Randtke

June 10, 2012

Nashville, Tennessee

At the annual meeting of the North American Serials Interest Group.

Materials posted at www.randtke.com/presentations/NASIG.html

Page 62: Automated metadata creation - Possibilities and pitfalls

Special thanks to:

Jason Cronk

Anna Annino

Page 63: Automated metadata creation - Possibilities and pitfalls

Automated Metadata Creation: Possibilities and Pitfalls

Presented by Wilhelmina Randtke

June 10, 2012

Nashville, Tennessee

At the annual meeting of the North American Serials Interest Group.

Materials posted at www.randtke.com/presentations/NASIG.html