Improving Metadata Quality: An Iterative Approach Using Scrum A primary challenge when developing a...

1
Improving Metadata Quality: An Iterative Approach Using Scrum A primary challenge when developing a large scale digital library is balancing metadata quantity and quality. The Theological Commons is a digital library (built on a MarkLogic Server) with ~50,000 books (or ~16,000,000 pages). Since its release in March 2012, we have been iteratively improving its metadata quality using the Scrum process. Methods Scrum is a form of agile project management, which works in fixed iterations (“sprints”) to develop projects and features. As problems in metadata are identified, team members add them to the product backlog. The problems are described in story form—i.e. from the perspective of end users rather than technologists. The product owner takes responsibility for ordering these stories from most to least significant. Generally, our team tackles metadata problems using some combination of computational methods and hand editing, with a preference for the former (for obvious reasons). We seek to pass through all the data during a single sprint. At the end of the sprint, we review our work Here are some questions our team is thinking about: 1. Should we develop a test suite for our metadata to flag obvious errors (beyond validation)? 2. Are there reliable natural language processing toolkits to assist with improving automatically generated metadata (i.e. OCR)? 3. How can we best frame user expectations when dealing with metadata deficiencies? Clifford B. Anderson Curator of Special Collections Princeton Theological Seminary Library Princeton, NJ (USA) Background Results Conclusion Questions Thanks to the members of our digital team: Cortney Frank, Greg Murray, Donna Quick, and Chris Schwartz The digital team approaches the problem of quality control in four stages: Identifying metadata problems and lacunae Prioritizing the most important stories and sending the rest to the product backlog Improving the product by implementing the metadata story during a single sprint, if possible Assessing the outcome and continuing to identify new metadata problems and lacunae. An iterative approach allows us to regularly improve metadata quality while continuously reevaluating our metadata priorities in response to stakeholders’ needs. 1 2 3 4 5 Here are some lessons we’ve learned as a team: The deeper you dig, the more you will find to improve. The quality improvement process is infinite. Aim shallow rather than deep so that you cover all documents in a single sprint. Keep a sharp focus—scope creep also affects metadata cleanup. When merging metadata from different sources, watch out for inconsistencies (even when all validate against the same schema). Don’t be parsimonious—if you think you might need some external metadata later, just build it into your document with a different namespace. Visualizing metadata problems can help you to understand them more intuitively (see Fig. 3) Aim to increase the percentage of computational approaches (hurray for XQuery!) with every sprint. Set user expectations with respect to metadata flaws (see Fig. 1) Fig. 1: Metadata for a search result: example of framing user expectations Fig. 3: The Theological Commons site (http://commons.ptsem.edu/) 0 20 40 60 80 100 120 140 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 The English Presbyterians (englishpresbyter00drys) Page of Book Error Percentage Identi fy Priori tize Improv e Assess Fig. 4: A scatter plot showing the est. distribution of errors in a book Fig. 3: An iterative cycle

Transcript of Improving Metadata Quality: An Iterative Approach Using Scrum A primary challenge when developing a...

Page 1: Improving Metadata Quality: An Iterative Approach Using Scrum A primary challenge when developing a large scale digital library is balancing metadata quantity.

Improving Metadata Quality:An Iterative Approach UsingScrum

A primary challenge when developing a large scale digital library is balancing metadata quantity and quality. The Theological Commons is a digital library (built on a MarkLogic Server) with ~50,000 books (or ~16,000,000 pages). Since its release in March 2012, we have been iteratively improving its metadata quality using the Scrum process.

Methods

Scrum is a form of agile project management, which works in fixed iterations (“sprints”) to develop projects and features.

As problems in metadata are identified, team members add them to the product backlog. The problems are described in story form—i.e. from the perspective of end users rather than technologists. The product owner takes responsibility for ordering these stories from most to least significant.

Generally, our team tackles metadata problems using some combination of computational methods and hand editing, with a preference for the former (for obvious reasons). We seek to pass through all the data during a single sprint.

At the end of the sprint, we review our work with the entire library staff. It’s not good enough to say that we’ve improved the metadata quality. We aim to demonstrate a new feature that exemplifies the improved quality of the metadata.

Here are some questions our team is thinking about:1. Should we develop a test suite for our metadata to

flag obvious errors (beyond validation)?2. Are there reliable natural language processing

toolkits to assist with improving automatically generated metadata (i.e. OCR)?

3. How can we best frame user expectations when dealing with metadata deficiencies?

Clifford B. AndersonCurator of Special CollectionsPrinceton Theological Seminary LibraryPrinceton, NJ (USA)

Background

Results

Conclusion

Questions

Thanks to the members of our digital team:

Cortney Frank, Greg Murray, Donna Quick,

and Chris Schwartz

The digital team approaches theproblem of quality control in fourstages:

• Identifying metadataproblems and lacunae

• Prioritizing the most importantstories and sending the rest tothe product backlog

• Improving the product by implementing the metadata story during a single sprint, if possible

• Assessing the outcome and continuing to identify new metadata problems and lacunae.

An iterative approach allows us to regularly improve metadata quality while continuously reevaluating our metadata priorities in response to stakeholders’ needs.

1

2

3

4

5

Here are some lessons we’ve learned as a team:

The deeper you dig, the more you will find to improve. The quality improvement process is infinite.

Aim shallow rather than deep so that you cover all documents in a single sprint.

Keep a sharp focus—scope creep also affects metadata cleanup.

When merging metadata from different sources, watch out for inconsistencies (even when all validate against the same schema).

Don’t be parsimonious—if you think you might need some external metadata later, just build it into your document with a different namespace.

Visualizing metadata problems can help you to understand them more intuitively (see Fig. 3)

Aim to increase the percentage of computational approaches (hurray for XQuery!) with every sprint.

Set user expectations with respect to metadata flaws (see Fig. 1)

Fig. 1: Metadata for a search result: example of framing user expectations

Fig. 3: The Theological Commons site (http://commons.ptsem.edu/)

0 20 40 60 80 100 120 1400.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

The English Presbyterians (englishpresbyter00drys)

Page of Book

Err

or

Per

cen

tag

e

Identify

Prioritize

Improve

Assess

Fig. 4: A scatter plot showing the est. distribution of errors in a book

Fig. 3: An iterative cycle