Digitisation at Scale: Automating the mass acquisition of digitised content

Digitisation at Scale:Automating the mass acquisition of

digitised contentIS&T Archiving Conference, Washington, April 2016

Dave ThompsonDigital Curator, Wellcome Library

The Wellcome Library

• Part of Wellcome Collection, astonishing public venue in London developed by the Wellcome Trust. Where people can learn more about medicine through the ages & across cultures

• Five-year plan for transforming the Wellcome Library.

Driver for digitisation

• To make our collections available to anyone, anywhere, we are digitising as much of our physical collection as we can, for both our website and the websites of other organisations. We are also digitising and hosting collections from partners that complement our holdings

Transforming the Wellcome Library: 2009-2014. http://wellcomelibrary.org/what-we-do/library-strategy-and-policy/transforming-the-wellcome-library/

The problem

• How to scale systems & processes to deliver on our ambition

• How to design & build new high volume systems & processes for; acquisition, storage, processing, access

• How to manage volumes of data during creation/acquisition

Process design – sources of content

Goobi(METS/OCR)

Preservica

In-house

Institutions

Contractors

Harvesting

TIFF or JP2

TIFF or JP2HD & ftp

TIFF or JP2

Normalises TIFF to JP2

Manual

Automatic

Jpylyzer validates JP2

Auto harvesting of JP2 & DMD

Grey literature

PDF

Ingest Officer / Digital Curator

Snagging

Snagging

The approach

• (Re)Use/develop existing systems were possible, e.g. bibliographic system Sierra, Preservica EE repository

• Identify where new systems would be required, e.g. workflow middle ware

• Take a practical approach & accept that it would be iterative learning as we go

The solution was to use Goobi

Why Goobi?

• Dedicated to digitisation

• Flexibility & process control

• Adaptable & scalable

• Vendor expertise/support

http://www.inspirelancs.org.uk/interested-in-volunteering-family-carers-volunteers-wanted/

Role of Goobi

• Role of Goobi is overall management & tracking of processes

• Initiate ingest into our DAM Preservica

• Reporting & statistics

Role of humans

• Working at volume did not imply more staff, it implied efficiency

• Also implied automation

• Human work was focussed on tasks machines couldn't do

http://planetivy.com/gaming/25273/natural-selection-2-gaming-evolution-in-action/

System & process design

• High volume doesn’t imply use of many systems

• Requires design to be as simple as possible, with as few moving parts as possible

• Processes need to be efficient & scalable, human as well as system

http://www.nivenswealthstrategies.com/keeping-it-simple/

Partnership for scalable digitisation

• Relationship with Internet Archive digitising our Library content

• High volume long term project

• Content harvested from Internet Archive website & processed automatically

• Dedicated Goobi process for fully automated harvesting

Harvesting from Internet Archive

Content processed automatically, including creation of METS & ALTO.

Goobi has a ‘repository’ of IA identifiers for searching/harvesting.

Goobi harvests data from Internet Archive website.

Content available in the player.Content stored in Preservica. DDS creates JSON for the player & pre-

caches some content.

Challenges - M&Ms

• Multi volume works

• No metadata to support their union

• Have to construct them manually, but process can be simplified

• Time consuming, still to be fully automated

Challenges – Working with partners

• Changes to Internet Archive website broke our harvesting

• For automated ftp to work 3rd parties need to follow instructions

• Creation of JPEG2000 images/video

• Incorrect identifiers trips up processes

Opportunities

• Working with IT, flexibility of virtualised environment

• Working with Intranda, brings in vendor expertise

• Distributed system brings in feedback from many users

• Small team simplifies decision making

• Success leads to success

Life cycle management

• Good place with regard to life cycle management

• Consistent processes based on common workflows

• Goobi outputs consistent & predictable

• Unified data set easier to manage in the future

Has automation been successful?

• Yes with a but

• Automation can be complex, easy to make mistakes

• Automation requires metadata to be available

• Automated processes still require a human minder

The scale of things

Lessons learned

• Complexity Vs simplicity

• Iterative approaches work but are time consuming

• Vendor support/input crucial when starting from scratch

• Process design essential

Be bold. Sometimes it’s the way we work that has to change

Thank you

Questions now, questions later…?

Dave Thompson, Digital CuratorWellcome Library

[email protected] @d_n_t

http://wellcomelibrary.org/

Digitisation at Scale: Automating the mass acquisition of digitised content

Presentations & Public Speaking

Transcript of Digitisation at Scale: Automating the mass acquisition of digitised content