Digitisation at Scale: Automating the mass acquisition of digitised content

23
Digitisation at Scale: Automating the mass acquisition of digitised content IS&T Archiving Conference, Washington, April 2016 Dave Thompson Digital Curator, Wellcome Library

Transcript of Digitisation at Scale: Automating the mass acquisition of digitised content

Page 1: Digitisation at Scale: Automating the mass acquisition of digitised content

Digitisation at Scale:Automating the mass acquisition of

digitised contentIS&T Archiving Conference, Washington, April 2016

Dave ThompsonDigital Curator, Wellcome Library

Page 2: Digitisation at Scale: Automating the mass acquisition of digitised content

The Wellcome Library

• Part of Wellcome Collection, astonishing public venue in London developed by the Wellcome Trust. Where people can learn more about medicine through the ages & across cultures

• Five-year plan for transforming the Wellcome Library.

Page 3: Digitisation at Scale: Automating the mass acquisition of digitised content

Driver for digitisation

• To make our collections available to anyone, anywhere, we are digitising as much of our physical collection as we can, for both our website and the websites of other organisations. We are also digitising and hosting collections from partners that complement our holdings

Transforming the Wellcome Library: 2009-2014. http://wellcomelibrary.org/what-we-do/library-strategy-and-policy/transforming-the-wellcome-library/

Page 4: Digitisation at Scale: Automating the mass acquisition of digitised content

The problem

• How to scale systems & processes to deliver on our ambition

• How to design & build new high volume systems & processes for; acquisition, storage, processing, access

• How to manage volumes of data during creation/acquisition

Page 5: Digitisation at Scale: Automating the mass acquisition of digitised content

Process design – sources of content

Goobi(METS/OCR)

Preservica

In-house

Institutions

Contractors

Harvesting

TIFF or JP2

TIFF or JP2HD & ftp

TIFF or JP2

Normalises TIFF to JP2

Manual

Automatic

Jpylyzer validates JP2

Auto harvesting of JP2 & DMD

Grey literature

PDF

Ingest Officer / Digital Curator

Snagging

Snagging

Page 6: Digitisation at Scale: Automating the mass acquisition of digitised content

The approach

• (Re)Use/develop existing systems were possible, e.g. bibliographic system Sierra, Preservica EE repository

• Identify where new systems would be required, e.g. workflow middle ware

• Take a practical approach & accept that it would be iterative learning as we go

Page 7: Digitisation at Scale: Automating the mass acquisition of digitised content

The solution was to use Goobi

Page 8: Digitisation at Scale: Automating the mass acquisition of digitised content

Why Goobi?

• Dedicated to digitisation

• Flexibility & process control

• Adaptable & scalable

• Vendor expertise/support

http://www.inspirelancs.org.uk/interested-in-volunteering-family-carers-volunteers-wanted/

Page 9: Digitisation at Scale: Automating the mass acquisition of digitised content

Role of Goobi

• Role of Goobi is overall management & tracking of processes

• Initiate ingest into our DAM Preservica

• Reporting & statistics

Page 10: Digitisation at Scale: Automating the mass acquisition of digitised content

Role of humans

• Working at volume did not imply more staff, it implied efficiency

• Also implied automation

• Human work was focussed on tasks machines couldn't do

http://planetivy.com/gaming/25273/natural-selection-2-gaming-evolution-in-action/

Page 11: Digitisation at Scale: Automating the mass acquisition of digitised content

System & process design

• High volume doesn’t imply use of many systems

• Requires design to be as simple as possible, with as few moving parts as possible

• Processes need to be efficient & scalable, human as well as system

http://www.nivenswealthstrategies.com/keeping-it-simple/

Page 12: Digitisation at Scale: Automating the mass acquisition of digitised content

Partnership for scalable digitisation

• Relationship with Internet Archive digitising our Library content

• High volume long term project

• Content harvested from Internet Archive website & processed automatically

• Dedicated Goobi process for fully automated harvesting

Page 13: Digitisation at Scale: Automating the mass acquisition of digitised content

Harvesting from Internet Archive

Content processed automatically, including creation of METS & ALTO.

Goobi has a ‘repository’ of IA identifiers for searching/harvesting.

Goobi harvests data from Internet Archive website.

Content available in the player.Content stored in Preservica. DDS creates JSON for the player & pre-

caches some content.

Page 14: Digitisation at Scale: Automating the mass acquisition of digitised content

Challenges - M&Ms

• Multi volume works

• No metadata to support their union

• Have to construct them manually, but process can be simplified

• Time consuming, still to be fully automated

Page 15: Digitisation at Scale: Automating the mass acquisition of digitised content

Challenges – Working with partners

• Changes to Internet Archive website broke our harvesting

• For automated ftp to work 3rd parties need to follow instructions

• Creation of JPEG2000 images/video

• Incorrect identifiers trips up processes

Page 16: Digitisation at Scale: Automating the mass acquisition of digitised content

Opportunities

• Working with IT, flexibility of virtualised environment

• Working with Intranda, brings in vendor expertise

• Distributed system brings in feedback from many users

• Small team simplifies decision making

• Success leads to success

Page 17: Digitisation at Scale: Automating the mass acquisition of digitised content

Life cycle management

• Good place with regard to life cycle management

• Consistent processes based on common workflows

• Goobi outputs consistent & predictable

• Unified data set easier to manage in the future

Page 18: Digitisation at Scale: Automating the mass acquisition of digitised content

Has automation been successful?

• Yes with a but

• Automation can be complex, easy to make mistakes

• Automation requires metadata to be available

• Automated processes still require a human minder

Page 19: Digitisation at Scale: Automating the mass acquisition of digitised content

The scale of things

Page 20: Digitisation at Scale: Automating the mass acquisition of digitised content
Page 21: Digitisation at Scale: Automating the mass acquisition of digitised content

Lessons learned

• Complexity Vs simplicity

• Iterative approaches work but are time consuming

• Vendor support/input crucial when starting from scratch

• Process design essential

Page 22: Digitisation at Scale: Automating the mass acquisition of digitised content

Be bold. Sometimes it’s the way we work that has to change

Page 23: Digitisation at Scale: Automating the mass acquisition of digitised content

Thank you

Questions now, questions later…?

Dave Thompson, Digital CuratorWellcome Library

[email protected] @d_n_t

http://wellcomelibrary.org/