Experiences on Migration of Data in Digitization Projects · Slide 3 MIGRATION • Migration is the...

42
Slide 1 Experiences on Migration of Data in Digitization Projects Juli Juli á á n n Besc Besc ó ó s s Presentation for the ERPANET Workshop Workflow in Digital Preservation Budapest, 13-15 October 2004

Transcript of Experiences on Migration of Data in Digitization Projects · Slide 3 MIGRATION • Migration is the...

Slide 1

Experiences on Migration of Data in Digitization Projects

JuliJuliáán n BescBescóóss

Presentation for the ERPANET WorkshopWorkflow in Digital PreservationBudapest, 13-15 October 2004

Slide 2

OVERVIEW

1. The Migration Issue 2. Our Experience 3. Migration Tasks 4. Best Practices for Preservation5. Planning and Schedule

Slide 3

MIGRATION

• Migration is the set of tasks to achieve periodic transfer of digital materials from one hard/soft configuration to another

Purpose • Long term preservation of the digital information created and stored

using digital technology• Allow broad access

– Retrieve, display and use

Origin • New devices, processes and software replace the methods to

record, store and access• New standards• Enhancement of service

Slide 4

ORIGIN OF MIGRATION

• Technology obsolescence–– HardwareHardware

More powerfull computers and higher density storageElements for updating are not available ( increase of

storage, memory, etc)–– Basic softwareBasic software

Operating systemsData base managers

•• Media–– Lifetime is rarely the constrainingLifetime is rarely the constraining factor factor forfor DPDP–– Obsolescence of old storageObsolescence of old storage media as media as newer and betternewer and better media are media are availableavailable

in in the marketthe market

• Obsolescence of the Access software–– Access in Access in new platform andnew platform and mediamedia–– Not availableNot available long long term programsterm programs–– Changes Changes in in metadata and metadata and in in image formats image formats –– New functions ofNew functions of thethe softwaresoftware

Slide 5

ORIGIN OF MIGRATION

• In practice it is a combination of:–– Technology obsolescence Technology obsolescence –– New functionalities of theNew functionalities of the softwaresoftware–– Derived from information and communication technologyDerived from information and communication technology–– Daily work onDaily work on:: digitisationdigitisation, , storagestorage and access requiringand access requiring::

Higher density storageFaster computers

• It is a consequence of:

–– TheThe digital digital world of information and communication technology is still world of information and communication technology is still relatively young and inmature relatively young and inmature

Slide 6

EXPERIENCE IN DIGITALIZATION PROJECTS

• Beginning in 1988 with the design and development of the Information System for the Archivo de Indias in Seville

• Computarization of 66 Archives and Libraries of different kinds and sizes in Spain and abroad

• Digitalization of more than 20 millions pages of ancient documents • Installation of more than 320 workstations • Development of the own products ArchiDOC-ArchiGES for Archives • With a team in the areas of consulting, managing, development, installation,

trainning and maintenance of systems for archives

Archivo General de Indias, Sevilla Access Room in 1992

Slide 7

MAIN PROJECTS WITH DIGITALIZATION

Archivo General de Indias, Sevilla

Archivo General de Simancas

Archivo Histórico Nacional, Madrid

Archivo Histórico Nacional - Sección Nobleza, Toledo

Archivo Histórico Nacional Sección Guerra Civil, Salamanca

Archivo de la Corona de Aragón, Barcelona

Archivo General de Navarra

Archivo del Reino de Valencia

Archivo del Reino de Mallorca

Biblioteca Sancho el Sabio, Vitoria

Archivo Virtual de la corona de Aragón ( con Imágenes del ACA y AHN)

Archivo Eclesiástico de Poblet

Archivo Histórico Universidad de Salamanca

Archivo Histórico de la Universidad de Santiago de Compostela

Archivo Histórico de la Universidad de Oviedo

Archivo General de la Nación, Colombia

Archivo Histórico Ultramarino, Lisboa

Archivo del Nacionalismo de la Fundación Sabino Arana, Vizcaya

Biblioteca Valenciana Archivo del Ilustre Colegio Notarial de Granada

Real Academia Española (Diccionarios Histórico)

Diccionario Biográfico Real Academia Historia

Archivo General Militar, Segovia

Archivo General Militar, Ávila

Instituto de Historia y Cultura Militar

Archivo General de la Marina, El Viso del Marqués, Ciudad Real

Archivo Histórico Provincial de Murcia

Sistema de Información del Archivo, Biblioteca, Fototeca y Videoteca de Cruz Roja Española

Biblioteca de la Fundación Francisco de Zabalburu, Madrid

Biblioteca Parlamento Vasco

Archivo-Biblioteca de la Diputación de Cáceres

Digitalización de 11 periódicos para 11 Instituciones Vascas de Prensa retrospectiva y prensa actual

Archivo Municipal de Castellón de la Plana

Archivo Histórico del Excmo. Ayuntamiento de La Laguna, Tenerife

Archivo del Ayuntamiento Oviedo

Archivo del Komintern, Moscow and its replica in 6 National Archives, LOC and Open Society Archives

Archivo General de Navarra

Archivo General Militar, Segovia

Zabalburu Library

Slide 8

Date Institution Number of Images Kind of Images 89-02 Archivo General de Indias, Sevilla 11.000.000 Manuscripts XVI-XIX 97- Archivo General de la Nación, Colombia 1.000.000 Manuscripts 94-00 Archivo General de Simancas 1.000.000 Manuscripts 97-04 Archivo General Militar, Ávila 180.000 Expedientes Militares 97-04 Archivo General Militar, Segovia 300.000 Expedientes Militares 98-04 Archivo General de Navarra 450.000 Manuscritos medievales 98- Archivo General de la Marina, El Viso del Marqués, Ciudad Real 150.000 Manuscripts 96- Archivo de la Real Chancillería de Valladolid Manuscripts 93-03 Archivo Histórico Nacional, Madrid 3.000.000 Manuscripts 95-01 Archivo Histórico Nacional - Sección Nobleza, Toledo 300.000 Manuscripts 96 Archivo Histórico Nacional Sección Guerra Civil, Salamanca Manuscripts 96 Archivo Histórico Provincial, Vizcaya 97 Archivo Histórico Provincial de Murcia 250.000 Protocols 99-02 Archivo Histórico Provincial de Oviedo 95 Archivo Histórico Ultramarino, Lisboa Manuscritos antiguos 95-04 Archivo de la Corona de Aragón, Barcelona 200.000 Medieval Manuscripts 94-01 Archivo Histórico de la Universidad de Salamanca 700.000 Manuscripts 96-02 Archivo Histórico de la Universidad de Oviedo 97-04 Archivo Histórico de la Universidad de Santiago de Compostela 400.000 Manuscripts 98-02 Archivo del Komintern, Moscú 1.000.000 Documents 1900-1945 93-04 Biblioteca y Archivo de la Fundación Sancho el Sabio, Vitoria 1.100.000 Monographs XVI-XIX 96-02 Biblioteca de la Fundación Francisco de Zabalburu, Madrid 700.000 Manus. y Mon. 96-00 Archivo del Nacionalismo de la Fundación Sabino Arana, Vizcaya 100.000 97-01 Archivo Histórico del Excmo. Ayuntamiento de La Laguna,Tenerife 100.000 Manuscripts 96 Archivo del Ilustre Colegio Notarial de Granada 200.000 Protocols 1998 Instituto de Historia y Cultura Militar 100.000 Manuscipts 95-00 Archivo Eclesiástico de Poblet 200.000 Manuscipts 98- Archivo-Biblioteca de la Diputación de Cáceres 200.000 Actas 98 Archivo Municipal de Castellón de la Plana 98 Centro de Investigaciones Biológicas (CSIC)

FIGURES OF DIGITALIZATION

Slide 9

Date Institution Number of Images Kind of Images 96-00 Real Academia Española Historical Dictionaries 96-00 Digitalización de 11 periódicos para 11 Instituciones Vascas 300.000/year Ancien Journals 99-04 Archivo Histórico Provincial Cantabria 2000 Archivo Ayuntamiento Estella 00-02 Archivo y Biblioteca Cruz Roja Photographs Monog. 00-04 Archivo Virtual de Aragón ( Imágenes del ACA y AHN) Medieval Manuscripts 00-01 Proyecto AER ( Con AGI y AHN inicialmente) 00-04 Biblioteca Parlamento Vasco 300.000 Monographs 01-04 Archivo del Reino de Valencia Manuscripts, Protocols 01-02 Diccionario Biográfico Real Academia Historia 01 Archivo del Ayuntamiento Oviedo Padrones XV 01-04 Archivo del Reino de Mallorca 02 Sistema Archivos Principado Asturias 02 Archivo Casa de Alba

FIGURES OF DIGITALIZATION

Slide 10

EXPERIENCES ON MIGRATION

1. Projects from 1988 – 1992:Computer System for Archivo General de Indias• The Archive contains 86 million of pages of original manuscripts

related to the Spanish Administration in America (XV-XIX centuries), in 43.000 bundles

• The Computer System integrated:– A Textual Data Base with 400.000 descriptive entries– A Digital Image Archive with 11 million digital images in 1995– A Module for User and Document Management: Control of User management, Consultation room, documents movements and statistics

• Access by researchers and archivists from 50 workstations• About 30% of present consultations are on the screen (1 million

pages/year ) • About 35% of printing are digital ( 85.000/year )• Access system in service since 1992

Slide 11

EXPERIENCES ON MIGRATION

Architecture• The Data Base for Descriptions in SQL/400 keeps the hierarchical

structure of fonds• Standalone Digitization Workstations with flat bed scanners and

optical disk driver under DOS• Images servers based on PCs with optical disk drivers • Access from PCs under OS/2Image Acquisition and Storage• 11 million images digitized in gray levels with high fidelity with respect

to the original manuscripts• Low cost workstations• Legibility Enhancements applied by users at the consultation time• Non expert digitization operators• Digitization: 100 dpi, 16 gray levels• 1 Page/minute, 15 workstations, 2 turns, 4 years

Slide 12

Image Acquisition and Storage• Images stored in WORM optical disks

– The structure at the low level ( bundle/documents ) was alsoin directories in the WORM disks– Access to images in one disk done through the call number of the document– Images path as metadata: images names had information about document call number and number of page.– Not available standard compression for gray level images. Images were DPCM compressed by software without losses.

• Compressed Image size of A4: 300-350 Kbytes• Storage for 1 bundle: 2000 x 350 = 700 MB

EXPERIENCES ON MIGRATION

Slide 13

Image Acquisition and Storage• Media for storage of digital images:

Bundles Media Year beg. Number of disks Images1.729 IBM optical disks ( 200 MB) 1989 6.916 3.458.0003.732 Plasmon optical disks ( 940 MB) 1991 3.732 7.464.000

50 CD-R (640 MB) 1996 100.000

EXPERIENCES ON MIGRATION

Slide 14

Slide 15

Slide 16

Slide 17

Example of blotches removal to be applied by the user

Slide 18

Slide 19

Example of reduction of ink bleeding through the paper

Slide 20

Archivo General de Indias

Digitization Room of Archivo de Indias in 1989

Slide 21

Archivo General de Indias

Shelf with optical disks

Slide 22

EXPERIENCES ON MIGRATION

2. Projects from 1992 – 1996:

– Data Base Server under OS/2 and DB2 – Access and Digitization workstations from PCs with OS/2– The relational Data Base keeps the hierarchical structure of

documentation– Images stored in CDRs

Directory structures and image names changed.Metadata in binary control files: Each image has information

about signature, position in hierarchical structure, number of page, notes

Image compression: JPEGMetadata in images: resolution, date, dimensions

Slide 23

EXPERIENCES ON MIGRATION

Example: metadata in Binary Control File

– The file keeps information about the hierarchical structure

– It maintains relationship between each image file and its position in the document.

– The control file and its metadata can be imported into the database

Slide 24

EXPERIENCES ON MIGRATION

Migration of Images of Archivo de Indias from 10.600 optical disks to 6.000 CD-Rs

– The images of a bundle are stored in 1 or 2 CD-R– Reading of optical disks through the network– No direct connectivity between optical disks and Windows NT

– Main Operation Tasks:Decompression of the DPCM formatCompression on JPEG formatTemporary storage in magnetic diskAll images of the bundle are copied in CD-RVerification of images by reading6.000 CD-Rs, and 6.000 CD-Rs backup copy

Slide 25

EXPERIENCES ON MIGRATION

Migration of Images from 6.916 WORM IBM disks to CD-Rs– Typically 4 WORM disks ( 200 MB each) in 1 or 2 CD-R

IBM Disks to CD-RPentium PCWindows NTToken-Ring PCI Card3GB disk SCSI interface

Microchannel IBM PS/2File system driver for OS/2OS/2 1.3 and Lan ServerTokenRing Microchannel Card

Token RingNetwork

CD-R DrivesIBM Optical Drives

Slide 26

EXPERIENCES ON MIGRATION

Migration of Images from 3.732 WORM Plasmon to CD-Rs– 1 WORM Plasmon disk ( 940 MB) in 1 or 2 CD-R

Plasmon Disks to CD-RPC with i486SCSI interfaceFile system driver for OS/2OS/2 3.0 Ethernet card

Pentium PCWindows NTToken-Ring PCI Card3GB disk SCSI interface

Plasmon Drives

CD-R Drives

HUB EthernetNetwork

HUB EthernetNetwork

Slide 27

EXPERIENCES ON MIGRATION

Migration of Images of Archivo de Indias from 10.600 optical disks to6.000 CD-Rs–– Requirements of personnel andRequirements of personnel and timetime

3 3 operators duringoperators during 4 4 monthsmonths

Similar migration schemes with less images:• Library Sancho el Sabio ( Vitoria) 1.000.000 images• University of Salamanca 700.000 images• Archivo General Militar, Segovia 200.000 images• Archivo del Monasterio Poblet 100.000 images

Slide 28

EXPERIENCES ON MIGRATION

3. Projects from 1996 to now:

– Oracle Data Base– Access and Digitization workstations with PCs with W/NT,.. W XP – Capturing Images also using standard programs and their metadata– Images stored in magnetic disks. CDROMS as backup

Metadata in database: Scanning operator, date of creation, Signature, path, dimensions in bytes… Data about control of the information

Metadata in image: resolution, dimensions… Data for presentation in computers and for printing

Image quality:200 – 300 dpi, 256 gray levelsColor images

Standard formats:TIFF, CCITTGIVJPEG, PDF,

Slide 29

EXPERIENCES ON MIGRATION

Example: metadata in database

Modes of Image Display Management of

Image Access

Slide 30

EXPERIENCES ON MIGRATION

Example: metadata XML File

– Same functionality than binary control file

– Standard: virtually any program can import these metadata

Slide 31

EXPERIENCES ON MIGRATION

Migration of Archivo de Indias from CD-R to magnetic disk in 2000– Project for online access and Internet

Just copy. Images are already with JPEG compression10 RAID cabinets of 350 GB each ( 8 disks x 50 GB )1 operator was required during 1 month for the copy from a CD-

ROM tower to magnetic disks– Transfer rate from different media:

Media Transfer rate Image BundleIBM optical disk 60 KBs 6 seconds 4 hoursPlasmon optical disk 100 KB/s 3 seconds 1 hourCD-R 16x 2,5 MB/s <1 second 5 minutesMagnetic disk 80 MB/s 1 minute

Similar Migrations:Sancho Sabio Library ( Vitoria) 1 million imagesZabalburu Library 700.000 imagesMilitary Archives 500.000 imagesArchivo General Navarra 600.000 imagesKomintern Archives (Moscow) 1 million images........ Komintern Archives, Moscow

Slide 32

Archivo General de Indias

UPS

UPS

Image Server

RAID Cabinet 1

RAID Cabinet 2

RAID Cabinet 3

RAID Cabinet 4

RAID Cabinet 5

RAID Cabinet 6

RAID Cabinet 7

RAID Cabinet 8

RAID Cabinet 9

RAID Cabinet 10

Data Base Server

Domain Controler Server

WEB Server

UPS

UPS

SERVERS AND IMAGE STORAGE

Slide 33

Archivo General de Indias

R es erved U P S

D ata B as e S erv er

D o m a in C o ntro le r S erv ers

U P S

W E B S er ve rs

Im age S e rv er

R A ID C ab inet 1

R A ID C ab inet 2

U P S

R es erved fo r R A I D C ab ine t 3

A uto R ep lic at ed o n line R em o te D is k s ub s ys temfo r B ac k up and S erv ic e

R ed lo c al

Slide 34

MIGRATION TASKS

• Analysis of origin and destination data models• Equivalence between of the fields in the origin and destination models

– New versions include new metadata not available before • Development of migration software• Testing with a limited number of objects• Display of information in a destination card• Application of migration to all data• Verification of results• Correction of errors:

– Sometimes some images cannot be copied and must be recoverd from alternative media or even to be digitised again

Komintern Archives, Moscow

Slide 35

MAIN COST FACTORS

• Preparation of the system for migration– Hardware and Basic Software:

Magnetic disk storage for imagesPCs with appropriate OS and DB manager

• Development of Software (1 programmer, 2-3 weeks work ) – Software development for migration– Testing of migration of data

• Operation ( usually less than 1 week)– Significant operation with removable media

Komintern Archives, Moscow

Slide 36

BEST PRACTICES FOR PRESERVATION

• General principles–– Based onBased on PC’sPC’s and mainstream commercial equipmentand mainstream commercial equipment–– KeyKey hardware hardware providedprovided byby first classfirst class ITIT companiescompanies–– Database managers of widespreadDatabase managers of widespread useuse–– Consultations with institutions undertaking projectsConsultations with institutions undertaking projects–– Based on elements and standard formatsBased on elements and standard formats.. Officials or theOfficials or the facto,facto, likelike

TIFF, JPEG, XML, etc. TIFF, JPEG, XML, etc. –– Modular,Modular, allowingallowing aa progressive installation and easy update of progressive installation and easy update of

elementselements–– Selection ofSelection of software:software:

FunctionalitiesNumber of installationsMaintenanceProvided by a IT company settled in the sector

–– Key factorsKey factors::Server, operating system, database managerBackup policies

Slide 37

BEST PRACTICES FOR PRESERVATION

• Digitization–– Capture Capture systemssystems::

Robust flatbed scanners (A3)Zenithal scanners. Digital cameras with limitations.

–– UseUse of standard compression formatsof standard compression formats. JPEG, CCITTGIV. JPEG, CCITTGIV–– Ensure thatEnsure that digitaldigital images will allowimages will allow aa broad range of futurebroad range of future useuse–– CaptureCapture the highest quality image technically possible and the highest quality image technically possible and

economically feasible for largeeconomically feasible for large--scale production scale production –– CaptureCapture the informational contentthe informational content // physical appearancephysical appearance–– Fast and easy correction of errors Fast and easy correction of errors

• Criteria for holding selection–– ValueValue–– ConditionCondition–– UseUse–– Acceptability of theAcceptability of the digital digital objectobject–– Access Access aidsaids

Slide 38

BEST PRACTICES FOR PRESERVATION

• Storage–– Media Media of wideof wide useuse and low costand low cost::

Magnetic disk for on line image service (specially in high demand)Disks with redundancyBackup in tapes of high capacity (10/20GB)One or two units available as hotsawpIt allows migration without personnel operation

In a distributed network they may need to be stored online inmultiple locations

CD-R or DVD as backup for off line access in case of system failure

–– In generalIn general there is little experiencethere is little experience inin storing massive quantities of storing massive quantities of culturallyculturally valuablevaluable materials materials

• Backup and Recovery–– UseUse industry standardindustry standard backupbackup and recovery proceduresand recovery procedures::

Periodic backup to magnetic tape A copy held on site for near term recoveryA copy off-site stored for disaster recovery

Slide 39

APPLICATION OF MIGRATION

Traditional approach of Computer Science• Migration of media

–– RefreshingRefreshing digital digital informationinformation by by copying it from medium to mediumcopying it from medium to medium–– Conversion ofConversion of files files to another format toto another format to be be interpretedinterpreted by by new new

programsprograms; ; toto a a reduced number of standard formatsreduced number of standard formats; ; • Migration of technology platform

–– Server Server and PCsand PCs–– PeriphericalsPeriphericals–– Capture Capture devices anddevices and CDR CDR writerswriters–– Operating system and databaseOperating system and database managermanager

• Migration of the digitising and access software–– Maintenance ofMaintenance of software in software in new platformnew platform–– NewNew software software versions for digitising and accessversions for digitising and access

Slide 40

PLANNING

• Planning for migration is difficult due to:– the limited experience– we cannot predict when media, soft and hard will become obsoleted

• No single strategy applies to all formats of digital information• It varies in different applicational environments, for different formats of digital

materials and for preserving different degrees of computation, display and retrieval

• It requires a unique new solution for each new format and process• Automatic conversion is only partially possible• In general there are no firm plans for migration, but to stay up to date with

current technologies by migration the content• Usually there is urgency involved in migration: due by the obsolescence of

soft and hard

Slide 41

SCHEDULE

• Schedule– New releases of software, databases,etc. can be expected every 2-3

years, with minor updates more often– Migration from one storage media to another every 4-5 years, if not

online– Migration to new hardware and software occur less frequently but can

be expected between 5-10 years

Slide 42

SUMMARY

• Best practices for Digital Preservation – Mainstream commercial equipment – Use of standard formats– Storage in magnetic disk with redundancy – Backup policies– Maintenance

• Periodical Update Policy – Hardware– Media– Basic sofware– Application software