Experiences on Migration of Data in Digitization Projects · Slide 3 MIGRATION • Migration is the...
Transcript of Experiences on Migration of Data in Digitization Projects · Slide 3 MIGRATION • Migration is the...
Slide 1
Experiences on Migration of Data in Digitization Projects
JuliJuliáán n BescBescóóss
Presentation for the ERPANET WorkshopWorkflow in Digital PreservationBudapest, 13-15 October 2004
Slide 2
OVERVIEW
1. The Migration Issue 2. Our Experience 3. Migration Tasks 4. Best Practices for Preservation5. Planning and Schedule
Slide 3
MIGRATION
• Migration is the set of tasks to achieve periodic transfer of digital materials from one hard/soft configuration to another
Purpose • Long term preservation of the digital information created and stored
using digital technology• Allow broad access
– Retrieve, display and use
Origin • New devices, processes and software replace the methods to
record, store and access• New standards• Enhancement of service
Slide 4
ORIGIN OF MIGRATION
• Technology obsolescence–– HardwareHardware
More powerfull computers and higher density storageElements for updating are not available ( increase of
storage, memory, etc)–– Basic softwareBasic software
Operating systemsData base managers
•• Media–– Lifetime is rarely the constrainingLifetime is rarely the constraining factor factor forfor DPDP–– Obsolescence of old storageObsolescence of old storage media as media as newer and betternewer and better media are media are availableavailable
in in the marketthe market
• Obsolescence of the Access software–– Access in Access in new platform andnew platform and mediamedia–– Not availableNot available long long term programsterm programs–– Changes Changes in in metadata and metadata and in in image formats image formats –– New functions ofNew functions of thethe softwaresoftware
Slide 5
ORIGIN OF MIGRATION
• In practice it is a combination of:–– Technology obsolescence Technology obsolescence –– New functionalities of theNew functionalities of the softwaresoftware–– Derived from information and communication technologyDerived from information and communication technology–– Daily work onDaily work on:: digitisationdigitisation, , storagestorage and access requiringand access requiring::
Higher density storageFaster computers
• It is a consequence of:
–– TheThe digital digital world of information and communication technology is still world of information and communication technology is still relatively young and inmature relatively young and inmature
Slide 6
EXPERIENCE IN DIGITALIZATION PROJECTS
• Beginning in 1988 with the design and development of the Information System for the Archivo de Indias in Seville
• Computarization of 66 Archives and Libraries of different kinds and sizes in Spain and abroad
• Digitalization of more than 20 millions pages of ancient documents • Installation of more than 320 workstations • Development of the own products ArchiDOC-ArchiGES for Archives • With a team in the areas of consulting, managing, development, installation,
trainning and maintenance of systems for archives
Archivo General de Indias, Sevilla Access Room in 1992
Slide 7
MAIN PROJECTS WITH DIGITALIZATION
Archivo General de Indias, Sevilla
Archivo General de Simancas
Archivo Histórico Nacional, Madrid
Archivo Histórico Nacional - Sección Nobleza, Toledo
Archivo Histórico Nacional Sección Guerra Civil, Salamanca
Archivo de la Corona de Aragón, Barcelona
Archivo General de Navarra
Archivo del Reino de Valencia
Archivo del Reino de Mallorca
Biblioteca Sancho el Sabio, Vitoria
Archivo Virtual de la corona de Aragón ( con Imágenes del ACA y AHN)
Archivo Eclesiástico de Poblet
Archivo Histórico Universidad de Salamanca
Archivo Histórico de la Universidad de Santiago de Compostela
Archivo Histórico de la Universidad de Oviedo
Archivo General de la Nación, Colombia
Archivo Histórico Ultramarino, Lisboa
Archivo del Nacionalismo de la Fundación Sabino Arana, Vizcaya
Biblioteca Valenciana Archivo del Ilustre Colegio Notarial de Granada
Real Academia Española (Diccionarios Histórico)
Diccionario Biográfico Real Academia Historia
Archivo General Militar, Segovia
Archivo General Militar, Ávila
Instituto de Historia y Cultura Militar
Archivo General de la Marina, El Viso del Marqués, Ciudad Real
Archivo Histórico Provincial de Murcia
Sistema de Información del Archivo, Biblioteca, Fototeca y Videoteca de Cruz Roja Española
Biblioteca de la Fundación Francisco de Zabalburu, Madrid
Biblioteca Parlamento Vasco
Archivo-Biblioteca de la Diputación de Cáceres
Digitalización de 11 periódicos para 11 Instituciones Vascas de Prensa retrospectiva y prensa actual
Archivo Municipal de Castellón de la Plana
Archivo Histórico del Excmo. Ayuntamiento de La Laguna, Tenerife
Archivo del Ayuntamiento Oviedo
Archivo del Komintern, Moscow and its replica in 6 National Archives, LOC and Open Society Archives
Archivo General de Navarra
Archivo General Militar, Segovia
Zabalburu Library
Slide 8
Date Institution Number of Images Kind of Images 89-02 Archivo General de Indias, Sevilla 11.000.000 Manuscripts XVI-XIX 97- Archivo General de la Nación, Colombia 1.000.000 Manuscripts 94-00 Archivo General de Simancas 1.000.000 Manuscripts 97-04 Archivo General Militar, Ávila 180.000 Expedientes Militares 97-04 Archivo General Militar, Segovia 300.000 Expedientes Militares 98-04 Archivo General de Navarra 450.000 Manuscritos medievales 98- Archivo General de la Marina, El Viso del Marqués, Ciudad Real 150.000 Manuscripts 96- Archivo de la Real Chancillería de Valladolid Manuscripts 93-03 Archivo Histórico Nacional, Madrid 3.000.000 Manuscripts 95-01 Archivo Histórico Nacional - Sección Nobleza, Toledo 300.000 Manuscripts 96 Archivo Histórico Nacional Sección Guerra Civil, Salamanca Manuscripts 96 Archivo Histórico Provincial, Vizcaya 97 Archivo Histórico Provincial de Murcia 250.000 Protocols 99-02 Archivo Histórico Provincial de Oviedo 95 Archivo Histórico Ultramarino, Lisboa Manuscritos antiguos 95-04 Archivo de la Corona de Aragón, Barcelona 200.000 Medieval Manuscripts 94-01 Archivo Histórico de la Universidad de Salamanca 700.000 Manuscripts 96-02 Archivo Histórico de la Universidad de Oviedo 97-04 Archivo Histórico de la Universidad de Santiago de Compostela 400.000 Manuscripts 98-02 Archivo del Komintern, Moscú 1.000.000 Documents 1900-1945 93-04 Biblioteca y Archivo de la Fundación Sancho el Sabio, Vitoria 1.100.000 Monographs XVI-XIX 96-02 Biblioteca de la Fundación Francisco de Zabalburu, Madrid 700.000 Manus. y Mon. 96-00 Archivo del Nacionalismo de la Fundación Sabino Arana, Vizcaya 100.000 97-01 Archivo Histórico del Excmo. Ayuntamiento de La Laguna,Tenerife 100.000 Manuscripts 96 Archivo del Ilustre Colegio Notarial de Granada 200.000 Protocols 1998 Instituto de Historia y Cultura Militar 100.000 Manuscipts 95-00 Archivo Eclesiástico de Poblet 200.000 Manuscipts 98- Archivo-Biblioteca de la Diputación de Cáceres 200.000 Actas 98 Archivo Municipal de Castellón de la Plana 98 Centro de Investigaciones Biológicas (CSIC)
FIGURES OF DIGITALIZATION
Slide 9
Date Institution Number of Images Kind of Images 96-00 Real Academia Española Historical Dictionaries 96-00 Digitalización de 11 periódicos para 11 Instituciones Vascas 300.000/year Ancien Journals 99-04 Archivo Histórico Provincial Cantabria 2000 Archivo Ayuntamiento Estella 00-02 Archivo y Biblioteca Cruz Roja Photographs Monog. 00-04 Archivo Virtual de Aragón ( Imágenes del ACA y AHN) Medieval Manuscripts 00-01 Proyecto AER ( Con AGI y AHN inicialmente) 00-04 Biblioteca Parlamento Vasco 300.000 Monographs 01-04 Archivo del Reino de Valencia Manuscripts, Protocols 01-02 Diccionario Biográfico Real Academia Historia 01 Archivo del Ayuntamiento Oviedo Padrones XV 01-04 Archivo del Reino de Mallorca 02 Sistema Archivos Principado Asturias 02 Archivo Casa de Alba
FIGURES OF DIGITALIZATION
Slide 10
EXPERIENCES ON MIGRATION
1. Projects from 1988 – 1992:Computer System for Archivo General de Indias• The Archive contains 86 million of pages of original manuscripts
related to the Spanish Administration in America (XV-XIX centuries), in 43.000 bundles
• The Computer System integrated:– A Textual Data Base with 400.000 descriptive entries– A Digital Image Archive with 11 million digital images in 1995– A Module for User and Document Management: Control of User management, Consultation room, documents movements and statistics
• Access by researchers and archivists from 50 workstations• About 30% of present consultations are on the screen (1 million
pages/year ) • About 35% of printing are digital ( 85.000/year )• Access system in service since 1992
Slide 11
EXPERIENCES ON MIGRATION
Architecture• The Data Base for Descriptions in SQL/400 keeps the hierarchical
structure of fonds• Standalone Digitization Workstations with flat bed scanners and
optical disk driver under DOS• Images servers based on PCs with optical disk drivers • Access from PCs under OS/2Image Acquisition and Storage• 11 million images digitized in gray levels with high fidelity with respect
to the original manuscripts• Low cost workstations• Legibility Enhancements applied by users at the consultation time• Non expert digitization operators• Digitization: 100 dpi, 16 gray levels• 1 Page/minute, 15 workstations, 2 turns, 4 years
Slide 12
Image Acquisition and Storage• Images stored in WORM optical disks
– The structure at the low level ( bundle/documents ) was alsoin directories in the WORM disks– Access to images in one disk done through the call number of the document– Images path as metadata: images names had information about document call number and number of page.– Not available standard compression for gray level images. Images were DPCM compressed by software without losses.
• Compressed Image size of A4: 300-350 Kbytes• Storage for 1 bundle: 2000 x 350 = 700 MB
EXPERIENCES ON MIGRATION
Slide 13
Image Acquisition and Storage• Media for storage of digital images:
Bundles Media Year beg. Number of disks Images1.729 IBM optical disks ( 200 MB) 1989 6.916 3.458.0003.732 Plasmon optical disks ( 940 MB) 1991 3.732 7.464.000
50 CD-R (640 MB) 1996 100.000
EXPERIENCES ON MIGRATION
Slide 22
EXPERIENCES ON MIGRATION
2. Projects from 1992 – 1996:
– Data Base Server under OS/2 and DB2 – Access and Digitization workstations from PCs with OS/2– The relational Data Base keeps the hierarchical structure of
documentation– Images stored in CDRs
Directory structures and image names changed.Metadata in binary control files: Each image has information
about signature, position in hierarchical structure, number of page, notes
Image compression: JPEGMetadata in images: resolution, date, dimensions
Slide 23
EXPERIENCES ON MIGRATION
Example: metadata in Binary Control File
– The file keeps information about the hierarchical structure
– It maintains relationship between each image file and its position in the document.
– The control file and its metadata can be imported into the database
Slide 24
EXPERIENCES ON MIGRATION
Migration of Images of Archivo de Indias from 10.600 optical disks to 6.000 CD-Rs
– The images of a bundle are stored in 1 or 2 CD-R– Reading of optical disks through the network– No direct connectivity between optical disks and Windows NT
– Main Operation Tasks:Decompression of the DPCM formatCompression on JPEG formatTemporary storage in magnetic diskAll images of the bundle are copied in CD-RVerification of images by reading6.000 CD-Rs, and 6.000 CD-Rs backup copy
Slide 25
EXPERIENCES ON MIGRATION
Migration of Images from 6.916 WORM IBM disks to CD-Rs– Typically 4 WORM disks ( 200 MB each) in 1 or 2 CD-R
IBM Disks to CD-RPentium PCWindows NTToken-Ring PCI Card3GB disk SCSI interface
Microchannel IBM PS/2File system driver for OS/2OS/2 1.3 and Lan ServerTokenRing Microchannel Card
Token RingNetwork
CD-R DrivesIBM Optical Drives
Slide 26
EXPERIENCES ON MIGRATION
Migration of Images from 3.732 WORM Plasmon to CD-Rs– 1 WORM Plasmon disk ( 940 MB) in 1 or 2 CD-R
Plasmon Disks to CD-RPC with i486SCSI interfaceFile system driver for OS/2OS/2 3.0 Ethernet card
Pentium PCWindows NTToken-Ring PCI Card3GB disk SCSI interface
Plasmon Drives
CD-R Drives
HUB EthernetNetwork
HUB EthernetNetwork
Slide 27
EXPERIENCES ON MIGRATION
Migration of Images of Archivo de Indias from 10.600 optical disks to6.000 CD-Rs–– Requirements of personnel andRequirements of personnel and timetime
3 3 operators duringoperators during 4 4 monthsmonths
Similar migration schemes with less images:• Library Sancho el Sabio ( Vitoria) 1.000.000 images• University of Salamanca 700.000 images• Archivo General Militar, Segovia 200.000 images• Archivo del Monasterio Poblet 100.000 images
Slide 28
EXPERIENCES ON MIGRATION
3. Projects from 1996 to now:
– Oracle Data Base– Access and Digitization workstations with PCs with W/NT,.. W XP – Capturing Images also using standard programs and their metadata– Images stored in magnetic disks. CDROMS as backup
Metadata in database: Scanning operator, date of creation, Signature, path, dimensions in bytes… Data about control of the information
Metadata in image: resolution, dimensions… Data for presentation in computers and for printing
Image quality:200 – 300 dpi, 256 gray levelsColor images
Standard formats:TIFF, CCITTGIVJPEG, PDF,
Slide 29
EXPERIENCES ON MIGRATION
Example: metadata in database
Modes of Image Display Management of
Image Access
Slide 30
EXPERIENCES ON MIGRATION
Example: metadata XML File
– Same functionality than binary control file
– Standard: virtually any program can import these metadata
Slide 31
EXPERIENCES ON MIGRATION
Migration of Archivo de Indias from CD-R to magnetic disk in 2000– Project for online access and Internet
Just copy. Images are already with JPEG compression10 RAID cabinets of 350 GB each ( 8 disks x 50 GB )1 operator was required during 1 month for the copy from a CD-
ROM tower to magnetic disks– Transfer rate from different media:
Media Transfer rate Image BundleIBM optical disk 60 KBs 6 seconds 4 hoursPlasmon optical disk 100 KB/s 3 seconds 1 hourCD-R 16x 2,5 MB/s <1 second 5 minutesMagnetic disk 80 MB/s 1 minute
Similar Migrations:Sancho Sabio Library ( Vitoria) 1 million imagesZabalburu Library 700.000 imagesMilitary Archives 500.000 imagesArchivo General Navarra 600.000 imagesKomintern Archives (Moscow) 1 million images........ Komintern Archives, Moscow
Slide 32
Archivo General de Indias
UPS
UPS
Image Server
RAID Cabinet 1
RAID Cabinet 2
RAID Cabinet 3
RAID Cabinet 4
RAID Cabinet 5
RAID Cabinet 6
RAID Cabinet 7
RAID Cabinet 8
RAID Cabinet 9
RAID Cabinet 10
Data Base Server
Domain Controler Server
WEB Server
UPS
UPS
SERVERS AND IMAGE STORAGE
Slide 33
Archivo General de Indias
R es erved U P S
D ata B as e S erv er
D o m a in C o ntro le r S erv ers
U P S
W E B S er ve rs
Im age S e rv er
R A ID C ab inet 1
R A ID C ab inet 2
U P S
R es erved fo r R A I D C ab ine t 3
A uto R ep lic at ed o n line R em o te D is k s ub s ys temfo r B ac k up and S erv ic e
R ed lo c al
Slide 34
MIGRATION TASKS
• Analysis of origin and destination data models• Equivalence between of the fields in the origin and destination models
– New versions include new metadata not available before • Development of migration software• Testing with a limited number of objects• Display of information in a destination card• Application of migration to all data• Verification of results• Correction of errors:
– Sometimes some images cannot be copied and must be recoverd from alternative media or even to be digitised again
Komintern Archives, Moscow
Slide 35
MAIN COST FACTORS
• Preparation of the system for migration– Hardware and Basic Software:
Magnetic disk storage for imagesPCs with appropriate OS and DB manager
• Development of Software (1 programmer, 2-3 weeks work ) – Software development for migration– Testing of migration of data
• Operation ( usually less than 1 week)– Significant operation with removable media
Komintern Archives, Moscow
Slide 36
BEST PRACTICES FOR PRESERVATION
• General principles–– Based onBased on PC’sPC’s and mainstream commercial equipmentand mainstream commercial equipment–– KeyKey hardware hardware providedprovided byby first classfirst class ITIT companiescompanies–– Database managers of widespreadDatabase managers of widespread useuse–– Consultations with institutions undertaking projectsConsultations with institutions undertaking projects–– Based on elements and standard formatsBased on elements and standard formats.. Officials or theOfficials or the facto,facto, likelike
TIFF, JPEG, XML, etc. TIFF, JPEG, XML, etc. –– Modular,Modular, allowingallowing aa progressive installation and easy update of progressive installation and easy update of
elementselements–– Selection ofSelection of software:software:
FunctionalitiesNumber of installationsMaintenanceProvided by a IT company settled in the sector
–– Key factorsKey factors::Server, operating system, database managerBackup policies
Slide 37
BEST PRACTICES FOR PRESERVATION
• Digitization–– Capture Capture systemssystems::
Robust flatbed scanners (A3)Zenithal scanners. Digital cameras with limitations.
–– UseUse of standard compression formatsof standard compression formats. JPEG, CCITTGIV. JPEG, CCITTGIV–– Ensure thatEnsure that digitaldigital images will allowimages will allow aa broad range of futurebroad range of future useuse–– CaptureCapture the highest quality image technically possible and the highest quality image technically possible and
economically feasible for largeeconomically feasible for large--scale production scale production –– CaptureCapture the informational contentthe informational content // physical appearancephysical appearance–– Fast and easy correction of errors Fast and easy correction of errors
• Criteria for holding selection–– ValueValue–– ConditionCondition–– UseUse–– Acceptability of theAcceptability of the digital digital objectobject–– Access Access aidsaids
Slide 38
BEST PRACTICES FOR PRESERVATION
• Storage–– Media Media of wideof wide useuse and low costand low cost::
Magnetic disk for on line image service (specially in high demand)Disks with redundancyBackup in tapes of high capacity (10/20GB)One or two units available as hotsawpIt allows migration without personnel operation
In a distributed network they may need to be stored online inmultiple locations
CD-R or DVD as backup for off line access in case of system failure
–– In generalIn general there is little experiencethere is little experience inin storing massive quantities of storing massive quantities of culturallyculturally valuablevaluable materials materials
• Backup and Recovery–– UseUse industry standardindustry standard backupbackup and recovery proceduresand recovery procedures::
Periodic backup to magnetic tape A copy held on site for near term recoveryA copy off-site stored for disaster recovery
Slide 39
APPLICATION OF MIGRATION
Traditional approach of Computer Science• Migration of media
–– RefreshingRefreshing digital digital informationinformation by by copying it from medium to mediumcopying it from medium to medium–– Conversion ofConversion of files files to another format toto another format to be be interpretedinterpreted by by new new
programsprograms; ; toto a a reduced number of standard formatsreduced number of standard formats; ; • Migration of technology platform
–– Server Server and PCsand PCs–– PeriphericalsPeriphericals–– Capture Capture devices anddevices and CDR CDR writerswriters–– Operating system and databaseOperating system and database managermanager
• Migration of the digitising and access software–– Maintenance ofMaintenance of software in software in new platformnew platform–– NewNew software software versions for digitising and accessversions for digitising and access
Slide 40
PLANNING
• Planning for migration is difficult due to:– the limited experience– we cannot predict when media, soft and hard will become obsoleted
• No single strategy applies to all formats of digital information• It varies in different applicational environments, for different formats of digital
materials and for preserving different degrees of computation, display and retrieval
• It requires a unique new solution for each new format and process• Automatic conversion is only partially possible• In general there are no firm plans for migration, but to stay up to date with
current technologies by migration the content• Usually there is urgency involved in migration: due by the obsolescence of
soft and hard
Slide 41
SCHEDULE
• Schedule– New releases of software, databases,etc. can be expected every 2-3
years, with minor updates more often– Migration from one storage media to another every 4-5 years, if not
online– Migration to new hardware and software occur less frequently but can
be expected between 5-10 years