Digital Preservation (E-Archiving)

49
Disseminating statistics: Internet and Publications Madrid, 3-5 March 2008 Digital Preservation (E-Archiving) Marta Melgar García [email protected]

description

Digital Preservation (E-Archiving). Marta Melgar García [email protected]. Presentation Index. Introduction Digital Preservation Strategies Digital Preservation Problems INE Journals digital repository INEBase History Our Virtual Library Project Phases The Technical Process in 3 steps - PowerPoint PPT Presentation

Transcript of Digital Preservation (E-Archiving)

Page 1: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

Digital Preservation(E-Archiving)

Marta Melgar Garcí[email protected]

Page 2: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

Presentation Index

• Introduction• Digital Preservation Strategies• Digital Preservation Problems• INE Journals digital repository• INEBase History

– Our Virtual Library– Project Phases– The Technical Process in 3 steps– The Publisher– Visualization On Internet– Interesting Data– IT Data

Page 3: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

Introduction

• Digital preservation combines policies, strategies and actions that ensure access to information in digital formats over time.

• Publications will be available and accessible for generations to come.

Source: American Library Association

Digital Preservation definition

Page 4: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

Digital preservation strategies and actions address content creation, integrity and maintenance.

– Planning– Content creation – Content integrity– Content maintenance– Problems

Source: ALA

Digital Preservation strategies

Page 5: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

• Clear and complete technical specifications

• Production of reliable master files

• Sufficient descriptive, administrative and structural metadata to ensure future access

• Detailed quality control of processes

Digital Preservation strategies

Page 6: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

Program planning, management and evaluation should consider:

• Risk assessment and management.• Cost benefit analysis.• Legal issues.• The role of file formats,standards and metadata.• Storage and maintenance.• Disaster planning.• The relationship between preservation and access.• Preservation strategies, approaches, and methodologies.• Technology forecasting for preservation.

Source: Cornell University Library

Digital Preservation strategies

Page 7: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

Content integrity includes:

• Documentation of all policies, strategies and procedures

• Use of persistent identifiers

• Recorded provenance and change history for all objects

• Verification mechanisms

• Attention to security requirements

• Routine audits

Digital Preservation strategies

Page 8: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

Content maintenance includes:

• A computing and networking infrastructure

• Storage and synchronization of files at multiple sites

• Continuous monitoring and management of files

• Programs for refreshing, migration and emulation

• Written disaster prevention and recovery plans

• Periodic review and updating of policies and procedures

Digital Preservation strategies

Page 9: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

• We have to preserve the records in an electronic era where change and speed is valued more highly that conservation and longevity.

• Enormous amounts of digital information are already lost forever.• Information technologies are essentially obsolete in a short period of

time. This dynamic creates an unstable and unpredictable environment for the continuance of hardware and software.

• There is a proliferation of document and media formats, each one potentially carrying their own software and hardware dependencies.Copying these formats from one storage device to another is simple. However, merely copying bits is not sufficient for preservation purposes. If the software is not avaliable, the information will lost. Besides the complexity of maintaining the integrity of links, embedded objects, etc.

• Digital preservation is expensive.• Increasingly restictive intellectual property and licensing regimes.

Source: http://www.ifla.orgSource: http://www.ifla.org

Digital Preservation problems

Page 10: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

Process steps:

1. In our OPAC (On-line public Access Catalogue), we select the 856 field (for electronical resources).

2. We create a fixed URL. This URL is inside our server.

3. We scan the journals in PDF format.

4. We get up the PDF files to the server through FTP.

5. We use the fixed URL and incorporate every different PDF file to its root.

6. We link every file to the OPAC Web.

7. We see the digitalized file in our OPAC Web.

INE Journals digital repository

In our Library we have created a digital repository of printed journals.

Page 11: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

Field 856

INE Journals digital repository

Page 12: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

INE Journals digital repository

Page 13: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

INE Journals digital repository

Page 14: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

INE Journals digital repository

Page 15: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

INE Journals digital repository

Page 16: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

INE Journals digital repository

Page 17: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

Some interesting data:

• No cost of implementation

• Involved personel: 2 people

• Project time: one and a half year

• Current status: More than 1000 journal numbers digitalized and published

INE Journals digital repository

Page 18: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

INEbase history

Background• 1996: The INE joins the Internet

• 2000: INEbase birth all statistical production offered on the Internet

• 2004: what shall we do with past information only available in printed format? Target: opening up to the public historical collection of INE publications only available on paper

Statistical books 1858-1997 available on the web

Page 19: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

We had to choose between different alternatives:

• Tables in pc-axis format

• Complete PDF versions of the books

• INEbase history

INEbase history: a new section of INEbase

Page 20: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

INEBase History: Our Virtual Library

Page 21: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

INEBase History: Our Virtual Library

Page 22: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

1858 Yearbook

INEBase History: Our Virtual Library

Page 23: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

Population (28 tables)

INEBase History: Our Virtual Library

Page 24: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

INEBase History: Our Virtual Library

Page 25: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

INEBase History: Our Virtual Library

Page 26: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

• Phase 1.

– What should be published? Most symbolic and

representative volumes of public statistical activity:Statistical Yearbooks (1858 – 1997) Population Censuses (1900 – 1970)

– Outsource scanning ( + de 100,000 pages)

– Outsource the software development• Phase 2.

– Cataloguing starts

– Software improvements suggested by use– 20 publications catalogued before publishing

INEbase history: Project Phases

Page 27: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

• Phase 3. – Internet launch takes place with 20 Yearbooks and 1 Census

• Phase 4.

– Cataloguing and web publications of 78 Yearbooks and 9 Censuses

(34 volumes)

INEbase history: Project Phases

• Phase 5.

Incorporation of new publications

Scan the Agrarian Census and VS statistics

Programme adaptation

Cataloguing & publication

Page 28: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

1. Scanning and OCR

• Scanning using the originals

– Unbinding (old and non-unique)

– Guillotining (repeated and unimportant)

– Microfiche (rare, old copies)

• TIFF files obtained

• OCR programme used to generate txt files used for

search engine

• Once PDF file is obtained ready to be catalogued

INEbase history: The Technical Process in 3 steps

Page 29: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

2. Cataloguing books into the system: “cataloguer” role 1st step: create index with

categories until we get to the final node: the statistical tables

2nd step: associate one or more PDF documents to each node

INEbase history: The Technical Process in 3 steps

Page 30: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

INEbase history: The Technical Process in 3 steps

How is cataloguing done? Practical example

Creation of a virtual book: Statistical Yearbook 2010

Node blocked

Page 31: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

INEbase history: The Technical Process in 3 steps

Creation of the index publication

Creating as many chapters as needed

Page 32: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

INEbase history: The Technical Process in 3 steps

Creation of the tables and association to the corresponding PDF-doc.

Page 33: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

INEbase history: The Technical Process in 3 steps

Recreating the hierarchical tree

All the publication´s documents appear associated to their corresponding table

Cataloguer’s work ends here

Nodes unblocked

Page 34: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

3. Revision before publishing

• Cataloguing should be revised before being published

• Who revises? there is a specific role, the “proof-reader”,

but…. this role has not really been used and …in reality another

cataloguer does the revision

• Once the proof-reading work is finished, the book is ready for

publication

Proof-reader’s work ends here

INEbase history: The Technical Process in 3 steps

Page 35: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

Main task: to publish books; other tasks: user and trasmission control, nodes translation

Blocked node

Published node

Unblocked node

Book ready to be shown on the Internet

And the translation process begins

INEbase history: The Publisher

Page 36: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

Cataloguing Server

Dissemination Server

Trasmission process: synchronization of servers

This step might not be needed

Page 37: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

INEbase history: Visualisation on the Internet

Page 38: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

INEbase history: Visualisation on the Internet

Yearbooks ordered by decades

Page 39: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

INEbase history: The hierarchical tree....

On the dissemination server On the cataloguing programme

Page 40: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

And just a click on the required table

And a 9 page PDF document is shown

Page 41: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

INEbase history: Anything else to be taken in account

Search engine

Change language

No. of tables

Size of pdf file

Page 42: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

INEbase history: The search engine

Direct access to the pdf document

Page 43: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

The search engine is based on the table titles (sorry, only in Spanish) and the hierarchical tree (in English as well)

Of course, you might as well use INE’s general search engine:

INEbase history: The search engine

Page 44: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

Population censuses: Everything is also valid

INEbase history: The search engine

Page 45: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

1- Economic data

• Initial scanning stage: 12,000 Euros, 110,000 pages

• External development: 90,000 Euros

2- Deadlines• Scaning + development programme: 6 months• Cataloguing: 20 months

3- Amount of scanned pages • Yearbook: 70,000 pages• Census: 30,000 pages• Total: 100,000 pages

INEbase history: Some Interesting Data

Page 46: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

4- Personnel used: • Cataloguing: 0 – 3 Recording assistants• Indexes translator: 1 trainee• Publisher: 1 – 2 Statisticians• IT support team

5- How many people use INEbase History? • Page views in october: 77,623 (1.2 % of total)

INEbase history: Some Interesting Data

Page 47: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

IT infrastructure:  a reasonably simple system:

•     A cataloguing server houses a copy of the work from the database and the collection of PDF pages; multiple cataloguer PCs provided with a "client" application connect to the server

•    One of the components of the family of web servers at www.ine.es houses the dissemination server (the software, plus a copy of the database and a copy of the collection of PDF pages). This is the system that serves Internet files

•    There are copy and safety mechanisms between one environment and the other

•     The environment is similar to a content management programme

INEbase history: IT DATA

Page 48: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

IT infrastructure:  a reasonably simple system:

• Client programmes developed with Microsoft.Net.

• Server programme developed with Java.

• Catalogue and dissemination database, Oracle 9i.

• Programmes for working with PDF files obtained from a manufacturer specialised in this kind of software.

• Conceptual design. Setting requirements, selection of

• platforms: National Statistics Institute.

• Scanning of originals: Proco S.A.

• Tecnological partner development: Sopra Group.

INEbase history: IT DATA

Page 49: Digital Preservation (E-Archiving)

Disseminating statistics: Internet and PublicationsMadrid, 3-5 March 2008

Thank you very much for your attention