Portal of the GDR pressdocuments:eromm_sc... · • transferring data to presentation file. Meta...

13
Portal of the GDR press Project of the Staatsbibliothek zu Berlin supported by the Deutsche Forschungsgemeinschaft Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012

Transcript of Portal of the GDR pressdocuments:eromm_sc... · • transferring data to presentation file. Meta...

Page 1: Portal of the GDR pressdocuments:eromm_sc... · • transferring data to presentation file. Meta data • Presentation: approx. 300 GigaByte per title • Archive: approx. 4.5 TeraByte

Portal of the GDR press

Project of the Staatsbibliothek zu Berlin –

supported by theDeutsche Forschungsgemeinschaft

Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012

Page 2: Portal of the GDR pressdocuments:eromm_sc... · • transferring data to presentation file. Meta data • Presentation: approx. 300 GigaByte per title • Archive: approx. 4.5 TeraByte

Subject of the project

GDR Newspaper Portal: Digitising of GDR newspapers and development of portalwith additional historical background information, provided by the ZZF, theCentre for Contemporary History, in Potsdam

Three newspapers of the GDR will be digitised, the full text of the articles will beindexed and made available: Neues Deutschland [ND] (April 23rd, 1946 – October 3rd, 1990) – 120,000 pages Berliner Zeitung [BZ] (May 21st, 1945 – October 3rd, 1990) – 140,000 pages Neue Zeit [NZ] (July 22nd, 1945 – July 5th, 1994) – 140,000 pages

The contents of the newspapers are subject to copyright. Therefore it has beennecessary to close contracts with the publishers of the three newspapers andwith two collecting societies.

Preconditions

Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012

Page 3: Portal of the GDR pressdocuments:eromm_sc... · • transferring data to presentation file. Meta data • Presentation: approx. 300 GigaByte per title • Archive: approx. 4.5 TeraByte

The desired result is reached in several steps:

• Locating and unbinding of bound newspapers, gap management• Scanning originals: grayscale TIFF• Automatic image fixing and binarization (fine rotation, conversion

to black-and-white images)• Automatic recognition of layout and text OCR (optical character recognition) • Manual layout recognition to improve quality regarding

– Fixing of reading order– Concatenating of text blocks to generate complete articles– Labeling of blocks (headlines, images, non-editorials …)

Project process

Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012

Page 4: Portal of the GDR pressdocuments:eromm_sc... · • transferring data to presentation file. Meta data • Presentation: approx. 300 GigaByte per title • Archive: approx. 4.5 TeraByte

Problems of automatic layout recognition

Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012

Headline spans two columns of a three-column article (type 1)Two columns are conactenated to the article, which is marked by the headline,last column becomes a separate article

Neues DeutschlandApril 13th, 1970, page 7

Page 5: Portal of the GDR pressdocuments:eromm_sc... · • transferring data to presentation file. Meta data • Presentation: approx. 300 GigaByte per title • Archive: approx. 4.5 TeraByte

Problems of automatic layout recognition

Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012

Headline spans two columns of a three-column article (type 2)Two columns are concatenated to the article, which is marked by the headline,first column becomes a separate article

Neues DeutschlandJanuary 15th, 1961, page 4

Page 6: Portal of the GDR pressdocuments:eromm_sc... · • transferring data to presentation file. Meta data • Presentation: approx. 300 GigaByte per title • Archive: approx. 4.5 TeraByte

Problems of automatic layout recognition

Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012

Headline of multi-column article (combination type 1 / type 2)Columns seem to be arbitrarily concatenated to one article,other columns become separate articles

Neues DeutschlandJanuary 15th, 1961, page 6

Page 7: Portal of the GDR pressdocuments:eromm_sc... · • transferring data to presentation file. Meta data • Presentation: approx. 300 GigaByte per title • Archive: approx. 4.5 TeraByte

Manual correction of errors by the automatic recognition

Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012

Using the so called ‚Korrektor‘ software single blocks can be re-assigned.The original reading order is then re- established.New articles can also be created.

Page 8: Portal of the GDR pressdocuments:eromm_sc... · • transferring data to presentation file. Meta data • Presentation: approx. 300 GigaByte per title • Archive: approx. 4.5 TeraByte

Workflow

Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012

Unbinding ofnewspaper

volumesScanning

OCR andlayout

recognition

qualitycheck

manualerror

correction

qualitycheck

preparationfor

presentation

digitalpreservation

key:

= Staatsbibliothek zu Berlin

= external contractor

1)

2) 3)

1) MIK-Center, Berlin2) Fraunhofer IAIS, St. Augustin3) ArchivInForm, Potsdam

Page 9: Portal of the GDR pressdocuments:eromm_sc... · • transferring data to presentation file. Meta data • Presentation: approx. 300 GigaByte per title • Archive: approx. 4.5 TeraByte

Preparation for presentation

• converting (down scaling) of master images to presentation images• every page for page view• every page as thumbnail• title page for results of calendar browsing

• transferring data to presentation file

Images

• reworking of METS data (adding PURLs)• creating of search index data (issues, articles)• discovering keywords for enrichments (Who is who / GDR)• creating of search index data (enrichments)• transferring data to presentation file

Meta data

• Presentation: approx. 300 GigaByte per title• Archive: approx. 4.5 TeraByte per title

Storage requirements

Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012

Page 10: Portal of the GDR pressdocuments:eromm_sc... · • transferring data to presentation file. Meta data • Presentation: approx. 300 GigaByte per title • Archive: approx. 4.5 TeraByte

Presentation of data

Knut Lohse, Staatsbibliothek zu Berlin, 2012

• AuthenticationPersonal library card, xlogon.net

(DFN in progress)• calendar• full-text search

Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012

Page 11: Portal of the GDR pressdocuments:eromm_sc... · • transferring data to presentation file. Meta data • Presentation: approx. 300 GigaByte per title • Archive: approx. 4.5 TeraByte

Presentation of data (full-text search)

Knut Lohse, Staatsbibliothek zu Berlin, 2012

The full-text search takes into account terms and expressions which were especially characteristic of the GDR. A query for „Berliner Mauer“ [Berlin Wall] also produces results like „antifaschistischer Schutzwall“ [anti-fascist protective wall], a typical expression only used in GDR

Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012

Page 12: Portal of the GDR pressdocuments:eromm_sc... · • transferring data to presentation file. Meta data • Presentation: approx. 300 GigaByte per title • Archive: approx. 4.5 TeraByte

Presentation of data (page view)

Knut Lohse, Staatsbibliothek zu Berlin, 2012

• Navigation withinissue, zoom

• article is highlighted• list of articles on this

page• full text is displayed,

persons are high- lighted

• Links to additional(external)informations

• possibility to hide textframes

Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012

Page 13: Portal of the GDR pressdocuments:eromm_sc... · • transferring data to presentation file. Meta data • Presentation: approx. 300 GigaByte per title • Archive: approx. 4.5 TeraByte

Future goals

Knut Lohse, Staatsbibliothek zu Berlin, 2012

• additional login possibility (DFN, in progress)• more volumes of „Neues Deutschland“, „Berliner Zeitung“ and „Neue Zeit“

will be included before the end of the project (May 2013)• User-friendly contact form for reporting errors• OCR error correction module (visitor‘s interaction)

http://zefys.staatsbibliothek-berlin.de/ddr-presse/

Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012

[email protected] [email protected]