Portal of the GDR pressdocuments:eromm_sc... · • transferring data to presentation file. Meta...
Transcript of Portal of the GDR pressdocuments:eromm_sc... · • transferring data to presentation file. Meta...
Portal of the GDR press
Project of the Staatsbibliothek zu Berlin –
supported by theDeutsche Forschungsgemeinschaft
Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012
Subject of the project
GDR Newspaper Portal: Digitising of GDR newspapers and development of portalwith additional historical background information, provided by the ZZF, theCentre for Contemporary History, in Potsdam
Three newspapers of the GDR will be digitised, the full text of the articles will beindexed and made available: Neues Deutschland [ND] (April 23rd, 1946 – October 3rd, 1990) – 120,000 pages Berliner Zeitung [BZ] (May 21st, 1945 – October 3rd, 1990) – 140,000 pages Neue Zeit [NZ] (July 22nd, 1945 – July 5th, 1994) – 140,000 pages
The contents of the newspapers are subject to copyright. Therefore it has beennecessary to close contracts with the publishers of the three newspapers andwith two collecting societies.
Preconditions
Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012
The desired result is reached in several steps:
• Locating and unbinding of bound newspapers, gap management• Scanning originals: grayscale TIFF• Automatic image fixing and binarization (fine rotation, conversion
to black-and-white images)• Automatic recognition of layout and text OCR (optical character recognition) • Manual layout recognition to improve quality regarding
– Fixing of reading order– Concatenating of text blocks to generate complete articles– Labeling of blocks (headlines, images, non-editorials …)
Project process
Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012
Problems of automatic layout recognition
Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012
Headline spans two columns of a three-column article (type 1)Two columns are conactenated to the article, which is marked by the headline,last column becomes a separate article
Neues DeutschlandApril 13th, 1970, page 7
Problems of automatic layout recognition
Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012
Headline spans two columns of a three-column article (type 2)Two columns are concatenated to the article, which is marked by the headline,first column becomes a separate article
Neues DeutschlandJanuary 15th, 1961, page 4
Problems of automatic layout recognition
Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012
Headline of multi-column article (combination type 1 / type 2)Columns seem to be arbitrarily concatenated to one article,other columns become separate articles
Neues DeutschlandJanuary 15th, 1961, page 6
Manual correction of errors by the automatic recognition
Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012
Using the so called ‚Korrektor‘ software single blocks can be re-assigned.The original reading order is then re- established.New articles can also be created.
Workflow
Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012
Unbinding ofnewspaper
volumesScanning
OCR andlayout
recognition
qualitycheck
manualerror
correction
qualitycheck
preparationfor
presentation
digitalpreservation
key:
= Staatsbibliothek zu Berlin
= external contractor
1)
2) 3)
1) MIK-Center, Berlin2) Fraunhofer IAIS, St. Augustin3) ArchivInForm, Potsdam
Preparation for presentation
• converting (down scaling) of master images to presentation images• every page for page view• every page as thumbnail• title page for results of calendar browsing
• transferring data to presentation file
Images
• reworking of METS data (adding PURLs)• creating of search index data (issues, articles)• discovering keywords for enrichments (Who is who / GDR)• creating of search index data (enrichments)• transferring data to presentation file
Meta data
• Presentation: approx. 300 GigaByte per title• Archive: approx. 4.5 TeraByte per title
Storage requirements
Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012
Presentation of data
Knut Lohse, Staatsbibliothek zu Berlin, 2012
• AuthenticationPersonal library card, xlogon.net
(DFN in progress)• calendar• full-text search
Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012
Presentation of data (full-text search)
Knut Lohse, Staatsbibliothek zu Berlin, 2012
The full-text search takes into account terms and expressions which were especially characteristic of the GDR. A query for „Berliner Mauer“ [Berlin Wall] also produces results like „antifaschistischer Schutzwall“ [anti-fascist protective wall], a typical expression only used in GDR
Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012
Presentation of data (page view)
Knut Lohse, Staatsbibliothek zu Berlin, 2012
• Navigation withinissue, zoom
• article is highlighted• list of articles on this
page• full text is displayed,
persons are high- lighted
• Links to additional(external)informations
• possibility to hide textframes
Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012
Future goals
Knut Lohse, Staatsbibliothek zu Berlin, 2012
• additional login possibility (DFN, in progress)• more volumes of „Neues Deutschland“, „Berliner Zeitung“ and „Neue Zeit“
will be included before the end of the project (May 2013)• User-friendly contact form for reporting errors• OCR error correction module (visitor‘s interaction)
http://zefys.staatsbibliothek-berlin.de/ddr-presse/
Almut Ilsen, Knut Lohse, Staatsbibliothek zu Berlin, 2012