The way from pdf-documents to xml-files A brief overview through the OCR- process and the XML mark...

18
The way from pdf- documents to xml-files A brief overview through the OCR-process and the XML mark up Christiana Klingenberg & Donat Agost

description

document processing 1)OCR (ABBYY FineReader) reading the pdf document, dividing the text in blocks building training files orthography check 2)XML markup (GoldenGATE) workflow (level 1) FAT / LSID treatments

Transcript of The way from pdf-documents to xml-files A brief overview through the OCR- process and the XML mark...

Page 1: The way from pdf-documents to xml-files A brief overview through the OCR- process and the XML mark up Christiana Klingenberg & Donat Agosti.

The way from pdf-documents to xml-files

A brief overview through the OCR-process and the XML mark up

Christiana Klingenberg & Donat Agosti

Page 2: The way from pdf-documents to xml-files A brief overview through the OCR- process and the XML mark up Christiana Klingenberg & Donat Agosti.

workflow

Page 3: The way from pdf-documents to xml-files A brief overview through the OCR- process and the XML mark up Christiana Klingenberg & Donat Agosti.

document processing

1) OCR (ABBYY FineReader)

• reading the pdf document, dividing the text in blocks

• building training files• orthography check

2) XML markup (GoldenGATE)

• workflow (level 1)• FAT / LSID• treatments

Page 4: The way from pdf-documents to xml-files A brief overview through the OCR- process and the XML mark up Christiana Klingenberg & Donat Agosti.

OCR – ABBYY FineReader

Considerations- building training files for each type face pattern

(eg. for each journal)- marking the blocks in logical reading order- recognizing special caracters [[worker]],

[[queen]], [[male]], [[soldier]]- orthography check- saving options- problems

Page 5: The way from pdf-documents to xml-files A brief overview through the OCR- process and the XML mark up Christiana Klingenberg & Donat Agosti.

type face pattern

1804. Carolum Reichard, Brunsviga. 1861. Journal of the Proceedings of the Linnean Society of London, Zoology

1921. Annales de la Societe Entomologique de Belgique 2005. Proceedings of the California Academy of Sciences

Page 6: The way from pdf-documents to xml-files A brief overview through the OCR- process and the XML mark up Christiana Klingenberg & Donat Agosti.

marking the blocks1

2 3

4

1234567

marking the blocks in a logical order to get a readable xml document

Page 7: The way from pdf-documents to xml-files A brief overview through the OCR- process and the XML mark up Christiana Klingenberg & Donat Agosti.

Vespa. 263emargina-ta.50. V. nigra thorace maculata, abdomine fasciis quinque prima antice emarginata, Vespa emarginata. Ent.

Syst. 2. 267. 51. * Habitat in Germania Dom Smidt.simplex51. V. nigra clypeo thoracis margine antico ab-dominisque fasciis quinque simplicibus flavis. Ent. Syst. 2,

267. 52. * Habitat Kiliae.parietina.52. V. nigra clypeo thoraceque maculatis, abdomi-ne fasciis supra quinque, subtus duabus flavis. Ent. Syst,

2. 268. 53. *Panz. Fn. Germ. 49. tab. 24.Habitat Kiliae.

Vespa. 263

50. V. nigra thorace maculata, abdomine fasciis emargina-quinque prima antice emarginata, ta.

Vespa emarginata. Ent. Syst. 2. 267. 51. * Habitat in Germania Dom Smidt.

51. V. nigra clypeo thoracis margine antico ab- simplex. dominisque fasciis quinque simplicibus flavis. Ent. Syst. 2. 267. 52. * Habitat Kiliae. 52. V. nigra clypeo thoraceque maculatis, abdomi- parietina. ne fasciis supra quinque, fubtus duabus flavis. Ent. Syst, 2. 268. 53. Panz. Fn. Germ. 49. tab. 24. Habitat Kiliae.

blocks marked in a logical sequence, „clean“ html

whole text marked in one block, „dirty“ html

Page 8: The way from pdf-documents to xml-files A brief overview through the OCR- process and the XML mark up Christiana Klingenberg & Donat Agosti.

special characters

it is not possible to enforce the Abbyy pattern editor to re-read certain characters!

[[worker]][[soldier]][[queen]][[male]][[…]] = not recognizable

Page 9: The way from pdf-documents to xml-files A brief overview through the OCR- process and the XML mark up Christiana Klingenberg & Donat Agosti.

orthography check / problems• additional dictionaries: “anty_species”, “anty_glossary”,

(“anty_Chris”)• latin dictionary?• geographic names dictionary?• misspelled taxa

(incl. species names beginning with CAPITALS)

• available training files for different type patterns for ABBYY (community)

• species dictionaries for different groups (eg. plants, beetles, birds, etc.) (community) (could be used as lexicon in GoldenGATE)

Page 10: The way from pdf-documents to xml-files A brief overview through the OCR- process and the XML mark up Christiana Klingenberg & Donat Agosti.

saving options

(T) australis Forel = parallela(T) bequaerti Forel = schultzei(T) bicolor (Clark) * = turneri(T) bidentata Brown n. sp. [[worker]] Philippines [13](T) bicuspis Emery 1900:268 [[worker]] [[male]]

Madagascar [15]boliviana Santschi = sinuata(P) brevidentata Wheeler — cribrinodis(T) brevinodis Santschi = cribrinodis(?) brunnipes (Clark) * 1938:361 [[worker]] S Australia:

Reevesby I. [16](T) cephalotes Viehmeyer = parallela(T) ceylonensis Donisthorpe = parallelacineracea Forel = punctata

(T) australis Forel = parallela(T) bequaerti Forel = schultzei(T) bicolor (Clark) * = turneri(T) bidentata Brown n. sp. [[worker]] Philippines [13](T) bicuspis Emery 1900:268 [[worker]] [[male]] Madagascar [15]boliviana Santschi = sinuata (P) brevidentata Wheeler — cribrinodis (T) brevinodis Santschi = cribrinodis(?) brunnipes (Clark) * 1938:361 [[worker]] S Australia: Reevesby I. [16] (T) cephalotes Viehmeyer = parallela (T) ceylonensis Donisthorpe = parallelacineracea Forel = punctata

Page 11: The way from pdf-documents to xml-files A brief overview through the OCR- process and the XML mark up Christiana Klingenberg & Donat Agosti.

workflow

Page 12: The way from pdf-documents to xml-files A brief overview through the OCR- process and the XML mark up Christiana Klingenberg & Donat Agosti.

GoldenGATE: xml mark up

• FAT / attribute taxon names– editing species names (beginning with lower

case letters, if not recognized as a genus)– marking of additional, not recognized taxa

(without the author, the author will be given during LSID referencing)

– edit annotations (improving the tool)

Page 13: The way from pdf-documents to xml-files A brief overview through the OCR- process and the XML mark up Christiana Klingenberg & Donat Agosti.

GoldenGATE: xml mark up

• LSID referencing– upload of new taxonomic names (quality

control?)– same taxon described by two authors? In

case of doubt, which one?

Establishing “taxon format” rules according with the ICZN for taxon upload:“Genus (SubGenus) species subspecies variety”(requires in most cases a previous editing of the taxa, during the OCR process or in GoldenGATE)

Page 14: The way from pdf-documents to xml-files A brief overview through the OCR- process and the XML mark up Christiana Klingenberg & Donat Agosti.

GoldenGATE: treatment mark up

• definitions of treatment options, especially: catalogue entry, synopsis, citation, reference group

• suggestions for simplifying the treatment mark up: journal-specific analyzers?

• treatment mark-up during “paginator” step and subSubSection mark up posteriorly?

Page 15: The way from pdf-documents to xml-files A brief overview through the OCR- process and the XML mark up Christiana Klingenberg & Donat Agosti.

GoldenGATE: TaxonX

• TaxonX validation: in GoldenGATE (no necessity of Oxygen or XMLSpy)

• TaxonX – MODS: what about books?

Page 16: The way from pdf-documents to xml-files A brief overview through the OCR- process and the XML mark up Christiana Klingenberg & Donat Agosti.

GoldenGATE: considerations• new definitions of mark up levels• LSIDs, citations (DOIs)• community: “mark up server”, integrating

specialists for special groups or mark up levels

Error prevention:• in case of doubt consult the original pdf (taxa),

especially when working with “dirty” html

Page 17: The way from pdf-documents to xml-files A brief overview through the OCR- process and the XML mark up Christiana Klingenberg & Donat Agosti.

expenditure of time• OCR: average of x 5,63 min / page

depends on type face pattern and availability of trainig file for type face pattern

• GoldenGATE: average of x 8,18 min / page (tx1)

– average time represents also time of debugging and error search– depends on number of taxa and treatments– time will reduce due to constant improving of GoldenGATE and developing

helpful tools

Page 18: The way from pdf-documents to xml-files A brief overview through the OCR- process and the XML mark up Christiana Klingenberg & Donat Agosti.

Time development GoldenGATE