IMPACT Final Conference - Apostolos Antonacopoulos

14
The Effect of Scanning Parameters on OCR Results A Case Study Apostolos Antonacopoulos PRImA Lab, The University of Salford, United Kingdom www.primaresearch.org

description

Case Study: Scanning Parameters

Transcript of IMPACT Final Conference - Apostolos Antonacopoulos

Page 1: IMPACT Final Conference - Apostolos Antonacopoulos

The Effect of Scanning Parameters on OCR ResultsA Case Study

Apostolos Antonacopoulos

PRImA Lab, The University of Salford, United Kingdom

www.primaresearch.org

Page 2: IMPACT Final Conference - Apostolos Antonacopoulos

Outline

Background Image selection Methods and procedures Experiments

Experiment 1: Colour Vs. greyscale Vs. bitonal

Experiment 2: Effects of resolution Experiment 3: Comparison with NLNZ images

Conclusions

2

Page 3: IMPACT Final Conference - Apostolos Antonacopoulos

Background Cost of storage is a real issue for Content Holders Study by Tracy Powell and Gordon Paynter of the

National Library of New Zealand (DLIB 2009) opened a number of questions

Aims: Examine the effects of colour in addition to

greyscale and bitonal Examine the effects of producing bitonal

images in different ways Examine the effects of different resolutions Study the results by image rather than average

3

Page 4: IMPACT Final Conference - Apostolos Antonacopoulos

Image Selection

Qualitative selection Parts of newspaper articles (no layout issues) Variety of newspapers from British Library

collection Quality of overall page taken into account Regions of different quality selected from

same page Only text regions selected (no graphics

present) No additional artefacts (e.g. warping) present

4

Page 5: IMPACT Final Conference - Apostolos Antonacopoulos

Methods and Procedures

Regions marked using Aletheia and extracted from the main image as separate PAGE files

Text was keyed and represented in PAGE files

Selected (“standard”) colour reduction and binarisation methods were applied

ABBYY FineReader Engine 9 used for OCR IMPACT OCR evaluation tool used

5

Page 6: IMPACT Final Conference - Apostolos Antonacopoulos

Experiment 1: Colour/Grey/Bitonal6

Page 7: IMPACT Final Conference - Apostolos Antonacopoulos

Accuracy Variation per Image

7

Page 8: IMPACT Final Conference - Apostolos Antonacopoulos

Bitonal: Best Algorithm Vs. Scanner

8

Page 9: IMPACT Final Conference - Apostolos Antonacopoulos

Original with Large Bitonal Variation

9

BL9_r0

Page 10: IMPACT Final Conference - Apostolos Antonacopoulos

Experiment 2: Effects of Resolution

10

Page 11: IMPACT Final Conference - Apostolos Antonacopoulos

Experiment 3: Examine NLNZ Images11

Page 12: IMPACT Final Conference - Apostolos Antonacopoulos

Variations in Quality and Accuracy

12

Other bitonalalgorithmbetter NLNZ1_r1

Scanner bitonalbetter NLNZ4_r0

Page 13: IMPACT Final Conference - Apostolos Antonacopoulos

Conclusions Averages do not give an accurate picture. Different

decisions should be taken for different document types

Better quality images leave room for improvement (re-OCR), especially when accuracy is far from high 90s%

Current OCR systems are not taking advantage of extra quality?

Higher quality (at least greyscale) is an investment Perhaps not so high resolution for “routine” material

“Lossy” compression is a real option Better to have a high quality image with an

imperceptible “loss” than a perfect low quality image!

13

Page 14: IMPACT Final Conference - Apostolos Antonacopoulos

Further Information14

PRImAhttp://www.primaresearch.org

IMPACThttp://www.impact-project.eu