IMPACT Final Conference - Asaf Tzadok

Post on 19-Oct-2014

1.697 views 0 download

Tags:

description

IBM Adaptive OCR engine and CONCERT (Cooperative Correction (including the library perspective)

Transcript of IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa © 2011 IBM Corporation

CONCERTCOoperative eNgine for Correction of ExtRacted Text

Asaf Tzadok

Manager, Image and Document Analytics Group

October 2011

IBM Labs in Haifa

2

Introduction

An estimated of at least 100 Millions books have been produced since Johann Gutenberg invented movable type in the 15th century.

A large part of this vast literature is now being converted to digital books and moved into the world of electronic publishing.

The digitization process involves Scanning technologies OCR (Optical Character Recognition) Post correction

The OCR quality range between 50%-90% of word level accuracy Post correction is a must and it costs a lot and it takes time

~1 euro per A5 page

IBM Labs in Haifa

3

Crowd Sourcing Projects

Distributed Proofreaders Gutenberg Project

National Library of Australia Australian Newspaper Digitisation

LDS Church Family Search

The National Library of Finland Digitalkoot

All are pure volunteer based crowd sourcing programs It works !!

IBM Labs in Haifa

4

Gutenberg Project – 1st Gen.

IBM Labs in Haifa

5

NLA – Australian Newspapers – 2nd Gen.

IBM Labs in Haifa

6

Collaborative Correction – State of the Art cont.

State-of-the-art systems, such as Project Gutenberg, Simply show page image and OCR results to be corrected

Drawbacks: Slow and unproductive process Prone to errors Hard to cross-check/merge Two passes are needed to ensure quality

Result:Complex, hard to track process = a lot of manual labor = limited public participation and contribution

IBM Labs in Haifa

7

DIGITALKOOT - Mole Games – 3rd Gen

IBM Labs in Haifa

8

Collaborative Correction – Games

Wider and younger public participation Easy to cross check Allows Parallelism Fully Scalable

Drawbacks Low productivity factor Static process with huge amount of work Limited use of the feedback from the users

Very long process to complete the digitization

IBM Labs in Haifa

9

Collaborative Correction – How does it work

A full web based collaborative-correction system Avoid any installation in the client side Intuitive for the wide public use

Call for participation (optional) Via the official website of the library Collection based

Volunteers keen on contributing to their cultural heritage preservation Top performers lists Library recognition awards Acknowledgements

IBM Labs in Haifa

10

CONCERT

Adaptive collaborative correction platform Uses the feedback from the users to improve productivity Fully connected to the Adaptive OCR Engine

Strong emphasis on productivity tools Reduce the time for verification/correction

Patented smart-key approach Motivate volunteers

Separating data entry process into several complementary tasks Optimized application dedicated to each task Break down the tasks into subtask Make it suitable for parallel processing Online compilation

Digitization flow optimizations Hierarchical context-level : character -> word -> page

IBM Labs in Haifa

11

CONCERT System Architecture

Image

Enhancements

OMNI Engine

(ABBYY FRE)

Book Fonts

Extraction

Book Optimized

Adaptive OCR

Engine

CONCERT

Quality ControlDictionaries

Scanned

Book

High Quality

Transcription

Web Users

CONCERT

Productivity Tools

CONCERT

Games

IBM Labs in Haifa

13

Adaptive OCR - Requirements

Consistent and reliable confidence level Important for quality assurance

No use of prior knowledge on the font Crazy font can be handled

Good use of the feedback from the users Character and Word level

Robust to distortion Page level distortion and printing variations

Easy to migrate between books from the same publisher Continues update

Not too slow Around 2-3 times slower than OMNI Engines

IBM Labs in Haifa

14

Adaptive OCR – Technical Considerations

Pixel Domain (Template matching) Pros

Easy to implement Scoring consistency

Cons Slow Sensitive to small distortion

Features Domain Pros

Fast Robust to small distortion Using invariant features can improve robustness to distortion

Cons Non consistent scoring mechanism

IBM Labs in Haifa

15

Adaptive OCR - Hybrid Approach

IBM Labs in Haifa

16

Distortion Example

Using hierarchical optic-flow High quality results for compensation for non-linear character warping Can overcome significant distortions

IBM Labs in Haifa

18

System flow

Character (Carpet) session Fast validation of OCR results Every word with rejected character is routed to Word Session

Word Verifier Session Utilized for cases when contextual information is necessary Rejected word will be corrected in the Page Session

Page-level Session For final closure of the page When entire page view for understanding is required

IBM Labs in Haifa

19

Character Session

OCR results are analyzed: Very high confidence results don’t require verification High confidence results may include some character recognition

errors. Hence, character session is used Low confidence results may have been caused by segmentation

errors. Hence Word session is used. For Character session, individual character images are extracted and

grouped together based on the recognition results (i.e. all the “b” would be grouped together at the same session)

For the given session, all the characters are grouped based on their confidence

IBM Labs in Haifa

20

Character Session

IBM Labs in Haifa

21

Character Session

IBM Labs in Haifa

22

Character Session

IBM Labs in Haifa

23

Word Session

Used for words Word is not in dictionary Having low confidence characters Having characters rejected in the Character Session

Shows Original word image Recognition results Possible spelling options

Words ordered alphabetic Based on the recognition results in lexicographic

IBM Labs in Haifa

24

Word Session – Before data entry

IBM Labs in Haifa

25

Word Session – After data entry

IBM Labs in Haifa

26

Word Session – Before data entry

IBM Labs in Haifa

27

Word Session – After data entry

IBM Labs in Haifa

28

Page Session

Used for correction of cases where word segmentation fails

Can be activated in one of 4 flavors Word view Line view Paragraph view Tagging view

System can go automatically from one problematic word to another

IBM Labs in Haifa

29

CONCERT - Page Session

IBM Labs in Haifa

30

Multilingual Support - English

1772

IBM Labs in Haifa

31

Multilingual Support - French

1668

IBM Labs in Haifa

32

Multilingual Support - German Gothic

1778

IBM Labs in Haifa

33

Multilingual Support - Dutch 1789

IBM Labs in Haifa

34

Multilingual Support - Japanese

IBM Labs in Haifa

35

Heart Newsreel Collection – Index Card

IBM Labs in Haifa

36

User Monitoring

Wide public participation may end up with data corruption by Malicious users Non qualified users

User rating and feedback motivates the use of the system Three ways validation

Good injection Characters/Words with high confidence to be true

Similar injection Characters/Words which may look similar but not identical For example: ‘O’ injection on ‘Q’ session

Error injection Characters/Words with high confidence to be false

IBM Labs in Haifa

37

User Monitoring – Screenshots

IBM Labs in Haifa

38

User Monitoring – Screenshots Cont.

IBM Labs in Haifa

39

User Monitoring – Screenshots Cont.

IBM Labs in Haifa

40

User Monitoring – Screenshots Cont.

IBM Labs in Haifa

41

User Monitoring – Screenshots Cont.

IBM Labs in Haifa

42

User Monitoring – Screenshots Cont.

IBM Labs in Haifa

43

User Monitoring – Screenshots Cont.

IBM Labs in Haifa

44

CONCERT Games

IBM Labs in Haifa

45

CONCERT in use

Hearst Newsreel Archive Collection First production use Tagging capabilities

Pilot in Japan for the Japanese Library Including customization for Japanese

1st phase pilots in major libraries in Europe KB – National Library of the Netherlands BL – British Library BSB – Bavarian State Library

IBM Labs in Haifa

46

CONCERT Future Planning

Search Over OCR Beyond transcription

Improve User Feedback Online advisor Best performers list

Community building around content Integrate community tools within the platform

CONCERT Games iPhone/iPad/Android/Desktop

E-Book creation Fully digital transcription Using original font as option

Page distortion correction Fully integrate the word-based page distortion correction

IBM Labs in Haifa © 2011 IBM Corporation

Thank You!