IMPACT Final Conference - Asaf Tzadok

45
IBM Labs in Haifa © 2011 IBM Corporation CONCERT COoperative eNgine for Correction of ExtRacted Text Asaf Tzadok Manager, Image and Document Analytics Group October 2011
  • date post

    19-Oct-2014
  • Category

    Technology

  • view

    1.697
  • download

    0

description

IBM Adaptive OCR engine and CONCERT (Cooperative Correction (including the library perspective)

Transcript of IMPACT Final Conference - Asaf Tzadok

Page 1: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa © 2011 IBM Corporation

CONCERTCOoperative eNgine for Correction of ExtRacted Text

Asaf Tzadok

Manager, Image and Document Analytics Group

October 2011

Page 2: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

2

Introduction

An estimated of at least 100 Millions books have been produced since Johann Gutenberg invented movable type in the 15th century.

A large part of this vast literature is now being converted to digital books and moved into the world of electronic publishing.

The digitization process involves Scanning technologies OCR (Optical Character Recognition) Post correction

The OCR quality range between 50%-90% of word level accuracy Post correction is a must and it costs a lot and it takes time

~1 euro per A5 page

Page 3: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

3

Crowd Sourcing Projects

Distributed Proofreaders Gutenberg Project

National Library of Australia Australian Newspaper Digitisation

LDS Church Family Search

The National Library of Finland Digitalkoot

All are pure volunteer based crowd sourcing programs It works !!

Page 4: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

4

Gutenberg Project – 1st Gen.

Page 5: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

5

NLA – Australian Newspapers – 2nd Gen.

Page 6: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

6

Collaborative Correction – State of the Art cont.

State-of-the-art systems, such as Project Gutenberg, Simply show page image and OCR results to be corrected

Drawbacks: Slow and unproductive process Prone to errors Hard to cross-check/merge Two passes are needed to ensure quality

Result:Complex, hard to track process = a lot of manual labor = limited public participation and contribution

Page 7: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

7

DIGITALKOOT - Mole Games – 3rd Gen

Page 8: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

8

Collaborative Correction – Games

Wider and younger public participation Easy to cross check Allows Parallelism Fully Scalable

Drawbacks Low productivity factor Static process with huge amount of work Limited use of the feedback from the users

Very long process to complete the digitization

Page 9: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

9

Collaborative Correction – How does it work

A full web based collaborative-correction system Avoid any installation in the client side Intuitive for the wide public use

Call for participation (optional) Via the official website of the library Collection based

Volunteers keen on contributing to their cultural heritage preservation Top performers lists Library recognition awards Acknowledgements

Page 10: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

10

CONCERT

Adaptive collaborative correction platform Uses the feedback from the users to improve productivity Fully connected to the Adaptive OCR Engine

Strong emphasis on productivity tools Reduce the time for verification/correction

Patented smart-key approach Motivate volunteers

Separating data entry process into several complementary tasks Optimized application dedicated to each task Break down the tasks into subtask Make it suitable for parallel processing Online compilation

Digitization flow optimizations Hierarchical context-level : character -> word -> page

Page 11: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

11

CONCERT System Architecture

Image

Enhancements

OMNI Engine

(ABBYY FRE)

Book Fonts

Extraction

Book Optimized

Adaptive OCR

Engine

CONCERT

Quality ControlDictionaries

Scanned

Book

High Quality

Transcription

Web Users

CONCERT

Productivity Tools

CONCERT

Games

Page 12: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

13

Adaptive OCR - Requirements

Consistent and reliable confidence level Important for quality assurance

No use of prior knowledge on the font Crazy font can be handled

Good use of the feedback from the users Character and Word level

Robust to distortion Page level distortion and printing variations

Easy to migrate between books from the same publisher Continues update

Not too slow Around 2-3 times slower than OMNI Engines

Page 13: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

14

Adaptive OCR – Technical Considerations

Pixel Domain (Template matching) Pros

Easy to implement Scoring consistency

Cons Slow Sensitive to small distortion

Features Domain Pros

Fast Robust to small distortion Using invariant features can improve robustness to distortion

Cons Non consistent scoring mechanism

Page 14: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

15

Adaptive OCR - Hybrid Approach

Page 15: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

16

Distortion Example

Using hierarchical optic-flow High quality results for compensation for non-linear character warping Can overcome significant distortions

Page 16: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

18

System flow

Character (Carpet) session Fast validation of OCR results Every word with rejected character is routed to Word Session

Word Verifier Session Utilized for cases when contextual information is necessary Rejected word will be corrected in the Page Session

Page-level Session For final closure of the page When entire page view for understanding is required

Page 17: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

19

Character Session

OCR results are analyzed: Very high confidence results don’t require verification High confidence results may include some character recognition

errors. Hence, character session is used Low confidence results may have been caused by segmentation

errors. Hence Word session is used. For Character session, individual character images are extracted and

grouped together based on the recognition results (i.e. all the “b” would be grouped together at the same session)

For the given session, all the characters are grouped based on their confidence

Page 18: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

20

Character Session

Page 19: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

21

Character Session

Page 20: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

22

Character Session

Page 21: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

23

Word Session

Used for words Word is not in dictionary Having low confidence characters Having characters rejected in the Character Session

Shows Original word image Recognition results Possible spelling options

Words ordered alphabetic Based on the recognition results in lexicographic

Page 22: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

24

Word Session – Before data entry

Page 23: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

25

Word Session – After data entry

Page 24: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

26

Word Session – Before data entry

Page 25: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

27

Word Session – After data entry

Page 26: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

28

Page Session

Used for correction of cases where word segmentation fails

Can be activated in one of 4 flavors Word view Line view Paragraph view Tagging view

System can go automatically from one problematic word to another

Page 27: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

29

CONCERT - Page Session

Page 28: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

30

Multilingual Support - English

1772

Page 29: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

31

Multilingual Support - French

1668

Page 30: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

32

Multilingual Support - German Gothic

1778

Page 31: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

33

Multilingual Support - Dutch 1789

Page 32: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

34

Multilingual Support - Japanese

Page 33: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

35

Heart Newsreel Collection – Index Card

Page 34: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

36

User Monitoring

Wide public participation may end up with data corruption by Malicious users Non qualified users

User rating and feedback motivates the use of the system Three ways validation

Good injection Characters/Words with high confidence to be true

Similar injection Characters/Words which may look similar but not identical For example: ‘O’ injection on ‘Q’ session

Error injection Characters/Words with high confidence to be false

Page 35: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

37

User Monitoring – Screenshots

Page 36: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

38

User Monitoring – Screenshots Cont.

Page 37: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

39

User Monitoring – Screenshots Cont.

Page 38: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

40

User Monitoring – Screenshots Cont.

Page 39: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

41

User Monitoring – Screenshots Cont.

Page 40: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

42

User Monitoring – Screenshots Cont.

Page 41: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

43

User Monitoring – Screenshots Cont.

Page 42: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

44

CONCERT Games

Page 43: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

45

CONCERT in use

Hearst Newsreel Archive Collection First production use Tagging capabilities

Pilot in Japan for the Japanese Library Including customization for Japanese

1st phase pilots in major libraries in Europe KB – National Library of the Netherlands BL – British Library BSB – Bavarian State Library

Page 44: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa

46

CONCERT Future Planning

Search Over OCR Beyond transcription

Improve User Feedback Online advisor Best performers list

Community building around content Integrate community tools within the platform

CONCERT Games iPhone/iPad/Android/Desktop

E-Book creation Fully digital transcription Using original font as option

Page distortion correction Fully integrate the word-based page distortion correction

Page 45: IMPACT Final Conference - Asaf Tzadok

IBM Labs in Haifa © 2011 IBM Corporation

Thank You!