Download - OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Transcript
Page 1: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

static void_f_do_barnacle_install_properties(GObjectClass

*gobject_class){

GParamSpec *pspec;

/* Party code attribute */ pspec = g_param_spec_uint64

(F_DO_BARNACLE_CODE, "Barnacle code.", "Barnacle code",

0, G_MAXUINT64,

G_MAXUINT64 /* default value */,

G_PARAM_READABLE | G_PARAM_WRITABLE |

G_PARAM_PRIVATE);

g_object_class_install_property (gobject_class,

F_DO_BARNACLE_PROP_CODE,

Joaquim [email protected]

OCRFeeder

OCR Made Easy on GNOME

July 27 2012

Page 2: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

What is it?

Document Analysis and Optical Character Recognition

for GNOME

Page 3: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Why?

Paper has a number of problems

No applications for GNU/Linux to do a fair job

Page 4: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Paper problems:Security

CC Photo by: http://www.flickr.com/photos/badwsky/

Page 5: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Paper problems:Preservation

CC Photo by: http://www.flickr.com/photos/98469445@N00/

Page 6: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Paper problems:Data processing

CC Photo by: http://www.flickr.com/photos/hugovk/

Page 7: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Paper problems:Ecology

CC Photo by: http://www.flickr.com/photos/pranavsingh/

Page 8: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Paper problems:Accessibility

CC Photo by: http://www.flickr.com/photos/illustrator/

Page 9: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

No fair conversion apps for GNU/Linux

apart from OCR engines, but...

Page 10: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

OCR != Document Conversion

(it only deals with chars)(does not consider the layout)(does not distinguish contents)

Page 11: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

What's needed is

Document Analysis and Recognition

(conversion of documents to an electronic format)

(first projects in the 80s)

Page 12: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Page 13: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Page 14: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

How it works

Page 15: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

So many layouts...

CC Photo by: http://www.flickr.com/photos/uber-tuber/

Page 16: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Layouts vary with the type of document

What works on detecting one, won't work on others

Page 17: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

OCRFeeder focuses on contents, not on layouts!

Page 18: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Key concept:

If a document image can be divided in windows of 1 (content)

or 0 (not content), then it is possible to group all the

1s and outline the contents

Page 19: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Page 20: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Recognition:

System-wide OCR engines are used

Engines are configured from the GUI or XML files

Page 21: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Page 22: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Most known free OCR engines are detected and configured

automatically:

* Tesseract* GOCR

* OCRAD* Cuneiform

Page 23: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Exportation formats:

ODTHTML

Plain textPDF

Page 24: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

User interaction:

Users can edit everythingand review the algorithm's results

So, UI can work in attended and unattended ways

CLI only works in an unattended mode

Page 25: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Page 26: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Demo time!

Page 27: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Other features:

* PDF importation* Unpaper preprocessor

* Font style edition* Image deskewing

* OCR results cleaning* Project saving/loading

Page 28: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Future:

* More exportation formats: HOCR, etc.

* Make OCR engines' management easier

Page 29: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Webpage:http://live.gnome.org/OCRFeeder

git:http://git.gnome.org/ocrfeeder

Bugzilla:http://bugzilla.gnome.orgproduct: OCRFeeder

Page 30: OCRFeeder - OCR made easy on GNOME (GUADEC 2012)

Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

Thank you!