Development of Arabic OCR - WordPress.com...NEMLAR (Network for Euro-Mediterranean Language...

21
Development of Arabic OCR Team members in UofG: Qiying He (Tina) SangYu Lee Leonardo Nunes Parente ----Opportunities and challenges & in IUG: Ghadeer Abu-Oda Shadia Baroud

Transcript of Development of Arabic OCR - WordPress.com...NEMLAR (Network for Euro-Mediterranean Language...

Page 1: Development of Arabic OCR - WordPress.com...NEMLAR (Network for Euro-Mediterranean Language Resources) project[10]: (2003-2005) - Partners: Egypt, Jordan, Lebanon, Morocco, Tunisia,

Development of Arabic OCR

Team members in UofG: Qiying He (Tina)

SangYu Lee

Leonardo Nunes Parente

----Opportunities and challenges

& in IUG: Ghadeer Abu-Oda

Shadia Baroud

Page 2: Development of Arabic OCR - WordPress.com...NEMLAR (Network for Euro-Mediterranean Language Resources) project[10]: (2003-2005) - Partners: Egypt, Jordan, Lebanon, Morocco, Tunisia,

OCR = Optical Character Recognition

What is OCR?

Page 3: Development of Arabic OCR - WordPress.com...NEMLAR (Network for Euro-Mediterranean Language Resources) project[10]: (2003-2005) - Partners: Egypt, Jordan, Lebanon, Morocco, Tunisia,

Situation & Problems 1

Solutions 2

Evaluation & Future Work 3

Content

Background in Gaza, Arabic language and

existing problems of Arabic OCR

Hidden Markov Model, Open software ,and

their advantages and disadvantages

Best solution, limitations, and future trend

Page 4: Development of Arabic OCR - WordPress.com...NEMLAR (Network for Euro-Mediterranean Language Resources) project[10]: (2003-2005) - Partners: Egypt, Jordan, Lebanon, Morocco, Tunisia,

Situation & Problem

Part 1

Page 5: Development of Arabic OCR - WordPress.com...NEMLAR (Network for Euro-Mediterranean Language Resources) project[10]: (2003-2005) - Partners: Egypt, Jordan, Lebanon, Morocco, Tunisia,

Blind

People

Free

Software

Cannot

afford

new apps

ATC in

IUG[1]

Background DOLOR

ATC = Assistive Technology Centre

Page 6: Development of Arabic OCR - WordPress.com...NEMLAR (Network for Euro-Mediterranean Language Resources) project[10]: (2003-2005) - Partners: Egypt, Jordan, Lebanon, Morocco, Tunisia,

Complexity of Arabic

DOLOR

28 characters, 22 are cursive, 6 are

non-cursive.

Cursiveness

The character can have up to 4 shapes

depending on its position (Table 1).

Shapes

[2]

Page 7: Development of Arabic OCR - WordPress.com...NEMLAR (Network for Euro-Mediterranean Language Resources) project[10]: (2003-2005) - Partners: Egypt, Jordan, Lebanon, Morocco, Tunisia,
Page 8: Development of Arabic OCR - WordPress.com...NEMLAR (Network for Euro-Mediterranean Language Resources) project[10]: (2003-2005) - Partners: Egypt, Jordan, Lebanon, Morocco, Tunisia,

Problems about OCR

Most Apps Focus on English or Latin

based language

A.

Not many techniques for handwritten

Arabic recognition

C.

Arabic OCR is still in the early stage

(inaccurate)

B.

Page 9: Development of Arabic OCR - WordPress.com...NEMLAR (Network for Euro-Mediterranean Language Resources) project[10]: (2003-2005) - Partners: Egypt, Jordan, Lebanon, Morocco, Tunisia,

Solutions

Part 2

Page 10: Development of Arabic OCR - WordPress.com...NEMLAR (Network for Euro-Mediterranean Language Resources) project[10]: (2003-2005) - Partners: Egypt, Jordan, Lebanon, Morocco, Tunisia,

Solution 1: Statistical Methods

Algorithm Accuracy Rate

Logistic Regression 89.4% [4]

Linear SVM 85.4% [4]

kNN (3) 89.5% [4]

HMM 92.1% [5]

- Hidden Markov Model (HMM)

A. What is HMM

Tool for representing probability distribution over sequences of observations [3]

B. Why

- Based on “process-focused approach”

Suitable for recognising handwriting

- High accuracy rate

Page 11: Development of Arabic OCR - WordPress.com...NEMLAR (Network for Euro-Mediterranean Language Resources) project[10]: (2003-2005) - Partners: Egypt, Jordan, Lebanon, Morocco, Tunisia,

[6]

Solution 1: Statistical Methods

- Hidden Markov Model (HMM)

D. Evaluation

- One of the most suitable algorithm for handwriting recognition

- Can be further developed by adapting appropriate software

C. How does it work?

A pattern is assigned to the model

with highest posterior probability (i.e.

the model that best explains the

pattern) [6]

Page 12: Development of Arabic OCR - WordPress.com...NEMLAR (Network for Euro-Mediterranean Language Resources) project[10]: (2003-2005) - Partners: Egypt, Jordan, Lebanon, Morocco, Tunisia,

Software Price

Sakhr £ 650.00

Omnipage (Pro) £ 292.00

Abby £ 100.00

B. Why?

- Price: Free

Solution 2: OCR Software

- Tesseract

A. What is Tesseract?

OCR engine for various operating systems, developed by HP in 1995

[7]

Page 13: Development of Arabic OCR - WordPress.com...NEMLAR (Network for Euro-Mediterranean Language Resources) project[10]: (2003-2005) - Partners: Egypt, Jordan, Lebanon, Morocco, Tunisia,

Solution 2: OCR Software

- Tesseract

C. Evaluation

- Easy accessibility: no cost & open for input

- Necessity for more participation & better accuracy for Arabic

B. Why?

Character Word

Change of error rate -7.31% -5.339%

- Open Source Software (OSS)

More opportunities to adapt software users’ input

Page 14: Development of Arabic OCR - WordPress.com...NEMLAR (Network for Euro-Mediterranean Language Resources) project[10]: (2003-2005) - Partners: Egypt, Jordan, Lebanon, Morocco, Tunisia,

Evaluation & Future Work

Part 3

Page 15: Development of Arabic OCR - WordPress.com...NEMLAR (Network for Euro-Mediterranean Language Resources) project[10]: (2003-2005) - Partners: Egypt, Jordan, Lebanon, Morocco, Tunisia,

Online

Community

Developers

:

University

students

Base:

Tesseract

+ HMM

Free collaborative Arabic OCR software

Page 16: Development of Arabic OCR - WordPress.com...NEMLAR (Network for Euro-Mediterranean Language Resources) project[10]: (2003-2005) - Partners: Egypt, Jordan, Lebanon, Morocco, Tunisia,

Free collaborative Arabic OCR software

- Android

- Ubuntu

- Debian

- Fedora

- 35 million articles in 288 different languages[9] - Since 2005: 12,000 developers from more

than 1,200 companies[8]

Linux:

Page 17: Development of Arabic OCR - WordPress.com...NEMLAR (Network for Euro-Mediterranean Language Resources) project[10]: (2003-2005) - Partners: Egypt, Jordan, Lebanon, Morocco, Tunisia,

NEMLAR (Network for Euro-Mediterranean

Language Resources) project[10]: (2003-2005)

- Partners: Egypt, Jordan, Lebanon, Morocco, Tunisia, West Bank &

Gaza Strip, Denmark, France, Greece and The Netherlands.

- BLARK (Basic Language Resource Kit) for Arabic

NEMAR project[10]: (2008-2010)

- Machine Translation

- Multilingual Information Retrieval for Arabic

- Supported by the European Commission's ICT programme

Free collaborative Arabic OCR software

Page 18: Development of Arabic OCR - WordPress.com...NEMLAR (Network for Euro-Mediterranean Language Resources) project[10]: (2003-2005) - Partners: Egypt, Jordan, Lebanon, Morocco, Tunisia,

Free collaborative Arabic OCR software

Limitations

- Lack of interest in making efforts to develop free software by other Arab countries

- Programmers disinterested in participating in the project

Future approaches

- Crowdsourcing and database

- Text-to-speech

Page 19: Development of Arabic OCR - WordPress.com...NEMLAR (Network for Euro-Mediterranean Language Resources) project[10]: (2003-2005) - Partners: Egypt, Jordan, Lebanon, Morocco, Tunisia,

ReferencesOLOR

[1] Elaydi H, Shehada H. A Source of Inspiration: ATC for Visually Impaired Students at the Islamic University of Gaza[J]. ICTA, 2007, 7: 12-14.

[2] Asebriy Z, Bencharef O, Raghay S, et al. Comparative systems of handwriting Arabic character recognition[C]//Complex Systems (WCCS),

2014 Second World Conference on. IEEE, 2014: 90-93.

[3] Sargur, N. S. Hidden Markov Models. [PowerPoint slides]. Presented at a CSE 574 lecture at Buffalo University.

[4]George, M. [no date]. Optical Character Recognition: Classification of Handwritten Digits and Computer Fonts.

[5]Huaigu, C. et al. (2014).Progress in the Raytheon BBN Arabic Offline Handwriting Recognition. International on Frontiers in Handwriting

Recognition.

[6] RWTH-OCR. (2007) Arabic Handwriting Recognition.[online] Available from https://www-i6.informatik.rwth-aachen.de/~dreuw/arabic.php.

[7] Ray, S. [No date]. The Tesseract open source ocr system. [online] Available from http://static.googleusercontent.com/…/pubs/archive/33418.pdf

[8] Corbet, J., Kroah-Hartman, G. and McPherson, A. (2015) The Linux Foundation Releases Linux Development Report. Available at:

http://www.linuxfoundation.org/ (Accessed: 25 August 2015).

[9] Safer, M. (2015) Wikipedia cofounder Jimmy Wales on 60 Minutes. Available at: http://www.cbsnews.com/…/wikipedia-jimmy-wales-morley-

safe…/ (Accessed: 29 August 2015).

[10] MEDAR, Speech and Language Technologies for Arabic (no date) Available at: http://www.medar.info/index.php (Accessed: 29 August 2015).

Page 20: Development of Arabic OCR - WordPress.com...NEMLAR (Network for Euro-Mediterranean Language Resources) project[10]: (2003-2005) - Partners: Egypt, Jordan, Lebanon, Morocco, Tunisia,

Thank

You!

Page 21: Development of Arabic OCR - WordPress.com...NEMLAR (Network for Euro-Mediterranean Language Resources) project[10]: (2003-2005) - Partners: Egypt, Jordan, Lebanon, Morocco, Tunisia,

Q & A