Tamil OCR using Tesseract OCR Engine

NAME : K.Balamurugan REG NO : 12370002 COURSE : M.SC (Comp. Science) SEMESTER : IV Semester

Under Guidance Dr.K.S. Kuppusamy

TAMIL(TAMIL OCR USING MULTIDIMENSIONAL INTERACTIVE LEARNING

MODEL)

Introduction to OCRAbstractSoftware RequirementAnalysisDesignGUI DesignConclusion

Outline:

What is OCR? Optical Character

Recognition, usually abbreviated to OCR, is the conversion of scanned photo/ Images of typewritten or printed text into machine-encoded/computer-readable text.

Introduction to OCR

Types of OCR Basically, there are three types of OCR. They are briefly discussed below:

Handwritten Text OCR

The text produced by a person by writing with a pen/ pencil on a paper medium and which is then scanned into digital format using scanner is called Handwritten Text.

Machine Printed Text OCR

Machine printed text can be found commonly in daily use. It is produced by offset processes, such as laser, inkjet and many more. The project comes under this category.

Introduction - Type of OCR

TAMIL (Tamil OCR using Multidimensional Interactive Learning model) is Optical Character Recognition Software that convert machine printed text into editable text. It is mainly targeting to Tamil language. It uses tesseract OCR Engine.

Tesseract OCR engine has trained in Tamil language so this software can convert an image in to text in Tamil language ; it involves the Tamil Tessdata files for mapping of each character in image into text. While trying to recognize a limited range of fonts (like a single font for instance), then a single training page might be enough. Tesseract was originally designed to recognize English text only. Efforts have been made to modify the engine and its training system to make them able to deal with other languages and UTF-8 characters.

TAMIL OCR USING MULTIDIMENSIONAL INTERACTIVE LEARNING MODEL

Abstract

Tesseract-OCR is the most widely used open source OCR across the world. Currently this OCR supports English language as default and few more language and it is a command line tool. Tesseract OCR Engine has flexibility that it can be trained to any language. In this project an application is developed to train OCR in Tamil languages. More Training Images and effort are needed to produce the training data.Since Tesseract-OCR is a command line tool, it is not mostly used by the beginners which limit the usage of tesseract. Here an excellent GUI was developed for tesseract-OCR, which makes peoples to use this OCR easily.Tamil OCR GUI is multidimensional and interactive. It uses espeak TTS engine to convert Tamil Text into voice. So it much beneficial for blind people.

Abstract

TAMIL OCR USING MULTIDIMENSIONAL INTERACTIVE LEARNING MODEL

Software Requirements

Tesseract 3.02 OCR Engine JTessBoxEditor SerakTesseractTrainer Visual studio 2010 .NET Framework 4.0 VB.NET Notepad ++ (Text Creator) Text2Image Command line Ghost Script 9.0 WebsitesScreenshot.dll Windows Image Acquisition(Wiaaut.dll) Hunspell Spell Checker NHunspellExtender eSpeak espeak Data Files

Requirements

Requirements

MILE Lab Tamil language Text-To-Speech (TTS) Engine Google Transliterations API Google Speech Service Tamil and English Spell Check Dictionaries Microsoft Speech API (SAPI) 5.4 (SAPI) Tessdata with Tamil Trained Data file Web browser Component VC++ Runtime Microsoft Office Interoperability Word Service Tamil Fonts and Tamil Keyboard software Windows Operating system(From windows XP to Windows 8)

Requirements

User Software Requirements

VC++ Runtime 2010 .NET Framework 4.0

Hardware Requirements:

Minimum 260MB RAM memory, preferably 600MB RAM 700MB free hard disk space Modem for internet connection

Software Requirements - Tesseract OCR Engine

The Tesseract engine was originally developed as proprietary software at Hewlett Packard labs in Bristol, England and Greeley, Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some migration from C to C++ in 1998. A lot of the code was written in C, and then some more was written in C++. Since then all the code has been converted to at least compile with a C++ compiler. Very little work was done in the following decade. It was then released as open source in 2005 by Hewlett Packard and the University of Nevada, Las Vegas (UNLV). Tesseract development has been sponsored by Google since 2006. Tesseract does not come with a GUI and is instead run from the command-line interface.

http://en.wikipedia.org/wiki/Command-line_interface

Tesseract was in the top three OCR engines in terms of character accuracy in 1995. It is available for Linux, Windows and Mac OS X, however, due to limited resources only Windows and Ubuntu are rigorously tested by developers.

Tesseract up to and including version 2 could only accept TIFF images of simple one column text as inputs. These early versions did not include layout analysis and so inputting multi-columned text, images, or equations produced a garbled output. Since version 3.00 Tesseract has supported output text formatting, hOCR positional information and page layout analysis. Support for a number of new image formats was added using the Leptonica library. Tesseract can detect whether text is monospaced or proportional.



The initial versions of Tesseract could only recognize English language text. Starting with version 2 Tesseract was able to process English, French, Italian, German, Spanish, Brazilian Portuguese and Dutch. Starting with version 3 it can recognize Arabic, English, Bulgarian, Catalan, Czech, Chinese (Simplified and Traditional), Danish, German (standard and Fraktur script), Greek, Finnish, French, Hebrew, Croatian, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak (standard and Fraktur script), Slovenian, Spanish, Serbian, Swedish, Tagalog, Thai, Turkish, Ukrainian and Vietnamese. Tesseract can be trained to work in other languages too

Each box results in a component which represents a single character. The box has an x- and y-coordinate, a width and height and the character which has been recognized in the region.

we have to manually edit possible mistakes that has been detected by Tesseract. Maybe all characters where wrong because the font you are using isn’t very common. To make this job a lot easier, several tools are available. These tools are called “Tesseract box editors”, the tool I used is jTessBoxEditor. jTessBoxEditor is written and Java and thus platform independent, it has options to merge and split which could be handy. After you corrected the mistakes, we can creating the training file.

Software Requirements –Box Editor

jTessBoxEditor is a box editor and trainer for Tesseract OCR, providing editing of box data of both Tesseract 2.0x and 3.0x formats and full automation of Tesseract training. It can read images of common image formats, including multi-page TIFF. The program requires Java Runtime Environment 6.0 or later.

Software Requirements - jTessBoxEditor

http://sourceforge.net/projects/vietocr/files/jTessBoxEditor/

http://code.google.com/p/tesseract-ocr/

http://www.oracle.com/technetwork/java/javase/downloads/index.html

Text-To-Speech Engine Introduction to TTS

A text-to-speech software or speech synthesis program is used to read a text out loud to a user. Espeak can be used for this purpose. A TTS engine uses natural human voices to avoid sounding like a robot during the reading of a text to a user. There are also few different free TTS engines available online where the user can choose the voice he prefers the text to be read with. One I used is MILE Lab Tamil language Text-To-Speech Engine.

Software Requirements - TTS

Espeak

ESpeak is a compact open source software speech synthesizer for Linux, Windows, and other platforms. It uses a formant synthesis method, providing many languages in a small size and it has also been used by Google Translate.

ESpeak is derived from the "Speak" speech synthesizer for British English for Acorn RISC OS a computer which was originally written in 1995 by Jonathan Duddington. A rewritten version for Linux appeared in February 2006 and a Windows SAPI 5 version in January 2007. Subsequent development has added and improved support for additional languages.

In my project espeak is used as offline Tamil text to speech engine. It support multi language speech.

Software Requirements - ESpeak

http://en.wikipedia.org/wiki/Speech_synthesis

http://en.wikipedia.org/wiki/Linux

http://en.wikipedia.org/wiki/Windows

http://en.wikipedia.org/wiki/Speech_synthesis#Formant_synthesis

http://en.wikipedia.org/wiki/Google_Translate

Microsoft Speech API (SAPI)

The Microsoft text-to-speech voices are speech synthesizers provided for use with applications that use the Microsoft Speech API (SAPI) or the Microsoft Speech Server Platform. There are client and server versions of Microsoft text-to-speech voices. SAPI used to produce interactive speech feature in the project

Software Requirements -Microsoft Speech API (SAPI)

http://en.wikipedia.org/wiki/Microsoft_Speech_API

Hunspell

Hunspell is a spell checker and morphological analyser designed for languages with rich morphology and complex word compounding and character encoding, originally designed for the Hungarian language.

Hunspell is based on MySpell and is backward-compatible with MySpell dictionaries. While MySpell uses a single-byte character encoding, Hunspell can use Unicode UTF-8-encoded dictionaries. Hun spell The spell checker Hun spell needs two files:

1. Dic tionary files with .DIC extension 2. Affix file with .AFF extension

Software Requirements - Hunspell

Ghostscript is a suite of software based on an interpreter for Adobe Systems' PostScript and Portable Document Format (PDF)page description languages. Its main purposes are the rasterization or rendering of such page description language files, for the display or printing of document pages, and the conversion between PostScript and PDF files.

GhostScript is open source it is used in the project for converting PDF files to image. PDF OCR uses this GhostScript.

Software Requirements - GhostScript

Existing System

The existing tesseract-OCR supports English language as default and also supports languages like Dutch, Spanish, Italian, French and German. All these languages are trained to tesseract OCR, but languages like Tamil, Malayalam are not much trained to OCR. There are no GUI available for tesseract in tamil and training tesseract is a big task which an intermediate persons too feel complex for training it. Since Tesseract OCR Engine is command line Tool, usage of OCR is much less.

Analysis

There are some OCR GUI are built using Tesseract OCR Engine, but it does not have much support for Tamil language.

Some GUI tools are listed below. VietOCR Tesseract-OCR QT4 gui Lime OCR

Few Online Services: CustomOCR Free OCR i2OCR(support Tamil language, but very less accuracy)

Analysis-Existing System

The proposed system

The proposed system has to be GUI. GUI of Proposed system is named as “Tamil OCR GUI”. It is trained to Tamil language, using Tesseract Training Procedures. It will much beneficial for blind people and normal users. The inputs to the OCR-Engines are: Sample Tamil Training Images Data Files Tamil Dictionary Final Tamil trained dataAs I will use the programming language VB.NET, it satisfies all the needs as extensibility, simplicity, interoperability, portability, powerful data structures and also Unicode support. The GUI was developed in VB.NET, which is Threaded application.

Analysis- proposed system

GUI is Web browser And Special OCR GUI in this project and the it going to capture web page images and extract all its Tamil and English text using the Tamil OCR component.

Proposed System has support of following language: 1. Tamil 2. EnglishUser can switch between these language and they can easily OCR Tamil and English Images using Proposed System.

Analysis- proposed system

Following are process of Tesseract OCR Engine:

At the first stage, outlines of the text are gathered by nesting, into Blobs. These blobs are organized into text lines and are broken into words differently according to the kind of character spacing. After that the lines and regions are analysed for fixed pitch or proportional text. Fixed pitch text is chopped immediately by character cells, however the proportional text is broken into words using definite spaces and fuzzy spaces. Then the recognition proceeds as a two-pass process.

In the first pass, it attempt recognize each word in turn passed to an adaptive classifier as training data. The adaptive classifier then gets a chance to more accurately recognize text. Thereafter the second pass is run over the page for the words those were not recognized well enough in the First pass.

At last a final phase resolves fuzzy spaces, and checks alternative hypotheses for the x-height to locate small cap text. We follow this architecture of Tesseract OCR engine to recognize the Tamil characters.

To train Tesseract for Tamil script recognition, I followed Complete Training procedure Specified.

process of Tesseract OCR Engine:

Tamil OCR Component:

Tesseract OCR Engine trained for tamil language

Image Extractor Image Reader Output Manager Output formatter Text Speaker

Analysis-DFD

Analysis- Training Procedure:

DATA FILES REQUIRED To train Tamil Language (lang = tam), I have to created 8 data files

in the tessdata subdirectory. The 8 files are: 1. tessdata/tam.freq-dawg2. tessdata/tam.word-dawg3. tessdata/tam.user-words4. tessdata/tam.inttemp5. tessdata/tam.normproto6. tessdata/tam.pffmtable7. tessdata/tam.unicharset8. tessdata/tam.DangAmbigs

Analysis- Training Procedure:

Analysis- Web OCR Component

Start

Image Extractor

Tesseract Tamil Trained OCREngine

Output Formatter Output Manager

Image Reader

Stop

Webpage(Web Brower)

Web GUI Page

Tamil Trained Dataset

Data’s

Fig: Architecture of Web OCR Component

1 Web OCR Component Extract all pictures of web page and save all in single folder called ‘OCR Image’ (image Extractor)2. Image Reader will take picture by picture and send to Tesseract Engine.3. Tesseract Engine will process and extract text from images only Tamil and English4. Output of Tesseract Engine is taken by Output Manger that will maintain text and images.5. Output Formatter will put Images and Corresponding Extracted Text in word document6 .if User move over any image all process are carried out from 1 to 4 and Text Speaker component will speak out to user.


Web OCR feature need web browser and it extract all images from particular page and OCR in the Image. Web OCR has Following Tools:

Image Grabber Link Grabber Web Page to Image Convertor Web page to PDF Convertor Screen Shotter All the above Tools are essential to do Web OCR. Image

Grabber will extract all images and save to Location user Specified. All extracted images are saved to Folder.



Fig: Web browser

One of the components of project is Web OCR, as I need Better GUI it is good to use web browser, because it is software in which most user and image files are there. This browser module has low priority in this project. All below mention features are I mentioned may not be implementing. Simple browser with required features is enough to developing web browser for computer using following technologies and concepts are required:

Vb.net Application programmer interface(API’s) Multithreading Voice library(voice recognition)


Analysis- Tamil OCR GUIStart

Input Image

Preprocessing

Tesseract Engine Tamil Trained Dataset

Tamil OCR

Output Text Text

Post Processing

Spell Checker

Tamil Dictionary

Tamil Data files

Accurate Output Text Text

Stop

Analysis- Tamil OCR GUI

Input Image

Pre - process

Tamil OCR GUI Output Text

Tesseract OCR Engine Post Process

Tamil and English Trained Data Set

Output Formats

1. PDF2. MS Word3. RTF4. XML5. WAV6. MP38. HTML9. Text10. Single Web page

Fig: Block Diagram of Full Image OCR


Post Process

Tamil OCR GUISelected Image Region

Output Text

Tesseract OCR EnginePre - process

Tamil and English Trained Data Set

Fig: Block Diagram of Region Image OCR


Fig: Block Diagram of Snapshotting OCR


Fig: Block Diagram of Batch OCR

Analysis- Pre-processing

Increase dpi (max 300)

Start

Input Image

Convert to black and white

Preprocess Manager

Remove Background (max 300)

Remove Inner images (max 300)

Preprocessed Image

Fig: Architecture of Pre-processor

Pre-processing is optional process in Tamil OCR. This is useful to increase Quality of image so that process of OCR will be more accurate.

The known system is divided into eight functional modules, which are Independent among them and they are followed: This Project has following modules:

1) Tamil Training to Tesseract2) Web OCR Component3) OCR GUI

Design - Modularity

Introduction The system is designed in two phase:

Preliminary System Design Detailed System Design

Coupling is the measure of relative interdependence among modules. In Tamil OCR project Coupling is very loose in nature because we can easily add any module without problem with any other module or require little modification in interface calling. Coupling depends of interface complexity between modules, the point at which entry of reference is made to a module, and what data pass across the interface.

Cohesion is the measure of relative functional strength of individual modules. In

Tamil OCR project Cohesion is very high in nature because performs a single task within

a software procedure, requiring little interaction with procedures being performed in other parts of a program.

Design

1.Training Tesseract-OCR Task

Tesseract 3.02 is fully trainable. In Tesseract 2.0 it will only accept Tiff format images, but in the latest version it include leptonica it will automatically convert any image into tiff format.

This page describes the training process, provides some guidelines on applicability to various languages, and what to expect from the results. Tesseract was originally designed to recognize English text only. Efforts have been made to modify the engine and its training system to make them able to deal with other languages and UTF-8 characters.

Design - Modularity

2.Web Tamil OCR-Component Development

1. Web Tamil OCR Component Extract all pictures of web page and save all in single folder called ‘OCR Image’ (image Extractor)2. Image Reader will take picture by picture and send to Tesseract Engine.3. Tesseract Engine will process and extract text from images only Tamil and English4. Output of Tesseract Engine is taken by Output Manger that will maintain text and images.5. Output Formatter will put Images and Corresponding Extracted Text in word document6 .if User move over any image all process are carried out from 1 to 4 and Text Speaker component will speak out to user.

Design - Modularity

3 .Web browser

Main component of project is Tamil OCR, as I need Better GUI it is good to use web browser, because it is software in which most user are there. This browser module has low priority in this project. Tesseract-OCR is a command line tool. The Project GUI is going to developed using .net lets user to easily work with a graphical user interface. Project targeting the development of component in web browser that component will extract Tamil character from images.

The user interface for OCRgui contains the open button to select the image (tiff format) file, an recognition button that converts the image to an editable text, a font button that select the font type for the output and preferences window that have all preferences. Web browser will use Tamil OCR Component to extract all images text if Tamil and English.

Design - Modularity

Cond….(Web browser )

The software interfaces used in the project are built VB.NET. It leads to easily create programs with a graphical user interface using the Visual basic.net. The Project does not make use of any particular protocol and therefore there may not be need for any specified communication interfaces.

Design - Modularity

4 .OCR GUI OCR GUI allow user to perform various OCR functions more easily in Tamil. This module provides following features to user.

1. Whole Image OCR2. Image Region OCR3. Snap Shot OCR4. Snipping Screen OCR5. Batch OCR6. Various Image Processing functionality7. Good Tamil Text Editor8. Easley Tamil Transliteration Typing (like phonetic)9. Good Tamil Text to Speech

Design - Modularity

Cont. …..(OCR GUI)

Design - Modularity

10.Image Convertor11.PDF OCR12.Various Output Format13.Various Input Acceptance14.Easy Typing in Other languages15.Email Sender16.Spell checking17.Snipping and Saving18.Combined OCR document19.OCR report

Detailed Design

Fig: Few Class Diagram of OCR GUI

GUI Design

Fig: Whole Page OCR (windows 7 and 8 version)

GUI Design

Fig: Whole Page OCR

(windows XP version)

GUI Design

Fig: Region OCR

GUI Design

Fig: Snipping OCR

GUI Design

Fig: Snipping OCR Output

GUI Design

Fig: Batch OCR Windows 8 modes

GUI Design

Fig: Batch OCR Windows XP mode

GUI Design

Fig: Snapshots OCR in Tamil

GUI Design

Fig: Snapshots OCR in English

GUI Design

Fig: Converted to black and White

GUI Design

Fig: Tamil OCR Help

GUI Design

Fig: Tamil OCR Setting Widow

GUI Design

Fig: Tamil OCR Other Setting Widow

GUI Design

Fig: Tamil OCR Speak Setting Widow

Fig: Tamil OCR UI Setting Widow

GUI Design

Fig: Tamil Typing

GUI Design

Fig: Image Preview With image Properties

GUI Design

Fig: Tamil OCR Print Preview Widow

GUI Design

Fig: Tamil OCR Batch OCR Theme Widow

GUI Design

Fig: Tamil OCR with Blackly Mode

GUI Design

Fig: Web OCR

GUI Design

Fig: Web Page Capture

GUI Design

Fig: Web Page Link Grabbing

GUI Design

Fig: Image Grabber in progress

GUI Design

Fig: PDF OCR

GUI Design

Fig: Find and High Light Feature

GUI Design

Fig: Snipping and Saving Tool

GUI Design

Fig: Batch OCR Context Help

GUI Design

Fig: Auto Cropped Image

GUI Design

Fig: Skewed Image

GUI Design

Fig: De Skewed Image

Tesseract-OCR for Tamil languageIncreases the number of users of tesseractGUI makes more people to migrate towards

open sourceIt makes easy for the people to work tesseract,

without the knowledge about it.Web browser can make use the OCR Tamil

component in the web browserFor blind people it is very good beneficial

Advantages of Project

The applications of Tesseract-OCR are News paper industries Books publishers. Industries of Digital Expertise Web browser Web Service Creation Web Application Common People’s

Advantages of Project

FUTURE ENHANCEMENT The accuracy of recognition can be improved by

more training AI Applied to make intelligent snapshotting Reader Image Content Search More Accurate and powerful Spell Checker Corrector Powerful web OCR Web Service Online Web Application Handwritten Recognition Browser Extensions More Powerful Image Preprocessing

Future Enhancement

I hope the result of the project is that the Tesseract-OCR is trained for Tamil language and the more training can be used to train the OCR for tamil language , so accuracy can be increased.

This Project can increase the number of users to tesseract OCR among tamil people and most of the Tamil press peoples to migrate towards free and open source software’s.

The GUI for tesseract will make easy for all the peoples who don’t have knowledge about tesseract to use it in a perfect manner.

Conclusion

The Tamil OCR Project was a great opportunity for me to experience the workings of software, know about research’s and bring my 2 years of academics to practice. The 4 months period of Project was very useful for learning the industry specific practices and standards. I am very confident that this experience will be a great boost and help for my career ahead.

The period at Dr.K.S kuppusamy was a great learning experience with very helpful project

supervisors. Their evaluation and guiding helped me a lot to learn and understand new techniques and methodologies. Overall, it was a great experience and will greatly benefit me in future.

The biggest advantage of Tesseract OCR is its availability as open code. Thus anybody having the interest to study the working procedure, and skill to improve it can able to train it for a new language. In this Document, I presented the step by step procedure to train Tesseract engine for Tamil printed text document. At first we train the Tesseract for a particular font type of English language that has not been supported earlier by performing a series of test. We then train Tesseract to recognize the Tamil character set, and observed the results. As we find editing the box file manually is a cumber-sum task (this language has a large character set), we try to generate the box file automatically. Also we could be able to detect the vowels and consonants of Tamil character set, however still we need to train Tesseract for dependent modifiers and other characters that exist in the Tamil documents, in future.

Conclusion

1. The tesseract-OCR Google groups is developed by “Ray Smith” and is accessed http://code.google.com/p/tesseract-OCR/2. Training the tesseract-OCR for Indic languages can be obtained from http://code.google.com/p/tesseract-OCR/wiki/TrainingTesseract3. The tesseract Indic home page is developed and maintained by “Debayan Banerjee” and can be found at http://code.google.com/p/tesseractindic/ and etc…

REFERENCES

http://code.google.com/p/tesseract-OCR/

http://code.google.com/p/tesseract-OCR/wiki/TrainingTesseract



http://code.google.com/p/tesseractindic/

Question

Tamil OCR using Tesseract OCR Engine

Technology

Transcript of Tamil OCR using Tesseract OCR Engine