Sinhala OCR (Digital, Handwritten, & Palm-leaf Text) - eAsia2009 ABS 387-Sinhala OCR
Embed Size (px)
Transcript of Sinhala OCR (Digital, Handwritten, & Palm-leaf Text) - eAsia2009 ABS 387-Sinhala OCR
Sinhala OCR (Digital, Handwritten, & Palm-leaf Text)
D. L. Anoj De Silva2
School of Computing, Asia Pacific Institute of Information Technology
(APIIT), Sri Lanka.
OCR (Optical Character Recognition) for Sinhala script has become an area of interest in the
recent years with a number of researches conducted on this subject. This paper uses multi-font
multi-size digital text, handwritten text and palm-leaf manuscripts as three (3) case studies to
address the Sinhala OCR. All three (3) case studies addressed the problem domains by
developing demo tools. The demo tools were implemented mainly using Artificial Neural
Key Words – Sinhala Script, Optical Character Recognition, OCR, Artificial Neural Networks,
Feature Extraction, Image Processing.
Sinhala is an official language of Sri Lanka, which is primarily used by its ethnic majority, the
Sinhalese. Sinhala script is principally used as the writing system for the Sinhala language.
Sinhala script derives its orthography from the Brahmi script. Brahmi is a family of abugidas
(writing systems) used in South Asia, Southeast Asia, Tibet, Mongolia and Manchuria.
Moreover Sinhala writing system is influenced by Pallava Grantha script which used around
8th – 10th century.
The art of converting human readable documents into machine readable and editable ASCII or
/ and Unicode format files is known as Optical Character Recognition (OCR). Most of the
1 Author of the Case Study 2
2 Author of the Case Study 1
3 Author of the Case Study 3
modern OCR engines for Scripts like Arabic, Latin, Chinese and Korean are capable of
handling multi-font and multi-size characters. Font families such as Serif and Sans–serif,
different font sizes such as 10, 12 and 14 are concerned in those OCR engines. Research
shows that there is NO such reported multi-font and multi-size supporting OCR engine for
Sinhala Script. Therefore it remains as a challenging problem to develop practical OCR
system for multi-font and multi-size characters, which contains in a single document. Case
study one (1) tries to address this problem.
Case study two (2) addresses the handwritten Sinhala script. A large number of organizations
in Sri Lanka deal with data acquired in the form of Sinhala handwriting. Handwriting is a
major source of input to most organizations where data is collected using hand filled forms
such as registration forms, tax forms, visa forms and census forms. Currently all collected
data needs to be entered to information systems manually for the purposes of processing and
storing. The manual data entry process is extremely time consuming and error prone. These
organizations would benefit greatly from a system that could convert handwritten Sinhala
script directly to electronic text.
Third (3) case study address the palm-leaf manuscripts. Most of Sri Lankan historical data
such as medicine potions, Buddhist dharma and astrological data are written on palm-leaves.
Over past two thousand years most of valuable data are written in palm-leaf manuscripts.
Most of these palm-leaves are nearing the end of their natural lifetime or are facing
destruction. There are some applications which are created OCR systems for palm-leaf
manuscripts, but NOT for Sinhala script.
3.0 Sinhala Script
Indic languages primarily belong to two major linguistic families, Indo-Aryan and Dravidian.
In Sri Lanka the majority spoken language is Sinhala and it is belongs to Indo-Aryan family.
Sinhala script uses the Abugida system. In an abugida, writing system in which each vowel-
consonant letter represents a pure-consonant accompanied by a specific vowel; the vowels are
indicated by modification of the consonant sign, either by means of diacritics or through a
change in the form of the consonant (Daniels and Bright, 1996, p4). To indicate consonant
with a different vowel, symbols are added around the base symbol as before, after, above,
below or In some cases, modifiers are placed on both sides of the consonant. This feature
makes OCR as complexity task for Indic scripts.
According to Gunasekara (1891, p.3), there are two types of alphabets in Sinhala. They are
the Elu alphabet and the Mixed Sinhala alphabet. The Sinhala alphabet used in the present
differs from both the Elu alphabet and the mixed alphabet. The contemporary Sinhala
alphabet consists of a total of 60 letters. It is made up of 18 vowels, 40 pure consonants and
the Anusvaraya and Visargaya. Some researchers consider that there are 41 pure consonants in
the contemporary Sinhala alphabet (Premaratne & Bigun 2002).
4.0. Optical Character Recognition (OCR)
An OCR system consists of many stages such as preprocessing, segmenting, feature
extraction and character recognition. The objective of the preprocessing stage is to enhance
the image quality and prepare them for further processes. The output of the OCR system is
highly influenced by the preprocessing module. Activities such as thresholding, noise
removal, skew correction and background line removal are usually conducted within the
preprocessing stage. For OCR only the foreground image is required. Extracting the
foreground from the background of an image is known as thresholding (binarization). Noise
removal is usually conducted on the image after undergoing thresholding. Noise can be
defined as any unwanted information contained in a digital image. Noise in document images
can be caused by certain attributes of the scanner, improper tuning of scanning parameters,
texture of the source and type of implement used to produce the characters.
The goal of segmentation is to break down a set of characters into smaller entities prior to the
recognition process. To reach the final output of segmentation, the group of characters should
be segmented into lines, words and finally individual characters. Character recognition should
not be directly conducted on raw segmented characters because characters of different sizes
and the large number of input variables can cause problems for pattern recognition systems.
Feature extraction is used after segmentation to transform raw character images into a
smaller and consistent number of variables known as features. After the feature extraction
stage character recognition is conducted. The objective of character recognition is to
successfully recognize a character using the extracted character features.
5.0 Case Study 1: Multi-font and Multi-size Sinhala OCR
The research shows that, the one and only currently available (and reported) Sinhala OCR
engine (2009) is capable of handling single-font and single-size character recognition, but not
Multi-font and Multi-size Sinhala character recognition at once (developed by UCSC).
Anyhow, in reality most of the documents we find contain at least two types of fonts and at
least two sizes of fonts. This case study represents a practical scheme for Multi-Font and
Multi-Size Character Recognition using Artificial Neural Network (ANN) for Sinhala Scripts
which proves the concept that Multi-Font and Multi-Size Optical Character Recognition can
be applied to Sinhala scripts as well.
The optical images which contains Multi-Font and Multi-Size Sinhala vowel characters taken
as the inputs for the system and then it goes through the image Pre-processing techniques
such as Grayscale Dilation, Median Filtering and then converted it to an binary image using a
Global Thresholding value. As the first step it goes through a grayscale dilation process to
reduce unwanted color details of an optical image. Noise filtering techniques should apply to
reduce noise up to certain extent. Therefore noise reduction techniques such as Median
Filtering, Sharpening and Smoothing applied to enhance the image quality in order to gain a
reasonable overall output from the OCR system. Gray scaled and noise removed image will
convert to a binary image as shown in Figures 1, 2, and 3. This binary image comprises of just
two pixel values (Black & White). A color intensity value should be chosen and the pixels
which contain higher values than the chosen intensity value are marked as 1 i.e. Black pixels
and which contains a lower intensity value marked as 0 i.e. White pixels. This process helps
to differentiate objects from its background of an image (Gonzalez & Woods, 2002).
Figure 1: RGB Image
with background color
Once the image is pre-processed, segmentation process has to be completed. Most of the time
there can be multiple text lines in an optical image. Therefore Horizontal Projection Profile
implemented to segment text lines as Figures 4 and 5. “The projection profile gives valleys of
zero height for these OFF pixels between the text lines. Segmentation of the image into
separate lines is done at these valley points” (Reddy & Krishnamoorthi 2008).
Figure 4: Image with multiple text lines Figure 5: Horizontal Projection Profile of
After segmenting text lines of an optical image, the characters/glyphs will segmented by
using Vertical Projection Profiles applied to segmented text lines as Figure 6.
Having segmented characters, it goes through a process of extracting each character to a
square with the width and height of the respective character. The above isolated
characters/glyphs are then resized to a specific image size (250 x 250) so it contains only the
specified amount of pixel data. This is the Normalization process taken to solve the main
objective in this case study which is Multi Font and Multi Size character recognition. Size
invariant shape invariant constant size of a character/glyph image would be the ideal solution
to make the Feature Extraction more generalized (Shatil A.S.M. & Khan, M., 2006).
The concept of this Feature Extraction method is, creating an abstract image of a
character/glyph out of the total pixel data (250x250 pixels) grabbed after normalization
process. A sample of an abstract image created by the prototype is shown below in Figure 8.
Before Sampling (250x250 pixels) After sampling to 25 x 25 pixels
Finally a Feed Forward Neural Network with back-propagation algorithm for supervised
learning is chosen for the training recognition process.
As mentioned above, demo tool developed to recognize Multi Font and Multi Size Sinhala
Characters/glyphs, prove the concept that Multi Font and Multi Size OCR for Sinhala Script
is possible and successful. This confirms that achievement had taken the Sinhala OCR
technology for a new level with the use Artificial Neural Networks.
6.0 Case Study 2: Offline Sinhala Handwriting OCR
OCR and Handwriting Recognition for Sinhala script have attracted a significant amount of
attention in the recent years. Analysis of existing research reveals that most of the efforts
focus on a limited subset of Sinhala characters and recognizing of constrained and well
defined handwriting. The significant lack of research in the area of unconstrained Sinhala
handwritten script recognition prevents the existing research and development attempts from
being useful in any realistic environment.
Handwriting recognition is the task of transforming a language represented in its spatial form
of graphical marks into its digital representation Plamondon and Srihari (2000, p.64). The
ultimate handwritten script recognition system should be able to recognize unconstrained
writing produced by any writer, deal with different writing styles and languages and remain
unaffected by the size of the vocabulary. But developing such a system remains a challenge
due to the complex nature of handwriting. Some of the factors contributing to this complexity
are writer dependency, various writing styles, similar looking characters, nature of the input
signal and vocabulary.
In this research the offline Sinhala handwriting recognition system was developed, trained and
tested using handwritten names collected from National Identity Cards (NIC) of Sri Lankan
citizens that contain Sinhala script. Since names of individuals contain a majority of
contemporary Sinhala characters, names were chosen as the domain to test the demo tool.
Since NIC contain names written in unconstrained Sinhala script and are readily available,
NIC was selected as the medium of acquiring handwritten Sinhala names.
As described in case study 1, this research also uses thresholding and noise removal
techniques for image preprocessing. After preprocessing, segmentation and feature extraction
methods are used. Finally the ANN used as the classifier for character recognition. Figure 9.
Figure 9 : Major steps of the system
Use of ANN provided a considerable level of accuracy for the handwriting recognition. But
test results suggested more room for improvement. Recognizing all Sinhala characters falls
into the category of large vocabulary problems. For such problems utilizing the knowledge of
the lexicon is a recommended method of increasing the performance of the system. A
technique such as Hidden Markov Model (HMM) can be integrated to the system to increase
its overall accuracy. Koerich et al. (2002, p.99) and (Marinai et al. 2005, p.31) have used
hybrid classifiers of ANN and HMM to increase the accuracy of handwriting recognition
6.0 Case Study 3: Palm leaf manuscripts OCR
Palm leaves were once a popular writing medium especially in the Asian region. These
manuscripts are created by first carving characters or letters using a metal stylus into the dried
Palm leaf. Next advancing the contrast, legibility and visibility of the carving was carried out
by applying lampblack with coconut oil or another aromatic oil which contains insect
repellent qualities. The life time of Palm leaf manuscripts are not as long as artificial
materials. They face destruction from causes such as dampness, fungus and insects.
Destruction of palm leaf manuscripts lead to the risk of losing a wealth of ancient knowledge
contained within them. OCR systems can be used to preserve the knowledge contained in
these manuscripts with more efficiency than a manual system.
It will be a vast domain if the selected topic was to address an area like medicine related or
dharma related. To avoid those difficulties, the scope is narrow to address the horoscope
which is written on Palm leaves only.
Sinhala Palm leaf horoscopes have mainly three parts where characters are written in. they are
two cages which are mentioned with the Zodiac sign and Nawanshakaya, a description which
written in Sanskrit using Sinhala script, describes about the time period which is the person
born and related details about that, a description in Sinhala about the persons astronomical
details. In this proposed system only consider about the two cages and the Sinhala description.
Due to the difficulty of identifying and recognition of compound characters, touching
characters and Sanskrit description are not addressed as in the domain.
Even though basic system is completed there are many enhancements should be done to use
as a product. Some are listed below.
- A capability of acquiring the data from the system (integrated scanning facility).
- Fully auto mated image pre processing stage
- A capability to segment overlapped lines and overlapped characters.
- Capability of matching with the possible words in a lookup table.
This paper uses multi-font multi-size digital text, handwritten text and palm-leaf manuscripts
as three case studies to address the Sinhala OCR for first time. All three (3) case studies
addressed the problem domains by developing demo tools. Except palm leaf OCR the other
two systems shows satisfactory results. However improvements could be made in the
preprocessing, segmenting, feature extraction and character recognition stages to improve the
overall accuracy of all three systems. Recognizing all Sinhala characters falls into the
category of large vocabulary problems. For such problems utilizing the knowledge of the
lexicon is a recommended method.
Our heartfelt gratitude goes out to project supervisor Mr. Balachandran Gnanasekaraiyer for
the vital encouragement and guidance he provided us at all times. We would also like to thank
our project assessors Mr. Gamindu Hemachandra and Ms. Jina R. Daluwatta for the valuable
feedback they gave us during the important stages of this project. Support given by the
academic staffs, lab administrators, and library staffs at APIIT are deeply appreciated.
Daniels, P.T., Bright, W, The world's writing systems, 1st ed, 1996, New York: Oxford
Gonzalez, R. C., Woods, R. E., 2002. Digital Image Processing. 2nd ed. Pearson Education,
Gunasekara, A. M., 1891. A comprehensive Grammar of the Sinhalese Language. Asian
Educational Services, New Delhi.
Koerich, A.L., Leydier, Y. Sabourin, R. Suen, C.Y. 2002. A hybrid large vocabulary
handwritten word recognition system using neural networks with hidden Markov
models. In: Eighth International Workshop on Frontiers in Handwriting Recognition,
August 6-8 2002 Ontario Canada. 99-104.
Marinai, S., Gori , M., Soda, G., 2005. ‘Artificial Neural Networks for Document Analysis
and Recognition’, IEEE Transaction on Pattern Analysis and Machine Intelligence,
vol. 27, no. 1, pp. 23-35.
Plamondon, R., Srihari, S. N., 2000. ‘On-Line and Off-Line Handwriting Recognition: A
Comprehensive Survey’, IEEE Transactions on Pattern Analysis and Machine Intelli
gence, vol. 22, no. 1, pp. 63-84.
Premaratne H.L & Bigun J. 2002, ‘Recognition of Printed Sinhala Characters Using Linear
Symmetry’, The 5th Asian Conference on Computer Vision, 23-25 January 2002,
Reddy N.V.S. & Krishnamoorthi 2008, ‘Hierarchical Recognition System for Machine Printed
Kannada Characters’, IJCS-S International Journal of Computer Science and -et
work Security, vol.8 no.11, pp 44-53.
Shatil A.S.M. & Khan, M., c.2006, Minimally Segmenting High Performance Bangla Optical
Character Recognition Using Kohonen -etwork, Computer Science and Engineering,
BRAC University, Dhaka, Bangladesh.
Bala is a lecturer at School of Computing at Asia Pacific Institute of Information Technology,
Sri Lanka, and is a consultant to the ICT Agency of Sri Lanka for the Tamil language. He was
responsible for the standardization of Tamil encoding, collation and keyboard. Moreover he is
working in ICT localization for last 4 years and member of Local Language Working Group
(LLWG) at ICT Agency.
D. L. Anoj De Silva, �ikeshala Wickramaarachchi, and Tashila Kannangara
Anoj, Nikeshala and Tashila are graduates of APIIT city campus, of B.Sc. (Hons) Computing
specialized in Software Engineering, which is affiliated to Staffodshire University of UK.
Currently Anoj is working as an Associate Software Engineer at Virtusa Corporation,
Nikeshala as an internship member of Unilever (Pvt) Ltd and Tashila as an E-Marketing
Executive of Archmage (Pvt) Ltd.