Mobile Reader Device for Blind People The proposed mobile reader device for blind has been designed...

POLITECNICO DI TORINO

SCUOLA DI DOTTORATODottorato in Ingegneria Elettronica e delle Telecomunicazioni

XXI ciclo

Tesi di Dottorato

Mobile Reader Device for BlindHardware definition and software development

Paolo Motto Ros

Tutore Coordinatore del corso di dottoratoprof. Eros Pasero prof. Ivo Montrosset

Marzo 2009

Abstract

In this study the design and development of a mobile reader device for blind havebeen examined. Despite the great interest around this kind of aids, none of thealready existing solutions have been designed with the basic idea of an interactivedevice that follows the user’s needs and behaviour with a very little training effort.Other products are simply neither portable nor autonomous, not able to properlyhandle textual information or not suited to operate in a real-time way. By this lastfeature we mean a modus operandi that elaborates a continuous stream of images,acquired by a custom hardware, at the same time the user moves the device on thetext he is trying to read. Indeed the system has to guide the user to follow thetext flow, without imposing any particular constraint on the movements, and togive information regarding the device alignment respect to the underlying content.It integrates a full-fledged character recognition engine with word-oriented features,all customizable without modifying the main application. In the end, the properway of communicating to the user all those information has to be defined.

The proposed mobile reader device for blind has been designed starting to analyzeboth the already available solutions and the results of previous projects. Then a setof requirements and constraints have been established, with the help of the end users.Two hardware platforms have been evaluated and compared through a feasibilitystudy in order to choose and define the most suitable one. This is derived from thesame solutions widely adopted in PDAs (Personal Digital Assistants), thus enablingthe software to run also on other devices. A set of custom development tools hasbeen created to support the design of the main application, which has been builtaround a core cross-platform processing engine. Particular attention has been paidto the text recognition subsystem, trying to improve its performances by exploitingand evaluating different opportunities. The result is a mobile prototype that, thanksto a custom input peripheral moved by the user along the source to be read, is ableto elaborate the images, detect the displacement of the text, recognize it and giveall those information to the user by means of speech synthesis.

II

Contents

Abstract II

Introduction 1

1 State of the art and related work 41.1 Existing solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 The Haptic project . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3 Final considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 The Stiper and Stiper2 projects 142.1 Project aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2 The Haptic Workshop . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 System description 243.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.1 Binarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.2.2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2.3 Skew detection . . . . . . . . . . . . . . . . . . . . . . . . . . 303.2.4 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.5 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2.6 Data output . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4 DSP platform 344.1 Proposed solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1.1 TMS320C6713 DSP overview . . . . . . . . . . . . . . . . . . 364.1.2 XC3S1000 FPGA overview . . . . . . . . . . . . . . . . . . . . 374.1.3 System feasibility . . . . . . . . . . . . . . . . . . . . . . . . . 39

III

4.2 System partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.2.1 FPGA implementation of an image processing engine . . . . . 464.2.2 FPGA implementation of neural networks . . . . . . . . . . . 50

4.3 Final considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5 SBC platform 585.1 ARM platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.1.1 XScale architecture . . . . . . . . . . . . . . . . . . . . . . . . 635.1.2 SIMD coprocesssor . . . . . . . . . . . . . . . . . . . . . . . . 645.1.3 PXA270 overview . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.2 System feasibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.2.1 Image acquisition . . . . . . . . . . . . . . . . . . . . . . . . . 705.2.2 Software components . . . . . . . . . . . . . . . . . . . . . . . 71

5.3 Final considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6 Software design and development 766.1 Development environment . . . . . . . . . . . . . . . . . . . . . . . . 766.2 Design and development guidelines . . . . . . . . . . . . . . . . . . . 80

6.2.1 Modularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816.2.2 Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876.2.3 Extensibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . 926.2.4 Testability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.2.5 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

7 Text recognition 1207.1 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1207.2 Pattern classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7.2.1 Artificial neural network classifier . . . . . . . . . . . . . . . . 1267.2.2 Performance improvements . . . . . . . . . . . . . . . . . . . . 1347.2.3 Classifier selection . . . . . . . . . . . . . . . . . . . . . . . . 147

7.3 Implementation notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 1557.4 Word-oriented features . . . . . . . . . . . . . . . . . . . . . . . . . . 1587.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

8 Conclusion 161

9 Future perspectives 165

IV

A Tables 168A.1 Artificial neural networks configurations . . . . . . . . . . . . . . . . 168A.2 Holdout method, 90% training – 10% validation . . . . . . . . . . . . 172A.3 Holdout method, 10% training – 90% validation . . . . . . . . . . . . 177A.4 Stratified two-fold cross-validation method . . . . . . . . . . . . . . . 183

Bibliography 193

V

Introduction

One of the big problem for blind people is the access to the printed text information.Nowadays more and more information are stored and managed through computers,but in everyday life people can not always rely upon such devices, e.g. when con-sulting restaurant menus or, more important, advice sheets supplied with medicines.These are basic examples, but they give an idea that, though the modern technolo-gies are becoming more and more present in our life, there is still a gap betweenwhat they are able to do and what we (or, in this case, blind people) need to behelped in. The final aim of this project is to precisely fill this gap. The idea isto study, design and develop a device that, in a natural way, help the user to readprinted text. This research is only a part of wider projects, named Stiper and Stiper2(STImolazione PERcettiva — Perceptive stimulation, still underway, described inChapter 2), funded by INFN (Istituto Nazionale Fisica Nucleare) and developed byISS (Istituto Superior di Sanita) in Rome and the Laboratorio di Neuronica in theDipartimento di Elettronica of Politecnico di Torino.

There are already some solutions (explored in Section 1.1), but none of themseems to have all the features desired. These issues have been partially addressedin a previous project named Haptic (see Section 1.2), which took to the realizationof a first prototype. The goal was to design and develop a device enabling the userto read printed text, in a flexible and autonomous way, without bothering to followa precise sequence of predefined operations (as it occurs with other solutions), andleaving the user free of exploring the information source (e.g. newspapers, magazines,menus, to name few) under his complete control. It should be something like a fingerextension that translates visual informations into tactile and/or audio ones. Usuallythe reader quickly moves fingers over the surface to find embossed points, but withthis solution it would be the device to be moved around, and a tactile feedback willgiven back to the user, which steadily hold his fingertip upon a particular actuator.Thus the feeling should be the same as reading a Braille book, and it needs not somuch learning time.

The Haptic prototype has been used as the starting point for the Stiper and

1

Stiper2 projects1, whose aim is to further extend such goal, by designing and devel-oping an autonomous and mobile device with the features described above. Firstof all, those ideas have been validated by real users and experts (see Section 2.2),which provided an important feedback, both confirming the value of these kind ofdevices and giving advices on how to improve it. This led to start to design a newprototype, now focused on the speech synthesis and with the device portability asthe main goal (along with some enhanced features).

Designing and developing such device means, first of all, to define the hardwareplatform, taking into account also some software issues. There are different solu-tions, based on completely custom systems (Chapter 4) or on existing commercialones (Chapter 5). Some criteria for their evaluation are discussed in Section 3.1,and that choice will heavily affect the application development. In the first case wewill have the possibility of resorting to hardware/software co-design methodologies,heading for a system which offers a wider range of capabilities in terms of compu-tational power. On the opposite side, a well-established already available platformshould offer an increased stability in the development cycle, allowing the designer tomainly focus his attention on the system features. Before going into the details ofthose choices, we will examine the system from a global standpoint (see Chapter 3),since it has been subdivided into sub-components, each one addressing a simplertask. Indeed understanding the whole system design is important both to define thehardware side and to develop the software. This last one will be discussed in Chap-ter 6, stating some guidelines on the development cycle, useful to have both a bettercode quality and a more manageable solution, and examining how they have beenput into practice. Again, we will have to look at the hardware specific characteristicsin order to achieve better performances. Since we will do a cross-platform develop-ment (being the host platform different from the target one), one requirement willbe, wherever possible, to develop system-independent components. This does notcome at no cost, and we will have to cope, for example, with testability concerns,in order to be sure that what we implement is what we intend. Other importantpoints will be flexibility/extensibility to take into account future improvements, toexperiment new features, or simply to test the existing ones and their optimizations.This last point is usually left at the end of the development cycle, but in this casewe have to address it since the beginning, in order not to subsequently refactor allthe code.

After having defined the hardware and developed the software, we will focusour attention on text recognition abilities of the system. This is based on ArtificialNeural Networks (ANN), which are able to identify the symbols extracted from theimages. This approach is a flexible and scalable data driven method that providesenough reliability, but it should be properly designed. This involves the definition

1The second one is the natural prosecution of the first one.

2

of a feature extractor, one or more topologies and training strategies. Since we canapply different improvements in such subsystem, it is important to find a way ofcomparing such solutions and to choose the best one. All this issues are addressedin Chapter 7. It is also worth noting that one of the requirements is to have a word-oriented system, and the corresponding functions will be implemented on top of thecharacter identification process. In this way, along with a multi-lingual spell-checker,the system will be able to give a feedback to the user about both the single symbol(when identified) and the whole word (once a space between words is detected).

As it is easy to argue, this project has involved a wide range of engineeringaspects: whenever possible every issue has been studied in the details, but stillthere is room for improving the existing solution. All these concerns are addressedin Chapter 8 and 9. It is important to remember that the main choice made alongthis research has been to prefer to have a complete functional system than to haveonly a subsystem which is almost perfect only in performing the task it was designedfor. This trade-off allows us to test the whole proposed solution with the real usersand experts (as it has been done with the previous prototype), which are the onlyones being able to validate the system.

3

Chapter 1

State of the art and related work

The problem of accessing text is not, of course, a new one: over the time severalsolutions have been proposed. One of the most used until nowadays is the Braillesystem, created by Louis Braille in 1821, which consists in representing each char-acter with a set of embossed point, forming one cell, disposed along a 3× 2 grid. Insuch way it is possible to represent up to 64 different symbols, nevertheless some-times two cells are used, e.g. for numbers and capital letters. This technique is usefuleven to write books, but they are big and expensive, only really suitable in somecases. The main alternative is the speech synthesis, but this is a feasible way onlysince the last decades.

The Haptic and Stiper/Stiper2 projects started from the idea that with theactual technology should be possible to offer to the user a more rich experience thanthe one given by existing solutions. As a remark, reading the text is not the onlygoal of this ambitious projects, anyway actually it is the more developed study.

As it is easy to argue, it is difficult to state the requirements of this device, sincepart of the problem is to understand how the system should behave exactly. It shouldbe flexible enough to adapt itself to the user in order to require the minimal amountof training. We have to remember that this is a system designed and developedby sighted people for blind people, so we have to periodically let them test theprototype, in order to evaluate the whole system and to give important feedbackon how to improve it. Therefore, before starting to design a new prototype, aworkshop day has been organized, where our ideas have been compared with theend user experience.

1.1 Existing solutions

One of the most used aid for blind people is the Braille bar (also called Brailleterminal or Braille display), which is a PC peripheral composed by a series of cells,

4

1 – State of the art and related work

each one able to reproduce a character by means (usually) of piezoelectric actuators.With 80 cells the user is able to read about one line of text per time, but there arealso smaller ones (40, 20 cells) that are designed with portability issues in mind.Actually there is also a device named BrailleBook which is a mobile device with awide array of such cells and it claims to be a sort of eBook Reader, equipped witha Bluetooth connection (to exchange data) and powered by a battery. All thesekind of products are of course useful, but they are not stand-alone devices and theycan only handle already stored information. This could be overcome with the useof a scanner and an OCR (Optical Character Recognition) software, but again itis not a comfortable solution in many cases, since it is not an immediate way ofreading. There is no direct connection between the input peripheral, the requestedoutput and the user: usually OCR software is designed to automate office tasks, andsometimes they require a trial and error process to correctly recognize everythingdesired. There are such type of software specifically tailored around the needs ofblind, but reaching the needed data is still not immediate, since not always the useris interested in the whole paper, but using those tools he would have to wait in anycase the elaboration of the entire sheet. We can think for example about a blind whowould like to know the total amount on a billing: it does not need all the details,only to read the required information in a easy way.

Figure 1.1: Detailed view of a Braille Bar

Talking about the Braille, we have also to mention the Braille printers: they solveto the same task as the traditional ones, but the output is different. There are manymodels with different features, among which the most important one is the printingspeed. Some of them are even able to reproduce the Braille on both sides of thepaper without interfering with each other. Nevertheless it is not at all a “real-time”reading approach, since we still need to acquire in some way the data, and there

5


will be almost the same problems seen before in reaching the desired information.Moreover we have to consider also some text formatting issues: for example, there isno translation 1:1 between the Braille characters and the traditional ones, becausethe uppercase symbols (and also numbers) are preceded by special prefix identifiers.

The main alternative to the Braille system is the vocal synthesis: it is a moreimmediate and natural way of communication, but again it is only a mean of deliver-ing information, such as the Braille terminal. Of course there are differences, whichwere exploited in the Haptic workshop (see Section 2.2). It has been widely usedin some software called “screen reader” which are able to read the screen contentto the user; this is more useful nowadays were almost all operating systems havea graphical user interface as default for interacting with the user. The software isable to describe what there is under the mouse cursor, so there is the concept ofexploring and searching for information. This overcomes some drawbacks of theBraille displays, which are inherently more suitable for text only systems. It is alsoworth noting that the speech synthesis has the advantage of not requiring any spe-cific hardware (only an audio interface, nowadays integrated in almost every digitaldevice), so using this technology should lead to design a more compact solution. Ofcourse it has some drawbacks, since it is generally implemented entirely in softwareand the algorithms which translate text into an audio signal involve a certain amountof computational power that, on mobile systems with limited resources, could slowdown the software.

The first device designed to explore a paper in natural way under the completecontrol by the user, with the purpose of enabling him to read text, was the Optacon(OPTical to TActile CONverter). It was developed (and then commercialized byTelesensory System Inc.) in 1969 at the Stanford University by Linvill and Bliss.Before such device there were many studies and trials to make the reading taskeasier for visually impaired people, as reported in [1]. Some of them were relatedto the conversion of text into Braille easier (eventually taking advantage of theBraille Grade II and III, which permit contractions), others to find suitable waysfor automatic tactual readers (by means of moveable keys), to design alternativeencodings for embossed character types (Moon type), and also to deal with the socalled “talking books”. The very first electronic reading aid was the Optophone,developed in 1912 by Dr. E. Fournier d’Albe, which was able to reproduce an audiblesignal according to the source image seen through a scanning slit. These sounds weresomething similar to chords, each note corresponding to the presence/absence of ablack sign (part of a glyph) in a specific position under the “sensor”. The user had tohear them and, by listening a sequence (or, better, a flow) of such audio information,he should have been able to guess the underlying characters. The results were notso promising as expected, since the required training was too hard, mainly due tothe difficulties to remember, but also to distinguish, so many different sounds andthen to put them together to form a single text symbol. Despite these problems,

6


the Optophone was the starting point of other studies with the aim of improvingsuch type of device, always with the goal of designing a personal reading instrument.Successive studies, were the Battelle reading aid or Mauch word generating device.This last one had the novel feature of being able to reproduce sounds similar tospeeches, in practice defining something like a new language. There were also afirst attempt (the Haskins speech machine) of developing a text to speech machine,which was able to read (intended as speaking) a pre-stored text. Nevertheless it wasable to speak only a limited subset of the whole language, carefully chosen as themost common one. In conclusion, when the Optacon appeared, it was considered abig improvement, since it enabled the user to read printed text in a useful manner,with only a little training (in comparison to the alternatives of course).

The basic idea of the Optacon was to equip the user with a little scanner forexploring the paper, and the graphical information were reproduced on a little tactilearray of actuators. At that time there was no “generally available hardware” usablefor such purposes, so they developed also the needed integrated circuits. Fromthe image processing standpoint, it performed an image binarization with a prefixedthreshold (this parameter tunable by the user), and so characters were not translatedinto the Braille ones, but simply represented point by point. That choice is whatLinvill and Bliss called direct translation, in contrast with recognition (see [2]). Forthe tactile transducer they used vibrating piezoelectric reeds, called bimorphs, eachone driven by an electronic oscillator circuit connected to the phototransistors of thescanner. These choices were made with the aim of designing a portable device withenough autonomy. They made tests to verify the quality of the tactile stimulationsand to discover the best resolution of both the tactile and photocells array in orderto enable the user to read text. In particular, Bliss, in [3], found that for thepica1 character type, a set of photosensors spacing of about 7 mils was needed, andconsidering the distance between the top of the highest letter and the bottom of thelowest letter, he went for an input device with 24 photosensors. It is interesting tonote that Bliss considered as the main cause of failure of the previous reading aidsthe insufficient vertical spacial sampling of the symbols, the limit being, as he stated,12 photocells per character. Using more sensors also helps the user to properly followtext lines; in order to further facilitate this issue, the Optacon was equipped witha mechanical tracking aid (considered very helpful) to limit vertical movements.Regarding the horizontal shifts, they found that only one column of stimulators wasnot sufficient: it was very difficult for the user to remember the sequence of thestimulation and to recognize the character. This was a well-known concern of allthis kind of devices, as pointed out in [1], but for the previous audio ones there wasno way of overcoming this problem. Indeed the idea of the Mauch machine of usingsounds that resembles speeches was an attempt to solve this issue. Bliss and Linvill,

11 pica = 12 pt.

7


through a set of experiments, found that the ability to read increased considerablywith four columns, but in the end they preferred to use six columns, also for abetter fingertip comfort and to be able to represent an entire character. In sucharrangement, the rows are spaced at 1 mm and the columns spaced at 2 mm, andthese distances are inferior to the traditional ones used in the Braille code (2.5 mmin each direction) in order to give a better feeling with the tactile images. The devicewas designed with the idea of exploring the array with only one finger steadily fixedon the transducer: they also tried a two fingers solution, but it was an improvementonly if the user could use both the index fingers, but this was not possible since theOptacon requires two hands to handle at the same time the scanner and the tactiletransducer.

The major advantage of the Optacon was the ability of making the user able tounderstand not only the text but also the graphics in an active way, since it is theuser to decide which part of the paper should be examined. On the other side, itrequired many hours of training, and it could not be used continuously for a longperiod because the fingertips were stressed by the vibrations on a long term. Incomparison, a scanner with a PC equipped with an OCR software are more usableto read a long text. Actually this device is no longer produced, but it seems that alot of people still use it. This should lead to the conclusion that it was an excellentdevice, carefully designed around the user’s needs, but the technology available atthat time was not sufficient to achieve such ambitious goal.

Figure 1.2: The Optacon device

Recently (beginning of 2008), Kurzweil and National Federation of the Blind,have released a new device called “kReader Mobile”2. It is a commercial mobile

2When we started the Haptic project first and then the Stiper one, this device was neitherreleased nor announced; so it has to be considered as a concurrent device, rather than a predecessor

8


phone on top of which they developed a specific software able to recognize text inthe images. Reading the available documentation [4], we can see that the user hasto take a snapshot of the paper, then the software tries to recognize the text or tellsto the user to adjust the shot to take a better snapshot. It is not a “real-time” wayof reading, since the image processing and the text recognition is still off-line. It isequipped with a high-resolution camera (5 Mp) with autofocus and flash, in order tobe sure that the text image is enough detailed to perform any recognition: we haveto consider that there is a minimal distance between the camera and the subject forfocused snapshots, and an entire paper is processed each time. According to thismodus operandi, the software needs a certain amount of time, difficult to estimate,making the user to wait for the first results. Afterwards the device communicates tothe user everything found, by means of the speech synthesis. It may be that multipleattempts would be needed before getting the right image: the user is blind and itis difficult to focus the entire subject in the proper way, e.g. minimizing perspectiveissues. Anyway it is still an helpful tool that addresses the problem with a differentapproach than the one discussed in this study, with all the pros and cons, as we willsee later (see Section 2.1).

1.2 The Haptic project

The first project planned to study a solution to enable the blind to read printedtext was Haptic (from the Greek word apto which refers to everything dealing withthe sense of touch). It was an Italian national project funded by INFN (IstitutoNazionale Fisica Nucleare), developed by ISS (Istituto Superiore di Sanit‘a) in Romeand by Laboratorio di Neuronica in the Dipartimento di Elettronica of Politecnicodi Torino, Italy. Its goal was a series of feasibility studies concerning aids for blind,with particular regard to the access of printed information and the detection ofobstacles in a short-mid range. For the first class of devices the idea was to developa kind of hand extension, able to acquire and process images continuously and thento give the proper feedback to the user, by means of tactile stimulations. Thesecond type of tools can be seen as a “radar walking stick”, equipped with a customhardware/software system able to perform an object distance estimation in orderto give to the user an idea of the surrounding environment. At the Politecnico diTorino the choice was to focus the attention on the first kind of aids, and to splitthe project into two sub-projects: the former dealing with images and the otherwith the text. It was clear since the beginning that a merely reproduction of theimages scaled down to the tactile display resolution was not sufficient to make theuser understand anything more complicated than simple drawings. On the otherside, for the text recognition, completely different algorithms and approaches were

one (as in the case of Optacon).

9


needed, thus leading to the project split. Of course both devices could share the samehardware and software platform, and feedback could be anyway given by means ofa tactile display (to be characterized and designed), able to reproduce both imagesand Braille text. From a practical standpoint, the instrument can be seen as atraditional PC mouse with a camera on the bottom side and a tactile transducer onthe top side.

In this project the main effort has been the software development and the inte-gration of the tactile transducer into the first prototype [5]. As said before, the idea(inspired by the Optacon) was to develop a tool that, in this case, simulates a fingerextension, with the user holding his fingertip steadily on the tactile transducer whilemoving the device around the paper. It would have been the software to move theBraille information under the fingertip according to the position of the characters inthe image acquired by the integrated camera. The percept of the relative movementbetween the tactile feedback and the fingertip should have been the same as the onefelt by the user reading a Braille book.

Figure 1.3: The first prototype used only for development purpose

Fig. 1.3 gives an idea about the use of the proposed device, although it is onlya first prototype, built with an old PC mouse and a small webcam inside; it wasused only for software development purposes. The first real prototype was a deviceequipped with a USB camera on the bottom side and the tactile transducer onthe top side (see Fig. 1.4). It was designed with the same ideas of the Optaconin mind, but with the valuable addition of the text recognition feature, preferringthe recognition approach instead of the direct translation one (see [2]). At thattime there was no hardware capable of supporting such approach, but nowadaysthis seems feasible, and thus the user can directly read the Braille character. Asstated also in [3], the next desired big improvement leap for those reading aids

10


Figure 1.4: Prototype with integrated tactile display

Figure 1.5: Detailed view of the integrated webcam

would have been the ability of performing the character recognition and then theword reconstruction, together with the capability of helping the user follow the textflow (thus avoiding the need of a mechanical tracking aid) and eventually to spellthe recognized characters. With all the previous studied solutions [1], the mainlimit was considered to be the way of encoding the information in an acceptablemanner for the user: in many cases those were difficult to understand, and moreoverdifficult to remember; using the Braille, or the speech synthesis, has the advantageof reducing this gap. Thus it is the device that goes towards the user’s needs andexperience, and not the opposite, where people have to adapt themselves to themachines. Compared to the Braille terminals or printers, the proposed solutionshould have the advantage of being an all-in-one tool, meaning that it does notrequire any external accessory; having also the (optional) speech synthesis should

11


Figure 1.6: Detailed view of the tactile display

enable all the visually impaired people to use this aid, even the ones who do notunderstand the Braille.

Using a common available camera allowed to have a first software prototypein a shorter time and a good input image quality. In this way the actual field ofview (about 17 mm× 13 mm) has been increased respect to the Optacon one (about6 mm× 3 mm), enabling the software to view 3 lines (10 pt) at the same time. Theonly issue was how to find the right height from the surface and how to light up thepaper in order not to have dark images. Clearly the system is still not able to readwide portions of text, but according to the recognition approach, the software can“track” the user movements, thus providing useful information such as the verticalalignment to the current line of text, the switching between them and the reachingof the end/beginning of a line. The tactile transducer followed the Optacon idea ofgiving to the user a continuous feeling, so each pin of the array is put at a distanceof about 1 mm and they are arranged in a matrix 8× 8. This should be sufficient tostimulate the whole fingertip and also to correctly reproduce Braille symbols. Thetechnology used is based on piezoelectric bars able to raise the required pins. Thechoice was not to use vibrating elements in order to avoid stressing the fingertip.Both the camera and the tactile display were connected to a PC and a specificsoftware was developed. Also the speech synthesis was considered, as an alternativeto the Braille. It was not possible to deliver to the user all the needed information,i.e. the ones upon the text recognition and others on the movement tracking, so theproposed solution was to use the tactile display only for the text, and the speechsynthesis for all the others.

12


1.3 Final considerations

In this chapter we have seen how the problem of enabling the blind to read printedtext has been addressed in the past. The most promising aid has been the Optacon(nowadays no more available), which was a portable device with a little scanner anda tactile display. The user was free to explore a paper and feel with the fingertipthe graphical (and implicitly also the text) information. Its main drawback was thelack of a proper character recognition functionality. On the opposite side, there isthe newly designed Kurzweil kReader, but it requires to correctly take a snapshotof the whole paper before processing the image. Both of them have pros and cons,and we will see in the next chapters how the freedom of use and the text recognitionabilities will be combined together in a mobile device. This has been already partiallyaddressed in the previous Haptic project, but it was only a feasibility study thatopened the road to the Stiper and Stiper2 projects, which take this research further.

13

Chapter 2

The Stiper and Stiper2 projects

The results obtained in the previousHaptic project were encouraging enough to starta whole new project, named Stiper (STImolazione PERcettiva — Perceptive Stimu-lation), always funded by INFN and in collaboration with ISS (Istituto SUperiore diSanita). Stiper2 is (since it is still underway) the natural prosecution of the former,and involves also Univerita di Roma 2.

The first thing to do is certainly to collect useful advice on how a new prototypeshould be designed, and these can only come from the real end user and/or expertswho have a deep knowlededge about the blindness and its implications. This hasbeen realised during a workshop day, named Haptic Workshop simply because theproposed solution was the result of the Haptic project.

2.1 Project aim

Stiper and Stiper2 follow the same research line traced by the previous Hapticproject. Again, there are the top goals of enabling the blind to access printed infor-mation and to explore the surrounding environment in a comfortable way. Since atthe Laboratorio di Neuronica (Dipartimento di Elettronica of Politecnico di Torino)the efforts were focused on the first goal, also in these projects the decision has beento continue to develop innovative reading aids for blind people.

The main goal of the Stiper project was to design and develop a second generationreader device, with the portability and autonomy as the key features. The Hapticprototype was the result of a feasibility study, but for a real device useful on adaily basis, it should be light, small and not requiring an external PC to performcomputations. Of course this poses some questions about the tactile feedback, sincethe transducer is the most limiting part to achieve such goals. There are alreadymany alternatives to communicate information to the blind, as described in [6],where different technologies are compared: although the piezoelectric option is still

14

2 – The Stiper and Stiper2 projects

the most used one, there is much interest around MEMS (Micro Electro-MechanicalSystems), electrorheological fluids and electro active polymers, to name few. Manyissues have to be taken into account, such as the purpose of these transducers (theycould be useful not only for blind people [7] and there are many aspects in a tactilefeedback, not only its 2D layout), the spacial resolution they can achieve, theirbandwidth (this include both the response time and operating frequency if theyvibrate), their force (since the user applies a pressure to explore the display), andalso the driving requirements, such as current consumption and applied voltage.Ideal characteristics for our application are a high spacial resolution, to be ableto give a continuous tactile experience, a high bandwidth, especially in terms ofresponse time, since our display content is highly dynamic and dependent upon howthe user moves the device, low-power driving requirements and small dimensions.At the best of our knowledge there is not such transducer, but using MEMS andSMA (Shape Memory Alloy) seems the most promising way. The first ones areelectro-mechanical systems (both sensors and actuators) built directly onto a siliconsubstrate, the same used for manufacturing electronic devices, while the second onesare special materials that can modify their shapes only under certain conditions.

An example of those technologies applied to the research field of our interest isproposed in [8], where plastic MEMS actuators, integrated with organic transistoractive matrices, are used to build a bendable thin transducer able to reproduceBraille characters. They require a low input voltage, about 3 V, although the gatevoltage can reach up to -30 V, in any case better than using piezoelectric actuatorswhich require near to 80 V. Thanks to the small footprint of the resulting Brailleelements, these can be easily packed to form a row of cells, useful to reproduce anentire word at the same time. Nevertheless, they are not so small as needed todevelop a graphical tactile display, because actually each actuator is large 4×1 mm;moreover they do not have a so high response time, 0.9 s when integrated into thecell, even if the stand-alone actuator has a frequency response of about 2 Hz. Theseresults make such solution well suited for a mobile Braille display, but not for ourcurrent needs: we would like to have a higher degree of integration and betterdynamic characteristics. Anyway it remains one of the most promising solutions,there is still a lot of room for improvements, since most of the limitations are mainlyengineering factors and not theoretical ones.

In conclusion, the tactile display design is still an open issue, and we are stillresearching for a suitable solution, but as we have seen during the Haptic Workshop(described below), the integration of this output device can be made optional, andother alternatives, i.e. the speech synthesis, will be preferred.

Regarding the afore mentioned portability issue, this is mainly a matter of prop-erly design the hardware platform. It could be either a completely new one (seeChapter 4), designed from scratch, or a customization of an already commercialwidely available solution (see Chapter 5). In a any case the concept design behind

15


Figure 2.1: Concept design of the Stiper/Stiper2 reading aid

the device is the same of a common PDA (Personal Digital Assistant), but targetedto blind people, with reading text capabilities (see Fig. 2.1). Defining a new hard-ware does not imply only the choice of the data processing core part, but also todecide how to acquire images, and this is influenced by the user’s needs and habits.There are mainly two approaches in this field, the first one is to design a paper scan-ner device, like the Optacon, and the other a camera-like tool, such as the KurzweilkReader (both described in Section 1.1). The former leads to a more interactive wayof exploring the information source, since it allows a “read as you go” use, and theuser has an active role in finding useful information. On the opposite side, the latterenables the user to have a global idea about the subject before going into details,but then he has to wait that the system elaborates all the image, as it is not done inreal time1, before accessing even only a small information; there is no feasible wayof allowing to skip the not desired information, and it is not rare the case that theuser has to do some trials before getting the right image to process. Of course theideal solution would be a mix of the two, i.e. a real-time camera performing textrecognition, but, from a practical standpoint, there is still no hardware/softwareplatform able to support such functionalities. It has also to be remarked that it isalmost impossible to replicate the human eye (and brain) functions, so we can onlyhelp the user to overcome their disabilities. According to these considerations, wehave decided to make the development way following the same path traced by theOptacon, since we simply believe that it would be more helpful in such ambitiousaim.

1Often in this study we will use the expression “real time” referring to the timing constraints ofthe elaboration. Perhaps it is not the most correct one, since the appliance is not mission critical.Anyway the results have to be computed in a useful time, ideally before than the next frame iscaptured by the camera, in order not to slow down the application. Hence the frequent use, in aloose meaning, of that assertion.

16


Beside the hardware, also the software has to be properly developed, since the re-quirements are quite different in the two approaches outlined above: in the Optacon-like way of exploring the sheet there are constraints, mainly timing ones, which areby far more stringent than in the alternative case. This leads to pay more attentionto optimizations since the beginning, while maintaining a high degree of flexibility tofurther extend or improve the opportunities offered by the reading aid in the future.Although the software is based on the one developed in the previous Haptic project,it has been completely re-designed, in order to achieve these new goals. The previ-ous implementation was a feasibility study, with the main goal of making it works,simply running on a common PC. Now we would like to design a true mobile device,with the same performances even if it does not have the same resources as a PC. Thisis a challenging task (addressed in Chapter 6), and to achieve these goals it is neededto carefully plan the system architecture and to design a set of development aids, inorder to make the proper implementation process smoother. Moreover we have topay a lot of attention on the text recognition features, since this is the core part ofthe system. These concerns have been extensively addressed in Chapter 7, where,besides improving the single-character recognition performances, word-oriented fea-tures have been implemented, according to the advice received during the Hapticworkshop, described in the next section.

2.2 The Haptic Workshop

Although the Haptic device described in Section 1.2 was only a prototype, notusable on a daily basis, it was complete enough to be tested by end users, in orderto verify the ideas and its real usefulness. Despite being the result of a previousproject, this device has been used as the starting point of the new Stiper/Stiper2project. Thus the so called “Haptic Workshop” (Turin, 12th June, 2006) has beenorganized, where final users coming from different backgrounds, together with aneuropsychiatrist and a tiflopedagogist, were invited. After a brief introductionabout the whole Stiper project, its roadmap, the history and the context about itsdevelopment, the device has been presented, followed by a long test session. Theusers only received some basic advice, and then they were free to test it. This wasimportant also to understand how they would have used it, since some assumptionswere made while designing the software, but they needed to be confirmed. Afterthese first sessions all the impressions, comments and hints have been collected, inorder to improve the system. All the opinions were summarized and divided intotwo categories: hardware and software. Of course most of them could be seen ascriticisms, but it is better to consider them the answer to the question “Where weshould go?”.

17


2.2.1 Hardware

The first impression about the device was surely enthusiastic, both for the newapproach to the problem of reading printed text and for being potentially veryuseful on a daily use, because of its flexibility and the design around the user’sneeds. Of course such goal is very ambitious and it is almost impossible to fulfillat the first chance, so the purpose of this workshop was precisely to exploit theflaws of this prototype, especially in a typical daily use, where the behaviour andthe needs vary a lot from user to user. Starting from these considerations, themost discussed characteristic was about the ergonomics: it was simply too heavyand big to be moved around the paper in a comfortable way. Ideally it should bepassed along the text advancing from one character to another, but these are verylittle shifts compared to the global device dimensions, so it is too easy to skip somesymbols. Being so heavy, it has an initial inertia that is counter effective to theintended typical use. It has to be noted that almost all the weight and the spaceinside the prototype is occupied by the tactile transducer2, so it was proposed tosplit the device into two parts: the first one small and light, with only the camera,and the other with all the remaining hardware, including the tactile display. Thisconfiguration resembles the Optacon design. Our concern was to develop a moreportable device, also in respect to the Optacon, and designing a “one hand” deviceseemed to be a valuable feature (useful to read product labels on the shelves in thesupermarket, for example), but the users stated that this configuration was not themajor drawback of such solution.

Regarding the tactile display, the major flaw was the inability of representingmore than one character at a time. The ideal solution would be to have the oppor-tunity to show an entire word at a time, thus requiring 8–10 of such transducer (wecould also think of using Braille Grade II which allows contractions). This wouldmake the system more similar to a Braille bar, but this solution would considerablyincrease the weight and the dimensions of the device, further making the devel-opment head for a two hand device. It is important to remark that the issue ispartially independent of the technology employed for the tactile display, since theBraille space constraints are fixed and it is not possible to have a device both ableto write more than few Braille characters and being usable with only one hand atthe same time. Anyway, recognizing a word character by character requires moreattention and reading long texts could be prohibitive, but acceptable for a shortcontent.

Going into the details of the choice made about the tactile transducer, it wasconfirmed that, despite not being expressly designed to represent Braille symbols,it was able to adequately simulate a traditional Braille cell. The idea of not using

2The first prototype, which had only the input peripheral, could easily fit into a common PCmouse, as seen in Fig. 1.3.

18


vibrating stimulations (as done in the Optacon) is better from the fingertip stressstandpoint, but more care should be taken in the making of raising pins, since actu-ally they are not of equal size (in particular the height is important): although this isnot a design flaw, it noticeably impacts on the user experience. Also the mechanismfor lowering such pins, which actually is based only upon the force applied by thefingertip, is misleading, because it is not reliable and in practice it makes the userfeel a “dirty” Braille. In the end it is desirable to have the option of using also the8-dot Braille (along with the traditional 6-dot one used in the current prototype),but this would require a re-design of the transducer (clearly more feasible than theword-size one).

After the above considerations, stating that the actual tactile solution is notuseful as expected, we switched to consider the speech synthesis option. By ouropinion, it was a second choice, only used because we did not find a suitable tactileway of delivering the information about the position and the alignment inside thetext flow. We thought that audio information could isolate the user and subtracttoo much attention from the surrounding environment, also believing that suchway of communicating could partially interfere and make less comfortable the textexploration and reading. We also considered that it is a very different experiencelistening and reading, being the second one more effective. Facing with real users,we have seen that the speech synthesis is much more useful than expected, sincethey have reported that no more than 10–15% of the entire blind population is ableto read the Braille. This is mainly due to the higher average age of blind people,compared to the sighted people, since we have to consider that most of them havelost the sight by accident or by common diseases in the old age. In such casesthey are often reluctant to learn the Braille, and moreover the fingertip sensitivitydecreases in old age, as reported in [1]. Beside this issue, the increasing evolution ofthe computer science has led to the possibility of having a good speech synthesis onmany platforms, not only desktop PC, also on mobile phones, making this solution amore viable way. Conversely to what was previously hypothesized, delivering audioinformation does not isolate the user: it could be true for sighted people, but forthe blind, which have to improve the sensitivity of all the other senses, it is not aproblem at all. Moreover, using only the speech synthesis, and not a tactile display,has the advantage of allowing a more compact and ergonomic design of the aid.

Despite the above considerations, it was also remarked that a tactile displaywould not be useless. Indeed, when reading a challenging text, for example, the userhas to pay more attention to the content, and in these cases the Braille would be moreeffective3. In the end, the tactile display should still be considered, but as an option.Also the word-oriented features, pointed out when we discussed about the tactile

3Also for sighted people in such situations, usually they prefer to read themselves the textinstead of listening someone else.

19


transducer, should be applied to the speech synthesis, for the same reasons, becausethey are of psychological nature and not bound to the specific way of communicating.

Talking about a word-oriented system, in contrast with the actual character-oriented architecture, it would be useful to have a wider portion of the underlyingpaper as input. It is not a matter of resolution, because the camera already com-pletely fulfills such requirements, but a concern about the device usability. Thisimprovement would allow a better tolerance in the vertical alignment of the devicewith the text, and eventually to process more than one line at a time, in orderto minimize the returns to the beginning of a line after having reached the end. Itwould also enhance the global reading rate. This feature requires, from the hardwarestandpoint, to move away the camera or to change the lenses with ones guaranteeinga wider angle of view. The former way is in contrast with the ergonomics require-ments, while the latter, in practice, is not so straightforward as expected. Perhapsa good trade-off could be to put the camera horizontally and use a mirror tilted byan angle of 45 ◦.

2.2.2 Software

The proposed software solution, still in a development stage, showed to be quiteeffective in recognizing the text, but not yet very usable in a daily context. The firstdemanded enhancement was clearly the word-oriented features, i.e. the possibilityof reading an entire word at once instead of letter by letter. The software was ableto partially record the character recognitions, in order to facilitate going backwardand re-reading some parts, but it has to be improved and to be completed with asyntactic analysis.

Another weakness in the system was found in the tracking algorithm, whichis in charge of detecting the user movements, to properly activate the recognitionprocess and to inform the user about the alignment in the text flow. As explainedin Chapter 3, it was preferred, for computational performance reasons, to havelight data structures and functions, but in this case it seems that the problem wastoo simplified. This could be also due to a not perfect understanding of the userbehaviour, or, looking also to the hardware issues discussed above, to the difficultiesto move the device in a smooth way. The resulting erratic movements misleadsthe software, since two successive images are then too much different to identify thepresence of a new character or the switching to a new line. It may be that the currentway of addressing this task is not the optimal one, although a better estimation ofthe operating parameters could improve the performances. It has to be remarkedthat the skew detection and tracking algorithms share some basic ideas, but in thefirst case these have been showed to be reliable enough, so it should be mainly amatter regarding the latter function. Using an improved image capture peripheral,like the previously proposed one, would significantly help the software, since it could

20


handle larger movements, making the whole system more flexible to the user needs.Along with these improvements, it could be useful to develop multi-line features

which should make the application to process more text lines at the same time, inorder to reduce the need of returning to the beginning of the line to read the nextone. This issue was already pointed out in [3], but to be implemented it wouldrequire a whole re-design of the system architecture, especially from the softwarestandpoint, both for technical reasons and to find suitable ways to manage multiplelines. The application should take an history of the recognized characters on allthe paper, not only the last ones, and for this purpose a suitable (not trivial) datastructure should be designed; it should be implemented also the concept of session,one for each page, started by the user when beginning to examine a paper andstopped at the end of the reading. It is important to remark that, although this lastfunctionalities are highly desirable, they are not so important as the others, such asthe ability to speak/read a whole word at a time.

Figure 2.2: Different interpretations of the same tactile display content, when thevirtual Braille cell does not have a fixed position.

The last notable requirement by the users was to make the Braille charactersappear always at the same position in the tactile display. When we designed thisprototype, we followed the idea that the user would have steadily held the fingertipon the transducer, with the software raising/lowering the pins to reproduce the samefeeling as directly exploring a Braille text (this idea was supported by the resultsobtained in [2]). This means to virtually map a Braille cell on the tactile displayaccording to the position of such character in the input image. If the user slightlyshifts the device, either vertically or horizontally, that character will have a differentplacement, and hence also its corresponding Braille symbol on the transducer. Wenoticed that the user were accustomed to continuously move the index finger in

21


order to guess the content and this, together with the movement simulated by thesoftware, was confusing the user. It was especially true for some symbols which aredifferent only for the position of one or two points relative to the whole characterposition; an example is the a which has only one point in the upper left positionand, in practice it is equal to a comma (which has only one point in the middleleft) placed at a higher position; this situation is visible in Fig. 2.2. The usersenses only the raised pins, but he does not have an idea about where the charactershould appear inside the display. According to the original idea, such displacementshould have helped the user to understand the vertical alignment, but in practiceit was highly misleading. Regarding the horizontal shift of the symbols, we haveseen that it had the same side effect of the vertical one, but with the additionaldrawback that it resulted to be very irregular. This is again related to the ideaof reproducing the relative position of the recognized character in the input imageon the output display: the typefaces in the printed text are usually proportionallyspaced, i.e. their placement depends upon their footprint, while the Braille symbolscan be considered to be uniformly spaced or, better, monospaced. This is also oneof the reasons for the big sizes of Braille books. The direct consequence is thatwhenever we have a tight character, e.g. an i, this will suddenly disappear (on thetactile display) as soon as the user slightly moves the device a little, and it will bereplaced by the next letter the software will focus on4. This issue, along with thefeature of partially representing surrounding characters, could confuse the user. Wehave also to take into account that upper case letters require two Braille symbols,and this amplify the problem. It is noteworthy that it is not related at all with thetransducer characteristics (neither its size nor resolution), and the only solution isnot to move the tactile information on the display, but to steadily showing them,like a traditional Braille cell. The user will only rely on the audio information tofollow the text flow.

2.2.3 Conclusions

The aim of this workshop was to find how to improve our device, and in this regard ithas been very useful. Stating that the solution is very interesting and the approachto the problem is the right one, the improvement discussed above can be summarizedas follow:

Hardware The main concern about the device from the physical standpoint is theergonomics. This implies a stand-alone, small and light device: in order togive an idea we can think about a PDA. This requirement leads to define anew hardware mobile platform (as we will see in Chapter 4 and 5), with the

4We will see in Chapter 3 that, always for run-time performance reasons, the application processonly one character at a time.

22


tactile display as optional. This choice is due to the fact that the existingtransducer can not be integrated, for the reasons seen above, in a new proto-type, thus leading to rely solely upon speech synthesis, anyway preferred on afirst approach.

Software The switching to a new hardware platform almost certainly will requirethe complete refactor of the whole software. Since the goal is to design a mobiledevice, almost surely the hardware will have limited resources and this shallbe taken into account by stating new requirements, defining new developmentguidelines and making, in certain cases, some trades-off. Anyway the mostdemanded feature is to switch from a character-oriented way of communicatingto a complete word one. Along with those major changes, more attention willbe paid also for the tracking and recognition algorithms, given that, especiallythe last one, are the core part of the system.

23

Chapter 3

System description

Starting from the study of the already existing solutions (see Section 1.1) and theresearch project goals and requirements (see Chapter 2), we have to concretely definethe hardware and software platform. The first one will be a completely new solution,since the current Haptic prototype relies upon a common PC to perform all therequired elaborations, while the second one will be based on the previous experienceand the results obtained in the Haptic project (see Section 1.2). Nevertheless eventhe software will be totally re-designed from the scratch, in order to meet the newrequirements and to take advantage of the underlying hardware.

Before going on to analyze in detail the software (Chapter 6) and the hard-ware platforms (Chapter 4 and 5), it is useful to examine some guidelines leadingthe choice of the hardware platform, and to analyze how the main application isorganized from the data flow standpoint.

3.1 Hardware

As stated before (see Chapter 2), we are looking for a solution that is both portableand autonomous, a small device like a mobile phone, for example. Different optionswill be analyzed, and to compare them we have to state first a set of requirements:

• Enough computational power

• Low power consumption

• Ease of interfacing and integrating with other devices

• Ease of development.

We need enough computational power to support heavy works, such as imageprocessing and text recognition, operations performed continuously and in a short

24

3 – System description

time. We have also to look at power consumption of such systems, because in themobile world it can be a driving factor. Typically these hardware platforms arecomposed by a base module equipped only with the essential parts, e.g. a CPU, acertain amount of memory and only few other components. Clearly the possibilityof expanding the system is an important point, since we will add some interfaces,like the one for the image acquisition or the audio one. Despite it will mainly workas a stand-alone device, the ease of integration with existing information systemshould be taken into account, both for development reasons and to further extendits usefulness. Last but not least we have to consider the development cycle forsuch solutions. It can be based both on a hardware/software co-design methodologyor on a pure software design, each one with its pros and cons. In any case thesoftware will play an important role, because it is in charge of controlling the wholesystem and thus it determines the effective value of the proposed solution. As wewill see later in Chapter 6, there are some guidelines to follow for a better systemquality, and, in order to put them in practice, we need to be supported by the tools.The system should be also “open” enough to be easily integrated and/or extendedthrough third-party components and these are often available only for some commonenvironments.

Basically, we have two kinds of solutions: the first one is what is called “customsolution”, where we start from the scratch in designing such platform, choosingthe components one by one and assembling them together. On the opposite side,we can look at existing already available “commercial” solutions and, if necessary,customize them. There are of course pros and cons for each choice, but in any casewe will have to do some trades-off. If we go for a complete custom solution, we coulddesign a system with all the desired computational power, but perhaps we will haveto design every single interface we need as well. We will have to define the softwareplatform from scratch, and take care of every low-level detail, such as all the driversnecessary to use the peripherals. In this way we will have to do a lot of work only tosetup the platform, instead of focusing our efforts on the functional features of thesystem. On the opposite side, the use of an already developed and well-establishedplatform will enable us to avoid such problems (or almost of them), but presumablywe will not have all the hardware resources to perform all the computations in ashort time, as required. Thus we will have to face with the optimizations since thebeginning, side by side with the software design steps.

According to the considerations just outlined, two different solutions will beevaluated. The first one (Chapter 4) is based on a DSP and an FPGA; it offers moreopportunities to achieve the project goals from the pure computational standpoint,and, at a first glance, it seems the natural solution. The second one (Chapter 5) isdesigned around an SBC (Single Board Computer) which is more suited in a mobileenvironment, and it will be the base of the proposed solution. For both platformsthe internal architecture and different engineering aspects will be deeply analyzed

25


in order to exploit their pros and cons, and a preliminary study about the possibleimplementation of our application will be presented. Of course the design issues arenot only related to the hardware part, involving also software aspects and, moregenerally speaking, the whole development cycle.

Beside the core hardware platform, we have also to decide how to acquire imagesand how to communicate to the user. For the former issue we still use an externalcamera, following the same approach used in the Haptic prototype (see Section 1.2),repackaged in a more comfortable case and surrounded by a set of white LEDs toproperly control the illumination. As we will see in Section 5.2 there are alternatives,so their development is still underway. To communicate with the blind, following theadvice received in the Haptic workshop (see Section 2.2), the tactile display has beenmomentarily left out (there are alternatives to the current solution, as outlined inSection 2.1, but actually nothing seems compelling) in favor of the speech synthesis.

3.2 Software

Figure 3.1: Overall data-flow diagram

From an architectural standpoint, the system can be divided into three parts:the first one acquires images from the paper the user is going to read, the second oneprocesses the data and feeds the third unit which translates the resulting informationinto audio and tactile ones. The internal data flow is clearly more complex and itsdiagram can be seen in Fig. 3.1. First of all the input data is pre-processed, in orderto clean the bitmap and overcome some issues like a different illumination across

26


the image. Then some useful high-level data are computed, and these are used todetect the alignment of the device along the text lines, in order to give useful adviceto the user (as seen in the previous chapter). These information are the ones thatcontrol almost all the successive processing stages, for example determining whichcharacter image will be analyzed. The recognition subsystem is based on ArtificialNeural Networks (ANNs), a detail which needs to be pointed out since, as we will seelater, it plays an important role in this design and it should be taken into accountalso in the hardware platform definition. The actual implementation only workson one character image each time, for run-time performance reasons. Neverthelessall the results are collected and then processed to form a string, something likea trace of the previous recognitions. This is useful both when the user wants toback-track and when entire words are identified and eventually corrected. All thecomputed data, both about the text and the device position, are conveyed to the userthrough a user feedback subsystem. Its role is to abstract the specific underlyingway of communicating information to the user by presenting a uniform and commoninterface. In the first prototype (the Haptic one) there were two of them: one forthe tactile display and one for the speech synthesis. In the current prototype thereis only the latter, given that actually the transducer is no more used.

The description outlined above is a modular view of the whole system, usefulto develop the software (discussed in Chapter 6), but to better understand how thesystem works, it should be examined from a computational standpoint and dividedinto a sequence of steps. Each of them has to accomplish a simple and well-definedtask, take the input from the previous one and output the required data to the nextone, ideally in a pipeline fashion. This structure has been developed in the Hapticproject, and a more detailed description can be found in [5]. During the developmentof the new mobile device, such results have been verified and the global architecturehas been still considered valid; clearly most of the algorithms used to accomplishthese sub-tasks have been improved or extended to meet the new requirements.

Following the idea outlined above, the entire data processing has been dividedinto six steps:

Binarization The input color image is converted into a black/white one; the pur-pose is to separate foreground (i.e. characters) from background.

Segmentation Each glyph has to be isolated and represented by a smaller datastructure, so that next steps can be faster than processing a raw image.

Skew detection Since it is impossible to require that the user moves the mousestraight along text lines, the program has to detect the skew of the image, inorder to provide more accurate results.

Tracking It has to be considered that the integrated camera only captures a portionof the sheet at a time; the device has to help the user to follow the text flow.

27


Recognition In this stage the real character recognition is performed and the re-sults are stored together with data of previous frames, allowing the softwarenot to process the whole image each time.

Data output In the end, computed information are presented to the user.

Everything outlined above has to be done in real time, so it is quite necessary touse lightweight algorithms and small data structures as much as possible. For thesereasons, computations are made only when necessary (e.g. when the current frameis much different from the previous one) and each step tries to reduce and simplifythe size of the data to be elaborated, giving high-level information. The softwarehas also to be flexible enough in order to deal with different type of media (it is easyto note that paper, ink and so on, used by magazines are quite different from thosefound in newspapers), so it has been developed an adaptive system which can havemultiple configurations selectable by the user.

In the following each of the steps outlined above will be described into details.

3.2.1 Binarization

(a) (b)

Figure 3.2: Example of image binarization: (a) the source image and (b) the pro-cessed one

The input color image is first transformed in a grey one and then into a blackand white one; this process is often used in OCR systems to put in evidence the textand eliminate the background. The image is often corrupted by noise (especially thebackground portions which should be white); this is due to a not uniform illumina-tion, the characteristics of the underlying paper, the way of printing the text andso on. It is easy to guess that a newspaper is completely different from a magazine,so we have to make the system flexible enough to cope with all those varieties ofinformation sources.

28


Usually, the binarization is made with a thresholding technique, where each pixelis compared with a reference value and hence set to black or white. Before such oper-ation it is common to filter the image in order to clean it (e.g. with gaussian, medianor composite filters for edge enhancing). This, in our case, did not provide usefuladvantages to justify its use. Moreover such operations could take too much time,and so the proposed solution simply relies on an optimized thresholding. There arevarious ways to achieve this result, roughly divided into local and global ones: thefirst ones determine the threshold for a given pixel only looking at its neighbour-hoods, while the latter ones use a unique value for the whole image. Both have prosand cons: the former could be useful to adapt the process in case of a non uniformillumination, but it gives bad results in large background areas, while the latter isin general more effective (it does not create artifacts), nevertheless usually makespart of the corner and borderline regions black, since the pixel are darker becauseof a less illumination1.

The solution can be to design a non uniform global adaptive thresholding. Nonuniform means that the reference value is not the same across all the image, but ittakes into account that the background is darker near the borders and lighter in thecenter (because of the illumination system, focused on the image center). Adaptivemeans that the threshold values are not fixed a priori, but corrected by an imagespecific factor. Global means that such factor is a parameter referring to the wholeimage instead of a local one, and in this case it is the average pixel intensity. Theresulting solution is simple ed effective (an example can be seen in Fig. 3.2), and allthe parameters involved are user configurable.

3.2.2 Segmentation

The raw image obtained in the previous step can be used to effectively isolate pos-sible characters. At this stage it does not matter whether they are letters or othersymbols, but they have to be represented by a high-level data in order to better un-derstand the composition of the image. This data can be, for example, the positionand size of the surrounding rectangle for each glyph, along with its real area (i.e.the amount of black pixels inside). What we have to do is to segment the image;the proposed solution first transforms the bitmap through the Run-Length Encod-ing (RLE) algorithm and then finds connected components. This is useful bothto reduce the memory needed to store the image and to precisely isolate the blackregions. Other methods, such as finding vertical lines that separates symbols, cannot be used, because it is not guaranteed that the text lines are perfectly horizontal,and even in that case there would be some corner cases (due to the specific font

1The illumination is controlled by a set of white LEDs put around the camera, but it is stilldifficult to reach optimal results.

29


spacing).In this way all the successive functions can work on information such as “charac-

ter position”, ”glyph size” and so on, without bothering about their actual bitmaprepresentation (which will instead be essential for the recognition process).

After this process, a little clean-up is needed, since sometimes there are somelittle noise elements that have been turned to black during thresholding and aredifficult to remove with traditional morphological operations. It is also possiblethat near the image boundaries there are some little pieces of symbols, that couldmislead the next routines, so they are grouped together. It is important to remarkthat at this stage the character recognition process is not yet performed, so it isincorrect to state a 1:1 relationship between the isolated graphical elements andthe text ones: for example, the letter i is composed by two distinct parts. Thisis an important issue to be taken into account during the skew detection and thesuccessive tracking phase.

3.2.3 Skew detection

Figure 3.3: Example of graphical elements grouped together for the estimation ofthe orientation. The surrounding rectangles are the ones determined in the segmen-tation, while the lines joining their centers are the ones considered during the skewdetection stage.

In order to give freedom to the user as much as possible, a useful feature is to notrequire to hold the device perfectly perpendicular to the text lines. The softwarewill rotate the images in order to make successive tasks easier. The method adoptedin our application involves the grouping of the regions detected previously into lines,so it is possible to estimate the skew by computing the angle of such hypothesized

30


lines. This is a common way in OCR systems to accomplish the task, but it is basedupon statistical considerations and hence a lot of data should be provided to achievereliable results. Clearly it is not a problem when the software has to elaborate anentire sheet. In our application we deal only with a limited portion of the paper,so we have to overcome such issue, for example averaging the estimated skew acrossmultiple frames. This ensures a more stable and smooth correct image rotationprocess, but it takes some time to react to a sudden alignment change. Anywaythis should not occur since it is supposed that the user maintains almost the sameorientation during the reading.

The algorithm analyzes the set of already isolated graphical elements, and tries togroup them together by examining different criteria, such as the horizontal distance,the relative vertical displacement and so on, in order to find the best arrangement.The optimal one is almost impossible to detect, since there is no way to cover all thepossible situations, we can only try to achieve an average satisfying behaviour. So itis not important to perfectly rotate the image, but to “de-skew” it enough to allowa correct tracking and recognition. An example of how the glyphs are put togetherto form a text line can be seen in Fig. 3.3 (the surrounding rectangles are the oneobtained by the segmentation).

3.2.4 Tracking

Figure 3.4: Example of the tracking step: the symbols are grouped into lines, andinside the central line the middle glyph is selected.

31


The system has to inform the user where the device is positioned respect to thetext flow, since, unlike the Optacon, we have preferred not to rely upon mechanicalaids to help the user going straight along a text line. The idea is to focus theattention only on a single line each time, usually the central one. The applicationwill advise the user whenever it is too near to the top, or the bottom, borderof the image, whether the user moves the device to a new line or it reaches thebeginning/end of a line. All these functions are managed by the tracking subsystem,which has also the purpose of activating the recognition when needed. Indeed wehave preferred, as already stated above, to analyze only the central symbol, andnot to perform such time-consuming task on everything appears in the image. Thesoftware compares the data coming from two following frames: if they are similar, itmeans that the user moved the device only a little, so the displacement between theactual symbol arrangement and the previous one should be correlated. In practice,the text line nearest to the image center is selected, and its displacement across twoconsecutive frames is used as the estimate of the vertical shifts. The same approachis used for the horizontal ones, where the middle symbol inside such central lineis used for reference. An example of this elaboration is shown in Fig. 3.4. It isnoteworthy that although we are using the “central property” as the criterion forthe selection of a line/character inside an image, not always the software will strictlyfollow this criterion. Workarounds are needed in order to avoid to give erratic resultsin some situation, for example when two elements are equally distant from the center,because there is the risk of giving not reliable information. This is a good approach,but it has to be improved to correctly handle subtle corner cases: if the user movesthe device too quickly, say as much as the distance between two symbols before thecapture of a new frame, the software will not detect the switch to a new characterto be analyzed. This particular situation has already been addressed, but it is stilla critical when there are two or more similar consecutive tight characters, e.g. ll orIl).

The idea of grouping graphical elements into lines is similar to the one used inthe previous skew detection stage, anyway the implemented functions are different.Now we can take advantage of the image orientation estimate and try to elaboratemore accurate results, independently of the physical device alignment respect to thepaper.

3.2.5 Recognition

This is the central stage of the whole system, since it is in charge of performing theproper text recognition and some related tasks. As already outlined above, not everyframe and only at most one symbol per frame will be submitted to the recognitionsubsystem: this should ensure better average run-time performances, since it iseasy to guess that it is the most time-consuming task compared to all the others.

32


Designing such system we have to take into account many aspects, related to thecreation of a flexible and scalable classifier, easy to manage both at run time and inthe development phase. All these issues, especially the ones regarding the classifier,will be addressed in Chapter 7. The character recognition mechanism has to becompleted with functions that trace the history of the previous identified symbols.This data has to be “synchronized” with displacement information coming from thetracking subsystem, in order to have a realistic set of data about the text surroundingthe actual view of the source paper. This means also to have the possibility both ofcorrecting the recognition (when it is not successful on a first step or when we havea result with a higher degree of confidence), and to make easier the implementationof word-oriented features (discussed in the previous chapter). Clearly the systemhas also to be able to identify spaces between words, and all these information areconveyed to a spell-checker to increase the results reliability.

3.2.6 Data output

The final step is to deliver all the information about the text to the user. Asexplained before, the first prototype used two ways to communicate to the user: atactile display for the Braille and the speech synthesis. For all the reasons explainedin the previous chapter, the former has been momentarily discarded in favor of thesecond one. This does not mean that it will be never used again, but, before re-integrating it, we need a more suitable hardware solution. Anyway the software hasto be able to handle different kinds of communication channels, so this stage hasto be designed with a two-level approach: the first one is an hardware independentlayer that abstracts the low-level details through a common set of functions, andthe other one collects all the information from the previous stages and send them tothe specific “output units”. In this way it should be easy to add a new peripheralswithout modifying the whole system2.

Since we have decided to basically rely only upon the speech synthesis, we havealso to face with some minor issues, e.g. give the proper priorities to the information,since it would be possible that an event can occur while the system is still speakingabout a previous one. Another aspect to take into account is how to manage thecommunication between this subsystem and all the rest. It could be done in asynchronous or in an asynchronous ways, with pros and cons for each one, but thesequestions are platform specific and can not be planned in advance. Ideally theasynchronous mode should be preferred, but this lead to develop a multi-threadedapplication, which is not always a desirable option on embedded/mobile appliances.

2This is mainly a software development issue, and actually the software still should be re-compiled in such situations, because there is not a true set of “external APIs” disjoint from themain source files.

33

Chapter 4

DSP platform

There are many different solutions to design this kind of systems, but we have to startby examining the characteristics and requirements, from the designer standpoint,in order to find the most suitable solution. First of all, we have to consider thatwe have a continuous flow of data, coming from the image sensor, that has tobe elaborated in a useful time. All the first steps, as seen in Section 3.2, performalways the same computations on a large amount of data. In this phase we use data-intensive algorithms such as filters and so on, which can be also optimized exploitingsome data- or instruction- level parallelism. The first technique means to identify asubset of the same sequential operations which can be “applied” at the same time todifferent portions of the whole input data in an independent way. Thinking abouta function that processes an image (where both the input and the output is anarray of data), this requires that the algorithm is structured in a way such that thesingle output value is not dependent on the other outputs, so we are able to processmore than one pixel per iteration. This is the base of the vector CPUs. Instead,taking advantage of instruction-level parallelism means to group different operationsto be executed in parallel, or even entire tasks to be run at the same time. Thelatter option is not feasible in this context, since to properly take advantage of thismethod we should have a multi-core CPU1. Even with only one core, parallelismcan be obtained in numerous ways, such as having a CPU with more than onepipeline or with a superscalar out-of-order pipeline. All of them ar well-knowntechniques, and we can also consider that we apply always the same functions toevery video frame in input, without changing the program time by time. From theseclues we can certainly go for a DSP (Digital Signal Processor), which are designedfor these tasks. They generally have a Harvard architecture, with distinct buses

1Of course a multi-core solution is not the only option, since we can look at CPU with SMT(Simultaneous Multi-Threading), which are able to perform, in some implementation, as fast as adual core system even with only one physical core. But in the embedded world this is still not aviable solution.

34

4 – DSP platform

for the data and code space, allowing a better throughput, a pipeline optimized toperform data-intensive applications, a memory architecture designed for streamingdata, using DMA extensively, and also an ISA (Instruction Set Architecture) whichincludes some instructions (directly supported by the underlying hardware) oftenencountered in number crunching functions (e.g. MAC, Multiply and ACcumulate).

Besides such consideration, we can also think about to directly implement inhardware some algorithms, for example some image pre-processing routines or evenartificial neural networks. We could develop something like “custom processors” ableto perform specific tasks: they should have the advantage of being perfectly tailoredon the problem domain thus delivering higher performances. This is achievable onlyif we have at our disposal a “programmable hardware” and the FPGAs2 are the mostsuitable components for such design. The idea is to shift the most computationalpower demanding routines from the software side to the hardware one, through aco-design methodology. Nowadays FPGAs offer a lot of resources, enough flexibilityand also some specific special cells, like DSP blocks or CPU cores, to implementalmost all the entire system on the chip. Not being tied to a specific processingarchitecture, we could design such co-processor to support the main processor whereit can not guarantee the desired performances. Last but not least, they offer also theopportunity of re-designing (or adding) some sub-components later without almostmodifying the prototype.

4.1 Proposed solution

Figure 4.1: The Orsys micro-line C6713CPU

2We could also think about ASICs, but it is not reasonable to use them in this research project.

35

4 – DSP platform

Before starting from scratch in designing such type of system, we have lookedaround in order to find an already developed module, and we selected the micro-lineC6713CPU [9] which is equipped with a Texas Instrument (TI) TMS320C6713 DSP,a Xilinx Spartan3 XC3S1000 FPGA and 64 MB of SDRAM. The DSP is based ona VLIW RISC architecture that provides 8 instruction units operating in parallel,yielding a maximum performance of 2400 MIPS and 1800 MFLOPS. Up to 256 kB in-ternal memory and a two-level cache architecture (64 kB L2 cache, 4 kB L1 programcache and 4 kB L1 data cache) guarantee the memory bandwidth required to sustainhigh data throughput. Furthermore, two multi-channel buffered synchronous serialports, an I2C bus interface, two timers, an enhanced DMA controller and a 16 bitwide Host Interface are built into this processor. An important feature of this DSPis the possibility of performing floating-point arithmetic in a native way, accordingto the IEEE754 standard , both in single and double precision. This uncommonpeculiarity enables to leave almost unchanged the already developed routines, whileobtaining a significant speedup in code execution. This could be especially useful forthe artificial neural networks for example, but in general it grants to the developerthe maximum flexibility in number crunching algorithms.

4.1.1 TMS320C6713 DSP overview

As previously stated, the Texas Instrument TMS320C6713 is a VLIW DSP, whereVLIW stands for Very Long Instruction Word and it is an architecture made totake advantage of the Instruction-Level Parallelism (ILP): the processor is able toperform more than one operation per cycle, but they have to be specified in theinstruction word which is composed by smaller atomic commands. In this waythe parallelism has to be exploited by the compiler3, which not always is able toperform such task because the source code is anyway a sequential program. Thedeveloper has to follow some coding patterns in order to take full advantage of thisarchitecture and sometimes he has still to resort to the assembly language. Analyzingthe reference guide [10] and the datasheet [11], we can see that the reported dataabout the performances are more theoretical than real, since that 2400 MIPS is themultiplication of the maximum operating frequency, 300 MHz, by the functionalunits, 8. These units are grouped by 4, so we can say that the DSP has two datapaths (sometimes they are called side A and B). Each of them is identical, with itsown set of 16 general-purpose registers, and there is a cross path which allows dataexchange between the two sides, despite this introduces a latency. The functionalunits can operate at the same time, but they are not equivalent, each one (given a

3In other architectures, such as the most used ones in the desktop PC world, it is the sameCPU that is in charge of identifying the parallelism in the code flow and then, thanks to the out-of-order execution technique, to arrange the issue order to properly take advantage of the hardwareresources.

36

4 – DSP platform

data path) being specialized on a particular type of computation. This means thateach unit only supports a very limited subset of the whole instruction set. The corefetches and execute one instruction packet (256 bit wide) at a time, and hence thecompiler is in charge of properly filling each corresponding “slot” (32 bit wide) inthe instruction word, trying to take advantage of all the resources. One drawbackof this architecture is the density of the resulting code, i.e. the amount of memoryoccupied by a program in relation with the quantity of operations really executed. Ifthe instruction parallelism is not properly exploited, some instruction packets will befilled with NOPs, wasting the memory with useless data. So TI has made distinctionbetween fetch packets and execution packets, the former referring to the data readfrom the program memory and the latter being the effective code executed. Thismeans that inside a fetch packet it is possible to have more than one execute packet,and each execute packet can span across multiple fetch packets. It is obvious thatthese situations should be avoided, because in the former case there will be anywaysome units left idle, and in the latter one there will be a memory access penalty.The other drawback is the strong dependency on the compiler ability to keep busyall the core units: leaving them idle means also to under-use the CPU from thecomputational power perspective.

Another important characteristic of DSPs is the load/store architecture, whichimplies that all the instructions use the registers as operands, i.e. to add two numberstored in memory we first load them in some registers, and only afterwards wecan add them. In order to facilitate the elaboration of a stream of data (usuallystored in some memory buffer), one functional unit per side is specialized in addressgeneration and allow both linear and circular addressing modes (useful to implementdigital filters).

In conclusion, the DSP is still a very fast processor, but we have to rememberthat it is not designed to execute general purpose programs and its performancesstrongly depend on the software design and implementation. The overall applicationarchitecture and the system partitioning will also determine how the data will beexchanged between the hardware and the software part. This is an important issue,as much as the raw processing power, and it will be briefly pointed out later, butbefore it is better to give an overview of the FPGA available on the Orsys module.

4.1.2 XC3S1000 FPGA overview

The Xilinx XC3S1000 FPGA [12, 13] is equivalent to about 1 million of systemgates, distributed in 17280 logic cells4 organized in 1920 CLBs (Configurable LogicBlock). Half of these resources can also be used as distributed RAM for a total

4Each logic cell is composed by a LUT and a FF.

37

4 – DSP platform

Figure 4.2: Functional block and CPU (DSP core) diagram of the TMS320C6713B

amount of 120 kb. This device also comprises 24 block RAMs5 and 24 dedicated18× 18 signed multipliers. These resources can be used to implement the requiredexternal logic, peripherals, interfaces between the DSP and the external world, andeven to develop specific logic cores optimized for specific algorithms. The DSP is de-signed to efficiently run data-intensive functions, but since it has to run also generalpurpose tasks, it is the result of some trades-off. Instead all the uncommitted logicin the FPGA can be combined into a processing unit highly specialized to solve onlyone task at a higher rate. This is accomplished by using highly scalable, parallelprocessing techniques and taking advantage of the underlying specific hardware. Inthis way the performances are not limited by the given resources, since now it isthe developer that allocates such resources according to the design constraints/re-quirements. For example, using a DSP to implement an FIR filter, the processing

5They are blocks of 1 k × 18 bit of RAM, usable in different configurations and for variouspurposes, available separately from the CLBs.

38

4 – DSP platform

rate is limited by the MAC units (2 in the C67x DSPs), while in the FPGA wecan take advantage of up to 24 multipliers (in the XC3S1000). So we are able toperform 3600 MMAC/s (Millions of MAC operations per second, running the deviceat 150 MHz), a considerable higher value than with the DSP (600 MMAC/s, run-ning at 300 MHz). This allows to design filters that are able to output one dataelement for each clock cycle, even when there are numerous arithmetic operationsinvolved, a common case in two-dimensional linear filtering. The system through-put is preserved, although a latency is introduced, but in this case it is nonessential.Every multiplier is associated with a specific RAM block, so we could use it to storecoefficients or intermediate values. The idea is to develop a system composed bya main processor (the DSP) and a set of application specific coprocessors directlyimplemented into the FPGA. The partition between the two options (software orhardware processing) depends on the system requirements and on the elaborationtasks. There are no separate steps for the hardware/software design, since theycan be afforded at the same time, in a co-design fashion. A further possibility isto dynamically re-program the FPGA at run time, in order to make visible to theDSP different coprocessors depending on the current stage of the data elaborationprocess. Nevertheless, even if this opportunity is compelling, it should be carefullyevaluated because we have also to take into account the time and resources spentin performing this re-programming task. In any case, every time we will have aprocessing bottleneck we shall consider to move that function to the hardware sideof the system. Of course we can not directly translate the software algorithm intoa hardware one, and we will have to re-design it to take advantage of the inherentparallelism offered by such hardware.

4.1.3 System feasibility

An overview of the chosen hardware platform can be seen in Fig. 4.3 (see [14] forfurther details), while Fig. 4.4 is a more detailed view on how the FPGA is connectedto the rest of the system (all the detailed information can be found in [15]). TheFPGA is mainly connected to the EMIF (External Memory Interface) [16], whichenables to design memory mapped peripherals and it is shared with the externalRAM, the Flash memory and the CPLD which contains some necessary glue logicfor the board. It is a very flexible interface, because it allows to partition all thememory space into four separated ranges (called CE, Chip Enable spaces), eachone with its own operating parameters. These include the data size, the endianness,timing parameters and also control signal types, to interface with SDRAM, SBSRAMand ASRAM. EMIF is a powerful way of adding a coprocessor to the system, butwe have to design the required logic in the FPGA, in order to have something likea dual-port RAM which behaves as a bridge between the custom hardware and theDSP, as described in [17].

39

4 – DSP platform

Figure 4.3: Block diagram of the Orsys micro-line C6713CPU

It is worth noting that the FPGA is connected to the DSP also through the HPI(Host Port Interface) [18], which is a 16 bit parallel port that allows to access all thememory space, including both the properly memory and internal memory mappedregisters, by an external peripheral. All the data exchanges are regulated by aset of registers accessible both by the DSP and the host device, and the transfersdirectly rely upon the EDMA (Extended DMA) subsystem. In every DSP system,the fast processing core has to be kept busy (especially in pipeline architectures)all the time, in order to sustain the desired throughput: this task is fulfilled by theDMA, which is in charge of performing all the data transfers between peripheralsand main memory without interfering with the data processing core. In the C6713the EDMA [19] is responsible of almost every data transfer (as it can be seen inFig. 4.2), even between the L2 cache and the main memory. It supports up tosixteen programmable channels, with a complex queuing and arbitration system tomanage different priorities designed to optimize the bus bandwidth and minimize

40

4 – DSP platform

Figure 4.4: FPGA connections in the Orsys micro-line C6713CPU

transfer interferences. It operates at the same CPU clock6, thus reaching a peaktransfer rate of 1800 MB/s (at a CPU clock of 225 MHz). It is worth noting thatthe HPI mentioned above is classified as a master peripheral, meaning that it isthe preferred channel to exchange a great amount of data in a fast way. The hostperipheral/processor acts as a master device and the DSP as a slave one.

The combination of the HPI and the EDMA could be the ideal solution to transferinput images from an external peripheral into the main memory, provided that suchdata is streamed through a parallel communication channel. This leads to lookat how the system will acquire images. Traditional systems are based on CCDdevices, which are analog sensors requiring an external frame grabber to digitize theinformation. Nowadays CMOS sensors have become a quite popular choice in thiskind of applications, since they offer all the required hardware packed in a singledevice, with a digital output and the possibility to control the acquisition process.Also the image quality (regarding both the noise level and the dynamic range)obtained by such components is actually comparable with the CCD one, and theease of use make them an effective solution for our system. There are many CMOSsensors, but they all have in common the same block diagram, showed in Fig. 4.5. Itis mainly composed by an image sensor array, with triplets (one for each fundamentalcolor, i.e. red, green and blue) of photodiodes or photogate transistors accessiblethrough columns/rows addresses. The analog signals are first pre-processed and

6The data movement operations are performed at the peripheral rate, so each port has a databuffer in order to not slow down the transfer engine.

41

4 – DSP platform

then sent to a set of ADCs, thus obtaining the digital data, that, formatted in aproper way, is sent to an output parallel port along with synchronization information.These include the pixel clock (PCLK), the horizontal reference line (HREF, activewhen the data transferred belong to the same row) and the vertical sync (VSYNC)to mark the start of a new frame. Fig. 4.6 and 4.7 show the relations betweenthese signals. From these diagrams we can see that, with a small bunch of gluelogic, it is quite straightforward to interface those devices with the HPI of the DSP,as mentioned before. Nevertheless camera chip synchronization signals have to beadapted to the HPI control lines and the data have to be buffered since the outputcamera chip bus is 8 bit width and the HPI is 16 bit width (moreover the accessshould be always performed in pair, in order to transfer 32 bit). Beside that databus, usually there is a I2C bus [20] to control the CMOS sensor. This is a low-cost,relatively low-speed (400 kbps in the Fast-mode) communication system, easy toimplement and to use. It is based only on two lines, with an open-collector linedriving technique: the former is for data and commands (SDA signal) and the latterfor the clock (SCL signal). It uses a master/slave protocol. In order to handle slowdevices, the clock can be kept low by the receiver thus simulating the wait state. Itis a multi-drop bus, and each peripheral has its own static address; in this way itis very effective to create simple sensor or device networks which does not require ahigh bandwidth. As said before, in this application it is used to set the operatingparameters in the CMOS camera sensor, such as the data output format, resolution,and analog signal pre-processing parameters. The C6713 DSP has also a flexibleI2C interface [21], fully compliant with the standard, so it is pretty straightforwardto manage the sensor, it should be only a matter of properly setting configurationregisters.

Almost all these characteristics are common to every CMOS sensor, but forthis project the Omnivision OV7648 [22] has been chosen, mainly for availabilityreasons. It is able to capture color VGA images (640×480) at a 30 fps rate. Actuallythe developed software does not require such a relative high resolution, but if thehardware would sustain the continuous elaboration of large images, it could beuseful to implement advanced image processing features. The same reasoning couldbe applied to whether to make use or not of the color information: currently thesoftware does not take advantage of it (as seen in Chapter 3), nevertheless it couldbe helpful to improve the object recognition functions. Namely, this device does notrelies upon the I2C protocol, and it has a custom bus named SCCB (Serial CameraControl Bus), anyway from the documentation [23], we can see that it is compatiblewith the standard one.

Looking at the other needed peripherals for this system, we have also to considerthe audio interface. Nowadays a common standard is the AC’97 codec, which allowsthe transfer of multiple channel audio data on a synchronous serial communication.The C6713 has a powerful serial interface named McBSP (Multi-channel Buffered

42

4 – DSP platform

Figure 4.5: Omnivision OV7648 color CMOS VGA CameraChip block diagram

Figure 4.6: Omnivision OV7648 row output timing diagram

Serial Port) [24], which handles full-duplex communications. It has double-buffereddata registers to sustain a continuous data streaming and a great flexibility in settingthe transmit/receive synchronization signals, data formatting and length. All thesefeatures enable the DSP to be interfaced with a great variety of industry-standardcodec and serially connected DAC or ADC (also thanks to integrated µ-Law and A-Law companding), including the AC’97 one. AC’97 [25] is a serial protocol widelyused in PCs to transfer data between a digital controller, integrated in the CPUchipset, and the audio codec, intended as the device in charge of performing theDAC functions. It is composed by only five wires (a reset signal, a synchronizationone and two data lines, one for each direction) and allows to transfer several audiochannels, with a resolution of 16 or 20 bit sampled at 48 or 96 KHz, through a time-division multiplexing at a constant bitrate of 12.288 Mbit/s. The audio codec could

43

4 – DSP platform

Figure 4.7: Omnivision OV7648 frame output timing diagram

be implemented in the FPGA, but we would need anyway an external DAC, so usingan “all-in-one” external component should be simpler. In any case, this requires todesign a daughterboard to be connected to the main DSP/FPGA one, through theexpansion connectors.

Another interesting option would be to equip the system with an USB connectionto easily interface it with other devices, such as PCs or similar. This would be usefulto exchange data, adding similar functionalities already available in PDAs or mobilephones. This goal could be achieved either with the implementation of a USB coreinterface in the FPGA (but then we still need to add an external physical driver)or with the integration of an external device. This second approach is preferable,both for simplicity and to leave more FPGA resources to other parts. A possiblesolution is described in [26], where the USB device is directly connected to theEMIF interface. In our case we do not have those signals available on the expansionconnectors, but, since they are connected to the FPGA, we could route them to adaughter board (as seen for the audio interface), in an easy way.

The last interface that we could need is a serial port to interface the Brailletransducer already used in the Haptic prototype. We could either re-use the alreadyexisting serial port or implement a new one, directly in the FPGA, since it shouldnot require a lot of resources.

4.2 System partitioning

After having defined the hardware platform, along with the external needed pe-ripherals, now we look at how to partition the system. As stated before, there areplenty of solutions, starting from implementing everything in software, using theFPGA only for the required interfaces/glue logic, and ending to implement all thedata processing algorithms directly into the hardware, using the DSP only to con-trol the whole system. Moving the processing functions from the software side tothe hardware is not a straightforward step, because usually a routine is designedin a serial fashion, while the hardware is inherently parallel. This means to addan intermediate step in the development cycle [27]: determined the algorithm, be-fore implementing it, we have to consider the underlying architecture to be used for

44

4 – DSP platform

such function. That means to define which (hardware) resources will be needed. Ina pure software system, they are already given by the CPU programming model,eventually together with its special features or extensions (such as dedicated co-processors). Conversely, in a programmable hardware, such as in FPGAs, we haveto design almost everything, ranging from the memory interfaces to the arithmeticunits or processing cores (if any). In order to further optimize, we could also mergesome functions (e.g. two sequent filters applied to the same image), thus broking the1:1 mapping between the actual implemented functions and the ones described in ahigh-level data-flow diagram7. For these reasons it is better to partition the systemright after having defined the subsystems that compose the application.

Figure 4.8: System partitioning of the DSP/FPGA platform

In the Fig. 4.8 there is a possible schema for the implementation of our applica-tion on the DSP/FPGA platform. Such divisions are not strictly fixed, and duringthe development/debug/test there will be the possibility to change such resourcesallocation, in order to achieve better performances. As stated previously, we willuse part of the FPGA to implement the interfaces we need, and these are globallysummarized in the “System interfaces” block. From such group the image capturinglogic has been left out, since it deserves more attention. One option, perhaps thesimplest one, is to barely place the glue logic to adapt the CMOS camera control

7We will see in Chapter 6 that this technique is also useful in the software development, butthis process of merging/splitting some routines is much architecture dependent.

45

4 – DSP platform

signals to the C6713 HPI (as previously discussed). Otherwise we could think aboutto place, in the between of the DSP and the sensor, an on-line image pre-processingblock. This would relieve the DSP from such data-intensive routines, which have tobe performed on every frame. The DSP is well suited to perform image processingin a very efficient way (a good starting point is [28]), but having at our disposalan FPGA we could obtain better results through a hardware/software co-design.This could be a valuable improvement, especially for the low-level operations on theraw image, but it may be that some high-level algorithms, such as the tracking oneoutlined in Section 3.2, are more suited for a pure software implementation. Forthis reason in the schema of Fig. 4.8 there is an image processing block both in theFPGA and in the DSP. The same reasoning can be applied to the ANN subsystem:an FPGA implementation could be preferable at a first glance, but, as we will see,it is not so clear (in advance) whether it would be the best solution or not.

For all the other tasks, ranging from the whole system control, general infor-mation algorithms (i.e. all the other not mentioned above) to the speech synthesis,it would be better to implement them on the DSP and hence in software. In thesections below we will examine a possible implementation of the image processingengine and the neural networks subsystem, which are the most challenging ones;understanding their feasibility is a crucial step in evaluating this kind of platform.

4.2.1 FPGA implementation of an image processing engine

Traditionally, almost all the image algorithms [29, 30] suppose that the data is al-ready stored in memory, so it is handled as a matrix: directly implementing suchroutines will result in a system with poor performances and requiring a lot of re-sources (such as a dedicated memory). One of the reasons is that we store the wholeimage first, then we process it, accessing the data in an almost random way whilethe input data is an ordered stream. In the end we transfer such data serializingthem to the DSP. The result is an on-line processing re-mapped onto an off-line one.In this system there is a continuous stream of data incoming at a fixed rate thathas to be elaborated in a useful time. It is be much better to find a suitable way ofperforming all the operations in an on-line manner, always assuring to respect allthe constraints, such as the timing and the bandwidth ones. Some of these issue,addressed from an architectural point of view, are discussed in [31], where some use-ful design patterns are suggested for the definition of this kind of systems. Designpatterns are general solutions that can be applied to a set of problems sharing somecommon issues. Examples are how to handle a large amount of data or techniquesfor the implementation of data processing units which are very fast and/or requiringa minimal amount of resources. Moreover, they give some advice on how to organizea general framework to address high-level issues (e.g. how to combine different ele-mentary units). In synthesis design patterns can be seen as the answer to frequently

46

4 – DSP platform

recurring questions8.Given our constraints, we should avoid those algorithms that make use of an

external memory to store the whole image (e.g. image rotation), in order to com-pletely rely on the internal FPGA resources for faster elaborations. All the otherimage filters give as output, for each pixel, the result of an operation performed on aset of pixels surrounding the current element: this can be view as a sliding windowrunning across all the image, which is used to compute the filter. In this case weneed only to know the values inside such window; we do need to have in advance allthe data, and so we can use a general architecture, like the one shown in Fig. 4.9. Asmall set of shifting registers are used to store the needed pixel values, and a filterunit computes the output, one value per clock cycle. Their length depends uponthe width of the image, while the number of them is equal to the height of the filterwindow. This structure is very useful, because requires less memory and maintainsthe same bandwidth in output, although there is a little delay (depending from thefilter window size) between the input stream and the output one. Anyway this is aminor issue, at least in our application. An example of such solution can be foundin [32], where the design of a convolving filter is also parametrized on the image andthe filter size. It is worth noting that using a filter with a separable kernel, i.e. thatcan be obtained by the multiplication of two vectors, will reduce the amount of re-quired multiplier (in a fully-parallel implementation) from N ∗M (being N the widthand M the height of the filter) to N +M . This is a considerable enhancement, bothfrom the resource allocation and operating speed point of view. With the Spartan3FPGA, such implementation can take advantage of the embedded multipliers andof the facilities to use logic cells and/or block RAM for the shift registers. In theformer case there is a specific macro, SRL16, see [13], thus resulting in a fast andefficient image coprocessor engine.

Even the so called rank order filters, which comprise also the morphologicaloperations and the median filter, can be implemented with the architecture outlinedabove. The median filter is very useful to remove spiking and random noise withoutblurring the contours, in contrast with the gaussian filter, for example (for furtherinformation about image processing see [29, 30, 33]). Its main drawback is to be,at least in a basic implementation, computational expensive, because it involves thesorting of the computed pixel values (for these reasons it has been not used in thecurrent software). Despite its complexity it is possible to implement those filtersin FPGAs, since it can be shown [34] that ordering a 3 × 3 matrix can be reducedto a vertical ordering followed by an horizontal one. This technique is similar tothe kernel separation in the convolution operations, and reduces the amount ofcomparisons needed. The parallelism offered by the hardware allows again to design

8Design patterns were originated in the software world, but then the idea of classifying anddocumenting such methods has been extended to almost all design contexts.

47

4 – DSP platform

a filter with a throughput of one pixel per clock cycle, performance difficult to obtainin software. Another fast implementation of this filter can be found in [35], where,thanks to a conceptually novel approach, each new data element is compared onlyonce with the old ones, and the results of these tests are stored in special registers;examining such content the median item will be easily found.

Figure 4.9: General architecture of a 3× 3 two-dimensional image filter

Generally speaking, there are different ways to speedup such image processingsubsystem, but the most used technique is to design the core in a pipeline fashion.This means to avoid to implement a large logic part into a single large combinatorialblock, because it would result in a slow unit, with the possibility of being the mainbottleneck in the data flow. The alternative is to split it into simpler and hencefaster elements, putting register between them, in order to synchronize all the datatransfers. The whole operation takes a longer time, but we can start a new operationin the first stage while the others are completing the preceding one. In this way wewill again have a high throughput, one output data per clock cycle), even if with alittle latency penalty, proportional to the pipeline length. Despite the introductionof new storage elements, this technique should have a little impact on the globaldesign, since that there is about one FF for each LUT, and these would be otherwiseunused. Such design pattern increases considerably the global speed performancesand can be applied not only at a low-level development stage, where we try tooptimize a specific block, but also at an architectural level, as explained in [36],where different combinations of simple convolvers are explored in order to designbetter solutions (clearly with different trades-off).

Going further to design a general solution for our image pre-processing engine,we could develop a more general framework, as the one proposed in [37]. The ideais to write a set of small well-optimized basic components and to build high-levelunits by putting together such elementary building blocks (also called “hardwareskeletons”). In particular, we can see every filter as a sum of a local operator anda global one. The first works on a pixel basis, e.g. by multiplying each element

48

4 – DSP platform

by a given weight. The second one examines the whole window filter, for examplesumming all the previously computed values or ordering them. In this way wecan have all the flexibility of modifying the image processing engine in a easy way,without compromising the performances. Such framework would be composed ofthree level of components: the arithmetic ones, the basic image operations andthe compound blocks. It would also be useful to explicitly define some particularconfigurations with enhanced optimizations9. In this way all the system could bedescribed by a custom high-level language, as in [38], allowing the developer to focushis attention on problem specification. In such scenario it is the custom developmentenvironment in charge of automatically instantiating the required components. Itis important to remark that it is only able to select a viable solution, not the bestone. Actually, this can not be known in advance, but only found through a set ofpost-synthesis simulation, based on the developer’s experience.

To achieve very high performances it could be also necessary to discard the almostalways used two-complement arithmetic in favor of the Signed Digit ArithmeticNumber Representation (SDNR, [39]). It has the interesting property of making theadditions and multiplications (but also other arithmetic operations) considerablyfaster, thanks to the representation redundancy and the presence of the sign in eachdigit. Anyway, this is at the expense of a more complicated basic blocks and anincreased memory occupation.

After the image pre-processing, as seen in Chapter 3, there is the segmentation,which is the first stage where the application extracts high-level data from the rawinput data, in form of connected components. This function is usually implementedin software, but even in this case the parallelism can be exploited. As describedin [40], a single-pass algorithm can be used to avoid the need of allocating a tempo-rary memory buffer and hence the creation of bottlenecks. In this way the DSP isrelieved of all the operations on the raw input image, minimizing the data transfersand leaving all its resources for other processing functions, starting from the trackingones to the whole system control ones.

The subsystem described above is similar to the one presented in [41], whichshowed the feasibility of such approach. Clearly there are some differences, themost notable one is the absence of the I2C master unit (since we can use the C6713one) and the output data format, which are information on connected components inthe image. Also the control unit will be different, since in our case we can interfacesuch coprocessor directly with the main DSP.

9In general we have to note that, to assure the interchangeability of some basic units, they havea predefined interface, thus prohibiting to access internal details, which could be used to exploitsome enhancements.

49

4 – DSP platform

4.2.2 FPGA implementation of neural networks

Another subsystem that we could move to the hardware side is the artificial neuralnetworks one. Although in this case we have neither bandwidth constraints nor pre-cise timing requirements (as in the previous case), we have to assure that the systemis able to perform, in the worst case, one character recognition per frame. As wewill see talking about the run-time performances (and the needed optimizations),the recognition step is dominated by the execution of ANNs, and this representsabout the 57% of the whole process time10. So it would be interesting to developa custom hardware coprocessor to perform those functions. There are many ap-proaches to design such part, as reported in [42], but perhaps the most promisingone is based on reconfigurable hardware, in our case with the FPGA. In contrastwith analog solutions we have a higher accuracy and flexibility (along with testa-bility and repeatability), which allows a better control on the system. Unlike asoftware implementation we are not limited by its serial nature (or by the under-lying processor), being able to exploit the inherent parallelism of ANNs. Despitehaving the same flexibility of ASICs (Application Specific Integrated Circuits), FP-GAs have a lower development cost, plus having the opportunity of re-configure thesubsystem, even at run time. It is also noteworthy that there are also tools [43, 44]that make the development of ANN systems easier, since they allow to design themin a independent way of the underlying platform, being either a software one or ahardware one. These aids are similar to those designed to deal with image process-ing [38], based on a database of pre-defined basic blocks. They offer a good degreeof flexibility in choosing the target characteristics, for example the underlying archi-tecture (serial/parallel etc.), arithmetic issues (the one presented in [44] is also ableto give an estimate about the hardware resources used), and allow to explore andcompare different solutions in a fast and easy way. This is very helpful to developour application, since, as we will see, for ANNs there are many ways to developthem in hardware. Moreover, given an architecture, there are many variations, e.g.the amount of processing elements dedicated to perform the synapse operations.

Anyway it has to be remarked that also the opportunity of implementing theminto the FPGA or in software has to be careful evaluated. Despite many advantagesin placing ANNs into an FPGA, there are some issues to take into account [45],first of all because we do not have all the flexibility as in a software implementation.One concern is about the number representation, which usually is a floating-pointone, either single or double precision. It guarantees the greatest flexibility, both interms of precision and range, but it is expensive (i.e. it requires a lot of resources)to implement it directly in hardware. For this reason it would be better to adoptthe fixed-point number representation, which relies upon the integer arithmetic,

10This value has been estimated running a set of tests with both the final version of the softwareand the not optimized one, on the final hardware platform, described in the next chapter.

50

4 – DSP platform

simpler and smaller to be implemented and hence considerable faster. Most of thedigital designs follow this way, but we have to investigate first the suitable numericprecision for our purpose, which is a constant value across all the (limited) range.When dealing with the floating-point arithmetic, these issues are, almost always,negligible, but with the fixed-point one the range allowed and the needed widthof the fractional part have to be carefully checked, in order to avoid overflows ortoo coarse-grained numeric results. Large numbers, in terms of bit width, are notalways a good solution, because this means large and slower arithmetic units. Theideal is to find the best trade-off or, better, the minimum amount of bit required toachieve the desired results. In general these information are obtained through a setof experiments and they are highly dependent on the specific application. Anywaywe can have an idea by looking at the results obtained in [46], where various testson different kind of problems showed that a resolution of 16 bit should be sufficientboth for the training and the execution of ANNs without a significant performancedegradation. Considering only the precision required, we can further truncate theweights to 8 bit. Moreover it is not so useful to have inputs with a precision equalto (or greater than) the one used for the weights, thus allowing to implement evensmaller arithmetic units. It is worth noting that having a too low resolution canbe balanced by an increase of the amount of hidden neurons, but too many hiddenneurons could affect the generalization ability of the network. Anyway, that solutionwould be helpful even to increase the network parallelism (each synapse value can becomputed independently) while reducing the complexity of each involved processingelement, thus leading to a globally faster implementation.

Considering the multi-layer perceptron structure of the neurons used in the feedforward network for this application (described in Chapter 7), it is not worthwhile (asalso reported in [42]) to speedup the sum of the synapse contributions without takingcare also of the activation functions. Usually it is not easy and effective to directlyimplement them in hardware, and they are replaced by a suitable estimation, forexample with a simple look-up table, a first-order approximation or by a piecewise-linear function. The last one seems to be the best trade-off, considering simplicityversus accuracy [47]. It is worth noting that the training process should also take intoaccount these peculiarities (i.e. the discontinuities in the gradient of the activationfunction and the inaccuracies due to the fixed-point arithmetic), either by choosingan algorithm robust enough, as in [47], or by modifying an existing one, as in [46].

As mentioned above, implementing ANNs is a good opportunity of taking ad-vantage of the hardware parallelism offered by FPGAs. Ideally we could associatea different processing element to each synapse and activation function unit for eachneuron in the network. Thus the computation time is roughly proportional to thenumber of layers, but this could be rarely be a feasible solution, because we haveanyway a finite amount of hardware resources, especially if we use part of the FPGAto design interfaces or the image pre-processing engine, as discussed above. In our

51

4 – DSP platform

case, with the XC3S1000 FPGA there are 24 embedded multipliers, not enoughto implement even one neuron (for a detailed view of the used ANNs, see Chap-ter 7) in a full parallel way. We could either think to design a network topology,starting from these constraints, or to keep the same configuration and to time mul-tiplex the resources. In this case there is the drawback of slowing down the wholecomputation time. Other alternatives would be to design networks whose weightsare only powers-of-two (or sum of powers-of-two), thus transforming multiplicationsinto shifts [48]. We can take advantage of the SRL16 Xilinx feature for their FPGAswhich allows to use a logic cell as 16 bit shift register and we are no more limitedby the small amount of multipliers. Otherwise we can use look up tables also intosynapse operations. In this case there are more effective alternatives to the rawimplementation of such idea, as reported in [49], where distributed arithmetic (i.e.the operation are performed bit by bit in parallel on different data [50]) is used, andthanks to the weight clustering, made accordingly to their precision, one big tablecan be divided into smaller ones, thus further reducing the required resources. It isworth noting that using only powers-of-two weights is a heavy constraint and thetraining algorithm has to be modified accordingly.

Talking about ANN hardware implementation issues for this application, wehave to note that we have large networks (see Chapter 7) in comparison to theones (implemented on FPGAs) described in literature, making more difficulty tofit everything on the single FPGA at our disposal. In any case we have multiplenetworks to manage, one per symbol to recognize, and this means that either thehardware has to be re-configured at run time for each ANN. A third solution wouldbe either to store them on the internal block RAMs (but these could be insufficient),or to store the weights in an external memory. The first solution could be a notviable solution if the time required to do that is too much compared to the executiontime, so we will almost certainly use the second option. As already pointed outin [42], this could pose a bottleneck in the run-time performances. Moreover theonly external memory available on the selected module is the same SDRAM used bythe DSP, which shares the bus with the FPGA. This would lead to make the FPGAcontend the bus with the DSP, thus slowing down the system. Another thing to beconsidered is that, conversely to the image pre-processing engine described above,the input data has to be transferred from the main memory (through either theHPI or the memory bus) and then put back, thus further increasing the wholeelaboration time. These practical considerations should make the designer carefulin determining the proper implementation of the ANN subsystem. Given the aforementioned limitations, a pure software implementation could be preferred, especiallyconsidering that the C6713 DSP has a cache memory, which can greatly improvethe memory access performances.

In the above discussion we have seen that implementing the image pre-processing

52

4 – DSP platform

engine in the FPGA should considerably improve the system performances. Nev-ertheless, moving the ANN subsystem from the software side to the hardware oneshould be carefully evaluated, since the drawbacks could overcome the advantages.The FPGA is also very helpful to equip the chosen module with all the interfaces notdirectly available in the DSP, without adding many external components, enhancingthe “integration density” of the whole system. On the other hand, the DSP wouldbe in charge of managing the whole device, running the top level functions and allthe other data-intensive processing algorithms, such as the ones needed to track theuser movements or the text to speech functionalities.


Designing a custom platform based on a DSP and an FPGA should lead to haveall the computational resources needed to perform the required tasks. Neverthelessthere are some drawbacks, mainly concerning the development cycle. First of all,it is not so easy as expected to effectively adopt a software/hardware co-designmethodology, since we have to cope with different design tools, and there is still agap between the hardware side and software one, mainly represented by the difficultyof transparently exchanging data, a delicate aspect which can heavily impact on run-time performances and the whole system flexibility. To properly take advantage ofthe hardware, the system partitioning should be fixed before starting to write thecode, intended both as the software and hardware implemented functions. This couldbe done only once that the system functionalities, features and algorithms employedare well-defined. In this case the system is still under a continuous development, andapplying such strategy would lead, in practice, to have an high overhead due to themaintenance of different working versions of the same system. Even the time and thecost to develop such system has to be considered, and, despite the big improvementsof the design tools in this area, generally speaking, a mixed hardware/softwaresystem like this requires more efforts than other solutions, mainly because of themore complex debug/test process of the hardware part.

We have also to take into account the power consumption, since this will be amobile device. In the case of a DSP/FPGA system it is very difficult to estimate apriori such value, since it strongly depends on the specific application, as reportedin [51, 52], and usually the chip vendors release specific tools that enable the designerto obtain an estimate of the power consumption values. This is especially truefor the FPGA, where this data are roughly proportional to the used resources,but if we plan to heavy use such parts to perform data-intensive computations,for example exploiting the parallelism in the computations, thus multiplying theinvolved blocks, we can not expect a low consumption. Also for the DSP it is difficultto estimate such parameter, since it strongly depends on the core and peripheral

53

4 – DSP platform

utilization. The first one, in a VLIW architecture, is a direct consequence of thesoftware optimizations and difficult to predict, while the other can be estimatedwith more accuracy. Nevertheless, from [51] we can see that in a “typical activity”our DSP can require about 1.6 W, and a baseline power consumption, which doesnot depend neither on the workload nor on the I/O activity, of more than 1 W. It isalso worth noting that it has no power-saving features (only a power down mode),so there is no way of tweaking this important parameter for a mobile device. Wehave also to remember that beside these components, in our system there is alsothe RAM, the Flash memory and the other peripherals chip, which have also to beconsidered. In the end, from this point of view, this solution could be too powerdemanding.

Another important issue to take into account is the lack of almost any “high-level” interfaces, such as the ones for the communication with other devices (e.g.the USB port) or the ones to store the information and/or configuration files, forexample through memory card or similar. Even the essential peripherals (the imageacquisition and the audio one) have to be designed from scratch. This adds asignificant overhead to the whole design process, and does not allow to focus on thesystem functionalities.

Designing the needed interfaces is not only an hardware concern, since we needto write also a device driver to abstract the low-level details, not an easy task, sincethe involved protocols could be complex, such in the case of USB [53]. It is alsoalmost sure that the resulting software will be a multi-threaded one, and it will needa filesystem component to write/read information from an external storage memorydevice. We will need a software communication stack, which implements high-levelprotocols, to exchange data with other devices. All these hints lead to desire a tradi-tional operating system, which takes care of all the details in a transparent way. Inthe DSP world there is no such kind of software, because this introduces an overheadwhich subtracts resources, therefore reducing the effective computational power ofthe system. Usually DSPs are used as a coprocessors in a more complex system, thusneeding only a low-level Hardware Abstraction Layer (HAL), allowing the designerto focus his attention only on the proper application development. These HALs aremore similar to a BIOS (Basic Input/Output System) than to a traditional OS, andeven the C6713 has its own DSPBIOS [54]. It is equipped with device drivers tomanage the on-chip peripherals (thanks to the Chip Support Library, CSL [55]), andthrough a small kernel it offers preemptive multitasking and all the facilities (such asmutexes, semaphores, pipes and so on) needed to implement a multi-threaded soft-ware. The designer can exactly choose, through the Code Composer Studio IDE,which component will be included, enabling to build a minimal environment. He hasthe opportunity to fine tune all the operating parameters, such as the task priorities,in order to accomplish all the system constraints, especially the real-time ones (it isalso possible to calculate the kernel overhead). The DSPBIOS is a big help for the

54

4 – DSP platform

application development, but it is still far away from replacing a traditional OS11,basically because the target for which it is designed is completely different one.

In the end, we have also to consider the software development environment, theCode Composer Studio IDE [56]. Even if the system will be designed with a hard-ware/software co-design methodology, the starting point is still the pure softwaresolution developed during the previous Haptic project. This makes to pay particularattention to the software aspects even in this study. We have already mentioned theCSL and the DSPBIOS features, but also the programming language has to be con-sidered, since it is important to develop a well-designed software. With almost everylanguage it is possible to write good applications, but some of them make such taskeasier, without compromising run-time performances, or more in general, effective-ness and efficiency. In the embedded world the most used one is the C language12,which allows both to write high level code and to take advantage of the low-leveldetails of the underlying hardware. For this project, as we will see in Chapter 6, it isbetter to state some design guides, such as modularity, flexibility, extendibility andtestability, with the aim of resulting in a more manageable and maintainable codebase, also looking to the future for further expansions and/or improvements. Inorder to achieve these goals, we have also to be supported both by the developmenttools and by the language itself, which should enable the developer to follow suchpractices in an easy way. As we will see in Chapter 6, C++ is a good choice in thisrespect. Anyway, in the industry world, it has been considered too feature-rich tobe used on embedded systems, so it was developed what is called Embedded C++,which is a subset of the former. Embedded C++ is what can be used with the TICode Composer Studio, so it will be discussed here rather than in Chapter 6. Themain goal of shrinking the language is to make the compilation process easier andfaster by removing the features that could be a bottleneck in obtaining a fast andsmall code, or simply too complicated to be managed by the development tools13.A good overview on the difference between the standard C++ (a complete referenceabout the C++ language can be found in [57]) and the embedded dialect can befound in [58]. The main features removed are:

1. Multiple inheritance and virtual base classes

11It is worth noting that it should be difficult to develop a complete OS for a DSP, since, forexample, it generally lacks of a MMU (Memory Management Unit). This absence is due to thepreference of a faster memory access, essential in this kind of applications, instead of a more generaland flexible solution.

12Sometimes the developers have also to use the assembly language, but in general this is neededonly when the compiler is not able to exploit some optimizations. In any case, a good knowledgeof the CPU ISA is useful to write faster code, since it leads to adopt specific coding patterns thatmake the compiler optimization task easier.

13Despite the first ISO C++ standard is dated back to 1998, actually there is still only onecompiler that fully support such standard, and this can be considered as a proof of the difficulties indeveloping a compiler front-end for this language, independently of the target application/platform.

55

4 – DSP platform

2. Run-time type identification (RTTI)

3. Exception handling

4. Templates

5. Namespaces

6. New style casts.

Some of the removed features are inessential, like items 1 and 2, because theyare useful in building a well-designed object-oriented software, but they can also beeasily avoided in favor of other programming techniques (they are the most under-used features). Exception handling is nowadays a powerful mechanism to correctlymanage unexpected situations, and it is implemented in almost every “modern” lan-guage14, and allows the developer to write a cleaner code. Anyway, as we will see inChapter 6, in the actual code base they are not used, because other solutions havebeen preferred. Looking at the other main differences between the embedded C++and the complete one, the namespaces do not introduce neither any run-time perfor-mance penalty nor any significant compilation-time increase. They are a powerfultool for a better management of the class/function naming schema, since they canbe considered a mechanism similar to the “package” or “module” features availablein other languages. New C++ style casts were introduced to give to the devel-oper a better control over the cast functionalities. Often the plain C-style cast is a“compiler dependent” function, but using a proper set of asserts after the type con-version15 we can easily do without them. Templates are one of the major concernsin the design of such embedded language, whose purpose is to enable the designerto write generic code, parametrized on a specific data type. Typical examples arecontainer classes, such as vectors or lists, whose functions are independent of thecontained value type. Other examples, where algorithms and classes are separatedfrom the specific implemented data type, are the numeric functions, which can beparametrized on the numeric representation (e.g. fixed-point or floating-point types)or the image class, where most of the functions (such as filters) can be designed, oncefor all, independently of the pixel type. Another use of templates is to enable specificcode optimization, thanks to the template specialization opportunity. In this waywe can implement a general algorithm that works in any case, but also some “smart”versions of the same functions that are faster when certain requirements (typicallyon some data type properties) are met. The compiler automatically choose the most

14By “modern” language we refer to newly designed languages, e.g. Java, C#, Python and manyothers.

15We could also resort to something like static assert facilities, which performs checks at com-pilation time, so independently of the run-time program flow, as described in [59], but they relyupon templates, which are not present in Embedded C++.

56

4 – DSP platform

suited routine based on its context. The main drawback of this technique is the socalled “code bloat”, i.e. the big increase of the compiled code size. Templates aredeclared in header files but they are instantiated only when they are encounteredin a source-code file; the compiler will generate the corresponding object code eachtime they are found, on a compilation-unit basis. Even if we have two templatesinstantiated with the same parameters but in different .cpp files, in the final exe-cutable we will have the same binary code duplicated two times. This phenomenoncan cause problems in embedded systems, where usually the memory (even the pro-gram memory) is limited. Anyway this is a compiler issue, not a language itself one.If we want to write a generic code without templates facilities, we would have toresort to the inheritance, which introduces a run-time penalty without completelysolving the issue (as explained in [60]). A final important observation is that a di-rect consequence of these differences is the lack of the Standard Template Library(STL, see [61] for reference), which is a valuable tool for the developers, because itprovides a rich set of data structures and functions to accomplish common tasks,such as data containers and the related algorithms to handling their content. With-out them we would certainly have to re-design similar components (considerablyincreasing the whole development effort), since they are fundamental parts for everyapplication. In synthesis, the software development tools coming with the DSP arenot so well suited to support the development of the proposed software followingsome important guidelines, such as modularity, flexibility and extensibility.

In conclusion, it is clear that performances are important for our application,and trying to maximize them has led to choose a hardware platform composed bya DSP and an FPGA. It is well suited for embedded appliances, but our goal is todesign a mobile device, thus making the power consumption an important factor.From this standpoint, other solutions are surely better, as we will see in the nextchapter. Looking at the ease of interfacing and integrating with other devices, wehave seen that almost all the peripherals have to be designed from the scratch.This, together with the lack of a full-fledged OS, could make the device hard tobe properly developed, with the risk of having a too “closed” system, difficult tomodify, to improve or to extend once designed. Moreover the development cyclewould be quite complex, because involving the use of several tools to maintaindifferent versions, hardware and software, of the same components. Further goinginto the details of those tools, for the software part, we have seen that they do notoffer the required features to build our application. In synthesis, this is an excellentplatform for the design of high performance appliances, but not well suited in amobile context and difficult to set up and to use for our application.

57

Chapter 5

SBC platform

A possible alternative to the hardware solution proposed in the previous chapter canbe based on the idea of starting from an existing commercial solution, for examplein the form of an SBC (Single Board Computer1), and then to customize it so thatit suites our needs. This would be helpful especially to reduce the time and costof development (one of the main drawbacks of the DSP/FPGA platform), but alsoto make a system easier to further improve. Almost certainly, the resulting systemwill be designed around a widespread CPU, a general purpose one and hence notoptimized for data-intensive processing applications. Anyway, it could have somefacilities to make such computations faster and other features that enable to designa system more geared towards a mobile solution. Currently there are plenty ofalternatives, but we can state some requirements to narrow the search. A firstchoice is to discard 8 bit and 16 bit MCUs, since they are mainly designed for verysmall embedded systems, and hence with too limited resources for our application.The attention will be focused on 32 bit CPUs: it should be designed for embedded or,better, mobile appliances, supported by common operating systems and developmenttools. These would enable the designer to use “standard” development practices,with particular regard to the support of the C++ programming language. A desiredfeature is the availability of these CPUs in the form of a SoC (System-On-a-Chip),which significantly reduces the system complexity, because of the integration of allthe peripherals into the main processor. The solution should be designed arounda well-established and widespread architecture, better if available from differentvendors, in order to not being tight to proprietary/closed system and to have the

1Traditionally this definition has been applied to systems designed for the embedded worldwhere a regular PC was not suited, and hence the need of a “stripped down” version of the same.These platforms took various names such as PC/104 (perhaps the most famous one), but, referringto the CPU type, they are all belonging to the x86 family; actually this acronym can be usedfor a wider range of products, such as the one used for this project, since it is still a completehardware/software system packaged on a single PCB.

58

5 – SBC platform

opporunity, in future, to design an improved solution based on a new CPU withoutre-developing all the software from the scratch.

With these characteristics there are mainly four families of CPUs: ARM, MIPS,PowerPC and x86. The last one is the most widespread architecture in the desktopand notebook world, it offers good performances (from the computational stand-point), and it is supported by all the mainstream OSs and development tools. Thisguarantees the highest degree of freedom, from the developer’s perspective, never-theless it is not a viable solution for a mobile appliance, because it is not shippedas a SoC and usually a x86 CPUs require two more chips (the so called northbridgeand southbridge) to have a minimal working system. Even the power consumptionis so high (at least about 15 W, for ultra-low-voltage CPUs) that makes them un-usable for our project. Other similar devices of this family are designed to improvethese aspects, but they are targeted to the embedded (not mobile) world, wheresmall and reliable PC-like instruments, e.g. PC-104 modules, have to be integratedinto more complex systems, but this kind of solutions are not suited at all to ourneeds. Recently (2008, after we have started the Stiper project and evaluated allthese possibilities) Intel has released a new CPU with the brand name Atom, whosetarget market is the one of the so called netbooks, or better, UMPCs (ultra-mobilePCs) and MIDs (mobile Internet devices). At a first glance this could be an optionto be taken into account, since Intel claims that the power consumption is very low(down to less than 1 W) and also the chip is very small. The reality is that evenwith these characteristics it is still unfeasible to adopt it in our system, since itrequires a chipset which increases considerably the global footprint and the powerconsumption, up to more than 10 W.

Regarding MIPS and PowerPC solutions, they do not have the drawback outlinedabove and are widely used in embedded systems, available either as SoC or asIP (Intellectual Property) for programmable hardware devices (FPGAs or ASICs).Typical applications are industry controllers, network or multimedia appliances andso on, but we are not aware of currently available mobile devices (which are the onesof our interest) equipped with them.

In the mobile world, the most widespread solution is based on the ARM technol-ogy, developed by ARM Ltd., which is a 32 bit RISC architecture (also licensed asan IP) with enhanced power-saving features, powerful extensions similar to the DSPfunctionalities (MAC and vector operations) that make easier the task of developingdata-intensive applications. Furthermore it is often available as SoC. It has a RISCHarvard architecture (different data and instruction buses), with a large array of uni-form registers, a load/store data processing model, a barrel shifter separated fromthe ALU, few addressing modes but with automatic increment/decrement options,configurable endianness, conditional execution and fixed size (32 bit) instructions (allthe arithmetic operations have three operands, resulting in a flexible programmingmodel). Although nowadays the so called war between RISC and CISC (Reduced

59

5 – SBC platform

or Complex Instruction Set Computers) has almost no more sense, in the embed-ded area, the former is still a winner choice, and the ARM architecture is a proof.A simpler architecture needs less resources to be implemented, thus increasing theopportunities of integration on the same silicon die of other interfaces/componentswhile still having a low power consumption. These features all contribute to havea simpler and cleaner architecture, expandable by adding co-processors in an easyway, with the result of having low-power devices with good performances. The maindrawback of the RISC architectures, i.e. the low code density, can be overcome bythe Thumb mode (see [62] for further details), where 16 bit instructions are used.ARM processors have different operating modes, in order to make distinction fromthe user code and OS code, for example, but also to handle interrupts more ef-ficiently. Usually it is also equipped with a MMU (Memory Management Unit),which is essential to implement an OS. There are many manufacturers that have de-veloped solutions based on this architecture, so ideally we could write the softwareonce (or at least the core part, which elaborates the data) and then run on manydevices with different CPU models. The differences among them are mainly relatedto the integrated interfaces, both to the RAM/Flash memory and to external pe-ripherals, the operating parameters, such as frequency and power consumption, andthe programming model, which could be slightly different, i.e. with some customextensions, respect to the official one.

5.1 ARM platform

As it has been done with the DSP/FPGA platform, we have looked for an existingplatform, with the aim of quickly starting the software development. Since theARM processors are widely used in PDA (Personal Digital Assistant), the idea wasto select two products:

• A commercial full-fledged PDA

• A bare module equipped only with the CPU and the essential components,such as the memory, with the possibility to extend it through something likean expansion port.

These two devices should share the same CPU core and also the same OS, in or-der to avoid to develop different versions of the same application. At the time of thischoice, the most promising solutions were the ones based on the PXA270, first devel-oped by Intel and then by Marvell2. The first selected device is the Fujitsu-SiemensPocket Loox N560, visible in Fig. 5.1, equipped with a PXA270 CPU running at

2During the 2005 Intel sold all the XScale division to Marvell, thus exiting from the ARMmarket and hence focusing only on the x86 CPU architecture.

60

5 – SBC platform

624 MHz, 128 MB of Flash memory and 64 MB of RAM. An interesting feature, stilluncommon on this kind of devices, is the USB Host port, which enables to con-nect some peripherals, such as mass storage devices or even webcams (provided theright device driver, as we will see later). The other device is the Toradex ColibriXScale PXA270 module [63], visible in Fig. 5.2, equipped with a PXA270 CPU run-ning at 520 MHz, 32 MB of Flash memory, 64 MB of RAM. It is not a full-fledgedPDA as the other one, it is only the core part of an embedded/mobile appliance:almost every interface integrated into the CPU is carried out and made availablethrough a SODIMM200 connector. There are general purpose I/Os, memory card(SDCard but also MemoryStick, CompactFlash), PS/2, I2C, 2 UARTs, 100 Mb Eth-ernet, USB (both host and client), audio I/O and a CMOS camera interface. So wehave everything we need to develop the system without adding almost any externalcomponents3. An important note is about the OS these devices run: for the PDAit is MS Windows Mobile 5, based on MS Windows CE 5, while the Colibri moduleruns a basic version of MS Windows CE 5; this is an important issue, since it makesthe two systems very similar from the software standpoint.

Figure 5.1: Fujitsu-Siemens PocketLoox N560 PDA

With these devices we can start to develop the software and, at the same time,to define what could be a novel PDA expressly designed for blind people. Ideallywe would have a first system based on the proposed PDA with a webcam connectedthrough the USB port, thus being much similar to the idea outlined in Section 2.1.

3There are some exceptions: for example, if we would like to have a RS232 port, we should addan external transceiver, but this is a pretty straightforward task.

61

5 – SBC platform

Figure 5.2: Toradex Colibri XScale PXA270 520 MHz module

This could serve also as a prototype in a short time in order to allow, once developedthe application, to organize test sessions with blind people verifying the effectivenessand usefulness of our proposed solution (as done for the previous prototype). Onthe other hand, we can design around the Colibri module a novel device, taking intoaccount all the ergonomic issues discussed in the Haptic Workshop (see Section 2.2)and looking at the existing products (see Section 1.1). The use of a common PDAcan be seen as a first attempt to integrate our application into existing informationsystems. In this regard, MS Windows Mobile makes things easier, since it is acommon and well-established OS often used in mobile devices. This would eventuallylead to make the application a complementary tool, or an add-on, for the devicesthe users already own (e.g. mobile phones), without making them to buy a new one.

A full-fledged PDA has many features, including the GUI and hence a screen,but these are not helpful for blind people. Without them we would be able to designa device more tailored around end users needs. The removal of the display from thedevice is both a hardware and a software concern. Without the screen we coulddesign an aid with a totally different ergonomics, smaller, less power hungry and,last but not least, cheaper. Looking at the software side, the GUI uses resources(memory, CPU time) that we would prefer to be available for the main application.Following this idea we could think about a small carrier board for the Colibri mod-ule, with an integrated CMOS camera sensor, the power supply circuitry (designedaround a rechargeable battery), and the connectors for external interfaces: thesewould be at least a USB client port (to synchronize data with a PC or to deploynew applications), the audio connector and a SDCard slot. This system could bethe final one, but for developing purposes we would need an intermediate prototypewith more optional interfaces, for example a display to visually check the system

62

5 – SBC platform

operations. In the end, the system could be based on a backplane with three slots:the first one for the CPU module, the second one for an interface board, with theCMOS camera sensor, the power supply components and the interface connectors,and the third one used for the afore mentioned “development peripherals”.

At this point someone might wonder why do not use for our purpose a commonmobile phone equipped with a camera. The reason is pretty simple: such integratedcamera is not able to focus at the short distance (about 2 cm) needed by our ap-plication: it should operate as a paper scanner, not as a conventional camera (seeSection 2.1). More than this, we need to provide a source light in order to properlyilluminate the sheet (relying on the ambient light is almost unfeasible in this case),and this can be done by adding few white LEDs around the camera.

5.1.1 XScale architecture

The PXA270 processor is designed around the Intel XScale architecture4, whichimplements the ARMv5TE programming model [62]. This means that the CPU isable to perform the long multiply, i.e. a 32×32 bit operation with a 64 bit result, andalso some DSP functions, such as the MAC and the saturated addition/subtraction.It has to be remarked that it does not have any FPU, so all the floating-pointarithmetic is emulated by the software, and hence it is much slower, as we willsee in Chapter 6. The XScale architecture [64] has a 7 stage superscalar pipeline,with in-order one issue per clock cycle but out-of-order completion, meaning thatalthough the instruction are executed sequentially (one per clock cycle), in somecases the operations can be terminated while the previous one is still underway;this is possible also in case of a stall, since the pipeline is not lock-stepped. These,along with the extensive use of bypassing and with a weak-consistency memorymodel, are powerful features that enable the CPU to reach up to about 800 MIPSat a frequency of 624 MHz. Such last result gives an idea about the computational-power differences that exists in comparison with the TMS320C6713 DSP, able toperform up to 2400 MIPS. Anyway it has to be remarked, that, especially in the DSPcase, this performance is the maximum achievable, not the effective one obtainedfrom real applications. The difference simply means that we will have to pay moreattention to software optimizations. It is noteworthy that the pipeline is longer thanin other ARM implementations, and this should lead to a greater time penalty whenthe pipeline is flushed (e.g. in case of branch execution). This could be minimizedby using conditional execution (hence the instructions are simply discarded in thefirst stages of the pipeline) and by the use of a large branch predictor, called byIntel Branch Target Buffer (BTB) and equipped with 128 entries.

4Although actually all the XScale-based processors are developed and produced by Marvell,Intel still has the rights on the XScale architecture.

63

5 – SBC platform

The above outlined features makes this processor run efficiently at higher fre-quencies than other alternatives (when we evaluated this solution it was the fastestARM processor), but, beside the raw processing power, we have to consider how thememory access are performed. As said before, it has a weak-consistency memorymodel, meaning that the access operations are re-ordered to enhance the perfor-mances. It has a L1 cache, with a data cache (32 kB, 32-way set associative witha line size of 32 B, plus an additional mini data cache of 2 kB) separated from theinstruction cache (with the same characteristic of the other one). There are alsosome features that boost the performances, such as pend/fill buffers which enable“hit under miss” operations (the core can look for other data in the cache and atthe same time wait for a request from the external memory) and write buffers sup-porting the coalescing of multiple stores to external memory. Interesting featuresare the opportunity of locking, upon a software request, some cache data lines andthe presence of an instruction for pre-loading into the cache the needed data. Theformer option is useful when a function accesses the same constant memory areamany times, for example inside a loop, in order to avoid the possible line evictionfrom one iteration to another. The latter helps to increase the whole throughputwhen processing a large amount of data by pre-loading in the current iteration thedata elaborated in the next one. From these information we can see that, despitebeing a general purpose CPU, it has been designed to obtain high performances inprocessing a large amount of data.

5.1.2 SIMD coprocesssor

Beside the characteristics outlined above, it is important to note the presence of aspecific MAC co-processor with a 40 bit accumulator and a novel coprocessor basedon the Intel Wireless MMX technology [65]. This last option makes available apowerful set of SIMD (Single Instruction Multiple Data) instructions, similar to theones found in recent x86 CPUs. This kind of operations are able to elaborate si-multaneously and in parallel, more than one data thus producing multiple results.They consider an operand as a vector of smaller data types, so in the same clockcycle we can have a vector of results. Without them we would need approximatelyone instruction for each vector element. The advantages are mainly two: first, amore compact code (and a better register use) and, second, significantly faster code,up to eight time faster in our case, introducing also in the software a form of com-putational parallelism. This solution is often used in the DSP world, where a largeamount of data has to be processed in the same way (e.g. in filter functions). Thesame VLIW architecture, employed in the DSP seen in the previous chapter, canbe considered a further extension of this technique, being able to specify differentoperations instead of a unique one for all the data. Taking advantage of this oppor-tunity means anyway to re-think some algorithms. Often they are defined in serial

64

5 – SBC platform

way and we have to exploit the parallelism in a similar way as we would have to doto implement them onto an FPGA (as seen in the previous chapter). This shouldlead to a considerable performance improvement on the image processing algorithmsand on the ANNs subsystem (these issues will be addressed in Chapter 6). Nev-ertheless, in contrast with an FPGA implementation, now the resources are fixedby the programming model of the coprocessor and can not be rearranged like in aprogrammable hardware.

Looking at the actual implementation on the PXA270 CPU, the basic data typeis 64 bit wide, and it can be seen as 8 bytes, 4 half-words or 2 words of 32 bit, eithersigned or unsigned. The coprocessor has 16 new 64 bit registers (for the SIMD datatypes) and a set of 32 bit registers (to store arithmetic/saturation SIMD flags orconstants used in some operations, e.g. shift or alignment), along with a completeset of logical, arithmetic,dedicated load/store (thus bypassing the regular registers)instructions. It is also interesting to see that, conversely to the MMX/SSE exten-sions available in the x86 architecture, there are also some very helpful operationssuch as the sum of the packed data (useful to calculate the average intensity of animage) and the element average on two vectors (needed, for example, in the RGBto greyscale pixel conversion). It is important to note that for these instructions, asfor most of the “regular” ones, the issue latency5 is always 1 and the result latency6

is at most 2, except for the ones that have to move data between the coprocessorregisters and the memory or the XScale core registers. This means that ideally wecould multiply by up to eight the throughput of the elaboration of a stream of bytes.

These instructions are processed by a fourth 64 bit pipeline branch, added inthe PXA27x architecture, along with the other three already present in the XScalearchitecture; the block diagram can be seen in Fig. 5.3. The WMMX pipelinedirectly starts after the two instruction fetch stages (IF1 and IF2) and has its owninstruction decode (ID) and register file access stage (RF). All the branches havetheir own write-back stage (*WB), thus allowing to potentially complete up to fourinstructions per cycle. As a final remark we can see that the PXA270 is stronglyoriented to embedded data-intensive systems, but to take advantage of all its power,we have to use good compilers, able properly scheduling instructions. The developercan also rely on CPU debug facilities and performance monitoring tools in order toverify a posteriori the effectiveness of the generated code.

65

5 – SBC platform

Figure 5.3: PXA27x pipeline block diagram

Figure 5.4: XScale PXA270 CPU diagram

66

5 – SBC platform

5.1.3 PXA270 overview

Looking at the integrated peripherals in the PXA270 [66] (see Fig. 5.4 for the CPUdiagram), we can see that we have all we need without requiring almost any externalcircuitry. Among the others there is an AC’97 controller for the audio (in this casewe are in the same situation of the DSP platform, requiring an external codec, seeSection 4.1, but on the Colibri module, it is already present such codec, see [63]),an USB Host Controller (supporting the Universal Serial Bus Specification Revision1.1) and with up to three ports to manage USB peripherals (e.g. webcams, but alsoa mouse and a keyboard for development purposes) and an USB Client Controller(always supporting the specification revision 1.1.) with up to two ports. Other usefulinterfaces are the UARTs (up to three) to communicate with external devices, suchas the tactile display, a keypad interface to easily equip the device with some systembuttons, a powerful and flexible set of GPIO (General Purpose I/O). Particularlyimportant for our system is the presence of the QCIF (Quick Capture Interface)and a I2C interface to manage CMOS camera sensors. The QCIF interface allowsa direct connection, without any glue-logic, to a great variety of sensors, since ithas a lot of configuration options, starting from the bus width, the operation mode(master/slave, parallel/serial bus, different types of synchronizations, image size andso on) and ending to the possibility of specifying the data encoding used, in orderto correctly pre-process the image before storing it in the main memory. All theseoperations are performed without relying on the CPU, thanks to a very efficientDMA subsystem, so the software has only to trigger the proper events (e.g. aninterrupt when the acquisition is completed) and to elaborate such data. As in thecase of the DSP, the CMOS sensor is controlled by the I2C interface, compliant tothe standard and compatible with the Omnivision SCCB bus [23].

Being a CPU targeted to the mobile market, it has numerous features for bothvoltage and frequency scaling, and Intel claims [67] that, at 600 MHz, an XScalebased processor is able to perform 750 MIPS with a consumption of 450 mW. Look-ing at the TMS320C6713 described above, assuming a 60% of CPU load, accord-ing to [51], the internal logic consumes 1300 mW, leading to a speed/power ratioof about 1.1 MIPS/mW, while in the case of the XScale core this value is about1.7 MIPS/mW. These value are useful only to have an idea about performances,since even only a pure MIPS (Million Instructions Per Second) comparison betweenthe two processors is not so useful, because we are referring to two completely differ-ent architectures. It would be better to use a benchmark program which would testthe CPU in a real scenario, respect to real functions and not estimating theoretical

5The issue latency is defined as the amount of clock cycles between the issue of two consecutivesimilar instructions.

6The result latency is defined as the amount of clock cycles between the instruction issue andthe result availability.

67

5 – SBC platform

maximum performances. Also the power consumption is very application depen-dent, since for the DSP we would need to know a lot of detail about the application(available only after its development) and, for the XScale core, there are only someguidelines to estimate the capacity of the power supply subsystem [68].

It has to be pointed out that, at the time of choosing the proper ARM plat-form, the XScale architecture was not the only one targeted to high performancesin the mobile world. The competing solutions were based on the ARM11 micro-architecture [69], which supports the ARMv6 ISA, and has almost the same fea-tures of the XScale one, although it seems that the latter is more optimized to granthigh performances. The only substantial difference is about the SIMD extensions,since the Wireless MMX technology is an exclusive feature of the XScale CPUs,the ARM11 ones have only a limited set of vector capabilities and restricted toa maximum data length of 32 bit (instead of 64 bit) thus not achieving the sameperformance results in this field.

Figure 5.5: Toradex Colibri XScale PXA270 module block diagram

All the PXA270 features described above, make the Colibri module block dia-gram (see Fig. 5.5) extremely simple for an already very complete system, since theinterface board afore mentioned could be a very simple stripped down version of themotherboard evaluation kit [70, 71].

5.2 System feasibility

As we have done for the DSP/FPGA platform, we have to decide how to implementthe whole application, and, in this case, it is easy to guess that we will have a pure

68

5 – SBC platform

Figure 5.6: System partitioning of the PXA270 platform (Colibri module)

software implementation. Thus the “system partitioning” is pretty straightforward,as it is shown in Fig. 5.6, in comparison to the DSP/FPGA one, shown in Fig. 4.8.We have only a “system feasibility” concern, which consists in defining/designingthe absent required peripherals and establishing the software platform.

A pure software implementation offers the highest degree of freedom and flexi-bility in designing the whole system, but it has the major drawback of resulting inan implementation slower than a mixed hardware/software one. We have to decidehow to take advantage of all the features described previously. Again, the mostdata-intensive algorithms are the ones that will need more attention, and that couldbe, whenever possible, parallelized. We have seen in Section 4.2 that the most suit-able subsystems to be optimized are the image pre-processing one and the ANNone. For the first one it is quite clear that the use of the available SIMD coprocessorshould lead to better performances, since we are able to process eight pixels per loopiteration (on a greyscale image with one byte per pixel). Also for ANNs it couldbe a viable way, but we have carefully to check before whether the trades-off, suchas the use of the fixed-point arithmetic, are still acceptable. These issues will beaddressed in Section 6.2, anyway it is worth noting that they are almost in commonwith the DSP/FPGA platform, even if the underlying architecture is now “fixed”.

69

5 – SBC platform

We can not change the programming model, so the parallelization of some functionsis not a matter related to the system partitioning but rather to an optimization one.

In the end, from the hardware standpoint, the only big concern is how to acquireimages, while, for the software side, the main concern is to check the need of third-party libraries and components, as we will see in the next sections.

5.2.1 Image acquisition

As outlined above, in the case of the Colibri module, we can use the PXA270 QCIFinterface [66] to acquire images, thus avoiding all the workload of decoding theMJPEG stream of data coming from an USB webcam. The peripheral would bemade of the only Omnivision OV7648 CMOS camera sensor [22] (the same thatwe would have used for the DSP/FPGA platform, see Section 4.1), since it is nolonger needed any external glue logic, and it would be controlled through the I2Cport. This is not a viable solution for the PDA, so in that case we will use an USBwebcam to accomplish the same task. It has to be remarked that this last optionis available also for the proposed SBC. It is a twofold solution, but with the USBoption we should be able to have sooner a working prototype (it is only a matter ofchoosing a suitable webcam), while with the second one we could design a customdevice, better tailored around our needs.

Defined the hardware side of the peripheral, we have to look at the softwareone, i.e. the device drivers. Both systems (the PDA and the SBC) used in thisapproach are equipped with MS Windows CE, but this is not a complete OS asthe desktop ones, and the device drivers have to be developed by OEMs (OriginalEquipment Manufacturers). This is true both for the USB webcam and the directconnection of a CMOS camera sensor. It is the same problem we would have in theDSP case, but now we have some help from an open-source project in one case andfrom the device manufacturer in the other one. The first is an independent projectaimed to develop an USB driver for webcams, provided that they are compliant tothe USB Video Class 1.1 (UVC) standard [72], under MS Windows CE [73], whichoffers low-level options to access to such stream of data. It is not a “standard”driver, fully integrated into the OS and easily usable by a common set of APIs(Application Program Interfaces) described in [74] . To use it the developer has toknow how to set some protocol parameters and to strictly follow the programmingmodel explained in the documentation. The results obtained by this way are notalways satisfactory, since, despite the UVC standard is defined in almost its parts,not always the manufacturers follow it precisely and hence there are sometimesglitches that require workarounds, which have to be carefully tested7. Anyway it is

7We have noted that the properly working of such devices could depend on the version of thesame, since sometimes the manufacturer updates or modifies the hardware, while maintaining the

70

5 – SBC platform

still the unique solution to equip the PDA with an external camera. An alternativewould be to use the one commonly found in some mobile devices, but we have to takeinto account both the purpose of such input peripheral (discussed in Section 2.1)and some practical considerations: usually the integrated camera can not focus atthe distances (about 1–2 cm) required and it has no illumination source.

An external USB webcam can be also used with the Colibri module, but a bettersolution would be to directly connect the already mentioned CMOS camera sensor tothe CPU through its QCIF interface (see Section 5.1). This has the main advantageof being a more plain solution, because it does not require almost any CPU work toobtain the raw image (it would be a DMA transfer) and also the hardware wouldbe globally simplified. Indeed in common webcams there is a processor, besidethe CMOS sensor, that compresses the images before transmitting them throughthe USB port. On the SBC side, the CPU has to decompress them before givingthem to the application, and this is a significant overhead. Anyway, despite thesebenefits, there are some drawbacks, since we have only source code examples showingthe feasibility of this way, they are not a fully developed driver. Anyway we believethat this is still the most promising solution on the long term, and its developmentis still underway; the USB webcam opportunity has been extensively used duringthe current development stages (despite its drawbacks) since it enabled to have aprototype in a faster way.

5.2.2 Software components

We have seen that the image capture peripheral is the only hardware concern forthis platform. All the system is implemented through a pure software application,and so, in order to verify the feasibility of this approach, we have to evaluate thesoftware environment and, eventually, the need of third party components. Thesolutions proposed in this chapter are all equipped with MS Windows CE: thismeans that we have a full featured operating system, with almost all the neededsoftware components at our disposal. This is true only at a first glance, since goinginto the details we can see that such OS does not even have a standard C library8

by default, and this issue has to be carefully taken into account when using (orporting) third party software. Nevertheless it is very similar, especially from thedeveloper’s standpoint, to the MS Windows family of OSs, as can be seen in [76]:it is almost true for the WinCE API which are almost identical to the Win32 ones,except for some advanced options. This similarity is only on the application side,since the OS kernel, the device driver subsystem and all the internal details areclearly very different, but this is obvious since they are two distinct, not related,

same product name.8By C standard library we refer to the functions described in [75].

71

5 – SBC platform

products, designed with different targets in mind. One important note is about thenaming schema of all the Microsoft product related to the mobile world: end userand commercial products (e.g. mobile phones) always refer to MS Windows Mobile,MS Windows Pocket PC, MS Windows Smartphone and so on, but they are allbased on MS Windows CE 5 (actually it has been already released the version 6,but there is still no version of MS Windows Mobile based on it) which is the core partof all the other software. The main notable limitation, for our application, is thelimit on the maximum amount of memory available for each process, 32 MB, sinceall the other issues, starting from the reduced graphical and GUI capabilities, areat most a concern in the development phase. Although the PDA is equipped with afull-blown OS, with all its features and application, and it is not possible to modifyits configuration, the Colibri module can be re-programmed with a custom version ofthe OS. This is possible with the MS Platform Builder, which enables the developerto deploy on the device a stripped down version, with only the minimal requiredcomponents (eventually even without the GUI, which requires a considerable amountof resources), thus leaving more space to the final application.

Given the outlined features of the shipped software base platform, the idea isto design and develop a core cross-platform engine which takes as input the streamof images coming from the camera sensor, elaborates all the information and givesthe results through a text-to-speech synthesis engine (as proposed in Chapter 2).A cross-platform application does not mean a binary-compatible executable, but asoftware that can be easily re-compiled, using the same source files, for the desiredtarget platform, simply by setting the proper compiler options. This is a valuablefeature, because we can use the same development platform to test the application,and then to deploy it on the final device (all these concerns will be discussed inChapter 6). Nevertheless we have to search for a viable way of enabling the core partto interact with the user (and the developer for test/debug purposes). This means tohave a GUI (Graphical User Interface) and a TTS (Text-To-Speech) library. It is notrequired in both cases to be the same, but they should offer the same opportunities.In the first case, for the development platform (a common PC with MS WindowsXP) we have preferred to use a cross-platform library which offers a lot of advancedfeatures, wxWidgets [77]. It has been chosen for its simplicity, its open-sourcenature and the flexibility in comparison to other alternatives, such as the well-knownMS MFC (Microsoft Foundation Classes). Nevertheless, this library has too manyfeatures and, in the end, adds an extra “software layer” to the application withoutany significant advantage in a mobile scenario. The alternative is to use a lightand essential library: WTL (Windows Template Library, [78]) is excellent in thisfield, since it is composed only by header files and heavily relies on an advanced useof C++ templates. With this technique, called static polymorphism (described forexample in [79]), we have zero run-time overhead because all the name bindings aremade at compile time instead of at run time (as usual with virtual functions). Thus

72

5 – SBC platform

WTL can be seen as tiny wrapper (the same source code is also the documentation)around the Win32/WinCE APIs, but with the valuable addition of enabling thedeveloper to use an object orient paradigm.

For the TTS library, there is neither an already available option shipped withWinCE nor a common API such as the SAPI (Speech Application ProgrammingInterface) available on the Win32 platform, which enables to develop a software re-lying on such functionalities without depending on the particular implementation.At the best of our knowledge, there is no open-source option for this kind of compo-nents (supporting other languages than English, in particular Italian), so we lookedfor a commercial solution. There are many alternatives, but in the end we havechosen the Loquendo Embedded TTS, because it is available for many languagesat different sampling rates (from 8 to 44 KHz) on many platforms, it is optimizedfor a wide range of CPUs, including the XScale ones, it has low memory (RAMand ROM) requirements and it is already successfully used on other applications forblind people. Looking at the documentation [80] we can see that there are two waysof using this library: synchronous and asynchronous. In the first way, when we senda text to the engine, we wait that it finishes to speak, while in the second way thestring is sent and then the application continues to run independently and at thesame time (thanks to multi-threading facilities) the TTS engine in parallel producesthe speeches. Ideally it would be better the latter mode, and it is the way in whichTTS was first used, since the main functions continue to elaborate the successiveimages. Not stopping the routines to process the input data flow is important to cor-rectly track the device movements and to properly handle the recognized characters(to form words), but this is possible only if there are enough resources to efficientlyrun two threads in parallel. On a mobile platform, where these are limited, theresult is a fragmented speech, interrupted by the main application thread and viceversa. This is the typical consequence of having two heavy threads running at thesame time, and the only solution is to make one of them lighter. One step in thisdirection has been to switch the TTS sampling rate from 16 KHz to 8 KHz, never-theless the results were still not optimal. This led to the conclusion of preferring asynchronous mode, so the speech is perfectly audible and, using a message queuesimilar to the one used to manage events by GUI applications, the interruptions ofthe main program are minimal (provided that for each message the text is short).For the MS Windows CE version of such library, there is a further option [81], whichconsists in a client-server approach, where the proper TTS engine is instantiated ina separate process space and the APIs called by the main application are merelystubs used to dispatch the information from one process to another. This shouldserve primarily to partially overcome the 32 MB (per process) memory limitation9,

9One of the biggest improvement of the new MS Windows CE 6 is the elimination of thisconstraints, but by now the application is only tested on WinCE 5 systems.

73

5 – SBC platform

leaving more space to the main application. In practice we have noticed no percepti-ble improvement, rather an occasional slight delay probably due to the inter-processcommunication, which is considerable slower respect to a simple function call. Inthe end, the adopted solution is an asynchronous in-process use of the TTS engine,with a 8 KHz sampling rate. It is not the ideal result, but it is the best trade-off wecould find.

In conclusion, in this section we have verified the feasibility of the software partof the system. At this stage we have not yet chosen all the components we couldneed, but nevertheless we have defined a base environment for the final application.We will need more accessory libraries but none of them are indispensable as theones discussed here. These concerns will be addressed in Chapter 6, because theyare not related to the specific underlying hardware platform, but more aimed tohelp to develop a well-designed solution.


The adoption of a more conventional platform, like the proposed one in this chapter,allows certainly to start to develop the software (thanks to the PDA device) sincethe beginning and, at the same time, to design a custom device designed around afull-fledged SBC. This last task is much easier than in the case of the DSP/FPGAplatform, because all the needed peripherals are already integrated. The only oneto be truly designed from the scratch is the image acquisition part, by means ofa CMOS camera sensor, nevertheless this should be a pretty straightforward task,which does not require any external glue logic.

The CPU at the base of the proposed solution is not certainly powerful as aDSP along with an FPGA, but, having examined it into the details we can seethat it is heavily optimized to achieve high performances, while maintaining a lowpower consumption profile. It has a better MIPS/mW ratio respect to the DSP and,more important, it has no “baseline” power consumption. This is a consumptionindependent of the workload and strictly related to the DSP used; for the examinedsystem (see Section 4.3) it is of about 950 mW (according to [51], only for the corepart without considering the analogous value for the I/O with other 80 mW). Bycomparison, for a PXA270 at 520 MHz as in the Colibri module, in [68] a value ofabout 750 mW is reported as a guideline to design the power supply module (andhence a kind of maximum consumption, always referred to the core part, withoutthe I/O). All these information are only presented to give an idea, since in bothcases the real results strongly depend on the specific application. For the XScalesolution, the power requirements are related to the actual workload and can not beestimated a priori like for the DSP.

From the designer standpoint, now the whole development cycle is more plain

74

5 – SBC platform

since we have not to cope with different tools and with all the burden of main-taining the synchronization between various hardware/software implementations ofmany subsystems. The increased flexibility is counterbalanced by the need of payingmore attention to the optimizations in order to have a usable final system, but thisis an achievable goal, as we will see in Chapter 6. Relying on a widespread andwell-established OS offers the possibility of focusing the efforts on the main appli-cation which could be deployed on commercial mobile devices (PDA, smartphonesand so on), integrated with existing information services, or being optimized for acustom platform (designed around an SBC). Using a well-known CPU architectureas the core part of the system should widen the future possibilities of re-engineeringthe underlying hardware, e.g. around a new SoC, to achieve better performances,ergonomics or autonomy without an overall re-design from the scratch of the soft-ware (since the basic programming model would be the same). This makes thewhole system open to further improvements and future novel solutions, hopefullyguaranteeing longer life, wider spread and an increased usefulness.

Considering the development concerns, we can now take advantage of the sametools used to develop applications on common PCs. There are good compilers areavailable, which offer a high support to the C++ language (almost full compliant tothe ISO standard, not restricted to the embedded version as in the DSP case) andhence we are now able to fulfill all the goals/requirements addressed in Chapter 6.This makes possible to re-use third party libraries10, not essential for the mainapplication itself, but important for testing and debug. We can further extend theseconsiderations to the aim of defining a common cross-platform environment withenough high level functionalities, something like a “software abstraction layer” thathides all the system-specific details and optimizations.

In the end, summing up all the points outlined above, obtained from the compar-ison and a brief feasibility study of two different kind of solutions, we can see that asolution based either on an SBC or on a common PDA should be a more suited wayto develop the device proposed in Chapter 2. Thus, defined the hardware platform,it is time to focus the attention on the software part, addressed in the next chapter.

10Often we will have to modify their sources in order to adapt them to the new target device,but it is still a feasible task, almost impossible for the DSP environment.

75

Chapter 6

Software design and development

In the previous chapters we have evaluated and proposed two different solutions forthe hardware platform, choosing in the end the one based on the XScale architecture.It is currently available both in commercial PDAs and in form of an SBC, equippedwith an OS based on WinCE1. Now we have to setup the development environment,choosing the right tools (usually an IDE, Integrated Development Environment),the programming language and eventually some other useful libraries. At the sametime we will state some guidelines to develop the software, with the aim of designinga final product, easy to maintain, to improve and to extend. In this chapter we willnot look at the details of the implementation (it is not a documentation about thedeveloped software), but rather we will discuss each idea and choice made behindthe software design and development, presenting only some real examples in orderto give a practical idea of the consequences of those decisions.

6.1 Development environment

Usually, for the development of WinCE applications, it is common is to use MSVisual Studio IDE, in this case the 2005 version. It allows to build both nativeexecutables and MS .NET Compact Framework ones. The second option means towrite programs that will run on a virtual machine, and hence the run-time perfor-mances are far from being optimal2. Anyway such choice offers a lot of valuable

1From now on we will refer to WinCE as MS Windows CE, which is the core part of all thefamily of mobile OSs of Microsoft (MS Windows Mobile, MS Windows Pocket PC and so on). Atthe same time Win32 will refer to the Microsoft family of desktop OSs, running on the x86 CPUarchitecture, e.g. MS Windows XP, MS Windows 2000 and so on.

2The emulation layer introduced by a virtual machine always slows down the performances.Of course there are some techniques to reduce this side effect, such as the Just-In-Time (JIT)compilation, but it is difficult to implement them on a mobile platform and we are not aware ofthe availability of such optimizations on the devices we have used.

76

6 – Software design and development

features, like a much more comfortable programming and executable environment,but since efficiency, as outlined previously, is one of the main concerns, this approachwill be avoided.

Building native executable usually means to write program in C or C++. Thelatter is derived from the former and can be seen as an extension (a good referenceguide of the C++ can be found in [57]). Often in the embedded world it is avoidedin favor of the more traditional C language [75]. The most notable criticisms aremainly related to the performances of the resulting executable, both about thesize and the speed of the generated code. For this reason it has been created theso called Embedded C++ [58], which is a subset of the standard C++, and itis supported by some embedded device manufacturer, also in the DSP world, asmentioned in Section 4.3. Always in Section 4.3 we have seen that some notableC++ features, such as templates, namespace and so on, are not structural languageobstacles to obtain a fast and small code; the problem resides in the compiler specificimplementation. Anyway we are not interested in building the smallest possibleexecutable, since we have all the memory we need to store the application along withthe needed data files. About the speed of the generated code, we can investigatehow these advanced language features can affect the final executable [60], and thento decide if, how and when to use them. Indeed in the C++ language there is nopenalty due to a feature that is not needed and hence not used, in synthesis “wepay for what we use”.

C++ is a very complex language and it is not easy to have a good knowledge ofall the subtle details and the features for an effective use, even if there are good refer-ences, such as [82, 83, 84]. Summing up all the features of the C++, we can say thatit is a multi-paradigm language, since it allows the imperative, object-oriented andfunctional (through the so called template meta-programming functionalities [85])programming styles, and we can freely mix them in order to achieve the desiredresults. Thanks to its abstraction capabilities we can easily define new structureddata types, for example to represent graphical object from a high-level standpoint,and thanks to the data encapsulation features we can guarantee the right type safetyby controlling the access to class members. Type-safety is also guaranteed by theintroduction of new style casts and the possibility of explicitly controlling them.Through the (multiple) inheritance mechanism we can build hierarchies of classesre-using and extending already implemented data types, enhancing the modularityof the application. We can use the polymorphism and the override features to cus-tomize some functions of newly designed data structures, while maintaining the samesyntax. There is the operator overloading option, that allows to define a customsemantics for the operators of the classes we define. In combination with the use oftemplates, we can implement the so called Domain-Specific Embedded Languages(DSELs, see [85]), which are a powerful tool to “inject” a new language, with its ownsyntax and semantics, into another one, leading to a more expressive code, although

77


more complex to debug and to implement from the language designer standpoint.Inheritance is not the only option to extend/re-use classes, since we could prefer

composition techniques: they have the advantage of not introducing all the burdenfor the late-binding of functions, usually implemented through the so called virtualtables. Moreover C++ enables the compiler to use inlining optimizations (it has aproper keyword to specify this request), i.e. to substitute the function code directlyinto the calling code, instead of using the traditional function call mechanism. Thesame results can be obtained through the use of preprocessor macros, but they arenot type safe and more difficult to debug3. Code inlining is also very useful combinedwith templates which open up all the power of the generic programming techniques:this means being able to implement algorithms in an independent way respect tothe data type for which it will be used. One last valuable feature of C++ is thepresence of a proper set of instruments to handle exceptions. They are helpful towrite a cleaner code, replacing the usual practice of extensively check every functionreturn value (or specific variables) for errors, nevertheless they introduce a littlerun-time penalty (this is still an open issue, see [60]).

The power of the C++ resides not only in the language itself, but also in thestandard library associated with it. In particular the most interesting part is theStandard Template Library (STL, see [61, 86]), composed by a wide range of datastructures and algorithms ready to be used. Despite being discarded in the Em-bedded C++ design, it is very helpful since it provides both a good and effectiveexample about how to take advantage of the language features. In synthesis, it is aset of robust solutions to many questions that every developer has to face with (e.g.how to properly design a container). In [61] every function/class is described intothe details and for each of them it is specified the pros and cons, the advantages andpitfalls of such solutions. The availability of those information makes the decision ofhow, when and where to use such components to shift from an implementation issueto a design one. All those components, as the library name implies, make an exten-sive use of templates, so they are a very flexible and re-usable code base, useful inalmost every context, avoiding to re-implement each time the basic functionalities.The developer has no longer to worry about how to manage memory, for example,which can be painful for an environment without a garbage collector, or to manuallycope with array of characters instead of using strings. In the end the purpose of theSTL is to complete the power of the C++, which can still be considered as a relativelow-level language like C, but with a lot of enhanced features. The developer hasstill the full control of every aspect of the generated code, without any hidden side

3One of the goal of the C++ designers was to reduce the importance of the preprocessor, sinceit can be considered a tool useful to write workarounds when there are no other possibilities. Itshould be better to have instruments to directly tackle those problems, in order to preserve all thecharacteristics of a well-designed software.

78


effect and with a minimal language run-time overhead. Moreover, all the tools usu-ally shipped with other high-level programming environment are available, helpingto quickly implement effective applications.

Beside the STL described above, there are also other libraries that follow thesame idea of completing the C++ with a useful set of tools. Their aim is both toexploit the power of the language (always with the goal of enabling the designer todevelop well-engineered software) and to make the developer’s life easier. Amongthe others, Boost [87] is a rich collection of small libraries that overcomes the lackof some functionalities in the standard C++ library. Examples are an effective wayto access the filesystem, new and improved container data types, advanced aids forthe implementation of the lambda calculus or for an uniform management of call-backs, multi-threading capabilities4, extended type-traits and meta-programmingutilities. They are so useful that some of them have been included in the C++ TR1proposal [88], which is a first attempt to extend the standard C++ library, a steptowards the upcoming new C++0x language ISO standard (expected in 2009). An-other very useful library for our application is Blitz++ [89], which provides a widerange of tools to manage arrays. It offers a lot of options, among which the most in-teresting one is the possibility of writing complex array expressions in a natural way.Sometimes it resembles the power and the flexibility of other well-known commercialtools specifically targeted for this kind of elaborations. Blitz++ is a real exampleof the techniques described in [79], and the combined use of static polymorphism,type-traits and expression templates, enables the developer to rapidly write cleanand efficient functions.

Beside the library outlined above, there are others that have been used for specifictasks, and hence will be described later. All these components were not originallydesigned by the authors to be used also on the WinCE platform. Anyway all ofthem are open-source projects and this means to have the opportunity to look atthe source code and eventually to modify it in order to satisfy specific requirements.Porting such libraries to the target platform has required a great effort, because,even if they do not have any particular requirement, they still assume the presenceof a standard C library and, for C++ code, the STL. In the first case it is mainly amatter regarding WinCE, which has not been designed with this feature in mind5.

4In this case, for example, there are already other libraries with the same goal, but most ofthem are only tiny wrappers around low-level system-dependent functionalities. Conversely, thisis an attempt to create a high-level set of components, taking advantage of the specific languagefeatures. It enables the developer to use new programming techniques in a easy way, relieving himfrom understanding internal working details.

5An example is the extensive use of Unicode to represent strings, and all the APIs supportonly 16 bit characters, while many C standard string functions still rely on the ASCII 8 bit en-coding. This does not mean that these last ones (and others) are totally absent, but the currentimplementation is still incomplete or lacking of some options.

79


For the STL, the problem is that, despite the documentation, the version shippedwith MS Visual Studio 2005 is sometimes not compliant to the standard, and itseems that sporadically leads to unpredictable results. This issue is only relative tothe embedded environment, since the desktop one does not have such problems. It isnot due to the compiler itself but rather to the library implementation shipped withit. For these reasons, STLport [90] has been chosen to replace the default library. Itis an open-source and well-established implementation of the STL and it is mainlyfocused on portability, also on embedded systems. In this way we are now able tosetup the basic development environment and, on top of it, all the other libraries(starting from Boost and Blitz++) can be adapted or simply re-compiled for thisplatform. In this way we have a common and uniform set of functionalities runningacross both the target platform and the development one. For components relyingonly on this “software layer” no more custom workarounds are hence needed, andthe same source file can be used everywhere, thus having only one code base butmultiple execution environments.

6.2 Design and development guidelines

For the software design and development, we need to decide some guidelines inorder to not only have as result a working application, but also a good quality ofthe implemented code. As already stated in Chapter 2, this project is still underwayand the prototype will be revised by end users in order to validate the whole system.This means the possibility that the software will be modified, improved or extended.We would like to do it in a easy way, thus making the application an open systemrather than a closed one. Therefore, we will point out some guidelines leading to aset of practices useful to achieve those goals. We can sum up all these concerns intofive categories:

Modularity means to subdivide the whole system into a set of independent com-ponents connected together. The resulting work can be seen as a frameworkcomposed by different objects that form the final application. This allowsto accomplish the challenging project goals step by step, to make the systemmanagement easier and to enable to work in parallel on different aspects ofthe software.

Flexibility is useful both to maintain the developed code open to improvementsand to easily adapt it to new requirements. We would like to avoid to con-tinuously refactor already implemented algorithms only to use them with newdata types, or to start to develop from the scratch new functions without thepossibility of re-using part of the existing code base, already tested and de-bugged. The system should be open both from the developer’s standpoint and

80


from the user’s one.

Extensibility Following the previous points, we could think about an applicationopen enough to be either integrated with, or expanded by, external compo-nents. This feature could be useful both for the final application (e.g. toexport the recognized text to other programs) and at the design stage. Prop-erly choosing the right tools for each development step means being able toobtain better results. This implies the use of different instruments, which haveto be inter-operable between themselves and with the final application itself.This is possible only if the software components can be “exported” outside themain environment.

Testability As we have seen in Section 2.1, the software is very complex, both be-cause the main goal is very challenging and because the involved functions (seeSection 3.2) are inherently not easy to implement, debug and test. In orderto ensure that what we have developed is what we intended, testability con-cerns have been addressed since the beginning. Examples are how to test andverify single components, the integration between them and the comparisonof different version of the same function.

Efficiency The run-time performances are important, because of the real-time be-haviour of the device. Usually this concern is left as the final one, once thedeveloped application is complete (from the functionalities viewpoint), but inthis case efficiency can be a driving factor in choosing the proper functions tobe used. There are different kind of optimizations, but some of them have tobe considered since the early development stages.

All those issues have been addressed always trying to take advantage of theunderlying hardware. At the same time they should not be too tight to the currentplatform, in order to avoid refactoring everything in case of porting the applicationto a new platform. In the following they will be described into details.

6.2.1 Modularity

As outlined above, this task is about to divide the whole system into smaller com-ponents, which can be then developed and implemented separately. In Section 3.2we have seen how the whole application is organized from the data-flow standpoint,but now for each of those blocks, shown in Fig. 3.1, we associate a class (or a set of)and some functions. The system is already organized as a set of relations betweenentities, but now we define the set of components that models them. The resultsof the mapping between functionalities and object classes can be seen in Fig. 6.1.Clearly the proposed architecture is not the only possible one, it is mainly derived

81


from the experience rather than from some theory. Some advice (see [91, 92]) havebeen taken into account; these can be geared more towards development or designaspects, in form of guidelines, practices or design patterns.

Figure 6.1: Overall data-flow diagram of the proposed application

In the system partitioning process we have to guarantee first of all that everysingle object has one and only one well-defined responsibility, independently of all theother class roles. Its task has to be sound and complete. With the first characteristicwe refer to the coherency of the design, the object role should not be misunderstoodor its features misused. With the second one we look at the usefulness of suchcomponent: it should be self sufficient to complete all the tasks related to thepurpose it was designed for. When developing a class hierarchy (both consideringthe inheritance, the use and the composition relations) we have to assure that nointer-dependency exists: if we would draw the corresponding schema using a directedgraph, according to the above statement, the resulting graph should be acyclicor, better, a tree. This is important to guarantee a good degree of freedom inextending/re-using the framework without being tight to a specific structure.

Other important design aspects are the simplicity and the clarity of the solution,following the KISS principle (Keep It Simple Software or, better, Keep It Small andSimple, derived from the Occam’s razor one which states to make things simplebut not too much simple) which leads to prefer correctness and readability aboveall. There is always the opportunity to extend or to improve an existing solution(if it is well engineered), but downsizing a too fat and heavy already implemented

82


function is almost always problematic. In this regard the Test-Driven Develop-ment (TDD) methodology [93, 94], and more in general the agile development andextreme-programming practices, are very useful, because implementing a unit testmeans to define and verify the functionalities of such component.

Modularity means also to correctly abstract and encapsulate data. The first def-inition means to design high-level data types by structuring low-level information.Moreover, it suggests to separate how the component will be used, i.e. the externalprotocol, from the actual implementation and its internal details, leaving the devel-oper free to improve or optimize the class without changing the rest of the system.The second definition is about how and when to hide low-level details, to preventexternal functions from modifying the object state. This is usually obtained bymaking distinction between the public interface6 and the private implementation.Anyway this need has to be balanced with the ones related to the testability. Specialattention is necessary to manage resources that can not be duplicated or that canbe instantiated only once (such as the TTS engine or the object representing thecamera) and used by multiple client objects. On the other hand, we shall avoid,whenever possible, to design components relying on global and shared data, sincetheir behaviour will be no more function only of their internal state (and perhaps itwill cause glitches difficult to exploit, e.g. in a multi-threaded environment).

As a final step we have also to state the relations among all the defined classes.There are mainly three types of relations: the first is the use-relation, when a classuses a component simply as a parameter or variable in some functions; the secondis the composition, when an object is a member of another one; the third is theinheritance. The first one usually presents no particular problems, while often theother two are misused, in particular the third one. It enforces the coupling betweentwo classes and could make the base class fragile, especially when it is not designedfor the inheritance. The derived class can modify its internal state bypassing itspublic interface, therefore making the code more complex to debug. In synthesis,inheritance is useful to replace the base class with a derived one wherever the first oneis required, but not to add functionalities to an existing one [91]. This relationshipplays an important role in object-oriented systems, but it has to be carefully used.Conversely, composition does not have those problems and hence it will be preferred.This choice leads to create a “horizontal” framework, with a loose coupling betweenclasses, in contrast to a vertical structure typical of systems created around theinheritance. The purpose is to enhance the flexibility of re-using or replacing someparts without affecting the overall code base.

Looking at the schema presented in Fig. 6.1, we can see that there is a topclass named hvEngine which encloses all core engine functionalities. It is the cross-platform component afore mentioned, and its purpose is to take the input image

6C++ also supports abstract interfaces.

83


from an external source, to perform all the computations and then to give back theresults. It has only few public methods:

Feed is the method that accepts a reference to an image as input. In order to givethe maximum flexibility it has been implemented as a template, being able tohandle different types of images. Anyway currently it is optimized only forBGRA bitmaps. Its task is to take the raw image, to perform all the basicimage pre-processing functions and to convert it in a RLE image, in order todownsize the amount of data passed to the other routines.

Process is the most important function, because it does almost all the work. Firstit takes the RLE image and performs the segmentation obtaining a set of high-level data. These are used to track the user movements and to decide when andwhat information to send to the character recognition subsystem. These are inthe end collected and sent to the objects representing the user feedback devices.This routine is not implemented as a big monolithic function, but as somethinglike a hub that in turns invokes a set of more specific subroutines. For thesereasons there is no hvEngine::Process() method marked on the above dia-gram, but there are hvEngine::Track() and hvEngine::Recognize(). Forthe interface with the output objects, there is a set of functions denoted by the“Send” prefix, e.g. hvEngine::SendLineStart(), hvEngine::SendLineEnd()and so on.

TakeSnapshot is the method used to take a snapshot of the actual internal enginestate. It is very useful both to have a graphical preview (for test purposes)and to obtain a set of information about how the different steps of the elabo-ration are working. It is a separated function from the previous one becausetheoretically it should not be needed in the final application. It is up to theuser of this component to decide how and when to use it. It is coupled witha set of CopySnapshot functions, which outputs such information to three(optional) data structures: an image for the graphical preview, an HTML textfor a readable description and a custom RawSnapshotData type. The first twoare used to give the qualitative results in a high level format, independentlyof the surrounding environment. In practice, the preview image is very usefulalso on a mobile device, but the HTML text is not so comfortable as expected(actually this option is disabled on the target platform). Having a “raw” low-level snapshot is almost essential for debug and test purposes (as we will seelater), but also to optimize the visualization of the proceedings of the internalcomputations.

AddUserFB is the method used to make the engine aware of some external objectto be used as output device. It is not possible, for these features, to rely on the

84


snapshot mechanism described above, since the information have to be givento the user as long as they are computed. For this reason the data are sentfrom different places inside the Process method, and the engine has to hold areference to these external components. It has to be remarked that output de-vices can not be encapsulated in the engine, since they are system specific andthey would tight the implementation to the actual platform, making difficultto expand or re-use the whole framework.

LoadCfg along with SetCfg and GetCfg are the functions that enable the maintop-level application to setup the proper operating parameters. They can bepassed to/from custom data structures or loaded from a file. This enhancesboth the flexibility and the modularity of the system, because, for example,the whole ANN subsystem can be entirely replaced on-the-fly, as all the otherparameters.

Reset is the method used to reset the whole engine and to put it in a “safe state”.It is essential for the test sessions, since they have to be started from a well-known condition. At run time it could be used as a last resort whenever thesystem is not running as expected.

The Feed and Process methods have been separated since the engine was ini-tially designed to run in multi-threaded environments: the image acquisition librariesare often designed to use a callback mechanism to interact with the main applica-tion. Usually callbacks are used in different threads where an infinite loop acquiresimages. Since the application runs in the main thread we have to provide the syn-chronization infrastructure to safely exchange the data back and forth from the twothreads. According to this model, the Feed function was designed to run inside thelibrary thread, and the process routine in the main one. Therefore each of them hasits own set of private member data. Ideally, this approach could be seen also as apipeline model, because Feed is able to run on a new image, while Process is stillworking on the previous frame: there is no interdependency besides the initial datatransfer at the beginning of Process. In practice, multi-threading has revealed tobe not so useful as expected, and actually a single-thread approach is used. Anywaythat architecture is still maintained in order to enforce the modularity of the system:this allows to improve a part of the system without changing the remaining.

Along with the top-level hvEngine class, there is a set of other data structuresthat have been used to develop the framework. One of them is hvImage which isresponsible of representing all the bitmap images with all the operations we could dowith them, ranging from loading/storing to a I/O stream (and hence also files), tothe conversion to other image types and all the common image processing functions.There is also an implementation of a custom platform-independent video file formatbased on this data type, to enable an off-line use of the system, especially useful

85


for testability purposes, as we will see later. The bitmap data type is used byhvRLEImage to extract high-level information in form of a set of hvGEnt objects,each of them representing a graphical entity (by means of its surrounding rectangleand few other properties) in the source image. We can not still speak of characterssince the recognition is a successive step. hvRLEImage is also used when there is theneed of obtaining the bitmap of the symbols to be identified, and therefore a properindexing mechanism between hvGEnt objects and the RLE image is maintained, inorder not to require to duplicate such data.

Not everything needs to be packed into a new object type and this is the caseof the DetectSkewDeg function, which does not represent any data but it is theroutine used to compute the source image skew. It takes a hvGEntSetX object (acollection of hvGEnt) as input, and outputs the angle. All these information are firstfiltered and then used as input for hvTracker, which has the role of detecting theuser movements. This implies to flag the switch to a new text line, a line begin/endcondition, and whether to analyze or not a new symbol. Sometimes a character iscomposed of more than one graphic entity, e.g. the i, and all the parts have to be puttogether before sending them to the recognition step. Although the tracking routinesplay the same role of a regular function, they have to be necessarily implementedas members of a class. They have to maintain some information across multipleframes and hence multiple calls, and given that those data are only relative to thesecomputations, they should not be made visible or shared with all the other objects.

When the system states that a character recognition is needed, all the informa-tion about that symbol are used to create the raw bitmap for the hvImgRcgn object,which runs the proper routines (they will be described into details in Chapter 7)and gives as output the glyph identifier and some related data, like the confidenceof such result. Again, this class is not a monolithic big class, but it is composedby several RcgnNNet components, each one representing a symbol that should berecognized. They encapsulate all the functions and data about the neural networks,but, despite the name, that class does not expose any internal details: it takes asinput the feature vector computed by hvImgRcgn and gives as output a numericvalue stating the probability that the input data belongs to the symbol the ANNwas trained for (for further information see Chapter 7). Through this abstraction itis possible to completely replace the ANNs with other solutions, while still re-usingthe hvImgRcgn class. The whole system is only directly interfaced with this class,which extracts the feature vector from the input image, passes it to the RcgnNNet

objects and then applies the winner-takes-all algorithm to obtain the recognitionresult. Of course, the composition of this subsystem is fully customizable, thanksto its modularity.

Each recognition output is encapsulated into a data structure named hvGuess

and, together with the previous results, is stored in a hvGuessContainer whichis processed by the function UpdateGuessContainer. This update has to be made

86


according to the information coming from the tracker object, in order to synchronizethis data with the device movements along the source paper. Thus, we are able toput together characters to form words, but it is better to verify them with a spell-checker, in this case HunSpell (it will be discussed in Section 7.4). In the endall the information have to be delivered to the user. As already stated, we wouldlike to maintain the independence of the particular system, and moreover, a TTSengine is completely different from a tactile display. The actual underlying outputsystem has to be decoupled from its high-level functionalities, and in this regard itis very helpful to adopt the Bridge pattern [92]: an abstract class named hvUserFB

is used as a “light software layer” to provide a uniform interface (from the systemstandpoint), and then the concrete derived classes representing the specific deviceare developed, e.g. hvUserFB_TTS for the TTS engine. Since it is possible to have, atthe same time, more than one of these output components (like TTS and the tactiledisplay, as in the Haptic project, see Section 1.2), the Observer pattern [92] is usedin the hvUSerFBBroker class. In the end, hvEngine (the top-level class) has onlyone instance of hvUserFBBroker, which is in charge to dispatch all the messages toall the other (optional) output devices; these can be added/removed dynamically,giving the maximum modularity and flexibility to the whole system.

In conclusion, in this section we have seen first the criterion to be used whenpartitioning a software system, and then the results of such analysis. As we will seelater, this is the base for making the system flexible, open to future improvementsand easier to maintain/test.

6.2.2 Flexibility

Flexibility is an important feature to enable the developed system being improvedand enhanced in future. Beside this consideration we have to consider that, althoughthe aim of this device, and hence the software, is well defined (see Section 2.1), wedo not know how exactly it will be used, because there are infinite possibilities andit is impossible to consider all of them. We can try to figure out a typical scenarioand then to provide a device that performs well in such conditions, but anyway itis still a matter of trades-off. This is where the flexibility comes into play: an openenough system will be able to cope with those new conditions, or at least it will bedesigned to embrace easily the needed changes (e.g. re-using almost all the alreadydeveloped code base). Flexibility can be determined on three level:

1. Architecture

2. Development/implementation

3. Run time.

87


The former is guaranteed by the modularity seen above, provided that the re-sulting relation schema is not too intricate and there is a loose coupling among theclasses. In the proposed solution, the whole system is flexible enough to accept asinput a great variety of images, which can be encoded in RGB, BGRA, BGR, HSVor simply a greyscale format. The input device can be either an external camera,a custom video file or something else, but in any way the choice neither affects theinternal working of the core engine nor requires any modification of the same. In asimilar way (as we have seen before) we can change the output device (and even usemore than one of them) without any sort of problems, even when they are completelydifferent, provided that they are wrapped in a class exposing a predefined interface.Inside the framework, the flexibility is maintained by preferring a horizontal classsystem, instead of a vertical one. It is based upon the collaboration or the re-use ofsome data types through the composition, preferred to the inheritance often usedin other solutions. This leads to design minimal and not monolithic classes, and toseparate the interfaces from their implementation.

The run-time type of flexibility is the one that makes the system adaptable tovarious scenarios: for example, it is completely different to elaborate images comingfrom a magazine in comparison to the ones from a newspaper. It is almost impos-sible that “one size fits all” and it is better to give to the user the opportunity ofcustomizing the system, tailoring its characteristics around his needs. The base toachieve this goal is to equip the system with a comprehensive configuration mecha-nism. Actually the application uses a configuration file, which is a Lua [95] script(more about this choice will be discussed later) and gives a high degree of freedomin specifying the desired setup. The same results can be obtained by directly usingthe proper core engine methods, since all the parameters can be changed on-the-fly, without worrying to previously stop the system. In any case there are a lot ofpossibilities to tune the performances, ranging from tweaking the image processingparameters, or the tracker ones, to the customization of the ANN list to be loadedor the spell checker language. All the neural networks are not hard-coded and canbe easily replaced with new ones. These can be different in topology, performancesor simply trained on a different character set. This enables to study and improvethe recognition subsystem (see Chapter 7 for further details) in parallel with thedevelopment of the main application, or even after having completed it.

Beside the dynamic configuration opportunities, we have to take care to designalgorithms robust and tolerant enough to handle almost all the situations. The firstfeature means to cope with unexpected and not well-defined situations often leadingto (or derived from) computation errors. The system should be able to recover fromthose failures (avoiding to hang up) in a silent and automatic way, without the userintervention. This is a valuable feature since the end user is blind. The secondcharacteristic means to make the functions correctly handle all the values the inputscan take. It is a challenging goal but it can be relaxed with the requirement of

88


making the routines give good results on average (with the least variance possible),or with the assumption that it is better to give no response at all than to provide awrong one.

Even at the development/implementation level we can look for some techniquesto promote the flexibility. One of these is to resort to the generic programming toseparate the algorithms from the actual data type they use. If we think about theimage processing, for example, we can see that a linear filter can be applied in thesame way to color or greyscale images: what makes the difference is only the pixeltype and its arithmetic operations, in this case the sum and the multiply. If we areable to abstract the function from these details, we could implement routines onceand then re-use them many times. It is not only a matter of economy, but also agood design issue, since if in the future we will need to modify such piece of code,we will have to do it only once, instead of maintaining the synchronization amongdifferent versions of the same. This is the main reason for the introduction of thetemplates in the C++ language: they offer the possibility to parametrize a piece ofcode upon a data type that will be specified by the user of that part. At a first glanceit seems that the same features could be implemented even with a bunch of smartmacros, but going into the details of the language [57], we can see that templatesoffer a lot of possibilities more than a simple “syntactic sugar”. Examples are thetemplate specializations, which make the compiler to use, for particular data type,an optimized function instead of the generic one. In this way we are able to keep theflexibility of a general solution without loosing the opportunities of implementing amore efficient code.

The complex mechanism behind templates is extremely powerful, and it enablesthe developer to use meta-programming techniques [85]. They are generally used toshift some elaborations from the run time to the compile time or to automatically se-lect the best option to achieve the desired results. We use the “meta-programming”expression since it is effectively a new way of writing “programs” which are evalu-ated at compile time. They have a particular syntax but with the same semantics ofa “traditional” language, along with most of their traditional expressions and state-ments. Their purpose is to manipulate pieces of code to be compiled, for exampleselecting a specific function among all the others, or to define the right routine basedon some input data type characteristics. The novelty of this approach is that it doesnot require any external tool, does not cause a run-time overhead and enables thedeveloper to write a more expressive and maintainable code, since we can now stateboth constraints and relations between data types and functions. This enforces theglobal flexibility of the system, since we will no longer worry, with those aids, tomanually check or modify some crucial details for the application, like the size of anarray, which usually has to be specified at compile time, as function of other staticparameters. Being more specific, the possibility of expressing constraints on datatypes inside a template enables the developer to write a more type-safe code, thus

89


still retaining the powerful flexibility of the generic programming and not sacrific-ing the correctness and the robustness of the developed application7. We can relyupon a new static assert facilities [85] that performs the desired checks at compiletime. Their use is a good practice, since they are executed independently of thecode execution, in contrast with run-time asserts. These last ones are less reliable,because whenever a program works without failing any check, it does not mean thatit runs correctly, since the asserts may be not executed. Other useful implementedaids through templates are the so called type-traits: basically they are templateclasses that can be specialized and used to develop code that is aware of, and cantake advantage of, particular properties, directly inferred, of those data types. Thistechnique does not require to modify the target data implementation or definition,it simply adds to the system new information or functionalities tailored around itsproperties, improving the separation of concerns.

It has to be remarked that without templates, as in the case of Embedded C++,we should use a complex inheritance hierarchy. In some cases it would be not suf-ficient, e.g. when developing containers for built-in data types which do not belongto any class. The user of the framework would be too much tight to its structure,adding more complexity (and run-time overhead) to the whole application and di-minishing the opportunities of re-using or extending such solution, and in conclusionits flexibility.

A brief example, using some facilities from the Boost libraries [87], of the ap-proach outlined above have been adopted for the hvImgRcgn class, which is usedto mange all the character recognition functions. As we will see in Chapter 7, weneed to extract a feature vector from the input image before classifying it. In ourcase, such data is a matrix, and we can have two of them, instead of only one, ifwe choose to use both the proportional and not proportional features. There aresome parameters used to declare and use those data structures, like size or contenttype. If we would like to use them with different configurations, we would end tohave different classes even if they are similar and share a common design. This isnot the best solution, and we can resort to templates, defining only one object typebut with the opportunity to set parameters at compile time. In this way the classdeclaration can be something similar to:

template<

bool TNoProp=true,bool TProp=false,

unsigned int TGridX=16,unsigned int TGridY=16,

class TNNet=NNet_fxp

> class hvImgRcgn

7This aspect is so important that in the upcoming C++0x standard, a new keyword will beintroduced exactly with these purposes.

90


We can see that it is possible also to specify default parameters and the class repre-senting the neural networks, further increasing the flexibility of such solution. Withonly the above declaration, the user of this class could misuse it, since he could putboth TNoProp and TProp to false, thus indicating a null feature vector (“neitherproportional nor not-proportional features”). We need to put a constraint on suchassignment. This can be done in this way:

typedef bool_<TNoProp> TNoProp_t;

typedef bool_<TProp> TProp_t;

typename TypeConstraint<

or_<TNoProp_t,TProp_t>

>::check c1_;

The first two code lines are needed to encapsulate the boolean values into a data typethat can be managed by the Boost meta-programming framework. TypeConstraintis the static assert mentioned above, it performs the logical or between the twoarguments and instantiates an empty and unused variable. This will be swept awayby the compiler, therefore not affecting in any way the resulting code. In case offailing, the compiler will not be able to instantiate such variable and hence will issuean error. This trick is possible since TypeConstraint is defined as follows:

template <typename T> class TypeConstraint

{

public:

typedef TypeConstraint<typename T::type> check;

};

template <> class TypeConstraint<true_type> {};

template <> class TypeConstraint<false_type>;

Again, we use templates and their specializations as functions that treat data typesinstead of objects. If the output of the type argument T is true_type (conventionallythe output can be accessed by the type member data type), then the compiler willbe able to instantiate TypeConstraint<>::check, given that it is declared anddefined. Otherwise, if the result will be false_type, it will not be possible, becausethe specialization is declared but not defined, thus generating the above mentionederror.

Another brief example of the template meta-programming is aligned_ptr, aclass used to guarantee the desired memory alignment in the dynamic allocation ofobjects with arbitrary alignment:

template<class T,std::size_t N> class aligned_ptr

{

//a1_ = alignment of T

91


typedef alignment_of<T> a1_;

//a2_ = N

typedef integral_c<std::size_t,N> a2_;

//a_c = a1_ > a2_ ? a1_ : a2_

typedef if_<greater<a1_,a2_>,a1_,a2_> a_c;

//a_ := type with alignment and size == a_c

typedef aligned_storage<

a_c::type::value, //size

a_c::type::value //alignment

> a_;

// Follow the rest of the class declaration, omitted here...

}

The class aligned_storage<> is declared as follows:

template<std::size_t S,std::size_t A> class aligned_storage

where S is the size that should have such object and A its alignment. In the designof aligned_ptr we have to cope with a possible difference between the desiredalignment and the natural one of the type T and hence to choose the greatest one:assuming that both are power of two this choice will be fine because both alignmentswill be satisfied. Again, this fragment of code is evaluated at compile time, and doesnot have any impact on the run time, so the result is the definition of a custom datatype a_ with the desired features.

In the end, flexibility is a key value that has to be always promoted in orderto enhance the software quality. In some cases, such at the architectural level, italmost overlaps with the issue of defining well designed class system, while in othersit is a matter of developing “smart” functions. With the same purposes, we canexploit some of the powerful features of the programming language we are using.The learning curve of the proposed techniques may be a little steep, but in the longtime this approach enhances a lot the reliability of the code base. Now we are ableto express also the intents of some choices and how some issues should be addressed.

6.2.3 Extensibility

Until now we have seen how to create a modular and flexible framework, composedby many smaller building blocks that combined together make the whole system.After having designed them, it is time to explore their functionalities, both for testand for improvement purposes. As already mentioned, all those components shouldalso work stand-alone and outside the main application scope. Indeed it is useful tofurther enhance the flexibility of the overall software architecture by enabling thedeveloper to write small and simple throw-away applications. They are not intended

92


to be part of, or a draft of, the final software, nevertheless they are essential to makeearly tests and comparison, without waiting a final and complete release of the mainapplication8. Those tools should not take too much time to be implemented andshould have a simple structure, making easy to improve or adding some features.They are only complementary tools used for development purposes.

At the beginning of this chapter we have seen why C++ has been chosen as themain language to develop the software, highlighting its pros rather than its cons.The main reasons to prefer alternate solutions are mainly due to its great complexityand its steep learning curve. Moreover it has a rigid write-compile-run cycle, andthis slows down the test phase, for example, because we have always to re-build theapplication even if, from one execution to another one, the difference is only thevalue of one parameter. It would be better to be able to rapidly write small scripts,focusing the attention on the use of the already developed components, instead ofcoping with all the details that make the system well designed but also complex. Forthis aim the C++ is no longer a winner choice, and we need an easy-to-use, small,complete and dynamic scripting language. We would like to integrate it with C++,both to extend it and to be extended by. There are plenty of solutions designedwith these characteristics in mind, especially in the open-source world, but in theend the chosen language has been Lua [95, 96].

Lua has been designed with four keys in mind:

Extensibility Lua is born not as stand-alone language, but rather as a descriptionlanguage to be integrated in more complex applications, and hence it is prettystraightforward to use it in an embedded way. The APIs are based on a simpleand clear stack model, where the client code pushes/pops the proper valuesand then invokes the desired functions. One host application can have multipleLua states not interfering each other; the allowed operations can eventuallybe restricted upon need. This is possible thanks to its high modularity, whichmakes easy to strip down the unneeded features and to ensure a very minimaldependence on external resources. It is noteworthy that from the developmentstandpoint all the source files can be compiled and linked directly into the mainapplication, without requiring any particular workaround.

Simplicity The strength of a programming language is not proportional to theamount of possible ways to implement the same thing, but rather to be ableto achieve all the desired results with a minimal set of well-established features.This does not mean to have only simple and basic characteristics; Lua has, forexample, first-class functions, closures, proper tail calls, coercion and corou-tines. Everything is fully addressed through its small and clean syntax, with

8In any case, even if we had only the final full-featured application working as expected, thatwould not imply the correctness of the single components, since their behaviour is biased from (ormasked by) the surrounding environment.

93


only one type for number values and only one for structured data (called table,a kind of a heterogeneous associative array). Nevertheless this simplicity doesnot affect its expressiveness since it is still possible to use object-oriented andfunctional paradigms. Its Pascal-like syntax makes the learning curve smooth.Being a dynamic language with automatic memory management frees the userof worrying about a lot of low-level details, thus fastening the whole developingcycle.

Efficiency Lua makes use of a Virtual Machine (VM) and hence is neither a com-piled nor an interpreted language. This is in general a good trade-off to achievehigher performances, without sacrificing other important features, such as theportability. Being tightly coupled with the C language, it makes easy to movecritical code fragments to the C side, in a completely transparent way for theuser. The memory management relies upon an efficient incremental garbagecollector, and string data type (which is usually the most memory hungry datatype) takes advantage of an effective hash technique. All these features makesLua a better option, from both the memory use and speed standpoint, thanother mainstream solutions, e.g. Python, Perl and Ruby (although it is alwaysdifficult and controversial the comparison among different languages, a firstattempt can be found in [97]).

Portability As we have pointed out talking about the software, this is an importantissue for our application, since it enables to re-use the same code base ondifferent platforms. In Lua this is ensured by the fact that all the core languageand its standard libraries are implemented in ANSI C (the compiled code ofthe full interpreter fits in about 150 kB), and hence every platform, for whichexists a C compiler, can host a Lua VM. This, together with the other featuresreported above, makes Lua a winner choice to equip embedded applicationswith a scripting language, even on small microcontrollers with a low amountof memory and limited computing resources.

All the features described above makes Lua a good choice to achieve our goals,in particular to open the developed class system to the external world. For thosepurposes we need an easy way to directly bind C++ class data types to Lua andvice-versa. We already have all the basic functionalities to do it manually, but afacility that does it automatically would relieve us of a tedious and error-prone task.There are already some solutions around, but most of them either require externaltools (usually scripts that parse the source code to be exported and generate theproper stub code) or they are not so well-established as needed. Ideally it should bea lightweight software layer directly implemented in C++, that allows to describewhat we want to export and then it takes charge of all the low-level details. Wehave just to develop the components and then to specify which features will be

94


available. There is an already developed open-source solution, named Luabind [98]:it is a library based on Boost libraries that implements a proper domain-specificembedded language (see [85]) through a smart and heavy use of templates. It makesLua and C++ interact in a pretty straightforward way. As a brief example, wecan see a minimal fragment of real C++ code used to export the top-level classhvEngine:

module(L,"hv")

[

class_<hvEngine>("Engine")

.def(constructor<>())

.def("LoadCfg",&hvEngine::LoadCfg)

.def("Feed",&hvEngine::Feed<hvRGB>)

.def("Feed",&hvEngine::Feed<hvBGRA>)

.def("Process",&hvEngine::Process)

.def("Reset"),&hVEngine::Reset)

];

In the first line we specify to open the hv module in the Lua state L, and then toexport the class hvEngine, its constructor, and all the other listed methods. It isnoteworthy that Lua does not have any support for the templates, so we can exportonly instances of templates. Anyway Luabind is able to manage overridden functionsin a transparent way, and in this case the result is that the Lua Feed method willaccept images made of either hvRGB or hvBGRA pixel types.

In conclusion, the main purpose of all the solutions proposed above is to promotethe extensibility of the designed software. In the final application this means to beable to write smart configuration files. These are able to interact with the hostapplication and do not only contain a simple list of pairs of property/value; this ishelpful also to enhance the flexibility of the whole system. Moreover, we are nowable to design test tools that help the development and the improvement of thedesired functionalities. This is possible thanks to a well designed modularizationthat enables to use only some objects isolated from all the rest and thanks to ascripting language tightly coupled with C++. It is also noteworthy that those toolscan be seen as a proof of concept on how to integrate the developed software intoexisting applications. This could help to augment the user experience with thiskind of devices, which was a reason to choose a commercial PDA device, as seen inChapter 5.

6.2.4 Testability

Testability means to be able both to verify that what we implemented is what weintended and to check the effectiveness of the proposed solutions. It has been one of

95


the major concerns along all the system development, since this is a complex systemand it is not reasonable to perform all the tests only at the end of the developmentcycle. As we have seen in Chapter 2, there are no precise requirements, and henceit is almost impossible to design and develop a system satisfying those goals at thefirst time. In practice several iterations of design, development and test are neededin order to achieve good results. In this respect the traditional software engineeringmethodologies do not help, since they prescribe a big initial design stage followedby the implementation phase and only at last the tests. This is what is usuallycalled the waterfall model. They are based on the assumption that everything canbe planned at the beginning, and usually this is the longest phase, leaving to theremaining stages, and especially to the tests, a marginal role. However they areapplied, they are not useful for our purposes, because they are too rigid and do notreact to design changes as needed. Better approaches are the ones that try to makethe whole development cycle tightly coupled to the design and test stages, in order tohave a continuous feedback about the progress and the effectiveness of the proposedsolutions. In this way we will have many frequent releases, instead of only one finalrelease, short on features but working, with the major benefit of having more controlon the match between the proposed solution and its goals. These are called agilemethodologies [99], which makes to focus the attention on the capability to adaptthe work in progress to what is actually needed rather than to what was planned atthe beginning (usually on a long term). Too often, during the development phase,the requirements change, some viable solutions fail, under-estimated issues ariseand other choices show their weakness or, worse, their flaws, thus making difficultto modify a full-blown monolithic model.

Working on a short term, with short and well-defined deadlines, enables to ad-dress all the afore mentioned possible questions time by time. If the planning isdone only once in advance, we still have to predict every corner case, with the riskof having an over-engineered system, too fat, complex and not flexible to be adaptedin a later stage. Those statements can be true in various engineering fields, but inparticular in the software one, where usually initial requirements are general needsrather than accurate requests. Besides, in this area there is a peculiarity respect toall the others: the developers are not simply “program writers”, and can contributesignificantly to make a well-done system design, and, on the other hand, also thedesigners should deal with almost all the development aspects [99]. This leads tothe conclusion that the development and the design are two faces of the same coin,not one the consequence of the other.

As outlined above, agile development is a new way of managing the whole sys-tem development, and actually there are various methodologies following those ideas.One of the most applied is the so named eXtreme Programming (XP, see [100] for abrief introduction), based on five values: the quality of the communication between

96


all the involved people, the simplicity of the proposed solutions, a continuous feed-back to ensure the match between the desired results and the actual situation, thecourage needed to make big changes without breaking the whole system and the re-spect among all the team members. There are various principles deriving from thosevalues, and these are put to work through a series of practices (divided in primaryones and corollary ones). Among all the others, incremental design and test-firstprogramming play central roles: the former has been already addressed thanks tothe flexibility and modularity discussed previously, and the other one is the base forthe test-driven development (see [93, 94]). In the traditional waterfall model, thetest phase is the last one, often a misunderstood mix of a debug and/or review pro-cess and a Quality Assurance (QA) validation. It only checks the already full-blownsystem and hence it is very difficult to properly identify and isolate weaknesses or er-ratic behaviours. TDD has been proposed to overcome those issues and to make thewhole development process smoother by placing the tests at the center of the designstep. First we have to write a test, then to implement the code that passes that test.The idea behind this practice is that we have first to answer to the questions “Howthe code should be used?”, “How should it work?” and then to implement the codethat satisfies that question. This helps to correctly design components, according tothe well-defined responsibility discussed previously. It enforces also the simplicity ofthe design (the already mentioned KISS principle), because we implement only whatwe need, not what we think it would be useful, as in the initial design stage of thewaterfall model: it is always easier to add a feature than to remove a problematic oruseless one. TDD contributes to each of the five values of the extreme programmingseen above: they can be seen as “living documentation” since they show how thecode should be used and its purpose, thus being an important point in the com-munication; it promotes simplicity since it leads to implement minimal solutions;feedback is obtained by continuously run tests; courage is increased because we canalways know when breaking changes arise, thus making safer the introduction ofnew features or the optimization of the existing ones; in the end respect is enhancedsince the roles of all the involved people have the same importance in the process,it does not matter whether they are testers, developers or designers. In synthesis,TDD, as a part of XP, is a powerful tool to develop a well-modularized, flexible andextensible system, enhancing the whole code quality, with less bugs since they aretackled as soon as they come out.

Although it seems that TDD is the answer to almost every question in thesoftware engineering field, we have to remember that it is a practice rather thana formal method, and various issues have to be taken into account. First of allnot everything is well suited for being tested, for example GUIs (Graphical UserInterfaces) or the so called AI (Artificial Intelligence) algorithms. Another importantpoint is that TDD itself does not guarantee anything, because a bad designed testwill lead to write bad designed code, thus shifting the responsibility of the quality

97


of the product from the pure development stage to the test one. Starting from thisstatement, we can see that first we have to write failing tests, and, only after, toimplement the functional code. Ideally we should write at least one test for everyfunction we have (more than one is preferable to avoid obvious implementationsand to check different cases of the same feature), and this leads to the coverageissue, i.e. each line of the software should be ideally addressed by at least one test.These have to be run often, after every little change and before any commit orrelease, in order to always deliver working code. All the tests should be collectedin a separated program from the main application we are developing, anyway bothhave to be implemented side by side. We have to take care not only of positive tests,but also of the negative ones, i.e. to state both when the functions should work andwhen they should not work. The aim is also to stress their tolerance to unexpectedinput data, and to deal with exceptions, because the system have to be robust incase of unexpected or unspecified behaviours.

We have basically three types of tests:

Unit tests where each component is verified alone, independently of all the others;

Integration tests which are used to check the interactions between objects;

System/acceptance tests which involve all the system together, in order validatethe whole application against the given requirements.

For each of the above items we can design a test bench application, dividedinto suites, each one dealing with a particular aspect of this process. Each suitecomprises several test groups: these, also called fixtures, are useful when a set oftests shares the same setup and tear down code (the former is the code needed toarrange the tests and the latter to clean after their execution). In any case, we haveto focus the attention on only one aspect per test, using only one assertion. Theseassertions usually imply the internal inspection of the state of the component beingtested; anyway this poses some questions, since a correct object-oriented designalways promotes the so called information hiding (a class does not have to showany internal detail). This could make difficult both to set an object in the desiredcondition before testing it, and then to check if a member function has done theright thing. If we strictly adhere to that guideline, we will be able to do only black-box tests, i.e. we manage a component only through its public features: in this waywe will test only its “external” behaviour, in the same way as it will be used in theclient code, but we will know almost nothing about its internal working. On theopposite side, we have white-box tests, which enable the tester to access to everydetail (both internal and external) of the object under test, thus leaving the greatestfreedom of verifying every aspect. This seems the best solution, but it is not alwaysdesirable, since it can be misleading to arbitrary modify the internal state; moreover,

98


as we will modify the class implementation, those tests will be no longer valid (andperhaps neither compilable). An intermediate way is the use of the glass-box testswere the internal state can be observed but not changed: in this way we use thecomponent only through its official interface, but as last resort, we can still lookinside to verify some conditions. From these considerations, we can see that notusing a white-box methodology helps to well design the components, even respectto the modularity issues explained in the previous section. Indeed black-box testsshould be preferred, since such approach is the most rigorous and closest one tothe way in which the components will be used in the final application. This choicerequires more attention in designing the setup procedure of tests (since we havenot a direct access to the internal component state) and in verifying the results, soin some rare cases we recurred to the glass-box approach, even if only for debugpurposes.

In the software development all the considerations discussed above has been ad-dressed through a set of three different applications, developed and maintained inparallel with the main one. The first one has unit-test functionalities and for thesepurposes there are several frameworks available to make this task easier. They canbe compared, for example, upon the amount of work needed to add new tests (some-times they are equipped with external parser tools to make this task automatic),the ease to modify and port to new platforms (in our case we manage at least twoplatforms, the target one and the development one), the support to fixtures and howthey handle crashes and exceptions (intentional and unintentional as seen before).Other important features are the variety and the quality of the assert functions(especially to compare floating-point data, which often needs special attention, asreported in [101, 102]), the overall flexibility in organizing tests into suites and insupporting different output types, and, in the end, its compliance to the C++ lan-guage (we use advanced feature, and the test application has to support them).After a brief comparison, the chosen framework is the one named TUT (C++ Tem-plate Unit-Test framework [103]), because it offers all the essential functionalitieswithout any frill in a small clean and portable pure C++ implementation, easy touse, maintain and extend, and so flexible that it is well suited for different types oftest applications.

TDD prescribes to write at least one test for each implemented feature, but itis almost impossible to cover every possible condition that the application couldencounter, so the purpose of this test software is reduced to guarantee that allthe basic functionalities work as expected, since they are mostly the basic buildingblocks above which we construct all the final application. These are mainly theclasses and functions involved in the first stages of the elaboration process (seeSection 3.2 for further details), where hvImage<> and hvRLEImage play a centralrole. The first one is a template used to manage all the bitmap images and toperform all the related operations. The second one extracts and holds a set of

99


high-level features, to be used in the successive steps, from the raw image. Theadopted strategy is to prefer, whenever possible, black-box tests, because they donot depend on the actual implementation. This enables to verify also the soundnessof the optimizations and the coherence of the results across different platforms. Inthe former case, as we will see in later, the implementation need to be radicallychanged to meet the new run-time performance requirements, thus making bothwhite-box and glass-box tests unusable. The same statement could be valid alsowhen dealing with different platforms, because some internal details could be slightlydifferent. An optimized function is very difficult to debug, because, to achieve thedesired results, it is sometimes needed to sacrifice the clarity or the simplicity ofthe algorithm. We need an instrument to check their conformance to the originalpurpose of that routine: if the tests are well designed, they can also solve this task,thus helping to ensure both the desired correctness and performances. The sameis true when coping with portability issues across different platforms: as alreadystated, we would like to have a common infrastructure, but we have to guaranteeto have the same results everywhere; since the tests are always the same (they relyonly on the object interface), they are a good way to assure this coherence. Thislast statement has to be verified also for third-party libraries that have been ported,especially for the ones that abstract low level system specific functionalities. As afinal remark, this unit-test application has been used with debug purposes as well:each time a “weird” behaviour has been found elsewhere, its cause has been firstisolated from everything else, and then the bug has been exploited through a specifictest and hence fixed. This practice helps to deliver good code quality but also toprevent future problems, since the test still remains in the test program and wouldbe run each time.

Figure 6.2: Off-line comparison tool diagram

The outlined test strategy is not so useful for every part of the designed software,especially when testing the components involved in the successive phases of thedata elaboration process. An example is about the skew detection routines (seeSection 3.2 for further information): it is not important to exactly compute the image

100


rotation as long as the software is able to recognize characters. It is clear that thisconcept is almost impossible to express it in a precise and scientific without involvingthe whole system. In these cases it is more useful to test how two different versions ofthe same function behave in front of the same input data, in order to have a clear ideaon how to improve the system. This method is useful to perform the integrationtests, i.e. to verify how the object interacts with each other. Again, the mainapplication can not be used for this purpose, since its input comes from an externalcamera and hence it is difficult to reproduce tests, a fundamental requirement. Thusthe idea (showed in Fig. 6.2) is to develop an off-line application, that takes as input apre-recorded input video file, used as a reference, and then, for each frame, comparesthe results of the computations. This is the leading idea, but in practice for everytest session we would like to examine different aspects of the same data process, andso it is not feasible to write neither a single full-fledged comprehensive application,nor a set of several programs each one with a particular aim, because it would bevery difficult to maintain and not so flexible as desired (especially if implementedin C++, with its write-compile-run development cycle). It would be preferable torely on a small set of dynamic scripts, easy to write and to adapt on particularneeds. As we have seen, we have used a dynamic scripting language, named Lua, toaccomplish this task: we exported all the needed components to this environment,so being able to reproduce the same run-time scenario of the final application. Thishas the major advantage of allowing to fully exploit the flexibility, the modularityand the extensibility seen before. Hence we were able to rapidly write effectiveprograms that run at the same time two different versions (or configurations) of thesame components which receive the same input data and whose output results arecompared. Typically the input data is a stream of images stored in a custom videoinput file, and the software has to report all the significant differences between theresults obtained by the two components under test on a frame by frame basis. Thishas been possible thanks to the correct system partitioning and to a well-designedobject-oriented framework, which have led to have a set of components independentof the surrounding environment, in particular from the input image source. Withthis solution each time we see in the main application an erratic behaviour, we areable to record the input data stream and then to deeply analyze afterwards thecauses of such results and to improve the involved functions. Clearly, every timewe implement a new feature, we have to assure that everything else is still workingcorrectly, and this can be accomplished again by this solution. Furthermore, it canbe used to measure the performance improvements, as we will see later. The strategyadopted for this second test tool is the same as the previous one, but taken to ahigher level.

The third and last application developed for testability purposes is directly inte-grated into the main application, but only on the development platform. Its purpose

101


is similar to the previous one, but designed for an on-line use where we have to eval-uate the performances of new features or their improvements on a user perceptionbase. According to this view, it is useless to compare a raw data structure rep-resenting the internal object state as it is done by the previous tool; it would bebetter to see how the input data are processed in a graphical way, through bothan intermediate elaborated image and a set of visual elements. An example is theevaluation of the tracking functionalities (see Section 3.2 for further reference): it isnot possible to state a priori whether a solution is more effective than another one.In this way the developer have an immediate and clear feedback about the overallworking of the system, and this can be seen as the system/acceptance tests outlinedabove. With this application it is possible both to record the input data stream andto use it in subsequent session tests, in order to have a feedback about the devel-opment progress. This last topic is addressed by the ability of running at the sametime two different core engines, with either the same or different configurations. Thesecond optional engine is loaded and managed through a Lua script (the other oneis the official main engine) and hence it is fully customizable. This means that theonly requirement is to implement a basic interface, but it could be designed with acompletely different approach not related at all with the current solution presentedhere. In practice this feature is used to compare two different setups of the engine,in order to give a first idea about what to improve, which are the effects of someconfiguration parameters and so on, giving some advice on how to design furtherevaluations.

In conclusion, we have seen that testability plays one of the most important rolein the software development stage of the whole project, since it gives the properinstruments and methodology to enable the developer to have a feedback and toverify that the developed software is working as expected. Moreover these tools alsohelp to understand and improve the quality of the whole design, since they enforcethe modularity, stress the flexibility and rely upon the extensibility of the wholeframework. This aim has been achieved through a set of three different tools, eachone addressing a particular aspect of the whole system.

6.2.5 Efficiency

In Chapter 4 and 5 we discussed the choice between two hardware platform, andwe evaluated their computational power. In the end we preferred to use an SBCbased on the Marvell PXA270 CPU, with the main consequence of paying particularattention to code optimizations in order to achieve the desired performances. This isa delicate issue in every project; in general we have to avoid to optimize prematurelyand to pessimize prematurely [91]. Anyway it is clear that we have to face with themsince the beginning, in order to have almost always a full-working system, followingthe extreme-programming practices previously outlined. It is important to have a

102


continue feedback on the work in progress about the overall data processing flow.This heavily impacts the whole development cycle, since otherwise we would notbe able to perform on-line tests with the main application, eventually recordingthe input data (when there is a wrong system behaviour) to further analyze later.Without this requirement we would have an unusable system until the end of thedevelopment process, with the risk of making wrong decisions (e.g. about whichalgorithm to use or how to implement a function) during the design cycle, mistakesdifficult to overcome later.

There are three levels of possible performance improvements:

• Design enhancements

• Development/implementation improvements

• Low-level optimizations.

The first optimization is to properly choose the algorithms, respect to both theircomplexity and to the type of data they handle. Dealing with images, this aspectis much important, since almost all the image routines operate on its bitmap repre-sentation, thus making their complexity proportional to the size, i.e. the number ofpixels. Moreover, when using filters that need a “sliding window” (a common case),the computational cost is the image size multiplied by the size of such filters. Theonly solution would be to prefer separable kernels whenever possible (see [29, 30]for further details) reducing the needed operations. Another way is to fuse somefunctions together, even if at a first glance this seems not to improve the overallperformances. The complexity is still the same, but in practice such solution leadsto a speedup, because it considerably reduces the amount of memory accesses, afundamental concern as we will see later. It is only applicable when two routineswork on the same input data and the second one does not need to wait the com-pletion of the first one. This is true for many image filters, where different kernelscan be combined together, but also when we perform the binarization: we can fusethe conversion from color pixels to greyscale ones and then the successive threshold-ing in one unique function. In this way we enforce the coupling between functions,but once this image pre-processing step is well established it is no more a problem.In any case we can always re-write such macro-function as a composition of otherroutines, with the aim of exploring different solutions, with the sole requirement ofimplementing the same interface of the former. This is possible only if the systempartitioning into sub-units has been properly done, as discussed previously.

Another decision taken in the design phase is to prefer high-level data structures,which should be also lighter to process, instead of the raw data. An example,as already explained in [5], is the use of an RLE image, which has less resourcerequirements than a bitmap one, as the base image data structure in the main process

103


function of the core engine. Following the same idea, all the subsequent steps willwork on a set of objects each one representing a graphic entity in the image, withsimple and small structures (basically the coordinates of the surrounding rectangle).This reduces considerably the amount of computation needed: as we have seen inSection 3.2, a method based on the nearest-neighbourhood approach, derived fromthe ideas presented in [104], is much faster than others, such as the projection profileanalysis (for a brief survey see [105]).

A final consideration about the possible optimizations that can be made at thedesign level is to execute the required computations only when needed. In Section 3.2we have seen how the entire data flow is processed, and, at a first glance, someonecould think that all those steps should be performed for every frame of the inputdata stream, and, for each of them, all the symbols inside should be recognized.This is a compelling task for a mobile platform, difficult to achieve when there aretime constraints, established to give to the user a natural feeling while reading (seeSection 2.1). As an example, in the final implementation, on the Colibri moduleexamined in Chapter 5, the character recognition step takes about the 57% of thewhole process time (excluding the image pre-processing phase). This is, on average,137 ms and it would lead to have a frame rate of about 7 fps (with all the otheroptimizations already turned on). Nevertheless not all the images have to beenprocessed in the same way, since two consecutive ones will be presumably muchsimilar, and moreover, the central character (the one we are focusing on) will bethe same. Once identified, a recognition will be no more needed until a new symboltakes the place of the former. Thanks to this trick (extensively applied also to otherfunctions), we are thus able to reach an average performance of 13.8 fps. It is truethat once a full recognition is required, the time to process the frame is still high,but in this way we will have at most a little glitch only in such case, and on all theother frames the software will follow the user movements (almost impossible if ittakes too time between a frame and the successive one).

Defined the algorithms, the next step is to look for a smart implementation.Running a code profiler on the host machine, we have seen that one of the mosttime consuming operation is the memory allocation into the heap space. Usuallythe technique to overcome this phenomenon is to avoid dynamic allocation resortingonly to the static one. It means to know at compile time the maximum amountof space needed for each data structure, because the needed stack space has tobe allocated in advance. Since we can not risk to be short on memory (and thepossible consequence of a buffer overrun), we will over-estimate, according to theworst case, the resources requirement, with the concrete possibility of running outof stack, thus crashing the system9. Conversely, with the dynamic allocation, only

9In such cases it may be that the software could not be even loaded, because all the memorywould be allocated at the beginning, but if there is not enough space the execution will abort. In

104


the needed space would be allocated, thus having more chances of not incurringin an insufficient memory problem. Anyway it is an expensive operation; a goodtrade-off in such cases is to resort to more sophisticate allocators able to pool thememory. We request, from the OS, more memory than needed to allocate the actualobject, in order to not make in future another allocation when a new object willbe created. At the same time, when an object will be destroyed, its space will notbe released but it will be retained for a future use. This technique is especiallyuseful when managing a lot of small elements with a short life: we both increasethe allocation performances and reduce the memory fragmentation. This idea is thebase of the so called incremental garbage collector used by the run-time of somedynamic languages (Lua is an example), but also of the default allocators used inthe STL (the standard C++ library [61]). This is one important reason to rely ona standard C++ environment, since we have this benefit at no cost.

Anyway, in the STL, there is a separate memory manager are for each instanceof a container, thus once it is destroyed, also all the memory associated will be freed.In our case we run the same functions every time on a stream of images: if thosedata structures are all declared locally, they will be instantiated and destroyed eachtime, thus making vain the afore mentioned benefits. Hence the idea of cachingsuch objects: if the instance of a container (not its data, that one will be replacedeach time) is re-used across multiple frames, the associated allocator will no morerequire any memory from the OS, thus avoiding an expensive operation. A clearexample is the RLE image object: it is true that it requires less space than a bitmap,but this depends on the actual image: for each row it uses a vector (in the actualimplementation) to store all the line elements in such row. If we create one newobject of that type for each frame, we will have many allocations each time; instead,if we re-use the old object, its internal allocator will still hold the references to thepreviously used memory and it will enable the component to recycle them. Thisworkaround is not about the implementation of the RLE image, but about how thiscomponent is used. In practice, this means to move the declaration of such objectfrom the scope of the function where it is used, to the class scope such functionbelongs to (the same reasoning can be applied to other cases too). Of course wewill have to pay attention that no other routine will manipulate it, since the goalof this choice is not to share the component, but to recycle it across multiple runs.The same cache technique can be used every time we need to compute a large setof constants, and they are the results of a particular mathematical function. Ina device with a small amount of memory, it would be better to calculate themeach time they are needed, but this is counter effective in our case, because thoseoperations could slow down the system. A concrete example is the binarizationroutine: as we have seen in Section 3.2, the threshold is not uniform across all the

any case we have to remember that under WinCE, there is a memory limit of 32 MB per process.

105


image and it is function of some configuration parameters. If we compute thosereference value (dependent on the pixel position) each time, we waste CPU time toobtain the same result each time (except when the configuration is changed). Theidea is to cache those results, thus substituting a memory access to the execution ofsuch function, leading to better run-time performances. Clearly this trick is usefulonly if the computations take more time than a memory access.

Another useful instrument to implement optimizations are again the C++ tem-plates, since with all the techniques previously described, we can move a lot ofchecks and decisions from the run time to the compile time, thus enabling the com-piler to properly choose the most suitable functions. A clear example is the macrofunction that takes the input color image and gives as output the binarized one: itsdeclaration relies upon templates, thus we have a general implementation for anytype of image, but when the input pixel type is of the form BGRA (stating thecolor order and the overall pixel size) a special function is used, which, in the actualimplementation, takes advantage of the SIMD instructions outlined in Section 5.1.Another example is the opportunity of switching from the floating-point arithmeticto the fixed-point one in a simple way without modifying any routine, through asimple preprocessor directive defined at the top project level. This is in turn pos-sible since all the functions have been parametrized upon the numeric type, andonce it is defined with all its properties (through the technique of type-traits), it ispretty straightforward to switch between these arithmetics and test the effects ofthe improvements.

Beside the optimizations made at the design and development level, we havestill the opportunity of exploiting a set of low-level optimizations; most of themrelies upon the knowledge of the specific underlying architecture. A good startingpoint is [64, 106], where useful advice are given to write C/C++ code that canbe well optimized by the compiler. Another useful resource is [107], even if it hasnot been specifically written for our target platform. It is a good overview aboutpossible bottlenecks and gives some advice on how C++ constructs are concretelyhandled by a compiler and how they impact on performances. More on this lasttopic can be found in [60], while other useful hints on using C++ for this kind ofsoftware are outlined in [79]. We refer to all these practices as a low-level onessince most of them are more related to the way of coding rather than to designand/or development concerns. There is no theory behind them, only some practicalconsiderations, aimed to help the compiler to automatically exploit new optimizationopportunities, eventually enabled by specific directives. In any case all of them havebeen taken into account, whenever useful, when defining the class hierarchy, ortalking about the flexibility. Most of those “distributed micro-optimizations” arean attempt to take advantage of all the features of the hardware features exposedin Section 5.1, sometimes recurring to a reverse engineering process in order toverify some hypothesis. Those coding practices have been adopted since the early

106


stages of the development process, to avoid to refactor later all the code base onlyto implement them. For this reason the improvements of the results can not bedirectly measured in the final application: there is not a not-optimized version tocompare with. A very basic example is the heavy use of the code inlining featureof the C++ language, enabling to directly “inject” the code of a subroutine intothe calling functions, thus avoiding the cost of a procedure call. This one can beoverwhelming for a small function: it is not only the cost of a branch execution,but also the one about the prolog/epilog of the subroutine, i.e. the instructionsfor saving/returning the parameters, variables and so on. With this technique wepreserve both the type-safety and the desired performances. The alternative wouldbe to use preprocessor macros, but these ones simply operate as a source codereplacement, without performing any control on the semantics.

The above example is both a low-level issue and a development concern: in com-parison to the alternatives it promotes the code quality and, with the developer’ssupport, it could be exploited by the compiler itself. Anyway there are other op-timizations that can not be afforded at all by the development tools these are, inour case, the opportunity of using the already discussed (see Section 5.1) SIMDinstructions, the switching from the floating-point arithmetic to the fixed-point oneand some memory use enhancements. All these topics will be further examined inthe following. For each of them extensive tests have been made in order to measurethe improvements; this has been possible thanks to the same software developed fortestability purposes, that enables to run two different core engines at the same timeon a set of custom video reference files. All the tests have been made both on thecode already optimized by the compiler (with all the useful directives turned on)and on the not-optimized one: in this way we can have an idea both about how thesetechniques further increase the performances (beyond the usual optimizations), andhow these development/design strategies can affect a plain implementation, inde-pendently from all the possible compiler optimizations.

All the showed results have been measured on the target platform, i.e. the Colibrimodule (presented in Section 5.1), equipped with an Intel PXA270 CPU running at520 MHz. Often we will refer to the “feed time” and to the “process time”, whichare the execution time of the two distinct routines used to elaborate all the data(as explained previously). The former is hvEngine::Feed, that takes as input thesource image and performs all the image pre-processing functions, while the latteris hVEngine::Process which does all the remaining. They have different character-istics and hence not all the optimizations will have the same impact on them, so wewill make distinction among them to have a more detailed analysis. Furthermore,for the process time it will be pointed out when a recognition is needed (not on allthe frames, as seen before), because there will be different performance results. Foreach kind of optimization, two tables will be presented, each one reporting two set ofdata: the former are the results with the code not optimized by the compiler (with

107


the /Od option), the latter with all the compiler optimizations activated (genericallydenoted by /O2, even if it is not the only used directive). The build name will beone of the following names:

Base the basic configuration, without using any of the described technique

Mem the first improvement has been about the memory access concern

Simd the next step has been to take advantage of the SIMD coprocessor availableon the PXA270 CPU

Fxp the CPU does not have any FPU, so it would be better to use the fixed-pointarithmetic.

Each one is an improvement of the previous one, so the SIMD build will includealso the memory optimizations, and the fixed-point one both of them. This is theorder in which they have been developed, but in practice they are independent ofeach other and can be enabled separately.

For each set of results presented in the next sections, the first table is about thetotal average amount of time (always measured in ms), the average feed time (refer-ring to the Feed routine) and the average process time (relative to the Process func-tion). A second table will report the details about the Process function, both whena recognition is needed (“Recognition”) and when it is not performed (“NoRecogni-tion”). The main difference between them is the involvement of the ANN subsystem,which is loaded as an external DLL (Dynamic Link Library), and it has been com-piled with all the optimizations turned on, independently from the ones used for themain application. This makes the difference between these two results (Recognitionvs NoRecognition) be almost constant across all the test, and it should be onlya function of the ANNs topology complexity. For each information there is boththe average measure time but also the speedup in performances that the specificoptimization brings.

Table 6.1: Effects of the compiler options on the application performances. Alltimings are in ms.

BuildTotal Feed Process

Time Speedup Time Speedup Time Speedup

base (/Od) 1840.73 — 286.88 — 1553.84 —base (/O2) 186.84 +885% 19.79 +1350% 167.05 +830%

fxp (/Od) 1060.68 — 15.98 — 1044.7 —fxp (/O2) 72.37 +1366% 4.07 +292% 68.29 +1430%

108


Table 6.2: Detailed view of the effects of the compiler options on the Process

function. All timings are in ms.

BuildProcess Recognition NoRecognition


base (/Od) 1553.84 — 3092.3 — 1336.22 —base (/O2) 167.05 +885% 516.64 +499% 117.6 +1036%

fxp (/Od) 1044.7 — 1882.05 — 925.74 —fxp (/O2) 68.29 +1430% 136.64 +1277% 58.58 +1480%

Before starting to discuss each type of optimization, we have to note that themost important one is to properly set the compiler options, as the Table 6.1 and 6.2show. In any case the developer alone can not make better than the compiler, butinstead it can help the compiler to do better its job. Indeed we can see that withall the optimizations implemented (the fxp build) the average time considerablydecreased: see for example the total speedups in Table 6.1 and the recognitionprocess one in Table 6.2. This should lead to the conclusion that all the usedtechniques have not only improved the system performances by themselves, butthey offered an additional benefit of making easier the automatic optimization taskdone by the development tools. Anyway there is an exception: in the case of the feedtime, almost all the speedup is given by the adopted techniques. This is becausemost of them are not achievable by the compiler itself, because it does not know thepurpose and the context of those elaborations, it only knows what the application hasto do. It is difficult to automatically vectorize the code or to guess the best memoryalignment without knowing the requirements. In conclusion, we will generally expectthat, when all the compiler options are turned on, the performance improvementsbrought by our optimizations will be higher.

One last remark in this regard is about the compiler used: initially the idea wasto use the Intel CE C++ Compiler10, strongly believing that it could generate a moreoptimized code. According to the documentation [108] it should allow a fine-graincontrol on this phase and also some advanced features, like the code vectorization.The practice was that often such compiler was unable even to compile some codefragments (usually the ones which make an heavy use of some C++ advanced fea-tures), giving a weird internal error message, especially when trying to turn on someadvanced optimization directives. This has led to reconsider the compiler choice,

10As already stated in Chapter 5, Intel designed the XScale architecture, and alongside of thehardware products, it developed also a compiler, to be integrate in the MS Visual Studio solutions,in order to fully take advantage of this architecture.

109


resorting to the common Microsoft compiler [109]; it is able to optimize for the XS-cale architecture, although it requires some manual settings. Since it was not clearhow and when the language intrinsics11 were supported for such architecture, wedeveloped some small test executable and, thanks to a reverse engineering analysis,we understood how to write a more efficient C++ code exploiting this aspects.

Memory access enhancements

One of the first useful optimization was surely the one related with the memoryaccess, since, in the first steps of the application data flow (outlined in Section 3.2),large data structures, i.e. the input bitmap images, are processed. In those cases themain bottleneck is the access to the main memory to read/write the data and hencethey are called “memory-bound” functions, in contrast with other operations whichare more affected by the raw computational power of the CPU and hence are called“CPU-bound” routines. As we have seen in Section 5.1, the PXA270 has 32 kB of L1data cache useful for memory-intensive applications; anyway, in order to fully takeadvantage of this opportunity, we have to implement data types so that the cachetrashing phenomenon is reduced. When a cache line is filled with data coming fromthe memory, the CPU reads 32 B with an alignment of 32 (i.e. the starting address ofthe data block is a multiple of 32), and if we follow this constraint we are able to useless cache lines, to minimize the loads/stores from/to the main memory, improvingthe whole performances. Eventually we could use the pre-load ARM instructionsto give hints on how the memory will be accessed. This requires to develop custommemory allocators, because under WinCE (see the documentation [76, 74]) thereis no guarantee on the allocations obtained from a C malloc function or its C++counterpart (the new construct). The only solution is either to use a C function thatrequest a greater memory portion, inside which the desired data can be fitted, orto rely on meta-programming techniques to define basic data types with the desiredalignment. Then these would be combined to achieve the desired result in a flexibleway (see the aligned_ptr class previously mentioned).

The choice to rely on the Blitz library [89] to manage all the arrays (and hencealso the internal representation of bitmaps) has been made to promote the codere-use, the flexibility and the simplicity. Nevertheless it can be a bottleneck since itdoes not always allow to fully exploit all the possible optimizations: examining itsimplementation, we can see that it is too much powerful for our needs, and the costof this unneeded options is a more complex structure. For example, it is designed tohandle a variable amount of data dimensions and, in order to have still an effectivesolution, it maintains a separate data structure that describes how the data is laid

11The language intrinsics are special constructs that, from the developer’s standpoint, are regularfunctions, but the compiler generate a specific assembly instruction; in practice, it is the mostcomfortable way to take advantage of CPU specific features, such as the SIMD coprocessor.

110


out in the memory. In most of the used pre-processing functions, all the pixels areaccessed in a sequential way, thus making all those accessory information counter-effective, because, to read/write a single element we have first to access the memoryto compute the exact pixel address. In that case a simple address increment wouldbe sufficient (it has to be remarked that the former way is still useful when we havea random access to the data). In such cases it would be better to bypass the library,because in those situations only the base address and few other information arenecessary. Of course we loose the ability of managing sparse matrices, but it is auseless feature for our application. This opportunity has been used designing thehvImage<> class, where the internal implementation behind the public interface canbe switched (at compile time), between a simple bridge to the Blitz functions orto a low-level custom implementation. The latter is useful to enhance the run-timeperformances, but this implies to make an heavy use of C pointers, with the riskof having bugs if misused12, but on the other hand the former ensures a run-timecontrol on the memory accesses. Since the two needs (the code correctness and therun-time performances) are clearly in contrast each other, and there is no possibletrade-off, we can use the previously developed unit test tools to ensure that thesecond implementation has the same behaviour of the first one, so we will use itsafely. In any case, if we will have the suspect about a new bug, we will have stillthe possibility of reverting to the safer solution to debug the code, since this optioncan be set by a directive at compile time.

Table 6.3: Effects of the memory optimizations on the application performances.All timings are in ms.



base (/Od) 1840.73 — 286.88 — 1553.84 —mem (/Od) 1065.95 +73% 34.65 +728% 1031.3 +51%

base (/O2) 186.84 — 19.79 — 167.05 —mem (/O2) 115.21 +62% 8.8 +125% 106.41 +57%

The performance results obtained with the above discussed solutions can befound in Table 6.3 and 6.4. We can see that generally there is a substantial im-provement, but it is huge for the Feed function (which is in charge of pre-processingthe images) thus validating the ideas pointed out previously. The +728% on the

12The dilemma about whether to use or not plain C pointers in this kind of data structure isa longstanding issue, with the main pros of having a powerful tool, but also with the cons of notbeing able to ensure the correctness of the resulting code; in some new languages the possibility ofusing pointers have been removed.

111


Table 6.4: Detailed view of the effects of the memory optimizations on the Process




base (/Od) 1553.84 — 3092.3 — 1336.22 —mem (/Od) 1031.3 +51% 2099.55 +47% 880.19 +52%

base (/O2) 167.05 — 516.64 — 117.6 —mem (/O2) 106.41 +57% 408.86 +26% 63.63 +85%

original code, only by optimizing the source code and without the compiler aid,shows the importance of the memory access aspect when developing this kind ofapplications. Considering the same value for the already optimized code we have“only” a +125%: this could be due to the fact that the compiler already takes care ofmany of these concerns, but still a high degree of improvement is possible. Lookingat the Process function, we can see the same trend in the recognition case, whileotherwise it is a good opportunity to speedup its execution.

In synthesis, the optimization of the memory use is a good opportunity to im-prove the performances, making the application 1.62 times faster than the originalone (in addition to all the automatic compiler optimizations), and this is particularevident on the Feed function, which has gained a speedup factor of 2.25.

SIMD optimizations

Figure 6.3: Conversion of a BGRA image into a greyscale one through SIMD in-structions

After having coped with the memory use by the application (the mem build seenin the previous tables), the next step has been to further take advantage of the IntelWireless MMX extension [65] available on the PXA270. As already explained inSection 5.1, they are instructions executed on a separate branch of the CPU pipelineand hence in parallel respect to all the others. They perform vector operations onthe operands, these being considered as packed arrays of 8 bytes, 4 half-words, 2words or 1 double-word13: the first place where we could use them is in the Feed

131 word = 32 bits, 1 double-word=64 bits.

112


function, where we process a lot of elements (the pixels of the input image) througha small set of well-defined operations. These could be executed in parallel on eachsubset of the source data, enabling to compute more than one element per iterationcycle, substantially increasing the overall throughput of the function. An exampleof this technique can be seen in Fig. 6.3: each row represents a contiguous line of8 B in memory (the MSB is the first one on the left), corresponding to one operandfor the WMMX operations. In this case we operate on 4 64 bit data (correspondingto 8 pixels) per iteration, and only 14 instructions are required. If we processthe pixel in the usual way we would load a pixel (being in the BGRA format itoccupies 32 bit), then we would extract each component in different registers andthe we would perform the needed computations14, thus using about 8 instructionsper pixel15. In comparison, the proposed solution has an instructions per pixelratio of 1.75. The real algorithm implemented in the hvImgBin<> function (againa template C++ function with the specialization on the BGRA pixel type), hasbeen designed to perform also the binarization on those 8 pixel, always in the sameiteration, to further speedup the whole execution time. It is a concrete exampleabout the fusion of two algorithms and makes extensive use of data caching toimprove the performances. The purpose is twofold: the first one is to halve theamount of required memory accesses to the input data, and the second one to avoidto compute each time the threshold reference value for each pixel, extending the useof SIMD instructions even to this task.

This technique implies to partially bypass the object-oriented architecture ofthe whole system, since from the implementation standpoint the SIMD C intrinsicswork on 64 bit data types, independently of how they are logically interpreted by thedeveloper’s intentions. This require to use raw C pointers and type casts in orderto have a compilable code. Sometimes it is also useful to look at the intermediateassembly generated code to verify their effectiveness. This, at a first glance, couldopen the system to potential subtle bugs, difficult to discover and fix. Again, thanksto all the test tools already developed, we are able to extensively check the correct-ness of this improved code, since it is only needed to turn on these optimizationsin the unit-test application and to run it: the involved components maintain thesame interface, the internal implementation is changed, but it has to give the sameresults. The abstraction layer furnished by the usual object-oriented paradigm hasbeen further extended to the SIMD C intrinsics available with the compiler for the

14In any case the simple average between the color components is not the most correct solution,since they should have different weights. Anyway for our purpose this simplification does notimpact at all on the results quality of the image processing functions.

15This is a rough estimate, since it strongly depends on the optimizations made by the compiler,but we could estimate 4 “unpack” and 3 addition plus 1 shift to simulate the average operation;probably, relying on the barrel shifter of the ARM CPUs, we could find a better solution, but stillfar away from the proposed one.

113


target platform: a set of identical functions has been defined, with the same inter-face but with a different implementation, depending on the underlying hardware.When the application is compiled for the mobile appliance these are re-mapped onthe original intrinsics, and when it is built for the host platform, they are replacedby custom functions16 that simulate the original ones. This has been done also tobe able to extensively test the optimized functions on the PCs used for the devel-opment, without recurring each time to debug the routines on the target system(which takes more time).

Table 6.5: Effects of the SIMD optimizations on the application performances. Alltimings are in ms.



mem (/Od) 1065.95 — 34.65 — 1031.3 —simd (/Od) 1065.91 +0% 15.98 +117% 1049.93 -2%

mem (/O2) 115.21 — 8.8 — 106.41 —simd (/O2) 110.23 +5% 4.27 +106% 105.96 +0%

Table 6.6: Detailed view of the effects of the SIMD optimizations on the Process




mem (/Od) 1031.3 — 2099.55 — 880.19 —simd (/Od) 1049.93 -2% 2129.34 -3% 897.24 -2%

mem (/O2) 106.41 — 408.86 — 63.63 —simd (/O2) 105.96 +0% 405.7 +1% 63.56 +0%

In Table 6.5 and 6.6 there are the improvements obtained with the SIMD opti-mizations. We can see that there are controversial results, since in the first table,for the Process function, with the compiler options turned off (the /Od rows), wehave a performance loss of 2%. This is actually explained by the fact that someSIMD functionalities are wrapped by regular C/C++ functions and the cost of thecall mechanism overwhelms the benefits brought by them. This is confirmed by the

16Although Intel claims [65] that WMMX instructions can be mapped one to one with theirhomologous on the x86 CPUs, there are some of them that have no counterpart (e.g. the averageor accumulate operations), so these ones should be emulated anyway and this would cancel thebenefits of using the MMX coprocessor.

114


substantial equality among the results obtained by enabling those compiler direc-tives, which is the real case of the final application. In any case the improvementsin the Feed function are significant, since they have a gain of +117% in the firstcase and +106% on the other. From the above discussion, we should expect a gainof about +350% (comparing the instructions per pixel with and without the use ofWMMX extensions), but the reality is far from that; it could be explained by theobservation that the advantage of SIMD operations is only about the pure compu-tational part of the function. The memory access roughly remains the same, andthe image pre-processing stage is heavily affected by this issue.

In conclusion, the global performance improvement on the final application is ofabout 5% with SIMD optimizations. It is not a high value, but this is due to thelimited use of the WMMX extensions. They are useful when we have to managelarge uniform data structures, where they are able to speedup the already optimizedcode by a factor of 2.06 as in the Feed routine.

Floating-point vs fixed-point arithmetic

The last meaningful optimization stems from the observation that the PXA270 CPU(and almost all the similar CPUs used in the mobile world) has no FPU (Floating-Point Unit), thus making all those operation emulated by the software, and henceslow. The alternative would be to use the fixed-point arithmetic, which, relying onthe integer arithmetic, is considerably simpler, faster and, most important, directlysupported by the hardware. Beside these pros, there are also some cons: the mostnotable ones are the limited precision and range. In this way we have to be surethat the chosen precision and range are compatible with the involved mathematicaloperations, avoiding both overflows and too rough results. Moreover we have todevelop all the functions independently of the actual number representation, main-taining the flexibility respect to further improvements (e.g. by replacing the actualimplementation with a more efficient one). A general implementation of this arith-metic on the ARM architecture can be found in [110], but the proposed solution,although effective, is not well designed, because it uses C macros, which are nottype-safe and thus prone to introduce subtle bugs, and requires a clumsy and notnatural syntax. To overcome these issues, we have to define a custom data type,respecting all the object-oriented practices. This has been achieved by implementingi32q16, which is defined on a 32 bit signed word with 16 bits left for the fractionalpart. A new namespace, fxp, has been introduced, and, beside the number type,other useful functions have been defined, such as mul<>, coeff<> and num_cast<>.All of them are template on numeric types (which can be mixed together) in orderto select the best specialization depending on their arguments. The first one is use-ful when we multiply two different operands and the result is of a third differenttype. The second one properly selects the right constant values at compile time; if

115


we specify a constant always through a floating-point number, a conversion will beneeded at run time, each time the routine is executed, making it slower, even if thereis the theoretical possibility of pre-compute such parameter. The third one is usedto explicitly handle numeric conversions. The definition of a new data type impliesthe use of regular C/C++ functions, but in a plain compilation they will make thesystem slower, because of their calling cost. For these reasons the compiler optionsare fundamental for these improvements, more than in other cases. The inability ofthe Intel CE C++ compiler to generate C++ code with a heavy use of templatesand these optimization turned on17, has been the main reason to prefer the Mi-crosoft one. Trying to solve these problems and still looking for an efficient solution,an intermediate step has been to manually write the corresponding code in assem-bly (respecting the ABI, Application Binary Interface [111]), leading to a betterknowledge on how the C code is compiled, useful to successively develop an effectiveimplementation of this numeric type. The results of this effort is a high-level type-safe C++ implementation that is as fast as the built-in common integer data type,as reported in Table 6.7. The measurements have been made by executing thoseoperations on a large data set, with different values; the tests have been repeatedten times and then the results averaged, in order to have reliable information. Ingeneral we can perform operations 20–30 times faster, but we reach a speedup factorof more than 220 on trigonometric functions (used to rotate the image or part ofit). The only exception is the division, because also in the integer arithmetic it isdone in software18. Anyway, since in any case it is the most expensive one amongthe basic four operations (about 57 times slower than an addition, see Table 6.8),whenever possible it has been carefully avoided.

Table 6.7: Performance comparison among different arithmetics. For each operation,all the timings are referred to the i32q16 number type.

double float i32q16 int

mul 19.05 8.69 1.00 1.01add 31.46 18.42 1.00 0.99sub 29.64 13.77 1.00 1.00div 0.45 0.18 1.00 1.00sin 226.51 — 1.00 —cos 221.94 — 1.00 —

17There is no particular reason for this weird behaviour of the development tools, since the sameimplemented C++ code was compiled for the host platform without any problem.

18Being the floating-point represented by an exponential notation, multiplications and divisionsare simpler than additions and subtractions.

116


Table 6.8: Performance comparison among different operations for the i32q16 nu-meric type. All the timings are relative to the multiplication one.

mul add sub div sin cos

1.00 0.42 0.41 57.25 17.98 20.08

The main difficult when switching from one arithmetic to another one is to ensurethat everything still works as expected, and again the test tools developed previouslyhave played a central role. Having the possibility of running two core engines atthe same time, we can directly compare the results, e.g. the internal state of thedesired components, eventually exploiting bugs and potential problems. In orderto maintain two different versions of the same code base, we can resort to C++templates which allows to implement functions without knowing in advance theexact numeric type that will be used. This opportunity is another reason to developa proper new data type instead of modifying the existing code to simply use theinteger arithmetic (as suggested in [110]). Anyway this imply to develop a completeset of functionalities, including the proper specialization of numeric_limits<> (itis a type traits for numeric data type, see [61]), with eventually a heavy use of staticasserts to prevent potential misuses. All the developed code is then designed to usea generic data type, provided that it respects all the required numeric properties,both in the syntax and in the semantics. The proposed solution has been revealedto be effective, both in terms of performances and in terms of result accuracy, beingits precision and range adequate to these task. Probably, using the same fixed-pointrange and precision across all the application is not the best choice, since we coulduse for each part of the program flow a more suited custom type, but this shouldlead to a more complex code, with many data conversion operations between onesection and another one.

Table 6.9: Effects of the fixed-point optimizations on the application performances.All timings are in ms.



simd (/Od) 1065.91 — 15.98 — 1049.93 —fxp (/Od) 1060.68 +0% 15.98 +0% 1044.7 +0%

simd (/O2) 110.23 — 4.27 — 105.96 —fxp (/O2) 72.37 +52% 4.07 +5% 68.29 +55%

After having implemented the i32q16 type and integrated into the main appli-cation, as done with all the other optimizations, timings were measured, and the

117


Table 6.10: Detailed view of the effects of the fixed-point optimizations on theProcess function. All timings are in ms.



simd (/Od) 1049.93 — 2129.34 — 897.24 —fxp (/Od) 1044.7 +0% 1882.05 +13% 925.74 -3%

simd (/O2) 105.96 — 405.7 — 63.56 —fxp (/O2) 68.29 +55% 136.64 +197% 58.58 +8%

results are shown in Table 6.9 and 6.10 where the fxp build is compared with theprevious simd one. We can see from the first part of both table (when the appli-cation is not optimized by the compiler) that moving away from the floating-pointarithmetic (despite the results presented in Table 6.7) brings no improvements atall. This can be explained by the fact that the actual implementation strongly re-lies upon inline code optimizations (as already discussed previously), which are notenabled by default. This substantially cancels the advantage brought by the fixed-point arithmetic, which is in turn fully exploited when those options are enabled (seethe second part of the tables). Now we have a global improvement of +55%, mostlydue to the Process function, since the Feed routine only pre-process images butdoes not make many number operations. Instead, there is a big improvement whena recognition is needed (see Recognition column in Table 6.10), because in this casethe ANN subsystem comes into play and it performs a lot of numeric computations.We can roughly estimate its cost, in terms of execution time19, by subtracting theNoRecognition time from the Recognition one (in the latter case the same opera-tions are done, plus the ANNs ones). The results are shown in Table 6.11, wherewe can see again only a small performance improvement when the compiler optionsare turned off, but we have a system that it is more than four times faster whenthey are activated. These results are far from the ones that we could expect lookingat the simple arithmetic comparison done in Table 6.7, but the pure computationalpower (i.e. how many operations can be executed per unit time) is not everythingin a real application like this one.

In conclusion, we can see that switching from the floating-point arithmetic tothe fixed-point one has been an important improvement; this has been achievedthrough a specific implementation that respects all the good design practices in aobject-oriented framework, like the separation between the internal implementation

19We have to remark that the ANN subsystem does not only comprise the neural networks,which are compiled as an external DLL with all the compiler optimizations turned on, but allthe associated functions, like the feature extraction for example. For further details on how thecharacters are recognized, see Chapter 7.

118


Table 6.11: Detailed view of the effects of the fixed-point optimizations on the ANNsubsystem. All timings are in ms.

BuildANN subsystemTime Speedup

simd (/Od) 1232.09 —fxp (/Od) 956.31 +29%

simd (/O2) 342.14 —fxp (/O2) 78.05 +338%

and the public interface, the flexibility of replacing the original solution with abetter one and the separation of concerns between the adopted algorithms and themathematical aspects of the same. The proposed solution leads to an increment of52% on the whole performances, and up to +338% on the ANN subsystem which isthe core part that recognizes the characters.

6.3 Conclusion

In this chapter we have examined how to setup the proper development environment,choosing both the programming language and some additional libraries. These are avaluable help in order to let the developer focus on the application functionalities andto promote both the quality and the effectiveness of the proposed solutions. A set ofguidelines have been discussed and used during the entire design and developmentcycle, looking not only at the current need of having a working prototype, butalso looking to the future, in order to make this system evolve to something morecomplete and useful for blind. We have not discussed about what algorithm is thebest one for each task, both because they are already outlined in Section 3.2 anddiscussed into more details in [5]. Anyway all of them has been entirely refactoredand revised in order to meet the new requirements imposed by the new platformand to improve the quality of the results.

The result is twofold: a final application able to run with acceptable performanceson the two proposed platforms in Chapter 5, and a set of tools that can help thedesigner/developer to further improve, or to add, features currently left out. Thesoftware is designed around a cross-platform engine that takes as input the imagescaptured by the device through an external camera, processes them, both by trackingthe user movement and by recognizing the characters, and gives information to theuser by means of the speech synthesis. Everything is performed trying to make thebest trade-off between the software reliability, efficiency, robustness and flexibility.

119

Chapter 7

Text recognition

In the previous chapters we first defined the hardware platform (Chapter 5) andthen designed and developed the software (Chapter 6). Now it is time to focusour attention on the core part of the system, i.e. the one that performs the textrecognition task. As already outlined in Section 3.2, it is based on artificial neuralnetworks, but they are only one part of this subsystem, since we have first to pre-elaborate the data (feature extraction) before classifying it. Afterwards we collectthe results coming from identification of consecutive characters in order to formentire words. Nevertheless ANNs are still the most important component, and wewill focus most of our attention on them. From an initial implementation, differentways to improve them will be explored; in order to fulfill this task, a proper tool hasbeen designed, developed side by side with the main application.

7.1 Feature extraction

As already mentioned above, the first task is to pre-elaborate the image of thesymbol to be recognized, in order to transform it in a set of data suitable to beprocessed by ANNs. Indeed they are used to classify not the whole character imageas it is, but a more compact (small and with a fixed size) numeric representationof the same. The input bitmap exactly represents the glyph to be read, but thesame character could have different shapes (each one depending on the text size)among various fonts and it is not feasible to develop a particular classifier for each ofthem. Moreover two identical symbols may have two different resulting images, forexample because of the previous data processing steps, which could lead to slightlydifferent data. For this reason we compute the so called feature vector, which shouldunivocally correspond to the original image. As explained in [112], we have to takeinto account different potential problems that could affect the recognition results.In particular the feature extraction function should output a set of data that both

120

7 – Text recognition

represents adequately the input image and is invariant respect to these issues:

• Illumination/contrast

• Size

• Position

• Rotation/skew

• Deformations

• Little font decorations.

The first item is mainly a concern on the acquisition process, not on the textitself. Independence from the symbol dimension is required to recognize differenttext sizes. Tolerance to the position and rotation is useful because we do not know apriori were the characters are located in the image and their orientation. As we haveseen in Section 2.1, the end user has no constraints on how to move the device. Thelast two items can be caused either by a not optimal image pre-processing or by thevariety that exists among different fonts. This last observation is much important,since we would like to design a system that it is not tied to a particular font type andthis is a concern of both the feature extraction process and the classification itself.We can note that simply using the character image as the feature vector is not agood solution. We have to define routines to extract features that both form a baseof the problem domain and are orthogonal each other, in order to avoid redundancy.Beside these ideal characteristics, we can look also to the reconstructability propertyoffered by the chosen feature extraction method, i.e. the possibility of synthesizingthe original image from the computed data.

There are several techniques that can be used for this task, and they can begrouped into four main approaches:

Template matching is the method outlined above, where the whole image is afeature vector; the name is due to the classification strategy that is usuallyemployed in this case, i.e. a simple comparison with a reference image.

Unitary transform (or even a set of) means to use mathematical transform tothe image such that its results are used as feature vectors; examples are theFourier transform, Gabor filters and others.

Zoning involves the subdivision of the image into zones (e.g. superimposing a grid)and therefore computing local features.

121


Geometric moments are one of the most common method used for this task. Itconsists in calculating geometrical properties of the image. Using a propercombination of them we guarantee that they are orthogonal (Hu invariantmoments).

It is noteworthy that each of these methods can be applied either to the wholeimage (in a greyscale or black and white format), or to the contour or the skeletonof the symbol. In these cases special attention has to be paid to particular situa-tions, for example when there is an “internal” and an “external” contour (e.g. P, B)or when symbols are accented. Geometric moments (e.g. Hu moments, which areinvariants to deformations) and unitary transforms can give accurate results andoften simplify the classification task, but they are also computationally expensive,leading to performance problems. As already pointed out in Section 2.1, the systemhas to be reactive, i.e. to give results in a short time to follow the user’s needs. Thisimplies to prefer lightweight, and hence fast, solutions: template matching is oneof the simplest one, but it is useful only if the input data is almost always clean(not affected by noise) and well defined. This is not our case, because we do notknow in advance how the user will use the device and what he is trying to read. Itis very different to process a newspaper rather than a magazine or a book image.For this reason we need a more flexible method that is both simple and effective.Some of these issues could be also directly overcome in the classification step: forexample ANNs exhibit a good flexibility and this could be useful to design a secondset of “de-rotation” ANNs [113] to tackle the image skew issue. The downside isthat the whole system would be too complex for our requirements/constraints andso we have to look to alternatives.

Examining these concerns, we can see that some of them have already been ad-dressed in the previous computation steps. The illumination uniformity is achievedby equipping the input camera with a set of white LEDs (as we have seen in Sec-tion 3.1) and the non-uniform contrast is tackled by an effective thresholding tech-nique. The position is not a problem in our case, since with the segmentationfunctions each symbol is characterized by the surrounding rectangle (and only thissub-image will be used for the recognition). In the end the rotation/skew has alreadybeen compensated by the skew detection routines (more details on these steps havebeen discussed in Section 3.2). In this way we have still to face only with the size,deformation and font decorations issues. For these reasons, the chosen feature ex-traction method is the zoning strategy with the percentage of black pixels inside eachzone as the unique property. Conceptually it can be viewed as a superimposition ofa grid and counting, for each cell, the number of black pixels (divided by the zonearea). This is equivalent to subsample the image and to use the resulting matrix as

122


the feature vector1, and so it has the valuable feature of the reconstructability aforementioned. The size invariance is achieved by adapting the virtual grid to the imagedimensions, thus it is also resolution independent and able to recognize symbols ofany size. The little differences among various font types or the eventual residualnoise they are (partially) canceled out by the average operation inside each matrixelement. An example of this method can be seen in Fig. 7.1.

According to this approach, we have two other decisions to take: whether or notto maintain the aspect ratio of the symbol and the resolution, both horizontal andvertical, of the sampling grid. There is no way of deciding a priori the first issue:retaining the proportionality makes the matrix itself carry one more information,the aspect ratio. On the other hand, one of the two resolution could be seen asunder-used; deciding to stretch the image enables to capture all the possible details(the difference is shown in Fig. 7.1). Anyway we could take both of them, andonly accurate tests of the feature extractor plus the classifier can give an answerabout which choice is better, as we will see in the next section. Regarding theother concern, i.e. the grid resolution, we can see that having a low resolution leadsto a better compensation for little defects, decorations and to a noise reduction.Conversely, a higher resolution guarantees a better reconstructability and hence itshould make the classification task easier, but also highlight small details (with allthe pros and cons). Increasing it too much ends up in using the overall sourceimage as the feature matrix. Again, since there are pros and cons for each option,it is better to test their effectiveness considering the whole classifier. Our mainconcern is the recognition ability of the entire subsystem, and this is the result ofboth the feature extraction and the identification process. Anyway, although thereare infinite possible options to combine the vertical and horizontal resolutions, wecan still use two equal values, since in the Latin alphabet there are both large andtight characters. Examples are 8 or 16, which should guarantee good results: thedifference among them can be seen in Fig. 7.2.

The sole drawback of this solution is from the implementation standpoint, sinceit is very sensible to numerical approximations. The actual function considers eachpixel value not as the intensity all condensed into one single point, but rather asa square with a uniform intensity value. The consequence is that each pixel in thesource image can span across multiple grid elements and vice versa. This methodshould give better results, but it requires a longer tuning stage in order to havea well-established routine. The initial implementation has been done using thefloating-point arithmetic (as everything else, only in a successive step it has beenconverted to the fixed-point one, see Section 6.2). It has many subtle features to take

1In the current implementation a black and white image is used as input data, but everythingdiscussed here can be applied to a grey image, replacing “percentage of black pixels” with “meanof the intensities of pixels”. This decision has been taken in order to improve the computationalperformances.

123


(a) (b) (c)

Figure 7.1: Examples of the proposed feature extraction: (a) the original image, (b)the feature matrix with the correct aspect ratio, and (c) the one stretched on thereal character size

(a) (b) (c)

Figure 7.2: Differences between the two dimensions of grids used for the featureextraction: (a) the original image, (b) 8× 8 size, and (c) 16× 16 size

into account [101, 102], e.g. the associative property on the addition is not valid andthe equality comparison may fail also in obvious cases. This makes a theoreticallycorrect code to fail (or crash, depending how the exceptions are handled). In partic-ular, the exact results strongly depend on the compiler optimizations, and the onlysolution is to use only greater-than or less-than comparisons, eventually with thehelp of epsilon values (roughly speaking, the smallest achievable values, defined inthe standard library [57]). It is necessary to deeply test the developed routines withthe tools previously designed. Only afterwards, according to the guidelines regardingthe optimizations, the whole algorithm has been converted from the floating-pointarithmetic to the fixed-point one. Anyway it has been a pretty straightforward pro-cess since the code was already implemented in an independent way from the specificnumeric type, following the advice stated in Section 6.2.

In conclusion, the feature extraction is an important step in the recognition pro-cess and many issues have been taken into account. Some of them could have beendirectly tackled in the successive classification stage, designing a system with goodtolerant properties. Instead we used some results from the previous elaborations,

124


thus allowing to develop a simple but effective solution to enable the proper classi-fier to achieve better performances. The main goal in designing the pre-elaborationstage has not been to perfectly overcome all the problems, but rather to attenuatetheir effect in order to increase the whole system performances. Indeed the proposedskew detection algorithm (see Section 3.2) is far from being perfect, but it is goodenough to rotate the image so that enables the ANNs could give reliable results(some results about the effect of the character rotation on a ANN classifier can befound in [113]). Also the resolution of the feature matrix for the proposed solutioncould be a critical parameter, but all these concerns should be evaluated togetherwith the chosen specific classifier, as we will see in the next sections.

7.2 Pattern classification

In the previous section we have seen why and how to pre-elaborate the input charac-ter image before submitting those results to the proper classifier. Now we design suchsubsystem, which takes in input the feature vector2 and gives as output the infor-mation regarding the class the data (and the source image) belong to. This problemis a particular case of the more general data mining task [114] and it is usuallyreferred as “pattern classification”. There are several approaches to this compellinggoal [115, 116], starting from the simplest one, which consists in calculating thedistance between the given data and a set of reference values and selecting the classwith the least distance (this approach is the template matching method outlinedabove). This is not sufficient for our purposes, so we have to recur to alternatives,among which we have [114] decision trees, rule-based, nearest-neighbourhood andBayesian classifiers, Artificial Neural Networks (ANNs), Support Vector Machines(SVMs), or even an ensemble of classifiers. All of them have their pros and cons,some are more suited than others to specific types of problems, and it is very difficultto evaluate which one is better than the others. Beside considering the problem do-main, we have to take into account what we know about the mathematical propertiesof the data we are trying to classify [115, 116], such as the probabilities accordingto which some events occur, but not always these information can be inferred. Wecan try to estimate their statistical properties (Bayesian decision theory), like in themaximum-likelihood and Bayesian parameter estimation. There are non-parametrictechniques like the k–nearest-neighbourhood one or fuzzy classifiers when there isno plain mathematical description. Alternatively, instead of trying to infer a modelfrom the data properties, a general model skeleton could be selected a priori andthen adapted, i.e. trained, on the particular data set representing the problem.

2Although we have a feature matrix (or matrices) rather than a vector of data, we will oftenuse the latter term since it is the common name used in literature, independently of the actualdata structure.

125


Among the options there are the so called data-driven methods (but also stochastictechniques, like the genetic algorithms, and non-metric approaches, such as decisiontrees), which do not try to explain the underlying process, but simply to reproduceits external behaviour and define the decision boundaries for the classification. Inthese cases we are not interested in knowing how and why the classification is work-ing, and we can consider it as a black-box to which we give it an input and we wantas output the correct results. Inside that component the model is spread acrossmany parameters, none of which has a direct and clear relationship with the realworld (opposed to the statistical methods for example). They are adjusted in the socalled training process: the data set representing the problem is given to the clas-sifier as a set of desired input-output pairs, and then a specific algorithm modifiesthe internal values in order to meet certain requirements and to achieve the traininggoal. This is the so called supervised learning.

The data-driven methods offer the greatest flexibility, because they make no as-sumptions (and hence also no requirements) about the problem characteristics. Theresulting solution can be set up independently of the main application development:it is not a matter of designing a particular executable routine, but rather to findthe proper set of parameters, so the training process can be automatized throughexternal tools. Among all the others, artificial neural networks [117] are one of themost used approach: they offer excellent approximation abilities (if used in regres-sion problems) and generalization performances (if used in classification problems),even when complex decision boundaries or non-linearities are involved. They are anextremely flexible technique that enables the developer to shape the model aroundhis requirements/goals, with a lot of options to tune in order to achieve the desiredperformances. Beside these pros, there are also some cons, mostly concerning thechoice of the optimal design [118], the best training strategy and how to improvethe results. All these concerns will be addressed in the next sections.

7.2.1 Artificial neural network classifier

Artificial neural networks are a research topic itself, and inside this field there are somany different solutions (with so many aims) that it is difficult to choose the bestone. All of them share the basic idea of taking inspiration from the neuroscience,where neural cells are the base of all the nervous systems. Each of those “basiccomponents” performs a very simple operation on the input and gives the resultsas output. The real strength is not on the single element, but rather on the inter-connection of many of them. Each one has its own set of internal parameters thatare tuned in the learning process and all together form the model of the problem.Actually it is not in our interest studying all the theory behind them, or the oppor-tunities they offer (a good reference is [117]), but simply to discuss how they havebeen useful in this research project.

126


In this study, the used ANN type is the so called multi-layer feed-forward per-ceptron networks (usually indicated as MLP, Multi-Layer Perceptron, networks).Neurons are organized in layers: the first one is the input layer, then there are thehidden ones and in the end the output one. Each neuron input is connected onlywith the outputs of neurons in the previous layers (feed-forward). The neurons areof the type perceptron, which means to have the following behaviour:

oi = f(k

j=1

wi,jxij + wi,0)

where oi is the output of the i-neuron (with k inputs), f the so called activationfunction, wi,j the weight corresponding to the j-input of the i-neuron, and wi,0 theassociated bias. The weights (and biases) are the ones which are modified by thetraining process, while the activation function, and, more important, the networktopology is established at design time.

Figure 7.3: Architecture of the proposed ANN classifier

First of all we have to decide whether to use a single large network (with anencoded output identifying the classes) or a set of small independent ANNs (eachone representing a category). The second option can give more accurate results,otherwise the decision boundaries in the feature space would be more complex [118]and thus the results worse. It is a better approach even from an architectural stand-point, because it promotes the system modularity and extensibility. The softwareis configurable via a simple text file, so it is easy to add an entry for a new symbolto be recognized and it does not require neither to recompile or modify the mainapplication. Each network has the feature vector computed previously as input, andreturns a value ranging from 0 to 1, which can be seen as the probability that theinput data belongs to the class for which the ANN was trained. The one which

127


has the highest score is the winner, while all the others are discarded. This is thewinner-takes-all approach, the overall classifier architecture is shown in Fig. 7.3.Some constraints (always defined via a configuration file) have been put over thisstrategy: a threshold on the best result and above the difference against the runner-up. This is required to enforce the general reliability of the results. Moreover inthis kind of problems it is better to give to the user no information at all than toprovide wrong ones.

Another decision to take is how to define the internal network structure, settlinghow many layers are needed and how they are linked. Usually, independently ofthe number of hidden layers, each neuron is fully connected, i.e. its inputs areassociated with all the units of the previous layer. This model assumes that directrelationships between two non-consecutive layers and even in the same layer are notallowed. This can cause, for very complex tasks such as this one, an overfittingissue, i.e. the system could be perfectly tailored over the training set, but it doesnot perform well on new data, and hence it does not generalize enough the problem.It can be easily seen by comparing the error achieved on the training error respectto the one on a validation set, which has to be disjoint from the former. There aremany reasons for this phenomenon [114], related both to a far from optimal trainingset composition (noisy or under-representative data) and to a too complex decisionsystem. In particular there is the multiple-comparison issue, which in synthesisstates that if we have too many opportunities of modifying a system, we will increasealso the chance of doing it in a useless way. In the ANN approach it can be seenby examining the network topology, in particular the amount of hidden neurons.Having many of them means to increase its redundancy [118], and to give morechance to memorize the training data instead of trying to capture its “essence”. Itcan be somehow correlated to the afore mentioned multiple-comparison issue. Inthose cases, the network will learn faster, but then in practice it will give poorresults. One solution is to reduce the amount of element per hidden layer, but wehave to pay attention to use a sufficient number of them in order to avoid incurringin the opposite problem, the underfitting issue, i.e. the inability of correctly tacklethe given problem because of the too simple model. It is noteworthy that in suchcases we could increase the amount of hidden layers (each one not too large), butthis is rarely needed. Another way is to use a not fully-connected network: in thisway each neuron is not connected to all the ones in the previous layer (as usual), butonly to a specific subset, to be defined. The rationale behind this idea is that eachneuron of the hidden layer can be seen as a feature extractor and it is focused on aparticular subset of the input data: in this way we reduce the intrinsic redundancyof the plain topology, which is one of the causes of the overfitting. There are manystudies relying upon this idea (perhaps the most famous one is [119]), involving asystem of networks connected together or an ensemble sharing the same weights.Other systems [120] simply prefer to use a single, classical network (for each class),

128


but designed with these ideas in mind, allowing only a predetermined connectionschema, thus making each neuron have different inputs from each other. A notableside effect of this architecture is that the system is lighter, from the computationalstandpoint, and hence it is more suited for our purposes, as we will see later.

Many ANN implementations simply have one hidden layer, but even in suchcase we have to make choices regarding other aspects, leading to have differentconfigurations that need to be tested. The input data can be “covered” more thanone time, meaning that a single input value can be used by more than one neuron.It should be avoided the case of two processing units having exactly the same inputs,because they would be redundant. This leads to consider the size of the hidden layerand the amount of input data for each internal neuron as a parameter in definingthe topology of the network. In the proposed solution the first neuron gets the firstN inputs, the second one gets those from N + 1 to 2N and so on; when the end ofthe input vector is reached, it begins again from the first input. An example of thisconfiguration (the final one used in this project) is shown in Fig. 7.4.

Figure 7.4: Final neural network topology

The first attempt has been to use two 8 × 8 arrays as the input feature vector(using the method explained in Section 7.1), one preserving the aspect ratio and onenot. The idea was that using two matrices instead of only one could lead to havea larger input data size without redundant information. It performed quite wellsince the beginning, but once that the entire character set (both upper and lowercase) was added to the system, sometimes the software seemed to be “confused”,and it was not able to discriminate some symbols. One hypothesis has been aboutthe size of the feature matrices which were insufficient to make the classifier systemwork well, because there were not enough preserved details in the computed data.Thus 16 × 16 grids were used and the results were encouraging, solving most ofthe uncertain cases. Of course there were still cases where it was too difficult to

129


distinguish symbols (e.g. 1 and l), but these errors could have been resolved in apost-recognition step where it should be possible to make a syntactical analysis uponthe entire word.

In order to have smaller networks (and therefore faster to run) the performancedifference between using two (as in the first experiments) 16×16 arrays, or only oneof them, has been evaluated. The effects of changing other parameters have beenalso examined, like the amount of neurons in the hidden layer or the Input CoverageFactor (ICF). This is defined as the number of times that a single value in the inputdata is taken into account (in a fully-connected network it is equal to the size of thehidden layer). Results are shown in Table 7.1, the Mean Squared Error (MSE) hasbeen used as the most important indicator about the performances on the validationset. It is reported also the feature extraction method, where the prefix of the inputtype denotes the grid resolution (width× height) and the postfix indicates whetherthe grids are proportional (-p), not proportional (-n) or both (-np). As a remark,the amount of neurons for each configuration includes also the input/output layerand the bias neuron.

Table 7.1: Comparison of different configurations of neural networks topologies andfeature extractors

NameInput

ICFTotal Total

MSEtype neurons connections

beta-h64 16x16-n 3 323 897 0.000493beta 16x16-n 3 291 833 0.000599

beta-np 16x16-np 3 547 1601 0.000599beta-c7 16x16-n 7 291 1857 0.000661beta-full 16x16-n 32 291 8257 0.000706beta-p 16x16-p 3 291 833 0.000901alpha 8x8-np 7 163 961 0.001106

alpha-full 8x8-np 32 163 4161 0.001524

In general it is quite clear that using arrays of 16 × 16 leads to better perfor-mances, but using both of them (“proportional” and “not proportional” matrices)gives no improvement (e.g. beta-np against beta), causing only to have more expen-sive networks. We can roughly estimate their computational cost as the total numberof connections, since the multiplication involved is the most expensive operation inthis case (the activation function can be approximated). Posing constraints on theconnections, instead of using fully-connected networks, is a good idea, for examplesee beta vs beta-full, or alpha vs alpha-full. Anyway, increasing too much theICF does not lead to better results (beta-c7 vs beta), while the connections aremore then doubled. An alternative is to maintain the same ICF and to increase the

130


size of the hidden layer (from 32 to 64 neurons). Adding new neurons should enablethe ANN to better locate the decision boundaries in the feature space [118] and,thanks to the constraint on the connections, we should still be able to avoid theoverfitting. A first attempt could be to double the size of the hidden layer, meaningto halve the input size of each neuron. From Table 7.1 it can be seen that this newconfiguration (beta-h64) has been the one which shows the best performances. Itis also worth noting that it is slightly better to use the not-proportional feature setinstead of the proportional one, see beta-p in comparison to beta).

In conclusion, the chosen configuration, beta-h64 is a set of networks (one foreach symbol to recognize) with an input size of 256, a hidden layer made of 64neurons each one having 12 synapses and a final output neuron fully connected withthe hidden ones (as showed in Fig. 7.4). It shows, on the validation set, a 2.24times smaller MSE than the one of the first configuration (alpha), while having a7% smaller size (intended as the amount of connections). From the ANN executionstandpoint, we can verify that posing constraints on the number of connectionscan help to improve the performances, as it is shown in Table 7.2. Those valueshave been measured on the target platform, enabling all the compiler optimizationoptions and, in the second part of the table, the fixed-point arithmetic. These havebeen discussed in Section 6.2, but we will discuss the specific issues regarding theANNs later. In both cases there is a significant speedup, although it is not so high asexpected by directly comparing the amount of connections in the two configurations.

Table 7.2: Run-time performances of different ANN topologies. All timings are inms.

BuildANN ANN subsystem

Configuration Time Speedup

simd (/O2)beta-full 1268.5 —beta-h64 342.14 +271%

fxp (/O2)beta-full 128.86 —beta-h64 78.05 +64%

In order to train these neural networks, a set of character images has been col-lected, corresponding to the entire Latin alphabet (upper and lower cases, vowelswith accents and also some other symbols such as brackets) with different fonts.They have been captured with the same developed software has been developed, inorder to ensure that the training set is a real representative set and not an artifact.In a first step, a base repository with 45 samples for each character, equally spannedacross three common fonts Arial, Times and Courier, has been prepared in order totrain a first set of networks. Then, thanks to the software that allows to save un-recognized symbols, the repository has been progressively expanded until to contain

131


more than 10000 samples. It was created not only using particular sheets, but alsodifferent common books (from various publishers), magazines and newspapers, withthe aim of having common cases. Of course the set grew up not uniformly over allthe classes, but proportionally to the difficulty to discriminate each symbol from allthe others. Most doubtful character images were saved automatically more often,and then added to the repository. This approach implies a periodic re-train of thewhole system, in order to improve global performances.

Finally it has been decided to split the whole set of collected data into two distinctsubsets: the first one to be used to train ANNs and the second one to validatethem, in order to compare different solutions. The division, 90% training set, 10%validation set, has been made randomly, but respecting the ratios of occurrencesfor each symbol in the global collection. All the performance results showed in thissection are referred to that validation set. Such choice is not the best one, sinceit poses some questions about the estimation of the generalization error and hencethe evaluation of the classifier [114]. It could be pointed out that the validation setis too small in order to give reliable results. On the other hand we need a goodand well-representative data set, large enough, to properly train the ANNs: forthis reason it is better, on a first approach, to use those sets, with the intention tore-consider this decision later.

Each network has been trained on the whole training set, so they should beable to recognize the proper symbol and to reject all the others. In a first step thestandard back-propagation algorithm (described in [121, 117]) has been used, wherethe weights are modified according to the rule:

∆wi,j = −η∂E

∂wi,j

,

where wi,j is the weight of the synapse j of the neuron i and η is the learning rateused by the algorithm. This is a good algorithm, but it is too slow and it requirestoo epochs in order to achieve the required precision (this becomes more relevantas the training set grows up). Thus alternative methods have been tested, suchas QuickProp (quick back propagation, described in [122]) or RProp (resilient backpropagation, described in [123, 124, 125]). We obtained the best results using thelatter, with the following update rule:

∆w(t)i,j =

−∆(t)i,j if

∂E

∂w(t)i,j

(t)

> 0

+∆(t)i,j if

∂E

∂w(t)i,j

(t)

< 0

0 else

132


where

∆(t)i,j =

η+ ∗∆(t−1)i,j if

∂E

∂w(t−1)i,j

(t−1)

∗ ∂E

∂w(t)i,j

(t)

> 0

η− ∗∆(t−1)i,j if

∂E

∂w(t−1)i,j

(t−1)

∗ ∂E

∂w(t)i,j

(t)

< 0

∆(t−1)i,j else

and0 < η− < 1 < η+ .

The main idea behind this algorithm (discussed in [123]) is to have a weightupdate value not proportional to the magnitude of the partial derivative of the errorfunction for each neuron. It takes into account only the sign of such derivative, andhence it is still an adaptive method but much faster. Of course constraints (∆max

and ∆min) on the allowed values of the weight updates are needed. After some tests,the following values have been chosen for the training process: η+ = 1.2, η− = 0.5,∆max = 50 and ∆min = 10−6.

Another issue about the training phase is when to stop it. Classical approachesimpose a maximum number of epochs or a limit upon the mean squared error ofeach network. In these experiments the maximum MSE allowed was 10−4, butthe training has been however stopped after 1000 epochs. Almost all the networkshave completed this phase obtaining such small error (in few hundreds of epochs).Only few of them, those who have to identify similar symbols (e.g. I and l or 1)have terminated because of the epoch limit. This can be explained consideringthat feature vectors, in such cases, are too similar and appear both as positive andnegative samples (each class is trained against all the others). It is very difficultto discriminate these characters without a context analysis, and thus the programoften misunderstands these symbols. This is not due, as someone would expect, toa little difference between the best and the second result, but to a low output valuefor both networks.

Similar problems arise when lower and upper cases of the character are quitethe same (e.g. o and O): the proposed solution is independent of the input imagesize, so it is quite impossible to distinguish among them. A workaround wouldbe not to train a class against all the others, but only against the ones which arereally different. This would solve only partially the problem, because in that casethere would be two (or more) networks that would return an high value for thesame symbol. It could be helpful since the system would have an idea about thecharacter, but a better solution would be to integrate a post-recognition step whichshould perform a lexical analysis or something similar (e.g. based upon precedingand following symbols).

133


7.2.2 Performance improvements

The above proposed solution has been used for the first tests: although it workedproperly, it was also clear that the whole performances should have been improved.The previous results comparison has been made by using only the MSE; the con-figuration tested, and hence the explored solutions, were few. It is very difficult toevaluate a priori which parameter could lead, taken alone, to better performanceswithout any doubt, because it interacts with all the others. The idea is to furtheranalyze the results and to find some hints on how to train different ANNs, in orderto have a set of configurations from which to choose the best one. As already pointedout in [118], there is no scientific algorithm that give as result the optimal ANN. Itis more an iterative process based on a trial and error approach. Anyway we canidentify some guidelines on how to design new configurations to be tested. Theoret-ically we should explore all the combination of opportunities, but it is not feasible inpractice and so we will stop the research when we will find a good solution, ratherthan trying to find the absolute best one. This also follows the already mentionedOccam’s razor principle, which states, once satisfied the requirements and achievedthe goal, to prefer the simpler solution. This is important especially in the patternclassification field, where there is no limit on the complexity (and hence the difficultto tune) of the system we design [115].

In this part, only some results (and not from all the configurations) will be pre-sented, in order to better focus on the research of new solutions. As we will see inSection 7.3, an automatic build system has been developed to train and evaluatethe proposed ANNs. This means that all the tests have been performed on all theconfigurations, but in order to show how and why new ways have been investigated,only the interesting subset of data will be presented. Almost all the results in thissection have been obtained with the training and validation set defined in the previ-ous section: later the same tests will be repeated with different evaluation strategies.In Appendix A the same results presented here will be shown in a complete tables(i.e. all the evaluations for all the configurations), but estimated through a two-foldstratified cross-validation approach (which is more reliable) instead of the holdoutmethod used here.

Deciding to improve the classification performances means first of all to decidehow to operate on this subsystem; if we look at ANNs characteristics, we can seethe following areas of intervention:

Topology Actually we have only used variations of the same topology. Definingthe right connection schema between the layers is a compelling task, but nev-ertheless it is a crucial aspect to achieve good results.

Evaluation parameters MSE is almost always used as the sole performance eval-uator, since it gives an idea about how much the ANNs are distant from the

134


desired behaviour. This does not mean that it is a good parameter for ev-ery purpose, and alternatives should also be considered, which could be moremeaningful.

Behaviour on specific data subsets Until now, we have evaluated the perfor-mances on the validation set without making distinctions between particularsubsets: understanding where the ANNs perform better or worse helps todetermine new ways of improvement.

Training strategy According to the information collected with the afore men-tioned proposals, we can modify the classical training process: it is not only amatter of tuning its parameters, but rather to define new goals or to modifythe way the training set is used.

Results on the target device In the end we have to remember that the finaltarget platform has different characteristics than the development one. Wehave to ensure the results obtained are good on the final system, where theapplication will run.

In the following sections those points will be discussed into details. The goal, asoutlined above, is to define a set of different configurations among which to choosethe best one.

ANN topologies and configurations

The solution previously described treats the input data as a plain vector, but itshould be better to use it as a matrix, since we are dealing with images. The idea isto design a topology in which layers have a 2D structure, thus being ideally bettersuited to handle this type of information. It would be better to apply constraints onthe connections, similar to the ones discussed previously. But in this case, dealingwith a 2D arrangement, applying the same idea would lead to have anyway a greatamount of connections. It has the same complexity of a kernel filter, i.e. strictlyrelated to the kernel size multiplied by the input data size. Hence the idea is to usetwo hidden layers instead of simply one. The first one could be seen as a featureextractor that reduces the input data size to a smaller data set. The second onecould be considered as the real classifier, together with the output layer. In this waywe should be able to limit the number of connections while still having a flexibleand theoretically more suited solution.

The proposed new topology (see Fig. 7.5) is made of two layers, each neuron of thefirst layer is connected to a 2× 2 portion (each of them contiguous, not overlapped)of the input data, while each neuron of the second layer is connected to a 3 × 3array (each of them partially overlapped, not contiguous). Both are 8× 8 matrices,

135


and, as explained above, the former halves the data size, while the latter follows aconnection pattern similar to the one used in the first topology (see Fig. 7.4). Wecan say, by analogy, that there is an ICF of 9, greater than the previous used valuesof 3 and 7. Now this is a parameter computed no more between the input andthe hidden layer, but among the second and the first hidden one. The relationshipbetween the inputs and the first hidden layer is simply a 4:1 connection. As alreadypointed out, these choices are a trade-off between complexity and flexibility, sinceideally we could have plenty of this kind of solutions.

Figure 7.5: Proposed ANN topology to take advantage of the image 2D structure

Table 7.3: Performances of the proposed 2D topology respect to the other ones

NameInput Total Total

MSEtype neurons connections

beta-h64 16x16-n 323 897 0.000493beta 16x16-n 291 833 0.0005992Dn 16x16-n 388 1025 0.000687alpha 8x8-np 163 961 0.0011062Dp 16x16-p 388 1025 0.001209

136


The results regarding this new topology are shown in Table 7.3, where the 2Dn

and the 2Dp are compared to the previous ones. The former has been used with thefeature vector that does not maintain the image aspect ratio, while the latter withthe proportional one. Again we can see that the trend (it is difficult to extrapolateabsolute conclusions from these few results) leads to prefer the not proportionalfeature set. Although they should have better performances, according to thesetests they are behind the best one previously found, despite the higher number ofneurons and connections. This has been quite surprising, but, still believing thatANNs based on this topology could be improved, in the successive tests this solutionwill be extensively used to try to understand the reasons of this behaviour and hencehow to improve them.

It is clear from this discussion that it is not possible to simple talk about networktopologies in general, but rather about ANN configuration. The classifier is not onlycomposed by the networks itself, but also by the feature extraction routine (discussedin Section 7.1), which in turn can be configured in different ways. Even the ANNs,as we will see later, can be trained according to various modes, and this affectsthe results. For this reason we will no more talk about topologies to identify thespecific ANN type. Anyway the topology identifier will be used to group them intodifferent families, i.e. sets of configurations which share the same ANN structure(along with the feature vector parameters). The family name is the same of thesolutions proposed in this (and the previous one) section, i.e. beta-h64, 2Dn and2Dp, and they will be used as the prefix for the configuration name we will develop.

Evaluation parameters

In the previous tests, the sole parameter used to compare different configurationshas been the MSE. This is the most used measure for this purpose, but in practice itis more suited to regression tests, where the goal is to approximate a given function,and a value about the “closeness” of the ANN output respect to the target relationis the most meaningful option. In our case we have to classify a data set, andhence the results are of boolean type (simplifying the problem), because they arethe result of the comparison between the ANN outputs and some reference values(the thresholds used in the winner-takes-all approach seen before). In synthesis, weare not so much interested in estimating the distance of the proposed model fromthe desired results, but rather to know how many samples in the validation set makethe model fail, i.e. to have an answer to how many errors we had.

An answer to the above question can be the BitFail parameter. This is a simpletrue/false property of the ANN that, given an input, states whether the error ofthe output is above a certain threshold, called BitFail limit or BitFaillim. Henceit is possible to obtain a number (ranging from 0.0 to 1.0) that represents thepercentage of wrong outputs, by running the network over a set of samples and

137


counting when such condition is verified. This should give a better idea about theclassification performances of the tested configuration, although it does not takeinto consideration how the ANN results are combined together in the final decisionalgorithm. It is noteworthy that the last statement is true also in the case of MSE.Anyway, having a low BitFail is clearly better and it can be used also as the traininggoal, in alternative to the classical MSE (as we will see later). This could be usefulin order to train the ANNs to be good enough to classify the desired data, not tooutput accurate values.

Table 7.4: Comparison of the BitFail parameter for the already designed ANNconfigurations

NameInput Total Total

BitFailtype neurons connections

beta 16x16-n 291 833 0.004973beta-h64 16x16-n 323 897 0.005595

2Dn 16x16-n 388 1025 0.0074132Dp 16x16-p 388 1025 0.007425alpha 8x8-np 163 961 0.007678

In Table 7.4 we can see the BitFail results for the configurations discussed in theprevious section. Before talking about which one performs better than the othersand why, a little note on how these parameters (MSE before and BitFail now) arecomputed is needed. We have seen that the validation set is about 10% of the wholedata set, and this is composed by more than 10000 samples. Since the system is ableto identify all the Latin alphabet (upper and lower case), numbers and a set of othersymbols, for each configuration we have 85 networks. Each of them is tested on thewhole validation set, in order to verify when they have to output both “true” (nearto 1, meaning that the input data belongs to the class the ANN was trained for)and “false” (near to 0, in all the other cases). So we have about 85 ∗ 1000 values tobe used to compute the MSE and the BitFail. This approach will be used across allthe tests we will make, and knowing it is useful to determine how many (as absolutevalue) wrong results we have.

From the comparison of the BitFail values, we can see that now the beta-h64

is no more the best one, and thus we should reconsider our previous decision andchoose beta. In any case they clearly perform better than the other ones, thusconfirming the conclusion that the proposed 2D ANN topology alone is not sufficientto guarantee good results and further investigations are needed.

138


Behaviour on specific data subsets

Using the main application in preliminary session tests, it was observed that almostall the fails of the character recognition subsystem were due to the low output of thecorresponding network. It was quite surprising, considering the results about theMSE and BitFail presented above. Trying to exploit this phenomenon, two typesof measure have been defined: the intra-class MSE/BitFail (MSEIC/BitFailIC)and the extra-class MSE/BitFail (MSEEC/BitFailEC)3. The former is the MSE/BitFail reported by a network over the validation examples that it should recognize,and the latter is the error in all the other cases (i.e. the ones that it should reject).Thus we can define the positive training/validation set, which should be consideredas the set of input data for which the network should give a 1 as output, whilethe negative one is all the rest of the input data, i.e. the ones for which the resultshould be a 0. This definition should be intended on a network by network basis,i.e. for the “lower case a” network the positive training/validation set is the set ofimages containing the a glyph, while the negative one is the set of all the remainingimages. It is also clear that, given a network, its positive input set is different fromthe one of another ANN, and it is certainly true that the positive set of the formeris part of the negative one of all the other ANNs. So, generally speaking aboutthe performances on a positive/negative sets, it means that the specified result isestimated through an “average” (the actual meaning of average depends upon theparticular parameter) of the same measure done for each network on its positive/-negative input set. More generally speaking, we can refer to an intra/extra-classparameter if it is a measure on the positive or negative set4.

Table 7.5: Comparison of the BitFail and MSE intra/extra-class parameters for thealready designed ANN configurations

Name BitFailIC

BitFailEC

MSEIC

MSEEC

2Dn 40.465753 174.977622beta 42.917031 11.6648502Dp 50.316832 141.358922

beta-h64 69.307985 177.676221alpha 73.908127 178.866773

3We will refer to the global MSE/BitFail as the MSE/BitFail mentioned until now, i.e. the oneon the entire validation set, in order to distinguish it from these new definitions.

4Given a network, measuring a performance value on the positive set means to run the ANN onexamples belonging to the same (and hence the intra prefix) class for which it is designed, whilethe negative set represents “all the other input data”, not belonging to the actual class (and so wewill use the extra prefix).

139


In Table 7.5 there are some results about the comparison between the intra-and extra-class parameters, in this case the ratio between them, in order to showhow much the two subsets differ from the performance standpoint. It is noteworthythat the maximum BitFail ratio is of about 74, even if testing all the configurationsdesigned until now, it can be up to about 108 (for the beta-full configuration,see Table 7.1 for further information). The reason for this phenomenon has beenconsidered due to the large disproportion between the positive and the negative set.As we have outlined before, each network is trained with its own right input set,against all the others (the negative set), which in turn is composed by 84 subsets.Thus, if we look at the variety of the samples used in the training, we can see thatthe positive set is roughly 1

85of the whole set. It is clear that, since the parameter

that drives the learning is an average of the results, the final ANNs will be moresuited to reject input patterns than to accept the right ones. This also explains thetrend found when investigating their behaviour in the final application: as pointedout previously, whenever there is recognition fail, it is due to low network outputsrather than too many high ones.

The results of these tests lead to the conclusion that when dealing with ANNs,not only the topology is important, but also the training process. In the next sectionwe will address this issue, trying to overcome the disproportion in the data sets.

Training strategies

In the previous section we have seen that the training process is a crucial aspect toobtain networks able to give good results, and given that there is disproportion inthe data set, we have to find a way to overcome this issue. The idea is to increase thepositive training set, for example allowing the training algorithm to automaticallyreplicate such data a prefixed number of times or in an automatic way. The first wayis conceptually similar to give a weight to each sample in the training set, while thesecond one tries to train the corresponding ANN through a set equally divided intopositive and negative samples. This second option is derived from the considerationthat the number of samples is not the same for each class, and it could be interestingto set the “multiplier parameter” automatically, on a case by case basis. It shouldlead to better results, since it tries to balance the training set in a proper way.

In order to improve the general precision of the system, it was also decided tochange when to stop the training, increasing the maximum number of epochs from1000 to 10000, but relaxing the constraint on the MSE, setting MSEmax = 10−3.Another option was to impose the above mentioned BitFail as the stop criterionfor the training, instead of choosing the maximum MSE. Again, the idea is to stopthe training when the ANN “gives enough good results over the majority of sam-ples”. For this first set of configurations, these parameters were used: BitFaillim =0.05,BitFailmax = 0.01.

140


Table 7.6: Results obtained with the replication of the positive set during the train-ing

NamePositive set Stop BitFailIC

BitFailECMSE BitFail

multiplier criterion

2Dn-xa-10k auto Err 4.247788 0.002293 0.0191322Dn-x85-10k 85 Err 4.256757 0.001645 0.0204212Dp-x85-10k 85 Err 4.439024 0.001933 0.0207672Dn-x100-10k 100 Err 4.793912 0.004381 0.023967

2Dn-xa-bf1-10k auto BitFail 4.993485 0.005981 0.0164612Dn-x50-10k 50 Err 6.011928 0.001854 0.0183262Dp-x50-10k 50 Err 6.307885 0.001705 0.0199382Dp-xa-10k auto Err 6.520548 0.003573 0.021757

2Dp-xa-bf1-10k auto BitFail 7.000000 0.002044 0.0135372Dp-x10-10k 10 Err 12.009531 0.001658 0.0168182Dn-x10-10k 10 Err 12.303030 0.002226 0.018280

2Dn — Err 40.465753 0.000687 0.0074132Dp — Err 50.316832 0.001209 0.007425

beta-h64 — Err 69.307985 0.000493 0.005595

Some results are reported in Table 7.6, compared to the previous best one. Asexpected, in general, multiplying the positive training set leads to better results, be-cause the ratio between the intra-class and extra-class BitFail decreases by an orderof magnitude. This is true only up to a certain amount, since it seems that using toohigh multipliers could be counter-effective (see 2Dn-x100-10k). The automatic wayof equalizing the training set is not always so effective, although it offers a generalimprovement and the best configuration in this experiment uses this option. TheBitFail criterion as the stop condition does not lead to better results than using theMSE (see 2Dn-xa-10k and 2Dn-xa-bf1-10k, which only differ in the stop criterion),but it still offers an opportunity to create different configurations.

As a final remark we can state that, generally speaking, using the techniqueof replicating part of the training set leads to a worse global MSE and BitFail.Given that the idea is to mainly focus on the positive outputs, the learning processwill take “less care” of the results on the other input data. Anyway, looking forexample at the two best configurations (2Dn-xa-10k and 2Dn-x85-10k) respect tothe “original” one (2Dn), we can see that we have obtained an improvement of oneorder of magnitude on the intra/extra class BitFail ratio, compared with a worseningof the global MSE/BitFail by a factor of 2.4–3.3. This should not be seen as aproblem, since the algorithm used to choose the winner ANN is based on thresholds

141


that are some orders of magnitude greater than the MSE. The purpose of this systemis the classification and not a regression analysis for which the global MSE could bethe most important performance measure. The same reasoning can be applied tothe BitFail parameter: in these tests the limit used was BitFaillim = 0.05, which issignificantly smaller than the afore mentioned thresholds. This does not mean thatsuch information are useless, but they have to be used in the right way, i.e. simplyconsidered as a term of comparison to understand how to design new and possiblybetter classifiers, leaving the task of determining their absolute performances to afinal comparison.

Results on the target platform

All the studies about the ANNs made until now have been done on the developmentplatform, which is a common PC and hence with at our disposal all the desiredresources, including the possibility of using the floating-point arithmetic withoutincurring in any penalty5. However, on the target platform, being absent the FPU(see Chapter 5), it is better to switch to the fixed-point arithmetic (as alreadydiscussed in Section 6.2): according to the results showed in Table 6.11, we canmake the recognition subsystem more than four times faster.

There are two ways to obtain fixed-point ANNs: the first one is to design alearning algorithm, possibly derived from the classical ones, that specifically usesthis arithmetic. The second one still relies on a standard training process, withfloating-point numbers, and then converts the resulting ANNs to the desired type.The former solution has some advantages, the most important one is that it addressesthe limits of such arithmetic (restricted precision and range) directly, thus beingable to overcome them. On the other hand, it could require to modify the trainingalgorithm to obtain good results, as done in [46]. In general it is not guaranteedthat traditional learning methods (such as the ones used in this study) are stilleffective. Instead, converting the resulting floating-point ANNs into fixed-pointones is a pretty straightforward step, and in our case it is directly handled by thelibrary (FANN [126], see Section 7.3 for further details) used for the whole neuralnetwork subsystem. Nevertheless, their behaviour has still to be checked, in order toconfirm the validity of the reasonings previously made, both about the performanceson specific data subsets and about the new training strategies.

Converting the whole main application to the fixed-point arithmetic implies toverify that its behaviour and its performances remain the same. This has to be

5During the development of the current system it has been observed that, for the x86 CPUscommonly used in PCs, the floating-point arithmetic is pretty fast, thanks to the presence of theFPU. Moreover the actual trend in new CPUs is to focus the attention on the performances of thefloating-point arithmetic, with the result of having sometimes better performances than with thefixed-point one.

142


true both for the general data processing functions and for the ANN subsystem hereexamined. As we have seen in Section 6.2, all the other routines have been validatedwith the developed test tool, but it is not possible to use the same approach forthe classifier component. It is very difficult to directly compare, sample by sample,the internal behaviour of ANNs with the two types of arithmetic and to demandto have always the same identical results. Our goal is the classification of featurevectors obtained from the character images, and as long as we will have the samesystem performances, we will not take care of these intermediate discrepancies. Asan example, in Table 7.7 are reported the same data as in Table 7.6, but for theconfigurations running in fixed-point mode.

Table 7.7: Results obtained running the ANNs in fixed-point mode




2Dn-x85-10k 85 Err 9.827422 0.007479 0.0242892Dp-xa-10k auto Err 11.858392 0.015720 0.030056

2Dp-xa-bf1-10k auto BitFail 13.623229 0.008746 0.0188902Dp-x85-10k 85 Err 15.453351 0.005249 0.0207442Dn-x10-10k 10 Err 17.092174 0.007803 0.0238982Dn-x50-10k 50 Err 17.983584 0.005133 0.0204332Dn-xa-10k auto Err 18.115016 0.002892 0.017520

2Dn-xa-bf1-10k auto BitFail 18.744966 0.008116 0.0167842Dp-x10-10k 10 Err 20.282318 0.006714 0.0192362Dp-x50-10k 50 Err 22.516245 0.002307 0.0161742Dn-x100-10k 100 Err 26.314205 0.003074 0.016070

2Dn — Err 54.720812 0.003854 0.0112352Dp — Err 76.851064 0.001582 0.008288

beta-h64 — Err 110.444444 0.000561 0.005756

From the Table 7.7 it is possible to see that, generally speaking, duplicatingthe positive set during the training will give better results, although, for example,setting the parameter to 10 is better than 50 but worse than 85. Both the globalMSE and BitFail for each configuration is increased, thus it would be interesting tofurther investigate this phenomenon by examining the conversion of a floating-pointANN into a fixed-point one.

The library used for dealing with ANNs has an automatic algorithm that, givena network, decides the best format to represent the weights. This choice is madeconsidering the greatest number that could be computed, deduced from the weightvalues (assuming that the input data has an absolute value smaller than 1) and

143


avoiding to incur into an overflow. Therefore the place of the decimal point dependsupon each ANN, and it is (in general) not the same for all the networks belongingto the same configuration. A higher value for the decimal point of course meansa greater precision, since it represents the numbers of bits used for the fractionalpart. In practice, a value greater than 6 should be sufficient to achieve good results(as reported in the documentation [126] provided with the ANN library), but it isbetter to check it anyway. According to [46], there should not be any significantperformance loss if using more than 8 bits to represent the weights, and in our caseall the numbers are represented by a 32 bit word. Beside those considerations, wehave also to verify that the number format used across the whole system does notcause a loss of accuracy in the conversion to the one required by the ANNs. Allthe other algorithms use a number type named i32q16 (as seen in Section 6.2),based on a signed 32 bit integer with 16 digits left for the fractional part, so it isrequired an on-the-fly conversion of the input data. In order to better understandthe differences about the execution of the ANNs with the two approaches, it is usefulto highlight the average decimal point used for each configuration, along with theMSE and the BitFail (defined analogously to the one previously used) obtained fromthe comparison of the outputs in the two ways. To further exploit whether switchingfrom the floating-point arithmetic could lead to worse or better results, the ratiobetween the global MSE in fixed-point mode end the one in floating-point mode arereported, and the same for the BitFail performance value, of course. These results,for the configuration already introduced, are reported in Table 7.8 (the minimumand maximum decimal point are omitted here, for brevity reasons).

From the data of the Table 7.8 it is possible to state that the difference be-tween a direct comparison of the two modes is not so great, but looking at how theglobal MSE changes, this seems to be more affected than the BitFail by the numericconversion. This operation can be seen, from the mathematical standpoint, as anintroduction of some sort of noise, and thus it directly affects the MSE. Conversely,if we still consider the BitFail (proportional to the amount of outcomes significantlydifferent from the desired values), its effect is not so relevant. We have to rememberthat, for the classification task, it could be better to look at how the BitFail changes,rather than the MSE, and this, in some cases, it is even smaller. These results areimportant, since they lead to the conclusion that converting the ANNs should notworsen so much the system performances.

Further results and final considerations

In the previous sections we have defined new network topologies, new evaluationparameters alternative to the classical MSE, we have investigated how these ANNsperform on specific data subsets and then established new training strategies. In theend we have also verified their behaviour using the fixed-point arithmetic instead

144


Table 7.8: Results of the comparison of the floating-point mode and the fixed-pointone

Name MSE BitFailAverage MSEfxp

MSEfp

BitFailfxp

BitFailfpdecimal point

beta-h64 0.000036 0.009819 10.035300 1.137898 1.0288072Dp 0.000542 0.023552 7.705880 1.308712 1.116279

2Dn-xa-10k 0.001666 0.055117 9.647060 1.261065 0.9157642Dp-x50-10k 0.002121 0.101566 8.635290 1.353014 0.811203

2Dn 0.003125 0.026016 8.082350 5.609700 1.5155302Dp-x85-10k 0.004415 0.061920 8.588240 2.715226 0.9988882Dn-x50-10k 0.005290 0.108415 9.376470 2.769122 1.1149502Dn-x100-10k 0.005376 0.108611 9.400000 0.701675 0.6705112Dn-x85-10k 0.006873 0.104777 9.411760 4.546512 1.1894052Dp-x10-10k 0.006895 0.092886 8.152940 4.049279 1.1437372Dn-x10-10k 0.007990 0.105905 8.929410 3.504871 1.307305

2Dp-xa-bf1-10k 0.008568 0.064372 8.541180 4.278616 1.3954052Dn-xa-bf1-10k 0.012345 0.064338 9.317650 1.356976 1.019579

2Dp-xa-10k 0.013857 0.057569 8.611760 4.399913 1.381484

of the classical floating-point one. All the considerations made were geared towardsthe design of new ANN configurations among which to choose the best one, since itis not possible to state a priori which change improves the results.

After having done all those variants on the same new network topology, thenatural question is whether these efforts are still useful to improve other ANNs. Forexample we can re-consider network based on the previous topologies, exhibitingbetter performances in a “classical” training (i.e. trained with the MSE as the goaland without a partial replication of the training set). In particular, they can beapplied to the beta-h64 base configuration which, in the first tests, has been chosenas the best one. The Table 7.9 shows that the answer to the question is positive(2Dn-x85-10k has been included for direct comparison purposes), by presenting thesame type of results presented in Table 7.7. Also with this configuration we havethe same trends found for the 2Dn and 2Dp families, although now replicating thepositive set 100 times seems to give better results. Even the tests on the conversionfrom the floating-point to the fixed-point arithmetic confirm the same conclusionsstated above (see Table 7.8), as we can see in Table 7.10.

Examining those results it is noteworthy that beta-h64-bf2-10k has both thelowest MSE and the lowest BitFail among all the configurations tested in this study.

145


Table 7.9: Results improvement obtained from the beta-h64 base configuration(fixed-point mode)




beta-h64-x100-10k 100 Err 7.645756 0.001444 0.020421beta-h64-x85-10k 85 Err 8.134280 0.001363 0.019558

beta-h64-xa-bf1-10k auto BitFail 9.807568 0.000881 0.0118912Dn-x85-10k 85 Err 9.827422 0.007479 0.024289

beta-h64-x50-10k 50 Err 11.915822 0.001075 0.016864beta-h64-x85-bf2-10k 85 BitFail 13.600000 0.000536 0.005618

beta-h64-x10-10k 10 Err 23.797217 0.000845 0.014861beta-h64-10k — Err 54.186235 0.001330 0.014033

beta-h64 — Err 110.444444 0.000561 0.005756

Table 7.10: Results of the comparison of the floating-point mode and the fixed-pointone, for the beta-h64 configuration family

Name MSE BitFailAverage MSEfxp

MSEfp

BitFailfxp

BitFailfpdecimal point

beta-h64-x85-bf2-10k 0.000024 0.012363 10.952900 0.991726 0.919021beta-h64-xa-bf1-10k 0.000030 0.023978 11.576500 0.924307 0.933992beta-h64-x50-10k 0.000035 0.031150 11.447100 0.961339 0.937301

beta-h64 0.000036 0.009819 10.035300 1.137898 1.028807beta-h64-x10-10k 0.000040 0.028065 11.211800 1.021928 0.895907beta-h64-x100-10k 0.000043 0.040210 11.635300 0.920879 0.943116

beta-h64-10k 0.000045 0.032785 11.105900 1.139736 0.883976beta-h64-x85-10k 0.000509 0.082249 11.564700 0.935549 0.965341

2Dn-x85-10k 0.006873 0.104777 9.411760 4.546512 1.189405

Moreover it has the best MSE ratio between the floating-point one and the fixed-point one, while the same BitFail ratio is the runner up. These should not lead tothe conclusion that it is the best configuration, but of course these information area good advice about the quality of its performances when used as a classifier. In thenext section we will see how to choose the best one without any doubt.

146


7.2.3 Classifier selection

All the results presented until now do not clearly show which solution could bethe best one, a final test is required. Those information only give an idea of an“average behaviour” of the ANNs for each configuration, but they do not considerthe algorithm used to choose the winner (and hence the symbol recognized) insidethe set. As stated previously, the strategy chosen is the winner-takes-all approach,and the only substantially interesting test is a classification one. This means tryingto properly recognize the validation set, identifying each character image. The mostimportant parameter is the percentage of correct recognitions, but it is also useful todistinguish between the wrong classifications and the so called “unknown” ones. Thewinner-takes-all gives an answer if and only if some conditions are met, otherwiseit will give no results at all. For our purposes, this opportunity is useful, sincein such cases they could be supplied by a simple prediction system or even betterby a spell checker. The best configurations respect to this final test are comparedin Table 7.11. In order to verify also that training the ANNs with a limit on thenumber of epochs of 10000 is better than of 1000, these have been tested with thetwo options: the former are the ones with the suffix -10k, while the second oneshave the suffix -1k. This additional test has been done because, in some particularcases, the ones with a “shorter” training could exhibit a better intra/extra-classBitFail ratio (it can be seen in the beta-h64-x85-bf2-1k vs beta-h64-bf2-10k

case). Thus in the final set, we have 36 different configurations, but in Table 7.11there are only some of them (for the complete results see Section A.2), in order togive an idea of the improvement from the first ANNs.

From these results we can see that the proposed new topology, which is thebase of the 2Dn and 2Dp configuration families, does not have the expected results,since the best one among them (2Dp-x50-10k) is by far inferior even respect to thefirst configuration we have used (alpha). Nevertheless the observations made whiletrying to improve them, in particular the use of the BitFail as training goal and thereplication of the positive set during the learning process, have been useful to developa solution (beta-h64-x85-bf2) that exhibits not only the best correct recognitionperformances, but also the lowest percentage of both wrong and unknown results. Itsparameters are shown in Table 7.12 (the parameters of all the other configurationsare reported in Section A.1).

According to this test, at a first glance we can be sure to have found the almostideal classifier for our purposes. But beside the results, we have also to consider howthey have been obtained. The test setup is fundamental to draw conclusions, andin the pattern classification field this means to evaluate how the training and thevalidation set have been determined. The whole data set has been initially formed bya first set of equally distributed character images (with different sizes and fonts) andprogressively expanded by saving the bitmaps that were not correctly recognized.

147


Table 7.11: Final classification results (%) of the best configurations (running infixed-point mode)

NameCorrect Wrong

No resultsresults results

beta-h64-x85-bf2-10k 97.0646 0.3914 2.544beta-h64-xa-bf1-10k 96.184 0.5871 3.229beta-h64-x100-10k 95.5969 0.9785 3.4247beta-h64-x50-10k 95.5969 0.9785 3.4247beta-h64-x10-10k 95.3033 0.8806 3.816beta-h64-x85-10k 95.3033 0.8806 3.816

beta-h64 94.1292 0.3914 5.4795alpha 93.8356 0.4892 5.6751

beta-h64-x85-bf2-1k 93.5421 0.5871 5.87082Dp-x50-10k 85.1272 2.6419 12.2309

alpha-full 84.9315 2.0548 13.01372Dp 83.4638 1.1742 15.362

2Dn-xa-10k 82.4853 4.5988 12.9159beta 74.364 2.9354 22.70062Dn 68.591 5.0881 26.3209

2Dn-x85-10k 52.6419 8.2192 39.1389

Table 7.12: Parameters of the best configuration

Parameter Value

Name beta-h64-x85-bf2-10kInput type 16x16-nInput size 256

Layers 3Neurons 323

Connections 897Stop criterion BitFail

MSEmax 0.001BitFaillim 0.025BitFailmax 0.01Epochsmax 10000

Positive set multiplier 85

This should ensure the representativity of the whole problem domain, since they are

148


obtained with the same application we are developing and hence they are real-worldexamples. Next step we separated the training set and the validation one, in orderto allow a fair comparison between different solutions. Using only the final errorcalculated on the training set is not significant, since in case of overfitting [114] wewill have poor performances in a real scenario. What we want to achieve is a goodestimate of the generalization ability of the proposed solutions, i.e. a value thatgives an idea of how ANNs will behave when previously unseen patterns will beclassified. This is the main reason for the split of the whole data set into a trainingset and a validation one. The errors obtained on them could be even divergent [115]:if we insist too much on the learning, its error will almost certainly decrease, butits corresponding validation value, from a certain point on, will start to increase.This is an example of the effect of the afore mentioned overfitting issue. Not alwayshaving a long training leads to better results. For this reason all the configurationsbelonging to the beta-h64 family has been trained with a maximum epoch limit ofboth 10000 and 1000.

Table 7.13: Final classification results (%) of the best configurations (running infixed-point mode) with a 10% training set and 90% validation set

NameCorrect Wrong


beta-h64-x85-bf2-10k 78.6572 4.4541 16.8886beta-h64-x50-10k 77.9585 3.6354 18.4061beta-h64-x100-10k 77.8384 4.5852 17.5764beta-h64-x85-10k 77.631 4.7271 17.6419

beta-h64-x85-bf2-1k 77.5 4.4978 18.0022beta-h64-xa-bf1-10k 77.4454 3.8865 18.6681

alpha-full 75.3384 6.8886 17.7729beta-h64-x10-10k 74.1157 3.2314 22.6528

alpha 72.8384 4.345 22.81662Dn-x85-10k 72.6092 3.9847 23.4061

beta 71.7904 3.1659 25.0437beta-h64 71.0044 2.7511 26.2445

2Dn-xa-10k 70.5459 3.5262 25.92792Dn 64.8362 2.4236 32.7402

2Dp-x50-10k 61.5939 9.9672 28.43892Dp 33.4934 28.0786 38.4279

The most simple way to evaluate the generalization properties is the holdoutmethod, which is the one used up until now, using the results on the validation set

149


as an estimate of them. This does not completely answer to the original question,i.e. how they will perform in a real scenario, because not all the data set can be usedfor the validation and thus we have incomplete information. At the same time wewould like to have as many samples as possible available for the training, but thisis in contrast with the validation needs. Indeed the main issue is how to properlysplit the whole data set, establishing the right proportions. In this respect there isno best choice, thus we have to make some trades-off: with the actual division (90%training, 10% validation) we have privileged the learning process, in order to be sureto have well trained ANNs. On the other hand we could not be so sure about theirgeneralization properties, given that the validation set is not so large as we wouldwant. The opposite reasoning would suggest to reverse the proportions, in orderto have a better idea of their possible real behaviour. The results of this test areshown in Table 7.13: the best configuration is always beta-h64-x85-bf2-10k, butnow the percentage of right results is considerably decreased down to about 79%.Also the one about the wrong classifications is no more the lowest one (it is higherthan many others, but still a low value). The amount of cases where the classifier isnot able to give a reliable output (the last column) is a significant fraction, about17%. This could be due to a low generalization capability, but since the same trendis verified across all the solutions, we could also hypothesize that it so because ofthe small training set. As already pointed out in [118], the training set should bemade larger as long as the ANNs get more complex, and, in our case, having onlythe 10% could be insufficient.

Another possible reason for the not so good results in the above case, couldbe the choices made about the thresholds in the winner-takes-all algorithm. Asexplained previously, there is a minimum threshold, MinThr, on the best value, anda threshold, DiffThr between the difference on the first and second highest values.Until now those parameters have been set to MinThr = 0.7 and DiffThr = 0.1,mainly based on experience rather than rigorous tests. In order to evaluate howthey affect the results some tests were conducted with the same ANNs developedin the last experiments. The results are shown in Table 7.14 and they are aboutonly the best configuration, i.e. beta-h64-x85-bf2-10k. In general we can see thatincreasing both those parameters makes the wrong results percentage decrease, butalso the right results one. Thus it is better to sum up them in a more compact tables,and to see how those variations are related to the increments of either MinThr(Table 7.15) or DiffThr (Table 7.16). For each row in those tables, the relativedecrement, i.e. respect to the previous value, are reported. Looking at the firstone, we can see that all the variations increase as long as MinThr grows up, thusmaking its ideal value of 1.0 (from the wrong results standpoint), but this is clearlyunfeasible. Actually we have always used 0.7, but 0.8 could be a better choice sinceit still offers a good decrement (25%) on wrong results with a low degrade on theright results side (less than 5%). Looking instead at that influence on the results

150


Table 7.14: Classification results (%) of the beta-h64-x85-bf2-10k configuration,obtained by varying both MinThr and DiffThr in the winner-takes-all algorithm

MinThr DiffThrCorrect Wrong


0.5 0 82.4 6.5 11.10.5 0.1 81.2 5.5 13.30.5 0.2 80.3 4.8 14.90.5 0.3 79.2 4.4 16.40.6 0 81.3 5.8 12.90.6 0.1 80.1 4.9 150.6 0.2 79.2 4.4 16.40.6 0.3 78.1 4.1 17.80.7 0 79.8 5.2 150.7 0.1 78.7 4.5 16.90.7 0.2 77.9 4 18.10.7 0.3 76.9 3.7 19.40.8 0 77.3 4.2 18.40.8 0.1 76.2 3.6 20.20.8 0.2 75.5 3.2 21.40.8 0.3 74.7 3 22.40.9 0 72.6 3.2 24.20.9 0.1 71.6 2.6 25.90.9 0.2 70.9 2.2 26.80.9 0.3 70.2 2.1 27.7

of the DiffThr parameter, we can see the setting it to a value different from zero isessential, but as we increase it, the relative decrement of the wrong results decreases,and the one about the right results is constant. The ideal value is again 1.0, but itis unfeasible. Anyway we could set it to 0.2, given that the relative improvement onerroneous outputs is still more than 10%. In synthesis, these results give us someadvice on how to modify the classifier behaviour in order to have more right resultswith the risk of having also more wrong ones, or less errors but with a lower rate ofcorrect recognitions. It is very difficult to state which trade-off is better. Actually wehave selected two of them, the first one with MinThr = 0.7 and DiffThr = 0.1,and the other with MinThr = 0.8 and DiffThr = 0.2. The former has beenused throughout all the tests, except these specific ones, while the latter has beenestablished in the above discussion. In any case, the real impact of the two solutionstrongly depends also on the post-recognition step, where recognized symbols are put

151


together to form words and then eventually corrected by a spell-checker (as we willsee in Section 7.4), but this kind of test cannot be done in the actual developmentstage. For these reasons, in the next steps, talking about how to select the bestconfiguration, the first setup will be used.

Table 7.15: Performance variations of beta-h64-x85-bf2-10k respect to theMinThr parameter

MinThrRelative Relative

wrong results right resultsdecrement decrement

0.5 — —0.6 10% 1%0.7 11% 2%0.8 25% 3%0.9 36% 6%

Table 7.16: Performance variations of beta-h64-x85-bf2-10k respect to theDiffThr parameter

DiffThrRelative Relative

wrong results right resultsdecrement decrement

0.0 — —0.1 19% 1%0.2 13% 1%0.3 8% 1%

In the previous experiments we have seen the importance of having a large val-idation set in order to have accurate results about the generalization, which is themeasure we are interested in. The actual used technique is the holdout one, butthere are also others, such as the random subsampling, the k-cross-validation andthe bootstrap method [114]. The first one is certainly an improvement on the holdout,because the results are averaged on a set of different selections of the training/vali-dation set, but nevertheless it still does not use all the available samples to measurethe performances. The last one does not impose the perfect separation between thetraining and validation set: when a sample is selected to be part of the training set,it is not discarded from the original data set, thus it could be redrawn (samplingwith replacement). Clearly, when measuring the performances with this methods,special attention has to be paid to this aspect. With the k -cross-validation tech-nique, the data set is divided into k equally partitioned subsets, and the learning

152


process is repeated k times, each time using a different subset for the validation andall the remaining for the training. Thus we have k sets of results, which will beaveraged. There are two main variants to this schema. The first one that sets theparameter k to the size of the original data set, and it is called the leave-one-outmethod, but in practice is unfeasible due to the too high amount of time required ofthe trainings (remember that in our case we have more than 10000 samples). Thesecond one is named stratified k-cross-validation, and it simply adds a constraint bywhich each subset has to maintain the same proportions among the different classesrespect to the original data set. It is noteworthy that this concept has already beenapplied to the holdout method used up until now.

It is not easy to evaluate which of the above techniques is better than others,but there are two main measures that give an idea of how the results coming fromthose tests are related to the real-world behaviour of the proposed solutions: thebias and the variance [127, 115]. The former is a measure of how much the estimateis distant from the real value, while the latter gives an idea about the accuracy of theresulting estimate. Usually the random subsampling is not used, since it is affectedby the same problems of the holdout method, and k -cross-validation or bootstrapare preferred. In [127] those two techniques have been compared: the first one hasin general a slightly higher variance, especially when k is small, while the secondone has a low variance but an extremely large bias in some cases. It is also reportedthat stratification helps to obtain even better results in cross-validation, and thesuggested method is thus the stratification ten-fold cross-validation.

As we will see in the next section, a custom software has been developed in orderto study all the concerns discussed above about the ANNs and the classificationperformances. It was originally developed to handle only the holdout method, bothfor simplicity and to have a first set of results in a short time (as it is easy to argue,all the other methods are much more time consuming). Later it has been modifiedto support the stratified k -cross-validation, but actually with only k = 2. As seenabove, this is not the best solution, nevertheless it still gives a good idea about thereal generalization error, because now the results are calculated on all the samplesin the original data set, overcoming the crucial dilemma about the proportions ofthe training/validation set previously discussed. Moreover we are able to computethe confidence interval through the formula [127, 114]:

2h · acch + z2 ± z ·

4h · acch + z2 − 4h · acc2h

2(h + z2)

where h is the size of the validation set, acch the estimated accuracy, and z a valuedepending on the desired confidence interval. It is clear that as long as h growsup, the confidence interval is more close to the estimated value, thus increasing thereliability of those results. Using the cross-validation, h is the size of the whole data

153


set [127], and for a confidence level of 99% z = 2.58 [114]. Now we have all theinformation to calculate the results presented in Table 7.17. The best configurationis again beta-h64-x85-bf2-10k which has the highest percentage of correct recog-nition (about 95%), the second lowest error rate (1%) and the lowest amount of nooutput at all (about 4%). In general we can see that those results are similar to theones previously found in Table 7.11, but now they are computed in more rigorousway thus leading to more accurate estimates.

Table 7.17: Final classification results (%) of the best configurations (running infixed-point mode) with the stratified two-fold cross-validation (the confidence levelis 99%)

NameCorrect Confidence Wrong

No resultsresults interval results

beta-h64-x85-bf2-10k 94.7208 94.1189–95.2641 0.9929 4.2863beta-h64-x50-10k 92.7055 92.012–93.3432 1.4746 5.8199beta-h64-x85-10k 92.5875 91.8891–93.2301 1.4845 5.928beta-h64-x100-10k 92.5186 91.8175–93.1642 1.5434 5.9379

beta-h64-xa-bf1-10k 92.499 91.7971–93.1454 1.2977 6.2033beta-h64-x10-10k 92.0763 91.3575–92.7401 1.2977 6.626

beta-h64-x85-bf2-1k 91.3685 90.6228–92.0601 1.6418 6.9898beta-h64 90.1494 89.3607–90.8857 0.757 9.0936

2Dn-x85-10k 86.2761 85.3721–87.1326 2.153 11.571alpha-full 86.2072 85.3014–87.0657 2.9591 10.8336

alpha 86.1778 85.2712–87.037 2.4675 11.3547beta 63.1046 61.8621–64.33 7.8155 29.07982Dn 61.4923 60.2404–62.7292 9.7424 28.7652

2Dp-x50-10k 61.1974 59.9439–62.4363 9.6441 29.15842Dp 47.6406 46.3649–48.9193 16.575 35.7845

2Dn-xa-10k 20.8907 19.8699–21.9495 19.2685 59.8407

In conclusion, in these sections we have studied and designed the classifier (namedbeta-h64-x85-bf2-10k) based on ANNs that, together with the feature extractionroutine defined in Section 7.1, is the core of the character recognition subsystem.It has an accuracy of about 95%, with only a 1% of possible wrong results. Thishas been measured through the stratified two-fold cross-validation, which is one ofthe most accurate way of computing those results. The proposed solution has beentested with the fixed-point arithmetic, which is the one that will be used in thefinal application, thus taking into account almost every practical aspect. From therun-time standpoint the execution time has been reduced from an initial value of

154


1268.5 ms to a final one of 78.05 ms, with an improvement of +1516% (see Table 6.11and 7.2 for further details6). It has also to be remarked that in this chapter not allthe results have been presented, see the Appendix A for a complete set of tablesregarding the performances of the developed ANN configuration. In the next sectionwe will talk about how these results has been achieved, since it has been preferredto develop custom tools in order to follow the same path traced in Chapter 6.

7.3 Implementation notes

The design and development of the character recognition subsystem has been studiedin parallel with the development of the main application. It has required to set upan ad-hoc set of tools, in particular to deal with ANNs. There are basically twooptions: the former is to use a well known off-the-shelf application that is alsoable to work with ANNs, while the latter requires to develop a set of custom toolsspecifically tailored around our needs. The first way, at a first glance, could bepreferable, since it would allow us to focus only on the research goals. We would notcope with the implementation details, and we would have all the flexibility givenby an environment designed to solve numeric (and not only) problems. We coulddo all the studies with it and then in the end export all the data to the targetapplication. One issue with this approach is exactly this, since almost certainly wewould anyway have to write a tool able to arrange those data in a format suitablefor our application, and this should be done both for the development platform andfor the mobile one. Then we should test that everything is working as expectedand in particular with the same performances obtained by the ANN developmenttools. In these cases, being bounded to a third-party library (and hence usually aclosed source tool), or re-implementing the algorithms, could be a problem, thanksto the risk of introducing bugs or discrepancies. In any case it would be difficultto guarantee to have the same results on each platform. In synthesis, the easeof studying ANNs, according to these choices, is counterbalanced by a successiveburden required to integrate those solutions in the final application.

The second option requires to set up an ad-hoc system to do everything: thesame code base will be used both to investigate and improve the ANNs and to usethem in the target system. As it has been done both for the hardware part and thesoftware one, we can look around to find a ready made suitable solution or to designfrom the scratch a new one. Discarding commercial solutions, because of the aforementioned issues, we can search for open source libraries. In this field there are somealternatives, but the most complete one seemed to be FANN (Fast Artificial Neural

6In the second table the performances of the beta-h64 has been measured; it is not the sameconfiguration that has been chosen as the best one, but from the run-time performance standpointthere is no difference, since they have the same topology.

155


Network [126]). It offers a full set of functionalities to handle training/test data,the ANNs itself, different training opportunities and, most important, an alreadydeveloped set of utilities to make the networks running in fixed-point mode. Thislast option has been designed to achieve the same goal as ours, i.e. an applicationtargeted to the mobile world with an ANN subsystem. In this way we have a fullfledged solution upon which build the character recognition subsystem. The libraryhas not been designed to run on a particular platform, thus the porting step has beenpretty straightforward. Anyway, it has been heavily modified to support at the sametime, in the same executable, both the fixed-point and the floating point arithmetic.Its rich set of C APIs has been first wrapped through C++ classes, then integratedin the RcgnNNet class, and this managed by the hvImgRcgn<> one, thus following themodularity guideline explored in Section 6.2. All those components have been thenexported to the Lua scripting language [95, 96] (thanks to the Luabind library [98]),with the aim of both improving the system extensibility and opening the way to thedevelopment of the afore mentioned custom set of tools. These choices are similarto the ones taken to design some of the test tools discussed in the previous chapter,where Lua has been used as a “glue language” to connect different already developedcomponents. In this way it is guaranteed that there would be no discrepancy betweenthe development tools and the final ones. The resulting ANNs would be used withoutany conversion and/or loss of performances.

Figure 7.6: Overall architecture of the developed software for training/evaluatingthe ANNs

The decision of implementing the data intensive functions in C++ and all the rest

156


through a dynamic scripting language has been revealed to be a winner solution.The whole environment has been developed progressively, as long as the resultsshowed in the previous sections gave some hints on how to improve or implementnew functionalities. Anyway the leading idea has been, since the beginning, todesign a tool that takes as input the source character image repository, the list ofthe desired ANN configurations, and gives as output both the resulting networks anda set of report tables that compare the performances of the proposed solutions. Theoverall architecture is shown in Fig. 7.6. From a practical standpoint, the softwarehas to partition the input data set into the training and validation one, to generatethe various ANN topologies which will be used for the learning process describedin the configuration, to validate the resulting networks and in the end to outputall the desired reports. The first task, given the desired set proportions, is prettystraightforward, and by default it generates stratified partitions, i.e. each of them hasthe same internal composition as the original one, in terms of proportions betweendifferent classes. Then all the character images are elaborated and transformed intotraining/validation files (compatible with the FANN library format), according tothe description of all the feature extraction method that will be used. This hasbeen necessary in order to speed up the ANN training, which is clearly the mosttime-consuming task. Having those data pre-computed is a meaningful help, giventhat most of the configurations share the same feature extractor type.

The target ANN configuration descriptions are kept in a couple of file separatedfrom the main ones, and they are easy to set up because of the good data descriptionfacilities of the Lua language. Indeed, from the software user standpoint, it is notrequired to know neither the working details nor the technical solutions adopted(including the development languages used). The end user has just to describe theclassifier properties and then to launch the main executable. After the learningstage, the ANNs are automatically run on the validation set, in order to measuretheir performances, and then converted to the fixed-point format; all the tests arethen repeated. All the data are collected, in form of tables, and saved as a set ofCSV (Comma Separated Values) files, thus they can be read from all mainstreamspreadsheet applications. In the end they are put together in a comprehensive reportfile (in the PDF format), generated thanks to a small bunch of AWK scripts andLATEX files.

This software has been used to manage the study of the whole character recog-nition subsystem: all the results presented in the above sections (apart from theones regarding the execution times) are excerpts of the reports automatically gen-erated. In particular the same code base has been used first to perform all the testin the 90%/10% test scenario (the former refers to the training set and the latter tothe validation one), then for the 10%/90% option and in the end for the 50%/50%one. This last one has not been previously explicitly mentioned, since it offeredthe opportunity (and hence it was soon replaced) to establish the stratified two-fold

157


cross-validation method to evaluate and to choose the final classifier configuration.More on the results obtained by this software can be found in Appendix A, wherethe full set of tables regarding this final test is presented.

7.4 Word-oriented features

In the previous sections we focused our attention on the concerns regarding thecharacter recognition, which has been addressed by designing an image recognitionsystem based on ANNs. Nevertheless, in the tests made with the whole application,thus considering also all the other processing steps that pre-elaborate the data andtransform the input image into high level information (see the application data flowdescribed in Section 3.2), the whole performances were not so high as expected.This is due to a set of reasons difficult to identify, and actually a great effort isstill underway to understand and improve the global result quality. Going towardsthis direction, and following the advice received during the Haptic Workshop (seeSection 2.2), it has been decided to extend the recognition features by switchingfrom a character-oriented system to a word-oriented one.

The first thing to do is to maintain an history of the previous results. As we haveseen in Section 6.2, the character recognition is performed only when strictly needed,i.e. when the user moves the device over a new symbol. Thus the information regard-ing the tracking of the device can be used to store the useful data in a vector. Eachtime a new character is found, it is either added by expanding the data structureor compared with the already existing information and eventually corrected. Thislast opportunity is based on the assumption that the output value of the ANNs inthe classifier can represent the confidence of the recognition, thus if we re-recognizean already stored symbol and the new one has a higher degree of confidence, it willtake the place of the older one. Beside the characters, also the spaces between wordshave to be detected. This is simply performed by computing the distance betweentwo successive elements and comparing them with a (configurable) threshold. It is asimple and quite effective method, even if it requires to be fine tuned in order to workproperly. Indeed it relies on the tracking algorithm, briefly described in Section 3.2,which already groups the graphical elements into lines. All these functions are im-plemented in the hvEngine::UpdateGuessContainer, pointed out in Section 6.2,whose purpose is to maintain synchronized a hvGuessContainer structure with therecognition and tracking outputs.

It is now quite clear how to add the desired word-oriented functionalities, sinceit is simply a matter of breaking the afore mentioned hvGuessContainer into wordsand then eventually to post-process them. To accomplish this task there are twobasic approaches: the former is to implement a prediction system, that could berun in parallel with the proper character recognition component, and the latter

158


means to add a spell-checker to the system. The first way is useful to use contextinformation, i.e. the ones about the surrounding letters, to improve the generalcharacter recognition performances. The actual system does not take into accountwhether the user is reading a text or a non-sense sequence of symbols, but takingadvantage of the correlations between character occurrences (a language-specificknowledge) could improve the global reliability. The second way is a proper aposteriori check, where there is a complete set of data about the word being read,and so it can be compared against a dictionary. This operation can be performedonly after the whole word has been selected, and in general it is a more complex andtime-costing function, thus it has to be completely separated from the other (stilluseful) solution.

It can be observed that the character recognition system already has good perfor-mances, thus making the integration of a spell checker a preferable option to improvethe device functionalities. As it is easy to argue, it is quite difficult to design fromthe scratch an effective (both from the result quality and run-time performancesstandpoint) spell-checker, and thus it has been preferred to re-use an existing solu-tion. In this field there are already a lot of solutions, both commercial and not, sinceit is a needed functionality in many common applications, such as word processors,email clients and web browser, to name few. As for all the other libraries used (seeChapter 6), we have preferred open-source solution, since they guarantee a greaterflexibility and control on the final product. Looking around for an effective solution,a set of products have been examined: among them Ispell, Aspell and HunSpellseemed to be the most used solution. The first one can be considered as an ancestor(but still used nowadays) of all the others; it is a good and reliable choice, but theothers offer more features, while this one is only able to correct one error per word.Aspell has been designed to replace Ispell, extending its features and correction ca-pabilities, even integrating a function that tries to suggest correct words based ontheir “phonetic similarity”. Anyway this last option is only available for the Englishlanguage. HunSpell [128] is a spell-checker and morphological analyzer developedto replace the previous MySpell, and it has been designed to handle in a suitableway also complex languages, such as the Hungarian (which gave the name to theproduct). It is actually developed for a great variety of platforms (mostly belongingto the Unix world), including Win32, and supports many languages. These has beenthe features that led us to try it7, and it has revealed to be an effective solution.As for many other libraries, a particular effort has been put in order to port it tothe WinCE platform, but in the end it worked in the main application without anyproblem.

With the above described functions, now the software is able not only to recognize

7It is actually integrated in several well-established software, like all the Mozilla products andthe OpenOffice.org suite: also this point has been valuable to take the decision.

159


single characters, but also to handle entire words, as requested by end users in theHaptic Workshop (see Section 2.2). Anyway the main application still spells thetext character by character, in order to give an idea of how the reading is going,but, when a space is detected, the entire word is passed to the spell-checker and thenspoken. With these functionalities the text recognition subsystem is now complete.

7.5 Conclusion

In this chapter we have designed the core character recognition subsystem. All thedevelopment guidelines discussed in Chapter 6 have been taken into account and wehave also taken advantage of the constraints given by the target hardware platformproposed in Chapter 5. The input image is first decomposed in a feature matrix andthis is classified by a flexible and scalable ANN ensemble combined with a winner-takes-all algorithm. According to our tests, done with the stratified two-fold cross-validation method, it is able to recognize about the 95% of the characters, with onlya 1% probability of errors. This is the result of a comparison between 36 differentconfigurations. Clearly the tests have been made in the same conditions as the onesof the final application on the mobile platform. For these purposes a custom softwarehas been developed and improved in parallel with the main one. Furthermore, thesystem has been equipped with a spell-checker in order to achieve the project goalsoutlined in Chapter 2 by implementing the demanded word-oriented functionalities.

160

Chapter 8

Conclusion

The goal of this project, still underway, is to the design and develop a mobile readerdevice for blind. After having examined the already existing solutions, we haveestablished the requirements, purposes and features of this aid. From the ergonomicsstandpoint, it has to be portable and autonomous, while from the functionalities oneit has to recognize the printed text in a real-time way, as long as the user movesit on the paper to be read. This, together with all the information needed toguide the reading, are conveyed to the user by means of a speech synthesis engine.The software is designed around a full-fledged character recognition subsystem withword-oriented features, and the hardware is defined taking inspiration from commonmobile PDAs.

When dealing with human disabilities it is almost impossible to write require-ments a priori like in many other engineering fields. Only through a direct inter-action with the end users we were able to define the requirements that such aidshould satisfy. This helped to define the device features and its architecture froma top-level standpoint. Processing a continuous stream of images, in order to giveuseful information in a short time, has heavily affected both the hardware and thesoftware choices. The first ones have addressed the definition of the proper com-puting platform, its requirements and how the input/output data are handled. Thesecond ones have led to properly setup the development environment, carefully planthe whole development cycle, both by stating some guidelines and by creating a setof custom help tools.

None of those challenging tasks has a unique solution. This is particularly ev-ident for the hardware side of the project, where two different alternatives havebeen proposed. The first one (see Chapter 4) is based on a DSP (Texas Instru-ment TMS320C6713) plus an FPGA (Xilinx X3CS1000), while the second one (seeChapter 5) is twofold: a first prototype based on a currently commercial availablePDA and, in parallel, a new custom platform, designed around an SBC. The twoexamined kind of solutions have been deeply analyzed, trying to exploit all their

161

8 – Conclusion

pros and cons, and for both a feasibility study has been done. This has involvedthe definition of the needed hardware resources, from the peripheral standpoint andfrom the software one, since a high amount of processing power is needed.

In the case of DSP/FPGA platform it has been possible to talk about the sys-tem partitioning, i.e. to decide which part of the application can be implementedby means of custom hardware processing cores, and therefore hardware/softwareco-design methodologies have been investigated. Different approaches have beenexamined for the two most data-intensive subsystems: the image pre-processing oneand the ANN one. The whole development cycle has been considered, in orderto correctly evaluate not only the final system, but even the difficult to design it.The aim was to find the best trade-off between the required implementation effort,its cost (in term of maintainability, possibility to extend/integrate with other de-vices and so on) and the desired results. Making those considerations both for thehardware and the software, we have seen that the DSP/FPGA was not a so goodsolution, and alternatives have been investigated.

In the end a solution based on the Intel PXA270 CPU has been proposed, sinceit has been expressly designed for mobile appliances. It runs up to 624 MHz, it isbased on the Intel XScale architecture, derived from the ARMv5TE one (ARM is themost widespread CPU architecture in the mobile world). It has advanced featuresto improve the execution performances: among the others, a SIMD coprocessor,extensively used in the final application. A fundamental feature that makes thedifference are the good performances respect to the power consumptions: accordingto Intel, it reaches, at 600 MHz, 750 MIPS with only 450 mW due to its frequencyand voltage scaling abilities. In comparison, the DSP has a baseline core powerconsumption of about 950 mW. Both the proposed PDA and SBC are equippedwith a full-fledged real-time OS based on MS Windows CE, which simplifies thesoftware development process because we have been able to use the same tools bothfor the target platform and the host one, leading to build a cross-platform software.

Designing and developing the software (see Chapter 6) has been a compellingtask, because, in order to have an effective solution with good results, we needed topose some guidelines. The most important requirements are modularity, flexibility,extensibility, testability and efficiency. Those are strictly correlated, but with thefirst one we mean all the concerns regarding the correct partition of the whole com-plex application into simpler components, the second one is fundamental for a systemin continuous evolution and the third one enables the underlying framework to beused or re-used by external tools. The last two issues are the most important ones,since the testability relies upon a set of practices to verify that the system (and itscomponents) is working as expected and gives reliable results. With these purposesa custom set of tools has been designed, which constituted also the base to analyzethe run-time performances of the final application on the target hardware. Althoughthe proposed solutions based on the PXA270 CPU offer a lot of advantages, they

162

8 – Conclusion

have the main drawback of having less computational power respect to the DSP/F-PGA platform. This has been counterbalanced by paying special attention to thecode optimization issues, like the memory access, the use of the Intel Wireless MMXtechnology, the switch from the floating-point arithmetic to the fixed-point one andan extensive use of the compiler specific options. All these opportunities have im-proved the average frame processing time from 1840.73 ms (0.54 fps) to 72.37 ms(13.8 fps), more than 25 times faster.

Beside the main application, a set of development tools has been designed inorder to study the ANN subsystem and to perform all the required test on thecomponents being developed. This has been possible by first developing the mainapplication and its underlying framework (with all the core algorithms) in C++, andthen using a dynamic scripting language named Lua to export all those componentsend extend the whole system.

Since the core task of the device is to recognize the text, a great effort has beenmade to design and develop the corresponding subsystem (see Chapter 7). It is basedon ANN, which is a data-driven powerful method to address every problem that havecomplex decision boundaries with almost no information known in advance. Thecharacter bitmap images are first pre-processed and a feature matrix is computedbefore classifying it. The proper pattern classification part is based on a set of ANNs,each one able to identify the symbol it was trained for, and then all the outputs arecollected by a winner-takes-all algorithm. The ANNs have been developed thanksto a custom software built on the same code base of the final application: this en-ables to validate the results being sure that the final application will have the sameperformances in this regard. In order to improve the correct recognition rate, newtopologies have been developed (with constraints on the connections), alternativeevaluation parameters have been used, the behaviour on specific data subsets hasbeen investigated and new training strategies have been tuned. All those aspectshave led to the definition of 36 different ANN configurations, which have been exten-sively tested in the same environment of the final application, with the fixed-pointarithmetic instead of the classical floating-point one. They have been compared firstwith the holdout method (with different proportion between the training and thevalidation set) and then with the more reliable stratified two-fold cross-validationone. The best solution is able to correctly recognize characters in the 95% of thecases, with an error rate of only 1%. The data set used for this study has beencreated with the same application it was in development, ending with a repositoryof more than 10000 real samples of all the Latin alphabet (both upper case andlower case) plus numbers and few other symbols. The recognition subsystem, alongwith the entire application, is fully customizable by means of a configuration file,and thus it is completely replaceable without making any modification to the mainapplication. All the other operational parameters can be easily tuned, allowingto prepare different setups according to the user’s needs. On top of the character

163

8 – Conclusion

identification routines, a multi-lingual spell-checker has been integrated, in order todevelop the highly-demanded word-oriented functionalities.

In conclusion, the proposed mobile reader device for blind could be useful to readprinted text in such contexts where it is not possible to use traditional aids. Beingbased on the same solutions widely adopted in PDAs makes it a valuable tool foreveryday life. Its modus operandi mimics the usual way of reading, thus granting abetter comfort and a noteworthy help to accomplish this important need.

164

Chapter 9

Future perspectives

The research presented in this work has led to the design and development of themobile reader device for blind discussed in Chapter 2. It is the result of a studyabout the requirements and specifications of the aid, the most suited hardwareplatform (discussed in Chapter 4 and 5), the needed software tools and applications(Chapter 6), and the fundamental text recognition functionalities (Chapter 7). Ithas been a challenging goal, since we have faced almost all the phases and theaspects in the design and development of a new product. Clearly it has not beenpossible always to go into the details as much as possible, and anyway some newideas and solutions may arise, due to the meanwhile increased experience, duringthe same development process. For practical reasons, it is not possible to directlyintegrate them into the current line of research and thus they will be left to futureimprovements. There are mainly three areas of improvements: the hardware side,the software one and the text processing features.

The hardware platform has been chosen comparing the ease of integration withother devices, the power consumption, the computational power and the develop-ment cycle. The chosen solution has been revealed to be suited for our purposes,but, in the meanwhile, new solutions have become available, and these could guar-antee better performances on almost every considered aspect. Actually, Marvellhave released the new PXA320 CPU, always based on the XScale architecture andintegrating the new Intel Wireless MMX2 technology wich adds new powerful in-structions and a 32 bit DDR RAM controller. It runs up to 806 MHz and Marvellclaims that is 25% faster, in terms of MIPS, than the previous PXA270 CPU (at thesame frequency) and with a reduced power consumption. Toradex (the manufac-turer of the currently used SBC module) has already in production a Colibri modulewith that processor. Always looking at ARM based CPUs, nowadays are availablethe Cortex-A8 cores (with the ARMv7 ISA), which have been optimized to achievehigh execution performances (it has a dual-issue superscalar in-order pipeline), areable to run up to about 1 GHz, have an integrated FPU and a SIMD coprocessor.

165

9 – Future perspectives

This technology is named NEON, it uses a 128 bit data type and it is incompati-ble with the Intel Wireless MMX one. Nevertheless, the switch to this new CPUshould be pretty straightforward, because it is always an ARM CPU and the SIMDoperations are actually wrapped in regular C functions: changing their implementa-tion, by using the new instructions, should be the only needed modification to takeadvantage of these new features.

Beside the processing core, also the overall device architecture has to be con-sidered. The first improvement should be about the image acquisition peripheral:actually it still relies on a customized USB webcam, but it is conceptually useless toencode the images, send them through the USB interface and then to decode them:as we have seen in Section 5.2, we have at our disposal the QCIF interface thatallows a direct connection between the SoC and the CMOS camera sensor. Thisrequire the development of a custom driver, and this generally implies to build acustomized version of the MS Windows CE operating system. This is clearly thebest way to accomplish such task, but it requires a long design and developmentstage, and it is currently underway, but no result is still available. Customizing theOS should offer also the opportunity of optimizing the system around our needs, re-ducing the resources used by almost useless components and leaving them for otherpurposes.

Another hardware improvement would be the study and the design of a newtactile display: as we have seen in Section 2.1, it has been actually discarded notbecause it is a useless option, but mainly because we have not still found a viablesolution, i.e. a display with small form factor, low power consumption and able torepresent entire words. Actually there are some ongoing research projects regardingthis question (pointed out in Section 2.1), but still no definitive solution. A naturalextension of the system, of course including the tactile transducer, would be tointegrate graphical functionalities in the main application, in a similar way as theoriginal Optacon modus operandi. This would extend the usability of the proposedsolution, but in this case the performances strongly depend on the haptic interfacecharacteristics.

Looking at the software side, the developed application satisfies the requirementsbut still needs improvements. One of them is a better integration with the under-lying OS, and this requires the development of a customized environment, neededeven to properly handle the custom camera afore mentioned. Moreover the alreadydeveloped algorithms should be improved, first of all the one that tracks the usermovements: a solution based on optical-flow techniques could guarantee more ac-curate results, but, at a first survey, all those methods are data-intensive functionsand thus could heavily affect the execution performances. Anyway, looking at theupgrade of the underlying hardware platform, they should be reconsidered.

Other useful improvements could be to switch from word-oriented feature totext-line-oriented feature, in order to handle multiple lines at each time and thus

166

9 – Future perspectives

further enhancing the text accessibility.Beside the improvement of existing functionalities, we have also to pay attention

to the optimization issues: actually the work done in Section 6.2 has only tracedthe path to follow, but still a lot of work is needed. A custom performance profilersystem has to be developed, since there is no useful way to do it with the actualdevelopment environment in a remote way. We need a low-intrusive method totrack which function are the most used one in order to spot possible bottlenecks.Since this analysis has to be done on the target platform (it is useless to tune it onthe development PC), we could also resort to the hardware performance monitorregisters of the PXA270 CPU.

There are improvement opportunities for the text recognition subsystem as well.The already developed part can be seen as the core part of a more complex real-time OCR engine. The character identification itself is actually reliable, but theintegration of a spell-checker is only a first step to properly handle a complete text.This means to correctly manage not only entire words, but also the punctuation, andgraphical symbols, such as the mathematical ones. A first attempt to accomplishthis task has already been done, some of the 85 ANNs should detect them: theresults are encouraging but still not satisfactory.

Another interesting option is to make the device automatically detect figuresand drawings. This would be helpful both to separate them from the text and toeventually reproduce graphical information on a tactile display. Despite in literaturethere is something addressing this issue, nothing seems actually to work properly,and research is still underway.

In conclusion, in this section we have seen how to improve and enhance theproposed solution, considering the hardware, the software and the whole systemperformances, in terms of new features and expansion of the existing ones. Fromthe hardware standpoint, new opportunities are becoming available, both for thecomputing platform and for the peripherals, eventually including a new tactile trans-ducer. The software can be further improved by tuning the current algorithms orby developing new ones. In this way new functionalities could be added to the sys-tem, like a proper full-featured image processing engine, able to properly retrieveand covey both the textual content and the graphical information of the sourcedocument.

167

Appendix A

Tables

The tables shown in this appendix are excerpts from the reports automaticallygenerated by the software developed to handle the training and evaluation of theartificial neural network subsystem, in charge of recognizing the character images.For a full reference about it, see Chapter 7.

For the sections regarding the holdout method, only the two most significanttables are reported, while for the stratified two-fold cross-validation the completeset is presented. Although the first technique has helped in the definition of theentire set of different ANN configuration, the second one, as seen in Section 7.2 ismore reliable, and hence gives a more precise idea about the performances of theresulting networks. Since the configurations are the same in all the tests, they aredescribed in a separate section.

A.1 Artificial neural networks configurations

Table A.1 reports the configurations used to train the ANNs:

Name The configuration name

FeatCfg The feature data type; 16x16 and 8x8 are the sizes of the matrix,while -n, -p or -np denotes whether the data is computed not propor-tionally, proportionally or both (meaning that the input data is made oftwo matrices) respect to the original image

InSize The size of the input layer of the ANNs (and hence the size of thefeature data)

Layers The number of layers (including the input/output one) of the ANNs

Neurons The whole number of neurons (including the input ones and thebias ones) in each ANN

168

A – Tables

Connections The whole number of connections in each ANN: not every ANNis fully connected and this value is important because it gives an idea ofthe computational complexity of the configuration

StopCrit The stop criterion used to train the ANNs: it can be the MSE(Err) or the BitFail (BitFail)

MaxMSE The maximum MSE required to stop the ANN training (if theStopCrit is set to Err)

BfLim The BitFail limit used to compute the BitFail value. It is the max-imum difference between the ANN output and the desired value abovewhich it is considered a wrong result

MaxBf The maximum BitFail (i.e. the maximum percentage of wrong resultsby the BitFail criterion) required to stop the training of the ANNs (if theStopCrit is set to BitFail)

MaxEpochs The maximum number of epochs allowed for the ANN trainingprocess

PosMul The (optional) number of times that the positive training set is du-plicated. It can be a number or auto, meaning that it is computedautomatically, on a case by case basis

Table A.2 reports other parameters used to test/evaluate the ANNs:


TestBfLim The BitFail limit used to test/evaluate the ANNs: it can bedifferent from the training one in order to fairly compare the results; itshould be the same for all the configurations

CmpBfLim The BitFail limit used to compare the fixed-point ANNs withthe floating-point ones: it is a limit on the difference between the twotypes of output

MinThr The minimum threshold for choosing the winner ANN in the classi-fication tests

DiffThr The minimum threshold on the difference between the first and therunner up ANN used in the classification tests

169

A – Tables

Tab

leA

.1:

nnet

cfg

Nam

eFea

tCfg

InSiz

eLayer

sN

euro

ns

Connec

tions

Sto

pC

rit

MaxM

SE

BfL

imM

axB

fM

axE

poch

sPosM

ul

1bet

a16x16-n

256

3291

833

Err

0.0

001

0.0

10.0

11000

2bet

a-full

16x16-n

256

3291

8257

Err

0.0

001

0.0

10.0

11000

3bet

a-p

16x16-p

256

3291

833

Err

0.0

001

0.0

10.0

11000

4alp

ha

8x8-n

p128

3163

961

Err

0.0

001

0.0

10.0

11000

5alp

ha-full

8x8-n

p128

3163

4161

Err

0.0

001

0.0

10.0

11000

6bet

a-n

p16x16-n

p512

3547

1601

Err

0.0

001

0.0

10.0

11000

7bet

a-c

716x16-n

256

3291

1857

Err

0.0

001

0.0

10.0

11000

8bet

a-h

64

16x16-n

256

3323

897

Err

0.0

001

0.0

10.0

11000

92D

p16x16-p

256

4388

1025

Err

0.0

001

0.0

10.0

11000

10

2D

n16x16-n

256

4388

1025

Err

0.0

001

0.0

10.0

11000

11

2D

n-x

10-1

0k

16x16-n

256

4388

1025

Err

0.0

01

0.0

10.0

110000

10

12

2D

p-x

10-1

0k

16x16-p

256

4388

1025

Err

0.0

01

0.0

10.0

110000

10

13

2D

n-x

50-1

0k

16x16-n

256

4388

1025

Err

0.0

01

0.0

10.0

110000

50

14

2D

p-x

50-1

0k

16x16-p

256

4388

1025

Err

0.0

01

0.0

10.0

110000

50

15

2D

n-x

85-1

0k

16x16-n

256

4388

1025

Err

0.0

01

0.0

10.0

110000

85

16

2D

p-x

85-1

0k

16x16-p

256

4388

1025

Err

0.0

01

0.0

10.0

110000

85

17

2D

n-x

a-1

0k

16x16-n

256

4388

1025

Err

0.0

01

0.0

10.0

110000

auto

18

2D

p-x

a-1

0k

16x16-p

256

4388

1025

Err

0.0

01

0.0

10.0

110000

auto

19

2D

n-x

a-b

f1-1

0k

16x16-n

256

4388

1025

Bit

Fail

0.0

01

0.0

50.0

110000

auto

20

2D

p-x

a-b

f1-1

0k

16x16-p

256

4388

1025

Bit

Fail

0.0

01

0.0

50.0

110000

auto

21

bet

a-h

64-x

85-1

0k

16x16-n

256

3323

897

Err

0.0

01

0.0

10.0

110000

85

22

2D

n-x

100-1

0k

16x16-n

256

4388

1025

Err

0.0

01

0.0

10.0

110000

100

23

bet

a-h

64-x

85-b

f2-1

0k

16x16-n

256

3323

897

Bit

Fail

0.0

01

0.0

25

0.0

110000

85

24

bet

a-h

64-x

85-b

f2-1

k16x16-n

256

3323

897

Bit

Fail

0.0

01

0.0

25

0.0

11000

85

25

bet

a-h

64-x

85-b

f1-1

k16x16-n

256

3323

897

Bit

Fail

0.0

01

0.0

50.0

11000

85

26

bet

a-h

64-x

10-1

k16x16-n

256

3323

897

Err

0.0

01

0.0

10.0

11000

10

27

bet

a-h

64-x

50-1

k16x16-n

256

3323

897

Err

0.0

01

0.0

10.0

11000

50

28

bet

a-h

64-x

85-1

k16x16-n

256

3323

897

Err

0.0

01

0.0

10.0

11000

85

29

bet

a-h

64-x

a-1

k16x16-n

256

3323

897

Err

0.0

01

0.0

10.0

11000

auto

30

bet

a-h

64-x

a-b

f1-1

k16x16-n

256

3323

897

Bit

Fail

0.0

01

0.0

50.0

11000

auto

31

bet

a-h

64-x

100-1

k16x16-n

256

3323

897

Err

0.0

01

0.0

10.0

11000

100

32

bet

a-h

64-1

0k

16x16-n

256

3323

897

Err

0.0

01

0.0

10.0

110000

33

bet

a-h

64-x

10-1

0k

16x16-n

256

3323

897

Err

0.0

01

0.0

10.0

110000

10

34

bet

a-h

64-x

50-1

0k

16x16-n

256

3323

897

Err

0.0

01

0.0

10.0

110000

50

35

bet

a-h

64-x

100-1

0k

16x16-n

256

3323

897

Err

0.0

01

0.0

10.0

110000

100

36

bet

a-h

64-x

a-b

f1-1

0k

16x16-n

256

3323

897

Bit

Fail

0.0

01

0.0

50.0

110000

auto

170

A – Tables

Tab

leA

.2:

nnet

cfg2

Nam

eTes

tBfL

imC

mpB

fLim

Min

Thr

Diff

Thr

1bet

a0.0

50.0

10.7

0.1

2bet

a-full

0.0

50.0

10.7

0.1

3bet

a-p

0.0

50.0

10.7

0.1

4alp

ha

0.0

50.0

10.7

0.1

5alp

ha-full

0.0

50.0

10.7

0.1

6bet

a-n

p0.0

50.0

10.7

0.1

7bet

a-c

70.0

50.0

10.7

0.1

8bet

a-h

64

0.0

50.0

10.7

0.1

92D

p0.0

50.0

10.7

0.1

10

2D

n0.0

50.0

10.7

0.1

11

2D

n-x

10-1

0k

0.0

50.0

10.7

0.1

12

2D

p-x

10-1

0k

0.0

50.0

10.7

0.1

13

2D

n-x

50-1

0k

0.0

50.0

10.7

0.1

14

2D

p-x

50-1

0k

0.0

50.0

10.7

0.1

15

2D

n-x

85-1

0k

0.0

50.0

10.7

0.1

16

2D

p-x

85-1

0k

0.0

50.0

10.7

0.1

17

2D

n-x

a-1

0k

0.0

50.0

10.7

0.1

18

2D

p-x

a-1

0k

0.0

50.0

10.7

0.1

19

2D

n-x

a-b

f1-1

0k

0.0

50.0

10.7

0.1

20

2D

p-x

a-b

f1-1

0k

0.0

50.0

10.7

0.1

21

bet

a-h

64-x

85-1

0k

0.0

50.0

10.7

0.1

22

2D

n-x

100-1

0k

0.0

50.0

10.7

0.1

23

bet

a-h

64-x

85-b

f2-1

0k

0.0

50.0

10.7

0.1

24

bet

a-h

64-x

85-b

f2-1

k0.0

50.0

10.7

0.1

25

bet

a-h

64-x

85-b

f1-1

k0.0

50.0

10.7

0.1

26

bet

a-h

64-x

10-1

k0.0

50.0

10.7

0.1

27

bet

a-h

64-x

50-1

k0.0

50.0

10.7

0.1

28

bet

a-h

64-x

85-1

k0.0

50.0

10.7

0.1

29

bet

a-h

64-x

a-1

k0.0

50.0

10.7

0.1

30

bet

a-h

64-x

a-b

f1-1

k0.0

50.0

10.7

0.1

31

bet

a-h

64-x

100-1

k0.0

50.0

10.7

0.1

32

bet

a-h

64-1

0k

0.0

50.0

10.7

0.1

33

bet

a-h

64-x

10-1

0k

0.0

50.0

10.7

0.1

34

bet

a-h

64-x

50-1

0k

0.0

50.0

10.7

0.1

35

bet

a-h

64-x

100-1

0k

0.0

50.0

10.7

0.1

36

bet

a-h

64-x

a-b

f1-1

0k

0.0

50.0

10.7

0.1

171

A – Tables

A.2 Holdout method, 90% training – 10% valida-

tion

Table A.3 reports global performance results of the ANNs running in fixed-pointmode:


FeatCfg The feature data type

FeatSize The feature data size

Size The whole size of the validation set, i.e. the sum of the validation setsizes for each ANN in this configuration

Neurons The amount of neurons (including the input ones and the bias ones)in the ANN

Connections The whole number of connections in the ANN. Not every ANNis fully connected and this value is important because it gives an idea ofthe complexity of the ANN.

MSE The global MSE on the validation set

BitFail The global BitFail on the validation set

Table A.4 reports detailed performance results of the ANNs running in fixed-pointmode. In particular it shows how different configurations behave on the posi-tive validation set and on the negative one:



BitFailRatio The ratio between IntraBitFail and ExtraBitFail

MSERatio The ratio between IntraMSE and ExtraMSE

IntraBitFail The global BitFail on the positive validation set (also calledintra-class BitFail)

IntraMSE The global MSE on the positive validation set (also called intra-class MSE )

IntraSize The whole size of the positive validation set, i.e. the sum of thepositive set sizes for each ANN

ExtraBitFail The global BitFail on the negative validation set (also calledextra-class BitFail)

ExtraMSE The global MSE on the negative validation set (also called extra-class MSE )

172

A – Tables

ExtraSize The whole size of the negative validation set, i.e. the sum of thenegative set sizes for each ANN

Table A.5 reports the results of the final classification test on the validation set.Now the entire ensemble (one for each configuration) of ANNs, running infixed-point mode, are tested:



RcgnOk A number between 0 and 1 representing the successful recognitionrate of the validation set

RcgnKo A number between 0 and 1 representing the erroneous recognitionrate of the validation set

RcgnUnk A number between 0 and 1 representing the inability to recognizethe validation set

Size The size of the validation set

173

A – Tables

Tab

leA

.3:

sum

mar

y.fx

p(o

rder

edby

MSE

)

Nam

eFea

tCfg

Fea

tSiz

eSiz

eN

euro

ns

Connec

tions

MSE

Bit

Fail

1bet

a-h

64-x

85-b

f2-1

0k

16x16-n

256

86870

323

897

0.0

00536

0.0

05618

2bet

a-h

64

16x16-n

256

86870

323

897

0.0

00561

0.0

05756

3alp

ha

8x8-n

p128

86870

163

961

0.0

00682

0.0

07816

4bet

a-h

64-x

10-1

0k

16x16-n

256

86870

323

897

0.0

00845

0.0

14861

5bet

a-h

64-x

a-b

f1-1

0k

16x16-n

256

86870

323

897

0.0

00881

0.0

11891

6bet

a-n

p16x16-n

p512

86870

547

1601

0.0

00930

0.0

07736

7bet

a-p

16x16-p

256

86870

291

833

0.0

00992

0.0

06159

8bet

a-h

64-x

10-1

k16x16-n

256

86870

323

897

0.0

01072

0.0

18441

9bet

a-h

64-x

50-1

0k

16x16-n

256

86870

323

897

0.0

01075

0.0

16864

10

bet

a-h

64-1

0k

16x16-n

256

86870

323

897

0.0

01330

0.0

14033

11

bet

a-h

64-x

85-1

0k

16x16-n

256

86870

323

897

0.0

01363

0.0

19558

12

bet

a-h

64-x

100-1

0k

16x16-n

256

86870

323

897

0.0

01444

0.0

20421

13

2D

p16x16-p

256

86870

388

1025

0.0

01582

0.0

08288

14

bet

a-h

64-x

85-b

f2-1

k16x16-n

256

86870

323

897

0.0

01842

0.0

18246

15

bet

a-h

64-x

85-b

f1-1

k16x16-n

256

86870

323

897

0.0

01881

0.0

20571

16

bet

a-h

64-x

50-1

k16x16-n

256

86870

323

897

0.0

01935

0.0

24209

17

bet

a-h

64-x

a-b

f1-1

k16x16-n

256

86870

323

897

0.0

02027

0.0

23587

18

bet

a-h

64-x

85-1

k16x16-n

256

86870

323

897

0.0

02099

0.0

24669

19

bet

a-h

64-x

a-1

k16x16-n

256

86870

323

897

0.0

02243

0.0

31576

20

2D

p-x

50-1

0k

16x16-p

256

86870

388

1025

0.0

02307

0.0

16174

21

bet

a-h

64-x

100-1

k16x16-n

256

86870

323

897

0.0

02671

0.0

28157

22

2D

n-x

a-1

0k

16x16-n

256

86870

388

1025

0.0

02892

0.0

17520

23

bet

a16x16-n

256

86870

291

833

0.0

03057

0.0

09013

24

2D

n-x

100-1

0k

16x16-n

256

86870

388

1025

0.0

03074

0.0

16070

25

alp

ha-full

8x8-n

p128

86870

163

4161

0.0

03331

0.0

15494

26

2D

n16x16-n

256

86870

388

1025

0.0

03854

0.0

11235

27

bet

a-c

716x16-n

256

86870

291

1857

0.0

04481

0.0

16933

28

bet

a-full

16x16-n

256

86870

291

8257

0.0

04980

0.0

18464

29

2D

n-x

50-1

0k

16x16-n

256

86870

388

1025

0.0

05133

0.0

20433

30

2D

p-x

85-1

0k

16x16-p

256

86870

388

1025

0.0

05249

0.0

20744

31

2D

p-x

10-1

0k

16x16-p

256

86870

388

1025

0.0

06714

0.0

19236

32

2D

n-x

85-1

0k

16x16-n

256

86870

388

1025

0.0

07479

0.0

24289

33

2D

n-x

10-1

0k

16x16-n

256

86870

388

1025

0.0

07803

0.0

23898

34

2D

n-x

a-b

f1-1

0k

16x16-n

256

86870

388

1025

0.0

08116

0.0

16784

35

2D

p-x

a-b

f1-1

0k

16x16-p

256

86870

388

1025

0.0

08746

0.0

18890

36

2D

p-x

a-1

0k

16x16-p

256

86870

388

1025

0.0

15720

0.0

30056

174

A – Tables

Tab

leA

.4:

repor

t.fx

p(o

rder

edby

Bit

Fai

lRat

io)

Nam

eFea

tCfg

Bit

FailR

ati

oM

SE

Rati

oIn

traB

itFail

Intr

aM

SE

Intr

aSiz

eE

xtr

aB

itFail

Extr

aM

SE

Extr

aSiz

e

1bet

a-h

64-x

100-1

k16x16-n

6.1

55331

4.6

53417

0.1

63405

0.0

11918

1022

0.0

26547

0.0

02561

85848

2bet

a-h

64-x

85-b

f2-1

k16x16-n

6.3

86965

7.6

42935

0.1

09589

0.0

13055

1022

0.0

17158

0.0

01708

85848

3bet

a-h

64-x

a-1

k16x16-n

6.5

35167

5.2

18057

0.1

93738

0.0

11149

1022

0.0

29645

0.0

02137

85848

4bet

a-h

64-x

85-1

k16x16-n

6.9

15152

6.4

20941

0.1

59491

0.0

12671

1022

0.0

23064

0.0

01973

85848

5bet

a-h

64-x

85-b

f1-1

k16x16-n

7.0

29715

6.4

14780

0.1

35029

0.0

11345

1022

0.0

19208

0.0

01769

85848

6bet

a-h

64-x

a-b

f1-1

k16x16-n

7.2

11447

5.7

44213

0.1

58513

0.0

11025

1022

0.0

21981

0.0

01919

85848

7bet

a-full

16x16-n

7.5

94833

5.6

85417

0.1

30137

0.0

26833

1022

0.0

17135

0.0

04720

85848

8bet

a-h

64-x

100-1

0k

16x16-n

7.6

45756

9.8

56346

0.1

44814

0.0

12888

1022

0.0

18940

0.0

01308

85848

9bet

a-h

64-x

85-1

0k

16x16-n

8.1

34280

10.2

53066

0.1

46771

0.0

12603

1022

0.0

18044

0.0

01229

85848

10

alp

ha-full

8x8-n

p8.5

99509

9.6

80970

0.1

22309

0.0

29258

1022

0.0

14223

0.0

03022

85848

11

bet

a-h

64-x

a-b

f1-1

0k

16x16-n

9.8

07568

15.9

56586

0.1

05675

0.0

11958

1022

0.0

10775

0.0

00749

85848

12

2D

n-x

85-1

0k

16x16-n

9.8

27422

3.8

22545

0.2

16243

0.0

27671

1022

0.0

22004

0.0

07239

85848

13

bet

a-h

64-x

50-1

k16x16-n

9.9

63830

8.9

04195

0.2

18200

0.0

15766

1022

0.0

21899

0.0

01771

85848

14

bet

a-c

716x16-n

10.6

85057

7.3

91418

0.1

62427

0.0

30802

1022

0.0

15201

0.0

04167

85848

15

2D

p-x

a-1

0k

16x16-p

11.8

58392

8.7

70637

0.3

16047

0.1

26324

1022

0.0

26652

0.0

14403

85848

16

bet

a-h

64-x

50-1

0k

16x16-n

11.9

15822

16.2

07620

0.1

78082

0.0

14782

1022

0.0

14945

0.0

00912

85848

17

bet

a-h

64-x

85-b

f2-1

0k

16x16-n

13.6

00000

33.8

20671

0.0

66536

0.0

13076

1022

0.0

04892

0.0

00387

85848

18

2D

p-x

a-b

f1-1

0k

16x16-p

13.6

23229

6.9

18636

0.2

24070

0.0

56570

1022

0.0

16448

0.0

08176

85848

19

2D

p-x

85-1

0k

16x16-p

15.4

53351

9.7

21537

0.2

73973

0.0

46277

1022

0.0

17729

0.0

04760

85848

20

2D

n-x

10-1

0k

16x16-n

17.0

92174

14.9

33824

0.3

43444

0.1

00117

1022

0.0

20094

0.0

06704

85848

21

2D

n-x

50-1

0k

16x16-n

17.9

83584

17.4

92598

0.3

06262

0.0

75204

1022

0.0

17030

0.0

04299

85848

22

2D

n-x

a-1

0k

16x16-n

18.1

15016

25.0

67113

0.2

64188

0.0

56493

1022

0.0

14584

0.0

02254

85848

23

2D

n-x

a-b

f1-1

0k

16x16-n

18.7

44966

18.6

20397

0.2

60274

0.1

25182

1022

0.0

13885

0.0

06723

85848

24

2D

p-x

10-1

0k

16x16-p

20.2

82318

12.1

51234

0.3

18004

0.0

72124

1022

0.0

15679

0.0

05936

85848

25

2D

p-x

50-1

0k

16x16-p

22.5

16245

49.8

24932

0.2

90607

0.0

73024

1022

0.0

12907

0.0

01466

85848

26

bet

a-h

64-x

10-1

k16x16-n

23.1

40127

32.9

32428

0.3

38552

0.0

25661

1022

0.0

14631

0.0

00779

85848

27

bet

a-h

64-x

10-1

0k

16x16-n

23.7

97217

42.6

96045

0.2

78865

0.0

24216

1022

0.0

11718

0.0

00567

85848

28

2D

n-x

100-1

0k

16x16-n

26.3

14205

71.3

99874

0.3

25832

0.1

20049

1022

0.0

12382

0.0

01681

85848

29

bet

a-n

p16x16-n

p37.6

55172

43.6

45539

0.2

03523

0.0

27030

1022

0.0

05405

0.0

00619

85848

30

bet

a16x16-n

37.8

00000

12.0

49791

0.2

37769

0.0

32596

1022

0.0

06290

0.0

02705

85848

31

alp

ha

8x8-n

p46.8

16514

117.6

09301

0.2

37769

0.0

33807

1022

0.0

05079

0.0

00287

85848

32

bet

a-h

64-1

0k

16x16-n

54.1

86235

261.7

70886

0.4

67710

0.0

85595

1022

0.0

08632

0.0

00327

85848

33

2D

n16x16-n

54.7

20812

55.1

22532

0.3

76712

0.1

29789

1022

0.0

06884

0.0

02355

85848

34

bet

a-p

16x16-p

62.8

62745

127.4

15150

0.2

24070

0.0

50805

1022

0.0

03564

0.0

00399

85848

35

2D

p16x16-p

76.8

51064

140.7

53859

0.3

36595

0.0

84232

1022

0.0

04380

0.0

00598

85848

36

bet

a-h

64

16x16-n

110.4

44444

233.1

15318

0.2

77886

0.0

35025

1022

0.0

02516

0.0

00150

85848

175

A – Tables

Tab

leA

.5:

clas

sify

repor

t.fx

p(o

rder

edby

Rcg

nO

k)

Nam

eFea

tCfg

Rcg

nO

kR

cgnK

oR

cgnU

nk

Siz

e

1bet

a-h

64-x

85-b

f2-1

0k

16x16-n

0.9

70646

0.0

03914

0.0

25440

1022

2bet

a-h

64-x

a-b

f1-1

0k

16x16-n

0.9

61840

0.0

05871

0.0

32290

1022

3bet

a-h

64-x

100-1

0k

16x16-n

0.9

55969

0.0

09785

0.0

34247

1022

4bet

a-h

64-x

50-1

0k

16x16-n

0.9

55969

0.0

09785

0.0

34247

1022

5bet

a-h

64-x

10-1

0k

16x16-n

0.9

53033

0.0

08806

0.0

38160

1022

6bet

a-h

64-x

85-1

0k

16x16-n

0.9

53033

0.0

08806

0.0

38160

1022

7bet

a-h

64

16x16-n

0.9

41292

0.0

03914

0.0

54795

1022

8bet

a-n

p16x16-n

p0.9

40313

0.0

06849

0.0

52838

1022

9alp

ha

8x8-n

p0.9

38356

0.0

04892

0.0

56751

1022

10

bet

a-h

64-x

50-1

k16x16-n

0.9

37378

0.0

10763

0.0

51859

1022

11

bet

a-h

64-x

10-1

k16x16-n

0.9

36399

0.0

09785

0.0

53816

1022

12

bet

a-h

64-x

85-b

f2-1

k16x16-n

0.9

35421

0.0

05871

0.0

58708

1022

13

bet

a-h

64-x

a-1

k16x16-n

0.9

32485

0.0

12720

0.0

54795

1022

14

bet

a-h

64-x

85-b

f1-1

k16x16-n

0.9

29550

0.0

06849

0.0

63601

1022

15

bet

a-h

64-x

85-1

k16x16-n

0.9

22701

0.0

08806

0.0

68493

1022

16

bet

a-h

64-x

a-b

f1-1

k16x16-n

0.9

20744

0.0

10763

0.0

68493

1022

17

bet

a-p

16x16-p

0.9

12916

0.0

11742

0.0

75342

1022

18

bet

a-h

64-x

100-1

k16x16-n

0.9

09002

0.0

17613

0.0

73386

1022

19

2D

p-x

50-1

0k

16x16-p

0.8

51272

0.0

26419

0.1

22309

1022

20

alp

ha-full

8x8-n

p0.8

49315

0.0

20548

0.1

30137

1022

21

2D

p16x16-p

0.8

34638

0.0

11742

0.1

53620

1022

22

2D

n-x

a-1

0k

16x16-n

0.8

24853

0.0

45988

0.1

29159

1022

23

bet

a-h

64-1

0k

16x16-n

0.8

20939

0.0

03914

0.1

75147

1022

24

bet

a-full

16x16-n

0.8

05284

0.0

31311

0.1

63405

1022

25

2D

n-x

100-1

0k

16x16-n

0.7

78865

0.0

35225

0.1

85910

1022

26

bet

a-c

716x16-n

0.7

72016

0.0

35225

0.1

92759

1022

27

bet

a16x16-n

0.7

43640

0.0

29354

0.2

27006

1022

28

2D

n16x16-n

0.6

85910

0.0

50881

0.2

63209

1022

29

2D

n-x

50-1

0k

16x16-n

0.6

66341

0.0

71429

0.2

62231

1022

30

2D

p-x

85-1

0k

16x16-p

0.6

41879

0.0

82192

0.2

75930

1022

31

2D

n-x

10-1

0k

16x16-n

0.5

55773

0.0

95890

0.3

48337

1022

32

2D

n-x

85-1

0k

16x16-n

0.5

26419

0.0

82192

0.3

91389

1022

33

2D

n-x

a-b

f1-1

0k

16x16-n

0.5

16634

0.0

75342

0.4

08023

1022

34

2D

p-x

10-1

0k

16x16-p

0.5

00978

0.1

33072

0.3

65949

1022

35

2D

p-x

a-b

f1-1

0k

16x16-p

0.4

56947

0.0

65558

0.4

77495

1022

36

2D

p-x

a-1

0k

16x16-p

0.3

50294

0.0

33268

0.6

16438

1022

176

A – Tables

A.3 Holdout method, 10% training – 90% valida-

tion

Table A.6 reports global performance results of the ANNs running in fixed-pointmode:






Connections The whole amount of connections in the ANN. Not every ANNis fully connected and this value is important because it gives an idea ofthe complexity of the ANN.


BitFail The global BitFail of the ANNs on the validation set

Table A.7 reports detailed performance results of the ANNs running in fixed-pointmode. In particular it shows how different configurations behave on the posi-tive validation set and on the negative one:










177

A – Tables


Table A.8 reports the results of the final classification test on the validation set.Now the entire ensemble (one for each configuration) of ANNs, running infixed-point mode, are tested:







178

A – Tables

Tab

leA

.6:

sum

mar

y.fx

p(o

rder

edby

MSE

)

Nam

eFea

tCfg

Fea

tSiz

eSiz

eN

euro

ns

Connec

tions

MSE

Bit

Fail

1bet

a-h

64-x

85-b

f2-1

0k

16x16-n

256

778600

323

897

0.0

02761

0.0

11721

2bet

a-h

64

16x16-n

256

778600

323

897

0.0

02924

0.0

11275

3bet

a-h

64-x

50-1

0k

16x16-n

256

778600

323

897

0.0

02926

0.0

21101

4bet

a-h

64-x

10-1

0k

16x16-n

256

778600

323

897

0.0

02953

0.0

19868

5bet

a-h

64-x

85-b

f2-1

k16x16-n

256

778600

323

897

0.0

03013

0.0

14304

6bet

a16x16-n

256

778600

291

833

0.0

03019

0.0

11824

7bet

a-h

64-x

10-1

k16x16-n

256

778600

323

897

0.0

03048

0.0

21184

8bet

a-h

64-x

a-b

f1-1

0k

16x16-n

256

778600

323

897

0.0

03080

0.0

18318

9alp

ha

8x8-n

p128

778600

163

961

0.0

03091

0.0

12928

10

bet

a-h

64-x

a-1

k16x16-n

256

778600

323

897

0.0

03095

0.0

25794

11

bet

a-h

64-x

50-1

k16x16-n

256

778600

323

897

0.0

03126

0.0

22742

12

bet

a-n

p16x16-n

p512

778600

547

1601

0.0

03126

0.0

12060

13

2D

n-x

a-b

f1-1

0k

16x16-n

256

778600

388

1025

0.0

03179

0.0

15732

14

2D

n-x

50-1

0k

16x16-n

256

778600

388

1025

0.0

03191

0.0

19355

15

bet

a-h

64-x

85-1

0k

16x16-n

256

778600

323

897

0.0

03203

0.0

22784

16

bet

a-p

16x16-p

256

778600

291

833

0.0

03225

0.0

12569

17

2D

n-x

100-1

0k

16x16-n

256

778600

388

1025

0.0

03228

0.0

21139

18

bet

a-h

64-x

a-b

f1-1

k16x16-n

256

778600

323

897

0.0

03248

0.0

21984

19

2D

n16x16-n

256

778600

388

1025

0.0

03273

0.0

11012

20

bet

a-h

64-x

100-1

0k

16x16-n

256

778600

323

897

0.0

03276

0.0

24387

21

2D

p-x

10-1

0k

16x16-p

256

778600

388

1025

0.0

03300

0.0

21876

22

2D

n-x

10-1

0k

16x16-n

256

778600

388

1025

0.0

03310

0.0

21078

23

bet

a-h

64-x

100-1

k16x16-n

256

778600

323

897

0.0

03326

0.0

24847

24

2D

n-x

85-1

0k

16x16-n

256

778600

388

1025

0.0

03329

0.0

20206

25

bet

a-h

64-1

0k

16x16-n

256

778600

323

897

0.0

03337

0.0

18764

26

2D

n-x

a-1

0k

16x16-n

256

778600

388

1025

0.0

03360

0.0

20364

27

2D

p-x

85-1

0k

16x16-p

256

778600

388

1025

0.0

03395

0.0

21075

28

bet

a-h

64-x

85-b

f1-1

k16x16-n

256

778600

323

897

0.0

03396

0.0

19719

29

2D

p-x

a-b

f1-1

0k

16x16-p

256

778600

388

1025

0.0

03436

0.0

16492

30

bet

a-h

64-x

85-1

k16x16-n

256

778600

323

897

0.0

03440

0.0

25406

31

2D

p-x

a-1

0k

16x16-p

256

778600

388

1025

0.0

03465

0.0

22430

32

bet

a-c

716x16-n

256

778600

291

1857

0.0

03546

0.0

13229

33

alp

ha-full

8x8-n

p128

778600

163

4161

0.0

04072

0.0

15446

34

2D

p-x

50-1

0k

16x16-p

256

778600

388

1025

0.0

05076

0.0

21903

35

bet

a-full

16x16-n

256

778600

291

8257

0.0

07568

0.0

18981

36

2D

p16x16-p

256

778600

388

1025

0.0

09417

0.0

17239

179

A – Tables

Tab

leA

.7:

repor

t.fx

p(o

rder

edby

Bit

Fai

lRat

io)

Nam

eFea

tCfg

Bit

FailR

ati

oM

SE

Rati

oIn

traB

itFail

Intr

aM

SE

Intr

aSiz

eE

xtr

aB

itFail

Extr

aM

SE

Extr

aSiz

e

1bet

a-h

64-x

100-1

k16x16-n

18.4

11394

58.6

40464

0.3

79694

0.1

16208

9160

0.0

20623

0.0

01982

769440

2bet

a-h

64-x

85-1

k16x16-n

18.8

34757

60.0

72958

0.3

95524

0.1

21921

9160

0.0

21000

0.0

02030

769440

3bet

a-h

64-x

100-1

0k

16x16-n

19.5

17134

62.9

82950

0.3

90830

0.1

19327

9160

0.0

20025

0.0

01895

769440

4bet

a-h

64-x

a-1

k16x16-n

19.7

24299

65.8

86968

0.4

16921

0.1

15651

9160

0.0

21137

0.0

01755

769440

5bet

a-h

64-x

85-1

0k

16x16-n

20.7

19606

70.6

95661

0.3

83188

0.1

24426

9160

0.0

18494

0.0

01760

769440

6bet

a-h

64-x

a-b

f1-1

k16x16-n

22.3

71828

63.5

24919

0.3

93013

0.1

18893

9160

0.0

17567

0.0

01872

769440

7bet

a-h

64-x

50-1

k16x16-n

23.4

23660

80.2

58294

0.4

21507

0.1

29819

9160

0.0

17995

0.0

01618

769440

8bet

a-h

64-x

85-b

f1-1

k16x16-n

23.5

51664

60.5

80452

0.3

67031

0.1

20962

9160

0.0

15584

0.0

01997

769440

9bet

a-h

64-x

50-1

0k

16x16-n

26.2

52936

93.8

96523

0.4

27074

0.1

31266

9160

0.0

16268

0.0

01398

769440

10

bet

a-full

16x16-n

26.6

05488

23.1

88898

0.3

88100

0.1

39168

9160

0.0

14587

0.0

06001

769440

11

bet

a-h

64-x

a-b

f1-1

0k

16x16-n

27.0

70647

75.7

66514

0.3

79476

0.1

24167

9160

0.0

14018

0.0

01639

769440

12

2D

p-x

a-1

0k

16x16-p

29.3

67543

142.2

65513

0.4

93886

0.1

85206

9160

0.0

16817

0.0

01302

769440

13

bet

a-h

64-x

85-b

f2-1

k16x16-n

29.4

36159

88.0

61109

0.3

15502

0.1

31094

9160

0.0

10718

0.0

01489

769440

14

2D

p-x

50-1

0k

16x16-p

30.9

61560

67.4

89353

0.5

01419

0.1

92229

9160

0.0

16195

0.0

02848

769440

15

2D

n-x

100-1

0k

16x16-n

31.0

11729

117.5

07544

0.4

84498

0.1

59980

9160

0.0

15623

0.0

01361

769440

16

2D

n-x

85-1

0k

16x16-n

32.1

03321

120.5

89894

0.4

74891

0.1

66780

9160

0.0

14793

0.0

01383

769440

17

2D

p-x

85-1

0k

16x16-p

32.3

66062

141.1

87506

0.4

98253

0.1

80909

9160

0.0

15394

0.0

01281

769440

18

2D

n-x

a-1

0k

16x16-n

33.4

44444

151.9

62935

0.4

92904

0.1

83909

9160

0.0

14738

0.0

01210

769440

19

2D

n-x

50-1

0k

16x16-n

34.2

29196

121.5

05998

0.4

76310

0.1

60361

9160

0.0

13915

0.0

01320

769440

20

bet

a-h

64-x

10-1

k16x16-n

34.3

47655

134.3

66488

0.5

22598

0.1

59420

9160

0.0

15215

0.0

01186

769440

21

bet

a-h

64-x

85-b

f2-1

0k

16x16-n

36.3

23968

112.0

36718

0.3

00764

0.1

34143

9160

0.0

08280

0.0

01197

769440

22

alp

ha-full

8x8-n

p36.6

62207

71.7

64899

0.3

98908

0.1

59472

9160

0.0

10881

0.0

02222

769440

23

bet

a-h

64-x

10-1

0k

16x16-n

37.1

10635

149.4

21616

0.5

17467

0.1

60666

9160

0.0

13944

0.0

01075

769440

24

2D

p-x

10-1

0k

16x16-p

39.5

12776

213.5

05365

0.5

94869

0.2

01305

9160

0.0

15055

0.0

00943

769440

25

2D

n-x

10-1

0k

16x16-n

39.7

78755

187.2

71219

0.5

75764

0.1

94224

9160

0.0

14474

0.0

01037

769440

26

2D

p-x

a-b

f1-1

0k

16x16-p

43.4

23981

141.6

93165

0.4

77729

0.1

83374

9160

0.0

11002

0.0

01294

769440

27

2D

n-x

a-b

f1-1

0k

16x16-n

45.2

28335

122.2

96732

0.4

68013

0.1

60168

9160

0.0

10348

0.0

01310

769440

28

bet

a-h

64-1

0k

16x16-n

46.3

49442

288.4

87021

0.5

67140

0.2

19694

9160

0.0

12236

0.0

00762

769440

29

2D

p16x16-p

50.9

42908

36.9

96929

0.5

53166

0.2

44745

9160

0.0

10859

0.0

06615

769440

30

bet

a-c

716x16-n

61.5

09586

147.9

37968

0.4

75328

0.1

92241

9160

0.0

07728

0.0

01299

769440

31

alp

ha

8x8-n

p71.3

73760

198.7

64066

0.5

04803

0.1

84672

9160

0.0

07073

0.0

00929

769440

32

bet

a-n

p16x16-n

p73.3

42908

199.4

47348

0.4

77838

0.1

86989

9160

0.0

06515

0.0

00938

769440

33

bet

a-p

16x16-p

73.6

26846

236.5

10809

0.4

99017

0.2

02265

9160

0.0

06778

0.0

00855

769440

34

bet

a16x16-n

93.4

03992

244.2

41379

0.5

29148

0.1

90936

9160

0.0

05665

0.0

00782

769440

35

bet

a-h

64

16x16-n

96.1

26038

323.1

70339

0.5

11463

0.1

97299

9160

0.0

05321

0.0

00611

769440

36

2D

n16x16-n

129.4

60581

450.2

25228

0.5

67686

0.2

34442

9160

0.0

04385

0.0

00521

769440

180

A – Tables

Tab

leA

.8:

clas

sify

repor

t.fx

p(o

rder

edby

Rcg

nO

k)

Nam

eFea

tCfg

Rcg

nO

kR

cgnK

oR

cgnU

nk

Siz

e

1bet

a-h

64-x

a-1

k16x16-n

0.7

89192

0.0

39956

0.1

70852

9160

2bet

a-h

64-x

85-b

f2-1

0k

16x16-n

0.7

86572

0.0

44541

0.1

68886

9160

3bet

a-h

64-x

50-1

0k

16x16-n

0.7

79585

0.0

36354

0.1

84061

9160

4bet

a-h

64-x

100-1

0k

16x16-n

0.7

78384

0.0

45852

0.1

75764

9160

5bet

a-h

64-x

100-1

k16x16-n

0.7

76747

0.0

40830

0.1

82424

9160

6bet

a-h

64-x

85-1

0k

16x16-n

0.7

76310

0.0

47271

0.1

76419

9160

7bet

a-h

64-x

85-b

f2-1

k16x16-n

0.7

75000

0.0

44978

0.1

80022

9160

8bet

a-h

64-x

a-b

f1-1

0k

16x16-n

0.7

74454

0.0

38865

0.1

86681

9160

9bet

a-h

64-x

50-1

k16x16-n

0.7

70961

0.0

43559

0.1

85480

9160

10

bet

a-h

64-x

a-b

f1-1

k16x16-n

0.7

70742

0.0

44323

0.1

84934

9160

11

bet

a-h

64-x

85-1

k16x16-n

0.7

69541

0.0

49127

0.1

81332

9160

12

bet

a-h

64-x

85-b

f1-1

k16x16-n

0.7

63646

0.0

45306

0.1

91048

9160

13

alp

ha-full

8x8-n

p0.7

53384

0.0

68886

0.1

77729

9160

14

bet

a-h

64-x

10-1

0k

16x16-n

0.7

41157

0.0

32314

0.2

26528

9160

15

bet

a-h

64-x

10-1

k16x16-n

0.7

36463

0.0

42686

0.2

20852

9160

16

2D

n-x

50-1

0k

16x16-n

0.7

33624

0.0

37336

0.2

29039

9160

17

alp

ha

8x8-n

p0.7

28384

0.0

43450

0.2

28166

9160

18

bet

a-n

p16x16-n

p0.7

28166

0.0

34498

0.2

37336

9160

19

2D

n-x

85-1

0k

16x16-n

0.7

26092

0.0

39847

0.2

34061

9160

20

bet

a-c

716x16-n

0.7

24891

0.0

49563

0.2

25546

9160

21

2D

n-x

100-1

0k

16x16-n

0.7

24672

0.0

39192

0.2

36135

9160

22

bet

a16x16-n

0.7

17904

0.0

31659

0.2

50437

9160

23

2D

n-x

a-b

f1-1

0k

16x16-n

0.7

16812

0.0

33843

0.2

49345

9160

24

bet

a-p

16x16-p

0.7

13319

0.0

28603

0.2

58079

9160

25

2D

p-x

a-b

f1-1

0k

16x16-p

0.7

12336

0.0

37445

0.2

50218

9160

26

2D

p-x

85-1

0k

16x16-p

0.7

11463

0.0

37336

0.2

51201

9160

27

bet

a-h

64

16x16-n

0.7

10044

0.0

27511

0.2

62445

9160

28

2D

n-x

a-1

0k

16x16-n

0.7

05459

0.0

35262

0.2

59279

9160

29

2D

p-x

a-1

0k

16x16-p

0.7

04803

0.0

37336

0.2

57860

9160

30

2D

n-x

10-1

0k

16x16-n

0.6

76747

0.0

30240

0.2

93013

9160

31

2D

p-x

10-1

0k

16x16-p

0.6

66703

0.0

29585

0.3

03712

9160

32

bet

a-full

16x16-n

0.6

50437

0.0

98362

0.2

51201

9160

33

2D

n16x16-n

0.6

48362

0.0

24236

0.3

27402

9160

34

bet

a-h

64-1

0k

16x16-n

0.6

38537

0.0

26310

0.3

35153

9160

35

2D

p-x

50-1

0k

16x16-p

0.6

15939

0.0

99672

0.2

84389

9160

36

2D

p16x16-p

0.3

34934

0.2

80786

0.3

84279

9160

181

A – Tables

Tab

leA

.9:

clas

sify

repor

t.fx

p(o

rder

edby

Rcg

nO

k)

Nam

eFea

tCfg

Rcg

nO

kR

cgnK

oR

cgnU

nk

Siz

e

1bet

a-h

64-x

85-b

f2-1

0k

16x16-n

0.7

54913

0.0

31550

0.2

13537

9160

2bet

a-h

64-x

85-b

f2-1

k16x16-n

0.7

38100

0.0

28275

0.2

33624

9160

3bet

a-h

64-x

a-1

k16x16-n

0.7

33297

0.0

23690

0.2

43013

9160

4bet

a-h

64-x

85-1

0k

16x16-n

0.7

29913

0.0

27838

0.2

42249

9160

5bet

a-h

64-x

100-1

0k

16x16-n

0.7

28057

0.0

28603

0.2

43341

9160

6bet

a-h

64-x

100-1

k16x16-n

0.7

27402

0.0

24454

0.2

48144

9160

7bet

a-h

64-x

a-b

f1-1

0k

16x16-n

0.7

24017

0.0

24891

0.2

51092

9160

8bet

a-h

64-x

50-1

0k

16x16-n

0.7

24017

0.0

21070

0.2

54913

9160

9bet

a-h

64-x

50-1

k16x16-n

0.7

22926

0.0

26419

0.2

50655

9160

10

bet

a-h

64-x

85-1

k16x16-n

0.7

22380

0.0

30459

0.2

47162

9160

11

bet

a-h

64-x

a-b

f1-1

k16x16-n

0.7

18996

0.0

25655

0.2

55349

9160

12

bet

a-h

64-x

85-b

f1-1

k16x16-n

0.7

10044

0.0

24782

0.2

65175

9160

13

alp

ha-full

8x8-n

p0.7

08952

0.0

50109

0.2

40939

9160

14

2D

n-x

50-1

0k

16x16-n

0.6

75764

0.0

22380

0.3

01856

9160

15

bet

a-n

p16x16-n

p0.6

70306

0.0

21834

0.3

07860

9160

16

bet

a-c

716x16-n

0.6

68777

0.0

29803

0.3

01419

9160

17

alp

ha

8x8-n

p0.6

67576

0.0

28384

0.3

04039

9160

18

2D

n-x

100-1

0k

16x16-n

0.6

67576

0.0

21507

0.3

10917

9160

19

bet

a-h

64-x

10-1

0k

16x16-n

0.6

67467

0.0

19869

0.3

12664

9160

20

2D

n-x

85-1

0k

16x16-n

0.6

65721

0.0

23690

0.3

10590

9160

21

bet

a-h

64-x

10-1

k16x16-n

0.6

61354

0.0

27620

0.3

11026

9160

22

2D

n-x

a-b

f1-1

0k

16x16-n

0.6

59934

0.0

23362

0.3

16703

9160

23

bet

a-p

16x16-p

0.6

57096

0.0

17140

0.3

25764

9160

24

2D

p-x

a-b

f1-1

0k

16x16-p

0.6

53930

0.0

24017

0.3

22052

9160

25

bet

a16x16-n

0.6

51856

0.0

17795

0.3

30349

9160

26

2D

p-x

85-1

0k

16x16-p

0.6

51856

0.0

21070

0.3

27074

9160

27

2D

p-x

a-1

0k

16x16-p

0.6

46943

0.0

20087

0.3

32969

9160

28

2D

n-x

a-1

0k

16x16-n

0.6

46507

0.0

21179

0.3

32314

9160

29

bet

a-h

64

16x16-n

0.6

41048

0.0

18886

0.3

40066

9160

30

2D

n-x

10-1

0k

16x16-n

0.6

08297

0.0

18122

0.3

73581

9160

31

2D

p-x

10-1

0k

16x16-p

0.5

88646

0.0

15939

0.3

95415

9160

32

2D

n16x16-n

0.5

84607

0.0

15830

0.3

99563

9160

33

bet

a-full

16x16-n

0.5

81004

0.0

54585

0.3

64410

9160

34

bet

a-h

64-1

0k

16x16-n

0.5

72817

0.0

14410

0.4

12773

9160

35

2D

p-x

50-1

0k

16x16-p

0.5

63974

0.0

72598

0.3

63428

9160

36

2D

p16x16-p

0.3

04039

0.2

29258

0.4

66703

9160

182

A – Tables

A.4 Stratified two-fold cross-validation method

Table A.10 reports global performance results of the ANNs running in floating-point mode:






Connections The whole amount of connections in the ANN. Not every ANNis fully connected and this value is important because it gives an idea ofthe complexity of the ANN.


BitFail The global BitFail on the validation set

Table A.11 reports global performance results of the ANNs running in fixed-pointmode. The data reported are the same of Table A.10.

Table A.12 reports detailed performance results of the ANNs running in floating-point mode. In particular it shows how different configurations behave on thepositive validation set and on the negative one:









183

A – Tables



Table A.13 reports detailed performance results of the ANNs running in fixed-point mode. The data reported are the same of Table A.12.

Table A.14 reports the results of the comparison of the ANNs running in floating-point mode vs running in fixed-point mode:


FeatCfg The feature data type used for this configuration

Size The whole size of the validation set, i.e. the sum of the validation setsizes for each ANN

MSE The global MSE between the ANNs running in the two ways

BitFail The global BitFail resulting from the comparison of the ANNs run-ning in the two ways

MinDecPt Each floating-point ANN is automatically converted into a fixed-point one and the decimal point can be different between ANNs belongingto the same configuration. This is the minimum decimal point.

AvgDecPt As stated above, the decimal point can be different for each ANN,this is the “average” decimal point

MaxDecPt As stated above, the decimal point can be different for each ANN,this is the maximum decimal point

MSE fxp/fp The ratio between the MSE running in fixed-point mode andthe one in floating-point mode

BitFail fxp/fp The ratio between the BitFail running in fixed-point modeand the one in floating-point mode

Table A.15 reports the results of the final classification test on the validation set.Now the entire ensemble (one for each configuration) of ANNs are tested,instead of the ANNs taken alone as in the previous tests:




184

A – Tables




ConfMin The minimum of the confidence interval

ConfMax The maximum of the confidence interval

ConfLvl The confidence level considered to compute the confidence interval

Table A.16 reports the results of the ANNs running in fixed-point mode from afinal classification test on the validation set. Now the entire ensemble (one foreach configuration) of ANNs are tested, instead of the ANNs taken alone asin the previous tests. The data reported are the same of Table A.15.

185

A – Tables

Tab

leA

.10:

sum

mar

y(o

rder

edby

MSE

)

Nam

eFea

tCfg

Fea

tSiz

eSiz

eN

euro

ns

Connec

tions

MSE

Bit

Fail

1bet

a-h

64

16x16-n

256

864620

323

897

0.0

00834

0.0

07379

2bet

a-h

64-x

85-b

f2-1

0k

16x16-n

256

864620

323

897

0.0

00861

0.0

07309

3bet

a-c

716x16-n

256

864620

291

1857

0.0

01019

0.0

06765

4bet

a-n

p16x16-n

p512

864620

547

1601

0.0

01084

0.0

06564

5bet

a-h

64-x

10-1

0k

16x16-n

256

864620

323

897

0.0

01183

0.0

17934

6bet

a-full

16x16-n

256

864620

291

8257

0.0

01255

0.0

06090

72D

n16x16-n

256

864620

388

1025

0.0

01326

0.0

08821

8bet

a-h

64-x

10-1

k16x16-n

256

864620

323

897

0.0

01341

0.0

21268

9bet

a-p

16x16-p

256

864620

291

833

0.0

01378

0.0

07507

10

bet

a-h

64-1

0k

16x16-n

256

864620

323

897

0.0

01442

0.0

16982

11

2D

p16x16-p

256

864620

388

1025

0.0

01512

0.0

08903

12

bet

a-h

64-x

50-1

0k

16x16-n

256

864620

323

897

0.0

01543

0.0

19689

13

bet

a-h

64-x

a-b

f1-1

0k

16x16-n

256

864620

323

897

0.0

01604

0.0

15250

14

alp

ha

8x8-n

p128

864620

163

961

0.0

01624

0.0

09585

15

2D

p-x

10-1

0k

16x16-p

256

864620

388

1025

0.0

01704

0.0

20581

16

bet

a-h

64-x

85-1

0k

16x16-n

256

864620

323

897

0.0

01815

0.0

21173

17

bet

a-h

64-x

85-b

f2-1

k16x16-n

256

864620

323

897

0.0

01915

0.0

16273

18

2D

n-x

85-1

0k

16x16-n

256

864620

388

1025

0.0

01927

0.0

19470

19

bet

a-h

64-x

100-1

0k

16x16-n

256

864620

323

897

0.0

01927

0.0

22011

20

alp

ha-full

8x8-n

p128

864620

163

4161

0.0

01997

0.0

08076

21

bet

a-h

64-x

50-1

k16x16-n

256

864620

323

897

0.0

02129

0.0

24019

22

bet

a-h

64-x

85-b

f1-1

k16x16-n

256

864620

323

897

0.0

02133

0.0

20051

23

2D

n-x

10-1

0k

16x16-n

256

864620

388

1025

0.0

02268

0.0

19617

24

bet

a-h

64-x

a-1

k16x16-n

256

864620

323

897

0.0

02328

0.0

28704

25

2D

p-x

50-1

0k

16x16-p

256

864620

388

1025

0.0

02337

0.0

20811

26

2D

p-x

a-1

0k

16x16-p

256

864620

388

1025

0.0

02351

0.0

22019

27

bet

a-h

64-x

a-b

f1-1

k16x16-n

256

864620

323

897

0.0

02411

0.0

25027

28

bet

a-h

64-x

85-1

k16x16-n

256

864620

323

897

0.0

02530

0.0

27112

29

bet

a-h

64-x

100-1

k16x16-n

256

864620

323

897

0.0

02601

0.0

26835

30

2D

n-x

a-1

0k

16x16-n

256

864620

388

1025

0.0

02849

0.0

21594

31

2D

p-x

85-1

0k

16x16-p

256

864620

388

1025

0.0

03101

0.0

21662

32

2D

n-x

50-1

0k

16x16-n

256

864620

388

1025

0.0

04351

0.0

20590

33

bet

a16x16-n

256

864620

291

833

0.0

04488

0.0

11005

34

2D

n-x

a-b

f1-1

0k

16x16-n

256

864620

388

1025

0.0

05999

0.0

17000

35

2D

p-x

a-b

f1-1

0k

16x16-p

256

864620

388

1025

0.0

06059

0.0

18145

36

2D

n-x

100-1

0k

16x16-n

256

864620

388

1025

0.0

08224

0.0

27118

186

A – Tables

Tab

leA

.11:

sum

mar

y.fx

p(o

rder

edby

MSE

)

Nam

eFea

tCfg

Fea

tSiz

eSiz

eN

euro

ns

Connec

tions

MSE

Bit

Fail

1bet

a-h

64-x

85-b

f2-1

0k

16x16-n

256

864620

323

897

0.0

00819

0.0

06763

2bet

a-h

64

16x16-n

256

864620

323

897

0.0

00906

0.0

07379

3bet

a-h

64-x

10-1

0k

16x16-n

256

864620

323

897

0.0

01175

0.0

16476

4bet

a-h

64-x

10-1

k16x16-n

256

864620

323

897

0.0

01347

0.0

20081

5bet

a-n

p16x16-n

p512

864620

547

1601

0.0

01441

0.0

08429

6bet

a-h

64-x

50-1

0k

16x16-n

256

864620

323

897

0.0

01464

0.0

18755

7bet

a-p

16x16-p

256

864620

291

833

0.0

01466

0.0

08190

8bet

a-h

64-x

a-b

f1-1

0k

16x16-n

256

864620

323

897

0.0

01482

0.0

14117

9bet

a-h

64-1

0k

16x16-n

256

864620

323

897

0.0

01590

0.0

15231

10

bet

a-h

64-x

85-1

0k

16x16-n

256

864620

323

897

0.0

01693

0.0

20024

11

bet

a-h

64-x

100-1

0k

16x16-n

256

864620

323

897

0.0

01809

0.0

21015

12

bet

a-h

64-x

85-b

f2-1

k16x16-n

256

864620

323

897

0.0

01885

0.0

16155

13

alp

ha

8x8-n

p128

864620

163

961

0.0

01958

0.0

12371

14

2D

n-x

85-1

0k

16x16-n

256

864620

388

1025

0.0

01964

0.0

16368

15

bet

a-h

64-x

50-1

k16x16-n

256

864620

323

897

0.0

02074

0.0

23289

16

bet

a-h

64-x

85-b

f1-1

k16x16-n

256

864620

323

897

0.0

02084

0.0

19639

17

bet

a-h

64-x

a-1

k16x16-n

256

864620

323

897

0.0

02238

0.0

27856

18

2D

p-x

10-1

0k

16x16-p

256

864620

388

1025

0.0

02266

0.0

19271

19

bet

a-h

64-x

a-b

f1-1

k16x16-n

256

864620

323

897

0.0

02353

0.0

24832

20

bet

a-h

64-x

85-1

k16x16-n

256

864620

323

897

0.0

02448

0.0

25972

21

bet

a-h

64-x

100-1

k16x16-n

256

864620

323

897

0.0

02490

0.0

25889

22

2D

p-x

a-1

0k

16x16-p

256

864620

388

1025

0.0

02774

0.0

18787

23

2D

n-x

50-1

0k

16x16-n

256

864620

388

1025

0.0

02843

0.0

16377

24

bet

a-c

716x16-n

256

864620

291

1857

0.0

02965

0.0

14944

25

alp

ha-full

8x8-n

p128

864620

163

4161

0.0

03282

0.0

16843

26

bet

a16x16-n

256

864620

291

833

0.0

04577

0.0

12229

27

2D

n16x16-n

256

864620

388

1025

0.0

05085

0.0

13816

28

2D

n-x

10-1

0k

16x16-n

256

864620

388

1025

0.0

05207

0.0

21026

29

2D

p-x

50-1

0k

16x16-p

256

864620

388

1025

0.0

05537

0.0

21445

30

2D

p-x

85-1

0k

16x16-p

256

864620

388

1025

0.0

06016

0.0

21117

31

2D

p16x16-p

256

864620

388

1025

0.0

07724

0.0

18187

32

2D

n-x

a-b

f1-1

0k

16x16-n

256

864620

388

1025

0.0

08128

0.0

18423

33

bet

a-full

16x16-n

256

864620

291

8257

0.0

08413

0.0

23757

34

2D

p-x

a-b

f1-1

0k

16x16-p

256

864620

388

1025

0.0

08738

0.0

18439

35

2D

n-x

a-1

0k

16x16-n

256

864620

388

1025

0.0

10648

0.0

25882

36

2D

n-x

100-1

0k

16x16-n

256

864620

388

1025

0.0

13381

0.0

28873

187

A – Tables

Tab

leA

.12:

repor

t(o

rder

edby

Bit

Fai

lRat

io)

Nam

eFea

tCfg

Bit

FailR

ati

oM

SE

Rati

oIn

traB

itFail

Intr

aM

SE

Intr

aSiz

eE

xtr

aB

itFail

Extr

aM

SE

Extr

aSiz

e

12D

n-x

100-1

0k

16x16-n

5.1

213

3.5

026

0.1

32226

0.0

27945

10172

0.0

25819

0.0

07978

854448

2bet

a-h

64-x

100-1

k16x16-n

5.2

751

8.7

391

0.1

34487

0.0

20723

10172

0.0

25495

0.0

02371

854448

3bet

a-h

64-x

85-1

k16x16-n

5.5

532

8.7

226

0.1

42548

0.0

20158

10172

0.0

25669

0.0

02311

854448

4bet

a-h

64-x

100-1

0k

16x16-n

5.8

528

12.3

926

0.1

21510

0.0

20960

10172

0.0

20761

0.0

01691

854448

5bet

a-h

64-x

a-1

k16x16-n

5.9

300

9.8

041

0.1

60539

0.0

20576

10172

0.0

27072

0.0

02099

854448

6bet

a-h

64-x

85-b

f1-1

k16x16-n

6.1

780

10.6

451

0.1

16496

0.0

20283

10172

0.0

18857

0.0

01905

854448

7bet

a-h

64-x

85-1

0k

16x16-n

6.3

583

13.4

703

0.1

26130

0.0

21188

10172

0.0

19837

0.0

01573

854448

8bet

a-h

64-x

a-b

f1-1

k16x16-n

6.4

029

9.1

897

0.1

50216

0.0

20124

10172

0.0

23460

0.0

02190

854448

9bet

a-h

64-x

85-b

f2-1

k16x16-n

6.7

225

12.7

921

0.1

02044

0.0

21350

10172

0.0

15180

0.0

01669

854448

10

bet

a-h

64-x

a-b

f1-1

0k

16x16-n

6.7

278

15.7

772

0.0

95556

0.0

21369

10172

0.0

14203

0.0

01354

854448

11

2D

n-x

a-b

f1-1

0k

16x16-n

6.7

462

5.7

529

0.1

07354

0.0

32720

10172

0.0

15913

0.0

05688

854448

12

2D

n-x

a-1

0k

16x16-n

6.7

730

11.9

722

0.1

36846

0.0

30048

10172

0.0

20205

0.0

02510

854448

13

2D

p-x

a-1

0k

16x16-p

7.5

113

19.4

432

0.1

54149

0.0

37495

10172

0.0

20522

0.0

01928

854448

14

2D

p-x

85-1

0k

16x16-p

7.6

568

13.8

132

0.1

54345

0.0

37280

10172

0.0

20158

0.0

02699

854448

15

2D

p-x

a-b

f1-1

0k

16x16-p

7.6

798

7.7

841

0.1

29571

0.0

43599

10172

0.0

16872

0.0

05601

854448

16

2D

n-x

85-1

0k

16x16-n

7.7

231

17.8

369

0.1

39206

0.0

28521

10172

0.0

18025

0.0

01599

854448

17

bet

a-h

64-x

50-1

k16x16-n

8.1

497

12.6

850

0.1

80004

0.0

23602

10172

0.0

22087

0.0

01861

854448

18

bet

a-h

64-x

50-1

0k

16x16-n

8.2

664

19.1

621

0.1

49528

0.0

24097

10172

0.0

18089

0.0

01258

854448

19

2D

n-x

50-1

0k

16x16-n

8.3

257

7.7

403

0.1

57786

0.0

31127

10172

0.0

18952

0.0

04021

854448

20

2D

p-x

50-1

0k

16x16-p

9.2

667

18.8

358

0.1

76268

0.0

36316

10172

0.0

19022

0.0

01928

854448

21

bet

a-h

64-x

85-b

f2-1

0k

16x16-n

11.3

144

38.5

032

0.0

73240

0.0

22602

10172

0.0

06473

0.0

00587

854448

22

2D

p-x

10-1

0k

16x16-p

13.2

674

44.4

840

0.2

39285

0.0

49684

10172

0.0

18036

0.0

01117

854448

23

2D

n-x

10-1

0k

16x16-n

14.1

001

26.6

674

0.2

39579

0.0

46076

10172

0.0

16991

0.0

01728

854448

24

bet

a-h

64-x

10-1

k16x16-n

15.3

816

31.9

126

0.2

79099

0.0

31024

10172

0.0

18145

0.0

00972

854448

25

bet

a-h

64-x

10-1

0k

16x16-n

16.4

314

36.6

599

0.2

48820

0.0

30211

10172

0.0

15143

0.0

00824

854448

26

bet

a16x16-n

35.2

877

12.2

585

0.2

75757

0.0

48441

10172

0.0

07815

0.0

03952

854448

27

bet

a-h

64-1

0k

16x16-n

36.6

086

194.3

108

0.4

37967

0.0

85183

10172

0.0

11964

0.0

00438

854448

28

2D

p16x16-p

54.3

761

213.0

722

0.2

97188

0.0

91270

10172

0.0

05465

0.0

00428

854448

29

2D

n16x16-n

55.7

799

237.7

355

0.2

98663

0.0

82281

10172

0.0

05354

0.0

00346

854448

30

alp

ha

8x8-n

p63.2

405

154.0

812

0.2

90897

0.0

58856

10172

0.0

04600

0.0

00382

854448

31

bet

a-p

16x16-p

63.3

443

114.8

376

0.2

72513

0.0

66420

10172

0.0

04302

0.0

00578

854448

32

bet

a-h

64

16x16-n

65.9

792

171.4

801

0.2

74675

0.0

46897

10172

0.0

04163

0.0

00273

854448

33

bet

a-n

p16x16-n

p70.1

044

139.0

319

0.2

53343

0.0

57028

10172

0.0

03614

0.0

00410

854448

34

bet

a-c

716x16-n

70.1

889

153.6

429

0.2

59438

0.0

55005

10172

0.0

03696

0.0

00358

854448

35

alp

ha-full

8x8-n

p90.5

603

178.0

853

0.2

49509

0.0

71871

10172

0.0

02755

0.0

00404

854448

36

bet

a-full

16x16-n

129.3

493

132.6

867

0.3

10460

0.0

64241

10172

0.0

02400

0.0

00484

854448

188

A – Tables

Tab

leA

.13:

repor

t.fx

p(o

rder

edby

Bit

Fai

lRat

io)

Nam

eFea

tCfg

Bit

FailR

ati

oM

SE

Rati

oIn

traB

itFail

Intr

aM

SE

Intr

aSiz

eE

xtr

aB

itFail

Extr

aM

SE

Extr

aSiz

e

1bet

a-h

64-x

100-1

k16x16-n

8.3

142

10.9

481

0.1

98191

0.0

24402

10172

0.0

23838

0.0

02229

854448

2bet

a-h

64-x

85-1

k16x16-n

8.6

567

10.9

076

0.2

06252

0.0

23915

10172

0.0

23826

0.0

02192

854448

3bet

a-h

64-x

85-b

f2-1

k16x16-n

9.0

313

14.8

627

0.1

33307

0.0

24092

10172

0.0

14761

0.0

01621

854448

4bet

a-h

64-x

85-b

f1-1

k16x16-n

9.2

051

12.6

864

0.1

64864

0.0

23240

10172

0.0

17910

0.0

01832

854448

5bet

a-h

64-x

a-1

k16x16-n

9.2

064

12.4

895

0.2

33877

0.0

24620

10172

0.0

25404

0.0

01971

854448

6bet

a-h

64-x

a-b

f1-1

k16x16-n

9.2

224

11.2

308

0.2

08808

0.0

23596

10172

0.0

22641

0.0

02101

854448

7bet

a-h

64-x

100-1

0k

16x16-n

9.3

277

15.7

504

0.1

78530

0.0

24277

10172

0.0

19140

0.0

01541

854448

8bet

a-full

16x16-n

9.9

477

5.6

502

0.2

13822

0.0

45072

10172

0.0

21495

0.0

07977

854448

9alp

ha-full

8x8-n

p10.2

225

18.8

511

0.1

55328

0.0

51134

10172

0.0

15195

0.0

02713

854448

10

bet

a-h

64-x

85-1

0k

16x16-n

10.7

546

17.6

548

0.1

93177

0.0

24991

10172

0.0

17962

0.0

01416

854448

11

2D

n-x

100-1

0k

16x16-n

10.7

786

4.5

413

0.2

79100

0.0

58335

10172

0.0

25894

0.0

12845

854448

12

bet

a-h

64-x

a-b

f1-1

0k

16x16-n

11.2

178

20.4

178

0.1

41368

0.0

24622

10172

0.0

12602

0.0

01206

854448

13

bet

a-h

64-x

50-1

k16x16-n

12.1

255

15.8

429

0.2

49705

0.0

27968

10172

0.0

20593

0.0

01765

854448

14

bet

a-h

64-x

50-1

0k

16x16-n

13.5

396

24.6

404

0.2

21294

0.0

28228

10172

0.0

16344

0.0

01146

854448

15

2D

n-x

a-1

0k

16x16-n

13.6

191

5.3

668

0.3

06921

0.0

54354

10172

0.0

22536

0.0

10128

854448

16

2D

n-x

a-b

f1-1

0k

16x16-n

17.1

212

12.8

311

0.2

65140

0.0

91545

10172

0.0

15486

0.0

07135

854448

17

bet

a-h

64-x

85-b

f2-1

0k

16x16-n

17.3

514

47.2

416

0.0

98408

0.0

25084

10172

0.0

05671

0.0

00531

854448

18

2D

p-x

a-b

f1-1

0k

16x16-p

17.4

788

10.5

188

0.2

69957

0.0

82657

10172

0.0

15445

0.0

07858

854448

19

2D

p-x

85-1

0k

16x16-p

18.3

132

13.0

993

0.3

21274

0.0

68989

10172

0.0

17543

0.0

05267

854448

20

bet

a-c

716x16-n

18.8

000

18.5

142

0.2

32304

0.0

45510

10172

0.0

12357

0.0

02458

854448

21

2D

p-x

50-1

0k

16x16-p

20.1

541

15.0

141

0.3

52733

0.0

71369

10172

0.0

17502

0.0

04753

854448

22

bet

a-h

64-x

10-1

k16x16-n

21.4

203

41.8

120

0.3

46834

0.0

38036

10172

0.0

16192

0.0

00910

854448

23

2D

n-x

10-1

0k

16x16-n

22.0

883

14.5

236

0.3

72100

0.0

65245

10172

0.0

16846

0.0

04492

854448

24

2D

p-x

a-1

0k

16x16-p

23.1

367

51.2

212

0.3

44868

0.0

89327

10172

0.0

14906

0.0

01744

854448

25

bet

a-h

64-x

10-1

0k

16x16-n

25.0

558

49.9

834

0.3

21766

0.0

37291

10172

0.0

12842

0.0

00746

854448

26

2D

n-x

85-1

0k

16x16-n

25.1

615

68.0

438

0.3

20684

0.0

74684

10172

0.0

12745

0.0

01098

854448

27

2D

p-x

10-1

0k

16x16-p

27.0

004

79.4

116

0.3

98447

0.0

93604

10172

0.0

14757

0.0

01179

854448

28

2D

p16x16-p

27.0

004

17.1

364

0.3

76033

0.1

11233

10172

0.0

13927

0.0

06491

854448

29

2D

n-x

50-1

0k

16x16-n

28.6

689

48.8

782

0.3

54207

0.0

88889

10172

0.0

12355

0.0

01819

854448

30

alp

ha

8x8-n

p35.0

007

46.5

328

0.3

09281

0.0

59338

10172

0.0

08836

0.0

01275

854448

31

bet

a16x16-n

37.2

628

12.5

612

0.3

19406

0.0

50604

10172

0.0

08572

0.0

04029

854448

32

2D

n16x16-n

46.9

044

33.6

376

0.4

20763

0.1

23608

10172

0.0

08971

0.0

03675

854448

33

bet

a-n

p16x16-n

p47.3

466

62.8

038

0.2

58258

0.0

52383

10172

0.0

05455

0.0

00834

854448

34

bet

a-h

64-1

0k

16x16-n

52.5

854

252.7

211

0.4

98427

0.1

01436

10172

0.0

09478

0.0

00401

854448

35

bet

a-p

16x16-p

72.7

332

105.7

573

0.3

23044

0.0

69426

10172

0.0

04441

0.0

00656

854448

36

bet

a-h

64

16x16-n

100.7

338

223.3

954

0.3

42017

0.0

55956

10172

0.0

03395

0.0

00250

854448

189

A – Tables

Tab

leA

.14:

sum

mar

y.cm

p.fxp

(ord

ered

by

MSE

)

Nam

eFea

tCfg

Siz

eM

SE

Bit

Fail

Min

Dec

Pt

AvgD

ecP

tM

axD

ecP

tM

SE

fxp/fp

Bit

Fail

fxp/fp

1bet

a-h

64-x

85-b

f2-1

0k

16x16-n

864620

0.0

00023

0.0

13953

611.1

613

0.9

51767

0.9

25285

2bet

a-h

64-x

85-b

f2-1

k16x16-n

864620

0.0

00031

0.0

30035

711.1

513

0.9

84331

0.9

92739

3bet

a-h

64-x

a-b

f1-1

0k

16x16-n

864620

0.0

00037

0.0

28148

611.6

413

0.9

23645

0.9

25727

4bet

a-h

64-x

85-b

f1-1

k16x16-n

864620

0.0

00040

0.0

35954

711.4

413

0.9

77037

0.9

79458

5bet

a-h

64

16x16-n

864620

0.0

00040

0.0

10996

610.0

212

1.0

86391

0.9

99999

6bet

a-h

64-x

10-1

0k

16x16-n

864620

0.0

00040

0.0

30207

611.1

413

0.9

94074

0.9

18717

7bet

a-h

64-x

50-1

0k

16x16-n

864620

0.0

00040

0.0

33640

611.5

513

0.9

48805

0.9

52587

8bet

a-h

64-x

10-1

k16x16-n

864620

0.0

00041

0.0

32931

711.2

613

1.0

04508

0.9

44199

9bet

a-h

64-x

100-1

0k

16x16-n

864620

0.0

00042

0.0

37222

811.7

313

0.9

38796

0.9

54754

10

bet

a-h

64-x

50-1

k16x16-n

864620

0.0

00043

0.0

38927

711.5

813

0.9

74017

0.9

69632

11

bet

a-h

64-x

85-1

0k

16x16-n

864620

0.0

00045

0.0

35257

611.5

513

0.9

32799

0.9

45753

12

bet

a-h

64-x

85-1

k16x16-n

864620

0.0

00046

0.0

44148

711.5

513

0.9

67408

0.9

57938

13

bet

a-h

64-x

a-b

f1-1

k16x16-n

864620

0.0

00048

0.0

47638

711.6

013

0.9

75994

0.9

92206

14

bet

a-h

64-x

100-1

k16x16-n

864620

0.0

00050

0.0

45958

811.6

713

0.9

57340

0.9

64737

15

bet

a-h

64-1

0k

16x16-n

864620

0.0

00052

0.0

26193

611.0

614

1.1

02650

0.8

96871

16

bet

a-h

64-x

a-1

k16x16-n

864620

0.0

00053

0.0

50976

711.7

113

0.9

61329

0.9

70455

17

bet

a-p

16x16-p

864620

0.0

00095

0.0

13194

58.0

712

1.0

64277

1.0

90874

18

bet

a16x16-n

864620

0.0

00105

0.0

15090

58.5

512

1.0

19806

1.1

11211

19

bet

a-n

p16x16-n

p864620

0.0

00402

0.0

15895

57.8

911

1.3

29736

1.2

84249

20

2D

n-x

85-1

0k

16x16-n

864620

0.0

00721

0.0

46411

59.7

113

1.0

19247

0.8

40671

21

alp

ha

8x8-n

p864620

0.0

00804

0.0

21063

58.0

511

1.2

05826

1.2

90637

22

2D

p-x

10-1

0k

16x16-p

864620

0.0

01005

0.0

57989

48.5

613

1.3

29460

0.9

36342

23

2D

p-x

a-1

0k

16x16-p

864620

0.0

01317

0.0

45805

49.1

513

1.1

80048

0.8

53235

24

bet

a-c

716x16-n

864620

0.0

01913

0.0

26079

57.7

611

2.9

09642

2.2

09059

25

alp

ha-full

8x8-n

p864620

0.0

01999

0.0

30293

57.2

411

1.6

43316

2.0

85690

26

2D

p-x

a-b

f1-1

0k

16x16-p

864620

0.0

03367

0.0

32250

48.8

813

1.4

42174

1.0

16194

27

2D

n-x

50-1

0k

16x16-n

864620

0.0

03509

0.0

50683

49.5

513

0.6

53372

0.7

95395

28

2D

n-x

10-1

0k

16x16-n

864620

0.0

03510

0.0

61251

59.1

713

2.2

96268

1.0

71826

29

2D

n16x16-n

864620

0.0

03746

0.0

28689

58.2

311

3.8

35878

1.5

66185

30

2D

p-x

50-1

0k

16x16-p

864620

0.0

03886

0.0

51266

48.9

412

2.3

68901

1.0

30490

31

2D

p-x

85-1

0k

16x16-p

864620

0.0

04321

0.0

53733

48.8

713

1.9

39739

0.9

74845

32

2D

n-x

a-b

f1-1

0k

16x16-n

864620

0.0

05889

0.0

33947

49.7

313

1.3

54729

1.0

83675

33

2D

p16x16-p

864620

0.0

06328

0.0

33838

67.9

111

5.1

09377

2.0

42753

34

bet

a-full

16x16-n

864620

0.0

07192

0.0

33945

36.9

511

6.7

02304

3.9

00828

35

2D

n-x

a-1

0k

16x16-n

864620

0.0

08351

0.0

58096

49.8

913

3.7

38039

1.1

98559

36

2D

n-x

100-1

0k

16x16-n

864620

0.0

13008

0.0

61724

59.7

213

1.6

26968

1.0

64687

190

A – Tables

Tab

leA

.15:

clas

sify

repor

t(o

rder

edby

Rcg

nO

k)

Nam

eFea

tCfg

Rcg

nO

kR

cgnK

oR

cgnU

nk

Siz

eC

onfM

inC

onfM

ax

ConfL

vl

1bet

a-h

64-x

85-b

f2-1

0k

16x16-n

0.9

49076

0.0

09634

0.0

41290

10172

0.9

43152

0.9

54412

0.9

9

2bet

a-h

64-x

10-1

0k

16x16-n

0.9

32953

0.0

13763

0.0

53284

10172

0.9

26268

0.9

39072

0.9

9

3bet

a-h

64-x

50-1

0k

16x16-n

0.9

29021

0.0

12092

0.0

58887

10172

0.9

22167

0.9

35313

0.9

9

4bet

a-h

64-x

10-1

k16x16-n

0.9

28136

0.0

13271

0.0

58592

10172

0.9

21246

0.9

34466

0.9

9

5bet

a-h

64-x

85-1

0k

16x16-n

0.9

24499

0.0

10912

0.0

64589

10172

0.9

17459

0.9

30983

0.9

9

6bet

a-h

64-x

100-1

0k

16x16-n

0.9

22139

0.0

12486

0.0

65376

10172

0.9

15006

0.9

28721

0.9

9

7bet

a-h

64

16x16-n

0.9

20763

0.0

08553

0.0

70684

10172

0.9

13575

0.9

27401

0.9

9

8bet

a-h

64-x

a-b

f1-1

0k

16x16-n

0.9

20664

0.0

11699

0.0

67637

10172

0.9

13473

0.9

27306

0.9

9

9bet

a-h

64-x

50-1

k16x16-n

0.9

12308

0.0

16417

0.0

71274

10172

0.9

04800

0.9

19277

0.9

9

10

2D

n-x

85-1

0k

16x16-n

0.9

10736

0.0

16713

0.0

72552

10172

0.9

03171

0.9

17763

0.9

9

11

bet

a-h

64-x

85-b

f2-1

k16x16-n

0.9

10736

0.0

13862

0.0

75403

10172

0.9

03171

0.9

17763

0.9

9

12

bet

a-c

716x16-n

0.9

09949

0.0

09634

0.0

80417

10172

0.9

02356

0.9

17006

0.9

9

13

bet

a-h

64-x

a-1

k16x16-n

0.9

09851

0.0

12584

0.0

77566

10172

0.9

02254

0.9

16911

0.9

9

14

bet

a-n

p16x16-n

p0.9

04345

0.0

11010

0.0

84644

10172

0.8

96555

0.9

11607

0.9

9

15

bet

a-full

16x16-n

0.9

04051

0.0

11797

0.0

84153

10172

0.8

96250

0.9

11323

0.9

9

16

bet

a-h

64-x

85-b

f1-1

k16x16-n

0.9

00020

0.0

14550

0.0

85430

10172

0.8

92082

0.9

07434

0.9

9

17

2D

p-x

10-1

0k

16x16-p

0.8

96087

0.0

18580

0.0

85333

10172

0.8

88021

0.9

03636

0.9

9

18

bet

a-h

64-x

a-b

f1-1

k16x16-n

0.8

95989

0.0

12387

0.0

91624

10172

0.8

87919

0.9

03541

0.9

9

19

alp

ha

8x8-n

p0.8

94416

0.0

12583

0.0

93001

10172

0.8

86295

0.9

02021

0.9

9

20

bet

a-h

64-x

85-1

k16x16-n

0.8

93925

0.0

14747

0.0

91329

10172

0.8

85788

0.9

01546

0.9

9

21

bet

a-h

64-x

100-1

k16x16-n

0.8

93433

0.0

13567

0.0

93001

10172

0.8

85281

0.9

01071

0.9

9

22

2D

p-x

a-1

0k

16x16-p

0.8

87338

0.0

17499

0.0

95163

10172

0.8

78995

0.8

95174

0.9

9

23

alp

ha-full

8x8-n

p0.8

84782

0.0

13960

0.1

01259

10172

0.8

76361

0.8

92699

0.9

9

24

bet

a-p

16x16-p

0.8

81439

0.0

14156

0.1

04404

10172

0.8

72919

0.8

89460

0.9

9

25

2D

p-x

50-1

0k

16x16-p

0.8

75541

0.0

25266

0.0

99194

10172

0.8

66850

0.8

83740

0.9

9

26

2D

n16x16-n

0.8

71412

0.0

10225

0.1

18364

10172

0.8

62605

0.8

79733

0.9

9

27

2D

p16x16-p

0.8

52045

0.0

12190

0.1

35765

10172

0.8

42732

0.8

60897

0.9

9

28

2D

n-x

10-1

0k

16x16-n

0.8

47424

0.0

25560

0.1

27015

10172

0.8

37999

0.8

56395

0.9

9

29

2D

n-x

a-1

0k

16x16-n

0.8

38085

0.0

17695

0.1

44220

10172

0.8

28441

0.8

47286

0.9

9

30

2D

p-x

85-1

0k

16x16-p

0.8

31891

0.0

28018

0.1

40090

10172

0.8

22109

0.8

41240

0.9

9

31

bet

a-h

64-1

0k

16x16-n

0.8

30712

0.0

10519

0.1

58769

10172

0.8

20904

0.8

40088

0.9

9

32

2D

n-x

50-1

0k

16x16-n

0.7

20704

0.0

32147

0.2

47149

10172

0.7

09085

0.7

32034

0.9

9

33

bet

a16x16-n

0.6

35765

0.0

63311

0.3

00924

10172

0.6

23370

0.6

47983

0.9

9

34

2D

p-x

a-b

f1-1

0k

16x16-p

0.6

00570

0.0

40307

0.3

59123

10172

0.5

87979

0.6

13030

0.9

9

35

2D

n-x

a-b

f1-1

0k

16x16-n

0.5

92410

0.0

33327

0.3

74263

10172

0.5

79784

0.6

04916

0.9

9

36

2D

n-x

100-1

0k

16x16-n

0.4

40621

0.0

43256

0.5

16123

10172

0.4

27964

0.4

53356

0.9

9

191

A – Tables

Tab

leA

.16:

clas

sify

repor

t.fx

p(o

rder

edby

Rcg

nO

k)

Nam

eFea

tCfg

Rcg

nO

kR

cgnK

oR

cgnU

nk

Siz

eC

onfM

inC

onfM

ax

ConfL

vl

1bet

a-h

64-x

85-b

f2-1

0k

16x16-n

0.9

47208

0.0

09929

0.0

42863

10172

0.9

41189

0.9

52641

0.9

9

2bet

a-h

64-x

50-1

0k

16x16-n

0.9

27055

0.0

14746

0.0

58199

10172

0.9

20120

0.9

33432

0.9

9

3bet

a-h

64-x

85-1

0k

16x16-n

0.9

25875

0.0

14845

0.0

59280

10172

0.9

18891

0.9

32301

0.9

9

4bet

a-h

64-x

100-1

0k

16x16-n

0.9

25186

0.0

15434

0.0

59379

10172

0.9

18175

0.9

31642

0.9

9

5bet

a-h

64-x

a-b

f1-1

0k

16x16-n

0.9

24990

0.0

12977

0.0

62033

10172

0.9

17971

0.9

31454

0.9

9

6bet

a-h

64-x

10-1

0k

16x16-n

0.9

20763

0.0

12977

0.0

66260

10172

0.9

13575

0.9

27401

0.9

9

7bet

a-h

64-x

10-1

k16x16-n

0.9

17715

0.0

13370

0.0

68915

10172

0.9

10409

0.9

24474

0.9

9

8bet

a-h

64-x

85-b

f2-1

k16x16-n

0.9

13685

0.0

16418

0.0

69898

10172

0.9

06228

0.9

20601

0.9

9

9bet

a-h

64-x

a-1

k16x16-n

0.9

12997

0.0

15238

0.0

71765

10172

0.9

05514

0.9

19939

0.9

9

10

bet

a-h

64-x

50-1

k16x16-n

0.9

12407

0.0

18777

0.0

68816

10172

0.9

04902

0.9

19371

0.9

9

11

bet

a-h

64-x

a-b

f1-1

k16x16-n

0.9

03166

0.0

16024

0.0

80810

10172

0.8

95335

0.9

10469

0.9

9

12

bet

a-h

64-x

85-b

f1-1

k16x16-n

0.9

02969

0.0

17007

0.0

80024

10172

0.8

95132

0.9

10280

0.9

9

13

bet

a-h

64

16x16-n

0.9

01494

0.0

07570

0.0

90936

10172

0.8

93607

0.9

08857

0.9

9

14

bet

a-h

64-x

100-1

k16x16-n

0.8

94711

0.0

19367

0.0

85922

10172

0.8

86600

0.9

02306

0.9

9

15

bet

a-n

p16x16-n

p0.8

93433

0.0

18580

0.0

87986

10172

0.8

85281

0.9

01071

0.9

9

16

bet

a-h

64-x

85-1

k16x16-n

0.8

92646

0.0

18286

0.0

89068

10172

0.8

84469

0.9

00310

0.9

9

17

bet

a-p

16x16-p

0.8

72493

0.0

16123

0.1

11384

10172

0.8

63717

0.8

80783

0.9

9

18

2D

n-x

85-1

0k

16x16-n

0.8

62761

0.0

21530

0.1

15710

10172

0.8

53721

0.8

71326

0.9

9

19

alp

ha-full

8x8-n

p0.8

62072

0.0

29591

0.1

08336

10172

0.8

53014

0.8

70657

0.9

9

20

alp

ha

8x8-n

p0.8

61778

0.0

24675

0.1

13547

10172

0.8

52712

0.8

70370

0.9

9

21

bet

a-c

716x16-n

0.8

51356

0.0

24774

0.1

23870

10172

0.8

42027

0.8

60227

0.9

9

22

2D

p-x

10-1

0k

16x16-p

0.8

21372

0.0

25069

0.1

53559

10172

0.8

11365

0.8

30960

0.9

9

23

2D

p-x

a-1

0k

16x16-p

0.8

12426

0.0

37161

0.1

50413

10172

0.8

02237

0.8

22207

0.9

9

24

2D

n-x

50-1

0k

16x16-n

0.7

90012

0.0

37947

0.1

72041

10172

0.7

79405

0.8

00240

0.9

9

25

bet

a-h

64-1

0k

16x16-n

0.7

87063

0.0

08750

0.2

04188

10172

0.7

76404

0.7

97346

0.9

9

26

2D

n-x

10-1

0k

16x16-n

0.6

56016

0.0

71471

0.2

72513

10172

0.6

43766

0.6

68063

0.9

9

27

bet

a-full

16x16-n

0.6

43335

0.0

41683

0.3

14982

10172

0.6

30991

0.6

55491

0.9

9

28

bet

a16x16-n

0.6

31046

0.0

78155

0.2

90798

10172

0.6

18621

0.6

43300

0.9

9

29

2D

p-x

85-1

0k

16x16-p

0.6

23673

0.0

78942

0.2

97385

10172

0.6

11203

0.6

35981

0.9

9

30

2D

n16x16-n

0.6

14923

0.0

97424

0.2

87652

10172

0.6

02404

0.6

27292

0.9

9

31

2D

p-x

50-1

0k

16x16-p

0.6

11974

0.0

96441

0.2

91584

10172

0.5

99439

0.6

24363

0.9

9

32

2D

n-x

a-b

f1-1

0k

16x16-n

0.5

32639

0.0

83857

0.3

83503

10172

0.5

19858

0.5

45376

0.9

9

33

2D

p-x

a-b

f1-1

0k

16x16-p

0.5

08749

0.0

46500

0.4

44751

10172

0.4

95959

0.5

21528

0.9

9

34

2D

p16x16-p

0.4

76406

0.1

65750

0.3

57845

10172

0.4

63649

0.4

89193

0.9

9

35

2D

n-x

100-1

0k

16x16-n

0.4

54679

0.0

38144

0.5

07176

10172

0.4

41975

0.4

67443

0.9

9

36

2D

n-x

a-1

0k

16x16-n

0.2

08907

0.1

92685

0.5

98407

10172

0.1

98699

0.2

19495

0.9

9

192

Bibliography

[1] P. W. Nye, “Reading aids for blind people — A survey of progress with thetechnological and human problems,” Medical and Biological Engineering andComputing, vol. 2, no. 3, Jul. 1964.

[2] J. Linvill and J. Bliss, “A direct translation reading aid for the blind,” Pro-ceedings of the IEEE, vol. 54, no. 1, pp. 40–51, Jan. 1966.

[3] J. Bliss, “A relatively high-resolution reading aid for the blind,” Man-MachineSystems, IEEE Transactions on, vol. 10, no. 1, pp. 1–9, March 1969.

[4] kReader Mobile User Guide, K-NFB Reading Tecnology Inc., 2008.[5] P. Motto Ros, “Lettore ottico per non vedenti,” Master’s thesis, Politecnico

di Torino, 2005.[6] J.-M. A. Mohamed Benali-Khoudja, Moustapha Hafez and A. Kheddar, “Tac-

tile interfaces: a state-of-the-art survey,” in ISR2004, 35th International Sym-posium on Robotics, Mar. 2004.

[7] V. G. Chouvardas, A. N. Miliou, and M. K. Hatalis, “Tactile display applica-tions: A state of the art survey,” in Proceedings of the 2 nd Balkan Conferencein Informatics, 2005, pp. 290–303.

[8] Y. Kato, S. Iba, T. Sekitani, Y. Noguchi, K. Hizu, X. Wang, K. Takenoshita,Y. Takamatsu, S. Nakano, K. Fukuda, K. Nakamura, T. Yamaue, M. Doi,K. Asaka, H. Kawaguchi, M. Takamiya, T. Sakurai, and T. Someya, “A flex-ible, lightweight Braille sheet display with plastic actuators driven by an or-ganic field-effect transistor active matrix,” Electron Devices Meeting, 2005.IEDM Technical Digest. IEEE International, pp. 4 pp.–100, Dec. 2005.

[9] User’s Guide DSP Development Kit for the micro-line C6713CPU, Orsys OrthSystem GmbH, Oct. 2005.

[10] TMS320C6000 CPU and Instruction Set Reference Guide, Texas Instruments,Oct. 2000.

[11] TMS320C6713B Floating-Point Digital Signal Processor, Texas Instruments,Oct. 2005.

[12] Spartan-3 FPGA Family Datasheet, Xilinx, Jun. 2008.[13] Spartan-3 Generation FPGA User Guide, Xilinx, Jun. 2008.[14] Hardware Reference Guide micro-line C6713CPU, Orsys Orth System GmbH,

193

Bibliography

Oct. 2005.[15] FPGA Programming Guide for the micro-line C6713CPU, Orsys Orth System

GmbH, Oct. 2005.[16] TMS320C6000 DSP External Memory Interface (EMIF) Reference Guide,

Texas Instruments, Feb. 2006.[17] “Interfacing Xilinx FPGAs to TI DSP Platforms Using the EMIF,” Xilinx,

Tech. Rep. XAPP753 (v2.0.1), Jan. 2007.[18] TMS320C6000 DSP Host Port Interface (HPI) Reference Guide, Texas In-

struments, Jan. 2006.[19] TMS320C621x/TMS320C671x EDMA Architecture, Texas Instruments, Mar.

2004.[20] The I2C-Bus Specification Version 2.1, Philips Semiconductor, Jan. 2000.[21] TMS320C6000 DSP Inter-Integrated Circuit (I2C) Module Reference Guide,

Oct. 2002.[22] OV7648 Color CMOS VGA CameraChip Datasheet, Omnivision, Oct. 2003.[23] OmniVision Serial Camera Control Bus (SCCB) Functional Specification,

Omnivision, Feb. 2003.[24] TMS320C6000 DSP Multichannel Buffered Serial Port (McBsP) Reference

Guide, Texas Instruments, Dec. 2005.[25] Audio Codec ’97, Intel, Apr. 2002.[26] P. Bansal and T. Hiers, “TMS320C6000 EMIF to USB Interfacing Using Cy-

press EZ-USB SX2,” Texas Instruments, Tech. Rep. SPRAA13A, Jul. 2005.[27] K. Gribbon, D. Bailey, and A. Bainbridge-Smith, “Development issues in using

FPGAs for image processing,” in Proceedings of Image and Vision ComputingNew Zealand 2007, Dec. 2007, pp. 217–222.

[28] S. Qureshi, Embedded Image Processing on the TMS320C6000 DSP: Examplesin Code Composer Studio and MATLAB. Springer, Jul. 2006.

[29] M. Sonka, V. Hlavac, and R. Boyle, Image Processing: Analysis and MachineVision, 2nd ed. Thomson-Engineering, Sep. 1998.

[30] J. C. Russ, The Image Processing Handbook, 4th ed. CRC Press, Jul. 2002.[31] K. Gribbon, D. Bailey, and C. Johnston, “Design patterns for image processing

algorithm development on FPGAs,” TENCON 2005 2005 IEEE Region 10, pp.1–6, Nov. 2005.

[32] R. Turney, “Two-Dimensional Linear Filter,” Xilinx, Tech. Rep. XAPP933(v1.1), Oct. 2007.

[33] I. Pitas, Digital Image Processing Algorithms and Applications. Wiley-Interscience, Feb. 2000.

[34] R. Maheshwari, S. S. S. P. Rao, and P. G. Poonacha, “FPGA implementa-tion of median filter,” VLSI Design, 1997. Proceedings., Tenth InternationalConference on, pp. 523–524, Jan 1997.

194

Bibliography

[35] G. Szedo, “Two-Dimensional Rank Order Filter,” Xilinx, Tech. Rep.XAPP953 (v1.1), Sep. 2006.

[36] B. Bosi, G. Bois, and Y. Savaria, “Reconfigurable pipelined 2-D convolvers forfast digital signal processing,” Very Large Scale Integration (VLSI) Systems,IEEE Transactions on, vol. 7, no. 3, pp. 299–308, Sep 1999.

[37] K. Benkrid, D. Crookes, and A. Benkrid, “Towards a general framework forFPGA based image processing using hardware skeletons,” Parallel Comput.,vol. 28, no. 7–8, pp. 1141–1154, 2002.

[38] K. Benkrid, D. Crookes, A. Bouridane, P. Corr, and K. Alotaibi, “A high levelsoftware environment for FPGA based image processing,” Image ProcessingAnd Its Applications, 1999. Seventh International Conference on (Conf. Publ.No. 465), vol. 1, pp. 112–116, Jul. 1999.

[39] A. Avizienis, “Signed-digit number representations for fast parallel arith-metic,” IRE Transactions on Electronic Computers, vol. EC-10, pp. 389–400,Sep 1961.

[40] C. Johnston and D. Bailey, “FPGA implementation of a single pass con-nected components algorithm,” Electronic Design, Test and Applications,2008. DELTA 2008. 4th IEEE International Symposium on, pp. 228–231, Jan.2008.

[41] C.-J. Chang, P.-Y. Hsiao, and Z.-Y. Huang, “Integrated operation of imagecapturing and processing in FPGA,” IJCSNS International Journal of Com-puter Science and Network Security, vol. 6, no. 1, pp. 173–180, Jan. 2006.

[42] L. Reyneri, “Implementation issues of neuro-fuzzy hardware: going towardhw/sw codesign,” Neural Networks, IEEE Transactions on, vol. 14, no. 1, pp.176–194, Jan. 2003.

[43] C. Torres-Huitzil, B. Girau, and A. Gauffriau, “Hardware/software codesignfor embedded implementation of neural networks,” in Reconfigurable Comput-ing: Architectures, Tools and Applications. Springer Berlin / Heidelberg, Jun.2007, pp. 167–178.

[44] L. M. Reyneri, M. Chiaberge, and L. Lavagno, “Simulink-based HW/SW code-sign of embedded neuro-fuzzy systems,” International Journal of Neural Sys-tems, vol. 10, no. 3, pp. 211–226, Jun. 2000.

[45] J. Zhu and P. Sutton, “FPGA implementations of neural networks — a sur-vey of a decade of progress,” in Field-Programmable Logic and Applications.Springer Berlin / Heidelberg, Sep. 2003, pp. 1062–1066.

[46] J. Holt and T. Baker, “Back propagation simulations using limited precisioncalculations,” Neural Networks, 1991., IJCNN-91-Seattle International JointConference on, vol. 2, pp. 121–126, Jul. 1991.

[47] Y. Chen and W. du Plessis, “Neural network implementation on a FPGA,”Africon Conference in Africa, 2002. IEEE AFRICON. 6th, vol. 1, pp. 337–342,Oct. 2002.

195

Bibliography

[48] M. Marchesi, G. Orlandi, F. Piazza, and A. Uncini, “Fast neural networkswithout multipliers,” Neural Networks, IEEE Transactions on, vol. 4, no. 1,pp. 53–62, Jan. 1993.

[49] B. Noory and V. Groza, “A reconfigurable approach to hardware implemen-tation of neural networks,” Electrical and Computer Engineering, 2003. IEEECCECE 2003. Canadian Conference on, vol. 3, pp. 1861–1864, May 2003.

[50] S. White, “Applications of distributed arithmetic to digital signal processing:a tutorial review,” ASSP Magazine, IEEE, vol. 6, no. 3, pp. 4–19, Jul. 1989.

[51] I. Garcia, “TMS320C6711D, C6712D, C6713B Power Consumption Sum-mary,” Texas Instruments, Tech. Rep. SPRA889A, Aug. 2005.

[52] B. Philofsky, “Seven Steps to an Accurate Worst-Case Power Analysis UsingXilinx Power Estimator (XPE),” Xilinx, Tech. Rep. WP353 (v1.0), Sep. 2008.

[53] J. Axelson, USB Complete: Everything You Need to Develop Custom USBPeripherals, 3rd ed. Lakeview Research, Aug. 2005.

[54] DSP/BIOS Kernel Technical Overview, Texas Instruments, Aug. 2001.[55] TMS320C6000 Chip Support Library API Reference Guide, Texas Instru-

ments, Aug. 2004.[56] Code Composer Studio IDE Getting Started Guide User’s Guide, Texas In-

struments, May 2005.[57] B. Stroustrup, The C++ Programming Language, 3rd ed. Addison-Wesley

Professional, Jun. 1997.[58] P. J. Plaguer, “Embedded C++: An Overview,” Embedded Systems Program-

ming, Dec. 1997.[59] J. Maddock and S. Cleary, “Boost.StaticAssert,” 2005.[60] “Technical Report on C++ Performance,” ISO/IEC, Tech. Rep. TR

18015:2006(E), 2006.[61] N. M. Josuttis, The C++ Standard Library: A Tutorial and Reference.

Addison-Wesley Professional, Aug. 1999.[62] ARM Architecture Reference Manual, Jun. 2000, ARM DDI 0100E.[63] Colibri XScale PXA270 Datasheet, Toradex AG, Dec. 2006.[64] Intel XScale Core Developer’s Manual, Intel, Jan. 2004.[65] Intel Wireless MMX Technology Developer Guide, Intel, Aug. 2002.[66] Intel PXA27x Processor Family Developer’s Manual, Intel, Jan. 2006.[67] Intel XScale Microarchitecture Technical Summary, Intel, 2000.[68] Intel PXA270 Processor Electrical, Mechanical, and Thermal Specification

Datasheet, Intel, 2004.[69] D. Cormie, The ARM11 Microarchitecture, ARM Ltd, Apr. 2002.[70] Colibri Evaluation Board Datasheet, Toradex AG, Dec. 2006.[71] Colibri Evaluation Board Schematics, Toradex AG, Feb. 2006.[72] Universal Serial Bus Device Class Definition for Video Devices, USB Imple-

menters Forum, Jun. 2005.

196

Bibliography

[73] D. Boling, “Writing a Windows CE USB Webcam Driver,” Oct. 2007.[Online]. Available: http://www.codeplex.com/cewebcam

[74] MSDN Windows Embedded Developer Center — Windows CE 5.0, Microsoft.[Online]. Available: http://msdn.microsoft.com/en-us/library/bb847951.aspx

[75] B. W. Kernighan and D. M. Ritchie, C Programming Language, 2nd ed. Pren-tice Hall PTR, Apr. 1988.

[76] D. Boling, Programming Microsoft Windows CE, 2nd ed. Microsoft Press,Jul. 2001.

[77] J. Smart, K. Hock, and S. Csomor, Cross-Platform GUI Programming withwxWidgets. Prentice Hall PTR, Jul. 2005.

[78] “Windows Template Library (WTL).” [Online]. Available: http://wtl.sourceforge.net/

[79] T. Veldhuizen, “Techniques for Scientific C++,” Indiana University, Tech.Rep. TR542, Aug. 2000.

[80] Loquendo TTS 6.6 SDK Programmer’s Guide, Loquendo, Mar. 2006.[81] Loquendo TTS for Windows CE 5.0, Loquendo, Jul. 2006.[82] S. Meyers, Effective C++: 55 Specific Ways to Improve Your Programs and

Designs, 3rd ed. Addison-Wesley Professional, May 2005.[83] ——, More Effective C++: 35 New Ways to Improve Your Programs and

Designs. Addison-Wesley Professional, Jan. 1996.[84] A. Alexandrescu, Modern C++ Design: Generic Programming and Design

Patterns Applied. Addison-Wesley Professional, Feb. 2001.[85] D. Abrahams and A. Gurtovoy, C++ Template Metaprogramming: Concepts,

Tools, and Techniques from Boost and Beyond. Addison-Wesley Professional,Dec. 2004.

[86] S. Meyers, Effective STL: 50 Specific Ways to Improve Your Use of the Stan-dard Template Library. Addison-Wesley Professional, Jun. 2001.

[87] “Boost C++ Libraries.” [Online]. Available: http://www.boost.org/[88] M. Austern, “Draft Technical Report on C++ Library Extensions,” ISO/IEC,

Tech. Rep. DTR 19768, 2005.[89] T. Veldhuizen, Blitz++ User’s Guide — A C++ class library for scientifc

computing, Mar. 2006. [Online]. Available: http://www.oonumerics.org/blitz/[90] “STLport.” [Online]. Available: http://stlport.sourceforge.net/[91] H. Sutter and A. Alexandrescu, C++ Coding Standards: 101 Rules, Guide-

lines, and Best Practices. Addison-Wesley Professional, Nov. 2004.[92] E. Gamma, R. Helm, R. Johnson, and J. M. Vlissides, Design Patterns: El-

ements of Reusable Object-Oriented Software. Addison-Wesley Professional,Nov. 1994.

[93] K. Beck, Test-Driven Development By Example. Addison Wesley, Nov. 2002.[94] D. Astels, Test-Driven Development: A Practical Guide. Prentice Hall PTR,

Jul, 2003.

197

http://www.codeplex.com/cewebcam

http://msdn.microsoft.com/en-us/library/bb847951.aspx

http://wtl.sourceforge.net/

http://wtl.sourceforge.net/

http://www.boost.org/

http://www.oonumerics.org/blitz/

http://stlport.sourceforge.net/

Bibliography

[95] “The Programming Language Lua.” [Online]. Available: http://www.lua.org/[96] R. Ierusalimschy, Programming in Lua, 2nd ed. Lua.org, Mar. 2006.[97] “The computer language benchmarks game.” [Online]. Available: http:

//shootout.alioth.debian.org/[98] D. Wallin and A. Norberg, “Luabind.” [Online]. Available: http:

//www.rasterbar.com/products/luabind.html[99] M. Fowler, “The new methodology,” Tech. Rep., Dec. 2005. [Online].

Available: http://martinfowler.com/articles/newMethodology.html[100] M. Marchesi, “The New XP: An Analysis of Extreme Programming Explained

— Second Edition,” 2005. [Online]. Available: http://www.agilexp.org/downloads/TheNewXP.pdf

[101] D. Goldberg, “What every computer scientist should know about floating-point arithmetic,” ACM Comput. Surv., vol. 23, no. 1, pp. 5–48, 1991.

[102] D. Monniaux, “The pitfalls of verifying floating-point computations,” ACMTrans. Program. Lang. Syst., vol. 30, no. 3, pp. 1–41, 2008.

[103] D. K. Vladimir Dyuzhev and M. Rzechonek, “TUT: C++ Template UnitTest Framework.” [Online]. Available: http://tut-framework.sourceforge.net/

[104] Y. Lu and C. L. Tan, “Improved nearest neighbor based approach to accu-rate document skew estimation,” in Proceedings of the Seventh InternationalConference on Document Analysis and Recognition (ICDAR 2003). IEEEComputer Society, 2003.

[105] J. J. Hull, “Document image skew detection: Survey and annotated bibliog-raphy,” in Document Analysis Systems II. Word Scientific. World Scientific,1998, pp. 40–64.

[106] Intel PXA27x Processor Family Optimization Guide, Intel, Aug. 2004.[107] A. Fog, Optimizing software in C++ — An optimization guide for Windows,

Linux and Mac platforms, Jul. 2007.[108] Intel C++ Compiler For Intel XScale Microarchitecture — User’s Manual,

Intel, 2005.[109] MSDN Visual Studio 2005 Developer Center, Microsoft. [Online]. Available:

http://msdn.microsoft.com/en-us/library/ms950416.aspx[110] “Fixed Point Arithmetic on the ARM,” ARM Ltd, Tech. Rep. ARM DAI

0033A, Sep. 1996.[111] R. Earnshaw, “Procedure Call Standard for the ARM Architecture,” ARM

Ltd, Tech. Rep. GENC-003534 v2.05, Jan. 2007.[112] Øivind Due Trier, A. K. Jain, and T. Taxt, “Feature extraction methods for

character recognition-a survey,” Pattern Recognition, vol. 29, no. 4, pp. 641 –662, 1996.

[113] S. Baluja, “Making templates rotationally invariant: An application to ro-tated digit recognition,” in Advances in Neural Information Processing Sys-tems NIPS ’99, 1999.

198

http://www.lua.org/

http://shootout.alioth.debian.org/

http://shootout.alioth.debian.org/

http://www.rasterbar.com/products/luabind.html

http://www.rasterbar.com/products/luabind.html

http://martinfowler.com/articles/newMethodology.html

http://www.agilexp.org/downloads/TheNewXP.pdf

http://www.agilexp.org/downloads/TheNewXP.pdf

http://tut-framework.sourceforge.net/

http://msdn.microsoft.com/en-us/library/ms950416.aspx

Bibliography

[114] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Pear-son Education (US), 2006.

[115] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. JohnWiley and Sons Ltd, 2001.

[116] A. R. Webb, Statistical Pattern Recognition, 2nd ed. John Wiley and SonsLtd, 2002.

[117] J. A. Hertz, R. G. Palmer, and A. Krogh, Introduction to the Theory of NeuralComputation. Westview Press, Jan. 1991.

[118] T. Kavzoglu, “Determining optimum structure for artificial neural networks,”in Proceedings of the 24 th Annual Technical Conference and Exhibition of theRemote Sensing Society, 1999, pp. 675–682.

[119] Y. LeCun and Y. Bengio, “Convolutional networks for images, speech, andtime-series,” in The Handbook of Brain Theory and Neural Networks. MITPress, 1995.

[120] M. Costa, E. Filippi, and E. Pasero, “A modular, cyclic neural network forcharacter recognition,” in Proc INNS World Congress on Neural Networks(WCNN ’94), vol. 3, Jun. 1994, pp. 204–209.

[121] J. C. Principe, “Artificial neural network,” in The Electrical EngineeringHandbook. CRC Press LLC, 2000.

[122] S. E. Fahlman, “An empirical study of learning speed in back-propagation net-works,” Carnegie-Mellon University, Tech. Rep. Computer Science TechnicalReport, 1988.

[123] M. Riedmiller and H. Braun, “RPROP — A fast adaptive learning algorithm,”Universitat Karlsruhe, Tech. Rep., 1992.

[124] M. Riedmiller, “Rprop - description and implementation details,” Institut furLogik, Komplexitat und deduktionssysteme and University of Karlsruhe, Tech.Rep., Jan. 1994.

[125] C. Igel and M. Husken, “Improving the Rprop learning algorithm,” in Pro-ceedings of the Second International ICSC Symposium on Neural Computation(NC 2000), H. Bothe and R. Rojas, Eds. ICSC Academic Press, 2000, pp.115–121.

[126] S. Nissen, “Implementation of a fast artificial neural network (FANN),”Department of Computer Science, University of Copenaghen (DIKU), Tech.Rep., Oct. 2003. [Online]. Available: http://leenissen.dk/fann/

[127] R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimationand model selection,” in International Joint Conference on Artificial intelli-gence (IJCAI). Morgan Kaufmann, 1995, pp. 1137–1143.

[128] L. Nemeth, “Hunspell: open source spell checking, stemming, morphologicalanalysis and generation under GPL, LGPL or MPL licenses.” [Online].Available: http://hunspell.sourceforge.net/

199

http://leenissen.dk/fann/

http://hunspell.sourceforge.net/

Mobile Reader Device for Blind People The proposed mobile reader device for blind has been designed...

Documents

Transcript of Mobile Reader Device for Blind People The proposed mobile reader device for blind has been designed...