A Greek voice recognition interface for ROV applications ...

A Greek voice recognition interface for ROV applications, usingmachine learning technologies and the CMU Sphinx platform

FOTIOS K. PANTAZOGLOUHellenic Centre for Marine Research

Institute of OceanographyFormer US base at Gournes

P.O. Box 2214, 71003 HeraklionGREECE

[email protected]

GEORGIOS P. KLADISHellenic Military Academy

16673 Vari AttikisGREECE

[email protected]

NIKOLAOS K. PAPADAKIS ∗

Hellenic Military Academy16673 Vari Attikis

[email protected]

Abstract: Finding new technical solutions to command and control smart and high technology devices has becomea necessity in our days. This is due to the fact that control of these devices, in a manual manner, may often becomecumbersome for the operator especially when he/she is involved with numerous tasks. One way to overcome this, itis common practice the use of Automatic Speech Recognition (ASR) procedures, and this is the main topic of thisarticle. In this article we present the implementation of a Greek CMU Sphinx model that can be used in RemotelyOperated Vehicles (ROV) operations and applications. In particular, this work is focused in the development andtraining of the CMU Sphinx platform for the Greek language using well established machine learning tools andtechnologies .The generic Greek model and the Greek model for ROV applications are freely available in interna-tional repository via (https://gitlab.sse.gr/fpantazoglou/omilia and https://goo.gl/9v3QqG )

Key–Words: Human-machine interface, Machine learning, CMU sphinx, Pocketsphinx, Greek language, Auto-matic Speech recognition, Hidden Markov Models, Remotely Operated vehicles,

1 IntroductionNew technologies and practices are characterizing ourage with their rapid development and expansion. Au-tomatic Speech Recognition (ASR) aims not only toimprove our everyday life, but also to be used in a va-riety of high-tech systems and equipment. One of theoldest issues that the scientific community has dealtwith is the application of ASR methods in systems .

According to [1] ASR has been used since the50s. ASR involves the use of sophisticated micro-computers, Internet-based computing programs andprogramming languages, to be applied in various ap-plications. This may improve everyday useful tasksand help in a number of more sophisticated and com-plicated apparatuses. CMU Sphinx [2] by CarnegieMellon University, Sun Microsystems Laboratories,and the Mitsubishi Electric Research Laboratories, isa popular open source platform among the scientificcommunity. Pocketsphinx is a CMU Sphinx vari-ant developed by Huggins-Daines [3] as an answerto the rapid developments in small and smart deviceslike smartphones tablets microcomputers etc. Pock-etsphinx is designed to be hand held and mobile-friendly, and is the main topic of this work.

∗Corresponding author

Wide range of tools provided, and flexible de-sign [2] [3] allow us to use them in speech recogni-tion related applications development. The softwareincluding instructions is well documented and an ac-tive community is in place for commercial support.Through its 20 years age of ongoing support and de-velopment, both the CMU Sphinx and Pocketsphinxsupport a plethora of languages including English,German, French, Spanish, Russian, Mandarin etc. [4].Both platforms are user friendly and due to the sys-tematic way that those can be trained they can easilyaccommodate any language or specific dictionary ac-cording to the application requirements . Particularly,all the necessary tools are provided by the platformsfor the creation of an accurate language model for anynew task [5].

Still although there is a great number of worksaddressing the previous issues, only a amount isdedicated for the Greek language model as a ba-sis, for ASR applications. For example, in [6] theauthors used CMU Sphinx as part of the researchproject RAPP (Robotic Applications for DeliveringSmart User Empowering Applications) and created asmall Greek model which included only certain Greekwords, and limits the use of their approach in plat-forms.

WSEAS TRANSACTIONS on SYSTEMS and CONTROLFotios K. Pantazoglou,

Georgios P. Kladis, Nikolaos K. Papadakis

E-ISSN: 2224-2856 550 Volume 13, 2018

To overcome this a generic Greek language modelwas necessary as a basis for more specialized modelsthat can be used in a variety of applications. So atfirst, the generic Greek language model that was cre-ated and introduced by [7]. Its development involvedthe use of the general principles for creating a newmodel according to the instructions given by the CMUSphinx developer team. Those instructions are beingused in any case a new language model is being im-plemented.

During the training process, a machine learningprocedure, the platform is provided via sphinxtrain(https://goo.gl/YDxgeu), part of the CMU Sphinxpackage with audio data of the desired language andthe corresponding transcribed texts to the audio datagiven. Through a statistical process the machine train-ing algorithm creates the basic features of param-eters for the desired language. At the same timethrough the decode, training results are tested in or-der to achieve greater accuracy. Since the training andadaptation of the Greek model is an adaptive and con-tinuous process the authors expand on previous workby presenting a more application oriented derivativeof their work. In particular this work involves, thecontrol of a Remotely Operated Underwater vehicle(ROV).Specifically the authors trained and adaptedthe already created generic Greek Model so it can beused for voice control in ROV applications.

The remainder of this article is structured as fol-lows: in Section 2 a brief literature review of ASRmethods is presented.Section 3 refers to the HiddenMarkov Models .In Section 4 CMU Sphinx, Pocket-shinx and their design architectures are analyzed andin Section 5 a presentation of the development and im-plementation of the generic Greek model for the CMUSphinx platform is included. Section 6 is dedicated tothe machine learning process that resulted the ROVoriented Greek model and the paper concludes withSection 7.

2 Automatic Speech Recognition(ASR)

ASR is the process of converting a voice signal into asequential string of words using a computational algo-rithm [8]. The main objective by the scientific com-munity is to improve the accuracy of ASR used forvarious environments, for different types of spokenlanguage, and speakers. The development of superprocessing and multi-core computing systems allowedthe successful application of ASR approaches for ap-plications [9]. ASR is being used in speech-to-speechprograms, for example [10], or in daily used applica-tions like skype (skype online translator).

ASR techniques are also widely used in voicesearch applications, intelligent digital assistants likeMicrosofts Cortana and Apples Siri and in the Gamingindustry. New clever homes use ASR to allow the userto interact with the home control system and at thesame time, in cars ASR based applications are usedfor informational or entertainment purposes [11].

One of the biggest challenges is the use of ASRfor handicapped people [12] or the use of ASR forcontrolling high tech scientific apparatuses.

A brief presentation about the main characteris-tics of ASR systems and their architectures follows in-cluding also the main categories and modelling tech-niques that are used.

2.1 ASR Design ArchitectureThe typical architecture of an ASR system is depictedin Fig. 1. The system consists of four parts: A signalprocessing and extraction part, the Acoustic Model(AM), the Language Model (LM) and the HypothesisSearch (HS).The first part receives as input the audiosignal.

Figure 1: A typical architecture of an ASR system.

In particular, the input signal is filtered from noiseand any disturbances, and is transformed into the fre-quency domain thus producing the principal featuresto be fed to the AM. The AM incorporates knowledgeabout acoustics and voice, and a score is generated forthe variable length sequence.

The LM calculates the probability of a hypotheti-cal word sequence, creating the LM score. Thereafter,the HS part combines the two scores produced andoutputs the sequence of words with the greatest score.

2.2 ASR CategoriesASR systems are categorized and characterized by theexpressions that are recognized. Those can be classi-fied according to [13] as :



E-ISSN: 2224-2856 551 Volume 13, 2018

◦ Single word recognition systems. Systems that rec-ognize individual words or single utterances. Typi-cally, a period of silence is necessary in both partsof the sampling window while accepting individualwords or expressions at a time.

◦ Expression recognition systems. These systems aresimilar to the above, but they allow the simultaneousprocessing of expressions as long as there exists aminimum gap between them.

◦ Continuous speech. These systems allow naturalconstant flow of speech by the user. It is one of themost difficult systems in creation as it contains tech-niques that define the different limits of expressions.

◦ Spontaneous speech. Is the speech that is naturaland without rehearsal. A speech recognition systemof spontaneous speech needs to be able to handle avariety of physical features of human speech, wordsthat are used in the spoken language such as ”ah”,”eh” etc.

CMU Sphinx and Pocketshpinx as platforms, arecompatible with all ASR categories. The accuracy ofthe developed models is subject to the training of themodel.

2.3 Modelling techniquesDifferent types in modelling techniques are used inthe ASR process, in order to generate speaker mod-els using the speaker specific feature vector. Intheir work [14] these are grouped as : The acoustic-phonetic approach [15], Pattern Recognition approach[20], Template based approaches [16], Dynamic TimeWarping (DTW) approach [21], Knowledge basedapproaches [14], Statistical based approaches [20],Learning based approaches [14], Artificial intelli-gence approaches [14], Stochastic Approaches [14],and approaches based on Deep Neural Networks(DNN) [17].A brief description of each modeling technique fol-lows .

◦ Acoustic-phonetic approach. Based on the theoryof vocal phonetics and assumptions.

◦ Pattern Recognition approach. Method is dividedinto two separate parts. Pattern training and recog-nition. Method based on HMM and can be appliedto a sound, a word, or phrase.

◦ Template based approaches. Here in the inputspeech is compared to a set of pre-recorded wordsin order to find the best match.

◦ Dynamic Time Warping. DTW is an algorithmthat measures similarities between two sequences

that may vary in time or speed. It has been success-fully applied to audio data.

◦ Knowledge based approaches. A knowledge basefor a language is encoded into a machine. Beingable to acquire this knowledge covering all cases ofa language and then successfully using it, is not atrivial task.

◦ Statistical based approaches. Oral speech is sta-tistically modeled using automated trained statisti-cal processes. The main disadvantage of statisticalmodels is that they must receive preliminary mod-els of modeling. In ASR HMMs have been imple-mented with great success.

◦ Learning based approaches. To overcome thedrawbacks of applications using HMM, learningmethods such as neural networks and genetic al-gorithm programming can be introduced. In theselearning models it is not necessary to provide clearrules or other knowledge.

◦ Artificial intelligence approach. This is a hybridmethodology between the acoustic-phonetic ap-proach and the pattern recognition approach. How-ever, this approach has had limited success, mainlydue to the difficulty of quantifying existing knowl-edge.

◦ Stochastic Approach. Probability models are usedto deal with uncertain or incomplete information.The most popular stochastic approach is the HMM,characterized by one of a finite state and a set of out-put distributions.

◦ Deep Neural Networks Approaches. With the im-plementation of DNN, error rates have been signif-icantly reduced, due to computer power, the exis-tence of large training data sets, and the better un-derstanding of these models. DNN are capable ofmodeling complex non-linear relationships betweeninputs and outputs. This ability is very important forhigh quality and accurate acoustic modeling.

In particular, CMU Sphinx and Pocketsphinx arebased on modeling with the use of HMMs, using Sta-tistical based approaches.

3 Hidden Markov ModelsA Hidden Markov Model, is one the machine learn-ing model on which CMU Sphinx and Pocketsphinxare based. In his work Baum [18] first described Hid-den Markov Models (HMM) defining them as a col-lection of states connected by transitions.As noted byJurafsky [19] we use them in speech recognition, sincethey allow us to talk about observed events and hid-den events in our probabilistic model. In his tutorial



E-ISSN: 2224-2856 552 Volume 13, 2018

Rabiner [20] describes he following components thatspecify a HMM:

1. N, the number of states in our model (starting andending state included). Q=q1q2...qN

2. M, the number of distinct observation symbolsper state which correspond to the physical outputof the system being modeled.

3. A. An N x N state transition matrix, where aijrepresents the transition possibility from state ito state j.

aij = P [qt+1 = Si | qt = Si], 1 ≤ i, j ≤ N.

4. bi. The observation output distribution in state i.

5. π. The initial state distribution π = {πi} where

πi = P [q1 = Si], 1 ≤ i ≤ N.

Rabiner described in his work, the the steps whichlead to the generation of an observation sequenceO =O1O2OT from the HMM when the appropriate valuesof N, M, A, B and π are given. Those steps are:

1. According to to the initial state distribution π,choose an initial state q1 = Si.

2. Set t = 1.

3. According to the symbol probability distributionin state Si, choose Ot = vk.

4. Transit to a new state qt+1 = Sj according to thestate transition probability distribution for stateSi.

5. Set t = t+ 1. If t < T return to step 3 otherwiseterminate the procedure.

By knowing the two model parameters (N and M)the specification of observation symbols and the spec-ification of the three probability measures (A, B, π)we can specify an HMM. In general for our conve-nience we use the compact notation | = (A, B, π) toindicate our complete set of parameters.

Rabiner [20] also introduced the idea that HMMshould be characterized by three fundamental prob-lems [19] :

Problem 1 (Likelihood): Given an HMM λ =(A, B) and an observation sequence O, determine theLikelihood P (O | λ).

Problem 2 (Decoding): Given an observation se-quence O and an HMM λ = (A, B), discover the besthidden state sequence Q.

Problem 3 (Learning): Given an observation se-quence O and the set of states in the HMM, learn theHMM parameters A and B.

To solve the three fundamental problems we usedifferent algorithms as presented in detail by [19].Those are, the Forward Algorithm for the Likelihoodproblem the Viterbi Algorithm for the Decoding prob-lem and the Forward-Backward (or Baum-Welch) Al-gorithm for the Learning problem.

4 CMU Sphinx and Pocketsphinxplatforms

CMU Sphinx has been developed jointly by CarnegieMellon University, Sun Microsystems Laboratoriesand Mitsubishi Research Laboratories [2]. It has beendeveloped in Java and supports all types of audio mod-els represented via HMM. CMU Sphinx is highly flex-ible, supports all types of HMM-based acoustic mod-els and multiple search strategies. It is an open sourceproject that supports most common languages and itscode is freely available at SourceForge1.

At the same time, it is easier to exploit the ca-pabilities of computing machines with multiple pro-cessors in the process of training for our model withlarge sets. CMU Sphinx allows the user to alter thelanguage model by parametrizing only one part of thesystem and without affecting the overall architecture.

Pockesphinx is a lightweight version of the CMUSphinx. It was first intended to be a version OSSPHINX-II for use on mobile devices and hardwarethat lacked on performance specifications. However,as Huggins-Daines showed [3], although its reason-able performance it needs a new design architecture.

This is developed in ANSI C programming lan-guage, and the requirements were namely, applicationportability, simple implementation, small code anddata footprint, security, memory performance [22].Due to the specifications of the working platform ad-dressed in this work, a laptop computer of modestcomputing power (Intel core Duo with 4 gbite RAM)is used for Pockesphinx.

4.1 CMU Sphinx Design ArchitectureLamere in his work [2] gave a very detailed presenta-tion of the CMU Sphinx platform and its architecture.The design Architecture of CMU Sphinx and the in-ter dependencies communications among its blocks isdepicted in Fig. 2. The main blocks are the Fron-tend, Decoder and Knowledge base. All blocks, ex-cept those in the Knowledge Base, are independentlyreplaceable custom made software modules written inJava.

In Frontend module speech is received andparametrized. The Decoder incorporates the linguist,

1https://goo.gl/nzRDmF



E-ISSN: 2224-2856 553 Volume 13, 2018

Figure 2: CMU Sphinx design architecture (Lamereet al., 2003).

the search manager, and the acoustic scorer while theKnowledge base includes the Acoustic (AM) and Lan-guage model (LM) as well as the Lexicon (or PhoneticDictionary).

The front end block and its several communicat-ing blocks, is shown in Fig. 3.

Figure 3: CMU Sphinx Front end architecture(Lamere et al., 2003).

Each block is activated via the excitation of itsinput and processes the information to determinewhether it is voice data or a control signal. The con-trol signal sets the beginning or ending of the speech,something very important for the decoder. If the in-coming data is speech then they are processed andstored temporarily for the next block.

One of the design features of the program is thefact that all the block outputs can be used simultane-ously. The input data is not necessarily an input of thefirst block. It can be introduced to numerous blocks. Itis also possible for any of the blocks to be replaced forfiltering purposes, reduction of noise, etc. The charac-teristics calculated using independent sources of infor-mation can be fed directly to the decoder, or alongsidethe speech signal, or by skipping it all together. Theentire system allows continuous operation, and has theability to function in a very flexible way.

The decoder incorporates the search manager, thelinguist and the acoustic scorer blocks. The Searchmanager is responsible for the construction and searchof a probability tree for the best matching case. The

search tree construction is based on the informationprovided by the linguist. Additionally, the searchmanager communicates with the audio scorer to ob-tain the scores for the incoming audio data. For a moredetailed treatment of the algorithms used the reader isreferred to the Viterbi [23] and Bushderby [24] works.

According to [25], the linguist creates the searchgraph that the decoder uses when searching for words,while concealing the complexities associated withcreating this graph. The search graph is created us-ing the language structure, as represented by a givenLM and the topological structure of the AM. Thelinguist can also use a dictionary (usually a pronun-ciation dictionary) to match words from the LM todata sequences of the AM. When creating the searchgraph, the linguist can incorporate sub-units of arbi-trary length frames. The design allows the user toprovide different configurations for various systemsand recognition requirements. For example, a sim-ple number recognition application can use a simplelinguist that keeps the entire memory search process,while larger applications with vocabularies of the or-der of 100.000 words use smarter linguists by storingonly a certain number of data from the find process inmemory. The aim of the acoustic scorer is to calculatethe probability of the output state or the probabilitydensity values for the different states for any given in-put vector.

The requirement is the communication potentialwith the Frontend blocks, to obtain the extracted at-tributes. The reviewer keeps all information aboutthe probability densities of the output states whilethe search block ignores, whether the scoring is donewith continuous, semi-continuous or discrete HMM.Finally, the design allows the user to include con-ventional Gauss scoring procedures, or scoring ap-proaches based on game theory, for example.

4.2 Pocketsphinx Design Architecture

The Pocketsphinx decoder architecture has been de-rived mainly from Sphinx-II and its architecture de-sign [22]. The design architecture is depicted in Fig.4.

Figure 4: Pocketsphinx Decoding Architecture(Daines, 2011).



E-ISSN: 2224-2856 554 Volume 13, 2018

During the decoding phase, Pocketsphinx usesa multi-search approach 2. This involves the,FWDTREE, FWDFLAT and BESTPATH stages.

The FWDTREE search method is based on a dic-tionary that has a tree structure. It separates the searchgraph in the vocabulary tree and in the word sequence.In order to have a fast computational process, themethod uses a large number of adjustable, width wisebeams. In particular, a separate beam is applied tothe last phoneme of each word from all HMMs. TheFWDTREE search method is quick but not very accu-rate.

The FWDFLAT search method looks for a gridcreated by the FWDTREE but does not re-rank it onthe same grid. It essentially reassesses the fact thatthe active vocabulary for search has already been as-sumed by the FWDTREE. Thus, the total number ofwords as a case is clearly more limited than the orig-inal number of words. Hence, a dynamic associationof each phrase with only that set of words is possi-ble during decoding, which is attractive for portabledevices of modest specs.

The BESTPATH lattice generation (best-path)method is a rewrite of the word table word grid thathas been generated by the FWDFLAT search. Therecalibration methodology is based on the Dijkstrashortest path algorithm [26]. In fact, the first searchmethod is principal and the other two are used for cor-rective and complementary purposes.

5 Implementation of the genericGreek model

The flow chart for the entire process is depicted in Fig-ure 5 and a description of each phase follows.

Figure 5: Implementation of the generic Greek model-flow chart.

2https://goo.gl/N7bk3n

5.1 Implementation of the Greek languagemodel

Language Models calculate the probability of a hypo-thetical word sequence, creating the LM score. Theycan describe more complex languages, like Greek, andare ideal for cases where we have free form, such asverbal speech, inputs.

There are two ways to create the language model.The former involves small and limited vocabularies bymaking use of the on-line tool developed by CarnegieMellon University 3 .This method also assumes theexistence of a generic model. By the latter, for largerlanguage models (i.e. 20000 words) and in the ab-sence of a generic model, like in the Greek languagecase, then the steps in the sequel can be performed.

1. Installation of the CMUCLMTK package of pro-grams that are provided by the Carnegie MellonUniversity.

2. Preparation of the necessary texts which need tobe of specific form. In particular, there are nocapital letters, and punctuation marks other thantones. Separate phrases are labeled by the < s >and < /s > symbols and there are no numbersymbols. Still the number symbols can be rep-resented in verbal form. For the Greek languagemodel, the training data were created from var-ious sources: texts from daily news, books ofdifferent themes, and subtitles from documen-taries. These adopt ’simple spoken’ languageavoiding any technical terms or specialized vo-cabulary. It is noted that it is possible to im-prove the language model by adding additionaldata. At present, the Greek language model con-sists of over 19000 words that represent the mostcommon words used in daily speaking.

3. Thereafter all text samples are transformed intobinary format via the tools provided from theCMUCLMTK package and the LM is created.

It is noted that for the application consideredto be used in ROV operations a specific languagemodel was created by the proper selection of partic-ular words/phrases.

5.2 Implementation of the Greek phoneticdictionary

One of the most important parts of the generic Greekmodel is the Phonetic dictionary. It is essential, toconvert Greek words into a form that is compatiblewith the CMU Sphinx and Pockesphinx software.

3http://www.speech.cs.cmu.edu/tools/lmtool-new.html



E-ISSN: 2224-2856 555 Volume 13, 2018

Initially, such a conversion involves a specificalphabet. The alphabet used internationally on theCMU Sphinx platform is ARPABET 4 . ARPABETis the voice representation developed by ARPA (Ad-vanced Research Projects Agency) in the frameworkof speech comprehension programs. Each phoneme ofthe English alphabet is matched with a separate ASCIIcharacter string. For the case of the Greek vocabularycreation two sub-tasks are performed. By the former,a sufficient number of Greek words is needed to createthe dictionary, and by the latter it is necessary to con-vert this dictionary into a compatible form to be usedby the system. This is included in the sequel.

5.2.1 Dictionary creationFor the creation of the dictionary, open source soft-ware and applications are preferred. It is noted that itis not possible to include all words of our daily spokenlanguage in the dictionary. For example, scientificallyterms, toponyms or surnames are not included. Still,those are added so that it responds effectively duringthe recognition process.

The first source of words used is provided by theInstitute for Language and Speech Processing as a re-sult of the work by researchers [27] .

Furthermore, the Greek lexicon 5 of the opensource program OpenOffice/LibreOffice is used. Thewords are edited in order to avoid duplication ofrecords and they are converted to binary form to beused by the Sphinx CMU platform.

5.2.2 Word to phoneme conversion mechanism-The Greek phonemes

After the creation of the dictionary, the words aretransformed into a set of phonemes compatible withthe CMU Sphinx and Pocketshpinx system. Initiallythe Greek phonemes are defined and the transforma-tion mechanism is deployed. In the Greek languagethere exist 24 letters and 25 phonemes:

◦ Greek letters α, β, γ, δ, ε, ζ, η, θ, ι, κ, λ, µ, ν, ξ, o,π, ρ, σ, τ, υ, φ, χ, ψ, ω

◦ Greek phonemes α, ε, ι, o, oυ, β, γ, δ, ζ, θ, κ, λ,µ, ν, π, ρ, σ, τ, φ, χ, µπ, ντ, γκ, τσ, τζ

There are phonemes that contain two letters, andthere are letters such as η, ι, which correspond to thesame phoneme i. There exist the double letters ξ, ψ,and double letters corresponding to a single phonemesuch as the one corresponding to ε. In the Greek al-phabet we have the final sigma ς , and double letters

4http://www.speech.cs.cmu.edu/cgi-bin/cmudict5http://www.elspell.gr/

such as ββ, κκ, λλ etc. Finally, the αυ and ευ corre-spond to a different vocal imaging, according to theletter that follows, as shown by Tsardoulias [6].

Taking into account all the above and thephonemes that CMU Sphinx uses, the list of Greekphonemes used in the Greek model is :

◦ Greek CMU Sphinx phonemes AE, AA, B,CH, D, DH, EH, F, G, HH, IH, JH, K, KS, L,M, N, OW, P, PS, R, S, SIL, T, TH, UW, V, W, Z,ZB, ZD, ZDH, ZL, ZM, ZN, ZR, ZV,ZW

Provided the list of Greek phonemes it is easy tocreate the phonetic dictionary with the use of an avail-able algorithm. The specific steps during this proce-dure in ARPABET format are :

1. Conversion of all letters to minuscule.

2. Replacement of the double and triplecombinations of letters with their corre-sponding phrases in the following order:oυ, oυ, σµπ, σντ, σγ, σβ, σδ, σµ, σν, σλ, σρ, µπντ, γκ, γγ, τς, τζ, σς, κκ, ββ, λλ, µµ, νν, ππ, ρρττ.

3. Replacement of the double specialvowel combinations with the corre-sponding vocals in the following order:αι, αι, ει, ει, oι, oι, υι, υι, αυ, αβ, ευ, εβ, ευ, εβαυ, αφ, αυ, αφ, εφ.

4. Replacement of all other remaining letters withthe corresponding phonemes.

After the completion of the Greek dictionary, anecessary step is the calibration and training of theGreek Acoustic Model (AM).

5.3 Training of the Greek AMThe most cumbersome task in the overall process inthe creation of the generic Greek model is the AM.This is a machine learning process that leads to aseries of statistical records that contains informationabout speech units. In fact it is the ”mapping” of thelanguage used in voice recognition services. For thismapping to be performed the system is fed with a largenumber of audio files which are matched to the writ-ten transcribed text. Depending on the calculationalpower and amount of data this is the most cost effec-tive step for the training of the AM. For a near to com-plete successful process, the more the speakers andthe variety of subjects spoken included, the better thetraining outcome. For example, using in the systemsound data that refer only to scientific or economic



E-ISSN: 2224-2856 556 Volume 13, 2018

field, may yield a poor performance in speech recog-nition. Audio data is required to be of specific formatfor use in the Pocketsphinx and sphinxtrain 6.

Sphinxtrain is responsible for the training of theAM. Training is necessary since the AM for a newlanguage is required. Specifically, the audio data is inMS WAV format, mono at 16 kHz 16 bit. A range ofdifferent sound sources were chosen to achieve the-matic and sound pluralism. These are:

◦ Recordings of individuals. Individuals were askedto read either texts or speak fluently and there wererecorded according the general instructions given bythe CMU Sphinx development team.

◦ News recordings. A large amount of data texts ofthe current news, from various thematic sections, onselected, which were then pronounced in a syntheticfemale language with the help of the Google trans-late service and were recorded via the Audacity FreeSoftware program 7.

◦ Free Audio books. Free Audio-books were usedwhich are provided by the Greek Ministry of Edu-cation 8.

◦ Audio data from Academic Lectures. A very impor-tant source of good audio data was the open lecturesof Hellenic universities that are in the Open courserepository 9 . These were edited to extract only thesound needed, and then edited to become compati-ble according to the data specifications.

◦ Meetings of the Greek Parliament 10 . The reposi-tory of audiovisual material of the Greek Parliamenthas a variety of different subject materials, as well asmany different speakers. These were processed andstored in the appropriate format for the sphinxtraintraining mechanism.

Overall, the audio data material used for the train-ing process had a total duration of 358:09:38 hours.The overall accuracy of the generic model is derivedfrom the combination of all individual statistics thatoccur during the training process. Having the indi-vidual statistics, one can intervene in the training andhardware parameters, and then introduce recurrenttraining processes until a better accuracy is achievedfor the underpinned application.

The implementation of the generic Greek modelis the necessary step in order to develop a smaller usecase oriented Greek model that can be used in ROVapplications. The description of this model follows.

6https://goo.gl/noanMQ7http://www.audacityteam.org/8http://ebooks.edu.gr/new/9http://opencourses.gr

10http://www.parliament.gr/

6 The Greek model for ROV applica-tions

ROVs are a family of submarine robotic vehicles thatallow tasks to be performed in conditions and envi-ronments that are neither friendly nor easily accessi-ble by humans. Connected via an umbilical cable withthe support ship, they allow the operator to undertakescientific missions that are characterized by high pre-cision in sampling and data quality. One of the mostvaluable tools of an ROV, is its manipulator system.

Figure 6: Hydro-Lek manipulator during MaxROVERoperations (https://goo.gl/DSmoz6).

Via such a system it allows the operator to per-form from simple tasks (i.e. picking objects ) to morecomplicated procedures (i.e. maintenance, samplingand surveying of the sea-bed) (Fig. 6). The manip-ulator systems are remotely controlled by the opera-tor. The most basic manipulator system consists of acontrol device, arms, joints and a gripper as the end-effector where different tools can be used. The mostsophisticated systems have complex electronic controlsystems and a large number of degrees of freedom inmovement.

One of the most important part of the system isthe operator’s interface system to perform tasks. Thehandling systems may vary according to the companyand depending on the degrees of freedom of the ma-nipulator. These may include from simple levers foroperation of the manipulator to more sophisticatedhandling systems.

Such operations require dexterity and a goodknowledge of the system by the operator for the han-dling. Usually, in ROV tasks, the shift job is six hours,thus adds a further degree of difficulty to the entireprocess. The handling performed is needed to be ac-curate and at the same time may require repetition.In this context, an ASR system that combines sim-



E-ISSN: 2224-2856 557 Volume 13, 2018

ple rules, ease of use, repetition, can decisively aid tofurther increase the productivity of the tools used bythe ROV and its operation by the user. Thus the useris able to operate the existing control and commandsystems on board using simple voice commands, andfocus in other tasks. Voice commands are clearly sim-pler and can improve the ease of operation in a numberof applications.

Having this in mind the already developed genericGreek model, illustrated in the previous section, isused for the creation of a smaller ROV oriented Greekmodel. Since we need higher accuracy we have toadapt the already created generic Greek model usingthe technique already presented by the CMU Sphinxdevelopment team. A smaller pruned language model(LM) is also needed since large LMs are not helpingin the overall accuracy.

From our working experience commands likestop, turn left, turn right, take sample, open, slow,more, less, close, add waypoint, mark, etc. can beused in a set of voice commands that can ease the workof a ROV team. Therefore as a first step in the adap-tation process, new phrases with text that includes thepotential command words, where used to create thenew pruned LM. The new, much smaller in size, LMis created with the process already described in Sec-tion 5.1.

Besides the LM, the new adopted AM that fits ourROV operations is fed with the new training mate-rial for more accurate performance. The new train-ing, material is used from the recording of 30 phrases(depicted in Fig. 7 translated in English) by a largenumber of individuals. In those phrases, the principalcommands that an ROV system operator team is nat-urally using is included. The phrases don’t have anyconceptual meaning .

The number of individuals chosen for the datawere divided equally with respect to genre. There-after the recorded and edited data set was introducedto the sphinxtrain program for the adaptation processthat actually recalculates all the statistical informationand characteristics that is needed from the AcousticModel (AM).

During the testing process of the new ROVadapted model, the accuracy of the model was in mostcases 100% as depicted in Fig. 8.

HCMRs 11 MaxROVER (Fig. 9) is an Observa-tion/Light Work Class ROV that is certified to diveand work until 2000 m depths. It is been widely usedfor research projects and has the capability to mon-itor and record marine habitats, to take samples andconduct precise measurements of scientific interest.

The Manipulation system of MaxROVER con-

11https://www.hcmr.gr

Figure 7: Set of phrases used for the training of theROV Greek model.

sists from two simple 5 DOF manipulators providedby Hydro-Lek 12 .Those are controlled with the helpof a simple and standard 6 joystick system. Extratools like a built in line cutter can be used to break,cut, or lift objects. Especially designed samplingand measurement apparatuses can be added to theMaxROVER and this gives an extra flexibility in thenumber of tasks it can undertake13. A built-in sam-pling basket is not present and this fact differentiatesthe sampling strategy with the use of separated bas-kets that are deployed in coordination with the ROVdeployment. This adds more working load to the ROVoperating team and reduces the number of simultane-ous tasks that it can perform during a single dive.

Usually the HCMR ROV operating team is 4 tech-nical scientists. A dive-supervisor, a pilot, a sonar andarm manipulation operator and a data log responsi-ble. Since the system is mainly operated in a manualmode, a high degree of collaboration is needed be-tween the ROV operating team members the scientistson board and the support ship personnel.

In a common use scenario we can use the Greekmodel for ROV operations in a number of tasks un-

12http://www.hydro-lek.com/13https://bit.ly/2zJLp4q



E-ISSN: 2224-2856 558 Volume 13, 2018

Figure 8: Part of the training report for the ROV Greekmodel.

Figure 9: HCMR MaxROVER ROV.

dertaken by the pilot, the sonar and arm manipulationoperator and the data log responsible. Simple smallcomputers like the Raspberry Pi 3 (which already sup-ports Pocketsphinx) can be used to automate some ofthe work done.

The pilot would demand the ROV for example tofollow a certain path ,following way-points, via voicecommands like start, stop, take right turn, heading 60degrees etc. The sonar and arm manipulation operatorcan use voice commands to take samples and conductmeasurements in a more easy and fast way. Finally thedata log responsible can with the help of voice com-mands and special programmed applications handle abigger amount of collected data at the same time. Itis noted that the accuracy of recognition is influencedgreatly by factors such as the noise of the surroundingenvironment, the quality of the speech of the speakerand the stress in the speech during operations.

7 ConclusionsHuman machine interaction technologies like ASR,gaining momentum the past decades. Voice recogni-tion and voice communication technologies have con-tributed significantly to a range of our day-to-day ac-tivities. Taking advantage of the rapid growth and thespread of technology, we have at our disposal a rangeof tools that enable us to improve both our daily rou-tine and our performance in a series of very demand-

ing tasks.CMU Sphinx and Pocketsphinx are ASR plat-

forms that allow the development of those tools viaa systematic methodology, which is well documentedand easy to reproduce. Due to the systematic method,which can be easily followed, a Greek voice recogni-tion interface for ROV applications was described inthis work. Although, the special characteristics andfeatures of the Greek language it was proved hereinthat the CMU sphinx can easily accommodate suchchallenges, and may be used and cater any specialrequirements according to the application concerned.The interested reader is referred to [7] for a detailedtreatment of the implementation of the generic Greekmodel for CMU Sphinx.

After the introduction of the generic Greek modelfor CMU Sphinx [7] the authors used the methodol-ogy provided by the CMU Sphinx development teamto further implement a Greek model that can be usedin ROV operations and applications. This was neces-sary since we needed a specialized model for a smallvocabulary application (ROV control).

The development of the ROV oriented Greekmodel, including the new adapted Acoustic Model,the Voice Dictionary and the new Language Modelwas performed using a modest computing system.

Both the generic Greek model of CMU Sphinxand the Greek model created for ROV operations canbe found in (https://gitlab.sse.gr/fpantazoglou/omiliaand https://goo.gl/9v3QqG ) for the interested reader.

Nowadays microcomputers (like Raspberry pi)are used for various applications in maritime applica-tions [28]. Those can be easily programmed and are acapable environment where CMU Sphinx can be de-ployed. Future work of the authors will involve the de-ployment of the presented Greek model on a RasperryPi for the control of manipulators for ROVs. The newvoice guided control system, will aid towards reduc-ing the work load of ROV pilots, improving accuracyand productivity.

Acknowledgements: The authors would like tothank Mr. Nickolay Shmyrev, developer and supportofficer for the CMU Sphinx platform. Mr. Shmyrevcontributed decisively with his experience and enthu-siasm during the training and customization phase ofthe implementation of the Greek Model.

References:

[1] T. B. Martin, H. J. Zadell, E. F. Grunza, M. B.Hmcher, and D. R. Reddy, Speech Recognitionby Machine: A Review, Proc. IEEE Conf. Rec.Soc. Amcr. ZYans. Comput. Tech. Rep vol. 6443,no. 2, 1974, pp. 541–546.



E-ISSN: 2224-2856 559 Volume 13, 2018

[2] P. Lamere et al., The CMU SPHINX-4 speechrecognition system, IEEE Intl. Conf. Acoust.Speech Signal Process. (ICASSP 2003), HongKong vol. 1, 2003, pp. 2-5.

[3] D. Huggins-Daines, M. Kumar, A. Chan, A.W. Black, M. Ravishankar, and A. I. Rudnicky,Pocketsphinx: A Free, Real-Time ContinuousSpeech Recognition System for Hand-Held De-vices Proc. IEEE Int. Conf. Acoust. Speech Sig-nal Process. vol. 1, 2006, pp. 185-188.

[4] CMU Sphinx Acoustic and Language Models,https://bit.ly/2SzLkYd, 2018.

[5] CMU Speech Recognition Toolkit,https://bit.ly/2QC00bL, 2018.

[6] E. G. Tsardoulias, A. L. Symeonidis, and P. A.Mitkas, An automatic speech detection architec-ture for social robot oral interaction, Proc. AudioMost. 2015 Interact. With Sound, 2015.

[7] F. K. Pantazoglou, N. K. Papadakis, and G. P.Kladis, Implementation of the generic GreekModel for CMU Sphinx speech recognitiontoolkit, eRA-12 International Scientific Confer-ence, 2017.

[8] M. Anusuya and S. Katti, Speech recognition bymachine: A review, Int. J. Comput. Sci. Inf. Se-cur., vol. 6, no. 3, 2009, pp. 181-205 .

[9] D. Yu, Automatic Speech Recognition: A DeepLearning Approach, Springer, 2014.

[10] D. Stallard et al., The BBN TransTalk Speech-to-Speech Translation System, English, no.June, 2009.

[11] M. L. Seltzer, Y. C. Ju, I. Tashev, Y. Y. Wang, andD. Yu, In-car media search, IEEE Signal Pro-cess. Mag., vol. 28, no. 4, 2011, pp. 50-60.

[12] A. Samah and A. A. Osman, Controlling HomeDevices for Handicapped People via Voice Com-mand Techniques, 2015.

[13] K. Geetha and E. Chandra, Automatic SpeechRecognition - An Overview, Int. J. Eng. Comput.Sci., vol. 2, no. 3, 2013, pp. 633-639.

[14] S. K. Gaikwad, B. W. Gawali, and P. Yannawar,A Review on Speech Recognition Technique,Int. J. Comput. Appl., vol. 10, no. 3, 2010, pp.16-24.

[15] Hemdal, J. F., G. W. Hughes, A feature basedcomputer recognition program for the modelingof vowel perception, Models for the Perceptionof Speech and Visual Form, MIT Press(Modelsfor the Perception of Speech and Visual Form),1967.

[16] R.K.Moore, Twenty things we still dont knowabout speech, Workshop on Progress andProspects of speech Research and Technology,1994.

[17] G. Hinton et al., Deep Neural Networks forAcoustic Modeling in Speech Recognition,IEEE Signal Process. Mag., no. November,2012, pp. 82-97.

[18] Baum and L., An inequality and associated max-imization technique in statistical estimation ofprobabilistic functions of a Markov process, In-equalities, vol. 3, 1972, pp. 1-8.

[19] D. Jurafsky and J. Martin, Hidden Markov Mod-els, Speech Lang. Process., no. Chapter 20,2017, p. 21.

[20] L. R. Rabiner, A tutorial on hidden Markov mod-els and selected applications in speech recogni-tion, Proc. IEEE, vol. 77, no. 2, 1989, pp. 257-286.

[21] Senin, P., Dynamic Time Warping AlgorithmReview, Science, December 2007, pp. 1-23

[22] D. H. Daines, An Architecture for Scalable ,Universal Speech Recognition Thesis Commit-tee: c 2011 David Huggins Daines, 2011, p. 131.

[23] A. J. Viterbi, Error bounds for convolutionalcodes and an asymptotically optimum decodingalgorithm, IEEE Trans. Inf. Theory, vol. 13, no.2, 1967, pp. 260-269.

[24] R. Singh, M. K. Warmuth, B. Raj, and P.Lamere, Classification with Free Energy atRaised Temperatures, EuroSpeech, no. 2, 2003,pp. 1773-1776.

[25] W. Walker et al., Sphinx-4: A Flexible OpenSource Framework for Speech Recognition,Smli, no. TR-2004-139, 2004, pp. 1-9.

[26] E. W. Dijkstra, A note on two problems in con-nexion with graphs, Numer. Math., vol. 1, no. 1,1959, pp. 269-271.

[27] A. Protopapas, M. Tzakosta, A. Chalamandaris,and P. Tsiakoulis, IPLR: an online resource forGreek word-level and sublexical information,Lang. Resour. Eval., vol. 46, no. 3, sep 2012, pp.449-459.

[28] G.Divya Priya , Mr.I.Harish, Raspberry PI BasedUnderwater Vehicle for Monitoring AquaticEcosystem, International Journal of Engineer-ing Trends and Applications (IJETA), vol. 6, iss.2, Mar-Apr 2015.



E-ISSN: 2224-2856 560 Volume 13, 2018

A Greek voice recognition interface for ROV applications ...

Documents

Transcript of A Greek voice recognition interface for ROV applications ...