IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI… · 2016. 2. 23. · 2702 IEEE...

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 12, DECEMBER 2014 2701

A Generic and Scalable Architecture for a LargeAcoustic Model and Large Vocabulary Speech

Recognition Accelerator Using Logic on MemoryOjas A. Bapat, Paul D. Franzon, Fellow, IEEE, and Richard M. Fastow

Abstract— This paper describes a scalable hardware acceler-ator for speech recognition, which uses a two pass decodingalgorithm with word dependent N-best Viterbi Beam Search.The observation probability calculation (Senone scoring) and firstpass of decoding using a Bigram language model is implementedin hardware. The word lattice output from the first pass is usedby software for the second pass, with a trigram language model.The proposed design uses a logic-on-memory approach to makeuse of high bandwidth NOR flash memory to improve randomread performance for Senone scoring and first pass decoding,both of which are memory intensive operations. The proposedHW/SW co-design achieves an overall speed up of 4.3X over a2.4-GHz Intel Core 2 Duo processor running the CMU Sphinxspeech recognition software, while consuming an estimated1.72 W of power. The hardware accelerator provides improvedspeech recognition accuracy by supporting larger acoustic modelsand word dictionaries while maintaining real-time performance.

Index Terms— Accelerator, beam search, embedded, hardwaresoftware co-design, logic on memory, multipass decoding, N-best,speech recognition, sphinx.

I. INTRODUCTION

MAJORITY of continuous speech recognition algorithmsuse Hidden Markov Models (HMMs). Excessive mem-

ory bandwidth and computing power required to obtain highrecognition accuracy in real time are the two main factorskeeping speech recognition from being mass adopted onthe embedded platform. Use of smaller acoustic models andword dictionaries to maintain real time performance inducesinaccuracy in recognition. The high computational requirementuses up most of the resources on a general purpose CPU andthe acoustic and language models use most of the cache anddynamic RAM (DRAM). This results in resource contentionand leaves the CPU unable to do any other task along withspeech recognition.

Manuscript received October 18, 2012; revised June 10, 2013; acceptedDecember 3, 2013. Date of publication January 21, 2014; date of currentversion November 20, 2014. This work was supported by the Research underContract 5582374 through Spansion Inc. and NC State University.

O. A. Bapat was with the Department of Electrical and Computer Engineer-ing, North Carolina State University, Raleigh, NC 27695 USA. He is now withSpansion Inc., Sunnyvale, CA 94085 USA (e-mail: [email protected]).

P. D. Franzon is with the Department of Electrical and Computer Engi-neering, North Carolina State University, Raleigh, NC 27695 USA (e-mail:[email protected]).

R. M. Fastow is with Spansion Inc., Sunnyvale, CA 94085 USA (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TVLSI.2013.2296526

Existing hardware solutions [1]–[6] are far from a genericprocessor and are optimized for use with a given set of acousticand N-gram language models. Many of them are unusable withother models while others suffer considerable performancedegradation when used with other models.

Existing hardware Software Co-Designs [7]–[10] mainlycalculate the observation probability in hardware. However,it is unclear as to which tasks in the algorithm should be fixedin hardware and which parts should be software controllablethrough commands, to allow maximum flexibility while pro-viding improved performance.

In this paper, a scalable and portable hardware accelera-tor for speech recognition has been proposed. It acceleratesthe acoustic modeling and decoding process of the speechrecognition algorithm. A multipass decoding approach [11],which splits the speech decode process into two parts hasbeen used. The first pass is carried out on a large searchspace, using a simple language model and an N-best ViterbiSearch [12], [13], which works reasonably well at coarserecognition and reduces the size of the search space. Thesecond pass is carried out on a smaller search space, usingmore sophisticated N-gram [14] language models. The firstpass has been implemented in hardware and the second passin software. This keeps the performance of the hardwareunaffected by the use of sophisticated models without compro-mising on end accuracy of the recognition process. Moreover,it makes hardware design much simpler. The software usedhere to demonstrate the benefits of the proposed architectureis Sphinx 3.0 [15], developed at Carnegie Melon University.

This paper is organized as follows. Section II discusses thefundamentals of HMM-based speech recognition and multi-pass decoding. Section III describes the trade-offs betweenhardware complexity, power consumption and speech recog-nition accuracy. Section IV describes the proposed hardwarearchitecture. In Section V, the modeling methodology used forperformance, area and power estimation is presented. Finally,Section VI discusses results and comparisons with relatedwork.

II. BACKGROUND

This section describes the working of an HMM-basedspeech recognition system. A complete speech recognitionsystem is shown in Fig. 1. It mainly consists of three parts,front end digital signal processing, acoustic modeling, and

1063-8210 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

2702 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 12, DECEMBER 2014

Fig. 1. Complete speech recognition system.

TABLE I

STATISTICAL MODELS USED IN SPEECH RECOGNITION

speech decoding. In the acoustic modeling stage, the featuresof incoming speech are compared to pretrained acoustic mod-els to find which phones are closer to the observed speech. Thedecoding stage uses the Viterbi beam search [16] algorithmto find the most likely sequence of phones for the observedspeech. Each stage in the algorithm uses models, which rep-resent the probabilities of sounds, sequences of sounds, wordsand sequences of words in the language. Gaussian distributionsare used to represent nature of the sounds and HMMs areused to model sequences and duration of the sounds. Wordsequences and their probabilities are stored as weights, whichare added during the decode process. The models commonlyused are shown in Table I.

A. Front End

The goal of a speech recognition system is to recognizethe uttered sequence of words. In the front end, input speechis sampled and a spectral analysis is performed to generatefeature vectors to represent this speech. These feature vectorsare generated at set intervals called frames. Each such featurevector is called an observation. The duration of a framedepends on the front end. CMU Sphinx uses 10-ms frames.Thus, the input speech is now converted to an observationsequence.

B. Acoustic Modeling

Every spoken word in the language is represented in termsof basic sounds called phones. The pronunciation of everyphone is affected by its context, i.e., the phone preceding itand succeeding it. Thus, the phones are clubbed with theirneighboring phones to form a context dependent units calledtriphones. The total number of possible triphones can be

Fig. 2. HMM for a triphone.

very large for any language, e.g., there are 50 phones inEnglish language and a possible 503 triphones. Each triphoneis represented by a statistical HMM, as shown in Fig. 2.The proposed design supports N-state HMMs. In the restof this paper, the terms triphone and HMM will be usedinterchangeably.

Each state in the HMM generates observation probabilitiesB j (Yt ) shown in (1). Each of the states in the HMM arerepresented by a multivariate Gaussian mixture models. In amultivariate mixture, vectors are used to represent the meanand variance parameters for the Gaussian distribution. Thedimensionality of these vectors is governed by the dimen-sionality of the features used to represent incoming speech.This depends on various factors such as the type of frontend filter/window functions used and input sampling rate. Theparameters for these Gaussian distributions are set from train-ing and are called the acoustic model. To avoid redundancyand to reduce training effort, states with the same Gaussiandistributions are combined together to one state called aSenone [17].

In the acoustic modeling (observation probability calcula-tion) stage of the recognition process, the observation prob-abilities of the Senones are calculated by calculating thedistance between the incoming feature vector and the Gaussiandistribution of the Senone, as shown in (1). This stage is calledSenone scoring

logb j (Yt ) =M∑

m=1

Cjm

N∑

n=1

(Yt [n] − μjm[n])2 ∗ Vjm[n]. (1)

C. Speech Decoding

The objective of the decoder is to find out the most prob-able sequence of words from the language model, given theobservation sequence. The probability of this word sequenceis shown in (2). The term P(O|W ) in (2) is called theobservation or acoustic probability and is calculated by Senonescoring. It is calculated for every HMM state, during thedecoding process. P(W ) is obtained from the language model

W1.....Wn = argwmaxP(W ).P(O|W )

P(O)(2)

log(δt ( j)) = max1≤i≤N

[log(δt−1(i)) + log ai j ] + log b j (Yt ). (3)

Equation (3) is used to calculate the probability for HMMsin the language model over the entire observed sequence. Theterm δt ( j) is the probability that an HMM would be in statej for an observation t . This calculation needs to be repeatedfor each state in the language model that is active for thegiven observation frame t . Each HMM can be a sequence

BAPAT et al.: GENERIC AND SCALABLE ARCHITECTURE 2703

Fig. 3. Hardware software partitioning for multipass speech recognition.

of N states. In the models used, each HMM consists of threestates and is called a triphone. Viterbi Beam search [18] keepsHMMs, which are above a predetermined threshold active,since it is impractical to calculate the probability of all HMMsin the language model. Previous work [18] suggests that bestpossible recognition accuracy for a given language model canbe obtained with approximately 10% of the HMM states in thelanguage model kept active. Simulations with CMU Sphinxconfirm that this is true for the Wall Street Journal CSR Icorpus.

D. Multipass Decoding Using N-Best Search

Single pass decoding uses the Viterbi beam search algorithmand computes the most probable sequence of phones, giventhe observed speech. This sequence of phones usually corre-sponds to the most probable sequence of words. However, thisassumption may not be true in case the probable path consistsof a word in the dictionary with multiple pronunciations.Also, the Viterbi algorithm assumes the dynamic programminginvariant, i.e., if the ultimate best path for the observationpasses through state K , then it must include the best pathup to and including state K . Higher N-gram [14] languagemodels do provide better accuracy by attaching probabilities tosequences of words, but they also violate the dynamic invariantand need backtracking to find the best path or risk losing thisinformation.

In multipass decoding [11], the decoding process is dividedinto two passes. The first pass of decode follows N best pathsat every node [11], [12] and uses a simple unigram/bigrammodel. In the proposed design, the first pass is implementedin hardware using a word-dependent N-best search [13], whichfollows multiple paths at word level. This approach has lesscomplexity compared to other approaches [13] and is verygeneric as it does not make use of multiple word sequences.A unigram/bigram language model is used. This model storesprobabilities for individual words and pairs of words onlyand does not violate the dynamic programming invariant.This eliminates the need for backtracking and simplifies thehardware implementation. Also, it allows us to use a widebeam width and a larger vocabulary for the Viterbi search.The output of this coarse first pass decoding step is a latticeof identified words. This lattice includes multiple paths and notjust the best path. This lattice is of a much lower order, com-pared with the entire word vocabulary. The second decodingpass, which is implemented in software uses this word latticeas the input. The second pass rescores the word lattice using a

sophisticated N-gram language model, e.g., trigram, to obtainthe best hypothesis.

The proposed hardware software partitioning scheme isshown in Fig. 3. Use of multipass decoding with N-bestsearch on a unigram/bigram model language model makes thisimplementation very generic and usable as the coarse first passfor speech recognition in any application. The beam width forthe first pass has to be wide enough, so that it always includesthe best possible hypothesis in the word lattice. The secondpass of decode is in software and can be very applicationspecific. For example, it can have an N-gram for, Call XYZfrom company ABC on cell phone. The multipass decodingapproach is very well suited for a hardware software co-designbecause it helps keep the first decode pass generic and doesnot require any feedback from the second decode pass. Thisallows the speech recognition front end, Senone scoring andboth decode passes to work completely in parallel.

III. DESIGN SPACE EXPLORATION

A SystemC model of the hardware was developed andused with CMU Sphinx 3.0 [15] for design space explorationand performance analysis of the speech recognition system.The parameterized SystemC Model allowed investigation ofthe tradeoffs for various HW/SW partitioning schemes andthe impact of various hardware configurations on the over-all performance of the system. Both the hardware and thecommunication interface were modeled in SystemC to observethe end-to-end latency for any operation. The parameters ofthe hardware model for each HW/SW split investigated werechosen such that the hardware performs in real time for eachsplit. The results of the simulation for each split representthe impact of software and interface latency for the HW/SWCo-design. The details of the SystemC Model are discussedin Section V.

The first step in design space exploration is to estimatethe CPU and memory bandwidth requirement for the speechrecognition algorithm (Table II). The Senone scoring process isthe most CPU and memory intensive as it involves calculatingtwo summations and reading the entire acoustic model. Thisphase can greatly benefit from hardware acceleration. Thedecode stage is not as CPU and memory intensive, however,the random nature of memory accesses and data dependencybetween them slows down this phase of the algorithm. Thesecond pass of decode works on the word lattice, which isvery small compared with the entire dictionary. Hence, it doesnot require much CPU and memory bandwidth. There aremultiple hardware/software partitions possible for an HMM-based speech recognition system. One is to accelerate onlythe observation probability calculation stage [1], [2], [9] inhardware while the word search is performed in software(Spli t1). The next step is to accelerate the processing ofHMMs (triphones). In this split (Spli t2), the HMM scores aresent back to the software every frame. The software performsthe transitions from one HMM to the next. In Spli t3, thehardware performs transitions for HMMs within a single word,and the inter word transitions are performed by software. Thehardware sends back a list of identified words to the software


TABLE II

CPU AND MEMORY BANDWIDTH REQUIREMENTS FOR VARIOUS STAGES

IN THE SPEECH RECOGNITION PROCESS

TABLE III

HARDWARE SOFTWARE PARTITIONS INVESTIGATED

Fig. 4. Performance comparison with sphinx 3.0 running on desktop PCwith various HW/SW splits for acceleration.

TABLE IV

COMMUNICATION BANDWIDTH REQUIREMENT

FOR VARIOUS HW/SW SPLITS

and receives a new list of words to be activated every frame.The next step is to bring in the inter word transitions intohardware (Spli t4). Table III shows various HW/SW splitsinvestigated in this paper.

The software used was Sphinx 3.0 running on a 2.4-GHzIntel Core 2 Duo processor with 4-GB RAM. The acousticmodel used had 8000 Senones, eight Gaussian mixtures perSenone [19], a 64-K word dictionary, a bi-gram languagemodel for first pass of decode and trigram language model [20]for the second pass. The goal was to observe the performanceimprovement and communication bandwidth requirements forall possible HW/SW partitions, while keeping the word errorrate (WER) constant. Fig. 4 and Table IV show the resultsof simulations. It can be observed that implementing moreand more tasks into hardware, achieves better performanceimprovement. Offloading of inter word transitions (Spli t4)into hardware provides a huge performance benefit, since itinvolves lot of memory accesses. The improved performance

Fig. 5. WER for different language models in pass I and trigram in pass II.

translates to better end accuracy of speech recognition by useof larger acoustic and language models in real time.

Another factor, which is important in a HW/SW co-design is the communication bandwidth requirement betweenthe hardware accelerator and CPU. Table IV shows thecommunication requirements for various HW/SW splits.Spli t1, which is implemented in many systems today[1], [2], [8], [9] requires nominal communication bandwidth.This system provides good performance improvement for largeacoustic models but not for large language models, as most ofthe time is spent by software in doing the word search. Spli t2requires large communication bandwidth and requires a veryfast interface. Spli t3 is very efficient in terms of bandwidthrequirements, however, interword transitions, which are mem-ory intensive and can benefit from hardware acceleration, arestill done in software. Moreover, in this partitioning scheme, tomaintain sequential DRAM access in hardware (this provideshuge performance boost), the incoming list of words to beactivated has to be sorted by software to match the order of theactive word list in hardware. This results in additional softwareoverhead. Spli t4 offers both low communication bandwidthand high performance improvement. Moreover, this schemeallows the hardware to run in parallel to the software, withoutany feedback from software. The output of the hardware is areduced search space (word lattice) on which the software canwork in the second decode pass, to find the best hypothesis.The HW/SW split chosen for the proposed design is Spli t4.

For a high accuracy real time speech recognition, it isnecessary to support large acoustic models and large wordvocabularies. Another factor, which affects accuracy is theuse of sophisticated language models, which model proba-bilities of sequences of words, rather than individual words.As observed in Fig. 5, a trigram language model performsbetter than a bi-gram model for a multipass decoder withone best algorithm. It is difficult to build hardware, whichis optimized to work equally well with any N-gram languagemodel. Working with larger N-gram models increases numberof random accesses to memory. It also greatly complicatesthe architecture as the history of last N words and back trackof the best path are now necessary. In addition, searching or


Fig. 6. Hardware architectures explored I. (a) Baseline architecture.(b) Architecture with no feedback for Senone activation.

hashing techniques are needed to determine if an N-gram ispresent in the model for a given active word sequence. Instead,a bi-gram model with an N-best algorithm with a wider beamwidth in the first pass can be used to improve WER. This isevident from simulation results shown in Fig. 5. The WERobtained using bi-gram models with increased beam widthfor the first pass and trigram models in the second pass, isstatistically similar to WER obtained using a trigram modelin both passes.

In this paper, the hardware is designed to work withunigram/bigram language models, to reduce hardware com-plexity. Unigrams can be indexed directly using word indices(IDs) and Bigrams can be indexed using either the source ordestination word ID. Direct indexing keeps memory accessesvery simple. The multipass decode approach [11] is imple-mented, where the first decode pass is a word dependentapproximate N-best time synchronous Viterbi beam searchalgorithm [12], [13], which returns a word lattice representingmultiple best paths rather than a single path. The beam widthfor the first pass, which uses a bigram model is increased,making sure that the best hypothesis is not omitted fromthe output of the first pass. The second pass using trigramlanguage model works on the word lattice generated in thefirst N-best pass and chooses the best hypothesis.

Design space exploration was started with a baseline designfor the hardware, as shown in Fig. 6(a) where the Senonescore unit (SSU) and Viterbi decoding unit (VU) share thesame memory and do not work at the same time. Thedecoding stage provides feedback to the Senone scoring stage,which activates Senones which need to be scored for thenext frame. As seen in Fig. 9, for a system, which uses awide beam width (larger than 10%) and a large vocabulary(64-K words), 92.2% of the Senones are always active, ascompared to 25% for a smaller vocabulary of 5 K with anarrow beam width. The activation of Senones for each framenot only introduces added complexity and memory require-ment, but also introduces dependency between the decodestage and the acoustic modeling stage. Breaking this feedbackloop allows the Senone scoring to be performed independentlyand in parallel to the decode stage with little overhead ofscoring 8% more Senones. While the decode stage works onframe N , the Senone scores for next frames can be calculated.

Fig. 7. Hardware architectures explored II. (a) Architecture with NOR flashfor acoustic and language model storage. (b) Architecture with cache forHMMs and prefetch for bigrams.

Fig. 8. Comparison between hardware accelerator with and without feedbackfor Senone activation.

Fig. 9. Number of Senones active for each frame for 5-K dictionary withnarrow beam width and 64-K dictionary with wide beam width.

This architecture is shown in Fig. 6(b). This offers significantperformance improvement for a modest increase in powerconsumption (see Fig. 8).


TABLE V

DATA ACCESSED EVERY FRAME FOR SENONE SCORING

AND DECODE PASS I

TABLE VI

MEMORY ACCESS EFFICIENCY FOR DIFFERENT

HARDWARE CONFIGURATIONS

For the senone scoring stage, the entire acoustic model hasto be read for each frame. For the decode stage, the HMMdata and language model weights need to be read for eachactive HMM. Table V shows the distribution of data read inthis stage. It turns out that 86% of the data accesses are non-volatile. Hence, the proposed design uses the logic on memoryapproach with an on chip high bandwidth NOR flash memory[Fig. 7(a)]. A combination of on-chip static RAM (SRAM)and off-chip DRAM was used for storing the non-volatiledata like senone scores and active HMM scores. Simulationswere performed to determine the access efficiency of multiplememory configurations (see Table VI). Memory efficiencyis defined as the ratio of the actual bandwidth achieved insimulation and the theoretical maximum memory bandwidthfor the given memory configuration. For Senone Scoring, sincethe entire acoustic model is read sequentially, the DRAMprovides good access efficiency. The Flash memory performsslightly better because it does not need activation, refresh andprecharge. For the Viterbi decoder, the access to word models,HMM models and Bigrams are random. Hence, memoryaccess efficiency is greatly improved by storing models ina high bandwidth NOR flash memory, which supports fastrandom access.

Two factors make the architecture in Fig. 7(b) beneficialfor this design. For the decode phase, the word model/HMMmodel can be read only once per word and cached, so that itcan be reused for the remaining active instances of the word.Fig. 10 shows that with increase in the number of instances fora word, the cache hit rate increases dramatically. The optimalsize of the cache was found to be as small as five HMMswithin a word. The cache is purged moving from one word tothe next. Higher cache hit rate means that the available flashbandwidth can be used to prefetch multiple rows of bigramsfor word transitions. This masks the bigram fetch latencyand improves performance. Based on the factors discussed inthis section, the architecture shown in Fig. 7(b) and HW/SWSpli t4 were chosen for this design and are discussed in detailin upcoming sections.

Simulations were performed with 400 utterances fromthe Wall Street Journal CSR I corpus, using the proposed

Fig. 10. HMM cache Hit rate for various cache sizes and values of N (numberof hypothesis kept active at every word).

Fig. 11. Effect of Number of Gaussian Mixtures in the Acoustic Modelon WER.

multipass N best decode approach. The results of these simu-lations are shown in Fig. 11. It was observed that a large 64 Kvocabulary and an acoustic model with 8 Gaussian mixtures isthe optimal design point (see Fig. 11). Increasing the numberof Gaussian mixtures beyond 8 did not provide much improve-ment in WER for the 64 K vocabulary. Hence, the design pointchosen was an acoustic model with 8000 senones, 8 Gaussiansand a bi-gram language model with a 64 K word dictionary.The hardware was designed to meet this requirement in realtime, including the communication between hardware andCPU.

IV. PROPOSED HARDWARE ACCELERATOR

This section describes the basic blocks of the proposedhardware accelerator. The system consists of three mainunits (see Fig. 12). The Interface Control Unit decodes thecommands and data obtained from software and controls thefunctioning of the SSU and the Viterbi Unit (VU). It is alsoresponsible for sending the word lattice back to the softwarewhen it requests for it. The SSU calculates the observationprobabilities, i.e., senone scores, for the entire library ofsenones. The VU calculates the state probabilities i.e., state


Fig. 12. Top Level Hardware Block Diagram.

Fig. 13. Block diagram for a scalable SSU.

scores, for all active states in the language model. In additionto this, it applies pruning thresholds to the states and activatesnew states if needed. It also adds recognized words to theword lattice as and when they are recognized.

A. Senone Score Unit

The SSU is shown in Fig. 13. It calculates the scores foreach Senone using (1). Once the start command is received,the Senone ID incrementer loops through the entire list ofSenones in the library. This module is highly pipelined toprovide sustained throughput. Multiple distance calculationunits are used in parallel to consume all the data providedby high bandwidth NOR flash.

1) Block Senone Scoring: The CPU sends a block offeature vectors for two consecutive frames to the hardware.The SSU computes the Senone scores for both the framessimultaneously while reading the acoustic model just once.The calculation of scores is split across the entire length oftwo frames and hence does not require increased parallelism.The memory bandwidth for the NOR flash is reduced by afactor of two. The size of the Senone score SRAM increasesby a factor of 2, since storage is required for scores of twoconsecutive blocks, for the decode stage to use, as shown inFig. 14.

Fig. 14. Pipelining of operations of the speech recognition algorithm.

Fig. 15. SSU power consumption for various block sizes for block Senonescoring.

Scoring of all Senones for each frame implies that thefeature vector is the only changing component every frame.This can be used to share acoustic model reads across frames.This technique is a variation of subvector clustering usedin [21]. Scoring of the entire Senone library is spread acrossmultiple frames. Fig. 15 shows the effect of block size on thepower consumption of the SSU. This design uses a block sizeof two because at the 180-nm technology node, the design islimited by SRAM size.

The data pipeline is shown in Fig. 14. The decode stageis delayed by two frames. The Senone scoring stage workson two sets of feature vectors simultaneously over the entiretwo frames while reading the acoustic model only once. Thisreduces total memory reads, peak bandwidth requirement andsubsequently the read power consumption. It adds a latency oftwo frames but still has real time throughput. The size of theSenone score SRAM is increased by a factor of 2 since scoresfor the previously scored block of frames are still being used.

2) Flash Control and Memory Structure: The Flash Controlunit translates the acoustic library offset and Senone ID intothe first memory address of the Senone entry. A packed datastructure is used for Senones in the library, to ensure thatthe performance is limited only by the physical bandwidth ofthe memory. At the beginning of every Senone, the length ofeach Senone record is stored. This helps us identify the endof a Senone. At the beginning of each library, the numberof Senones is stored, which helps identify the end address ofthe last Senone in the library. The data structure used to storethe acoustic model (Fig. 16) allows use of multiple acousticmodels with different number of Senones, Gaussian mixturesor feature vector dimensions.


Fig. 16. Data structure for acoustic model.

Fig. 17. Block diagram for a scalable VU.

3) Distance Calculation: This module computes the innersummation for (1). It has four parallel units for thesubtraction-square-multiply operations. The output of theseunits is used by two stages of addition. Scalability for largeracoustic models can be easily achieved by adding more flashmemories and distance calculation units in parallel, as shownin Fig. 13.

4) Logarithmic Addition: Since, all the operations are inlogarithmic domain, a logarithmic addition is required for theouter summation of (1). This involves calculating the valueof log(A + B) from log(A) and log(B). This is done usinga lookup table similar to the one used by CMU Sphinx [15].This unit needs pipelining as it has to access the lookup table.However, no parallelism is required since this operation takesplace only once for N distance calculations.

B. Viterbi Unit

The VU (Fig. 17) is responsible for performing the firstdecode pass on the incoming speech using a simple uni-gram/bigram language model. It has a long pipeline, whichworks on each HMM from the active list, which is streamedfrom the DRAM sequentially. The random accesses that arerequired for the state score calculation in (3) are the transi-tion probabilities logai j and Senone scores logb j (Yt ). Theseaccesses are divided between the language model stored inflash and the Senone scores stored in the SRAM by the SSU.The pipelining of these accesses provides further performancebenefit.

1) Flash Control and Memory: The bigram language model,word structures and probabilities and the HMM structures arestored in the flash memory. The word dictionary stores thesequence of HMMs, which form the word. The HMM dictio-nary stores the Senones and transition probabilities within anHMM. The bigram model stores the probabilities going from

Fig. 18. Data structure for word dictionary, HMM model and languagemodel.

one source word to multiple destination words. Each wordin the dictionary occupies a line in a 256 bit wide line ofmemory and can be accessed in a single read. Similarly, anHMM up to three states (a triphone) can be accommodated ina line of memory and accessed in a single read. All the wordand HMM models that are larger than one line in memory arestored in the form of a linked list (Fig. 18). The each node ofthe linked list is one line in flash memory. The first nodes ofthe linked list can be accessed directly by translating the wordID or HMM ID. The each subsequent node stores the pointerto the next node. The word model and triphone structuresare accessed only once per word and stored in a cache. Thestructure of the active list (Section IV-B.4) allows reuse ofthis data for multiple instances of the same word. The bigramsare indexed by source word ID. Multiple lines of bigrams foreach source word are prefetched into the SRAM. The prefetchSRAM is divided equally to store multiple bigrams of allpossible source words in that frame. The bigrams are read fromthis SRAM in the order of destination word IDs as each wordfrom the active list is being processed. For larger languagemodels, multiple parallel flash memories and DRAMs can beused to improve memory bandwidth.

2) HMM Scoring: This block calculates (3) for each HMM.The last state scores are available from the HMM active listentry in DRAM. Transition probabilities and Senones IDsare obtained from the flash. The Senone IDs are used toread the corresponding Senone scores from the SSU SRAM.A simple add-compare select unit adds the last state scores,transition probabilities and Senone scores, compares them andthen selects the best.

3) Adaptive Pruning: Adaptive pruning [22] is used to limitthe size of the search space for the first pass of decode. Theinitial pruning threshold (T0) is set by the software and thenmodified for each frame using equation (4). Here, Nset isthe maxhmmpf parameter set by the user. Nt is the numberof states that are active in the current frame. This equationrepresents a closed-loop system, which adjusts Tt+1 to keepNt+1 (number of Senones active in the next frame) as closeto Nset as possible. The value of α is set to 0.2 to dampen theresponse of this system. A 10% tolerance is added to Nset tomake sure that the number of HMMs passed is always morethan Nset . This tolerance value was obtained empirically, byrunning 400 sentences from the Wall Street Journal CSR ICorpus. For any frame, the adaptive threshold is never usedif it is wider than the initial beam threshold set by software.


Fig. 19. Number of active HMMs per frame over an entire utterance, whichlasts 381 frames.

Also, the adaptive threshold is not calculated unless the valueof Nt is larger than Nset . A similar equation is used for theword thresholds and pruning of N-best paths as well. Thisthreshold is subtracted from the best HMM score for theprevious frame t and compared with every state score in framet + 1 for pruning. For the word threshold, the same procedureis repeated with the word score. For N-best path pruning, thesame technique is applied to multiple active list entries of thesame word. Shown in Fig. 19 are the number of active states,which pass pruning for an example utterance, which lasts 381frames

Tt+1 = Tt + α(1.1 ∗ Nset − Nt ). (4)

4) Active List Generation: The active list is maintainedat word level. There is a separate active list entry for eachcombination of a word and its predecessor. All the entries forone word are stored together. Each word entry contains entriesfor all the active HMMs within that word. For each activeHMM entry, the scores of all states in the HMM are stored.For each word entry, its word ID, the word ID of the previousword it transitioned from, start frame of the first HMM ofthe word, the ID of the first HMM state in the word and thestart frame of the last HMM of the word are stored. After theHMM is scored, the best of its state scores is compared withthe adaptive pruning threshold. If it passes this threshold, it isadded to the active list for the next frame.

5) Word Lattice Generation: Whenever the last HMM of aword gets deactivated, it is added to the word lattice. For eachword in the lattice, its word ID, previous word ID, score, startframe for the first HMM of the word, start frame for the lastHMM of the word, and end frame for the last HMM of theword are stored. This is the same format as used by CMUSphinx and is discussed in [11]. It provides both, path andtime informations for each recognized word. Such a latticecan easily be converted into a word graph, N-best sentencelist, N-best word list or any other lexical tree notation, whichis required by the software.

6) New Word and HMM Activation: For activation of newHMMs within a word, the score of the last state of an HMM

is checked against the HMM exit threshold. If this thresholdis passed, the next HMM within the word is activated. Themethod for activating new words is not straightforward. Here,a transition has to be made, from a word that has exited,to all other possible words. This task is performed using aword activation map. This map represents the new entries thatneed to be activated for an exit word by checking the possiblesounds it can transition to. As existing word entries are readfrom the active list, their previous word IDs are comparedagainst the word lattice entries from the last frame. If entriesalready exist, the best of the existing and new entry is chosen.After all the existing entries for a word have been processed,the remaining new entries for the word are appended to thelist. The word/unigram model in flash memory is grouped bythe starting Senones of words and the same insertion order ismaintained in the active list and word lattice. This ensures thelists never lose order and can be processed sequentially.

C. Interface Between Software and Hardware

The interface and control unit (Fig. 12) decodes the com-mands received from software. The SSU and VU are usedto service these commands. A variety of commands havebeen defined to set configurable parameters for the proposedhardware. Some of the configurable parameters are as follows.

1) ACOUSTIC_MODEL_OFFSET: This is the memoryoffset for the acoustic model used. This lets the userchange acoustic models on the fly.

2) LANGUAGE_MODEL_OFFSET: This is the memoryoffset for the language model used. This lets the userswitch language models.

3) HMM_MODEL_OFFSET: This is the memory offset forthe HMM model used. This lets the user switch HMMmodels.

4) HMM_INIT_BEAM: This is the threshold value used todefine the widest beam width to be used for the pruningHMMs.

5) WORD_INIT_BEAM: This is the threshold value usedto define the widest beam width to be used for pruningwords.

6) MAXHMMPF: This sets the maximum number ofHMMs to be kept active for any frame. Adaptive pruning[22] is used to calculate the pruning threshold usingthe number of active HMMs from previous frames.The threshold is adjusted to keep the number of activeHMMs close to maxhmmpf .

7) MAXWPF: This is similar to maxhmmpf except itcontrols number of word exits.

8) MAX_N_BEST: This parameter sets the maximuminstances of a single word to be kept active. Adaptivepruning is used to prune the N-best list as well. Thepruning threshold for each word is different and isstored in the active list along with the word entry. Thisthreshold is used only if the HMM and word pruningthreshold are not able to keep the number of instancesof a word below the N-best parameter.

9) FEATURE_LENGTH: This parameter is used to dynam-ically change the dimensions of the incoming feature


Fig. 20. Example simulation with System C.

vector from software. This allows us to be compat-ible with multiple front ends. It has been observedin [23] and [24] that for lower input sampling rates,lower dimensional feature vectors can be used withoutsacrificing accuracy.

10) HMM_LENGTH: This is the number of states for eachHMM in the language model.

11) MAX_MIXTURES: This is the maximum number ofGaussian mixtures used for any Senone.

V. HARDWARE MODELING METHODOLOGY

A parameterized transaction level SystemC model of theproposed hardware architecture was developed. The SystemCmodel was transaction based at the boundary between thehardware and CPU. The model was integrated with CMUSphinx. Sphinx was modified to send commands to theSystemC model of the hardware, which runs on a separatethread. The communication between the Sphinx and SystemCwas done using a shared memory structure. This allowedobservation of the effects of different hardware configurationson the end accuracy of the recognition process. An exampleof the interaction between the Sphinx and SystemC simulationthreads is shown in Fig. 20. This model was used for designspace exploration and for estimation of performance, area andpower.

A. Timing, Area and Power Estimation

For timing, area and power estimation, a combination ofSystemC and RTL was used. The basic building blocks of thehardware like adders, multipliers, and so on. were identifiedand synthesized using Synopsys DesignWare [25] and gluelogic in Spansion CS239LS 180-nm standard cell library. Themajor operations involved in the entire algorithm were iden-tified. These were modeled as timed functions/methods fromthe SystemC model. These operations were further subdividedinto basic arithmetic operations. The latency for each of thearithmetic operations was estimated using synthesized RTL.

TABLE VII

ESTIMATED POWER CONSUMPTION OF THE HARDWARE FOR DIFFERENT

TASKS IN THE RECOGNITION PROCESS

For area estimation, the total number of building blocks,which would be required for each operation modeled inSystemC was estimated. The individual components in syn-thesized Verilog were stitched together using glue logic. Thetotal logic area was calculated by adding the areas of each ofthe basic block. The glue logic within the basic blocks wasseen to add 20% overhead to the area. An additional 20% areaoverhead was added to account for similar glue logic on thetop level.

For calculating the power, counters were used in the Sys-temC model to count the number of arithmetic operations oneach basic building block and memory read/write operationsfor SRAM and flash. The power consumption for each basicblock was obtained from synthesis and used to obtain energyper operation. The total power consumption for logic wasobtained by multiplying the energy per operation with the totalnumber of operations obtained from the SystemC simulation.This number was padded by another 20% to match the areaoverhead added for top level glue logic. For the SRAMand Flash, the read/write energy per operation was obtainedfrom Spansion, Inc and multiplied with the total number ofmemory operations obtained from simulation. DRAM powerwas obtained by performing simulations using DRAMSim2[26] memory system simulator. A SystemC wrapper was builtaround DRAMSim2. This wrapper passed commands and databetween the hardware SystemC model and DRAMSim2. TheDRAMSim2 simulation provided the time spent by the DRAMin each modes, i.e., read/write/activate/refresh/precharge. Thedata obtained from the SystemC/DRAMSim2 co-simulationwas verified by performing a trace-based simulation withDRAMSim2 for the same data. The energy consumption wascalculated using formulas provided by DRAM manufacturersto calculate power consumption for the DRAM in each operat-ing mode [27]. The total power consumption for the hardwarewas calculated by adding the energy consumption for logic,SRAM, flash, and DRAM.

VI. RESULT

This design is estimated to run at a clock speed of 100 MHz,using Spansion CS239LS 180-nm standard cell library. TheSSU supports processing of an Acoustic Library of 8000Senones, eight Gaussian Mixtures, and a 39-D feature vectorin 0.85x real time. It uses a 768 bit wide flash memory to storethe acoustic model. The VU supports an N-best search usingan unigram/bigram language model of 64-K words. Usinga maximum N-best parameter value of 10, 0.78x real timeperformance was achieved for the VU. However, this numbervaries largely with input. Numbers as high as 0.92x were


TABLE VIII

SSU COMPARISON WITH RELATED WORK

TABLE IX

VU COMPARISON WITH RELATED WORK

observed for one of the utterances in the test data. The VUuses a 256 bit wide flash memory to store bigram, word andtriphone models. The flash memories used have a random readlatency of 80 ns. An on-chip SRAM of 14 KB is neededfor DRAM read/write buffers, log-add lookup tables, HMMdictionary cache, bigram prefetch cache, and word activationmap. The SRAM requirement for storing Senone scores andword lattice depends on the size of acoustic and word models.In this design, 64 KB is needed for Senone scores and16 KB for the word lattice. The estimated area is 200-K gateequivalents for SSU logic and 120-K gate equivalents for VUlogic.

The SSU consumes an estimated 422 mW in the data pathand 384 mW for memory reads. The VU consumes 230 mWin the data path and 359 mW for memory reads. The totalestimated on-chip SRAM power consumption for SSU andVU is 143 mW. A comparison with previous work is shownin Tables VIII and IX. All the designs in these tables meetthe real time performance criteria. The figure of merit chosenfor comparison of SSU is the power consumed per Gaussian,as Gaussian is the lowest unit of the acoustic modeling stage.The work done by the acoustic modeling stage can be easilyrepresented in terms of number of Gaussians processed perframe. The figure of merit chosen for the VU is the powerconsumed per word in the dictionary. The amount of workdone in this stage is proportional to the number of words inthe vocabulary and the language model used. Since prior workuses different language models, only the number of words inthe dictionary is used for comparison. It should also be notedthat prior work does not implement the N-best algorithm.

Table VII shows the power consumed by different stages ofthe recognition algorithm. Senone scoring is the most powerhungry as the entire acoustic model needs to be read onceevery two frames. The power consumption for other stagescan be reduced by caching and using a NOR flash memory,which provides fast random accesses.

The proposed hardware architecture provides high mem-ory access efficiency using NOR Flash for random accesses(see Table VI). The random nature of word and HMMdictionary reads greatly reduces memory access efficiencywhen stored in a DRAM. The memory read power for VUis greatly reduced by caching HMM dictionary reads on anSRAM till all entries or a single word has been processed.

An average cache hit rate of 65.74% was achieved for HMMdictionary reads. The second major factor in improving theperformance and power efficiency of a VU is the use ofa simplified unigram/bigram language model. The proposedHardware/Software co-design provides an overall 4.3X perfor-mance improvement compared with a software only solutionrunning on an Intel Core 2 duo 2.4-GHz processor with a4 GB 667-MHz DDR2 SDRAM, while consuming an esti-mated 1.72 W. The proposed partition, which uses multipassdecoding with word N-best search provides a generic hardwarearchitecture, which can be used with software that runs anyfront end and N-gram language model.

VII. CONCLUSION

In this paper, an architecture for generic speech recognitionaccelerator has been proposed. The goal was to achieveperformance improvement using hardware acceleration whilekeeping the hardware architecture generic and portable. Thisinvolved accelerating the observation probability calculationstage of the algorithm while supporting multiple front ends.This was achieved with SIMD architecture for SSU, whichsupports multidimensional feature vectors and variable numberof Senones and Gaussian mixtures. For the acceleration ofthe search phase of the algorithm, one of the biggest chal-lenges was to decouple the hardware implementation fromthe software stack and the end application. SophisticatedN-gram language models provide accuracy improvements inspeech recognition, especially if they can be made applicationspecific. However, building a hardware architecture that worksequally well with any N-gram model is a difficult task. In thisdesign, a multipass decode algorithm was used, with simpler(unigram/bigram) language models in the first pass. The word-dependent N-best algorithm, which keeps track of N-best pathsat each node, was used to compensate for the reduction inaccuracy due to simpler language models. The second pass ofdecode was kept in software and could use any higher N-gramlanguage model.

This paper implements the observation probability calcu-lation stage and the word N-best decoder in hardware. Thenext step would be to look at any bottlenecks in the front endstage. The front end stage can be computationally intensiveif it includes modules to combat noise, echo, reverberation


and other ambient conditions. It would also be interesting toobserve how other more complex N-best algorithms and A*decoding algorithms perform in hardware.

ACKNOWLEDGMENT

O. A. Bapat would like to thank R. Eliyahu, J. Olson andthe Spansion Design team for providing power consumptiondata for the on-chip NOR Flash Memory and SRAM and foraccess to the their standard cell library.

REFERENCES

[1] D. Chandra, U. Pazhayaveetil, and P. Franzon, “Architecture for lowpower large vocabulary speech recognition,” in Proc. IEEE Int. SOCConf., Sep. 2006, pp. 25–28.

[2] U. Pazhayaveetil, D. Chandra, and P. Franzon, “Flexible low powerprobability density estimation unit for speech recognition,” in Proc.IEEE ISCAS, May 2007, pp. 1117–1120.

[3] F. L. Vargas, R. Dutra, and R. Fagundes, “An FPGA based Viterbi algo-rithm implementation for speech recognition,” in Proc. IEEE ICASSP,vol. 2. May 2001, pp. 1217–1220.

[4] S. Yoshizawa and N. Hayasaka, “Scalable architecture for word HMM-based speech recognition and VLSI implementation in complete system,”IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 53, no. 1, pp. 70–77,Jan. 2006.

[5] Y.-K. Choi, K. You, J. Choi, and W. Sung, “A real-time FPGA-based20,000-word speech recognizer with optimized DRAM access,” IEEETrans. Circuits Syst. I, Reg. Papers, vol. 57, no. 8, pp. 2119–2131,Aug. 2010.

[6] P. J. Bourke and R. A. Rutenbar, “A low-power hardware searcharchitecture for speech recognition,” in Proc. 9th Int. Conf. SpeechCommun., Sep. 2008, pp. 2102–2105.

[7] M. Li and T. Wen, “Hardware software Co-design for Viterbi decoder,”in Proc. ICEPT-HDP, Jul. 2008, pp. 1–4.

[8] B. Matthew, A. Davis, and Z. Fang, “A low-power accelerator for theSPHINX 3 speech recognition system,” in Proc. Int. Conf. CASES, 2003,pp. 210–219.

[9] O. Cheng, W. Abdulla, and Z. Salcic, “Hardware–software codesignof automatic speech recognition system for embedded real-time appli-cations,” IEEE Trans. Ind. Electron., vol. 58, no. 3, pp. 850–859,Mar. 2011.

[10] P. Li and H. Tang, “Design of a low-power coprocessor for mid-sizevocabulary speech recognition systems,” IEEE Trans. Circuits Syst. I,Reg. Papers, vol. 58, no. 5, pp. 961–970, May 2011.

[11] D. Jurafsky and J. H. Martin, Speech and Language Processing: AnIntroduction to Natural Language Processing, Computational Linguis-tics, and Speech Recognition, 2nd ed. Englewood Cliffs, NJ, USA:Prentice-Hall, 2009, pp. 285–348.

[12] R. Schwartz and Y.-L. Chow, “The N-best algorithms: An efficient andexact procedure for finding the N most likely sentence hypotheses,”in Proc. ICASSP, vol. 1. Apr. 1990, pp. 81–84.

[13] R. Schwartz and S. Austin, “A comparison of several approximatealgorithms for finding multiple (N-best) sentence hypotheses,” in Proc.ICASSP, vol. 1. Apr. 1991, pp. 701–704.

[14] P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. D. Pietra, and J. C. Lai,“Class-based N-gram models of natural language,” Comput. Linguist.,vol. 18, no. 4, pp. 467–479, Dec. 1992.

[15] (2013, Jun. 2). CMU Sphinx [Online]. Available: http://cmusphinx.sourceforge.net/

[16] A. Viterbi, “Error bounds for convolutional codes and an asymptoticallyoptimum decoding algorithm,” IEEE Trans. Inf. Theory, vol. 13, no. 2,pp. 260–269, Apr. 1967.

[17] M. Y. Hwang and X. Huang, “Subphonetic modeling with Markovstates—Senone,” in Proc. ICASSP, vol. 1. Mar. 1992, pp. 33–36.

[18] L. R. Rabiner, “Tutorial on hidden Markov models and selected appli-cations in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286,Feb. 1989.

[19] K. Vertanen. (2013, Jun. 2). CMU Sphinx Acoustic Models—Us English[Online]. Available: http://www.keithv.com/software/sphinx/us/

[20] K. Vertanen. (2013, Jun. 2). CSR LM-1 Language Model Training Recipe[Online]. Available: http://www.keithv.com/software/csr/

[21] Y.-K. Choi, K. You, J. Choi, and W. Sung, “VLSI for 5000-wordcontinuous speech recogntion,” in Proc. IEEE ICASSP, Apr. 2009,pp. 557–560.

[22] H. Van Hamme and F. Van Aelten, “An adaptive-beam pruning tech-nique for continuous speech recognition,” in Proc. 4th ICSLP, vol. 4.Oct. 1996, pp. 2083–2086.

[23] C. Sanderson and K. K. Paliwal, “Effect of different sampling ratesand feature vector sizes on speech recognition,” in Proc. IEEE Annu.Conf. Speech Image Technol. Comput. Telecommun., vol. 1. Dec. 1997,pp. 161–164.

[24] H. Hirsch, K. Hellwig, and S. Dobler, “Speech recognition at multiplesampling rates,” in Proc. 7th Eur. Conf. Speech Commun. Technol.,Sep. 2001, pp. 1–4.

[25] Synopsys. Mountain View, CA, USA. (2013, Jun. 2). DesignwareBuilding Block IP Documentation Overview [Online]. Available:https://www.synopsys.com/dw/doc.php/doc/dwf/dwbb_overview.pdf

[26] P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “DRAMSim2: A cycleaccurate memory system simulator,” Comput. Archit. Lett., vol. 10, no. 1,pp. 16–19, Jan. 2011.

[27] M. Technologies. (2013, Jun. 2). Calculating Memory System Powerfor DDR SDRAM [Online]. Available: http://www.ece.umd.edu/class/enee759h.S2005/references/dl201.pdf

[28] U. C. Pazhayaveetil, Hardware Implementation of a Low Power SpeechRecognition System. Ann Arbor, MI, USA: ProQuest, 2007.

Ojas A. Bapat received the master’s and Ph.D.degrees from North Carolina State University,Raleigh, NC, USA, in 2009 and 2012, respec-tively.

He is currently a Systems Hardware Engineerwith Spansion, Inc., Sunnyvale, CA, USA. Hewas involved in a variety of topics, includingspeech recognition, in-situ calibration of circuitsusing an on-chip coprocessor, and DDR2 controllerdesign. His current research interests include speechrecognition algorithms, development of HW/SW

co-designs and co-processors, and hardware performance modeling and designusing SystemC/C++ and RTL.

Paul D. Franzon (SM’99–F’06) received the Ph.D.degree from the University of Adelaide, Adelaide,S.A., Australia, in 1988.

He is currently a Professor of electrical and com-puter engineering with North Carolina State Univer-sity, Raleigh, NC, USA. He was with the AT&TBell Laboratories, Murray Hill, NJ, USA, DSTOAustralia, Australia Telecom and two companies,and he co-founded Communica and LightSpin Tech-nologies. He has lead several major efforts andpublished over 200 papers. His current research

interests include the technology and design of complex systems incorporatingVLSI, MEMS, advanced packaging, and nano-electronics.

Dr. Franzon received the National Science Foundation Young InvestigatorsAward in 1993 and the Alcoa Award in 2005. He was selected to join theNCSU Academy of Outstanding Teachers in 2001. He was selected as aDistinguished Alumni Professor in 2003.

Richard M. Fastow received the B.S. degree inphysics from the Massachusetts Institute of Tech-nology, Cambridge, MA, USA, and the Ph.D. degreein physics/material science from Cornell University,Ithaca, NY, USA.

He is currently the Director of speech technologyand algorithms with Spansion, Inc, Sunnyvale, CA,USA. He was involved in nonvolatile memory atAMD and Intel. He holds 65 patents and has 31publications. His current research interests includechip architectures for memory intensive tasks, such

as speech recognition.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI… · 2016. 2. 23. · 2702 IEEE...

Documents

Transcript of IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI… · 2016. 2. 23. · 2702 IEEE...