Ekaterina.gonina

CS267 Assignment 0

Ekaterina Gonina

Bio:

I am a first year PhD student in CS at UC Berkeleys ParLab. I am interested in parallel graph algorithms and optimizing them for different parallel hardware platforms. I am also

interested in developing parallel programming frameworks to make programming newly

emerging parallel hardware easier for regular programmers. As and undergrad at UIUC, I

worked with Professor L.V. Kale on parallelizing and optimizing minimum spanning tree

algorithms using MPI and Charm++ (www.charm.cs.uiuc.edu).

What Im hoping to gain from this course is to learn in a nice systematic way the current

state of the art of developing parallel applications and its implications with hands-on

examples. Im hoping to gain a solid foundation for my parallel programming Prelim exam.

Application:

Data-Parallel Large Vocabulary Continuous Speech

Recognition on GPUs

Speech Recognition is an important application that the Parallel Computing Lab at UC

Berkeley is exploring at to mine for parallel parallel patterns and eventually create

frameworks and techniques to implement inference engines efficiently on manycore

platforms. Speech recognition is a key technology that enables human-computer

interaction in many emerging applications, for example, keeping a diary of a conference

meeting or recording a one-on-one research meeting presents very useful applications

that both the industrial and research community could benefit from. However, the size of

language vocabulary models are extremely large and are proportional to the accuracy of

recognition we want, thus parallel processing is extremely in need here we want to be

able to recognize human speech in small amounts of time, ideally, in real-time, and that is

only possible if we take full advantage of the parallelism of current manycore platforms.

One effective approach for solving the large vocabulary continuous speech recognition

(LVCSR) problem is to use the Hidden Markov Model (HMM) with beam search

approximate inference algorithm. This system uses a recognition network that is

compiled offline from a variety of knowledge sources using powerful statistical learning

techniques. Spectral-based speech features are extracted by signal-processing, the audio

input and the inference engine then computes the most likely word sequence based on the

extracted speech features and the recognition network [1]. Since it is not feasible to

explore the whole recognition network to find the most likely word sequence given the

set of signals from speech, they used the beam search heuristic the reduce the problem

space to a feasible size.

Parallel Platform:

The application targets NVIDIA G8x series GPUs as its parallel platform with SIMD

architecture. It has an array of Shared Multiprocessors (SM) each of which has 8 scalar

processors. They used CUDA programming environment to implement the parallel

version of the algorithm. The application is organized into a sequential host program and

one or more parallel kernels that get invoked from the sequential program to run on the

GPU. The basis of parallel execution is a CUDA thread, the kernel executes a scalar

sequential program across the set of threads. The programmer organizes the threads into

thread blocks that get scheduled to run on the SIMD lanes of one SM multithreaded

processor. Each SM has 16KB of on-chip memory that has high bandwidth and low

latency, this memory is shared among the threads in a block. For more information about

CUDA and the NVIDIA GPU programming see [2].

Inference Engine Implementation:

The inference engine implements the beam search algorithm and iterates over set of

active states to infer the set of next active states in the recognition network graph. Each

iteration begins with a set of active states that represent the set of most likely sequence of

words up to the current observation. The first step computes the observation probabilities

of all potential next states, the second step computes the next state likelihoods, and the

third step selects the most likely next state to retain as the set of new active states for the

next iteration [1]. Figure below illustrates the HMM graph and the high level overview of

the iteration process and the figure below is the detailed breakdown of the inference

engines iterations:

The major bottleneck in the algorithm is the transfer of data from CPU to GPU, thus the

key challenge in effectively using the GPU is to keep all computations and intermediate

results in the GPU memory.

The most significant parallelization potential in this algorithm is in the data level

parallelism. In computing next state likelihoods for example, they parallelize over the set

of active end states and do computations for each state in parallel, and in computing

observation probabilities they take advantage of the embarrassingly parallel structure of

the problem they compute probability for each state independently in parallel.

Results:

The application performed really well on the GPU, it got overall of 9x speedup over its

sequential equivalent, 19x in the computation of observation probabilities kernel and

11.8x in updating the next state likelihoods. The key result is that the performance of the

parallel version is about 6.25x better than the performance of sequential version [1]. The

two versions of the algorithm retain the same accuracy. Below is some performance data

they gathered, broken down by each kernel:

Conclusion:

This project presents a significant step in parallelizing automatic speech recognition on

manycore platforms, specifically on NVIDIA GPUs using CUDA. It uses the data

parallel model to achieve 9x speedup over the sequential version of the algorithm

illustrating that we have a great space to explore in improving performance of speech

recognition applications on such architectures. The next step that we are currently

working on is implementing the inference engine using a different recognition network Weighted Finite State Transducer (WFST) that is optimized for state space. We hope to

see great speedup and performance improvement from parallelizing speech recognition

using this model.

References:

1. Jike Chong, Youngmin Yi, Nadathur Rajagopalan Satish, Kurt Keutzer. "Data Parallel

Large Vocabulary Continuous Speech Recognition on Graphics Processing Unit". Poster,

GSRC Annual Symposium 2008, 29, September, 2008, 1.1.1.15.

2. NVIDIA CUDA Manual,

http://developer.download.nvidia.com/compute/cuda/2_0/docs/CudaReferenceManual_2.

0.pdf

Ekaterina.gonina

Documents

Transcript of Ekaterina.gonina