Ekaterina.gonina

4
CS267 Assignment 0 Ekaterina Gonina Bio: I am a first year PhD student in CS at UC Berkeley’s ParLab. I am interested in parallel graph algorithms and optimizing them for different parallel hardware platforms. I am also interested in developing parallel programming frameworks to make programming newly emerging parallel hardware easier for regular programmers. As and undergrad at UIUC, I worked with Professor L.V. Kale on parallelizing and optimizing minimum spanning tree algorithms using MPI and Charm++ ( www.charm.cs.uiuc.edu ). What I’m hoping to gain from this course is to learn in a nice systematic way the current state of the art of developing parallel applications and its implications with hands-on examples. Im hoping to gain a solid foundation for my parallel programming Prelim exam. Application: Data-Parallel Large Vocabulary Continuous Speech Recognition on GPUs Speech Recognition is an important application that the Parallel Computing Lab at UC Berkeley is exploring at to mine for parallel parallel patterns and eventually create frameworks and techniques to implement inference engines efficiently on manycore platforms. Speech recognition is a key technology that enables human-computer interaction in many emerging applications, for example, keeping a diary of a conference meeting or recording a one-on-one research meeting presents very useful applications that both the industrial and research community could benefit from. However, the size of language vocabulary models are extremely large and are proportional to the accuracy of recognition we want, thus parallel processing is extremely in need here we want to be able to recognize human speech in small amounts of time, ideally, in real-time, and that is only possible if we take full advantage of the parallelism of current manycore platforms. One effective approach for solving the large vocabulary continuous speech recognition (LVCSR) problem is to use the Hidden Markov Model (HMM) with beam search approximate inference algorithm. This system uses a recognition network that is compiled offline from a variety of knowledge sources using powerful statistical learning techniques. Spectral-based speech features are extracted by signal-processing, the audio input and the inference engine then computes the most likely word sequence based on the extracted speech features and the recognition network [1]. Since it is not feasible to explore the whole recognition network to find the most likely word sequence given the set of signals from speech, they used the beam search heuristic the reduce the problem space to a feasible size.

description

cuda thes

Transcript of Ekaterina.gonina

  • CS267 Assignment 0

    Ekaterina Gonina

    Bio:

    I am a first year PhD student in CS at UC Berkeleys ParLab. I am interested in parallel graph algorithms and optimizing them for different parallel hardware platforms. I am also

    interested in developing parallel programming frameworks to make programming newly

    emerging parallel hardware easier for regular programmers. As and undergrad at UIUC, I

    worked with Professor L.V. Kale on parallelizing and optimizing minimum spanning tree

    algorithms using MPI and Charm++ (www.charm.cs.uiuc.edu).

    What Im hoping to gain from this course is to learn in a nice systematic way the current

    state of the art of developing parallel applications and its implications with hands-on

    examples. Im hoping to gain a solid foundation for my parallel programming Prelim exam.

    Application:

    Data-Parallel Large Vocabulary Continuous Speech

    Recognition on GPUs

    Speech Recognition is an important application that the Parallel Computing Lab at UC

    Berkeley is exploring at to mine for parallel parallel patterns and eventually create

    frameworks and techniques to implement inference engines efficiently on manycore

    platforms. Speech recognition is a key technology that enables human-computer

    interaction in many emerging applications, for example, keeping a diary of a conference

    meeting or recording a one-on-one research meeting presents very useful applications

    that both the industrial and research community could benefit from. However, the size of

    language vocabulary models are extremely large and are proportional to the accuracy of

    recognition we want, thus parallel processing is extremely in need here we want to be

    able to recognize human speech in small amounts of time, ideally, in real-time, and that is

    only possible if we take full advantage of the parallelism of current manycore platforms.

    One effective approach for solving the large vocabulary continuous speech recognition

    (LVCSR) problem is to use the Hidden Markov Model (HMM) with beam search

    approximate inference algorithm. This system uses a recognition network that is

    compiled offline from a variety of knowledge sources using powerful statistical learning

    techniques. Spectral-based speech features are extracted by signal-processing, the audio

    input and the inference engine then computes the most likely word sequence based on the

    extracted speech features and the recognition network [1]. Since it is not feasible to

    explore the whole recognition network to find the most likely word sequence given the

    set of signals from speech, they used the beam search heuristic the reduce the problem

    space to a feasible size.

  • Parallel Platform:

    The application targets NVIDIA G8x series GPUs as its parallel platform with SIMD

    architecture. It has an array of Shared Multiprocessors (SM) each of which has 8 scalar

    processors. They used CUDA programming environment to implement the parallel

    version of the algorithm. The application is organized into a sequential host program and

    one or more parallel kernels that get invoked from the sequential program to run on the

    GPU. The basis of parallel execution is a CUDA thread, the kernel executes a scalar

    sequential program across the set of threads. The programmer organizes the threads into

    thread blocks that get scheduled to run on the SIMD lanes of one SM multithreaded

    processor. Each SM has 16KB of on-chip memory that has high bandwidth and low

    latency, this memory is shared among the threads in a block. For more information about

    CUDA and the NVIDIA GPU programming see [2].

    Inference Engine Implementation:

    The inference engine implements the beam search algorithm and iterates over set of

    active states to infer the set of next active states in the recognition network graph. Each

    iteration begins with a set of active states that represent the set of most likely sequence of

    words up to the current observation. The first step computes the observation probabilities

    of all potential next states, the second step computes the next state likelihoods, and the

    third step selects the most likely next state to retain as the set of new active states for the

    next iteration [1]. Figure below illustrates the HMM graph and the high level overview of

    the iteration process and the figure below is the detailed breakdown of the inference

    engines iterations:

  • The major bottleneck in the algorithm is the transfer of data from CPU to GPU, thus the

    key challenge in effectively using the GPU is to keep all computations and intermediate

    results in the GPU memory.

    The most significant parallelization potential in this algorithm is in the data level

    parallelism. In computing next state likelihoods for example, they parallelize over the set

    of active end states and do computations for each state in parallel, and in computing

    observation probabilities they take advantage of the embarrassingly parallel structure of

    the problem they compute probability for each state independently in parallel.

    Results:

    The application performed really well on the GPU, it got overall of 9x speedup over its

    sequential equivalent, 19x in the computation of observation probabilities kernel and

    11.8x in updating the next state likelihoods. The key result is that the performance of the

    parallel version is about 6.25x better than the performance of sequential version [1]. The

    two versions of the algorithm retain the same accuracy. Below is some performance data

    they gathered, broken down by each kernel:

  • Conclusion:

    This project presents a significant step in parallelizing automatic speech recognition on

    manycore platforms, specifically on NVIDIA GPUs using CUDA. It uses the data

    parallel model to achieve 9x speedup over the sequential version of the algorithm

    illustrating that we have a great space to explore in improving performance of speech

    recognition applications on such architectures. The next step that we are currently

    working on is implementing the inference engine using a different recognition network Weighted Finite State Transducer (WFST) that is optimized for state space. We hope to

    see great speedup and performance improvement from parallelizing speech recognition

    using this model.

    References:

    1. Jike Chong, Youngmin Yi, Nadathur Rajagopalan Satish, Kurt Keutzer. "Data Parallel

    Large Vocabulary Continuous Speech Recognition on Graphics Processing Unit". Poster,

    GSRC Annual Symposium 2008, 29, September, 2008, 1.1.1.15.

    2. NVIDIA CUDA Manual,

    http://developer.download.nvidia.com/compute/cuda/2_0/docs/CudaReferenceManual_2.

    0.pdf