Pushing the Envelope: Rethinking Acoustic …dpwe/proposals/DARPA-EARS-NA-2002...Pushing the...
Transcript of Pushing the Envelope: Rethinking Acoustic …dpwe/proposals/DARPA-EARS-NA-2002...Pushing the...
Pushing the Envelope: Rethinking Acoustic Processing A-1
Pushing the Envelope:Rethinking Acoustic Processing for Speech Recognition
A. Executive Summary
State-of-the-art speech recognition systems continue to improve, but the core acoustic operation
remains the same: a single feature vector (derived from the power spectral envelope over a 20-30
ms window, stepped forward by ~10 ms per frame) is compared to a set of distributions derived
from training data for an inventory of sub-word units (usually some variant of phones). This step
has remained essentially unchanged for decades, and we believe that this limited perspective is a
key weakness in speech recognizers. Note for instance that, under good conditions, human phone
error rate for nonsense syllables has been estimated to be as low as 1.5% [Allen 1994], as
compared with rates over an order of magnitude higher for the best machine phone recognizers
[Lee & Glass 1998, Deng & Sun 1994, Robinson et al. 1994]. In this light, our best current
recognizers appear half-deaf, only making up for this deficiency by incorporating strong domain
constraints. To develop generally applicable and useful recognition techniques, we must
overcome the limitations of current acoustic processing. Interestingly, even human phonetic
categorization is poor for extremely short segments (e.g. <100 ms), suggesting that analysis of
longer time regions is somehow essential to the task. This suggestion is supported by information
theoretic analysis showing discriminative conditional dependence between features separated in
time by up to several hundred milliseconds [Yang et al. 2000, Bilmes 1998].
In the development of speech recognition, there have been certain innovations in feature
processing, such as delta calculation, cepstral mean normalization, or RASTA [Hermansky &
Morgan 1994], which have been able to effect valuable performance improvements with minimal
changes to the statistical processing. In general, however, signal processing and statistical
modeling techniques have co-evolved , making it unlikely that a modification in one domain will
significantly improve performance without a corresponding change in the other. This is clearly
illustrated for the case of longer temporal support, which is most simply introduced by using
highly overlapped analysis windows (e.g., 500 ms processing windows with a 10 ms frame step).
Unfortunately, successive frames of the resulting features are highly correlated when compared to
a standard 20-30 ms window, and this increased violation of the conditional independence
assumptions made in the statistical processing leads to the introduction of tweak factors at
every time scale in an effort to compensate. The close coupling of signal processing and statistical
modeling leads us to propose a balanced effort between the two areas — radical modifications to
the front end processing, along with a corresponding restructuring of the statistical models to
accommodate these modifications.
The proposed work consists of two tasks:
1) Signal Processing: Replacing the current notion of a spectral energy-based vector at
time t by a set of variables based on posterior probabilities of broad acoustic categories
for long-time and short-time functions of the spectro-temporal plane, where long-time
refers to periods of up to a second. Depending on the categories, nontraditional variables
such as pitch-related features may be useful, and other ‘style variables such as speaking
rate could also be incorporated. These features will result in multiple streams of
probabilistic information.
Pushing the Envelope: Rethinking Acoustic Processing A-2
2) Statistical Modeling: Modifying the statistical models both to incorporate these new
multirate front ends, and to handle explicitly areas of missing information, i.e. portions of
the time-frequency plane that are obscured by acoustic degradation. This task will also
cover the discriminative learning of dependence across streams, and the exploitation of
this information for optimal combination design. In addition, the work will result in new
event-based models, in the sense of allowing acoustic cues of multiple time spans
associated with one unit.
We will pursue a number of techniques, broadly inspired by both human audition and our wish to
develop compatible statistical underpinnings that work together within a unifying multirate (and
multistream) framework, itself derived from our understanding of perception. We seek to replace
energy-based features, the common currency of acoustic front ends, by values reflecting the
posterior probability of different signal categories, themselves defined by data-driven techniques.
These probabilistic estimates will be supported on time-frequency windows drawn from a large
and flexible family, selected by experimental results from the auditory system, by data-adaptive
decompositions, and by empirical evaluation. Further input from hearing research will come via
novel features developed to reflect pitch, rate and other perceptual attributes. These approaches
are particularly important for the case of conversational speech, which exhibits the greatest
variability in speaking rate and vocal quality, and which must be analyzed in terms of parameters
that vary across the full range of time scales, from phones through to phrases and beyond. The
best approach to variability in realization is to have a wide range of alternative information
sources from which to estimate the speech content, allied with a combination strategy able to
switch opportunistically amongst the most useful sources at any given instant.
Our proposal seeks to address the impairment of current speech recognizers through a radical
reconstruction of the interface between sound and search. The representation of speech as a
sequence of spectral envelopes will be pushed aside. Pronunciation and grammar constraints,
while invaluable for reducing word error rates, can often serve to mask basic problems in the
acoustic classification, and thus we will not explore their extension in this work. Instead, we will
concentrate on the modeling of the most basic speech sounds, with application to word
recognition tasks using systems that hold constant the later stages of processing (search,
pronunciation, language modeling, etc.). A solid foundation at this level — in itself a novel
approach in speech recognition — will accrue further benefits when constraints are re-applied.
The teams working on the two tasks in this proposal will interact closely, through membership
overlap, frequent meetings, and joint work on internal evaluations. Each team includes strong
senior players, known for their innovations in these areas. Task 1 will include Hynek
Hermansky; Task 2 will include Mari Ostendorf and Herve Bourlard. Evaluation strategies for
both tasks will be developed by George Doddington. PI Nelson Morgan, along with Dan Ellis and
Kemal Sonmez, will work on both Tasks 1 and 2. A separate proposal with SRI as the prime site
will focus on Rich Transcription, and we will ensure that SRI can exploit the technologies
developed in this Novel Approaches effort when they bear fruit.
While the approaches proposed here comprise a radical departure from mainstream methods, we
feel that pushing the envelope is a difficult but necessary step to achieve the dramatic
reduction in word error rate that this program seeks.
Pushing the Envelope: Rethinking Acoustic Processing B-1
B. Innovative Claims
1) Use of broad category posterior probabilities over time-frequency patches rather
than short-window spectra or cepstra as basic features.
2) As a particular case, use of long temporal regions and limited spectral regions,
where long means analysis windows from 100 ms to 1 second, and limited
means 1-3 critical bands.
3) An alternate approach to time frequency analysis will incorporate a signal
adaptive front end, using an information-theoretic criterion to cluster temporal
regions with a local cosine basis tree.
4) Integration of features from analysis using differing temporal extent, using
multirate statistical models.
5) Integration of methods from Computational Auditory Scene Analysis (CASA)
into this new ASR framework.
6) Development of multiple streams based both on the methods referred to above
and on a criterion of minimal common errors between streams, subject to an
overall constraint on errors to avoid a trivial but useless solution.
7) Development of an event-based statistical model, in which event timing but not
extent is critical.
8) Development of multistream models incorporating all possible stream
combinations.
9) Use of partial information techniques so that low-confidence regions of time-
frequency are not given significant weight.
10) Development of task choice/evaluation methods to match the goals of this project.
Pushing the Envelope: Rethinking Acoustic Processing C-1
C. Statement of Work
Core research - The contractor will research, develop, evaluate and document innovative
acoustic processing algorithms for ASR, including time-frequency tradeoffs for front end signal
processing , and development of statistical algorithms and models to optimally incorporate the
new front end features.
Evaluation — The contractor will develop methods for the rapid evaluation of progress in
intermediate stages of analysis, in the course of which small tasks will be proposed and
developed. In addition, more traditional means of ASR evaluation will be employed throughout
the project. Together, these methods will be used to guide progress within the project. In the 4th
and 5th
years these procedures will be augmented by feedback from results in the governmental
evaluations of the Rich Transcription task, as the best Novel Approaches will be integrated into
the SRI-based evaluation system for the later years of the project.
Program Collaboration — The contractor will encourage collaboration within its team (in
particular with frequent internal meetings, conference calls, team access to a common Web site,
and the use of CVS or similar mechanisms to facilitate the development of common code
wherever possible). The contractor will also attend and support EARS meetings to facilitate
program level collaboration.
Reporting — The contractor will prepare and submit deliverable reports describing the progress,
results, and technical details of task-related activities.
Project Management — The contractor will manage the EARS Novel Approaches effort
including budgeting, scheduling, resource specification, financial tracking of the project Tasks.
The contractor will coordinate, consolidate, and submit Task-level Status Report and Project
Summary deliverables as defined in the PIP.
Pushing the Envelope: Rethinking Acoustic Processing D-1
D. Technical Rationale
Introduction
Word error rates for ASR are still too high in general, and particularly so for conversational
speech and other speech recorded under the imperfect acoustic conditions typical of many
military and commercial applications. The single largest contribution to the significant
improvements obtained by researchers over the last 5 years has been due to adaptation (in the
most general sense) over substantial amounts of testing data, and while this can be invaluable, it is
often not applicable for tasks that require excellent performance regardless of the data available
for each speaker. Even with these enhancements the performance is too poor for many
applications, and minor refinements of the basic methods are unlikely to yield the needed
improvement of conversational speech recognition down to the 5-10% range in word error rate,
particularly under general acoustic conditions (e.g., cell phone, speaker phone, and/or noisy
acoustic background).
It is further unlikely that any single magic bullet will be able to provide the desired degree of
improvement. Rather, as we have seen from the past, multiple innovations will be required to
provide significant change. However, there is an additional problem to be faced — the problem of
the so-called local minimum in error rates for complete ASR systems. As noted in [Bourlard,
Hermansky & Morgan 1996], once a system or set of approaches has been extensively
optimized, nearly any change to the system will lead to an increase in word error rates. While
most changes will simply be based on unsuccessful ideas, there is a small subset of initially
unpromising novel approaches that could lead to fundamental improvements in the complete
system once the consequences of the change are better understood. The core system design in
nearly every current state of the art ASR systems uses cepstral coefficients derived from an
auditory-scaled filter bank, computed over a 20-30 ms analysis window once every ~10 ms, with
Gaussian mixture models trained on such features to provide acoustic likelihoods (combined in
later components with the prior linguistic knowledge of pronunciation and grammar models). The
signal processing and statistical components for such a system have co-evolved so that it is
difficult to improve performance by modifying one without a corresponding change in the other.
Consider a much simpler problem than conversational speech recognition, the recognition of read
connected digits. Surprisingly, for the case of noisy digit strings such as is explored in the Aurora
project [Hirsch & Pearce 2000], error rates are still extremely high, averaging 13.1% for the
baseline system when tested over SNRs between 20 and 0 dB. Therefore, even for cases where
the range of pronunciation variability is small and where there is very little to improve on in the
language model, the performance is poor, even with the best current systems. This is true
despite the use of a Maximum a Posteriori (MAP) recognition scheme, in which the most
probable model will always be chosen. The suboptimality in practice must arise from incorrect
models, in the sense that the statistics do not well represent the data that will be seen in
recognition. This needs to be corrected in two ways: first, the data representation (features)
needs to be chosen so that the ultimate hypotheses are invariant over a range of conditions that
may not be seen during the training phase; and second, the statistical models must be developed
to properly represent the distributions and dependencies that will be observed from the stream
(or streams) of new features.
Summarizing these arguments, we need a coordinated effort in signal processing and in statistical
Pushing the Envelope: Rethinking Acoustic Processing D-2
modeling in order to successfully provide an improved alternative to today s state of the art.
However, we are left with a critical difficulty: When essentially all changes to the current
standard are likely to provide an increase in the error rate (or small decreases when used together
with the older system with some combination methods such as ROVER [Fiscus 1997]), how can
we determine the most likely directions for ultimate significant improvements? The core idea
common to all the pieces proposed in this document is to move beyond the current framewise
orientation of speech recognizers. Why should this be desirable (and not just different)? To begin
with, typical cepstral methods are inherently sensitive to spectral amplitudes, which are affected
by channel characteristics, noise, and reverberation. They also are sensitive to modification based
on context and speaking style. Temporal information is incorporated in a very specific and
limited way in the first order Markov models that we use, and it is likely that there is much more
that is fundamental to speech patterns that could be incorporated.
For these reasons, we are proposing to focus on the incorporation of acoustic information from
much longer time regions (100 ms — 1s), using a number of different approaches to feature
extraction from the time-frequency plane. Accompanying this core approach are a number of
other key ideas that our preliminary efforts have suggested: multirate statistical models, partial
information techniques, and models of higher-order auditory processing such as pitch perception
and source formation. Additionally, our experience over the last 5 years has convinced us that
there is no single ideal form of front end signal processing, so that a multistream approach will
be used. Unlike earlier efforts in which arbitrary multiple front ends were used, we will be
focusing on developing a rational approach to the design of these front ends with the criterion of
minimizing the number of errors in common (subject to an overall constraint on errors to avoid a
trivial but useless solution).
We have formulated the proposed work in terms of 2 tasks. We are proposing sufficient
personnel in each area to insure progress, but there will be enough overlap that a coordinated
effort between the task groups can be assured. The two tasks are:
1) Signal Processing: design and instantiate a new acoustic front end to calculate functions
of the time-frequency plane, particularly with a much longer time support than is
typically used for ASR. The output of the front end will be more like posterior
probabilities of broad classes than like spectral energies or cepstra.
2) Statistical Modeling: design and instantiate methods to handle incomplete information,
multiple rate data, and multiple streams of the posteriors based on the new time-
frequency functions.
Late in the project we will also take the best-scoring approaches from our internal evaluations
and provide them to our colleagues working on Rich Transcription for application to the ultimate
goals of the EARS program.
The proposed solutions within the two tasks are further summarized below.
Task 1: Signal Processing
A core idea for this task is to replace the current notion of a spectral energy-based vector at time twith a vector based on posterior probabilities of broad categories for long-time (up to a second or
more) and short-time functions of the time frequency plane. Other analyses will use time and
frequency windows intermediate between these extremes. Depending on the categories,
Pushing the Envelope: Rethinking Acoustic Processing D-3
nontraditional variables such as pitch-related features could be useful, and attributes such as
speaking rate could also be included in the classification. These features may be represented as
multiple streams of probabilistic information.
• Multiple time-frequency trade-offs: As we have noted, the dominant representation of
sound information is energy spectra calculated over brief time frames. This has been
advantageous in allowing hidden Markov models to accommodate timing variations
encountered in real speech, but at the cost of making certain kinds of temporal information
awkward or impossible to employ. Our recent work has shown effective and highly
complementary recognition by pushing the time-frequency balance to the opposite extreme:
TRAPS make independent first-stage classifications based only on information in a single
narrow frequency band, measured over an extended time window of up to 1 second
[Hermansky & Sharma 1998]. TRAPS are competitive with spectral features for clean
speech, and can halve the error rate for bandlimited noise corruption [Hermansky & Sharma
1999]. There is no particular reason to favor either dimension exclusively; a framework able
to integrate partially-dependent information will allow us to use both these views of the
signal, as well as a range of other analyses that use time and frequency windows intermediate
between these extremes. In particular, we will also use local cosine trees to zoom in and out in
time for a multitude of scales (bandwidths) weighted in probability using the entropy
criterion in growing the multiresolution tree.
• Auditory-based signal cues: In a further re-evaluation of the information extracted from
the basic sound data, we will consider looking beyond the coarse spectral energy (in all its
time-frequency guises) to extract information from the finer time structure within each band.
This information is demonstrably important to listeners, as can be witnessed by the blurry,
whispering crowd effect that is the best that can be resynthesized from current speech
recognition representations [Ellis-surfsynth 1997]. Pitch information is noteworthy in
allowing listeners to recognize voiced segments as such, and particularly in helping to glue
together different parts of the signal energy that properly belong to the same voice, and
separating them from other interfering energy: Approaches of this kind, employing pitch,
onset, modulation and spatial cues, have been developed under the banner of computational
auditory scene analysis (CASA) [Cooke & Ellis 2001], and will be integrated as a further
basis for classification within our framework.
¥ Principled multistream framework: The simplest approach to recognition is to identify a
single cue (such as broad short-time spectral profile) and use it for classification, but such an
approach is intrinsically brittle. Human perception apparently uses a far more robust
approach of employing numerous, redundant, parallel cues which are integrated to form a
single decision [Minsky 1986]. Recently, speech recognition systems based on multiple
independent recognizers have consistently and significantly outperformed other systems
[Fiscus 1997, Singh et al. 2001], but their hypothesis-level combinations are heuristic.
Instead, we will develop a probabilistically-rigorous system for combining many sources of
information, with different degrees of mutual dependence, to yield an optimal classification.
In this way, errors that occur independently in different information streams can be
discounted, and weak but consistent evidence from multiple sources can reinforce a correct
decision. We see relative error rate improvements of 25% or more over the best single stream
when complementary information streams are combined at the appropriate intermediate level
Pushing the Envelope: Rethinking Acoustic Processing D-4
(for the noisy digits Aurora task)[Ellis & Bilmes 2000].
For all of these, our vision is to displace spectral energy magnitude as the common element
underlying all acoustic modeling, replacing it with posterior probabilities of particular speech
classes — values with specific meaning that can be estimated from any kind of partial-signal cue,
and amenable to optimal combination via well-known procedures. A key part of the research will
be the determination of archetypical speech categories that will be used for generation of the low-
level posteriors. Our preference is to use data-driven methods to determine these categories,
rather than relying on linguistic theory-dependent classes such as articulatory features; however,
if data-driven approaches lead to something very close to traditional categories, we will use them.
Task 2: Statistical Modeling
A key goal in Task 2 is modifying the statistical models to both incorporate these new (multirate)
front ends, and to explicitly handle missing information (i.e., portions of the time-frequency
plane that are obscured by degradation such as noise or reverberation). This task will also include
the development of discriminative learning of dependence across streams, and incorporation of
this information for optimal combination.
• Partial information recognition: One of the greatest weaknesses of current speech
technology is the tacit assumption that the signal consists exclusively of the single voice of
interest, or, failing that, in trying to normalize the problem in such a way to approximate this
condition. The strength, however, of a recognition scheme based on multiple alternative
information streams is that individual streams can become unreliable or unavailable without
compromising the overall classification — provided their unreliability is correctly detected.
Thus, detection and labeling of non-target information is central at every level of this
approach. To address a classification problem in which a certain, dynamically-varying
portion of the information is unavailable, we will use the optimal tools of the missing data
formalism, recently developed specifically to address situations in which high levels of
nonstationary noise can temporarily obscure any part of the signal. This approach has been
shown as extremely successful in small-vocabulary, high-noise conditions such as the Aurora
task [Barker, Green & Cooke 2001]. Classification that uses acoustic information only when
it is informative, and backs off to contextual inference for other dimensions, offers the best
promise of rising to human levels of recognition in the phone-detection task, and by extension
to the classification of larger units .
• Statistical models for multistream combination: In previous work on subband
(multiband) recognizers, we have developed methods for the combination of information from
disjoint parts of the spectrum. In one of these approaches, developed at IDIAP, likelihoods
are estimated by integrating over all possible reliable stream subsets. We will conduct research
to determine if such methods are useful outside of the multiband example, and in particular
for the streams developed in Task 1. In other work at OGI and ICSI, neural networks and
simple combination rules have been used to integrate information from multiple streams
outside of the multiband context, and we will incorporate this experience in our work.
Conditional auxiliary variables can also be used to generate functions that estimate the
reliability of streams for their combination. Finally, we will study a new model called HMM2
which is a mixture of HMMs, which in principle could permit dynamic subband
segmentation as well as optimal recombination [Bourlard et al 2000, Bengio et al 2000].
Pushing the Envelope: Rethinking Acoustic Processing D-5
• Multirate and event-based models for recognition: The obvious next step after dividing
the observation space into multiple streams is to allow multiple rates, but little work has been
done with such observations because of the synchronization problems in decoding associated
with having different rates. In this work, we propose two alternative models that provide a
mechanism for using multirate features: a multirate model which is essentially a coupled set
of HMMs and a multiresolution model which has a switching mechanism between HMM
streams associated with the different rates to accommodate signal-dependent analysis. In
work on machine toolwear (wear on the edge of a machining tool), UW has shown that the
multirate model outperforms a standard Gaussian mixture HMM (Fish, 2001). A key
difference in the application of this model to speech is that it is problematic to allow
segmentation times only at the slowest rate, so we introduce the notion of event-based
models with fixed temporal extent but variable-rate analysis within each feature stream and
changing temporal resolution across streams. At SRI, related preliminary work has been done
on a kind of HMM that can handle a multiresolution feature stream, i.e. where there is a
single stream but the resolution varies as a function of time. In both cases, much work is still
needed before the methods will be useful for large vocabulary recognition, including research
on parameter tying and discriminative stream coupling. Most fundamentally, we need to gain
experience in applying this approach to automatic speech recognition.
Throughout these subtasks runs the consistent theme: constructing both a formal and a practical
foundation for the incorporation of multiple, incomplete acoustic streams, including streams
having different temporal support.
It could be said that, in general, the work proposed here is motivated much more by speech
perception than speech production. In other words, we take a very different approach than in
the recent trends in speech recognition that involve articulatory modeling. Both approaches are
well motivated from the perspective of trying to account for the variability observed in
conversational speech, in providing mechanisms for handling phones of widely different
durations and articulation quality. However, our perceptually-motivated models have the added
advantages that (1) they can better account for a signal that does not entirely due to articulators,
since there is noise (including multispeaker interference) and reverberation, and (2) are more
amenable to discriminative training techniques. Of course, this argument does not preclude using
both approaches in a single system, but we believe that this would be too much to tackle in one
project. Like the other important efforts in pronunciation modeling or language modeling that we
will not work with here, the proposed project has the potential to yield a key component for
speech recognition systems of the future.
Pushing the Envelope: Rethinking Acoustic Processing E-1
E. State-of-the-Art SystemSeveral systems are available for use in the proposed project. First and foremost, due to the ICSI-
SRI collaboration, the complete SRI system is available for our use, including on-site SRI
researchers at ICSI who are expert in its use and modification. It will be used for evaluations at
ICSI and SRI of methods developed at all the sites, and is further described below. Team
members at all sites will also be able to use HTK, the UW Graphical Models Toolkit (GMTK),
and a variant of the ICSI hybrid neural network/HMM system that was used successfully in the
1998 Hub 4 evaluation. In addition, multirate modeling software (to be developed at UW as part
of this project) will be available to all sites.
The target Rich Transcription system for transferring the successful results of this project will be
the SRI DECIPHERTM
system, which incorporates a combination of state-of-the-art techniques.
The DECIPHERTM
system has consistently exhibited state-of-the-art performance throughout
several years of government-administered tests, as is distinguished by its detailed modeling of
pronunciation variation, its robustness to noise and channel distortion, and its multilingual
capabilities. These features make the system accurate in recognizing spontaneous speech of many
styles, dialects, languages, and noise conditions.
Relevant techniques used in DECIPHERTM
include:
• Bottom-up state-clustered Gaussian mixture HMMs for acoustic modeling [Digalakis &
Murveit 1994].
• Acoustic adaptation to speakers, channels, and environments using affine mean and variance
transforms [Digalakis et al. 1995, Sankar & Lee 1996] and combined transform-based and
Bayesian adaptation [Digalakis & Neumeyer 1996].
• Vocal-tract length normalization [Cohen et al 1995,Wegmann et al. 1996]
• Inverse transform speaker adaptive training [Jin et al. 1998]
• Probabilistic optimum filtering to overcome noisy and mismatched conditions [Neumeyer &
Weintraub 1994]
• Context and word-dependent phone duration modeling [Gadde 2000]
• Progressive search with lattice recognition and N-best rescoring [Ostendorf et al. 1991,
Murveit et al. 1993]
• Minimum word error decoding by posterior maximization in confusion networks [Stolcke et
al. 1997, Mangu et al. 2000]
• Multiple system combination based on word confusion networks [Fiscus 1997, Stolcke et al.
2000]
• Acoustic Model Training using MMIE [Jing et al. 2001]
These and other components are already integrated in a reconfigurable software system that can
be readily retargeted to new tasks. For example, the same system with minor changes has been
applied successfully to the NIST HUB4 and Hub5 tasks [Sankar et al. 1998; Stolcke et al. 2000]
and the SPINE evaluations [Gadde et al. 2001].
Pushing the Envelope: Rethinking Acoustic Processing F-1
F. Tasks
TaskNumber
Task Title Lead Site Principal Investigator(s)
1 Signal ProcessingInternational Computer
Science Institute (ICSI)
Nelson Morgan (ICSI)
Hynek Hermansky (OGI)
2 Statistical ModelingInternational Computer
Science Institute (ICSI)
Nelson Morgan (ICSI)
Dan Ellis (Columbia)
In cases when two names are listed in Principal Investigator (PI) column, the PI is listed first and
the co-PI is listed second.
Pushing the Envelope: Rethinking Acoustic Processing G-1
G. CostsCategory Year 1 Year 2 Year 3 Year 4 Year 5 Total
Task 1: Signal Processing and Evaluation
Labor 111,231 115,772 120,541 125,548 130,805 603,897Benefits 13,901 14,595 15,324 16,091 16,895 76,806travel 10,500 8,250 8,250 8,250 8,250 43,500ODC: tuition 6,450 6,450 6,450 6,450 6,450 32,250equipment 6,510 0 0 0 0 6,510subcontracts 579,337 564,959 569,411 573,990 578,894 2,866,591overhead 137,702 96,339 100,160 104,173 108,385 546,759
Total Cost 865,631 806,365 820,136 834,502 849,679 4,176,313
Task 2: Statistical Modelling andEvaluation
Labor 111,231 115,772 120,541 125,548 130,805 603,897Benefits 13,901 14,595 15,324 16,091 16,895 76,806travel 10,500 8,250 8,250 8,250 8,250 43,500ODC: tuition 6,450 6,450 6,450 6,450 6,450 32,250equipment 6,510 0 0 0 0 6,510subcontracts 529,354 517,620 532,422 546,303 555,364 2,681,063overhead 137,702 96,339 100,160 104,173 108,385 546,759
Total Cost 815,648 759,026 783,147 806,815 826,149 3,990,785TOTAL 1,681,279 1,565,391 1,603,283 1,641,317 1,675,828 8,167,098
Budgetary Estimate
ICSI Columbia IDIAP OGI SRI UW TotalTask 1 1,309,722 500,002 0 1,500,000 866,589 0 4,176,313Task 2 1,309,722 500,002 318,940 0 843,866 1,018,255 3,990,785Total 2,619,444 1,000,004 318,940 1,500,000 1,710,455 1,018,255 8,167,098
Pushing the Envelope: Rethinking Acoustic Processing H-1
H. Tasks
1.1 Task 1, Signal Processing
Participating Sites: ICSI (Lead Site), OGI, Columbia, SRI International
Key Personnel: Nelson Morgan (PI), Hynek Hermansky (co-PI), George Doddington, Dan
Ellis, Kemal Sonmez
Dependencies: Task 1 can be done without Task 2, but has a much greater likelihood of success
if both are done, since traditional modeling approaches are not well-suited to the kinds of signal
processing proposed in Task 1. Task 2 does not make sense without Task 1.
Pushing the Envelope: Rethinking Acoustic Processing H-2
1.2 Technical Objective
This task comprises a set of related novel approaches to front end signal processing for ASR.
The goal is to produce multiple feature representations which, when used together, will reduce
word error rates on conversational and/or noisy speech. The methods employed should be
applicable to some extent with conventional means for combining such approaches (e.g., using
ROVER [Fiscus 1997]), but the much greater potential lies with coordination with the statistical
modeling approaches to be developed in Task 2.
More specifically, our goal is to replace the current notion of a spectral energy-based vector at
time t with variables based on posterior probabilities of broad categories for long-time and short-
time functions of the time-frequency plane. Depending on the categories, nontraditional variables
such as pitch-related features could be useful, and speaking style variables such as speaking rate
could also affect the classification. Long-time features could be based on periods as long as a
second. These features may be represented as multiple streams of probabilistic information.
Other analyses that use time and frequency windows will intermediate between these extremes.
The activities proposed under this task consist of:
• Signal processing using multiple time-frequency tradeoffs, including intervals of up to a
second and spectral bandwidths of down to a single critical band
• Use of auditory-based signal cues to generate non-traditional (e.g., pitch-related) variables
• Employing these processing alternatives to generate functions relating to the posterior
probabilities of data-driven acoustic classes, to be used as the front end observations in
the recognition system.
• Principled generation of mutually-beneficial streams of such functions for use with
multistream, multirate statistical models
If successful, this work would remove much of the sensitivity current recognizers have to short-
term spectral variability that occurs over speakers, speaking styles, and acoustic conditions, both
by placing a greater reliance on temporal information and by reducing the dependence on any
particular aspect of the time-frequency plane.
Any such high-risk research program can fail in many ways. However, the team assembled for
this task has had a number of key successes in this area, and the methods chosen are based on
long study of the problem and on promising preliminary results. Therefore, we think it highly
likely that research along these lines will yield discoveries useful to working systems in the
future.
Pushing the Envelope: Rethinking Acoustic Processing H-3
1.3 Technical Essence
In order to develop novel approaches to the acoustic signal processing done in an ASR front end,
it is necessary to break away from the traditional structure of a local cesptrum based on the
outputs of a mel-scaled filter bank over a 20-30 ms analysis window. Our team s experience and
intuition point us towards alternates that are characterized by several key properties:
• The core features will be functions of broad-category posterior probabilities over time-
frequency patches, where the patches can include long temporal regions and limited
spectral regions. Here, long means analysis windows from 100 ms to 1 second, and
limited means 1-3 critical bands.
• Features reflecting additional aspects of the signal such as pitch and speaking rate will be
incorporated to improve the posterior probability estimation.
• Multiple streams of features will be designed and tested, based both on the methods
above and on a criterion of minimal common errors between streams, with an overall error
reduction constraint to prevent a trivial but useless solution.
Details of the proposed methods follow in section 1.4.
Pushing the Envelope: Rethinking Acoustic Processing H-4
1.4 Technical Approach
Introduction
The core idea for Task 1 is to replace the current notion of a spectral energy-based vector at a
given time instant with a set of variables describing posterior probabilities of broad categories for
long-time and short-time functions of the time-frequency plane. This research direction is based
on the empirical observation that information relevant for deriving sub-word categories is not
only located in short regions in the immediate vicinity of a particular frame , but rather is
distributed over a long timespan corresponding to a syllable or more [Bilmes 1998, Yang et al.
2000]. We have also observed that speech categories can be classified surprisingly well given very
little spectral information but only the temporal evolution of frequency-localized parameters
[Hermansky et al. 1999]. Finally, we have experimentally determined that multiple feature
vectors having different properties can be combined to yield lower error rates on a number of
problems [Janin et al. 1999].
Our experience with mechanisms to exploit these observations is still limited, and many of the
key parameters of such a new approach are unexplored. Some of the questions remaining to be
asked are:
• Once we deviate from time-local spectrally (or cepstrally) based features, how should we
determine the optimal time-frequency patches to use, and what functions of these patches
should be applied? Early explorations suggest that frequency-local temporal patterns of
up to 1 s can be extremely useful in augmenting the more time-local cepstral patterns, but
we have no assurance of the optimality of such an approach.
• If these longer analysis windows are used to generate posterior probabilities, for what
classes should these probabilities be generated? Presumably they should be sub-word
classes, perhaps for primitives such as broad acoustic categories. However, we prefer to
derive such classes from training data rather than relying on expert opinion.
• Traditional speech features, as well as the methods mentioned above, rely on local
spectral envelope information, and deliberately suppress pitch information as unhelpful
in practice. However, harmonic structure is a strong cue for the specific character of
frequency-local information. Simply extending local feature vectors to include pitch
information has not been a successful approach (H.L. Mencken noted that For every
complex problem there is always a solution that is simple, elegant, and wrong. ). We
think that this information may be more helpful in estimating the probabilities for broad
classes using limited frequency regions, supporting recognition decisions that are less
sensitive to signal variability. Still, exactly how do we make use of pitch and other
prosodic features present in the speech signal?
• Assuming that there is no one single sufficient answer to the questions above, how can we
choose candidates for combination in a multistream multirate system?
In the methods described below, which expand the brief summaries of Section D, we will return
to these points and give our current perspectives on how we will attempt to answer the
questions posed. We will also draw comparisons with related work.
Multiple time-frequency trade-offs
Pushing the Envelope: Rethinking Acoustic Processing H-5
We will explore the use of different varieties of time-frequency functions as the basis of the
feature streams providing fuel to our statistical engine. For all the reasons described earlier, it is
highly probable that we need to extend our analysis to significantly larger time windows than are
typically used. As a specific example, we will work with narrow spectral subbands and long
temporal windows (up to a second or more, but in any event sufficiently long for multiple
syllables). In our previous work on this topic we have used the mean and variance-normalized
temporal trajectory of logarithmic spectral energy in a single critical band as a feature vector for
class posterior probability estimation [Hermansky & Sharma 1998]. We developed this
approach for use in a multistream combination system alongside conventional features, and using
posterior estimates as intermediate features. For the OGI Numbers task (natural numbers such
as sixty-five over the telephone), the temporal trajectory features afforded a 23% relative error
rate improvement to 3.7% overall, compared to 4.8% for the posterior-feature system without
temporal trajectory features; our plain GMM-HMM baseline had a word error rate of 6.0% on
the same task [Hermansky et al.1999].
Working with such narrow bands makes the acoustic processing inherently more robust to
frequency-localized signal degradations. Psychoacoustics, however, has identified an effect
known as Co-modulation Masking Release (CMR), in which correlated noise in multiple bands
can actually help the identification of signals in one of the bands [Hall et al. 1984]. This suggests
that we should incorporate more than a single critical band in each of our estimators.
This brings us back to the question of determining sets of optimal time-frequency patches. One
approach is to follow psychoacoustically-guided inspiration, and choose our spectrotemporal
support on the basis of experimental results for broad category classification or detection.
However, we also have other tools at our disposal, such as linear and nonlinear discriminant
optimization techniques [Kajarekar et al. 2001]. In particular, we can segment the signal in a
principled manner by assigning proper probabilities to different time-frequency resolution
tradeoffs using a local cosine tree representation. Wavelet packets and cosine bases have good
feature localization properties, and fast algorithms exist for building multiresolution
representations. However, they have a major shortcoming for pattern recognition applications, in
that the representations are not time-invariant. Because the transform coefficients of a pattern are
quite different when subjected to translation, there is no simple mechanism to define a
translation-invariant detector. In this project, we will incorporate a novel way to address this
problem by using local cosine trees to zoom in and out in time to segment the transients and
stationary parts. In processing the signal, long segments on the order of a second are transformed
with a local cosine basis to generate an entropy-based tree and a dyadic time axis division. The
shortest segments at the highest resolution (the leaf nodes of the tree) form the basic frames. The
local cosine transform effectively clusters these frames in groups of powers of two, depending on
their frequency content. Each segment is then analyzed into Fourier spectra at a multitude of
scales (bandwidths), and probability estimates from each scale can be weighted by the entropy
criterion computed during the growth of the multiresolution tree.
Whether our time and frequency extents are based on empirical classification/detection results, or
adaptively derived from local signal properties, we aim to generate feature streams that look more
like probabilities than energies. Returning to our second introductory question, for what classes
should these probabilities be generated? Our preference is to develop data-driven broad classes,
as in [Kajerekar & Hermansky 2000] where we achieved a 15% relative improvement in word
Pushing the Envelope: Rethinking Acoustic Processing H-6
error rate by defining new subword units that maximized the conditional independence of words
and features given the unit — i.e., the criterion in deriving the unit was to capture as much as
possible of the lexically-relevant information in the features. In the proposed project, new broad
classes will be derived for the chosen time-frequency patches, and the combined set of
probabilities for a particular time can be used as an input to classification machinery for finer
categories. One particularly salient characteristic of subword units (which would be a likely cue
borad classes) is voicing. Voicing can be extracted from signal fine structure as discussed in thenext section.
Auditory-based signal cues
We will also look beyond the coarse spectral energy (in all its time-frequency guises) to extract
information from the finer time structure within each band. Sounds that we perceive as having a
pitch — such as voiced speech, the buzzing of a bee, or the squeaking of a door — generally exhibit
energy modulation at a specific fundamental frequency across a number of frequency bands. For
instance, the fundamental of voiced speech usually lies in the range 100-250 Hz, but the formant
frequencies excited at that pitch extend up to 3000 Hz and beyond. The prevalence and
importance of these wideband periodic signals, and the strong perceptual cohesion of all the
frequency bands involved, inspired the development of the weft element, which represents this
kind of signal in a single, integrated object [Ellis 1997]. Wefts are extracted from a mixed signal
by calculating the autocorrelation of the energy envelope in each peripheral frequency channel,
summing normalized autocorrelations across channels to form a periodogram (a display of
dominant periodicities as a function of time), picking the dominant periodicities at each time slice,
then going back to the per-channel autocorrelations to estimate the energy intensity at each
periodicity for each frequency band. The result is an element defined by a fundamental-period
track, and a smooth time-frequency energy surface which describes the overall amplitude and
formant structure associated with that period.
The key advantage of this operation is that, in suitable circumstances, it is possible to
simultaneously estimate the intensity of two different periods in the same time-frequency cell.
Also, because the time-frequency surface is assumed to have smoothly-varying properties, any
missing values (where the dominance of one signal over another makes it impossible to extract
the intensity of the weaker source in that band) can be estimated either by simple interpolation of
the surrounding values, or by more sophisticated inference.
Periodicity-based signal analysis is an example of the computational auditory scene analysis
(CASA) approaches which will be applied to the problem of better estimating the low-level
probabilities that will form our new front end streams. Other techniques that we will investigate,
similarly concerned with organizing the signal according to the different sound sources present,
include using information about common onset and offset across frequency bands, and coherent
modulation in amplitude and frequency [Cooke & Ellis 2001]. There is strong evidence for tuned
receptive fields in the primary auditory cortex, analogous to the oriented edge detectors in the
primary visual cortex, which could be implicated in the perception of formant transitions in
speech [Simon et al. 1998], and analysis along these lines will be investigated within the same
CASA framework.
In previous experiments along these lines, incorporation of an autocorrelation-based harmonicity
measure as a basis for a soft labelling of time-frequency cells as corrupt or reliable yeilded a word
Pushing the Envelope: Rethinking Acoustic Processing H-7
error rate improvement of approximately 10% relative (from 42% to 38%) at 0dB SNR in the
Aurora noisy digits task, when compared to the same missing-data system using only static noise
estimates for labelling [Barker, Cooke & Ellis 2001]. This result, based only on clean speech
training data, actually outperforms the multicondition baseline result (trained on speech
corrupted to resemble the test set), even though multicondition training typically betters clean-
training by a factor of 2 or more in this high-noise case.
Principled multistream framework
Finally, we will explore approaches to generating multiple streams from the algorithms discussed
above. One simple approach is to use a number of likely but arbitrary candidates from the
different algorithms. In particular, we would generate streams with differing temporal and
spectral resolutions, since even events that seem to be quite localized in time or frequency can
have correlated effects over significant stretches of time due to the inertia of the vocal
mechanisms (coarticulation). But even given the above methods, there are a range of functions
that could be applied to the streams to give them different properties, for instance to develop
features reflecting energy with particular time-frequency orientations, i.e., energy moving up or
down in frequency at a particular rate. In all of these cases, we will try to choose streams that
generate orthogonal errors; at least, we wish to minimize the number of errors in common for the
different streams (subject to an overall constraint on errors to avoid a trivial but useless solution,
as noted previously). This requirement can be defined quite simply for the case of combining
multiple recognizers, as is done with approaches such as ROVER [Fiscus 1997]; however, what
would it mean for combination at the level of the acoustic front end? In previous experiments
[Wu et al. 1998] we have compared the specific words and utterances where errors occur for
systems based on single streams to predict which streams would be most profitably combined.
However, since our combination is at an earlier stage, it would make more sense to develop an
orthogonality criterion at the acoustic level. We have investigated measures of the mutual
information in different feature streams as a basis for choosing both which streams to combine
and the method of combination, for instance, before or after the posterior estimation stage [Ellis
& Bilmes 2000]; minimizing the conditional mutual information of posterior estimates based on
single feature sets correctly selected the best-performing stream pair, able to achieve a 25%
relative word error rate improvement relative to the best single stream (for the noisy digits
Aurora task). Another approach to stream selection would be to examine broad category or
phonetic errors for the individual streams, or to compute relative entropies between error signals
over the different classes for each stream. We will experiment with all these approaches to this
problem.
Pushing the Envelope: Rethinking Acoustic Processing H-8
1.5 Evaluation Methodology
It is now generally accepted that research productivity is enhanced, if not determined, by
guidance through evaluation feedback. This is true at least in human language technologies and it
is endorsed by the subject solicitation through the requirement for this section on evaluation.
Careful definition of evaluation can also help to define the research tasks and to focus research on
key problem areas. Thus evaluation ideally should be a dynamic process during the research
process — as better understanding of the problem is gained, research tasks and performance
measures should be modified to focus on the critical technical challenges.
Spoken language is arguably the most challenging problem that humankind faces today. Being the
fundamental process of one human communicating semantic information over an acoustic channel
to another individual, it is a multifaceted problem that in the limit requires human competence in
acoustic, linguistic, and semantic processing, including complete (and dynamic) world knowledge.
Some of the most important challenges in taking on such a huge mission involve calibration of the
problem, specifically in segmenting the problem into sub-problems that can be tackled in a
meaningful way, even if not fully solved, and in identifying a successful plan of attack (which
means identifying the critical technical challenges and the order in which they should be
addressed). The subject solicitation (for Novel Approaches) presents an opportunity, an
invitation even, to do this kind of segmentation. Potential opportunities for segmentation lie
between the levels of acoustic, phonetic, syllabic, word, syntactic and semantic representation.
Evaluation methodology for speech transcription comprises four parts. The most fundamental of
these is the definition of the evaluation task. The other three supporting parts are the selection
of a performance measure, the identification/collection/selection of a speech corpus and other
research resources, and the establishment of an evaluation process.
For the signal processing research task, we expect to define the evaluation task at a low level of
representation, primarily at the level of the word and the level of the syllable. The lower the
level the better, because the signal processing research addresses the problem of acoustic
representation and of decoding speech into the most primitive symbolic level, typically a
phonetic-level unit. The closer evaluation is to this level, the better the feedback and insight on
issues and technical challenges. There is danger in evaluating at the phonetic level, however,
because the conceptual integrity and perceptual value/stability at this level is not clear. Thus we
propose to create evaluation tasks at the syllabic and word level.
We also propose to establish human performance benchmarks and to use this information to
better understand where the challenges lie and what levels of algorithmic performance are
reasonable to expect. When human listeners are unconstrained, transcription performance is quite
high. Word error rates of less than one percent may be observed for human listeners, even for
conversational speech at 10 dB S/N ratio [Deshmukh et al. 1996]. In order to achieve this level of
performance, however, we believe that human listeners depend on semantic and contextual
information. And in order to provide truly useful guidance to us, our human performance
benchmarks must use only those sources of knowledge that we are dealing with. But limiting
human listeners to these (lower level) sources of knowledge is a tricky and difficult challenge.
While we propose to use only natural and unconstrained speech as test data, there may be
reasonable ways to eliminate or minimize the unwanted higher level information provided to
listeners. For example, we can excise sub-phrase (~ 1 sec, two or three word) segments and
Pushing the Envelope: Rethinking Acoustic Processing H-9
present these randomly to human listeners.
Word error rate (WER) has been used as the standard performance measure for speech
transcription tasks for many years. And while efforts have been made to find more meaningful
measures of transcription performance (such as information-based measures and information-
weighted WER), the WER remains as a simple and reasonably satisfactory performance measure.
We therefore plan to use it also, and to adapt it to syllable recognition performance as well.
Another way of looking at transcription errors which has been useful is to tabulate these errors in
a detection task framework [Doddington et al. 1997]. This results in miss probabilities and false
alarm rates which can be analyzed in various ways. For example, when these statistics are
conditioned on word (or on syllable) they can provide valuable insight into failure mechanisms
and model weaknesses.
Evaluation corpora will be drawn from published corpora that are available from LDC and other
linguistic resource suppliers. And while these corpora will be under our control and therefore not
truly unseen by us, we will control the use of speech data in the usual way to avoid being
misled by biased results, with a division of speech data resources into training data ,
development test data , and evaluation test data .
The most valuable feedback, especially at the exploratory stages of research being proposed here,
requires detailed analysis of performance. This includes, as a prime example, the conditioning of
performance on parameters of interest. Thus the evaluation tools must be flexible and easily
modified in order to accommodate the range and change of research focus. Evaluation will thus
play an active role in research through exploratory analysis of test results in an effort to
understand the problem better, not just to measure the overall success of various algorithmic
ideas.
In summary, the evaluation protocol must be nimble enough to evolve along with advances in the
ambitious research program described above, but will feature:
• establishment of human performance benchmarks
• regular evaluation of technical achievement
• test data drawn from standard LDC corpora
• evaluation at the phone, syllable, and/or word levels, with variants to be explored as part of
this effort
• use of standard metrics, such as word error rate and adaptations to syllable recognition task.
Evaluation will play a central role in our research effort, as it does in all productive speech
research environments. The fact that we are dealing with novel approaches and with new ideas
does not eliminate or minimize the value of performing evaluations and performance analysis to
gain an understanding of the problem (and of our proposed ideas). In fact the situation is just the
opposite — new and radical ideas need all the more testing and analysis in order to become mature
contributions to speech technology.
Pushing the Envelope: Rethinking Acoustic Processing H-10
1.6 Resources Required
The research proposed here will be solely based on available speech data and annotations. LDC
Corpora (Switchboard, Call Home, Broadcast News), in addition to ICSI and UW Meeting data,
will be used for train and test sets for large vocabulary work. We will also use ICSI s
phonetically labeled Switchboard data, and a variety of small vocabulary corpora available at ICSI
(e.g., Aurora) for quick turnaround development work.
Pushing the Envelope: Rethinking Acoustic Processing H-11
1.7 Work Plan
In the first year, we expect to focus on the exploration of time-frequency tradeoffs for the front
end work of Task 1, making use of some sensible default choices for the broad acoustic categories
(e.g., voicing, stop vs continuant, etc.) that will be used for the posterior probabilities. In parallel,
we will work on refining the choice of these categories through data-driven methods. We will also
begin to develop pitch-related measures to assist in the probability estimation. Initial Task 1
work will be based on a small vocabulary task (e.g., Aurora) to keep experiment turnaround time
fast, to minimize language model effects (given the focus on front end design), and because we
will be using a simplified statistical modeling structure since the Task 2 work will not yet have
generated new instantiations for Task 1 to use.
After the first year, we will use a reduced set of candidate approaches from the first year (the
better time-frequency tradeoffs, pitch-related features, and data-driven broad acoustic categories
that we found) and do a joint development to study the interaction between these choices. In
addition, once we observe encouraging results for a particular approach on a small task, we will
migrate to a task that will be compatible to the Rich Transcription effort at SRI/ICSI/UW, with
the goal of eventually transferring technology to that effort. Task 1 researchers will work more
closely with the Task 2 researchers at this point, both to transfer the new front end to the
statistical modelers and to incorporate the improved models in the front end studies
Our initial efforts will focus on the use of a single feature stream, but once all of these methods
have been developed, we will put a greater focus on the principled generation of multiple feature
streams. By this time Task 2 work should have refined the stream combination methods.
Throughout the project, we will frequently compare notes on the different approaches being
tested, e.g., for determination of time-frequency tradeoffs or of functions for multistream
generation. In this way, we hope to encourage diverse thought within the collaboration while
providing frequent opportunities for merging the best ideas. In the final two years, we will work
on merging some of the different signal processing approaches more explicitly, as well as
transferring the best technology to the SRI team s Rich Transcription system.
Pushing the Envelope: Rethinking Acoustic Processing H-12
1.8 Milestones and Schedule
Pushing the Envelope: Rethinking Acoustic Processing H-13
1.9 Cost Breakdown
Category Year 1 Year 2 Year 3 Year 4 Year 5 TotalTask 1: Signal Processing andEvaluation
Labor 111,231 115,772 120,541 125,548 130,805 603,897Benefits 13,901 14,595 15,324 16,091 16,895 76,806labor total: 125,132 130,367 135,865 141,639 147,700 680,703
travel 10,500 8,250 8,250 8,250 8,250 43,500
ODC: tuition 6,450 6,450 6,450 6,450 6,450 32,250
equipment 6,510 0 0 0 0 6,510ODC total: 12,960 6,450 6,450 6,450 6,450 38,760
subcontracts:Columbia 100,000 100,000 100,001 100,000 100,001 500,002
OGI 300,000 300,000 300,000 300,000 300,000 1,500,000SRI 179,337 164,959 169,410 173,990 178,893 866,589
subcontract total: 579,337 564,959 569,411 573,990 578,894 2,866,591
overhead137,702
96,339 100,160 104,173 108,385 546,759
Total CostEstimate
865,631 806,365 820,136 834,502 849,679 4,176,313
Task 1 Roll-up:
Labor 603897Benefits 76806travel 43500ODC: tuition 32250 : equipment 6510subcontracts 2866591overhead 546759
Total CostEstimate
4176313
Pushing the Envelope: Rethinking Acoustic Processing H-14
2.1 Task 2, Statistical Modeling
Participating Sites: ICSI (Lead Site), UW, Columbia, SRI International, IDIAP
Key Personnel: Nelson Morgan, Dan Ellis (co-PI), Herv Bourlard, George Doddington, Mari
Ostendorf, Kemal Sonmez,
Dependencies: Requires Task 1 (Signal Processing), since the point of this task is to model and
integrate the information streams developed in Task 1.
Pushing the Envelope: Rethinking Acoustic Processing H-15
2.2 Technical Objective
This task is aimed at improving the ability of statistical ASR engines to accommodate novel front
end signal processing such as the approaches developed in Task 1. The goal is to develop
statistical models and modeling methods that will optimally incorporate multiple feature
representations, including representations with much longer time support than is used in
conventional systems. When used together with such diverse front end representations, the
desired result would be to reduce word error rates on conversational and/or noisy speech. The
methods employed should be applicable to some extent with existing multiple front ends (e.g.,
PLP and RASTA), but the much greater potential lies with coordination with the front end
approaches to be developed in Task 1.
More specifically, our goal is to replace the current notion of a statistical model based on a single
stream of observations assumed to have a short-time basis with statistical models that both
incorporate these new (multirate) front ends, and explicitly handle missing information (i.e.,
portions of the time-frequency plane that are obscured by degradation such as noise or
reverberation). This task will also include the development of discriminative learning of
dependence across streams, and incorporation of this information for optimal combination.
The activities proposed under this task consist of:
• Development of multirate and event-based statistical models for speech recognition
• Development of partial information approaches for the noisy incomplete data that will
comprise some of the multiple streams
• Principled combination of multiple streams for minimum error rates, particularly for
posterior probabilities for data-driven acoustic classes used as front end observations.
If this work is successful, it would vastly broaden the capability of recognition engines to
accommodate very different feature streams such as the ones proposed in Task 1. If both are
successful, the results of this task would complement those of Task 1 to further decrease the
sensitivity current recognizers have to variability in the short-term spectrum that occurs over
speakers, speaking styles, and acoustic conditions.
As with Task 1, such a high-risk research program can fail in many ways. However, as with Task
1, the team assembled for this task has had a number of key successes in this area in the past, and
the methods chosen are based on long study of the problem and on preliminary promising results.
Therefore, we think it highly likely that research along these lines will yield discoveries that will
be used by the state-of-the-art systems of the future.
Pushing the Envelope: Rethinking Acoustic Processing H-16
2.3 Technical Essence
It has often been difficult to incorporate novel front ends into traditional speech recognition
systems. We believe that a significant reason for this problem has been that the implicit
assumptions built into our common statistical models has made them optimal for the traditional
short-time front end. More generally, in order to optimally incorporate novel approaches to the
acoustic signal processing done in an ASR front end, we must develop new statistical structures
and methods that can handle multiple time scales and multiple feature streams. Our team s
experience and intuition point us towards alternatives that are characterized by several key
properties:
• Multiple coupled HMMs will be used to integrate features from analysis over differing
temporal extent, which implies an event-driven rather than segmental view of the speech
process.
• We will investigate models for fixed rate, variable rate and multirate feature streams.
• So-called partial information or missing data formalisms will be used to handle low-
confidence regions in time and frequency.
• A variety of statistical formulations that describe possible combinations of (or coupling
between) candidate streams will be tested, including Markov and hidden Markov
dependencies.
• New statistical modeling frameworks will be developed for characterizing speech events,
which may have acoustic cues of varying time spans, rather than the standard approach of
characterizing subword units bounded by specific start and end times.
Pushing the Envelope: Rethinking Acoustic Processing H-17
2.4 Technical Approach
There is currently much interest in new types of front ends, particularly using different scales of
temporal processing. Evidence from data-driven learning indicates that there is much potential
utility of features computed using relatively long analysis windows (>100 ms). However, hidden
Markov Models (HMMs) per se are not well suited to these features for a couple of reasons.
First, the HMM and frame-based cepstra have co-evolved as ASR system components, and
hence are very much tuned to each other; this is one reason why progress with novel approaches
has been so difficult. In particular, systems incorporating new signal processing methods in the
front end are at a disadvantage when tested using standard HMMs. Secondly, the standard way
to use longer temporal scales with an HMM is to simply use a large analysis window and a small
(e.g., 10 ms) frame step, so that the frame rate is the same as for the small analysis window. The
problem with this approach is that successive features at the slow time scale are even more
correlated than are those at the fast time scale, so there will be a need for tweak factors at
every time scale. These points suggest that we should consider changing the statistical model.
One approach that has been proposed is to add feature dependencies or explicitly model the
dynamics of frame-based features in various extensions of HMMs. While we have ourselves
made contributions in this area, we now believe that a very different approach is needed that
relaxes the frame-based processing constraint. We propose instead to focus on the problem of
multistream and multirate process modeling for two main reasons. First, the use of multiple
streams provides robustness to corruption of individual streams. Second, the use of multiple
streams introduces more flexibility in characterizing speech at different time and frequency scales,
which we hypothesize will be useful for both noise robustness and characterizing the variability
observed in conversational speech.
Partial information recognition
One of the greatest weaknesses of current speech technology is the tacit assumption that the
signal consists exclusively of the single voice of interest, or, failing that, in trying to normalize the
problem in such a way to approximate this condition. The strength, however, of a recognition
scheme based on multiple alternative information streams is that individual streams can become
unreliable or unavailable without compromising the overall classification — provided their
unreliability is correctly detected. Thus, detection and labeling of non-target information is
central at every level of this approach. To address a classification problem in which a certain,
dynamically-varying portion of the information is unavailable, we will use the optimal tools of
the missing data formalism, recently developed specifically to address situations in which high
levels of nonstationary noise can temporarily obscure any part of the signal. This approach has
been shown as extremely successful in small-vocabulary, high-noise conditions such as the
Aurora task [Barker, Green & Cooke 2001]. Classification that uses acoustic information only
when it is informative, and backs off to contextual inference for other dimensions, offers the best
promise of rising to human levels of recognition in the phone-detection task.
Consider, in particular, the effect of acoustical interference such as noise and reverberation on a
particular stream. While onsets and strong formant peaks will in general be relatively unaffected,
the spectral valleys between formants and the periods of energy decay are likely to be affected
much more seriously. As a simplification, we could say that some aspects of the original speech
spectrum are observable, while others have been masked by the interfering energy and are thus
Pushing the Envelope: Rethinking Acoustic Processing H-18
hidden. Bayesian decision theory has little difficulty accommodating this situation; the overall
posterior probability of a particular subword class given the evidence is simply the expected
value of the posterior probability over all possible values of the missing data. That is, the overall
posterior probability of a particular subword class q given present evidence Xp is simply the
expected value of the posterior over all possible values of the missing data Xm i.e.:
The case of diagonal-covariance Gaussian model is particularly simple because each data
dimension is conditionally independent, so integrating over the missing data dimensions reduces
to evaluating the Gaussian only for the available data dimensions. Thus for a mixture of diagonal-
covariance Gaussians — the most common distribution model used in speech recognizers — the
data likelihood needed to compare models in the HMM decoding algorithm can be calculated as
[Cooke et al. 2001]:
where p(k | q) is the static mixture weight for component k within state q, and the integral over
Xm will, in the absence of any information on the missing data, evaluate to unity and disappear.
The principal difficulty, of course, lies in detecting which values have been corrupted and which
are reliable, i.e., in dividing each observation vector into the present and missing subsets Xp and
Xm. In general, this information will not be directly available (since that would imply knowing a
priori the separate spectra of target and noise), and it must be inferred somehow from the signal
content, either by using CASA-like processes to organize the input, by comparing the
consequences of different possible labelings in terms of the resulting fit to speech models, or by a
combination of both these methods.
Statistical models for multistream combination:
Our team and others have found that it can provide improved performance to incorporate
multiple feature streams [Janin et al. 1999, Hermansky et al. 1999, Singh et al. 2001]. While
approaches that combine systems at the word level (e.g. ROVER [Fiscus 1997]) can be very
useful, their utility greatly diminishes for systems that are not very good individually. However,
when feature streams are combined at a lower level, we have often found that systems that are
individually rather poor can behave quite well collectively, and can also show robust behavior
when properly chosen. Therefore, we will be working to develop new approaches for the
combination of such streams, in coordination with the corresponding activity in Task 1.
Multiband speech recognition is an example of an approach in which individual streams (based on
narrow bands of the spectrum) have relatively poor performance, while the complete system has
often demonstrated good performance, particularly for robustness to narrow-band noise
[Bourlard et al 1996, Bourlard & Dupont 1996]. However, some limitations of the approach are
evident. In particular, the assumption of independence between frequency bands could be the
cause of reduced performance in the case of clean speech and wide-band noise.
To overcome the above problems, while nicely reconciling missing data and multiband
approaches, IDIAP recently proposed a new set of full combination rules which integrate
acoustic models trained on all possible combinations of subbands, preserving correlation
information and leading to higher performance in all noise conditions. More specifically, the
p q X p(((( )))) p q X p Xm,,,,(((( )))) p Xm X p(((( )))) X md∫∫∫∫=
p Xp q(((( )))) p k q(((( )))) p X p k q,,,,(((( )))) p X m k q,,,,(((( )))) X md∫∫∫∫k
∑∑∑∑=
Pushing the Envelope: Rethinking Acoustic Processing H-19
HMM emission (posterior) probability )|( Xqp is estimated by integrating over all possible
subband band combinations:
)|()|()|,()|(11
XbpXqpXbqpXqp pp
B
pk
B
pp ∑∑
==
==
where B is the number of possible subband combinations (K
2= , if K is the number of
subbands), pX , as above, represents the present (reliable) subset, and )|( Xbp p the
probability that subset pb is reliable.
Comparing to the missing data approach, this method is equivalent to considering the pointer to
the missing subsets as an additional latent variable used during HMM training/decoding. This was
recently evaluated, and showed significant improvements on all kinds of stationary and
nonstationary, narrow and wide band, noise conditions [Hagen et al. 2000, Hagen 2001].
Although excellent results were already achieved with uniform reliability measures )|( Xbp p ,
and no adaptation at all to the signal or noise conditions, further improvements could also be
achieved by doing online and unsupervised maximum likelihood adaptation of those reliability
weights. For the longer term goals of this project we will extend the full-combination approach
on the multistream problem in general, not only for frequency subband streams.
We will also investigate the use of a frequency-based HMM to compute emission probabilities,
yielding a model called HMM2 (a mixture of HMMs). HMM emission probabilities are
typically modeled through Gaussian mixtures or artificial neural networks. Also, in the multiband
based recognizers discussed above, we have to decide a priori the number and position of the
subbands being considered. In the HMM2 formalism, introduced in [Bourlard et al. 2000, Bengio
et al. 2000], the emission probabilities of the HMM (now referred to as temporal HMMs ) are
estimated through a secondary, state-dependent, HMM (referred to as feature HMMs )
working along the feature vector. This model will then allow for dynamic (time and state
dependent) subband (frequency) segmentation, integration of all possible frequency subband
combinations, as well as optimal recombination according to a standard maximum likelihood
criterion (although other criteria used in standard HMMs could also be used). This approach may
also be viewed as performing a kind of nonlinear vocal tract normalization.
Multirate recognition
The idea of multistream modeling can be extended to the case where the feature processes in the
different streams involve different time scales. Here, we look at two different mathematical
frameworks for handling such observations: one assumes that all time scales are always observed,
and the other that a subset of the time scales are observable (as for signal-dependent signal
processing) according to a switching model.
For the fully observable case, UW has developed a multirate model that is essentially a coupled
set of HMMs, where each HMM operates on a different feature stream with dependence of
higher rate state transitions and/or distributions on the lower rate state index. The multirate model
has some aspects in common with factorial HMMs [Gharamani & Jordan 1997, Saul & Jordan
1999, Logan & Moreno 1998, Nock & Young 2000] and the multistream models previously
discussed. However, it differs because the feature time series are extracted at different rates, and
also because of the explicit coupling across scales with conditional distributions. Multirate
features have previously been investigated for speech, but only with weak coupling between the
Pushing the Envelope: Rethinking Acoustic Processing H-20
streams via score combination techniques [Dupont & Bourlard, 1997]. Here we propose to look
at Markov coupling between the states and the conditional Gaussian dependencies across
streams. It may also be possible to extend the ideas of HMM2, described previously, to the
multirate framework.
The multirate model was used at UW to improve classification performance in an application of
estimating the wear on the edge of a machine milling tool [Fish 2001]. The tool wear-level
estimation task required modeling at two different rates for gradual trends and chipping events.
Related multiresolution models have also been used with success in image classification
applications [Li et al. 1999]. While application to speech recognition problems will require more
sophisticated parameter tying and decoding structures than in these other applications, we expect
this approach to revolutionize speech processing because of the broader spectrum of acoustical
processing methods that can be incorporated in a statistically meaningful way. In particular, the
framework provides a way to integrate features computed over phone- and syllable-sized time
scales with standard short-time features that have already been shown to be useful. In addition to
evidence that syllable-size features would be useful for English, we anticipate that it will provide
a much better framework for integrating intonation features, as is critical in tone languages.
While the basic theoretical infrastructure and baseline software are in place, the multirate
framework is not quite ready for application to speech recognition. At a practical level, the
implementation is in MATLAB, which is extremely slow and impractical for use with large
speech corpora. Hence, some software development is needed. Also, the model is currently only
implemented for two streams and the team intends to incorporate more. More importantly, there
are a number of algorithmic extensions that will be required to maximally benefit from this
approach. Three key problems that we propose to address include characterization of
asynchronous streams, discriminative learning, and parameter tying.
The toolwear model was based on a paradigm incorporating a fixed number of high-rate features
(and state transitions) for each low-rate feature; i.e., the low-rate state would change after N
high-rate state transitions, where N was fixed for each coupled model. Such a fixed structure does
not make sense for speech, since many phone instances will be shorter in duration that the frame
rate of our longest time features. What is needed is an event-driven view of the feature modeling
process, where event timing but not extent is critical -- that is, observations center at some event
time instant rather than span some region of time with definitive start and end times. In early
stages of the project, we will use signal-dependent processing that specifies feature timing ,
such as keying syllable level features to energy maxima (likely vowel nucleus) or using
mechanisms motivated by the cosine basis tree. Later, we will investigate joint design of the
window placement function and recognition model parameters. It is important to note that this
approach is not the same as the event-based models of ASR that have been proposed in acoustic
phonetics [Bitar & Espy-Wilson, 1996; Niyogi et al., 1998; Lahiri, 1999], because we are
explicitly avoiding intermediate hard decisions in decoding and use statistical discrete-event
process models rather than simply treating the event posteriors as frame-based HMM features.
In parallel to the effort on event-based modeling, we will investigate solutions to the standard
ASR problems of discriminative learning and parameter tying. First, we propose to look at
discriminative learning of dependence across streams, particularly with features. An efficient
algorithm has been developed theoretically, but it is not yet implemented. In addition, we
propose to develop methods for determining what features should be used in the different
Pushing the Envelope: Rethinking Acoustic Processing H-21
streams. This can be thought of as an extension of linear discriminant transformations based on a
new estimation technique. Since the ultimate goal is large vocabulary speech recognition, we need
to develop methods for automatic structure learning, including state tying but also stream
coupling. At a minimum, we can train a system with independent stream HMMs and use
standard HMM state clustering techniques, and then force the coupled HMM to have the same
tying structure. However, we hope to do more by leveraging previous work at UW by Nock on
state tying for factorial HMMs [Nock 2001].
In related work at SRI, we have proposed a multiresolution HMM that will consistently model
subsets of observable (or, partially observable) variables from a multirate process, obtained from
a signal-adaptive front end. We assume that the time scales at which phonetic cues occur are not
absolute and may change from one realization to another. To address this issue, multiresolution
observation HMMs (MRHMM) will be introduced. The observable subset of multirate features
is indicated by a hidden variable corresponding to the multirate state index; hence, the MRHMM
can be thought of as switching between resolution groups. The switching transition probabilities
of the HMM are computed directly by the local cosine basis tree generation algorithm and are
determined by the size of the drop in entropy at every branching of the tree. The resulting HMM
structure will be able to localize around transients in time by going to higher scales, and to
localize around transients in frequency by going to lower scales.
In [Sonmez et al. 2000], we have demonstrated a set of phonetically-motivated acoustic features
that discriminate a preliminary test set of highly ambiguous voiceless stops in CV contexts. The
features were automatically computed from data that had been hand-marked for consonant burst
location and voicing onset (extension to automatic marking is also proposed). Two corpora were
processed using a parallel set of features: conversational speech over the telephone
(Switchboard), and a corpus of carefully elicited speech. The latter provides an upper bound on
discrimination, and allows for comparison of feature usage across speaking style. We explored
data-driven approaches to obtaining variable-length time-localized features compatible with an
HMM statistical framework. We also suggested techniques for extension to automatic annotation
of burst location, for computation of features at such points, and for augmentation of an HMM
system with the added information.
The main idea of the switching multiresolution framework proposed in this task is to capture the
spectra at each time instant at a time-frequency resolution warranted by the instantaneous
stationarity. HMMs produce outputs uniformly in time, i.e. one feature vector per frame, the
unit time window. The way to synchronize the output of our cosine packet-segmented front-end
is to define the shortest interval of time (maximum time resolution, minimum frequency
resolution) as the effective frame. This way short duration phones such as stops can be localized
precisely without contamination from neighboring vowels and stationary sections can benefit
from a longer time window.
Pushing the Envelope: Rethinking Acoustic Processing H-22
2.5 Evaluation Methodology
Evaluation support for the statistical modeling research task (task 2) will be conducted in
accordance with the same evaluation principles and guidelines that were outlined in section H1.5
for the signal processing research task (task 1). Especially for the case of the statistical modeling
task, evaluation will serve as a tool to better understand the space of (statistically abstracted)
phonetic/syllabic/other representations. Armed with a better understanding (i.e., a more accurate
statistical characterization), it is virtually certain that our definition of the representational space
will change. Furthermore, this change may well be in unit concept as well as unit inventory.
Thus it is very important that our evaluation tools maintain generality and flexibility, to adapt to
changing research needs, as described in section H1.5.
Pushing the Envelope: Rethinking Acoustic Processing H-23
2.6 Resources Required
The research proposed here will be solely based on available speech data and annotations. LDC
Corpora (Switchboard, Call Home, Broadcast News), in addition to ICSI and UW Meeting data,
will be used for train and test sets for large vocabulary work. We will also use ICSI s
phonetically labeled Switchboard data, and a variety of small vocabulary corpora available at ICSI
(e.g., Aurora) for quick turnaround development work.
Pushing the Envelope: Rethinking Acoustic Processing H-24
2.7 Work Plan
While the statistical modeling effort is aimed at providing a framework for more appropriate
evaluation of the new features developed in Task 2, there are already innovative multistream
front ends (e.g., TRAP-based) that will be useful for initial development of the methods
proposed here for modeling partial information and new approaches to multistream combination.
The extension of these techniques to multirate processing is straightforward, so these plus the
signal-dependent analysis already developed by SRI will be useful for the initial work on
multirate modeling. Initial modeling work will be based on a small vocabulary task (e.g., Aurora),
in part to keep experiment turnaround time fast but also because the parameter tying
infrastructure needed to address large vocabulary recognition will require some research.
After the first year, guided by error analyses in performance evaluation studies, we will begin
assessing the maturing modeling techniques on the new features developed in Task 1. In addition,
once we observe encouraging results for a particular approach on a small task, we will migrate to
a task that will be compatible to the Rich Transcription effort at SRI/ICSI/UW, with the goal of
eventually transferring technology to that effort. We anticipate that much of the work on more
complex tasks will use the framework of lattice rescoring as opposed to the much larger software
development cost of building (or even modifying) a large vocabulary decoder compatible with the
new methods.
As described earlier, we will keep the effort focused on acoustic modeling by fixing the
vocabulary and language model in all rescoring efforts, and by conducting initial developments on
tasks, like Aurora, where the language model is not a critical factor in the system performance.
Throughout the project, we will leverage the results from each of the different modeling
approaches to help improve the others. Initially, this will be primarily in terms of sharing
acoustic segmentations, comparing performance on the best feature sets, etc. In the final two
years, we will work on merging some of the different modeling approaches more explicitly, as
well as transferring the best technology to the SRI Rich Transcription system.
Pushing the Envelope: Rethinking Acoustic Processing H-25
2.8 Milestones and Schedule
Pushing the Envelope: Rethinking Acoustic Processing H-26
2.9 Cost Breakdown
Category Year 1 Year 2 Year 3 Year 4 Year 5 TotalTask 2: StatisticalModeling
Labor 111,231 115,772 120,541 125,548 130,805 603,897Benefits 13,901 14,595 15,324 16,091 16,895 76,806labor total: 125,132 130,367 135,865 141,639 147,700 680,703
travel 10,500 8,250 8,250 8,250 8,250 43,500
ODC: tuition 6,450 6,450 6,450 6,450 6,450 32,250
equipment 6,510 0 0 0 0 6,510ODC total: 12,960 6,450 6,450 6,450 6,450 38,760
subcontracts:Columbia 100,000 100,000 100,001 100,000 100,001 500,002
IDIAP 60,200 62,960 65,720 67,100 62,960 318,940SRI 179,099 158,367 163,449 168,677 174,274 843,866UW 190,055 196,293 203,252 210,526 218,129 1,018,255
subcontracttotal:
339,299 321,327 329,170 335,777 337,235 1,662,808
overhead 137,702 96,339 100,160 104,173 108,385 546,759
Total Cost 815,648 759,026 783,147 806,815 826,149 3,990,785
Task 2 Roll-up:
Labor 603,897Benefits 76,806travel 43,500ODC: tuition 32,250equipment 6,510subcontracts 1,662,808overhead 546,759
Total Cost 3,990,785.0
Pushing the Envelope: Rethinking Acoustic Processing I-1
I. Language Preferences
The Novel Approaches team will take advantage of the multilingual efforts being developed at
SRI for other projects, including the Rich Transcription effort. Consequently, the offeror prefers
Egyptian Arabic and Mandarin Chinese due to both their choice by SRI and to the large amount
of data available in these dialects compared to most alternatives.
Pushing the Envelope: Rethinking Acoustic Processing J-1
J. Resources RequiredThe research proposed here will be solely based on available speech data and annotations. LDC
Corpora (Switchboard, Call Home, Broadcast News), in addition to ICSI and UW Meeting data,
will be used for train and test sets for large vocabulary work. We will also use ICSI s
phonetically labeled Switchboard data, and a variety of small vocabulary corpora available at ICSI
(e.g., Aurora) for quick turnaround development work.
Pushing the Envelope: Rethinking Acoustic Processing K-1
K. Resources Offered
The ICSI/OGI/SRI/UW/Columbia/IDIAP team has always held a strong commitment to the
sharing of resources with the larger speech community, and in the past has made available such
resources as: SRI’s LM toolkit and Switchboard prosodic feature database; and ICSI’s
phonetically transcribed Switchboard data, and their front end (RASTA) and neural network
software; and a number of UW’s HTK-compatible recognition modules. Additionally, the jointly
developed Meeting data are in preparation for release at this time. We intend to continue this
commitment to resource sharing under this program.
We anticipate that the main contributions will be in terms of software modules for front-end
analysis, acoustic modeling and/or error analysis. Since the research proposed here is of a high
risk nature, it is impossible at this time to predict which aspects will be the most successful and
therefore worth making publicly available. However, our past record should provide a strong
sense of our intention.
As is always the case when releasing new resources to the community, the release schedule must
depend on the readiness of the materials for public use and on our ability to provide appropriate
support. We anticipate a graduated release process, first circulating and exercising the materials
among the sites composing our team, then to our affiliated Rich Transcription team, and
ultimately to other teams and to the community at large. Materials will be made available for
research purposes only.
Pushing the Envelope: Rethinking Acoustic Processing L-1
L. Cost Sharing Offered
This proposal offers no cost sharing.
Pushing the Envelope: Rethinking Acoustic Processing M-1
M. DeliverablesRequired Quarterly Status Reports and an Annual Project Summary Reports - ICSI will
deliver both the DARPA/ITO Quarterly Status of Reports and an Annual Project Summary
Reports as required in the solicitation and as scheduled in Section H for each Task. These
reports will contain the information as required by the solicitation and will be electronically
submitted via the DARPA/ITO Technical — Financial Information Management System (T-
FIMS), at the government furnished Uniform Resource Locator (URL) on the World Wide Web
(WWW).
Additional Deliverables — In addition to the required reports, ICSI will deliver, subject to the
agreement of all parties, computer software and/or technical data developed under this contract.
The exact nature of these deliveries depends on agreements with DARPA regarding the specifics
of yearly R&D program elements and, as such, cannot be uniquely described at this time.
Proprietary Claims — The members of the Starting Over Novel Approaches research team do
not intend to limit the release of results, reports, or presentations made as a result of this
program, which have been documented as the only deliverables of the effort. This does not
preclude ICSI or its teammates from sharing proprietary information with the Government if it is
deemed to be mutually beneficial.
When the team participates in the performance of the Contract, each team member agrees to grant
to the other team members through the terms of a subcontract between the parties a royalty-free
limited license, without right to sublicense, to use in performance of the prime contract and
subcontract to the intellectual property solely developed by the a team member during
performance of the prime contract and subcontract ( Program Intellectual Property ) to the
extent necessary for a team member to perform the prime contract or subcontract as the situation
dictates. Each team member shall ensure that its use of another team member s Program
Intellectual Property shall not compromise the confidentiality or proprietary nature of said
intellectual property.
Each team member shall retain sole title to Program Intellectual Property created solely by its
employees and consultants, and Program Intellectual Property created by employees or
consultants of more than one team member shall be jointly owned by such team members. A
joint owner will not have to account to the other owner(s) for any income or other consideration
received from its exploitation of the jointly owned intellectual property unless provided for in a
separate written agreement entered into by the joint owners.
Any subcontract awarded to a team member by ICSI, will contain the same intellectual property
terms and conditions as are contained in the prime contract. Such subcontract will also contain
the above terms set forth in this section.
Pushing the Envelope: Rethinking Acoustic Processing M-2
Identification and Assertion of Restrictions on the Government s Use, Release, orDisclosure of Technical Data or Computer Software
The team s technical approaches for the development of the contract deliverables is partially
based on programs that were developed with mixed funding and will be delivered with
Government Purpose Rights if requested by DARPA or program stakeholders. SRI s DecipherTM
software is one of these programs.
Pushing the Envelope: Rethinking Acoustic Processing N-1
N. Exceptions
(1) We note in the BAA that there is no guidance for including a bibliography of sources and
references cited in the proposal. Based on verbal guidance from the DARPA EARS PM, we have
included a single consolidated bibliography as Appendix to our proposal.
(2) The involvement of Professors Morgan and Ostendorf represents an exception to the policy
outlined in the EARS BAA, which was that:
If a site works on both Rich Transcription and Novel Approaches, it
should use different individuals in each to insure that its Novel
Approaches work is not shortchanged in order to maximize results on
the NIST-administered evaluations for Rich Transcription.
in the sense that they each will play roles in both this proposal and in the Rich
Transcription effort proposed by the SRI/ICSI/UW team. However, we argue
that this will not negatively effect the Novel Approaches effort in either case. For Professor
Ostendorf, her primary and major role will be in the Rich Transcription Proposal. For this Novel
Approaches proposal, her graduate student is very advanced and independent, so Professor
Ostendorf can take a minor role in this effort. Only 2 weeks per year of her time is budgeted
for student supervision associated with the Novel Approaches work. For Professor Morgan, his
primary time commitment is to the Novel Approaches proposal, with only 2 weeks per year of
his time budgeted for meetings and consulting for the Rich Transcription work. However, no
other ICSI or UW staff or students will be on both proposals, so that there will be no mechanism
for the involvement of Novel Approaches personnel in the Rich Transcription project. Thus,
while this is an exception to the letter of the BAA requirements, we believe that we are
complying with the spirit of the requirement. In addition, the connection to the SRI/UW/ICSI
effort will provide a good avenue for transferring the successful technology developed in this
work to a Rich Transcription system in the later years of the Novel Approaches project.
Pushing the Envelope: Rethinking Acoustic Processing O-1
O. Management PlanBased on our extensive experience with domestic and international collaborations for speech
research, and in particular on our previous work together in various combinations, the
ICSI/SRI/UW/OGI/Columbia/IDIAP team has developed a comprehensive management plan.
The key characteristics of this plan are given below.
Team Structure — The PI for the proposal is Nelson Morgan, who has supervised numerous
domestic and international collaborations at ICSI. He will be the point person for the DARPA
PM, and is also the Director of ICSI so that there are no other management layers for the
contractor organization. He will of course be responsive to key researchers in the project, and in
particular will work closely with a Management Committee consisting of Hynek Hermansky,
Dan Ellis, and George Doddington. Note that Professors Hermansky and Ellis are also co-PIs for
the two Tasks in this proposal. The committee will share responsibility for coordinating efforts
between the two tasks and between research efforts and DARPA requests. Within each task, the
PI and co-PI will coordinate communication between key personnel across sites. The ICSI
contracts staff will also report to Professor Morgan, and will handle the subcontracts with SRI,
UW, OGI, Columbia, and IDIAP.
Technical Structure — The execution of the technical program is the responsibility of two
teams, corresponding to each of the major research tasks: Signal Processing and Statistical
Modeling. Professor Morgan is the PI of each, but the co-PIs (Hermansky and Ellis) will have
significant roles in task management. Each Task team will, with the aid of this management, plan,
execute, and report on the task-related program elements and are composed of the personnel
detailed in the respective task descriptions provided in Section H. These two teams coordinate
through the common PI, through the Management Committee, and more generally through key
personnel who work on both tasks. Incorporating substantial input from key personnel, the
project PI will guide the research, development, and implementation of the EARS prototypes
ensuring linkage of vision, implementation, and execution.
Team Leadership — To address the technology development and implementation challenges
cited in previous sections of this proposal, we offer a carefully selected group of outstanding key
individuals described in Section P. The key personnel team members have the academic
background, record of past accomplishment, and substantive experience necessary to meet the
EARS Novel Approaches challenges. We will dedicate all key personnel to the project as long as
their skills are required. The PI, Nelson Morgan, will be assigned to this program for its entirety.
Alterations in key personnel assignments will not be made before securing permission from the
government.
Government Coordination - Communication with the Government EARS Team members is
essential to the success of the program and is an area to which the PI and Co-PIs will pay special
attention. Our Team expects Government participation and encourages direct communication
with the staff that comprises our technical organization. We will actively participate in an
overarching EARS program management structure to facilitate program-wide coordination of
research and evaluation. The PI and Co-PIs will be available to meet with DARPA at its
convenience to assist in briefing program progress and specifics to stakeholders and agency
leadership or to address and resolve issues related to achieving program objectives. Finally, we
will promptly deliver Quarterly Status Reports as well as Annual Project Summary Reports that
Pushing the Envelope: Rethinking Acoustic Processing O-2
describe technical plans versus performance, results, and findings.
Internal Communications - Communication is essential to the successful functioning of the
team. In the large multisite collaborations that we have been in, personnel exchange has been an
invaluable tool. Therefore, we will plan frequent visits between participating sites, particularly so
that the most active researchers will learn to work with one another. Naturally, the Management
Committee will also schedule and conduct regular meetings to review progress and define plans
and actions required to meet DARPA objectives.
Our technology approach to facilitating communication has four components. (1) Schedule
regular opportunities for face-to-face meetings, including an annual team meeting, smaller group
discussions (generally coordinated with other conferences and workshops) and training
workshops (e.g., on Decipher) early in the contract period. (2) Establish and maintain a secure
Web site that contains information on meetings, reviews, experiments, and other programmatic
data to provide continuous availability of critical data to team members. This will allow us to
synchronize all the people, documents, software, and information required to manage our effort
on a single hub. (3) Make a subset of the materials on this site available on an independent open
site as approved by the Management Committee and coordinated with DARPA. (4) Use
commercial tools such as email, videoconferencing, and teleconferencing to reduce dependency on
travel and encourage team interchange and communication.
Pushing the Envelope: Rethinking Acoustic Processing P-1
P. Personnel Qualifications
Nelson Morgan, PI: - Nelson Morgan received his B.S., M.S., and Ph.D. degrees in 1977,
1979, and 1980 respectively, all in electrical engineering from UC Berkeley. He has been
working in speech processing since 1980, and prior to that was a practicing audio engineer.
He led a speech research effort at National Semiconductor starting in 1980, worked for several
years on basic pattern recognition algorithm at the EEG Systems Lab in the mid-80 s, and has
led the speech research effort at ICSI since 1988; he currently is also Director of that
institution. He is also a Professor-in-residence in the Electrical Engineering Department of UC
Berkeley, has over 150 publications including 3 books (the most recent being a graduate text
in speech and audio processing), and has graduated 10 PhDs, 3 MSs, and numerous
postdoctoral fellows. With Herve Bourlard, he was an originator of the hybrid system
approach to speech recognition (neural networks used probabilistically with HMMs), and
with Hynek Hermansky was the co-inventor of RASTA and several of its alternate forms. He
is the holder of a number of patents in speech processing methods. He is on the Board of
Directors of the Applied Voice Input-Output Society (AVIOS). He is the former co-Editor-
in-Chief of Speech Communication. He was formerly on the Neural Network Technical
Committee of the Signal Processing Society, and currently is on the Speech Technical
Committee of the same society. He is a Fellow of the IEEE. Professor Morgan will devote
25% of his time to the proposed project, and to split his time equally between the two
Tasks.
Hynek Hermansky, co-PI: Hynek Hermansky received his PhD. from the University of
Tokyo, Japan, in 1983, and his M.S. from Brno University of Technology, Czech Republic,
in 1972, both in electrical engineering. He has worked on speech processing for 30 years. He
was Research Fellow and Assistant Professor at Brno University of Technology, Research
Scholar under 5-year Japanese Ministry of Education Fellowship at University of Tokyo,
Research Engineer at Panasonic Technologies in Santa Barbara, and Senior Member of
Technical Staff at U S WEST Advanced Technologies. He is Professor and Director of Center
for Information Technologies at the Department of Electrical and Computer Engineering at
OGI School of Science and Engineering of Oregon Health and Sciences University in Portland,
Oregon, and Senior Research Scientist at ICSI Berkeley. He is also a Member of the Board of
the International Speech Communication Association, Associate Editor and Member of the
Technical Committee on Speech Processing of IEEE, Member of the Editorial Board of
Speech Communication, and Visiting Faculty at a number of European Universities and
International Summer Schools. He has over 100 publications, has graduated 4 PhDs, and
holds 5 U.S. patents. His technical achievements include Perceptual Linear Prediction,
RASTA speech processing of speech (with Nelson Morgan), multistream techniques for
automatic recognition of speech (ASR) (with Misha Pavel, Herve Bourlard and Nelson
Morgan), Feature Neural Networks for HMM-based ASR (with Dan Ellis), and ASR from
temporal patterns (TRAPs) He is a Fellow of the IEEE, awarded for research and
development of perceptually based speech processing methods. Professor Hermansky will
devote 30% of his time to this project, and will work entirely on Task 1.
Daniel Ellis, co-PI: Dan Ellis has been an assistant professor in the Electrical Engineering
department of Columbia University since 2000; and is also a visiting senior research scientist
at the International Computer Science Institute, with whom he has been affiliated since 1996.
Pushing the Envelope: Rethinking Acoustic Processing P-2
From 1989 to 1996, he was a research assistant in the Media Laboratory of the
Massachusetts Institute of Technology. He received S.M. (1992) and Ph.D. (1996) degrees
from MIT s department of Electrical Engineering and Computer Science, where his
dissertation was on prediction-driven computational auditory scene analysis. He holds a
B.A. (Hons.) degree in Electrical and Information Sciences from Cambridge University,
awarded in 1987. Dr. Ellis s principal research interest is signal processing and perceptual
organization in the auditory system, and how this can be emulated and exploited in automatic
systems for recognition and organization of speech and audio. He has authored many
significant publications on these topics, as well as sound analysis software in use worldwide.
He is the co-ordinator of the AUDITORY email discussion list with an international
membership of over 700. Prof. Ellis expects to devote 1 summer month/year of his time split
equally between the two Tasks.
Herv Bourlard, Professor: H. Bourlard received the Electrical and Computer Science
Engineering degree and the Ph.D. degree in Applied Sciences both from Facult’e
Polytechnique de Mons, Mons, Belgium. He was a member of the Scientific Staff at the
Philips Research Laboratory of Brussels (PRLB, Belgium), and an R&D Manager at L&H
SpeechProducts (BE); he is now Director of IDIAP (Dalle Molle Institute for Perceptual
Artificial Intelligence) a not-for-profit research institute, affiliated with the Swiss Federal
Institute of Technology at Lausanne (EPFL, CH) and the University of Geneva, involved in
speech and speaker recognition, computer vision and machine learning. He is also Professor at
EPFL and External Fellow of the International Computer Science Institute (ICSI) at Berkeley
(CA). His current interests mainly include statistical pattern classification, artificial neural
networks, and applied mathematics, with applications to time series processing, speech and
speaker recognition, and language modeling. He is the author/coauthor of over 140 reviewed
papers (including one IEEE paper award) and book chapters, as well as two books. H.
Bourlard is a member of the programme and/or scientific committee of numerous international
conferences, Editor-in-Chief of the ‘‘Speech Communication’’ journal (Elsevier), and Action
Editor of the ‘‘Neural Networks’’ journal. He is also a Member of the Admin Committee of
EURASIP (European Association for Signal Processing), a Member of the Advisory Council
of ISCA (International Speech Communication Association), Member of the IEEE Technical
Committee on Neural Network Signal Processing, and an appointed expert for the European
Commission. He is a Fellow of the IEEE ‘‘for contributions in the fields of statistical speech
recognition and neural networks’’, and a member of the Board of Trustees of the Swiss
Network for Innovation. Professor Bourlard will devote 1 month/year to this project, and will
work entirely on Task 2.
George Doddington: Dr. Doddington received a B.S. degree from the University of Florida
and M.S. and Ph.D. degrees from the University of Wisconsin, all in Electrical Engineering.
He then joined Texas Instruments (TI), where he worked for 20 years. While at TI he created
a corporate initiative in speech technology R&D, became chief of speech research, and was
elected a Senior Fellow of Texas Instruments. He was also responsible for numerous
advances in the speech research and speech technology while at TI: He provided leadership
in the creation and sharing of common speech corpora and speech resources, he provided
leadership in the development of evaluation standards and evaluation-guided speech research,
and he was responsible for a number of firsts in speech research, resources, and technology
Pushing the Envelope: Rethinking Acoustic Processing P-3
deployment, including the first formal evaluation of commercial automatic speech recognition
systems (in IEEE Spectrum, in 1980) and the first commercial deployment of speech
recognition and speaker verification over the telephone (for Sprint, in 1989). Dr. Doddington
then joined SRI, where he was employed for 10 years. During this period he was on
assignment to the US government, providing technical direction for government research
programs in human language technology: Among his contributions, he managed DARPA’s
Human Language Technology (HLT) program during 1992-1995. Since then he has provided
technical guidance for government programs in HLT. This includes translating program
objectives into formal R&D tasks, defining research resources, formulating performance
measures, and creating evaluation processes. Dr. Doddington expects to devote 15% of his
time to the proposed effort, with his time split evenly between the two Tasks.
Mari Ostendorf, Professor — Mari Ostendorf received her B.S., M.S., and Ph.D. degrees in
1980, 1981, and 1985, respectively, all in electrical engineering from Stanford University. She
has been working in speech processing for more than 15 years. In 1985, she joined the Speech
Signal Processing Group and BBN Laboratories, where she worked on low-rate coding and
acoustic modeling for continuous speech recognition. Two years later, Dr. Ostendorf went to
Boston University in the Department of Electrical and Computer Engineering where she built
and established a speech processing program. She joined the Electrical Engineering
Department at the University of Washington in 1999 as an Endowed Professor of System
Design Methodologies, and, together with two assistant professors, has established a large
speech research lab. Recently she was named the Associate Chair for Research in the EE
Department. Recent research contributions have been in the areas of: acoustic modeling for
spontaneous speech recognition, dependence modeling for adaptation, use of out-of-domain
data in language modeling, stochastic models of prosody for both recognition and synthesis,
and information extraction for speech. She has authored over 120 papers and has graduated 13
Ph.D. students and several MS students. Dr. Ostendorf has served on the Speech Processing
and the DSP Education Committees of the IEEE Signal Processing Society, as editor of the
journal Computer Speech and Language, and on numerous conference and workshop technical
committees. She is a Senior Member of the IEEE and a member of Sigma Xi. Professor
Ostendorf expects to devote 2 weeks per year of her time to the proposed effort, as she will
primarily be focused on the Rich Transcription proposal coming from SRI. For this proposal,
she will work entirely on Task 2.
Kemal Sonmez: Kemal Sonmez is a Senior Research Engineer in the Speech Technology and
Research (STAR) Lab at SRI International. He received his Ph.D. in Electrical Engineering
University of Maryland College Park while spending three summers as a visiting research
scientist with the speech group at Texas Instruments in Dallas. He joined SRI in1996, where
he worked on and managed a number of the Government-sponsored research efforts. In
particular, he was the PI on SRI’s speaker recognition effort in conversational telephone
speech using the Switchboard corpus, and on the ROAR acoustic modeling project. He was
especially involved in the development of signal adaptive approaches to frontend processing,
multiresolution modeling, modeling of prosody and robust processing of prosodic features.
Dr. Sonmez expects to devote 30% of his time to the proposed EARS Novel Approaches
effort, overseeing the work done at SRI, assisting with the work at UW and focusing on both
tasks 1 and 2.
Pushing the Envelope: Rethinking Acoustic Processing Q-1
Q. Team CapabilitiesThe proposing team comprises a rich set of individuals and institutions, with collective
experience both in large speech recognition projects and evaluations, and in a wide range of
research innovations. The prime contractor, ICSI, is the central site for a number of domestic and
international collaborations incorporated in this project, and is itself known as a center for
research in novel approaches. Key collaborations upon which this proposal builds include work
of Bourlard and Morgan on neural networks [Morgan & Bourlard 1995] and between Hermansky
and Morgan on robust acoustic strategies ("RASTA approaches") [Hermansky & Morgan 1994].
The trio’s philosophy of innovation in ASR was elaborated in [Bourlard et al. 1996]. In addition
to Hermansky’s OGI and Bourlard’s IDIAP, the team includes Columbia and UW, whose speech
laboratories have generated extensions to and advances beyond conventional speech modeling
approaches, and SRI, which contributes both a well-established baseline recognition system and a
history of speech innovations. Finally, Dr. George Doddington brings to the team his extensive
experience in the evaluation of speech processing systems.
ICSI: ICSI has been a center for innovation in speech processing for 13 years, benefiting from
interactions between research staff, international visitors, and U.C. Berkeley students and
faculty. In addition to the work with Bourlard and Hermansky, ICSI innovations have included:
multi-stream models [Janin et al. 1999]; work on speaking rate [Mirghafori et al. 1996, Morgan
& Fosler 1998, Fosler & Morgan 1999]; direct training of utterance posteriors [Konig et al.
1996]; stochastic perceptual models of speech [Morgan et al. 1995]; combining phone and
syllable information [Wu et al. 1997, 1998]; computational auditory scene analysis [Barker et al.
2000, Ellis 1999]; noise and reverberation robustness [Benitez et al. 2001, Kingsbury 1998, Shire
1999]; and the annotation and study of conversational speech based on the expert phonetic
labeling of a portion of the Switchboard Corpus [Greenberg et al. 1996]. While the goal of ICSI
speech research has been to develop novel approaches providing insight, rather than to achieving
the best error rates in the short term, ICSI has participated in DARPA evaluations, including
Resource Management, Wall Street Journal, and the 1998 Hub 4 evaluation of 1998 (jointly with
Sheffield and Cambridge [Cook et al. 1999, Morgan et al. 1999])with a 20% WER.
SRI: SRI has been developing large vocabulary recognition research systems for the past decade
in the context of various DARPA and DoD-funded programs. It continuously maintains an
LVCSR research systems based on the Decipher recognizer and the SRILM language model
toolkit (the latter is freely available for research use). Many of the components of current state-
of-the-art LVCSR systems have been invented or co-invented by SRI researchers: bottom-up
state-clustered Gaussian mixture HMMs for acoustic modeling [Digalakis & Murveit 1994];
acoustic adaptation to speakers, channels, and environments using affine mean and variance
transforms [Digalakis et al. 1995, Sankar & Lee 1996] and combined transform-based and
Bayesian adaptation [Digalakis & Neumeyer 1996]; progressive search with lattice recognition
and N-best rescoring [Murveit et al. 1993]; minimum word error decoding and posterior
maximization in confusion networks [Stolcke et al. 1997, Mangu et al., 2000 SRI last
participated in the Hub-4 evaluations in 1998, with a WER of 21%. The system later underwent
a major overhaul prior to the 2000 Hub-5 evaluation, resulting in a 24% relative WER reduction
on that task, and a 2001 Hub-5 eval performance of 29% WER. The Decipher system continues
to evolve and on the DARPA-sponsored 2001 SPINE evaluations achieved a WER of 27%, the
best performance of any submitted system.
Pushing the Envelope: Rethinking Acoustic Processing Q-2
OGI: The Anthropic Signal Processing Laboratory at OGI was established 9 years ago by Prof.
Hynek Hermansky. Hermansky’s early work on processing of corrupted speech introduced data-
guided speech analysis techniques, which later evolved into data-guided RASTA filters
[Hermansky & Morgan 1994] and data-guided spectral projections [Hermansky & Malayath
1998]. Multi-band recognition was introduced and extensively studied in collaboration with
Bourlard in Switzerland and Morgan at ICSI in 1995 and later evolved into the TRAP technique
[Hermansky & Sharma 1998] for robustness to many types of spectral distortions and also
addresses context dependency of phonemes. Nonlinear generalization of early data-guided
techniques in the form of Feature Nets (tandem approach) was introduced with Ellis of
ICSI/Columbia [Hermansky et al. 2000]. The group successfully participates in NIST Speaker ID
evaluations, in DARPA’s SPINE evaluations, and in telecommunication industry standards
(ETSI) activities, and collaborates most intensively with ICSI, CMU’s robust speech recognition
group, IIT Madras (India), CSLP at Johns Hopkins University, and CSLU group at OGI.
Columbia: The Laboratory for Recognition and Organization of Speech and Audio was
established within Columbia’s Electrical Engineering department by Professor Ellis when he
moved there in 2000. Combining the statistical pattern recognition and learning techniques of
speech recognition with a broader range of audio processing techniques drawn from audio
engineering and auditory modeling, the lab addresses information extraction from all kinds of
sound signals. In addition to development of the Tandem approach to speech acoustic modeling,
the best performing system in the Aurora Eurospech Special Event in September 2001 [Ellis &
Reyes 2001], current projects in the lab span a range of topics from speech through to music
analysis and clustering, to alarm sound detection and classification.
IDIAP: IDIAP (Dalle Molle Institute for Perceptual Artificial Intelligence) is a semi-private
research institute affiliated with the Swiss Federal Institute of Technology (EPFL) at Lausanne
and the University of Geneva and conducts research in the areas of speech and speaker
recognition, computer vision and machine learning. Speech recognition systems developed at
IDIAP are based on Hidden Markov Models (HMM) and on hybrid systems combining HMMs
and Neural Networks (NNs). To improve the performance of those systems, the speech group
research activities include the study of robust speech analysis techniques (e.g., measuring
information on different window lengths and using multiscale systems), robust speech modeling
(e.g., multi-stream approaches), and robust decision rules (e.g., measures of confidence).
UW: In 1999, the Department. of Electrical Engineering at the University of Washington made a
major commitment to building a speech technology research program, hiring Mari Ostendorf and
two other faculty members who founded the Signal, Speech and Language Interpretation (SSLI)
Laboratory, building on the program that she had created 12 years before at Boston University.
There are currently 21 researchers in the lab, which is dedicated to solving core problems in
speech and language technologies, facilitating multidisciplinary research, and providing a broad
educational experience for students. Prof. Ostendorf’s research contributions include: segment-
based (or, trajectory) models of acoustic parameters for ASR, dependence models for speaker
adaptation, sentence-level mixture language models for topic and dialog structure, use of out-of-
domain data in language modeling, computational modeling of prosody for speech recognition and
synthesis, integration of prosody and parsing, and information extraction from speech. Her BU
group participated in several DARPA evaluations (in collaboration with BBN), she works on
Hub 5, and contributed to the UW SPINE evaluation effort this fall led by Prof. Jeff Bilmes.
Pushing the Envelope: Rethinking Acoustic Processing R-1
R. Technology Transfer
As noted in Section K (Resources Offered), each of the participating sites will make available to
the research community a number of resources resulting from our research. Particularly through
SRI, we have excellent pathways for transfer to government programs. Finally, nearly all sites
have close relations with one or more commercial entities (e.g., Qualcomm for ICSI) who will be
very interested in positive results in this project.
In some sense, though, the strongest route for technology transfer from a high-risk algorithms
program such as this is through our link with the SRI-centered Rich Transcription project, which
will be incorporating the most successful of our algorithms in their systems in years 4 and 5 of
their project.
Pushing the Envelope: Rethinking Acoustic Processing S-1
S. Government-owned ResourcesThe Government-Furnished Equipment (GFE) identified below is required by our team for the
performance of the proposed effort, and should be included in the terms of a resulting contract.
This GFE has not been included in the price of this proposal. Our performance depends on our
team receiving approval to either transfer these items to the resulting contract or to allow for our
use of these items on a rent-free, non-interference basis. SRI has a pending bid into SPAWAR to
purchase this equipment; if the bid is not accepted then these items will need to be transferred to
the resultant contract.
Organization Description Tag Number Accountable ContractNumber
SRI Sun Disk Drive G443067 N66001-94-C-6048SRI R-Squared Disk
DrivesG443126, G444294, G444295 N66001-94-C-6048
SRI Seagate Disk Drives G443127, G443129 N66001-94-C-6048SRI Sun Computers G443879, G443880, G442408 N66001-94-C-6048SRI Seagate Disk Drives G444136, G444137 N66001-94-C-6048SRI R-Squared SCSI
DrivesG444138, G444143, G444144 N66001-94-C-6048
SRI Sun Computers G444148, G444316, G444141 N66001-94-C-6048SRI Procom Hard
DrivesG444376, G444377 N66001-94-C-6048
SRI Acropolis Disks G444639, G444641 N66001-94-C-6048SRI Seagate Disk Drives G444674, G444675 N66001-94-C-6048SRI Sun Storage System G444940, G444941, G444942 N66001-94-C-6048SRI Dell Computer G445103 N66001-94-C-6048SRI Vanguard Disk
DrivesG445165, G445372, G445373,G445374
N66001-94-C-6048
SRI Vanguard HardDrives
G445371, G445372, G445373,G445374
N66001-94-C-6048
SRI Sun CD/RomReader
G441922 N66001-94-C-6046
SRI R-Squared DiskDrives
G443711, G443712, G443713 N66001-94-C-6046
SRI DEC Computer G443724, G443725, G443726,G443727
N66001-94-C-6046
SRI R-Squared DiskDrive
G444089 N66001-94-C-6046
SRI Sun Computers G444122, G444158 N66001-94-C-6046SRI Seagate Disk Drive G444160 N66001-94-C-6046SRI Sun Storage
SystemsG444943, G444944 N66001-94-C-6046
SRI Dell Computers G445123, G445124, G445125 N66001-94-C-6046U W Dell 868 Computers 1175736, 1175709, 1175680 N66019928924U W Acer 868 Computers No tag N66019928924U W Sun Ultra-5_10
Computers1175755,1175756, 1175760 N66019928924
All team sites requesting equipment (ICSI, SRI, UW, OGI, and Columbia) have included the
requisite letters of notification with this proposal indicating that we cannot provide new
information technology resources to support the development evaluation activities that are
required. These letters fully comply with the requirements of the solicitation and are included as
Appendix B in the paper versions of this proposal.
Pushing the Envelope: Rethinking Acoustic Processing T-1
T. Organizational Conflict of Interest
SRI International is currently providing technical support to several offices within DARPA. We
do not believe that any potential conflict of interest exists as these individuals are not directly
involved in the effort proposed herein. The individuals and the offices they support are listed
below:
Murray Burke DARPA/IXO Program Manager for High Performance Knowledge Base and
Rapid Knowledge Formation programs
Tim Grayson DARPA/TTO Program Manager for Digital RF Tags (DRaFT) program
William Coleman OSD C3I/SAPCO Sr. Technical Advisor to DARPA Deputy Director
William Schneider DARPA/MTO Program Manager for Optoelectronics
Richard Wishner Director DARPA/IXO
Thomas Strat IXO Program Manager (Expected effective date 1/21/02)
None of the other sites (UW, Columbia, OGI, ICSI, and IDIAP) are providing such support.
Pushing the Envelope: Rethinking Acoustic Processing 1
Appendix A: References
Allen, J.,
How do humans process and recognize speech?
IEEE Transactions on Speech and Audio, 2(4): 567-577, Oct. 1994
Barker, J., Cooke, M. and Ellis, D.,
Decoding speech in the presence of other sound sources,
Proc. ICSLP-2000, Beijing, October 2000
Barker, J., Cooke, M. and Ellis, D.,
Combining bottom-up and top-down constraints for robust ASR: The multisource decoder,
Workshop on Consistent and Reliable Acoustic Cues CRAC-2001at Eurospeech-2001, Aalborg, Denmark, September 2001.
Barker, J., Green, P. and Cooke, M.,
Linking ASA and robust ASR by missing data techniques,
Proc. WISP-2001, Stratford UK, 2001.
Bengio, S, Bourlard, H. and Weber, K.,
An EM algorithm for HMMs with emission distribution represented by HMMs,
IDIAP Research Report RR00-11, 2000.
Benitez, C., Burget, L., Chen, B., Dupont, S., Garudadri, H., Hermansky, H., Jain, P., Kajarekar,
S. and Sivadas, S.,
Robust ASR front-end using spectral-based and discriminant features: experiments on the
Aurora tasks,
Proc. Eurospeech-2001, Aalborg, September 2001.
Bilmes, J.,
Maximum Mutual Information Based Reduction Strategies for Cross-Correlation based Joint
Distributional Modeling,
Proc. ICASSP-98, Seattle, 469-472, 1998.
Bilmes, J.,
Buried Markov models for speech recognition,
Proc. ICASSP-99, Phoenix, II-713-716, 1999.
Bitar, N. and Espy-Wilson, C.,
Knowledge-based parameters for HMM speech recognition,
Proc. ICASSP, pp. 29-32, 1996.
Bourlard, H., Bengio, S, and Weber, K.,
New Approaches Towards Robust and Adaptive Speech Recognition,
invited keynote, to appear in Proc. Advanced Neural Information Processing Systems (NIPS-13)
Pushing the Envelope: Rethinking Acoustic Processing 2
workshop, T.K. Leen, T.G. Dietterich, and V. Tresp (Eds.), MIT Press, Denver CO, Dec. 2000.
Bourlard, H. and Dupont, S.,
A New ASR Approach Based on Independent Processing and Recombination of Partial
Frequency Bands,
Proc. ICSLP-96, 426-429, Philadelphia PA, 1996.
Bourlard, H., Dupont, S., Hermansky, H. and Morgan, N.,
Towards Subband-Based Speech Recognition,
Proc. VIII European Signal Processing Conference (EUSIPCO’96) (Trieste, Italy), 1579-1582,
1996.
Bourlard, H., Hermansky, H. and Morgan, N.,
Towards Increasing Speech Recognition Error Rates,
Speech Communication, 205-231, May 1996.
Cohen, J., Kamm, T., Andreou, A.G.,
Vocal Tract Normalization in Speech Recognition: Compensating for Systematic Speaker
Variability,
Proc. 15th Annual Speech Research Symposium, pp. 175-178, Baltimre, MD, 1995.
Cook, G., Christie, J., Ellis, D., Fosler-Lussier, E., Gotoh, Y., Kingsbury, B., Morgan, N.,
Renals, S., Robinson, A.J., and Williams, G.,
The SPRACH System for the Transcription of Broadcast News,
DARPA LVCSR meeting, 1999.
Cooke, M. and Ellis, D.,
The auditory organization of speech and other sources in listeners and computational models,
Speech Communication 35, 141-177, 2001.
Cooke, M., Green, P., Josifovski, L. and Vizinho, A.,
Robust automatic speech recognition with missing and unreliable acoustic data,
Speech Communication 34(3), 267-285, June 2001.
Deng, L. and Sun, D.,
Phonetic classification and recognition using HMM representation of overlapping articulatory
features for all classes of English sounds,
Proc. ICASSP-94, 45-48, April 1994.
Deshmukh, N., Duncan, R., Ganapathiraju, A. and Picone, J.,
Benchmarking Human Performance for Continuous Speech Recognition ,Proc. ICSLP-96, Philadelphia, 1996.
Doddington, G., Corrada, A., Ganapathiraju, A., Goel, V., Kirchhoff, K., Ordowski, M., Picone,
J. and Wheatley, B.,
Syllable-Based Speech Recognition
Pushing the Envelope: Rethinking Acoustic Processing 3
Final Report, Johns Hopkins Summer Workshop 1997,
http://www.clsp.jhu.edu/ws97/syllable/final_presentation/index.html
Dupont, S. and Bourlard, H.,
Using multiple time scales in a multistream speech recognition system,
Proc. Eurospeech-97, I-3-6, Rhodes, 1997.
Ellis, D.,
The Weft: A representation for periodic sounds,
Proc. ICASSP-97, Munich, II-1307-1310, 1997.
Ellis, D.,
Listening to speech recognition — the Surfsynth home pageWeb page, http://www.icsi.berkeley.edu/~dpwe/projects/surfsynth , 1997.
Ellis, D.,
Using knowledge to organize sound: The prediction-driven approach to computational auditory
scene analysis and its application to speech/nonspeech mixtures,
Speech Communication 27 3-4, pp. 281-298, April 1999.
Ellis, D. and Bilmes, J.,
Using mutual information to design feature combinations,
Proc. ICSLP-2000, Beijing, 2000.
Ellis, D. and Reyes, M.,
Investigations into Tandem acoustic modeling for the Aurora task,
Proc. Eurospeech-01, Aalborg, Denmark, September 2001.
Ellis, D., Singh, R. and Sivadas, S.,
Tandem Acoustic Modeling in Large-Vocabulary Recognition,
Proc. ICASSP-01, Salt Lake City UT, May 2001.
Fiscus, J.,
A post-processing system to yeild reduced word error rates: Recgnizer Output Voting Error
Reduction (ROVER),
Proc. ASRU-97, Santa Barbara, 1997.
Fish, R.,
Dynamic models of machining vibrations, designed for classification of tool wear,Ph.D. dissertation, Univ. of Washingtion, 2001.
Fosler-Lussier, E.,
Multi-Level Decision Trees for Static and Dynamic Pronunciation Models,
Proc. Eurospeech-99, Budapest, I-463-466, 1999.
Pushing the Envelope: Rethinking Acoustic Processing 4
Fosler-Lussier, E., and Morgan, N.,
Effects of Speaking Rate and Word Frequency on Conversational
Pronunciations,
Speech Communication Special Issue on pronunciation variation, 29 (2-4),
137-157, 1999.
Ghahramani, Z. and Jordan, M.,
Factorial Hidden Markov Models,
Machine Learning 29, 245-273, 1997.
Greenberg, S., Hollenback, J. and Ellis, D.,
Insights into spoken language gleaned from phonetic transcriptions of the Switchboard corpus,
Proc. ICSLP-96, Philadelphia, 1996.
Hagen, A.,
Robust speech recognition based on multi-stream processing,PhD Thesis, Swiss Federal Institute of Technology, Lausanne, December 2001,
also IDIAP Research Report, RR01-4.
Hagen, A, Morris, A. and Bourlard, H.,
From Multi-Band Full Combination to Multi-Stream Full Combination Processing in Robust
ASR,
Proc. ISCA ITRW Intl. Workshop on Automatic Speech Recognition: Challenges for the NextMillennium, Paris, Sep.~18-20, 2000.
Hall, J., Haggard, M. and Fernandes, M.
Detection in noise by spectro-temporal pattern analysis,
Journal of the Acoustical Society of America, 76, 50-56, 1984.
Hatch, A.,
Word-Level Confidence Estimation for Automatic Speech RecognitionM.S. Thesis, University of California at Berkeley, August 2001.
Hermansky, H., Ellis, D., and Sharma, S.,
Tandem connectionist feature stream extraction for conventional HMM systems,
Proc. ICASSP-2000, Istanbul, June 2000, III-1635-1638
Hermansky, H. and Malayath, N.
Spectral Basis Functions from Discriminant Analysis,
Proc. ICSLP’98, Sydney, November 1998.
Hermansky, H. and Morgan, N.,
RASTA Processing of Speech,
IEEE Transactions on Speech and Audio Processing, special issue on Robust Speech Recognition
2(4), 578-589, Oct., 1994
Pushing the Envelope: Rethinking Acoustic Processing 5
Hermansky, H. and Sharma, S.,
TRAPS — Classifiers of temporal patterns,
Proc. ICSLP-98, Sydney, III-1003-1006, 1998
Hermansky, H. and Sharma, S.,
Temporal Patterns (TRAPS) in ASR of Noisy Speech,
Proc. ICASSP-99, Phoenix AZ, March 1999.
Hermansky, H., Sharma, S. and Jain, P.,
Data-derived nonlinear mapping for feature extraction in HMM,
Proc. ASRU-99, Keystone CO, 1999
Hermansky, H., Tibrewala, S. and Pavel, M.,
Towards ASR on Partially Corrupted Speech,
Proc. ICSLP-96, 462-465, Philadelphia, October 1996.
Hirsch, H. and Pearce, D.,
The AURORA Experimental Framework for the Performance Evaluations of Speech
Recognition Systems under Noisy Conditions,
Proc. ISCA ITRW ASR2000, Paris, September 2000.
Janin, A., Ellis, D. and Morgan, N.,
Multi-stream speech recognition: ready for prime time?
Proc. Eurospeech 1999,. II-591-594, 1999.
Jurafsky, D., Wooters, C., Segal, J., Stolcke, A., Fosler, E., Tajchman, G. and Morgan, N.,
Using a Stochastic Context-Free Grammar as a Language Model for Speech Recognition,
Proc. ICASSP-95, 189-192, 1995.
Kajarekar, S., Yagnanarayana, B. and Hermansky, H.,
A Study of Two Dimensional Linear Discriminants for ASR,
Proc. ICASSP-01, Salt Lake City UT, 2001.
Kajarekar, S. and Hermansky, H.,
Optimization of units for continuous-digit recognition task,
Proc. ICSLP-2000, Beijing, 2000.
Kingsbury, B.,
Perceptually-inspired signal processing strategies for robust speech recognition in reverberantenvironments,Ph.D. Dissertation, University of California at Berkeley, December 1998.
Konig, Y., Bourlard, H. and Morgan, N.,
REMAP - Experiments with Speech Recognition,
Pushing the Envelope: Rethinking Acoustic Processing 6
Proc. ICASSP996, 3350-3353, 1996.
Lahiri, A.,
Speech recognition with phonological features,
Proc. Int. Congress on Phonetic Sciences, pp. 715-718, 1999.
Lee, S., and Glass, J.,
Real-time Probabilistic Segmentation for Segment-Based Speech Recognition,
Proc. ICSLP-1998, Sydney, 1998.
Li, J., Najmi, A. and Gray, R.M.,
Image classification by a two-dimensional hidden Markov model,
Proc. ICASSP-99, VI-3313-3316, Phoenix AZ, 1999.
Logan, B. and Moreno, P.,
Factorial HMMs for Acoustic Modeling,
Proc. ICASSP-98, 813-816, Seattle WA, 1998.
Malayath, N., Hermansky, H., Kajarekar, S., and Yegnanarayana B.,
Data-Driven Temporal Filters and Alternatives to GMM in Speaker Verification,
Digital Signal Processing 10, 55-74, 2000.
Minsky, M.
The Society of MindSimon and Schuster, 1986.
Mirghafori, N., Fosler, E. and Morgan, N.,
Towards Robustness to Fast Speech in ASR,
Proc. ICASSP-96, 335-338, 1996.
Morgan, N., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Janin, A., Pfau, T., Shriberg, E. and
Stolcke, A.,
The Meeting Project at ICSI,
Proc. Human Language Technologies Conference, San Diego, March 2001
Morgan, N. and Bourlard, H.,
Continuous Speech Recognition: An Introduction to the Hybrid HMM/Connectionist
Approach,
Signal Processing Magazine, pp 25-42, May 1995
Morgan, N., Bourlard, H., Greenberg, S. and Hermansky, H.,
Stochastic Perceptual Models of Speech,
Proc. ICASSP-95, 397-400, 1995.
Morgan, N., Ellis, D., Fosler-Lussier, E., Janin, A. and Kingsbury, B.,
Reducing errors by increasing the error rate: MLP Acoustic Modeling
Pushing the Envelope: Rethinking Acoustic Processing 7
for Broadcast News Transcription,
Proc. DARPA LVCSR meeting, 1999.
Morgan, N., and Fosler-Lussier, E.,
Combining multiple estimators of speaking rate,
Proc. ICASSP-98, 729-732, Seattle WA, May 1998.
Niyogi, P., Mitra, P., and Sondhi, M. M.,
‘‘A detection framework for locating phonetic events,’’ Proc. ICSLP,
pp. 1067-1070, 1998.
Nock, H.,
Techniques for Modelling Phonological Processes in Automatic Speech Recognition,Ph.D. dissertation, Cambridge Univ., 2001.
Nock, H. and Young, S.,
Loosely-coupled HMMs for ASR,
Proc. ICSLP-2000, III-143-146, Beijing, 2000.
Robinson, A., Hochberg, M., and Renals, S.,
IPA: Improved Modelling with Recurrent Neural Networks
Proc. ICASSP-94, 37-40, April 1994.
Saul, L. and Jordan, M.,
Mixed memory Markov models,
Machine Learning 37, 75-87, 1999.
Shire, M.,
Data-driven modulation filter design under adverse acoustic conditions and using phonetic and
syllabic units ,
Proc. Eurospeech-99, Budapest, III-1123-1126, 1999.
Simon, J., Depireux, D. and Shamma, S.,
Representation of complex spectra in the auditory cortex,
Proc. 11th International Symposium on Hearing, ed. Palmer, Ress, Summerfield & Meddis, 513-
520, Whurr Publishers, 1998.
Singh, R., Seltzer, M., Raj, B. and Stern, R.,
Speech in noisy environments: Robust automatic segmentation, feature extraction and
hypothesis combination,
Proc. ICASSP-01, Salt Lake City, 2001.
Sonmez, M., Plauche, M., Shriberg, E., and Franco, H.,
Consonant Discrimination in Elicited and Spontaneous Speech: A Case for Signal-adaptive
Front Ends in ASR,
Proc. ICSLP-2000, Beijing, October 2000.
Pushing the Envelope: Rethinking Acoustic Processing 8
Wu, S., Kingsbury, B., Morgan, N. and Greenberg, S.,
Performance Improvements Through Combining Phone- and Syllable-Scale Information in
Automatic Speech Recognition,
Proc. ICSLP-98, 459-462, Sydney, 1998.
Wu, S., Shire, M., Greenberg, S. and Morgan, N.,
Integrating Syllable Boundary Information Into Speech Recognition,
Proc. ICASSP-97, 987-990, Munich, 1997.
Yang, H., Van Vuuren, S., Sharma, S. and Hermansky, H.
Relevance of Time-Frequency Features for Phonetic and Speaker-Channel
Classification,
Speech Communication 31, 35-50, 2000.