Open Problems in Speech Recognition Nelson Morgan, EECS and ICSI.
-
date post
20-Dec-2015 -
Category
Documents
-
view
226 -
download
6
Transcript of Open Problems in Speech Recognition Nelson Morgan, EECS and ICSI.
Open Problems in Speech Recognition
Nelson Morgan, EECS and ICSI
ICSI and EECSICSI and EECS
•International Computer Science Institute
•Nonprofit, closely affiliated with UCB-EECS:
- faculty (e.g., Morgan, Feldman)- Board (Berlekamp, Karp, Malik)- students (PhD, MS)
• Focus areas in speech,language,theory, internet research; CITRIS involvement
A working speech A working speech recognizer (circa 1920)recognizer (circa 1920)
A working speech A working speech recognizer (circa 2002)recognizer (circa 2002)
Current ApplicationsCurrent Applications
•Toys
•Telephone queries (operator/touch tone replacement)
• Voice dialing (for cell phones)
• Dictation (esp. for specific domains)
Major Reasons for Major Reasons for SuccessSuccess
• Late 60’s statistical methodology (HMMs, developed for cryptography) applied to speech in 70’s and 80’s
• Moore’s Law + engineering refinements to HMM training/recognition (1986-now)
• Normalization approaches (mean norms, RASTA filtering, vocal tract length approx)
Two examples of things Two examples of things that helpedthat helped
• RASTA: 2% digit error -> 60% for different phone system; down to 3% using RASTA; now used for voice dialing in millions of cell phones
• Vocal tract length normalization: 1 parameter for each speaker, significant effect on errors; now used in all large research systems
Major Technical Major Technical ChallengesChallenges
•Speaker variability for fluent/conversational (pronunciation, rate, overlaps)
25-40%error on conversations
•Acoustic variability for general environments (noise, reverb, talker movement) 3-10%error on read digits (vs <1% in clean conditions)
Modern ASR SystemsModern ASR Systems
• From 50,000 ft, all ASR systems the same:
- compute local spectral envelope- determine likelihoods of speech
sounds- search for most likely HMMs
• Spectral envelope distorted by many things
- Alternatives often are bad fits to the statistical models
Pronunciation Lexicon
Signal Processing
PhoneticProbabilityEstimator
Decoder(word search)
WordsSpeech
Grammar
ASR in BriefASR in Brief
ASR is half-deafASR is half-deaf
• Phonetic classification very poor
• Success due to constraints (domain, speaker, noise-canceling mic, etc)
• These constraints can mask the underlying weakness of the technology
Rethinking Acoustic Rethinking Acoustic Processing for ASRProcessing for ASR
• Escape dependence on spectral envelope
• Use multiple front ends across time/freq
• Modify statistical models to accommodate new front ends
• Design optimal combination schemes for multiple models
The DARPA (IAO) The DARPA (IAO) “EARS” Program“EARS” Program
• New 5 year program to radically reduce errors in conversational speech-to-text
• Two components: - Rich Transcription (large reductions
in error rate, improvements in readability and portability to new languages)
- Novel Approaches (radical changes)
EARS: Effective Affordable EARS: Effective Affordable Reusable Speech-to-textReusable Speech-to-text
• Rich Transcription: 4 teams- SRI/ICSI/UW- BBN/U.Pitt/UW/LIMSI- Cambridge U.- IBM
• Novel Approaches: 2 teams- ICSI/SRI/UW/OGI/Columbia/IDIAP- Microsoft
time
Novel Approach 1: Novel Approach 1: Pushing the Envelope Pushing the Envelope
(aside)(aside)
• Problem: Spectral envelope is a fragile information carrier
estimate of sound identity
info
rmat
ion
fusi
on
10 msOLD
PROPOSED
• Solution: Probabilities from multiple time-frequency patches
i-th estimate
up to 1s
k-th estimate
n-th estimate
estimate of sound identity
Novel Approach 2: Novel Approach 2: Beyond Frames…Beyond Frames…
• Solution: Advanced features require advanced models, not limited by fixed-frame-rate paradigm
OLD
PROPOSED
conventional HMMshort-term features
• Problem: Features & models interact, new features may require different models
advanced features multi-rate / dynamic scale classifier
Other speech-to-text Other speech-to-text projectsprojects
• Dialog systems: DARPA Communicator/Symphony, German SmartKom
• Noise/reverberation for cell phone, military environments: DARPA SPINE program, various European projects (EU, ETSI)
• Recognition/retrieval/summarization for multiparty meetings: Swiss IM2, EU m4, ICSI/UW/SRI/Columbia NSF-ITR
Resource generation Resource generation from Berkeley from Berkeley researchersresearchers
• gmtk - a new graphical model toolkit specialized for speech (extension of 2 PhD theses, Bilmes [UW] and Zweig [IBM]) -
• Publicly available speech/neural network software (RASTA, speech neural network training system)
• Soon: a “meeting data” corpus
Campus interactionCampus interaction
• Within EECS (CIS):- Feldman (also ICSI), NLU- Jordan and Russell, machine
learning
• Linguists:- Ohala, phonology- Fillmore(ICSI), semantic
lexicography
Natural Speech + Natural Speech + Language Projects at Language Projects at
ICSI/EECSICSI/EECS• Berkeley Restaurant Project (BeRP) - online stochastic context free grammar probabilities with natural mixed initiative
• SmartKom - tourist information query system w/American pronunciations of German place names
SummarySummary
• Progress in speech recognition research led to working systems in particular domains
• Performance still severely limited for conversational speech, noisy/reverberant conditions
• We and others are working to transcend these limitations with novel approaches