Speech Processing František Hrdina. Presentation parts: Speech processing theory Available...

28
Speech Processing František Hrdina

Transcript of Speech Processing František Hrdina. Presentation parts: Speech processing theory Available...

Speech Processing

František Hrdina

Presentation parts:

• Speech processing theory

• Available commercial software

• SDK SAPI 5.1

Speech processing includes different technologies and applications:

• Speech encoding

• Speaker separation

• Speech enhancement

• Speaker identification (biometrics)

• Language identification

• Keyword spotting

• Automatic speech recognition (ASR) problem, intelligent human computer interface (IHCI)

Intelligent human computer interface

Theory of everything

F(X)=0

...for suitable values of F, and suitable interpretations of X .You can come

up with another theory, but it will merely be a special case of this one.

Automatic speech recognition (ASR)

Systems can be divided by

– Vocabulary size • From tens of words to hundreds of thousands of words

– The speaking format of the system • Isolated or connected words (phone dialing), continuous speech

– The degree of speaker dependence of the system

• Speaker-dependent, independent

– The constraints of the task • as the vocabulary size increases, the possible combinations of words to be recognized grows

exponentially. Some form of task constraint, such as formal syntax and formal semantics, is required to make the task more manageable.

ASR Phases

• Signal pre-processing - filtering, 16bit/20kHz sampling (65536 range every 0.05ms)

• Signal-processing phase – to reduce rate of data, 10-30ms segments, windowing – to prevent discontinuities and spectrum distortion

• Pattern matching. Matching the feature vector to already existing ones and finding the best match There are four major ways to do this: (1) template matching, (2) hidden Markov models, (3) neural networks, and (4) rule-based systems.

• Time-alignment phase - A sequence of vectors recognized over a time are aligned to represent a meaningful linguistic unit (phoneme, word). Different methods can be applied, for example, the Viterby method, dynamic programming, and fuzzy rules.

• Language analysis - The recognized language units recognized over time are further combined and recognized from the point of view of the syntax, the semantics, and the concepts of the language used in the system.

Automatic speech recognition

Signal pre-processing

• Filtering

• 16bit/20kHz sampling (range of 65536 values every 0.05ms)

• Soundcard can be accessed via DirectX in Windows

• http://www.ymec.com/products/dssf3e/index.htm

Signal-processing

• to reduce rate of data and to gain feature vectors

• 10-30ms segments

• windowing – to prevent discontinuities and spectrum distortion

• http://www.ymec.com/products/dssf3e/index.htm

Speech can be represented on the:· Time scale (waveform) representation· Frequency scale (spectrum)· Both a time and frequency scale (spectrogram)Features: loudness, pitch, cepstrum (Fourier analysis of the logarithmic amplitude spectrum of the

signal) ~ autocorrelation, formants – freq. with the highest energy

Signal-processing

Digital Filter Banks modelBased on human auditory system.

quasi-linear until about 1 kHz quasi-logarithmic above 1 kHz

http://mi.eng.cam.ac.uk/~ajr/SA95/SpeechAnalysis.html

Pattern matching

• Matching the feature vector to already existing ones and finding the best match There are four major ways to do this:

• (1) template matching, • (2) hidden Markov models, • (3) neural networks,• (4) rule-based systems.

Pattern matching Word segmentation

• The relationship between the segmentation of sensory input (e.g., the speech signal) into chunks and the recognition of those chunks.

Two Processes or One? – chicken-and-egg problem• One approach to resolving the paradox is to assume that segmentationand recognition are two aspects of a single process—that tentative hypotheses about each issue are developed and testedsimultaneously, and mutually consistent hypotheses are reinforced.• A second approach is to suppose that there are segmentation cuesin the input that are used to give at least better-than-chance indicationsof what segments may correspond to identifiable words.• bottom-up vs. top-down, word segmentation by children

ASR with MLP

Pattern matching MLP - features

• Connectionism (brain-like) vs. Turing machine• The main pros of MLPs over other statistical modeling methods are

(1) MLP implementations typically require fewer assumptions and can be optimized in a datadriven fashion, (2) backpropagation training can be generalized to any optimization criterion, including maximum likelihood and all forms of discriminative training, and (3) MLP modules can easily be integrated in nonadaptive architectures

• Most industrial speech recognizers to date count very few MLP components.

• The main disadvantage: MLP training time is typically much greater than that of nonconnectionist models for which closed-form or fast iterative solutions can be derived.

Pattern matching MLP - future

• MLP-based models have outperformed state of-the-art traditional recognition systems on some of the most challenging recognition tasks.

• Despite the recent advances in multimodule architectures and gradient-based learning, several key questions are still unanswered, and many problems are still out of reach. How much has to be built into the system, and how much can be learned? How can one achieve true transformation-invariant perception with NNs?

• New concepts (possibly inspired by biology) will be required for a complete solution. The accuracy of the best NN/HMM hybrids for written or spoken sentences cannot even be compared with human performance. Topics such as the recognition of three-dimensional objects in complex scenes are totally out of reach. Human-like accuracy on complex PR tasks such as handwriting and speech recognition may not be achieved without a drastic increase in the available computing power. Several important questions may simply resolve themselves with the availability of more powerful hardware, allowing the use of brute-force methods and very large networks.

Time-alignment

• A sequence of vectors recognized over a time are aligned to represent a meaningful linguistic unit (phoneme, word). Different methods can be applied, for example, the Viterby method, dynamic programming, and fuzzy rules.

Language analysis

• The recognized language units recognized over time are further combined and recognized from the point of view of the syntax, the semantics, and the concepts of the language used in the system.

ASR Problems

Speech signal is highly variable according tothe speaker, speaking rate, context, and acoustic conditions.

People use huge vocabulary of over 300,000 words.

Ambiguity of speech:• Homophones: "hear" and "here"• Word boundaries: /greiteip/ "gray tape"or "great ape"• Syntactic ambiguity "the boy jumped over the stream with

the fish"

Commercial Software

• Extremely complicated task: No place for new players.

• Nuance Communications dominates server-based telephony and PC applications market.

• IBM - command and control (grammar-constrained) and dictation. Claims to reach human SR quality by 2010 , MS 2011.

• Microsoft – SpeechServer.

• Growing market segment – mobile phones. (Operators such as Vodafone, et cetera)

• Allows using speech synthesis (TTS – text to speech) and speech recognition in custom applications

• C&C or natural speaking• User independent, can be trained• 60,000 English words• Free

• My experience:– Not based on .NET (maybe in Vista?), chaotic documentation (compared

to .NET’s), but still easy to use– Training greatly improves performance– Good accuracy on constrained vocabulary , grammar is defined as a state machine– Too sensitive (recognizes nearly any sound as a word), better results probably with

user dependant system.– Does not provide any way how to view score of recognition – for setting treshold– Useless for controlling Windows (because of 2 previous points)– Some features are not implemented yet (but interface is provided), for example: add

digit “3” to vocabulary and recognized as “three”– Can be very useful when combined with IR remote control (supplies parameters to

commands)– Dictation mode not tested, used in MS Word.– TTS sounds very artificial

Literature

• Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering Nikola K. Kasabov The MIT Press 1998, 581pgs

• The Handbook of Brain Theory and Neural Networks - Second Edition, edited by Michael A. Arbib, The MIT Press 2003, 1290pgs

• http://en.wikipedia.org/wiki/Speech_recognition/• Training Neural Networks for Speech Recognition John-Paul Hosom,

Ron Cole, Mark Fanty, Johan Schalkwyk, Yonghong Yan, Wei WeiCenter for Spoken Language Understanding (CSLU) Oregon Graduate Institute of Science and Technology February 2, 1999

• http://mi.eng.cam.ac.uk/~ajr/SA95/SpeechAnalysis.html• http://www.phon.ucl.ac.uk/courses/spsci/matlab/lect10.html