Speech technology basics

86
Speech Technology- Basics Presenter: Eshwari.G

Transcript of Speech technology basics

Page 1: Speech technology   basics

Speech Technology- Basics

Presenter: Eshwari.G

Page 2: Speech technology   basics
Page 3: Speech technology   basics

What is DSP?

• Digital signal processing is the processing of signals in a digital form.

Page 4: Speech technology   basics

SIGNAL

Continuous signals x(t)

A description of how one parameter varies with another parameter

Discrete signals x[n]

Page 5: Speech technology   basics

DIGITAL SIGNAL

DIGITAL signals x[n]

Discrete signals x[n]

Page 6: Speech technology   basics
Page 7: Speech technology   basics
Page 8: Speech technology   basics

Analog-to-digital conversion is an electronic process in which a continuously variable (analog) signal is changed, without altering its essential content, into a multi-level (digital) signal.

The input to an analog-to-digital converter (ADC) consists of a voltage that varies among a theoretically infinite number of values.

Examples are sine waves, the waveforms representing human speech etc.

The output of the ADC, in contrast, has defined levels or states.

The simplest digital signals have only two states, and are called binary.

ANALOG TO DIGITAL CONVERSION

Page 9: Speech technology   basics

Advantages of digital signals• First, digital signals can be stored easily.

• Second, digital signals can be reproduced exactly. All you have to do is be sure that a zero doesn't get turned into a one or vice versa.

• Third, digital signals can be manipulated easily. Since the signal is just a sequence of zeros and ones, and since a computer can do anything specifiable to such a sequence, you can do a great many things with digital signals. And what you are doing is called digital signal processing.

Page 10: Speech technology   basics

BASIC STRUCTURE OF A DIGITAL SIGNAL PROCESSING SYSTEM

Pre-amplifier

Final-amplifier

Analog-DigitalConverter

Digital- AnalogConverter

Software(Algorithm)

DigitalSignal

Processor

001101101010

010110110101

A/D D/A

digitizedsignal

processeddigitalsignal

ANALOGinputsignal

amplifiedANALOGsignal

processedANALOG

signal

ANALOGoutputsignal

Page 11: Speech technology   basics

DIGITAL TO ANALOG CONVERSION

Page 12: Speech technology   basics

BASIC STRUCTURE OF A DIGITAL SIGNAL PROCESSING SYSTEM

Pre-amplifier

Final-amplifier

Analog-DigitalConverter

Digital- AnalogConverter

Software(Algorithm)

DigitalSignal

Processor

001101101010

010110110101

A/D D/A

digitizedsignal

processeddigitalsignal

ANALOGinputsignal

amplifiedANALOGsignal

processedANALOG

signal

ANALOGoutputsignal

Page 13: Speech technology   basics

The process of combining signals is called synthesis.

Decomposition is the inverse operation of synthesis, where a single signal is broken into two or more additive components.

Synthesis & Decomposition

Page 14: Speech technology   basics

2041×4 = ?

The number 2041 can be decomposed into: 2000+40+1

Each of these components can be multiplied by 4 Then synthesized to find the final answer8000 + 160 + 4 = 8164

The goal of this method is to replace a complicated problem with several easy ones.

Synthesis & Decomposition

Page 15: Speech technology   basics

• There are infinite possible decompositions for any given signal, but only one synthesis

• For example, the numbers 15 and 25 can only be

synthesized (added) into the number 40

• In comparison, the number 40 can be decomposed into:1+39, 2+38 & 30+10 etc.

Synthesis & Decomposition

Page 16: Speech technology   basics

Divide & conquer strategy

Signal being processed is broken into single components

Each component is processed individually

Results are reunited

SUPERPOSITION

Page 17: Speech technology   basics

SUPERPOSITION

Page 18: Speech technology   basics

DECOMPOSITION

There are two main ways todecompose signals in signal processing:

Impulse decomposition and Fourier decomposition.

Page 19: Speech technology   basics

Impulse DECOMPOSITION

Impulse decomposition breaks an N samples signal into N component signals, each containing N samples.

Each of the component signals contains one point from the original signal, with the remainder of the values being zero.

A single nonzero point in a string of zeros is called an impulse.

Page 20: Speech technology   basics

IMPORTANCE OF IMPULSE DECOMPOSITION

Impulse DecompositionImpulse decomposition is important because it allows signals to be examined one sample at a time.

Similarly, systems are characterized by howthey respond to impulses.

By knowing how a system responds to an impulse,the system's output can be calculated for any given input. This approach is called convolution

Page 21: Speech technology   basics

Fourier Decomposition

Any N point signal can be decomposed into N/2 signals, half of them sine waves and half of them cosine waves.

The lowest frequency cosine wave (called in this xC0 [n]illustration), makes zero complete cycles over the N samples, i.e., it is a DC signal.

Page 22: Speech technology   basics

Fourier Decomposition

The next cosine components: , , and , make 1, 2, xC1 [n] xC2 [n] xC3 [n] and 3 complete cycles over the N samples, respectively.

Since the frequency of each component is fixed, the onlything that changes for different signals being decomposed is the amplitude of each of the sine and cosine waves.

Page 23: Speech technology   basics

CONVOLUTION & FOURIER ANALYSISCONVOLUTION & FOURIER ANALYSIS

The two main techniques of signal processing:Convolution and Fourier analysis.

Strategy

Decompose signals into simple additive components,

Process the components in some useful manner,

Synthesize the components into a final result.

This is DSP.

Page 24: Speech technology   basics

CONVOLUTIONCONVOLUTION

Convolution is a mathematical way of combining two signals to form a third signal.

Using the strategy of impulse decomposition, systems are described by a signal called the impulse response.

Convolution relates the three signals of interest: the input signal, the output signal, and the impulse response.

Convolution provides the mathematicalframework for DSP

Page 25: Speech technology   basics

IMPULSE RESPONSEIMPULSE RESPONSE

The delta function is a normalized impulse, that is, sample number zero has a value of one, while all other samples have a value of zero.

Delta function is frequently called the unit impulse.

Page 26: Speech technology   basics

IMPULSE RESPONSEIMPULSE RESPONSE

Impulse response is the signal that exits a system when a delta function (unit impulse) is the input.

If two systems are different in any way, they will have different impulse responses.

Just as the input and output signals are often called x[n] y[n] and , the impulse response is usually given the name is h[n]

Page 27: Speech technology   basics

IMPULSE RESPONSEIMPULSE RESPONSE

• Any impulse can be represented as a shifted and scaled delta function.

• Consider a signal, , composed of all zeros except sample number 8, a[n] which has a value of -3.

• This is the same as a delta function shifted to the right by 8 samples, and multiplied by -3.

• In equation form: a[n] = -3δ[n-8]

Page 28: Speech technology   basics

IMPULSE RESPONSEIMPULSE RESPONSE

If the input to a system is an impulse, such as , -3δ[n-8] what is the system's output?

Scaling and shifting the input results in an identical scaling and shifting of the output.

Page 29: Speech technology   basics

IMPULSE RESPONSEIMPULSE RESPONSE

If -3δ[n-8] results in h[n] , it follows that -3δ[n-8] results in -3h[n-8] h[n]

In words, the output is a version of the impulse response that has been shifted and scaled by the same amount as the delta function on the input.

If you know a system's impulse response, you immediately know how it will react to any impulse.

Page 30: Speech technology   basics

How a system changes an input signal into an output signal

First, the input signal can be decomposed into a set of impulses, each of which can be viewed as a scaled and shifted delta function.

Second, the output resulting from each impulse is a scaled and shifted version of the impulse response.

Third, the overall output signal can be found by adding these scaled and shifted impulse responses.

In other words, if we know a system's impulse response, then we can calculate what the output will be for any possible input signal.

Page 31: Speech technology   basics

• It is able to provide far better levels of signal processing than is possible with analogue hardware alone.

• It is able to perform mathematical operations that enable many of the spurious effects of the analogue components to be overcome.

• In addition to this, it is possible to easily update a digital signal processor by downloading new software.

• Once a basic DSP card has been developed, it is possible to use this hardware design to operate in several different environments, performing different functions, purely by downloading different software.

• It is also able to provide functions that would not be possible using analogue techniques.

Advantages over analogue processing

Page 32: Speech technology   basics

• It is not able to provide perfect filtering, demodulation and other functions because of mathematical limitations.

• In addition to this the processing power of the DSP card may impose some processing limitations.

• It is also more expensive than many analogue

solutions, and thus it may not be cost effective in some applications.

Limitations

Page 33: Speech technology   basics

SPEECH ANALYSISExtraction of properties or features from a speech

signal

Involves a transformation of s(n) intoanother signal,a set of signal or a set of parameters

ObjectivesSimplificationData reduction

Page 34: Speech technology   basics

Signal

t

• Continuous Signal

(both parameters can assume a continuous range of values)

Vertical Axis (y axis)– AmplitudeHorizontal Axis (x axis) – Time

The parameter on the y-axis (the dependent variable)

is said to be a function of theparameter on the x-axis

(the independent variable)

Page 35: Speech technology   basics

Speech Wave form

In this, the time axis is the horizontal axis from left to right and the curve shows how the pressure increases and decreases in the signal

Time domain representation.

Page 36: Speech technology   basics

Frequency domain (spectral)

f(ω)

Spectrum for a 1-ms pulse

f(t)

Page 37: Speech technology   basics

Time domain vs Frequency domain (Temporal) vs (Spectral)

Spectrum at 0.15 seconds into the utterance, in the beginning of the "o" vowel.

Page 38: Speech technology   basics

SHORT TIME ANALYSIS Short segments of speech signal are isolated

and processed as if they were short segments from a sustained sound

This is repeated as often as desired

Each short segment is called an analysis frame

Result – a single number or set of numbers

Page 39: Speech technology   basics

SHORT TIME ANALYSIS

• ASSUMPTION Properties of the speech signal change relatively

slowly with time This assumption leads to a variety of speech

processing methods

Page 40: Speech technology   basics

TYPES OF SHORT TIME ANALYSIS

Short Time Energy (Average Magnitude)

Short Time Average Zero crossing rate

Short Time Auto-correlation

Page 41: Speech technology   basics

Short Time Energy (Average Magnitude)

Amplitude of the speech signal varies appreciably with time

Amplitude of unvoiced segments is much lower than the amplitude of voiced segments

Short time energy provides a convenient representation that reflects these amplitude variations

Page 42: Speech technology   basics

Short Time Energy (Average Magnitude)

50ms of a vowel

Squared version of (a)

Energy for a window length = 5 ms

Page 43: Speech technology   basics

Short Time Average Zero crossing rate

A zero crossing occurs when s(n) = 0, for a continuous

signal

A zero crossing occurs if successive samples have

different algebraic signs, for a discrete signal

Page 44: Speech technology   basics

Short Time Average Zero crossing rate

For sinusoids F0 = ZCR/2

For speech signals calculation of F0 from ZCR is less precise

High ZCR – Unvoiced speech

Low ZCR – Voiced speech

Draw back – Highly sensitive to noise.

ZCR is a simple measure of frequency content of the signal

t

Page 45: Speech technology   basics

Short Time Autocorrelation

Speech signal of s(n)

Fourier transform of s(n) = S(e jw )

Energy spectrum = [S(e jw ) ]2

[S(e jw )]2 is called Autocorrelation of s(n)

This preserves information about harmonic and formant amplitudes in s(n)

Page 46: Speech technology   basics

Autocorrelation - Significance

Autocorrelation function contains the energy

Period can be estimated by finding the location of the first maximum in the auto correlation function.

Auto correlation function contains much more information about the detailed structure of the signal.

Page 47: Speech technology   basics

Autocorrelation - Application

Applications

1. F0 estimation

2. Voiced /unvoiced determination

3. Linear prediction.

Page 48: Speech technology   basics

Cepstrum

DFTS(n) LOGMAGNITUDE

IDFT

S(e jω) log|S(e jω)|

Cepstrum was derived by reversing the first four letters of "spectrum”

Cepstrum was introduced by Bogert, Healey and Tukey in 1963 for characterizing the seismic echoes resulting from earthquakes

A cepstrum is the result of taking the Inverse Fourier transform (IFT) of the log spectrum as if it were a signal.

Originally it was defined as ‘spectrum of spectrum’.

Operations on cepstra are labelled as quefrency analysis, liftering, or cepstral analysis

Page 49: Speech technology   basics

Why Cepstrum?

• The cepstrum can be seen as information about rate of change in the different spectrum bands.

• It has been used to determine the fundamental frequency of human speech.

• Cepstrum pitch determination is particularly effective because the effects of the vocal excitation (pitch) and vocal tract (formants) are additive in the logarithm of the power spectrum and thus clearly separate.

• The cepstrum is often used as a feature vector for representing the human voice and musical signals.

Page 50: Speech technology   basics

Cepstral concepts - Quefrency

The independent variable of a cepstral graph is called the quefrency.

The quefrency is a measure of time, though not in the sense of a signal in the time domain.

For example, if the sampling rate of an audio signal is 44100 Hz and there is a large peak in the cepstrum whose quefrency is 100 samples, the peak indicates the presence of a pitch that is 44100/100 = 441 Hz.

This peak occurs in the cepstrum because the harmonics in the spectrum are periodic, and the period corresponds to the pitch.

Page 51: Speech technology   basics

Cepstral concepts - Rahmonics

• The x-axis of the cepstrum has units of quefrency, and peaks in the cepstrum (which relate to periodicities in the spectrum) are called rahmonics.

• To obtain an estimate of the fundamental frequency from the cepstrum we look for a peak in the quefrency region

Page 52: Speech technology   basics

Cepstral concepts - Liftering

A filter that operates on a cepstrum might be called a lifter.

A low pass lifter is similar to a low pass filter in the frequency domain.

It can be implemented by multiplying by a window in the cepstral domain and when converted back to the time domain, resulting in a smoother signal.

Page 53: Speech technology   basics

Cepstral Analysis

• Low quefrency components or samples predominantly correspond to spectral envelope. (Up to about 3 to 4 msec). These are also called cepstral coefficients. • High quefrency components

predominantly correspond to periodic excitation or source. (Beyond 4 msec)• If signal is periodic, a strong peak is

seen over the high quefrency region at T0, the pitch period.• If signal is unvoiced, components are

distributed over all quefrencies.

Page 54: Speech technology   basics

The cepstral coefficients

• Cepstral coefficients can be derived both from the filter-bank and linear predictive analyses.

• By keeping only the first few cepstral coefficients and setting the remaining coefficients to zero, it is possible to smooth the harmonic structure of the spectrum.

• Cepstral coefficients are therefore very convenient coefficients to represent the speech spectral envelope.

• Cepstral coefficients have rather different dynamics, the higher coefficients showing the smallest variances.

Page 55: Speech technology   basics

Cepstrum

Formant can be estimated by locating the peaks in the log spectra

For voiced speech there is a peak in the cepstrum

For unvoiced speech there is no such peak in the cepstrum

Position of the peak is a good estimate of the Pitch Period

Page 56: Speech technology   basics

Linear Predictive Coding• Linear Predictive Coding (LPC) is

one of the most powerful speech analysis techniques

• It is one of the most useful methods for encoding good quality speech at a low bit rate.

• It provides extremely accurate estimates of speech parameters, and is relatively efficient for computation.

Page 57: Speech technology   basics

Linear Predictive Coding

Source-Excitation signal Transfer Function

Speech

We can use the LPC coefficients to separate a speech signal into two parts: the transfer function (which contains the vocal quality-formants) and the excitation (which contains the pitch and the loudness)

Page 58: Speech technology   basics

• LPC analyzes the speech signal by • estimating the formants, • removing their effects from the speech

signal, • and estimating the intensity and

frequency of the remaining buzz. • The process of removing the formants is

called inverse filtering, and the remaining signal is called the residue.

Page 59: Speech technology   basics

• The numbers which describe the formants and the residue can be stored or transmitted somewhere else. LPC synthesizes the speech signal by reversing the process: use the residue to create a source signal, use the formants to create a filter (which represents the tube), and run the source through the filter, resulting in speech.

• Because speech signals vary slowly with time, this process is done on short chunks of the speech signal, which are called frames. Usually 30 to 50 frames per second give intelligible speech with good compression.

Page 60: Speech technology   basics

Basic Principle

A Speech sample can be approximated as a linear combination of past speech samples

By minimizing the sum of the squared differences between the actual speech samples and the predicted ones, a unique set of predicted codes can be determined

Linear Predictive Coding

Page 61: Speech technology   basics

Applications

1. F0 estimation

2. Pitch

3. Vocal tract area functions

4. For representing speech for low bit transmission or storage

Linear Predictive Coding

Page 62: Speech technology   basics

Highlights

1. Extremely accurate estimation of Speech Parameters

2. High speed of Computation

3. Robust, reliable & accurate method

Linear Predictive Coding

Page 63: Speech technology   basics

Ways in which the basic models of analysis and the associated parameters from them

are used in an integrated system Diagnostic Applications (CSL & VAGMI) Digital transmission of voice communication Non – Machine communication by voice a. Voice Response systems b. Speaker recognition systems

c. Speech recognition systems

Page 64: Speech technology   basics

Pre-emphasis

Before Pre-emphasis

After Pre-emphasis

Boost the amount of energy in the high frequencies. For voiced segments like vowels, there is more energy at the lower frequencies than at the higher frequencies - spectral tilt.

Boosting the high frequency energy makes information from these higher formants more available to the acoustic model and improves phone detection accuracy.

This pre-emphasis is done with a filter

Page 65: Speech technology   basics

WindowingGoal of feature extraction is to provide spectral features.

Speech is a non-stationary signal, spectrum changes very quickly if we extract spectral features from an entire utterance or conversation.

Instead, we want to extract spectral features from a small window of speech that characterizes a particular subphone (its statistical properties are constant within this region).

Windowing determines the portion of the speech signal that is to be analyzed by zeroing out the signal outside the region of interest.

Pre

Emphasis

Window DFT Mel filter Bank

log IDFT deltas

Page 66: Speech technology   basics

Windowing techniques

• Rectangular• Bartlett• Hamming• Hanning• Blackman• KaiserThe most commonly used are the

Rectangular and the Hamming methods

Page 67: Speech technology   basics

Bartlett Window

Rectangular Window

Page 68: Speech technology   basics

Hanning Window

Hamming Window

Page 69: Speech technology   basics

Kaiser Window

Blackman Window

Page 70: Speech technology   basics
Page 71: Speech technology   basics

DFT

Pre

Emphasis

Window DFT Mel filter Bank

log IDFT deltas

Spectrum at 0.15 seconds into the utterance, in the beginning of the "o" vowel.

Page 72: Speech technology   basics

The Mel frequencyHuman hearing is not equally sensitive at all frequency bands.

Modeling this property of human hearing during feature extraction improves speaker recognition performance.

The form of the model used in MFCCs is to warp the frequencies output by the DFT onto the mel scale.

A mel (Stevens et al, 1937; Stevens and Volkmann, 1940) is a unit of pitch. Pairs of sounds that are perceptually equidistant in pitch are separated by an equal number of mels.

The mapping between frequency in hz and the mel scale is linear below 1000 Hz and logarithmic above 1000 Hz. The mel frequency can be computed from the raw acoustic frequency as follows: f Mel(f) = 1127ln (1+ ------) 700

Pre

Emphasis

Window DFT Mel filter Bank

log IDFT deltas

Page 73: Speech technology   basics

Mel filter Bank

During MFCC computation, we implement this intuition by creating a bank of filters that collect energy from each frequency band, with 10 filters spaced linearly below 1000 Hz and the remaining filters spread logarithmically above 1000 Hz .

Finally, we take the log of each of the mel spectrum values.

In general, the human response to signal level is logarithmic - humans are less sensitive to slight differences in amplitude at high amplitudes than at low amplitudes.

In addition, using a log makes the feature estimates less sensitive to variations in input such as power variations due to the speaker’s mouth moving closer or further from the microphone.

Page 74: Speech technology   basics

Log magnitude spectrum

Magnitude spectrum

Log magnitude spectrum

Pre

Emphasis

Window DFT Mel filter Bank

log IDFT deltas

Replace each amplitude value in the magnitude spectrum with its log

Visualize the log spectrum as if itself were a waveform

Page 75: Speech technology   basics

Cepstrum is the spectrum of the log of the spectrum. By taking the spectrum of the log spectrum, we have left the frequency domain of the spectrum and gone back to the time domain

Pre

Emphasis

Window DFT Mel filter Bank

log IDFT deltas

IDFT

Page 76: Speech technology   basics

There is a large peak around 120, corresponding to the FoThere are other various components at lower values on the x-axis. These represent the vocal tract filter (the position of the tongue and the other articulators). Thus, if we are interested in detecting phones, we can make use of just the lower cepstral values. If we are interested in detecting pitch, we can use the higher cepstral values

Pre

Emphasis

Window DFT Mel filter Bank

log IDFT deltas

Cepstrum

Page 77: Speech technology   basics

MFCC 12 co-efficients

For MFCC extraction, we generally just take the first 12 cepstral values. These 12 coefficients will represent information solely about the vocal tract filter, cleanly separated from information about the glottal source. It turns out that cepstral coefficients have the extremely useful property that the variance of the different coefficients tends to be uncorrelated. This is not true for the spectrum, where spectral coefficients at different frequency bands are correlated.

Pre

Emphasis

Window DFT Mel filter Bank

log IDFT deltas

MFCC

Page 78: Speech technology   basics

The extraction of the cepstrum with the inverse DFT results in 12 cepstral coeffcients for each frame.

We next add a 13th feature; the energy from the frame.

Energy correlates with phone identity and so is a useful cue for phone detection (vowels and sibilants have more energy that stops, etc.).

The energy in a frame is the sum over time of the power of the samples in the frame; thus, for a signal x in a window from time sample t1 to time sample t1, the energy is

t2

Energy = ∑ x2[t] t=t1

Pre

Emphasis

Window DFT Mel filter Bank

log IDFT deltas

Energy

Page 79: Speech technology   basics

DeltasSpeech signal is not constant from frame to frame. This change, such as the slope of a formant at its transitions,

or the nature of the change from a stop closure to stop burst, can provide a useful cue for phone identity.

For this reason, we also add features related to the change in cepstral features over time.

We do this by adding for each of the 13 features (12 cepstral features plus energy) a delta or velocity feature and a double delta or acceleration feature.

Each of the 13 delta features represents the change between frames in the corresponding cepstral energy feature, and each of the13 double delta features represents the change between frames in the corresponding delta features.

Pre

Emphasis

Window DFT Mel filter Bank

log IDFT deltas

Page 80: Speech technology   basics
Page 81: Speech technology   basics

SPEECH SPECTROGRAPH• A speech spectrograph is a laboratory instrument

that displays a graphical representation of the amplitudes of the various component frequencies of speech on a time based plot. • A tool for analyzing vocal output. • It is used for identifying the formants, and for real-

time biofeedback in voice training and therapy

Page 82: Speech technology   basics

SPEECH SPECTROGRAPH (Analog)

Page 83: Speech technology   basics

Speech Spectrograph (Digital)

Pre

Emphasis

Window DFT Plot

Amplitude vs.

Frequency

Plot

Spectro-gram

Time

Page 84: Speech technology   basics

Pre

Emphasis

Window DFT PlotAmplitude vs.

Frequency

PlotSpectro-

gram

Time

Page 85: Speech technology   basics

SPEECH SPECTROGRAPH• There are two main kinds of analysis performed by

the spectrograph, wideband (with a bandwidth of 300-500 Hz) and narrowband (with a bandwidth of 45-50 Hz).

Page 86: Speech technology   basics

WIDEBAND SPECTROGRAPH• When used for normal speech

with a fundamental frequency of around 100-200 Hz, will pick up energy from several harmonics at once and add them together.

• The Fo (fundamental frequency) can be determined from the graphic

• Also, the frequencies and relative strengths of the first two formants (F1 and F2) are visible as dark, rather blurry concentrations of energy.