Download - Stein VoiceDSP 3.1 Voice DSP Processing III Yaakov J. Stein Chief Scientist RAD Data Communications.

Stein VoiceDSP 3.1

VoiceVoice

DSPDSP

ProcessingProcessing

IIIIII

VoiceVoice

DSPDSP

ProcessingProcessing

IIIIII

Yaakov J. Stein

Chief ScientistRAD Data Communications

Stein VoiceDSP 3.2

Voice DSPVoice DSP

Part 1 Speech biology and what we can learn from it

Part 2 Speech DSP (AGC, VAD, features, echo cancellation)

Part 3 Speech compression techiques

Part 4 Speech Recognition

Stein VoiceDSP 3.3

Voice DSP - Part 3Voice DSP - Part 3

Simple coders

– G.711 A-law -law

– Delta– ADPCM

CELP coders– LPC-10– RELP/GSM– CELP

Other methods

– MBE– MELP – STC – Waveform Interpolation

Stein VoiceDSP 3.4

Encoder CriteriaEncoder Criteria

Encoders can be compared in many ways

the most important are:

Bit rate (Kbps) Speech quality (MOS) Delay (algorithmic [frame+lookahead] + computational + propagation) Computational Complexity

Often less important:

Bit exactness (interoperability) Transcoding robustness Behavior on non-speech (babble noise, tones, music) Bit error robustness

Stein VoiceDSP 3.5

PSTN Quality CodersPSTN Quality Coders

Rate ITU-T encoder128 Kbps 16bit linear sampling

64 Kbps G.711 A-law/-law 8bit log sampling

32 Kbps G.726 ADPCM

16 Kbps G.728 LDCELP

8 Kbps G.729* CS-ACELP

4 Kbps SG16Q21* ???

* toll quality MOS rating, but higher delay

Stein VoiceDSP 3.6

Digital Cellular StandardsDigital Cellular Standards

Coder Rate Approach Quality Complexity Delay

GSM FR 13 RPE-LPT 3.5 Low 40

GSM HR 5.6 VSELP <3.5 High 45

GSM EFR 12.2 ACELP 4.0 Medium 45

GSM AMR 4-12 ACELP 3.5-4.0 Medium 45

TIA IS54 8 VSELP 3.5 Medium 45

TIA IS641 8 ACELP 4.0 Medium 45

TIA IS96 8* QCELP <3.5 Medium 45

TIA EVRC 8* ACELP 4.0 High 50

TIA Q13 13* QCELP 4.0 Med-High 45?

* = Variable rate

Stein VoiceDSP 3.7

Military / Satellite StandardsMilitary / Satellite StandardsCoder Rate

(Kb/s)Approach Quality

(MOS)Complexity Delay

(ms)

FS-1015LPC-10

2.4 LPC 2.5 Low- med 13.5

FS-1016 4.8 CELP 3.0 high 67.5

MELP 2.4 MELP 3.3 Med-high 67

Satellite 1 4.8 IMBE 3.3-3.5 medium 100

Satellite 2 2.4-3.6 AMBE 3.3-3.5 medium 100

Stein VoiceDSP 3.8

Voice DSPVoice DSP

Simple

coders

Stein VoiceDSP 3.9

G.711G.711

16 bit linear sampling at 8 KHz means 128 Kbps

Minimal toll quality linear sampling is 12 bit (96 Kbps)

8 bit linear sampling (256 levels) is noticeably noisy

Due to– prevalence of low amplitudes– logarithmic response of ear

we can use logarithmic sampling

Different standards for different places

Stein VoiceDSP 3.10

G.711 - cont.G.711 - cont.

-law

A-law

Although very different looking they are nearly identical G.711 approximates these expressions by 16 staircase straight-line segments

(8 negative and 8 positive) -law: horizontal segment through origin, A-law: vertical segment

North America

= 255

Rest Of World

= 87.56

Stein VoiceDSP 3.11

DPCMDPCM

Due to low-pass character of speech

differences are usually smaller than signal values

and hence require fewer bits to quantize

Simplest Delta-PCM (DPCM) : quantize first difference signal

Delta-PCM : quantize difference between signal and prediction

sn = p ( sn-1 , sn-2 , … , sn-N ) = pi sn-i

If predict using linear combination (FIR filter), this is linear prediction

Delta-modulation (DM) : use only sign of difference (1bit DPCM)

Sigma-delta (1bit) : oversample, DM, trade-off rate for bits

i

Stein VoiceDSP 3.12

DPCM with predictionDPCM with prediction

If the linear prediction works well, then the prediction error

n = sn - sn

will be lower in energy and whiter than sn itself !

Only the error is needed for reconstruction,

since the predictable portion can be predicted sn = sn + n!

sn

predictionfilter

sn n

-

predictionfilter sn

sn

Stein VoiceDSP 3.13

DPCM - post-filteringDPCM - post-filtering

Simplest case : if highly oversampled

then previous sample sn-1 predicts sn well,

so we can use DM,

if sgn(n) < 0 then - else +

For DM there is no way to encode zero prediction errorso decoded signal oscillates wildly

Standard remedy is a post-filter that low-pass filters this noise

But there is a b i g g e r problem!

Stein VoiceDSP 3.14

Open-loop PredictionOpen-loop Prediction

The encoder (linear predictor) is present in the decoder

but there runs as feedback

The decoder’s predictions are accurate with the precise error n

but it gets the quantized error n and the models diverge!

sn - Q

PF

nsn

PF

IQ

Stein VoiceDSP 3.15

Side InformationSide Information

There are two ways to solve the problem ...

The first way is to send the prediction coefficients

from the encoder to the decoder

and not to let the decoder derive them

The coefficients sent are called side-information

Using side-information means higher bit-rate

(since both n and coefficients must be sent)

The second way does not require increasing bit rate

Stein VoiceDSP 3.16

Closed-loop PredictionClosed-loop Prediction

To ensure that the encoder and decoder stay “in-sync”

we put the decoder into the encoder

Thus the encoder’s predictions are identical to the decoder’s

and no model difference accumulates

snsnIQ

PF

- Q n

PF

IQ

n

Stein VoiceDSP 3.17

Two types of errorTwo types of error

For DM there are two types of error (depending on step size)

too small too large

Stein VoiceDSP 3.18

Adaptive Step SizeAdaptive Step Size

Speech signals are very nonstationary

We need to adapt the step size to match signal behavior– Increase when signal changes rapidly– Decrease when signal is relatively constant

Simplest method (for DM only):– If present bit is the same as previous multiply by K (K=1.5)– If present bit is different, divide by K– Constrain to a predefined range

More general method :– Collect N samples in buffer (N = 128 … 512)– Compute standard deviation in buffer– Set to a fraction of standard deviation

• Send to decoder as side-information or• Use backward adaptation (closed-loop computation)

Stein VoiceDSP 3.19

ADPCMADPCM

G.726 has– Adaptive predictor– Adaptive quantizer and inverse quantizer– Adaptation speed control– Tone and transition detector– Mechanism to prevent loss from tandeming

Computational complexity relatively high (10 MIPS) 24 and 16 Kbps modes defined, but not toll quality

G.727 same rates but embedded for packetize networks

ADPCM only used general low-pass characteristic of speech

What is the next step?

Stein VoiceDSP 3.20

Scalar QuantizationScalar Quantization

Standard A/D has preset, evenly distributed levels

G.711 has preset, non-evenly distributed levels

With a criterion we can make an adaptive quantizer

Simplest criterion: minimum squared quantization error

n = sn - sn E = n2

Need algorithm to find optimal placement of levels [EM-type algorithms]

Stein VoiceDSP 3.21

Vector QuantizationVector Quantization

We can do the same thing in higher dimensions

Here we wish to match input data xi i = 1 .. N

to a codebook of codewords Cj j = 1 .. M

with Minimal Mean Squared Error

E = i=1..N | xi - C |2

where C is the codeword closest to xi in the codebook

xi

C1

C2

C4

C3

Stein VoiceDSP 3.22

LBG Algorithm for VQLBG Algorithm for VQ

Input xi i = 1 .. N [clustering, unsupervised learning]

Randomly initialize codebook Cj j = 1 .. M

Loop until converge:

Classification Step

for i = 1 .. N

for j = 1 .. M

compute Dij2 = | xi - Cj |2

classify xi to Cj with minimal Dij2

Expectation Step

for j = 1 .. M correct center Cj = i Cj xi

1Nj

Stein VoiceDSP 3.23

Speech Application of VQSpeech Application of VQ

OK, I understand what to do with scalar quantization

what is VQ good for ?

We could try to simply VQ frames of speech samples

but this doesn’t work well !

We can VQ spectra or sub-band components

We often VQ parameter sets (e.g. LPC coefficients)

We also VQ model error signals

Stein VoiceDSP 3.24

Voice DSPVoice DSP

CELP

coders

Stein VoiceDSP 3.25

LPC-10LPC-10

Based on 10th order LPC (obviously) [Bishnu Atal]

180 sample blocks are encoded into 54 bits Pitch + U/V (found using AMDF) 7 bits Gain 5 bits 10 reflection coefficients found by covariance method

– first two coefficients converted to log area ratios– L1, L2, a3, a4 5 bits each– a5, a6, a7, a8 4 bits each– a9 3 bits a10 2 bits 41 bits

1 sync bit 1 bit

54 bits 44.44 times per second results in 2400 bps

By using VQ could reduce bit rate to under 1 Kbps!

LPC-10 speech is intelligible, but synthetic sounding

and much of the speaker identity is lost !

Stein VoiceDSP 3.26

The ResidualThe Residual

Recover sn by adding back the residual error signal

sn= sn + n

So if we send n as side-information we can recover sn

n is smaller than sn so may require fewer bits !

But n is whiter than sn so may require many bits!

The question has now become:

How can we compress the residual?

Stein VoiceDSP 3.27

Encoding the ResidualEncoding the ResidualRELP (6-9.6 Kbps)

Low-pass filter and downsample residual to 1 KHzEncode using ADPCM

VQ-RELP (4.8 Kbps)VQ coding of residual

RELP (4.8 Kbps)Perform FFT on residualBaseband coding

RPE-LTP (GSM-FR at 13 Kbps) Residual Pulse Excitation - Long Term Predictor

Perform Long Term Prediction (pitch recovery)Subtract to obtain new residualDecimate by 3, use phase with maximum energyExtract 6-bit overall gain Encode remainder with 3 bits/sample

Stein VoiceDSP 3.28

Residual and ExcitationResidual and Excitation

Synthesis filter sn = en + am sn-m

Analysis filter rn = sn - am sn-m

So rn = en !

enall-pole

filter sn

sn rn-all-zerofilter

Note: all-zero filter is the inverse of the all-pole filter

excitation

residual

Stein VoiceDSP 3.29

CELPCELP

Atal’s idea:

Find a way to efficiently encode the excitation !

Questions:

How can we find the excitation?

Theoretically, by algebra (invert the filter!)

How can we efficiently encode the residual?

VQ - Code Excited Linear Prediction

How can we efficiently find the best codeword?

Exhaustive search

en LPC sn

Stein VoiceDSP 3.30

CELP - cont.CELP - cont.

Atal and friends (Schroeder, Remde, Singhal, etc.) discoveries:

Even random codebooks work well [Gaussian, uniform]

Don’t need large codebooks [e.g. 1024 codewords for 40 samples]

Can center-clip with little loss

Codebook with constant amplitude almost as good

So we can use codebooks with structure (and save storage/search/bits)

Multipulse (MP)

Constant Amplitude Pulse

Regular Pulse (RP)

Stein VoiceDSP 3.31

Special ExcitationsSpecial ExcitationsShift technique reduces random CB operations from O(N2) to O(N)

[a b c d e f] [c d e f g h] [e f g h I j] ...

Using a small number of +1 amplitude pulses leads to MIPS reduction

Since most values are zero, there are few operations

Since amplitudes +1 no true multiplications

In a CB containing CW and -CW we can save half

Algebraic codebooks exploit algebraic structure

Example: choose pulses according to Hadamard matrix

Using FHT reduces computation

Conjugate structure codebooks

Excitation is sum of codewords from two related CBs

Stein VoiceDSP 3.32

Analysis by SynthesisAnalysis by Synthesis

Finding the best codeword by exhaustive search

sn

CB ... LPC-

findminimum

Computeenergy

Stein VoiceDSP 3.33

Perceptual WeightingPerceptual Weighting

The criterion for selecting the best codeword should be perceptual

not simply the energy of the difference signal!

We perceptually weight the signal and the synthesized signal

PWsn

CB LPC PW

-

Since PW is a filterwe need use it only once

sn

CB LPC

- PW

Stein VoiceDSP 3.34

Perceptual Weighting - cont.Perceptual Weighting - cont.

The most important PW effect is masking

Coding error energy near formants is not heard anyway

so we allow higher error near formants

but demand lower perceivable error energy

To do this we de-emphasize according to the LPC spectrum!

Simplest filter is 1 - ai z-I where ai are the LPC coefficients

How do we take the critical bandwidth into account?We perform bandwidth expansion Denominator expansion > numerator

1 - iai z-I

1 - iai z-I

1

Typical values:

BW = ln() Fs

Stein VoiceDSP 3.35

Post-filterPost-filter

Not related to the subject, but if we are already here …

In order to increase the subjective quality of many coders

post-filters are often used to emphasize the formant structure

These have the same form as the perceptual weighting filter

– but 1 withtypical values Denominator expansion < numerator!

– the post-filter also reinforces tiltwhich should then be compensated by an IIR filter

– since the spectral valleys are de-emphasized

we should change the PW filter parameters and

Originally proposed for ADPCM !

Stein VoiceDSP 3.36

SubframesSubframes

Coders with large frames (> 10 ms) need a long excitation signaland hence a lot of bits to encode

An alternative is to divide the frame into (2-4) subframeseach of which has its own codeword excitation

frame n-1 frame n+1frame n

------- LPC ------- CWCW CW CW

subframe 1

subframe 2

subframe 3

subframe 4

We really should recompute LPC per subframebut we can get away with interpolating !

Stein VoiceDSP 3.37

LookaheadLookahead

If we are already dividing up the frame

we can compute the LPC based on a shifted frame

This is called lookahead, and it adds processing delay !

To decrease delay we can use backward looking IIR filterand then we needn’t send/store the LPC coefficients at all!

------- LPC -------

CWCW CW CW

------- LPC -------

CW CW CW CW

Stein VoiceDSP 3.38

What happened to the pitch?What happened to the pitch?

Unlike LPC, the ABS CELP coder is excited by codebook

Where does the pitch come from?

Random CB: minimi zation will prefer “good” excitation

Regular/Multi pulse: pulse spacing (not enough pulses for high pitch)

But this is usually not enough (residual has pitch periodicity)

Two solutions:

Adaptive codebook (Klejn, etal)

Long term prediction (Atal + Singhal)

Both of these reinforce the pitch component

Stein VoiceDSP 3.39

Adaptive CBAdaptive CB

Adaptive codebook is repetitions of previous excitations

Total excitation is weighted sum of stochastic CB (random, MP, RP, etc)

and adaptive CB

Adaptive

CB

Fixed

CB

LPC

Ga

Gs

Stein VoiceDSP 3.40

Long Term PredictionLong Term Prediction

error

computation

perceptual

weighting

Using long-term (pitch predictor) and short-term (LPC) prediction

Long term predictor may have onlyone delay, but then non-integer

1 1 - z -

-codebookpitch

predictor LPCgain

sn

Stein VoiceDSP 3.41

Federal Standard CELPFederal Standard CELP

FS 1016 at 4.8 Kbps has MOS 3.2

Developed by AT&T Bell Labs for DOD 144 bits / 30 ms frame

10th order LPC on 30 ms Hamming windowno pre-emphasis, additional 15 Hz BW expansion (quality and LSP robustness)

Conversion to LSP and nonuniform scalar quantization to 34 bits

4 subframes (7.5 ms) LSP interpolation

512 entry fixed CB - static -1,0,+1 from center-clipped Gaussian

+ 5 bit nonuniform quantized gain 56 bits

256 entry adaptive CB - 8 bits + 5 bit nonuniform quantized gain 48 bitsoptional noninteger delays, optional

Perceptual weighting

Postfilter + spectral tilt compensation, removable for noise or tandeming

FEC 4 bits SYNC 1 bit reserved 1 bit

Stein VoiceDSP 3.42

G.728G.728

16 Kbps with MOS similar to G.726 at 32 Kbps

Low 5 sample (0.625 sec) delay

High computational complexity (about 30 MIPS)

CELP with Backward LPC

LPC order 50 (why not? - we don’t transmit side-information!)

Frame of 2.5 ms (20 samples)

4 subframes of 0.625 ms (5 samples)

Perceptual weighting

Only 10 bit index to fixed CB is transmitted

10 bits per 0.625 ms is 16 Kbps !

Stein VoiceDSP 3.43

G.729G.729

8 Kbps toll-quality coder for DSVD and VoIP

Computational complexity 20 MIPS, but G.729a is about 10 MIPS

frame 10 ms (80 samples) lookahead 5 ms (1 subframe)

LPC, LSP, VQ, LSP interpolation

CS-ACELP CB (Interleaved single pulse permutation) 4 [+1] pulses / subframe

closed loop pitch prediction and adaptive CB (delay+gain)

2 (40 sample) subframes per frame

For each frame the encoder outputs 80 bitsLSF coefficients 18 bits pitch 8 bits gain CB 14 bits

adaptive CB 5 bits parity check 1 bit

pulse positions 26 bits pulse signs 8 bits

Stein VoiceDSP 3.44

G.729 annexesG.729 annexes

A Compatible reduced complexity encoder with minimal MOS reduction

B VAD and CNG

C Floating point implementation

D 6.4 Kbps versionsimilar to G.729 but 64 output bits per frame, quality better than G.726 at 24Kbps

LSF coefficients 18b pitch+adaptive CB 8+4b gain CB 12b fixed CB 22b

E 11.8 Kbps coder for high quality and music

Stein VoiceDSP 3.45

G.723.1G.723.1

6.4 (MP-MLQ) and 5.4 (ACELP) Kbps rates

About 18 MIPS on DSP

frame 30 ms (240 samples) lookahead 15 ms.

LPC on 30 ms (240 sample) frames, LSP and VQ

open-loop pitch computation on half-frames (120 sample)

excitation on 4 subframes (60 samples) per frame

perceptual weighting and harmonic noise weighting

fifth-order closed loop pitch predictor

MP-MLQ: 5 or 6 [+1] pulses / subframe, positions all even or all odd

ACELP: 4 [+1] pulses / subframe, positions differ by 8

Annex A VAD-CNG Annex B floating point implementation

Stein VoiceDSP 3.46

Voice DSPVoice DSP

OtherMethods

MBE/MELPSTC/WI

Stein VoiceDSP 3.47

MBE coderMBE coder

LPC10 makes hard U/V decision - no mixed voicing

Multi Band Excitation uses a different excitation

harmonics of pitch frequency

frequency-dependent binary U/V decision

large number of sub-bands (>16)

Simultaneous ABS estimation of pitch and spectral envelope

Then U/V decision made based on spectral fit

Use of dynamic programming for pitch tracking

V

f

Stein VoiceDSP 3.48

MBE coder - cont.MBE coder - cont.

DVSI made various MBE, AMBE and IMBE for satellite (INMARSAT)

Bit rates 2.4 - 9.6 Kbps (toll quality at 3.6 Kbps)

Integral FEC for bit-error robustness

As an example:

128 bits for each 20 ms frame

pitch 8 bits

U/V decisions K bits (K < 12)

spectral amplitudes (DCT) 75-K bits

FEC (Golay codes) 45 bits

Stein VoiceDSP 3.49

MELPMELP

DOD wanted a new 2.4 Kbps coder with MOS similar to FS1016

Main problems with LPC10:– voicing determination errors– no handling of partially voiced speech

Unlike MBE MELP uses standard LPC model

MELP excitation is pulse train plus random noise

Soft decision in small number (5) of sub-bands

Frame 22.5 ms (180 samples)

10th order LPC, 15 Hz BW expansion, LSF, interpolation, VQ

pitch refinement

5 sub-bands (0-500-1000-2000-3000-4000Hz) pitch and noise excitation

FEC

Stein VoiceDSP 3.50

SSinusoidal inusoidal TTransform ransform CCoderoder

McAulay and Quatieri model:

instead of LPC use sum of sine waves

sn = i = 1 .. N Ai cos ( i n + i )

For each analysis frame (10 - 20 ms) need to extract N Ai & i s

Voiced speech

Use pitch and important harmonics [from pitch-synchronized STFT]

Unvoiced speech

Use peaks of STFT [points where slope changes from + to -]

At high bit-rates keep magnitudes, frequencies and phases

At low bit-rates frequencies constrained and phases modeled

Stein VoiceDSP 3.51

STC - cont.STC - cont.

sn

sn

overlappedwindowing

FFT

peakpicker

spectrumencoder

sum ofsinusoids

spectrumdecoder

• Sparse spectrum is updated at regularly spaced times

• Amplitude linearly interpolated between updates

• Interpolated phase must obey 4 conditions ()

e.g. all-pole model

Stein VoiceDSP 3.52

STC - cont.STC - cont.

Tracking the sinusoidal components

birth

deathfre

que

ncy

time

Stein VoiceDSP 3.53

WWaveform aveform IInterpolationnterpolation

Voiced speech is a sequence of pitch-cycle waveforms

The characteristic waveform usually changes slowly with time

Useful to think of waveform in 2d

This waveform can be the speech signal or the LPC residual

Phase in pitch period

time

Stein VoiceDSP 3.54

WI - cont.WI - cont.

sn

sn

LPC +pitch tracking

Characteristicwaveformextraction

2d CWalignment

quantization

waveforminterpolation

decoding

• Per frame LPC and pitch are extracted

• Represent CW by features (e.g. DFT coefficients)

• Alignment by circular shift until maximum correlation

• Separate treatment for voice and unvoiced segments

conversionto 1d