TELE4652 Mobile and Satellite Communications · •Basic PCM system •Quantisation is the loss of...

TELE4652 Mobile and Satellite

Communications

Lecture 9 – Speech Coding

Overview

• Important in the development of Cellular

Networks

• Speech compression advances were significant in

the growth in network capacity from 1G networks

to 2G networks (and beyond)

• Techniques can reduce the amount of digital data

required to represent speech by factors of ten or

more at no loss in perceptual quality

• This lecture will provide a brief overview of the

important ideas of speech compression

TELE4652

Speech Processing

• Speech Compression is one aspect of the overall

speech processing

• For example, GSM speech processing is shown

below:

TELE4652

GSM Speech Processing

Aspects of a speech processing system:

1.A/D conversion. GSM samples at 8kHz with 13

bits/sample. Raw data rate is 104kbps.

2.Framing – groups together a sequence of speech

samples to form a frame. GSM takes frames of 160

samples every 20ms.

3.Frame classification – a Voice Activity Detection

(VAD) algorithm is used to identify whether frame

contains speech or not

TELE4652


4. Discontinuous Transmission System (DTX) – don’t transmit if there is no speech within the frame. This

• Prolongs battery life

• Reduces the level of interference across the network

There should be speech data to transmit less than half the time on average.

5. Comfort Noise Generation – If there is no speech data, the receiver will generate some comfort noise (background noise)

6. Silence Descriptor (SID) – Tx sends an estimate of the background noise. In GSM this is done once every 480ms

TELE4652


7. Error Concealment – bit errors on channel can cause speech frames to be corrupted. If detected -> Bad Frame Indicator (BFI). Lost speech frame is replaced by prediction from previous frames. 16 consecutive lost frames results in failure of acoustic channel.

8. Speech Codec (Compression) – GSM uses a technique called Regular Pulse Excitation – Long Term Prediction (RPE-LTP). This is a type of Linear Predictive Coding (LPC). Output is 260 bits for every frame (data rate out is thus 13kbps)!!!

TELE4652


Overview:

TELE4652

Digitising Speech

• Telephone quality speech is

generally taken as band-limited to

[300Hz, 3.4kHz)

• Usual sampling frequency is 8kHz

• High quality audio (for music)

requires a greater bandwidth

• Speech digitisation systems can

be understood as Pulse Code

Modulation (PCM)

• Sample, Quantise, Encode

TELE4652

Pulse Code Modulation

• Basic PCM system

• Quantisation is the loss of information – signal

distortion

TELE4652

Quantisation of Speech Samples

• Perceptual quality requires at least 13 bits/sample

• From Signal to Quantisation Noise Ratio (SQNR)

• Resulting data rate, 8kHz x 13 bits/sample =

104kbps is much too high for cellular application

• Non-uniform quantisation – companding

• Companding = Compressing + Expanding. Non-

linear amplification of speech

• Common schemes – A-law and µ-law

• Used in the PSTN

TELE4652

Companding

• Non-uniform amplification in the time domain

• Idea is that there is more information contained in

the low amplitude parts of a speech waveform

• There should be more quantisation levels at low

amplitudes

TELE4652

Companding

• Reduces the number of bits per sample to 8

• Resultant bit rate is only 64kbps

• Not enough for cellular systems, but used in

Cordless Phones

• Practically either performed by non-linear

amplification prior to quantisation, or by direct non-

uniform quantiser

TELE4652

Adaptive Differential PCM

• Use an adaptive algorithm to predict the next speech

sample

• Quantise and encode the difference between the

prediction and the actual sample

• Can adapt step-size based on the received signal

characteristic

• This reduces the dynamic range of the quantiser, and hence

the number of bits required in representation

• The receiver can use the same predictive algorithm, and

then receives the difference to add to its prediction

TELE4652

Adaptive DPCM

• Adaptive DPCM encoder

TELE4652

Adaptive DPCM

• Predictive – estimate the next sample from the

previous sample(s)

• Then quantise and encode these prediction errors

• This greatly reduces the dynamic range of the

input samples, and so can reduce the number of

bits/sample and so reduce the data rate

TELE4652

Adaptive DPCM

• Prediction filters are simple FIR filters

• For example, gradient prediction:

• This gives:

• This is the simplest, first-order estimate

• General form would be

• Then encode prediction error:

TELE4652

s

s

nnnn T

T

wwwz

−+= −−

−21

1

212 −− −= nnn wwz

∑=

−=K

i

inin waz1

nnn zwe −=

Adaptive DPCM

• The filter coefficients can be made adaptive

• The quantisation step size can be made adaptive

• Slowly varying signal, step size is small

• Rapidly varying signal, step size is large

• Perceptual quality requires 4 bits/sample – 32kbps

data rate -> still too high for cellular applications

TELE4652

Linear Predictive Coding

• Able to achieve bit rates as low as 2kbps. With

good perceptual quality, of the order of 10kbps

• Speech is heavily structured

• Rather than directly encode the speech samples,

develop a model of the speaker’s vocal tract and an

excitation to this model

• Send the model parameters and the excitation to

the receiver, which can then regenerate the speech

sample

• This greatly reduces the data rate needed

TELE4652


• Speech generation model

TELE4652

Speech waveforms

• Voiced and unvoiced speech

TELE4652


• Analysis by synthesis principles

TELE4652


• Classify speech samples as one of two types –

voiced or unvoiced

• Voiced can be modelled as a periodic input

• Unvoiced speech as random noise input

•Model the human vocal tract as an all-pole filter

• Analysis by synthesis – iterative (feedback) system.

Use current model of the speech generation process

to generate current speech sample. Based on the

error, refine the model until the generated and real

samples match as closely as desiredTELE4652


•MMSE ideas. However, speech sample is generally

weighted by perceptual weighting function

• This weighting function can account for auditory

masking phenomena

• Also seek to enhance formant frequencies

TELE4652

LPC Principles

• Short-term prediction – predict the current speech

sample from previous (generally 8-16 samples)

•

• This is equivalent to an all-pole prediction filter:

• is called the analysis filter. H(z) is the

synthesis filter

• Residuals (prediction errors):

TELE4652

[ ] [ ]∑=

−=p

k

k knsans1

~

( )( )zP

zHs−

=1

1( ) ∑

=

−=p

k

k

ks zazP1

( )zPs

[ ] [ ] [ ]nrknsansp

k

k +−=∑=1

~~

( )( )

( )zRzP

zSs−

=1

1~

LPC Principles

• Prediction residuals can be used as inputs to the

synthesis filter at the receiver to regenerate the

speech samples

• Aim is to choose the filter coefficients to minimise

mean square prediction error

•More generally this can be a perceptually

weighted mean

• The average is done over a 10-20ms speech frame

TELE4652

[ ] [ ]( )∑ −=n

nsnsE2~

LPC Principles

• Select the filter coefficients to minimise the MSE,

• This gives the equation:

• Correlation matrix over input speech samples

• The optimal filter coefficients can thus be found

from a matrix equation:

TELE4652

{ }ka

[ ] [ ] 00 1

=−−=∂∂

∑∑= =

k

p

k

N

ni

aknsinsa

E

[ ] [ ] [ ]

−−== ∑

=

N

n

mk knsmnsCC1

0a =C [ ]paaa ,,, 21 K=a

LPC Principles

• This is fundamentally a matrix inversion problem

• The challenge is to find computationally efficient

methods to perform it

• Common methods are:

• Autocorrelation method

• Covariance method

• Both employ an algorithm, called Durbin’s

algorithm, to do the matrix inversion using

reflection coefficients

• These are a set of coefficients related to TELE4652

{ }ka

LPC Principles

• Often the reflection coefficients are quantised and

encoded in place of the filter coefficients

• Can be directly employed in lattice

implementations of digital filters

• An alternative approach, then, is to compute these

reflection coefficients directly from the speech

samples

• A popular technique using this idea is the

‘covariance lattice method’, employing what is

called the Burg algorithm

TELE4652

Basic LPC System

• Achieve bit rates as low as 2kbps

• Improvements are need to achieve speech quality

expected in cellular networks

TELE4652

LPC Refinements

Techniques can be employed to enhance this basic

LPC system, to achieve desired speech quality

1.Speech frame windowing

2.Non-uniform quantisation of filter parameters

3.Long-term prediction filter

4.A weighted error filter

5.Representation of the excitation signal

TELE4652

Speech Frame Windowing

• Typically determine LPC filter coefficient once per

speech frame – 20ms (160 samples)

• Divide frame into sub-frames (4-7ms in duration)

to determine excitation signals

• A window is used to prevent discontinuities

between speech frames

• Hamming window is common

TELE4652

[ ]

−=

aL

nnw

π2cos46.054.05863.1

Non-uniform Quantisation

• Uniform quantisation of filter coefficients

would require 10 bits/coefficient, for acceptable

quality

• An alternative is the reflection coefficients,

• The most common approach is to encode a non-

linear mapping of the reflection coefficients

• Inverse sine transform:

• Log-area ratios (LAR):

• Can use piecewise linear approximations to the

above

TELE4652

{ }ka

{ }ik

( ){ }ik1sin−

+−

=i

ii

k

kLAR

1

1log

[ ][ ] 0.195.0 if

95.0675.0 if

675.0 if

375.68

675.02

<<

<<

<

−±

−±≈

i

i

i

i

i

i

i

k

k

k

k

k

k

LAR

Non-uniform Quantisation

• Non-uniform quantisation of parameters can

significantly reduce the overheads in bits per

parameter

• An 8th order LPC filter, the use of LAR can reduce

the quantisation overheads from 10 bits/coefficient,

so 80 bits total, to 6,6,5,5,4,4,3,3 = 36 bits in total

for equivalent LAR.

• This is a significant saving in total bit rate

• A different approach, called line spectral pairs, is

to identify and encode the formant frequenciesTELE4652

Long-term Prediction

• correlations in speech span many speech samples,

particularly pitch effects in voiced speech

• A long-term predictor can be used to model the

finer structure in the speech spectrum

• Firstly perform short-term prediction, and find the

residual r[n].

• Then perform a long-term prediction

• One tap case, this is to form the filter to minimise

the long-term residual:

TELE4652

[ ] [ ] [ ]α−−= nGrnrne


• The resulting filter is called the pitch synthesis

filter

• For first order case,

• α represents the pitch delay, and G the pitch gain.

• These coefficients would be found using the

MMSE criterion

• The general pitch synthesis filter would be

where the long-term predictor is

• Either first order, , or third order,

is sufficient in practise TELE4652

( ) α−−=

GzzP 1

11

( ) ( )zPzP l−=1

11

( ) ( )∑−=

+−=2

1

m

mk

k

kl zGzP α

021 == mm 121 == mm


• LTP is usually performed once per sub-frame

• Delay encoded with 7 bits, and gain as 3-4 bits

• Cascaded with the STP filter

TELE4652


• Effective in removing pitch effects

TELE4652

Weighted Error Filter

• Based on the principle of Auditory masking

• Both temporal and spectral masking phenomena

• Spectral masking – minimum level above which

signals are detectable

• Hence, noise comes from spectral regions with

little to no speech components

TELE4652


• Quantisation noise will be more significant in

regions with no speech signal energy

• We would like to weight the quantisation so that

less emphasis is placed on the quantisation of the

dominant spectral regions

• Warp the spectrum to emphasise the silent parts

f the spectrum,

• Here, , is the degree of de-emphasis

• For γ = 0, we get the usual MMSE situation TELE4652

( ) ( )( )γzP

zPzW

s

s

−

−=′1

1

10 << γ


• The diagram below shows the use of the

perceptual spectral weighting filter in the analysis

by synthesis speech codec

TELE4652

Excitation Signals

• The assumption that speech is either voiced or

unvoiced is too simplistic for good performing

speech codecs

• Aim to find some more realistic excitations

• Trade-off – accuracy of excitation signal against

the number of bits needed to represent it

TELE4652

Excitation Signals

• Most common techniques: Multi-pulse

Excitations, Regular Pulse Excitation, Codebook

Excitation

TELE4652

Multi-Pulse Excitation

• Determine the location and amplitude of several

pulses to give the best fit (via weighted

perceptual MSE) on residuals after STP and LTP

• Typical is 4 pulses per sub-frame (say 5ms

segment with 40 samples)

• Express as

TELE4652

[ ] [ ]∑−

=

−=1

0

M

k

kk mnnv δβ


• Quantise the pulse positions and amplitudes and

send them to the decoder

• Iterative techniques can be used to determine the

optimal pulse locations

• Combinatorial encoding can be used to represent

pulse positions – can achieve 17 bits for 4 pulse

positions (naively we’d need 6 bits/position

• Amplitudes – determine scaling factor (max value

or RMS) -> 6 bits for it

• Other, scaled amplitudes encoded with 3

bits/amplitudeTELE4652


• The typical MPE excitation requires 35 bits per

5ms sub-frame to encode

• This implies a data rate of 7kbps for the excitation

• The excitation is by far the largest part of an LPC

system in terms of data rate

• This is then the major differentiator between

different LPC techniques

TELE4652

Regular Pulse Excitation

• The excitation sequence is a set of regular pulses

• For example, one of four possible sets of pulses

are chosen:

• The aim is then to determine the pulse set and

the amplitudes that minimise the residual error

TELE4652


• Example of an RPE determination algorithm

1. Do short-term LPC analysis:

2. Determine LTP parameters, α and G

3. Compute LTP residual:

4. Pass these residuals through a smoothing filter

where z[n] is some windowing function (say

Hamming sinc)

TELE4652

[ ] [ ] [ ]∑=

−−=p

i

i insansnr1

[ ] [ ] [ ]α−−= nGrnrnd

[ ] [ ] ( )∑−

=

−=1

0

N

i

inzidny

[ ] QnQQ

n

D

n

n

Dnz ≤≤−

+

= for ,cos46.054.0sincππ

π


5. Decompose y[n] into D sets (say D = 4 – the

number of different pulse sets), and compute the

energy of each set: For each k = 1 up to D:

6. Choose the pulse set with the maximum energy

TELE4652

( ) [ ]iDkyk

i +=β

( ) ( )∑−

=

=1

0

2M

i

k

i

kE β


• The regular spacing of the pulses makes this very

computationally efficient (say, compared to MPE)

• For D = 4 and N = 40 (for a 5ms frame):

• 2 bits for the starting position of the pulse train

• For the 10 pulse amplitudes:

• 6 bits for the amplitude scaling factor

• 3 bits for each pulse (10 x 3 = 30 bits)

• The total bits is then 38 bits/5 ms frame

• The resultant data rate is 7.6 kbps

TELE4652

Code Excitation

• Observe that the largest component of MPE and

RPE is the excitation sequence

• The point of LPC is that, after STP and LTP the

residual should just look like random noise

• Consider the excitation instead as a random noise

signal

• Have a large codebook of random excitations, and

choose the best one

TELE4652

Code Excitation

• Encoder searches through the codebook to find

the excitation with the lowest MSE output

• It then transmits the index of this excitation to

the receiver

• Acceptable performance requires a codebook of

say 1024 with Gaussian populated random

sequences

• For a 5ms sub-frame, would use 10 bits for the

codeword index, and say 5 bits for overall gain

(frame energy)

TELE4652

Code Excitation

• This results in 15 bits/sub-frame

• Data rate is only 3 kbps!

• CELP produces a huge saving in data rate

• The problem is, it is very slow and

computationally expensive to search through

large codebooks for the best excitation sequence

• This is the R&D emphasise in CELP – how to

structure the codebook to reduce computational

overheads

TELE4652

CELP

• Schematic diagram of the principles

TELE4652

CELP

• To implement in real-time, there are many

proposed ways to structure and populate the

codebooks

• Sparse excitation codebooks (a few random

pulses)

• Ternary codebooks (three state)

• Algebraic codebooks (generated by a FEC)

• Vector Sum codebooks (form excitations as a sum

of codewords from smaller codebooks)

TELE4652

VSELP

• Used in USDC – IS-54

TELE4652

VSELP

• Speech frames were 20ms divided into 5ms sub-

frames

• Used three 128 size codebooks for excitations

• One of the codebooks implemented the LTP loop

-> an adaptive codebook

• Gains of codeword determined each sub-frame

and vector quantised with 8 bits

• Each of the three excitations required a 7 bit

index

• Thus, 116 bits per frame for excitationTELE4652

VSELP

• The LPC filter used in IS-54 was 10th order, and

coefficients quantised as 38 bits once per frame

• An additional 5 bits were used per speech frame

for overall sample energy

• Data rate was 159 bits per 20ms frame, or

7.95kbps

• Channel coding was then employed to produce

the transmitted data rate of 13kbps

TELE4652

CELP

• Code Excited Linear Predictive Coding (CELP)

tends to be the choice in 3G cellular standards

• This has been made possible through the increase

in computing power

• Thus, modern speech codecs are able to achieve

lower data rates than previously possible

TELE4652

VSELP

• The LPC filter used in IS-54 was 10th order, and

coefficients quantised as 38 bits once per frame

• An additional 5 bits were used per speech frame

for overall sample energy

• Data rate was 159 bits per 20ms frame, or

7.95kbps

• Channel coding was then employed to produce

the transmitted data rate of 13kbps

TELE4652

The GSM Codec

• GSM used RPE-LTP for speech coding

TELE4652

GSM Codec

• Speech was sampled at 8kHz with 13bits/sample

• Pre-emphasis was used to enhanced high-

frequency content (and improve numerical

precision) -> filter

• Broken into 20ms frames of 160 samples

• Sub-frames are 5ms of 40 samples

• Hamming window to remove boundary effects

between frames:

TELE4652

( ) 19.01 −−= zzH

[ ] [ ]

−⋅⋅=L

nnsns pspsw

π2cos46.054.05863.1

GSM Codec

• STP done with an 8th order LPC filter

• Encode the LAR using 6,6,5,5,4,4,3,3 bits, so 36

bits in total

• STP filter is updated once per frame

• LTP is done per sub-frame

• 2 bits for LTP gain, and 7 bits for the delay

• Regular Pulse Excitation is used – D = 4 and N = 13

(one of four possible pulse sets with 13 pulses per

set)

TELE4652

GSM Codec

• The initial RPE phase takes 2 bits

• Amplitude scaling factor is 6 bits

• Then, there are 13 amplitudes with 3 bits per

amplitude

• The result is 160 bits/frame

• Speech codec data rate is 13kbps for GSM

TELE4652

GSM Codec

• Summary of GSM codec output bits:

8 STP LAR coefficients 36bits per frame

4 LTP Gains 4 × 2 = 8 bits

4 LTP Delays 4 × 7 = 28

4 RPE Grid positions 4 × 2 = 8

4 RPE Block Maxima 4 × 6 = 24

4 × 13 RPE Pulse Amplitudes 52 × 3 = 156

Total bits per 20 ms 260 bits/frame

TELE4652

GSM Codec

The structure of the decoder:

TELE4652

TELE4652 Mobile and Satellite Communications · •Basic PCM system •Quantisation is the loss of...

Documents

Transcript of TELE4652 Mobile and Satellite Communications · •Basic PCM system •Quantisation is the loss of...