TELE4652 Mobile and Satellite Communications · •Basic PCM system •Quantisation is the loss of...
Transcript of TELE4652 Mobile and Satellite Communications · •Basic PCM system •Quantisation is the loss of...
TELE4652 Mobile and Satellite
Communications
Lecture 9 – Speech Coding
Overview
• Important in the development of Cellular
Networks
• Speech compression advances were significant in
the growth in network capacity from 1G networks
to 2G networks (and beyond)
• Techniques can reduce the amount of digital data
required to represent speech by factors of ten or
more at no loss in perceptual quality
• This lecture will provide a brief overview of the
important ideas of speech compression
TELE4652
Speech Processing
• Speech Compression is one aspect of the overall
speech processing
• For example, GSM speech processing is shown
below:
TELE4652
GSM Speech Processing
Aspects of a speech processing system:
1.A/D conversion. GSM samples at 8kHz with 13
bits/sample. Raw data rate is 104kbps.
2.Framing – groups together a sequence of speech
samples to form a frame. GSM takes frames of 160
samples every 20ms.
3.Frame classification – a Voice Activity Detection
(VAD) algorithm is used to identify whether frame
contains speech or not
TELE4652
GSM Speech Processing
4. Discontinuous Transmission System (DTX) – don’t transmit if there is no speech within the frame. This
• Prolongs battery life
• Reduces the level of interference across the network
There should be speech data to transmit less than half the time on average.
5. Comfort Noise Generation – If there is no speech data, the receiver will generate some comfort noise (background noise)
6. Silence Descriptor (SID) – Tx sends an estimate of the background noise. In GSM this is done once every 480ms
TELE4652
GSM Speech Processing
7. Error Concealment – bit errors on channel can cause speech frames to be corrupted. If detected -> Bad Frame Indicator (BFI). Lost speech frame is replaced by prediction from previous frames. 16 consecutive lost frames results in failure of acoustic channel.
8. Speech Codec (Compression) – GSM uses a technique called Regular Pulse Excitation – Long Term Prediction (RPE-LTP). This is a type of Linear Predictive Coding (LPC). Output is 260 bits for every frame (data rate out is thus 13kbps)!!!
TELE4652
GSM Speech Processing
Overview:
TELE4652
Digitising Speech
• Telephone quality speech is
generally taken as band-limited to
[300Hz, 3.4kHz)
• Usual sampling frequency is 8kHz
• High quality audio (for music)
requires a greater bandwidth
• Speech digitisation systems can
be understood as Pulse Code
Modulation (PCM)
• Sample, Quantise, Encode
TELE4652
Pulse Code Modulation
• Basic PCM system
• Quantisation is the loss of information – signal
distortion
TELE4652
Quantisation of Speech Samples
• Perceptual quality requires at least 13 bits/sample
• From Signal to Quantisation Noise Ratio (SQNR)
• Resulting data rate, 8kHz x 13 bits/sample =
104kbps is much too high for cellular application
• Non-uniform quantisation – companding
• Companding = Compressing + Expanding. Non-
linear amplification of speech
• Common schemes – A-law and µ-law
• Used in the PSTN
TELE4652
Companding
• Non-uniform amplification in the time domain
• Idea is that there is more information contained in
the low amplitude parts of a speech waveform
• There should be more quantisation levels at low
amplitudes
TELE4652
Companding
• Reduces the number of bits per sample to 8
• Resultant bit rate is only 64kbps
• Not enough for cellular systems, but used in
Cordless Phones
• Practically either performed by non-linear
amplification prior to quantisation, or by direct non-
uniform quantiser
TELE4652
Adaptive Differential PCM
• Use an adaptive algorithm to predict the next speech
sample
• Quantise and encode the difference between the
prediction and the actual sample
• Can adapt step-size based on the received signal
characteristic
• This reduces the dynamic range of the quantiser, and hence
the number of bits required in representation
• The receiver can use the same predictive algorithm, and
then receives the difference to add to its prediction
TELE4652
Adaptive DPCM
• Adaptive DPCM encoder
TELE4652
Adaptive DPCM
• Predictive – estimate the next sample from the
previous sample(s)
• Then quantise and encode these prediction errors
• This greatly reduces the dynamic range of the
input samples, and so can reduce the number of
bits/sample and so reduce the data rate
TELE4652
Adaptive DPCM
• Prediction filters are simple FIR filters
• For example, gradient prediction:
• This gives:
• This is the simplest, first-order estimate
• General form would be
• Then encode prediction error:
TELE4652
s
s
nnnn T
T
wwwz
−+= −−
−21
1
212 −− −= nnn wwz
∑=
−=K
i
inin waz1
nnn zwe −=
Adaptive DPCM
• The filter coefficients can be made adaptive
• The quantisation step size can be made adaptive
• Slowly varying signal, step size is small
• Rapidly varying signal, step size is large
• Perceptual quality requires 4 bits/sample – 32kbps
data rate -> still too high for cellular applications
TELE4652
Linear Predictive Coding
• Able to achieve bit rates as low as 2kbps. With
good perceptual quality, of the order of 10kbps
• Speech is heavily structured
• Rather than directly encode the speech samples,
develop a model of the speaker’s vocal tract and an
excitation to this model
• Send the model parameters and the excitation to
the receiver, which can then regenerate the speech
sample
• This greatly reduces the data rate needed
TELE4652
Linear Predictive Coding
• Speech generation model
TELE4652
Speech waveforms
• Voiced and unvoiced speech
TELE4652
Linear Predictive Coding
• Analysis by synthesis principles
TELE4652
Linear Predictive Coding
• Classify speech samples as one of two types –
voiced or unvoiced
• Voiced can be modelled as a periodic input
• Unvoiced speech as random noise input
•Model the human vocal tract as an all-pole filter
• Analysis by synthesis – iterative (feedback) system.
Use current model of the speech generation process
to generate current speech sample. Based on the
error, refine the model until the generated and real
samples match as closely as desiredTELE4652
Linear Predictive Coding
•MMSE ideas. However, speech sample is generally
weighted by perceptual weighting function
• This weighting function can account for auditory
masking phenomena
• Also seek to enhance formant frequencies
TELE4652
LPC Principles
• Short-term prediction – predict the current speech
sample from previous (generally 8-16 samples)
•
• This is equivalent to an all-pole prediction filter:
• is called the analysis filter. H(z) is the
synthesis filter
• Residuals (prediction errors):
TELE4652
[ ] [ ]∑=
−=p
k
k knsans1
~
( )( )zP
zHs−
=1
1( ) ∑
=
−=p
k
k
ks zazP1
( )zPs
[ ] [ ] [ ]nrknsansp
k
k +−=∑=1
~~
( )( )
( )zRzP
zSs−
=1
1~
LPC Principles
• Prediction residuals can be used as inputs to the
synthesis filter at the receiver to regenerate the
speech samples
• Aim is to choose the filter coefficients to minimise
mean square prediction error
•More generally this can be a perceptually
weighted mean
• The average is done over a 10-20ms speech frame
TELE4652
[ ] [ ]( )∑ −=n
nsnsE2~
LPC Principles
• Select the filter coefficients to minimise the MSE,
• This gives the equation:
• Correlation matrix over input speech samples
• The optimal filter coefficients can thus be found
from a matrix equation:
TELE4652
{ }ka
[ ] [ ] 00 1
=−−=∂∂
∑∑= =
k
p
k
N
ni
aknsinsa
E
[ ] [ ] [ ]
−−== ∑
=
N
n
mk knsmnsCC1
0a =C [ ]paaa ,,, 21 K=a
LPC Principles
• This is fundamentally a matrix inversion problem
• The challenge is to find computationally efficient
methods to perform it
• Common methods are:
• Autocorrelation method
• Covariance method
• Both employ an algorithm, called Durbin’s
algorithm, to do the matrix inversion using
reflection coefficients
• These are a set of coefficients related to TELE4652
{ }ka
LPC Principles
• Often the reflection coefficients are quantised and
encoded in place of the filter coefficients
• Can be directly employed in lattice
implementations of digital filters
• An alternative approach, then, is to compute these
reflection coefficients directly from the speech
samples
• A popular technique using this idea is the
‘covariance lattice method’, employing what is
called the Burg algorithm
TELE4652
Basic LPC System
• Achieve bit rates as low as 2kbps
• Improvements are need to achieve speech quality
expected in cellular networks
TELE4652
LPC Refinements
Techniques can be employed to enhance this basic
LPC system, to achieve desired speech quality
1.Speech frame windowing
2.Non-uniform quantisation of filter parameters
3.Long-term prediction filter
4.A weighted error filter
5.Representation of the excitation signal
TELE4652
Speech Frame Windowing
• Typically determine LPC filter coefficient once per
speech frame – 20ms (160 samples)
• Divide frame into sub-frames (4-7ms in duration)
to determine excitation signals
• A window is used to prevent discontinuities
between speech frames
• Hamming window is common
TELE4652
[ ]
−=
aL
nnw
π2cos46.054.05863.1
Non-uniform Quantisation
• Uniform quantisation of filter coefficients
would require 10 bits/coefficient, for acceptable
quality
• An alternative is the reflection coefficients,
• The most common approach is to encode a non-
linear mapping of the reflection coefficients
• Inverse sine transform:
• Log-area ratios (LAR):
• Can use piecewise linear approximations to the
above
TELE4652
{ }ka
{ }ik
( ){ }ik1sin−
+−
=i
ii
k
kLAR
1
1log
[ ][ ] 0.195.0 if
95.0675.0 if
675.0 if
375.68
675.02
<<
<<
<
−±
−±≈
i
i
i
i
i
i
i
k
k
k
k
k
k
LAR
Non-uniform Quantisation
• Non-uniform quantisation of parameters can
significantly reduce the overheads in bits per
parameter
• An 8th order LPC filter, the use of LAR can reduce
the quantisation overheads from 10 bits/coefficient,
so 80 bits total, to 6,6,5,5,4,4,3,3 = 36 bits in total
for equivalent LAR.
• This is a significant saving in total bit rate
• A different approach, called line spectral pairs, is
to identify and encode the formant frequenciesTELE4652
Long-term Prediction
• correlations in speech span many speech samples,
particularly pitch effects in voiced speech
• A long-term predictor can be used to model the
finer structure in the speech spectrum
• Firstly perform short-term prediction, and find the
residual r[n].
• Then perform a long-term prediction
• One tap case, this is to form the filter to minimise
the long-term residual:
TELE4652
[ ] [ ] [ ]α−−= nGrnrne
Long-term Prediction
• The resulting filter is called the pitch synthesis
filter
• For first order case,
• α represents the pitch delay, and G the pitch gain.
• These coefficients would be found using the
MMSE criterion
• The general pitch synthesis filter would be
where the long-term predictor is
• Either first order, , or third order,
is sufficient in practise TELE4652
( ) α−−=
GzzP 1
11
( ) ( )zPzP l−=1
11
( ) ( )∑−=
+−=2
1
m
mk
k
kl zGzP α
021 == mm 121 == mm
Long-term Prediction
• LTP is usually performed once per sub-frame
• Delay encoded with 7 bits, and gain as 3-4 bits
• Cascaded with the STP filter
TELE4652
Long-term Prediction
• Effective in removing pitch effects
TELE4652
Weighted Error Filter
• Based on the principle of Auditory masking
• Both temporal and spectral masking phenomena
• Spectral masking – minimum level above which
signals are detectable
• Hence, noise comes from spectral regions with
little to no speech components
TELE4652
Weighted Error Filter
• Quantisation noise will be more significant in
regions with no speech signal energy
• We would like to weight the quantisation so that
less emphasis is placed on the quantisation of the
dominant spectral regions
• Warp the spectrum to emphasise the silent parts
f the spectrum,
• Here, , is the degree of de-emphasis
• For γ = 0, we get the usual MMSE situation TELE4652
( ) ( )( )γzP
zPzW
s
s
−
−=′1
1
10 << γ
Weighted Error Filter
• The diagram below shows the use of the
perceptual spectral weighting filter in the analysis
by synthesis speech codec
TELE4652
Excitation Signals
• The assumption that speech is either voiced or
unvoiced is too simplistic for good performing
speech codecs
• Aim to find some more realistic excitations
• Trade-off – accuracy of excitation signal against
the number of bits needed to represent it
TELE4652
Excitation Signals
• Most common techniques: Multi-pulse
Excitations, Regular Pulse Excitation, Codebook
Excitation
TELE4652
Multi-Pulse Excitation
• Determine the location and amplitude of several
pulses to give the best fit (via weighted
perceptual MSE) on residuals after STP and LTP
• Typical is 4 pulses per sub-frame (say 5ms
segment with 40 samples)
• Express as
TELE4652
[ ] [ ]∑−
=
−=1
0
M
k
kk mnnv δβ
Multi-Pulse Excitation
• Quantise the pulse positions and amplitudes and
send them to the decoder
• Iterative techniques can be used to determine the
optimal pulse locations
• Combinatorial encoding can be used to represent
pulse positions – can achieve 17 bits for 4 pulse
positions (naively we’d need 6 bits/position
• Amplitudes – determine scaling factor (max value
or RMS) -> 6 bits for it
• Other, scaled amplitudes encoded with 3
bits/amplitudeTELE4652
Multi-Pulse Excitation
• The typical MPE excitation requires 35 bits per
5ms sub-frame to encode
• This implies a data rate of 7kbps for the excitation
• The excitation is by far the largest part of an LPC
system in terms of data rate
• This is then the major differentiator between
different LPC techniques
TELE4652
Regular Pulse Excitation
• The excitation sequence is a set of regular pulses
• For example, one of four possible sets of pulses
are chosen:
• The aim is then to determine the pulse set and
the amplitudes that minimise the residual error
TELE4652
Regular Pulse Excitation
• Example of an RPE determination algorithm
1. Do short-term LPC analysis:
2. Determine LTP parameters, α and G
3. Compute LTP residual:
4. Pass these residuals through a smoothing filter
where z[n] is some windowing function (say
Hamming sinc)
TELE4652
[ ] [ ] [ ]∑=
−−=p
i
i insansnr1
[ ] [ ] [ ]α−−= nGrnrnd
[ ] [ ] ( )∑−
=
−=1
0
N
i
inzidny
[ ] QnQQ
n
D
n
n
Dnz ≤≤−
+
= for ,cos46.054.0sincππ
π
Regular Pulse Excitation
5. Decompose y[n] into D sets (say D = 4 – the
number of different pulse sets), and compute the
energy of each set: For each k = 1 up to D:
6. Choose the pulse set with the maximum energy
TELE4652
( ) [ ]iDkyk
i +=β
( ) ( )∑−
=
=1
0
2M
i
k
i
kE β
Regular Pulse Excitation
• The regular spacing of the pulses makes this very
computationally efficient (say, compared to MPE)
• For D = 4 and N = 40 (for a 5ms frame):
• 2 bits for the starting position of the pulse train
• For the 10 pulse amplitudes:
• 6 bits for the amplitude scaling factor
• 3 bits for each pulse (10 x 3 = 30 bits)
• The total bits is then 38 bits/5 ms frame
• The resultant data rate is 7.6 kbps
TELE4652
Code Excitation
• Observe that the largest component of MPE and
RPE is the excitation sequence
• The point of LPC is that, after STP and LTP the
residual should just look like random noise
• Consider the excitation instead as a random noise
signal
• Have a large codebook of random excitations, and
choose the best one
TELE4652
Code Excitation
• Encoder searches through the codebook to find
the excitation with the lowest MSE output
• It then transmits the index of this excitation to
the receiver
• Acceptable performance requires a codebook of
say 1024 with Gaussian populated random
sequences
• For a 5ms sub-frame, would use 10 bits for the
codeword index, and say 5 bits for overall gain
(frame energy)
TELE4652
Code Excitation
• This results in 15 bits/sub-frame
• Data rate is only 3 kbps!
• CELP produces a huge saving in data rate
• The problem is, it is very slow and
computationally expensive to search through
large codebooks for the best excitation sequence
• This is the R&D emphasise in CELP – how to
structure the codebook to reduce computational
overheads
TELE4652
CELP
• Schematic diagram of the principles
TELE4652
CELP
• To implement in real-time, there are many
proposed ways to structure and populate the
codebooks
• Sparse excitation codebooks (a few random
pulses)
• Ternary codebooks (three state)
• Algebraic codebooks (generated by a FEC)
• Vector Sum codebooks (form excitations as a sum
of codewords from smaller codebooks)
TELE4652
VSELP
• Used in USDC – IS-54
TELE4652
VSELP
• Speech frames were 20ms divided into 5ms sub-
frames
• Used three 128 size codebooks for excitations
• One of the codebooks implemented the LTP loop
-> an adaptive codebook
• Gains of codeword determined each sub-frame
and vector quantised with 8 bits
• Each of the three excitations required a 7 bit
index
• Thus, 116 bits per frame for excitationTELE4652
VSELP
• The LPC filter used in IS-54 was 10th order, and
coefficients quantised as 38 bits once per frame
• An additional 5 bits were used per speech frame
for overall sample energy
• Data rate was 159 bits per 20ms frame, or
7.95kbps
• Channel coding was then employed to produce
the transmitted data rate of 13kbps
TELE4652
CELP
• Code Excited Linear Predictive Coding (CELP)
tends to be the choice in 3G cellular standards
• This has been made possible through the increase
in computing power
• Thus, modern speech codecs are able to achieve
lower data rates than previously possible
TELE4652
VSELP
• The LPC filter used in IS-54 was 10th order, and
coefficients quantised as 38 bits once per frame
• An additional 5 bits were used per speech frame
for overall sample energy
• Data rate was 159 bits per 20ms frame, or
7.95kbps
• Channel coding was then employed to produce
the transmitted data rate of 13kbps
TELE4652
The GSM Codec
• GSM used RPE-LTP for speech coding
TELE4652
GSM Codec
• Speech was sampled at 8kHz with 13bits/sample
• Pre-emphasis was used to enhanced high-
frequency content (and improve numerical
precision) -> filter
• Broken into 20ms frames of 160 samples
• Sub-frames are 5ms of 40 samples
• Hamming window to remove boundary effects
between frames:
TELE4652
( ) 19.01 −−= zzH
[ ] [ ]
−⋅⋅=L
nnsns pspsw
π2cos46.054.05863.1
GSM Codec
• STP done with an 8th order LPC filter
• Encode the LAR using 6,6,5,5,4,4,3,3 bits, so 36
bits in total
• STP filter is updated once per frame
• LTP is done per sub-frame
• 2 bits for LTP gain, and 7 bits for the delay
• Regular Pulse Excitation is used – D = 4 and N = 13
(one of four possible pulse sets with 13 pulses per
set)
TELE4652
GSM Codec
• The initial RPE phase takes 2 bits
• Amplitude scaling factor is 6 bits
• Then, there are 13 amplitudes with 3 bits per
amplitude
• The result is 160 bits/frame
• Speech codec data rate is 13kbps for GSM
TELE4652
GSM Codec
• Summary of GSM codec output bits:
8 STP LAR coefficients 36bits per frame
4 LTP Gains 4 × 2 = 8 bits
4 LTP Delays 4 × 7 = 28
4 RPE Grid positions 4 × 2 = 8
4 RPE Block Maxima 4 × 6 = 24
4 × 13 RPE Pulse Amplitudes 52 × 3 = 156
Total bits per 20 ms 260 bits/frame
TELE4652
GSM Codec
The structure of the decoder:
TELE4652