Speech & Audio Coding - LiU

Speech & Audio Coding

TSBK01 Image Coding and Data Compression

Lecture 11, 2003

Jörgen Ahlberg

Outline

• Part I - Speech

– Speech

– History of speech synthesis & coding

– Speech coding methods

• Part II – Audio

– Psychoacoustic models

– MPEG-4 Audio

Speech Production

• The human’s vocalapparatus consists of:

– lungs

– trachea (wind pipe)

– larynx

• contains 2 folds of skin called vocal cords which blow apart and flap together as air is forced through

– oral tract

– nasal tract

�

The Speech Signal

The Speech Signal

(OHPHQWV�RI�WKH�VSHHFK�VLJQDO�� VSHFWUDO�UHVRQDQFH��IRUPDQWV��PRYLQJ�� SHULRGLF�H[FLWDWLRQ��YRLFLQJ��SLWFKHG��SLWFK�FRQWRXU� QRLVH�H[FLWDWLRQ��IULFDWLYHV��XQYRLFHG��QR�SLWFK�� WUDQVLHQWV��VWRS�UHOHDVH�EXUVWV�� DPSOLWXGH�PRGXODWLRQ��QDVDOV��DSSUR[LPDQWV�� WLPLQJ

The Speech Signal

9RZHOV � FKDUDFWHULVHG�E\ IRUPDQWV��JHQHUDOO\�YRLFHG��7RQJXH��OLSV�� HIIHFW�RI�URXQGLQJ��([DPSOHV�RI�YRZHOV��D��H��L��R��X��D��DK��RK��9LEUDWLRQ�RI�YRFDO�FRUGV��PDOH�� +]��IHPDOH�XS�WR��+]��9RZHOV�KDYH�LQ�DYHUDJH�PXFK�ORQJHU�GXUDWLRQ�WKDQ�FRQVRQDQWV��0RVW�RI�WKH�DFRXVWLF�HQHUJ\�RI�D�VSHHFK�VLJQDO�LV�FDUULHG�E\�YRZHOV�

)��)��FKDUW )RUPDQW�SRVLWLRQV

The Speech Signal

� �� &KDQQHO�YRFRGHU � ILUVW�DQDO\VLV�E\�V\QWKHVLV V\VWHP GHYHORSHGE\ +RPHU�'XGOH\ RI�$77�ODEV�� 92'(5�

� �� 3&0�� ILUVW�FRQFHLYHG�E\�3DXO�0��5DLQH\�DQG�LQGHSHQGHQWO\�E\��$OH[�5HHYHV��$77�3DULV��LQ��'HSOR\HG��LQ�86�3671�LQ��

92'(5�± WKH�DUFKLWHFWXUH

History of Speech Coding

� �� &KDQQHO�YRFRGHU � ILUVW�DQDO\VLV�� E\�� V\QWKHVLV�V\VWHP�GHYHORSHG�E\�+RPHU�'XGOH\�RI�$77�ODEV�� 92'(5�


History of Speech Coding

29(�IRUPDQW�V\QWKHVLV��*XQQDU�)DQW� .7+��

+LVWRU\�RI�6SHHFK�� &RGLQJ

� �� &KDQQHO�YRFRGHU � ILUVW�DQDO\VLV�� E\�� V\QWKHVLV�V\VWHP�+RPHU��'XGOH\�RI�$77�ODEV�� 92'(5�


� �� µ�ODZ�HQFRGLQJ�SURSRVHG��VWDQGDUGLVHG�IRU�WHOHSKRQH�QHWZRUN�LQ��*��

� �� GHOWD�PRGXODWLRQ�SURSRVHG��GLIIHUHQWLDO�3&0�LQYHQWHG�

� �� $'3&0�GHYHORSHG

� �� &(/3�YRFRGHU�SURSRVHG��PDMRULW\�RI�FRGLQJ�VWDQGDUGV�IRU�VSHHFK�VLJQDO�WRGD\�XVH�D�YDULDWLRQ�RQ�&(/3�

� 6LJQDO�IURP�D�VRXUFH�LV�ILOWHUHG�E\�D�WLPH�YDU\LQJ�ILOWHU�ZLWK�UHVRQDQW�SURSHUWLHV�VLPLODU�WR�WKDW�RI�WKH�YRFDO�WUDFW�

� 7KH�JDLQ�FRQWUROV�$Y DQG�$1 GHWHUPLQH�WKH�LQWHQVLW\�RI�YRLFHG�DQG�XQYRLFHG�H[FLWDWLRQ�

� 7KH�IUHTXHQF\�RI�KLJKHU�IRUPDQW�DUH�DWWHQXDWHG�E\��G%�RFWDYH��GXH�WR�WKH�QDWXUH�RI�RXU�VSHHFK�RUJDQV��

� 7KLV�LV�DQ�RYHU�VLPSOLILHG�PRGHO�IRU�VSHHFK�SURGXFWLRQ��+RZHYHU� LW�LV�YHU\�RIWHQ�DGHTXDWH�IRU�XQGHUVWDQGLQJ�WKH�EDVLF�SULQFLSOHV�

Source-filter Model of Speech Production

Speech Coding Strategies

1. PCM

• Invented 1926, deployed 1962.

• The speech signal is sampled at 8 kHz.

• Uniform quantization requires >10 bits/sample.

• Non-uniform quantization (G.711, 1972)

• Quantizing y to 8 bits -> 64 kbit/s.


2. Adaptive DPCM

• Example: G.726 (1974)

• Adaptive predictor based on six previous differences.

• Gain-adaptive quantizer with 15 levels � 32 kbit/s.


3. Model-based Speech Coding

• Advanced speech coders are based on models of how speech is produced:

Excitationsource

Vocaltract

An Excitation Source

Noisegenerator

Pulsegenerator

Pitch

Vocal Tract Filter 1: A Fixed Filter Bank

BP

g1

BP

g2

BP

gn

Vocal Tract Filter 2: A Controllable Filter

Linear Predictive Coding (LPC)

• The controllable filter is modelled as

yn = ∑ ai yn-i + Gεn

where εn is the input signal and yn is the output.

• We need to estimate the vocal tract parameters (aiand G) and the exciatation parameters (pitch, v/uv).

• Typically the source signal is divided in short segments and the parameters are estimated for each segment.

• Example: The speech signal is sampled at 8 kHz and divided in segments of 180 samples (22.5 ms/segment).

Typical Scheme of an LPC Coder

Noisegenerator

Pulsegenerator

Pitch

Vocal tractfilter

v/uv Gain Filter coeffs

Estimating the Parameters

• v/uv estimation

– Based on energy and frequency spectrum.

• Pitch-period estimation

– Look for periodicity, either via the a.c.f our some other measure, for example

that gives you a minimum value when p equals the pitch period.

– Typical pitch-periods: 20 - 160 samples.


• Vocal tract filter estimation

– Find the filter coefficients that minimize the error

ε2 = ( yn - ∑ ai yn-i + Gεn )2

– Compare to the computation of optimal predictors (Lecture 7).


• Assuming a stationary signal:

where R and p contain acf values.

• This is called the autocorrelation method.


• Alternatively, in case of a non-stationary signal:

where

• This is called the autocovariance method.

Example

• Coding of parameters using LPC10 (1984):

54 bits � 2.4 kbit/sSum:

1 bitSynchronization

46 bitsUnvoiced filter

46 bitsVoiced filter

6 bitsPitch

1 bitv/uv

The Vocal Tract Filter

• Different representations:

– LPC parameters

– PARCOR (Partial Correlation Coefficients)

– LSF (Line Spectrum Frequencies)

� /3&�DQDO\VLV � 9�]�

� 'HILQH�SHUFHSWXDO�ZHLJKWLQJ�ILOWHU��7KLV�SHUPLWV�PRUH�QRLVH�DW�IRUPDQW�IUHTXHQFLHV�ZKHUH�LW�ZLOO�EH�PDVNHG�E\�WKH�VSHHFK� 6\QWKHVLVH�VSHHFK�XVLQJ�HDFK�FRGHERRN�HQWU\�LQ�WXUQ�DV�WKH�LQSXW�WR�9�]�

� &DOFXODWH�RSWLPXP�JDLQ�WR�PLQLPLVH�SHUFHSWXDOO\�ZHLJKWHG�HUURU�HQHUJ\�LQ�VSHHFK�IUDPH

� 6HOHFW�FRGHERRN�HQWU\�WKDW�JLYHV�ORZHVW�HUURU

'HFRGLQJ�� 5HFHLYH�/3&�SDUDPHWHUV�DQG�FRGHERRN�LQGH[� 5H�V\QWKHVLVH�VSHHFK�XVLQJ�9�]��DQG�FRGHERRN�HQWU\�

(QFRGLQJ�

� 7UDQVPLW�/3&�SDUDPHWHUV�DQG�FRGHERRN��LQGH[

3HUIRUPDQFH�� NELW�V��026 ��

'HOD\ ��PV��0,36� ��NELW�V��026 ��

'HOD\ ��PV��0,36�� NELW�V��026 ��

'HOD\ ��PV��0,36�

Code Excited Linear Prediction Coding (CELP)

Examples

• G.728– V(z) is chosen as a large FIR-filter (M ~ 50).

– The gain and FIR-parametrers are estimated recursively from previously received samples.

– The code book contains 127 sequences.

• GSM– The code book contains regular pulse trains with variabel

frequency and amplitudes.

• MELP– Mixed excitation linear prediction– The code book is combined with a noise generator.

Other Variations

• SELP – Self Excited Linear Prediction

• MPLP – Multi-Pulse Excited Linear Prediction

• MBE – Multi-Band Excitation Coding

Quality Levels

BitrateBandwidthQuality level

<4 kbit/sSynthetic quality

4 – 16 kbit/sCommunication quality

16 – 64 kbit/s300 – 3400 kHzNetwork (tool) quality

>64 kbit/s10 kHzBroadcast quality

� 026��0HDQ�2SLQLRQ�6FRUH��UHVXOW�RI�DYHUDJLQJ�RSLQLRQV�VFRUHV�IRU�D�VHW�RI�EHWZHHQ��± ��XQWUDLQHG�VXEMHFWV��

� 7KH\��UDWH WKH�TXDOLW\��WR��EDG��SRRU��IDLU��JRRG��H[FHOOHQW��

� 026�RI��RU�KLJKHU�GHILQHV�JRRG�RU�WRRO�TXDOLW\��QHWZRUN�TXDOLW\�� UHFRQVWUXFWHG�VLJQDO�JHQHUDOO\�LQGLVWLQJXLVKDEOH�IURP�WKH�RULJLQDO��

� 026��EHWZHHQ��± ��GHILQHV�FRPPXQLFDWLRQ�TXDOLW\�± WHOHSKRQH�FRPPXQLFDWLRQV

� 026�EHWZHHQ��± ��LPSOLHV�V\QWKHWLF�TXDOLW\

� ,Q�GLJLWDO�FRPPXQLFDWLRQV�VSHHFK�TXDOLW\�LV�FODVVLILHG�LQWR�IRXU�JHQHUDO�FDWHJRULHV��QDPHO\��EURDGFDVW��QHWZRUN�RU�WROO��FRPPXQLFDWLRQV��DQG�V\QWKHWLF�

� %URDGFDVW�ZLGHEDQG�VSHHFK�± KLJK�TXDOLW\�´FRPPHQWDU\´�VSHHFK�± JHQHUDOO\�DFKLHYHG��DW�UDWHV�DERYH�� NELWV�V�

Subjective Assessment

� '57��'LDJQRVWLF�5K\PH�7HVW��OLVWHQHUV�VKRXOG�UHFRJQLVH�RQH�RI�WKH�WZR�SRVVLEOH�ZRUGV�LQ�D�VHW�RI�UK\PLQJ�SDLUV��H�J��PHDWO�KHDW�

� '$0��'LDJQRVWLF�$FFHSWDELOLW\�0HDVXUH�� WUDLQHG�OLVWHQHUV�MXGJH�YDULRXV�IDFWRUV�H�J��PXIIOHGQHVV��EX]]LQHVV��LQWHOOLJLELOLW\

4XDOLW\�YHUVXV�GDWD�UDWH��N+]�VDPSOLQJ�UDWH�

Subjective Assessment

Speech & Audio Coding - LiU

Documents

Transcript of Speech & Audio Coding - LiU