Speech & Audio Coding - LiU
Transcript of Speech & Audio Coding - LiU
Speech & Audio Coding
TSBK01 Image Coding and Data Compression
Lecture 11, 2003
Jörgen Ahlberg
Outline
• Part I - Speech
– Speech
– History of speech synthesis & coding
– Speech coding methods
• Part II – Audio
– Psychoacoustic models
– MPEG-4 Audio
Speech Production
• The human’s vocalapparatus consists of:
– lungs
– trachea (wind pipe)
– larynx
• contains 2 folds of skin called vocal cords which blow apart and flap together as air is forced through
– oral tract
– nasal tract
�
The Speech Signal
The Speech Signal
(OHPHQWV�RI�WKH�VSHHFK�VLJQDO�� VSHFWUDO�UHVRQDQFH��IRUPDQWV��PRYLQJ�� SHULRGLF�H[FLWDWLRQ��YRLFLQJ��SLWFKHG����SLWFK�FRQWRXU� QRLVH�H[FLWDWLRQ��IULFDWLYHV��XQYRLFHG��QR�SLWFK�� WUDQVLHQWV��VWRS�UHOHDVH�EXUVWV�� DPSOLWXGH�PRGXODWLRQ��QDVDOV��DSSUR[LPDQWV��� WLPLQJ
The Speech Signal
9RZHOV � FKDUDFWHULVHG�E\ IRUPDQWV��JHQHUDOO\�YRLFHG��7RQJXH��OLSV�� HIIHFW�RI�URXQGLQJ��([DPSOHV�RI�YRZHOV��D��H��L��R��X��D��DK��RK��9LEUDWLRQ�RI�YRFDO�FRUGV��PDOH����� ���+]��IHPDOH�XS�WR����+]��9RZHOV�KDYH�LQ�DYHUDJH�PXFK�ORQJHU�GXUDWLRQ�WKDQ�FRQVRQDQWV��0RVW�RI�WKH�DFRXVWLF�HQHUJ\�RI�D�VSHHFK�VLJQDO�LV�FDUULHG�E\�YRZHOV�
)��)��FKDUW )RUPDQW�SRVLWLRQV
The Speech Signal
� ������ &KDQQHO�YRFRGHU � ILUVW�DQDO\VLV�E\�V\QWKHVLV V\VWHP GHYHORSHGE\ +RPHU�'XGOH\ RI�$77�ODEV�� 92'(5�
� ������ 3&0�� ILUVW�FRQFHLYHG�E\�3DXO�0��5DLQH\�DQG�LQGHSHQGHQWO\�E\����$OH[�5HHYHV��$77�3DULV��LQ�������'HSOR\HG��LQ�86�3671�LQ������
92'(5�± WKH�DUFKLWHFWXUH
History of Speech Coding
� ������ &KDQQHO�YRFRGHU � ILUVW�DQDO\VLV�� E\�� V\QWKHVLV�V\VWHP�GHYHORSHG�E\�+RPHU�'XGOH\�RI�$77�ODEV�� 92'(5�
� ������ 3&0�� ILUVW�FRQFHLYHG�E\�3DXO�0��5DLQH\�DQG�LQGHSHQGHQWO\�E\����$OH[�5HHYHV��$77�3DULV��LQ�������'HSOR\HG��LQ�86�3671�LQ������
History of Speech Coding
29(�IRUPDQW�V\QWKHVLV��*XQQDU�)DQW� .7+�������
+LVWRU\�RI�6SHHFK�� &RGLQJ
� ������ &KDQQHO�YRFRGHU � ILUVW�DQDO\VLV�� E\�� V\QWKHVLV�V\VWHP�+RPHU���'XGOH\�RI�$77�ODEV�� 92'(5�
� ������ 3&0�� ILUVW�FRQFHLYHG�E\�3DXO�0��5DLQH\�DQG�LQGHSHQGHQWO\�E\����$OH[�5HHYHV��$77�3DULV��LQ�������'HSOR\HG��LQ�86�3671�LQ������
� ������ µ�ODZ�HQFRGLQJ�SURSRVHG��VWDQGDUGLVHG�IRU�WHOHSKRQH�QHWZRUN�LQ�������*������
� ������ GHOWD�PRGXODWLRQ�SURSRVHG���GLIIHUHQWLDO�3&0�LQYHQWHG�
� ������ $'3&0�GHYHORSHG
� ������ &(/3�YRFRGHU�SURSRVHG��PDMRULW\�RI�FRGLQJ�VWDQGDUGV�IRU�VSHHFK�VLJQDO�WRGD\�XVH�D�YDULDWLRQ�RQ�&(/3�
� 6LJQDO�IURP�D�VRXUFH�LV�ILOWHUHG�E\�D�WLPH�YDU\LQJ�ILOWHU�ZLWK�UHVRQDQW�SURSHUWLHV�VLPLODU�WR�WKDW�RI�WKH�YRFDO�WUDFW�
� 7KH�JDLQ�FRQWUROV�$Y DQG�$1 GHWHUPLQH�WKH�LQWHQVLW\�RI�YRLFHG�DQG�XQYRLFHG�H[FLWDWLRQ�
� 7KH�IUHTXHQF\�RI�KLJKHU�IRUPDQW�DUH�DWWHQXDWHG�E\�����G%�RFWDYH��GXH�WR�WKH�QDWXUH�RI�RXU�VSHHFK�RUJDQV��
� 7KLV�LV�DQ�RYHU�VLPSOLILHG�PRGHO�IRU�VSHHFK�SURGXFWLRQ��+RZHYHU� LW�LV�YHU\�RIWHQ�DGHTXDWH�IRU�XQGHUVWDQGLQJ�WKH�EDVLF�SULQFLSOHV�
Source-filter Model of Speech Production
Speech Coding Strategies
1. PCM
• Invented 1926, deployed 1962.
• The speech signal is sampled at 8 kHz.
• Uniform quantization requires >10 bits/sample.
• Non-uniform quantization (G.711, 1972)
• Quantizing y to 8 bits -> 64 kbit/s.
Speech Coding Strategies
2. Adaptive DPCM
• Example: G.726 (1974)
• Adaptive predictor based on six previous differences.
• Gain-adaptive quantizer with 15 levels � 32 kbit/s.
Speech Coding Strategies
3. Model-based Speech Coding
• Advanced speech coders are based on models of how speech is produced:
Excitationsource
Vocaltract
An Excitation Source
Noisegenerator
Pulsegenerator
Pitch
Vocal Tract Filter 1: A Fixed Filter Bank
BP
g1
BP
g2
BP
gn
Vocal Tract Filter 2: A Controllable Filter
Linear Predictive Coding (LPC)
• The controllable filter is modelled as
yn = ∑ ai yn-i + Gεn
where εn is the input signal and yn is the output.
• We need to estimate the vocal tract parameters (aiand G) and the exciatation parameters (pitch, v/uv).
• Typically the source signal is divided in short segments and the parameters are estimated for each segment.
• Example: The speech signal is sampled at 8 kHz and divided in segments of 180 samples (22.5 ms/segment).
Typical Scheme of an LPC Coder
Noisegenerator
Pulsegenerator
Pitch
Vocal tractfilter
v/uv Gain Filter coeffs
Estimating the Parameters
• v/uv estimation
– Based on energy and frequency spectrum.
• Pitch-period estimation
– Look for periodicity, either via the a.c.f our some other measure, for example
that gives you a minimum value when p equals the pitch period.
– Typical pitch-periods: 20 - 160 samples.
Estimating the Parameters
• Vocal tract filter estimation
– Find the filter coefficients that minimize the error
ε2 = ( yn - ∑ ai yn-i + Gεn )2
– Compare to the computation of optimal predictors (Lecture 7).
Estimating the Parameters
• Assuming a stationary signal:
where R and p contain acf values.
• This is called the autocorrelation method.
Estimating the Parameters
• Alternatively, in case of a non-stationary signal:
where
• This is called the autocovariance method.
Example
• Coding of parameters using LPC10 (1984):
54 bits � 2.4 kbit/sSum:
1 bitSynchronization
46 bitsUnvoiced filter
46 bitsVoiced filter
6 bitsPitch
1 bitv/uv
The Vocal Tract Filter
• Different representations:
– LPC parameters
– PARCOR (Partial Correlation Coefficients)
– LSF (Line Spectrum Frequencies)
� /3&�DQDO\VLV � 9�]�
� 'HILQH�SHUFHSWXDO�ZHLJKWLQJ�ILOWHU��7KLV�SHUPLWV�PRUH�QRLVH�DW�IRUPDQW�IUHTXHQFLHV�ZKHUH�LW�ZLOO�EH�PDVNHG�E\�WKH�VSHHFK� 6\QWKHVLVH�VSHHFK�XVLQJ�HDFK�FRGHERRN�HQWU\�LQ�WXUQ�DV�WKH�LQSXW�WR�9�]�
� &DOFXODWH�RSWLPXP�JDLQ�WR�PLQLPLVH�SHUFHSWXDOO\�ZHLJKWHG�HUURU�HQHUJ\�LQ�VSHHFK�IUDPH
� 6HOHFW�FRGHERRN�HQWU\�WKDW�JLYHV�ORZHVW�HUURU
'HFRGLQJ��� 5HFHLYH�/3&�SDUDPHWHUV�DQG�FRGHERRN�LQGH[� 5H�V\QWKHVLVH�VSHHFK�XVLQJ�9�]��DQG�FRGHERRN�HQWU\�
(QFRGLQJ�
� 7UDQVPLW�/3&�SDUDPHWHUV�DQG�FRGHERRN��LQGH[
3HUIRUPDQFH�� ��NELW�V��026 �����
'HOD\ ����PV�����0,36� ��NELW�V��026 �����
'HOD\ ���PV�����0,36�� ���NELW�V��026 �����
'HOD\ ���PV�����0,36�
Code Excited Linear Prediction Coding (CELP)
Examples
• G.728– V(z) is chosen as a large FIR-filter (M ~ 50).
– The gain and FIR-parametrers are estimated recursively from previously received samples.
– The code book contains 127 sequences.
• GSM– The code book contains regular pulse trains with variabel
frequency and amplitudes.
• MELP– Mixed excitation linear prediction– The code book is combined with a noise generator.
Other Variations
• SELP – Self Excited Linear Prediction
• MPLP – Multi-Pulse Excited Linear Prediction
• MBE – Multi-Band Excitation Coding
Quality Levels
BitrateBandwidthQuality level
<4 kbit/sSynthetic quality
4 – 16 kbit/sCommunication quality
16 – 64 kbit/s300 – 3400 kHzNetwork (tool) quality
>64 kbit/s10 kHzBroadcast quality
� 026��0HDQ�2SLQLRQ�6FRUH���UHVXOW�RI�DYHUDJLQJ�RSLQLRQV�VFRUHV�IRU�D�VHW�RI�EHWZHHQ����± ���XQWUDLQHG�VXEMHFWV��
� 7KH\��UDWH WKH�TXDOLW\���WR������EDG����SRRU����IDLU����JRRG����H[FHOOHQW��
� 026�RI���RU�KLJKHU�GHILQHV�JRRG�RU�WRRO�TXDOLW\��QHWZRUN�TXDOLW\��� UHFRQVWUXFWHG�VLJQDO�JHQHUDOO\�LQGLVWLQJXLVKDEOH�IURP�WKH�RULJLQDO��
� 026��EHWZHHQ�����± ����GHILQHV�FRPPXQLFDWLRQ�TXDOLW\�± WHOHSKRQH�FRPPXQLFDWLRQV
� 026�EHWZHHQ�����± ����LPSOLHV�V\QWKHWLF�TXDOLW\
� ,Q�GLJLWDO�FRPPXQLFDWLRQV�VSHHFK�TXDOLW\�LV�FODVVLILHG�LQWR�IRXU�JHQHUDO�FDWHJRULHV��QDPHO\��EURDGFDVW��QHWZRUN�RU�WROO��FRPPXQLFDWLRQV��DQG�V\QWKHWLF�
� %URDGFDVW�ZLGHEDQG�VSHHFK�± KLJK�TXDOLW\�´FRPPHQWDU\´�VSHHFK�± JHQHUDOO\�DFKLHYHG��DW�UDWHV�DERYH��� NELWV�V�
Subjective Assessment
� '57��'LDJQRVWLF�5K\PH�7HVW���OLVWHQHUV�VKRXOG�UHFRJQLVH�RQH�RI�WKH�WZR�SRVVLEOH�ZRUGV�LQ�D�VHW�RI�UK\PLQJ�SDLUV��H�J��PHDWO�KHDW�
� '$0��'LDJQRVWLF�$FFHSWDELOLW\�0HDVXUH��� WUDLQHG�OLVWHQHUV�MXGJH�YDULRXV�IDFWRUV�H�J��PXIIOHGQHVV��EX]]LQHVV��LQWHOOLJLELOLW\
4XDOLW\�YHUVXV�GDWD�UDWH���N+]�VDPSOLQJ�UDWH�
Subjective Assessment