Stereo coding for the ITU-T G.719 codec - DiVA Portal

UPTEC F11 034

Examensarbete 30 hpMaj 2011

Stereo coding for the ITU-T G.719 codec

Tomas Jansson

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Stereo coding for the ITU-T G.719 codec

Tomas Jansson

This thesis presents a stereo coding architecture for the ITU-T G.719 fullband monocodec. G.719 is suitable for teleconferencing applications with a competitive audioquality for speech and audio signals that are encoded at 32, 48 and 64 kbps. Theproposed stereo architecture comprises parametric stereo coding where the spatialproperties of the stereo channels are modeled with the use of parameters, which areencoded and transmitted to the decoder together with an encoded downmix of thestereo channels. The stereo architecture has been implemented in MATLAB with anexternal mono coding using a floating point ANSI-C implementation of the ITU-TG.719 codec.

Two parametric stereo models have been implemented in a framework operating inthe complex-valued Modified Discrete Fourier Transform (MDFT) domain. The firstmodel is based on the inter-channel cues that represent level differences, timedifferences and coherences between the stereo channels. The cues approximate thecorresponding interaural cues that characterize our localization of sound in space.The second model is based on the Karhunen-Loève Transform (KLT) with theassociated rotation angles, the inter-channel time differences and the residual scalingparameters. An improved MDFT domain extraction of the inter-channel timedifference between the stereo channels has been used for both stereo models. Theextracted stereo parameters have been non-uniformly quantized based on the spatialaccuracy and the frequency dependency of the human auditory system.

The data rate of the stereo parameters has been estimated for each model to around4 kbps. As a result G.719 has been used as a core codec at 44 and 60 kbps in orderto subjectively evaluate the performance of the fullband stereo codec at 48 and 64kbps. In the comparison with G.719 dual mono coding, i.e. independent mono codingof the stereo channels, the evaluation showed a higher performance of the proposedstereo models for complex clean and reverberant speech signals. However, noconsistent gain of the parametric stereo coding was revealed for noisy speech, mixedcontent and music signals. In addition, the first stereo model showed consistently aslightly higher performance than the second model in the subjective evaluation butwith no significant difference.

The results revealed a high potential for parametric stereo coding using the ITU-TG.719 codec. In comparison to the existing stereo codecs 3GPP AMR-WB+ and3GPP eAAC+ the average performance was better at the equal bitrate of 48 kbps.

ISSN: 1401-5757, UPTEC F11 034Examinator: Tomas NybergÄmnesgranskare: Mikael SternadHandledare: Manuel Briand

Sammanfattning

Detta arbete presenterar en stereoarkitektur för ITU-T G.719 vilket är ett monokodek för ljudsignaler med full bandbredd (20 kHz). G.719 kan användas för bland annat telekonferenssystem då ljudkvaliteten för tal- och ljudsignaler är konkurrenskraftig vid bithastigheter på 32, 48 och 64 kbps. Stereoarkitekturen använder modeller för att beskriva den spatiella ljudbilden från stereokanaler med hjälp av parametrar som kodas och skickas till avkodaren tillsammans med en kodad nedmixning av stereokanalerna. Den presenterade stereoarkitekturen har implementerats i MATLAB och använder en ANSI-C flyttalsimplementation av ITU-T G.719 för extern monokodning.

Genom att använda den komplexvärda MDFT transformen (Modified Discrete Fourier Transform) har två parametriska stereomodeller utvecklats i ramverket för G.719. Den första modellen är baserad på skillnader mellan stereokanalerna i form av intensitet, tidsfördröjning och koherens. Parametrarna approximerar de interaurala skillnader som präglar den mänskliga lokaliseringen av ljudkällor. Den andra modellen är baserad på KL-transformen (Karhunen-Loève Transform), som karakteriseras av rotationsvinklar, men använder också tidsfördröjningen mellan stereokanalerna och intensiteten av KL-transformens residual. Estimeringen av tidsfördröjningen mellan kanalerna, ICTD (Inter-Channel Time Difference), har förbättrats med två nya algoritmer. De extraherade stereoparametrarna har kvantiserats olinjärt baserat på det mänskliga hörselsystemets spatiella begränsningar.

Bithastigheten för stereoparametrarna har estimerats till omkring 4 kbps för båda modellerna. För att subjektivt bedöma prestandan hos stereokodeket vid 48 och 64 kbps har G.719 använts för kodningen av de nedmixade stereokanalerna vid 44 och 60 kbps. I jämförelse med G.719 DM-kodning (Dual Mono), dvs. oberoende monokodning av stereokanalerna, visade sig de implementerade stereomodellerna vara effektivare för både rena och brusiga komplexa talsignaler. Däremot kunde ingen konsekvent effektivitetsförbättring genom parametrisk stereokodning observeras för brusigt tal, musik och blandat innehåll, t.ex. en kombination av tal, musik och ljudeffekter. Den subjektiva evalueringen visade dessutom genomgående en något högre prestanda för den första stereomodellen jämfört med den andra.

Sammantaget visade resultaten en hög potential för parametrisk stereokodning med ITU-T G.719. I jämförelsen med de befintliga stereokodeken 3GPP AMR-WB+ [60] och 3GPP eAAC+ [61] uppfattades ljudkvaliteten i genomsnitt vara bättre för den föreslagna G.719 stereoarkitekturen vid bithastigheter på 48 kbps.

Acknowledgments

This Master’s thesis is the final achievement for my degree of Master of Science in Engineering Physics at Uppsala University. The work has been performed at the department of Audio Technologies at Ericsson Research in Stockholm.

I would like to acknowledge my supervisor Manuel Briand for his tremendous support during the work and writing of this thesis. I would also like to thank Erlendur Karlsson and Erik Norvell for their help and advices. Finally, I would like to express my gratitude to the participants in the subjective evaluation of the developed stereo architecture.

Table of contents

1 INTRODUCTION ..................................................................................................................................... 1 1.1 Background ...................................................................................................................................................... 1 1.2 Thesis outline ................................................................................................................................................... 2

2 ITU-T G.719 FULLBAND SPEECH AND AUDIO CODEC ............................................................... 3 2.1 G.719 encoder .................................................................................................................................................. 3

2.1.1 Framework .......................................................................................................................................................... 4 2.1.2 Perceptual coding ................................................................................................................................................ 8

2.2 G.719 decoder ................................................................................................................................................. 10 2.2.1 Overlap-add synthesis ....................................................................................................................................... 10

2.3 Performance of ITU-T G.719 ............................................................................................................................ 12 2.3.1 Algorithmic efficiency ........................................................................................................................................ 12 2.3.2 Subjective Quality Performance ........................................................................................................................ 13

2.4 Conclusion ...................................................................................................................................................... 14

3 CHARACTERISTICS OF OUR PERCEPTION OF SOUND IN SPACE ......................................... 17 3.1 Localization of sound in space using interaural cues ....................................................................................... 17

3.1.1 Limitations of the auditory system.................................................................................................................... 21 3.1.2 Summing localizations of audio sources ........................................................................................................... 21

3.2 Considerations about room acoustics ............................................................................................................. 23 3.3 Conclusion ...................................................................................................................................................... 24

4 EXISTING STEREO CODING TECHNIQUES .................................................................................. 27 4.1 Joint stereo coding .......................................................................................................................................... 29

4.1.1 Mid-side coding ................................................................................................................................................. 29 4.1.2 Intensity stereo coding ...................................................................................................................................... 33 4.1.3 KLT-based stereo coding ................................................................................................................................... 35

4.2 Binaural cue coding ......................................................................................................................................... 38 4.2.1 Spatial cue extraction ........................................................................................................................................ 39 4.2.2 Spatial cue synthesis ......................................................................................................................................... 41

4.3 Parametric Stereo coding ................................................................................................................................ 43 4.3.1 Improvements over BCC technique .................................................................................................................. 44 4.3.2 Decorrelator ...................................................................................................................................................... 50 4.3.3 Quantization and coding of spatial cues ........................................................................................................... 52 4.3.4 Pre-weighting for improved downmix from USAC ............................................................................................ 54

4.4 Conclusion ...................................................................................................................................................... 56

5 STEREO FRAMEWORK FOR THE ITU-T G.719 CODEC ............................................................ 57 5.1 Transform compatible with G.719 codec ......................................................................................................... 57 5.2 Block processing from G.719 with additional zero-padding ............................................................................. 59 5.3 Frequency resolution based on the ERB scale ................................................................................................. 63 5.4 Conclusion ...................................................................................................................................................... 65

6 STEREO MODELS BASED ON THE BCC/PS ARCHITECTURE ................................................. 67 6.1 High level description of model 1 .................................................................................................................... 68 6.2 High level description of model 2 .................................................................................................................... 69 6.3 Parameter extraction ...................................................................................................................................... 70

6.3.1 Optimization of the frequency domain filterbank ............................................................................................ 71 6.3.2 Improved ICTD extraction ................................................................................................................................. 78

6.4 Signal alignment of stereo channels before downmixing ................................................................................ 92

6.5 MDFT domain decorrelation ........................................................................................................................... 93 6.6 Stereo synthesis .............................................................................................................................................. 96

6.6.1 Upmixing in model 1 ......................................................................................................................................... 97 6.6.2 Upmixing in model 2 ......................................................................................................................................... 98

6.7 Conclusion ...................................................................................................................................................... 98

7 QUANTIZATION AND CODING OF THE STEREO PARAMETERS ........................................ 101 7.1 Non-uniform quantization ............................................................................................................................ 101 7.2 Coding of the quantized parameters ............................................................................................................. 103 7.3 Bitrate estimations for training and MPEG database .................................................................................... 105 7.4 Conclusion .................................................................................................................................................... 108

8 PERFORMANCE OF THE IMPLEMENTED G.719 STEREO CODEC ...................................... 109 8.1 Objective measurement ................................................................................................................................ 109 8.2 Subjective evaluation .................................................................................................................................... 110

8.2.1 MUSHRA – G.719 stereo at 48 kbps ................................................................................................................ 111 8.2.2 MUSHRA – G.719 stereo at 64 kbps ................................................................................................................ 114

8.3 Conclusion .................................................................................................................................................... 116

9 CONCLUSION AND FUTURE WORK ............................................................................................. 119 9.1 Conclusion .................................................................................................................................................... 119 9.2 Future work .................................................................................................................................................. 120

9.2.1 Compatibility with ITU-T G.719 ....................................................................................................................... 120 9.2.2 Reduction of complexity and delay ................................................................................................................. 121 9.2.3 Improved subjective quality for complex stereo signals ................................................................................. 121

REFERENCES .............................................................................................................................................. 123

APPENDIX ......................................................................................................................................................... I A. Critical bandwidths and ERB ............................................................................................................................. I B. Stereo recording techniques ............................................................................................................................ II C. Distributions of stereo parameters .................................................................................................................. V D. MDCT, MDST, MDFT and TDAC properties ...................................................................................................... IX E. Stereo material .............................................................................................................................................. XX

i

Notations

Mathematical symbols

i 1−

[ ]⋅ℜ Real part of complex number

[ ]⋅ℑ Imaginary part of complex number

[ ]⋅E Mathematical expectation

[ ]⋅∠ Argument of complex number

⋅ Absolute value

⋅ Rounding to the nearest integer towards minus infinity

⊗ Element-wise multiplication of matrix elements ∗a Complex conjugate of the complex number a TA Transpose of the matrix (or vector) A HA Complex-conjugated transpose of the matrix (or vector) A

Indices and variables n Discrete time index

k Frequency coefficient index j Index of time-domain signal block

b Index of frequency subband

f Frequency

sf Sampling frequency

N2 Length of time-domain signal block

N Length of MDCT, MDST and MDFT spectra

Time-domain signals

[ ]nl Left channel time-domain input signal

[ ]nr Right channel time-domain input signal

[ ]nl Left channel time-domain output signal

[ ]nr Right channel time-domain output signal

ii

[ ]nm Non-coded downmix (mid) time-domain signal

[ ]ns Residual (side) time-domain signal

[ ]nd Decorrelated downmix time-domain signal

[ ]nm Coded downmix (mid) time-domain signal

[ ]nw Window function

[ ]nhb Filter impulse response for frequency subband of index b

[ ]kFb Filter of subband b in the frequency domain filterbank

Filterbank signals

[ ]nx bj , Subband time-domain signal of index b for block j of [ ]nx

Frequency-domain spectra

[ ]kX DFT or MDFT spectrum of time-domain signal [ ]nx

[ ]kX bj , DFT or MDFT subband-spectrum of index b for block j of [ ]nx

[ ]kX MDCTbj , MDCT subband-spectrum of index b for block j of [ ]nx

[ ]kX MDSTbj , MDST subband-spectrum of index b for block j of [ ]nx

Signal processing functions

[ ]τφxy Normalized cross-correlation function of [ ]nx and [ ]ny as function of time-lag τ

[ ]bjPX , Subband signal power of [ ]kX bj ,

[ ]bjPXY , Subband cross-spectrum of [ ]kX bj , and [ ]kY bj ,

Subband parameters

ICLDB The number of subbands in the ICLD filterbank.

[ ]bjPXY ,∆ Subband power ratio between the channel [ ]kX bj , and [ ]kY bj ,

[ ]bjLR ,τ∆ Subband time difference between the channel [ ]kL bj , and [ ]kR bj ,

iii

[ ]bjcLR , Subband inter-channel coherence between the channels[ ]kL bj , and [ ]kR bj ,

[ ]bj,σ Rotational angle of KLT for time-domain block j and frequency subband b

[ ]bjLR ,θ∆ ICPD for time-domain block j and frequency subband b

[ ]bj,Θ OCPD for time-block j and frequency subband b

[ ]bt∆ Time shift specified for frequency subband of index b

Abbreviations

3GGP 3rd Generation Partnership Project

AMR-WB+ Adaptive MultiRate WideBand Plus

BCC Binaural Cue Coding

BMLD Binaural Masking Level Difference

CCF Cross-Correlation Function

DFT Discrete Fourier Transform

eAAC+ Enhanced Advanced Audio Coding Plus

ERB Equivalent Rectangular Bandwidth

FFT Fast Fourier Transform

IACC Inter-Aural Cross Correlation

ICC Inter-Channel Coherence

ICLD Inter-Channel Level Difference

ICPD Inter-Channel Phase Difference

ICTD Inter-Channel Time Difference

ILD Interaural Level Difference

IS Intensity Stereo

ITD Interaural Time Difference

ITU International Telecommunication Union

JS Joint Stereo

KLT Karhunen-Loève Transform

LAME LAME Ain't an Mp3 Encoder

MDCT Modified Discrete Cosine Transform

MDFT Modified Discrete Fourier Transform

iv

MDST Modified Discrete Sine Transform

MPEG Moving Picture Expert Group

MPS MPEG Surround

MS Mid Side

MUSHRA MUltiple Stimuli with Hidden Reference and Anchors

OLA OverLap Add

OPD Overall Phase Difference

PE Perceptual Entropy

PCA Principal Component Analysis

PS Parametric Stereo

QMF Quadrature Mirror Filter

SMLD Side-Mid Level Difference

SMR Signal-to-Masking Ratio

SNR Signal-to-Noise Ratio

STFT Short-Time Fourier Transform

USAC Unified Speech and Audio Coding

WMOPS Weighted Million Operations Per Second

1

1 Introduction

1.1 Background

Audio coding is used to obtain an efficient and compact representation of audio signals to allow a low cost transmission and storage. An efficient encoding is desirable for transmission over channels with limited capacity such as mobile radio channels in telecommunication systems. At the decoding side the original audio signal is reconstructed from an encoded and compact representation. While lossless coding enable perfect reconstruction of the audio signals lossy coding techniques use perceptual models based on the human auditory system in order to remove perceptual irrelevancies, i.e. inaudible information. Perceptual audio coding allows for an even more compact representation of the original audio signal than lossless coding. However, in order to obtain an acceptable audio quality, the technique requires accurate psychoacoustical models of the human hearing as well as advanced signal processing methods.

Audio codecs, i.e. encoder and decoder, are designed according to complexity, delay and quality constraints defined by the end-user application. In real-time systems there is also a bandwidth constraint which imposes audio coding algorithms with low latency. The development of audioconferencing and videoconferencing systems together with the progress of audio recording techniques has forced the development from narrowband (100-3400 Hz) speech codecs to wideband (50-7000 Hz), super wideband (50-14000 Hz), and now fullband audio codecs (20-20000 Hz). Fullband audio coding schemes represent the frequencies that are audible by the human ear and allow for a very immersive audio quality for all types of audio signals.

Audio codecs are standardized with requirements on the usage in order to facilitate inter-operability between different telecommunication systems from various manufactures. In June 2008 the first fullband audio coding standard of ITU-T1 was specified as the G.719 codec. The codec was designed for teleconferencing applications and handles fullband speech, mixed content and music signal at a data rate of 32, 48 and 64 kbps with a low complexity and a low delay.

In order to obtain a more realistic listening experience of the audio signals it is of interest to enhance mono systems with stereo or multi-channel audio support. For teleconferencing applications this gives a new dimension where the participants are not just characterized by their voice but also by their spatial location and acoustical environment as is illustrated in Figure 1.1. The ITU-T G.719 codec has no stereo support but it can be used by stereo rendering systems such as videoconferencing with dual mono encoding, i.e. independent mono coding of the two stereo channels.

1 Telecommunication Standardization Sector of the International Tele Union

2 Chapter 1 - Introduction

Encoder

Decoder

Stereo recording

Figure 1.1: Stereo audio for spatial localization in teleconferencing.

For stereo audio there are not just redundancies within the channels but also between them, which can be used for a more efficient stereo coding. Similar to perceptual mono coding the properties of the human hearing can be used to compress the representation of the stereo channels additionally. Parametric stereo coding has successfully been used in existing stereo codecs such as 3GPP Enhanced Advanced Audio Coding Plus (eAAC+) [61][62] and Extended Adaptive Multi-Rate Wideband (AMR-WB+) [60]. There is therefore of a high interest to investigate the benefit of parametric stereo coding for the ITU-T G.719 codec in order to improve the compression ratio while maintaining the subjective quality.

1.2 Thesis outline

The objective of this thesis has been to study relevant state-of-art stereo coding methods and propose an innovative and competitive stereo coding algorithm for the ITU-T G.719 mono codec. The codec would have to handle fairly complex teleconferencing speech and audio signals, e.g. clean speech, reverberant speech and mixed content of speech and music.

The thesis consists of nine chapters where a solution for stereo coding with the ITU-T G.719 mono codec is presented. In chapter 2 the G.719 codec is described with a focus on the properties that are important for stereo coding. The basic principles of the human spatial hearing are presented in chapter 3 followed by a review of useful and interesting state-of-the art stereo coding techniques in chapter 4. In chapter 5, the framework of the proposed stereo coding solution for G.719 is presented prior to a detailed description of the stereo modeling and the coding of stereo channels in chapter 6 and 7. In chapter 8, the performance of the proposed algorithms is evaluated and in chapter 9 conclusions are drawn along with suggestions of further improvements.

3

2 ITU-T G.719 fullband speech and audio codec

In June 2008 a fullband extension to the standardized ITU-T2 G.722.1C [8] codec was adopted as the ITU-T G.719 codec [1]. The codec was standardized at bitrates of 32, 48 and 64 kbps but it allows bitrates of 32-128 kbps in steps of 4 kbps. Fullband audio signals, sampled at 48 kHz, are processed in overlapping blocks of two consecutive frames of 960 samples (20 ms) which will be presented in section 2.1.

The G.719 codec works in the Modified Discrete Cosine Transform (MDCT) domain, where the time domain audio signals are transformed into a frequency representation. The gain of transform coding might be high, e.g. for a single sine signal that in the frequency domain can be described by its frequency, amplitude and phase for an infinitely long time. However, audio signals usually have both frequency domain complex and simple characteristics which results in different coding gain for different parts of the signal. Of the same reason various types of signals will not be perceived to have the same quality even though they were coded similarly. G.719 handles some of these differences by for example detecting and handling frames with transients, i.e. impulses in the waveform, differently and adaptively allocating bits for different parts of the frequency content.

The algorithmic delay for G.719 is 40 ms due to buffering of two frames and can be seen as a low delay in comparison to other codecs as will be presented in section 2.3. In addition, the complexity of the encoder and decoder is low, i.e. the computational load is small, and the audio quality is high for clean, reverberant and noisy speech, mixed content and music for bitrates from 64 kbps.

In contrast to MPEG audio codecs where the encoders are informatively defined, the ITU-T specification of G.719 defines both the encoder and the decoder algorithms normatively. In the following sections the ITU-T G.719 encoder and decoder will be presented with focus on the components that are important for the development of a stereo architecture for G.719.

2.1 G.719 encoder

The G.719 encoder consists of several processing blocks where the 48 kHz sampled input signal is encoded into a bitstream (see Figure 2.1)

As a first step, the input mono signal is analysed by a transient detector in order to decide what type of time-frequency transform will be used or more precisely what time resolution will be used to better represent the waveform. When the spectral content of the signal can be considered as stationary, input frames of length N = 960 samples (20 ms) are transformed. The mapping from the time domain to the frequency domain is made by use of an adaptive Modified Discrete Cosine Transform (MDCT) with a transform size of 2N = 1920 samples, which corresponds to two frames. The MDCT comprises Time-Domain Aliasing (TDA) which implies that the 2N samples of each transform block are mixed into a signal block of N = 960 samples. The aliased signal is subsequently transformed into N spectral coefficients. For non-stationary frames the time-domain aliased signal is split into

2 Telecommunication Standardization Sector of the International Tele Union

4 Chapter 2 - ITU-T G.719 fullband speech and audio codec

four sub-frames of N/4 = 240 samples (5 ms) to better represent the fast changes in the time-domain signal. These non-stationary frames are transformed using a higher temporal resolution transform than the MDCT used in the stationary mode.

The transformed signal frame is subsequently divided into 44 frequency sub-vectors that approximate the frequency representation of the human ear, i.e. high resolution at low frequencies and low resolution at high frequencies (see Appendix A). The norms of the sub-vectors are estimated, quantized, encoded and used to normalize the spectral coefficients. According to a psychoacoustic model the norms are adaptively weighted and used to allocate bits for coding of the sub-vectors. Based on the bit allocation the normalized spectral coefficients are lattice-vector quantized. Finally, the flags indicating transients, type of coding, etc. form the encoded bit stream together with the coded quantization indices for the lattice code vectors and the coded norm indices. In addition, in stationary mode the level of non-coded coefficients (noise) is estimated, coded and included in the bit stream as well.

Transient detector

Norm estimation

Norm quantization and coding

Norm Adjustment

Bit allocation

Transformation Spectrum normalization

Lattice quantization and coding Noise level

adjustment Input signal

MUX

Tran

sien

t fla

g

Qua

ntiz

ed n

orm

s,

Latti

ce c

odev

ecto

rs

Noi

se le

vel

adju

stm

ent

Encoded bitstream

Figure 2.1: Block diagram of the G.719 encoder [1].

2.1.1 Framework

The G.719 framework is defined by the transformation of time-domain signals into frequency domain spectra. The transform is a Modulated Lapped Transform (MLT) performed differently depending on the mode selection based on the transient detection, i.e. transient or stationary mode. The MLT consists of a windowing followed by a Modified Discrete Cosine Transform (MDCT), which in addition can be subdivided into Time-Domain Aliasing (TDA) and Discrete Cosine Transform of type IV (DCTIV). The transient mode consists of a further time segmentation into four sub-frames to improve the time resolution. However, the buffering and windowing is common for both the stationary

Section 2.1 - G.719 encoder 5

and transient mode which means there is no additional delay due to extra buffering in the transient detector.

The transients are detected from the time-domain signal in order to select a fine time-resolution (for transients) or a fine frequency resolution (for stationary signals). The short-time energy in four sub-frames of 240 samples is for each processed frame compared to the long-term energy estimated based on energies in the previous frames with a forgetting factor smoothing. If the short-time energy is more than 7.8 dB higher than the long-term energy the transient mode is selected, otherwise that stationary mode is used (see [1] for more details). The switching between the stationary and the transient mode is instantaneous and does not require the usage of a transition window as it is the case for example in MPEG2/4-AAC [25][28]. This instantaneous switching is therefore not introducing any additional delay. In addition, transient mode is triggered for the overlapped (next) frame whenever a transient is detected, since the transform is applied to blocks of two consequent frames. The block processing with windowing, buffering and transformation in the MLT will be described in the following sections.

2.1.1.1 Frame buffering and windowing with overlap

A time-limited block of the input audio signal can be seen as windowed with a rectangular window. The windowing that is a multiplication in the time-domain becomes in the frequency domain a convolution and results in a large frequency spread for this window. In addition the sampling theorem states that the maximal frequency that can be correctly represented in discrete time is the Nyquist frequency, i.e. half of the sampling rate, otherwise aliasing occurs. For example in a signal sampled at 48 kHz a frequency of 25 kHz, i.e. 1 kHz above the Nyquist frequency of 24 kHz, will be analysed as 23 kHz due to the aliasing (see [2] for more information).

Due to the large frequency spread of the rectangular window the frequency analysis can be contaminated by the aliasing. In order to reduce the frequency spread and suppress the aliasing effect windows without sharp discontinuities can be used. Two examples are the sine and the Hann windows, defined in [2], that compared to the rectangular window indeed have a larger attenuation of the side lobes but also a wider main lobe. This is illustrated in Figure 2.2 where the shape of the windows and the corresponding frequency spectrum can be observed. Conclusively, there has to be a trade-off between the possible aliasing and the frequency resolution.


Figure 2.2: Three window functions and their corresponding frequency spectrum. The windows are 1920 samples long at a sampling rate of 48 kHz.

In the synthesis of the analysed and encoded blocks of a processed audio signal the window effects has to be cancelled. For example the inverse window function could be applied to the coded time-domain blocks but there is a high possibility that artefacts can be audible near the block edges due to discontinuities and amplification of the coding errors. In order to reduce the block artefacts overlap-add techniques are commonly used [2].

In ITU-T G.719 the blocks of two consequent frames are windowed with a sine window of length 2N = 1920 samples that is defined by

[ ]

+=

Nnnw

221sin π , for n=0,…,2N-1 (2.1)

The signals are processed with an overlap in the data of 50% between consecutive blocks. The windowed signal of each block is given by

[ ] [ ] [ ][ ] [ ]

−=−−=

=12,...,

1,...,0NNnNnxnw

Nnnxnwnx OLD

w (2.2)

which is illustrated in Figure 2.3. The figure shows the buffering and windowing with the overlap of N = 960 samples between the blocks of length 2N. The blocks are Time-Domain Aliased (TDA) into spectra of length N that are transformed using the Discrete Cosine Transform (DCTIV). The information from the transient detector is not used in the buffering, the windowing or the TDA but for the DCTIV, which implies that there is a common buffering and windowing for the stationary and transient mode. The combination of the TDA and the DCTIV is the MDCT which is further presented in the following section.

0 500 1000 1500 20000

0.5

1

t/Ts

Window functions

RectangularSineHann

0 0.002 0.004 0.006 0.008 0.01-100

-50

0

f/Fs

dB

Frequency spectrum


x (n)xOLD(n)xw(n)

TDA

TDA

TDA

x (n)xOLD(n)xw(n)

x (n)xOLD(n)xw(n)

TDA

TDA

TDA

N N N N

2N

2N

2N

N

N

N

[ ]nxOLD [ ]nx

[ ]nxw

Frame j-1 Frame j Frame j+1 Frame j+2

Figure 2.3: G.719 buffering, windowing and transformation of an audio signal [1].

2.1.1.2 Modified Discrete Cosine Transform

The MDCT is used in G.719 to transform the buffered and windowed signal blocks in to a frequency representation. The transform comprises Time-Domain Aliasing (TDA) which means that the signal blocks of 2N=1920 samples are folded (aliased) into blocks of N=960 samples. These time-domain aliased signals of each block are then represented by N coefficients of cosine basis functions. Due to the TDA it is not possible to reconstruct the time-domain signals from individual MDCT spectra, but the framework of overlapped signal blocks enables perfect reconstruction. The 50 % overlap and the properties of the windows are essential for the reconstruction where the TDA can be cancelled with overlap-add of consequent inverse-transformed MDCT spectra. The conditions for Time-Domain Aliasing Cancellation (TDAC) and the perfect reconstruction with the overlap-add technique will be described in more details in section 2.2.1.

As described in section 2.1.1.1 the signal blocks are overlapped in order to avoid block artefacts. The number of frequency coefficients per time unit is thereby increased in comparison to transformation of non-overlapped blocks. This implies that the bitrate of coding the spectra is increased in order to avoid block artefacts. However, due to the TDA in the MDCT the bitrate can be reduced by the corresponding factor of the overlap. This in combination with the real frequency coefficients makes the MDCT competitive for audio coding with a compact representation of the signals.

The MDCT spectrum [ ]kX MDCT of the windowed signal [ ]nxw is by definition obtained as

[ ] [ ]∑−

=

+

++=

12

0 21

221cos2 N

nw

MDCTw kNn

Nnx

NkX π for k=0,…,N-1 (2.3)

In Appendix D it is shown how the MDCT can be expressed as the DCTIV of a time-domain aliased signal as illustrated in Figure 2.3.


2.1.1.3 Transient mode transformation

In the transient mode of G.719 the time-aliased signal block ww Qxx =~ (with the notations used in Appendix D.I) is reversed in time and divided into four sub-frames. The reversion re-creates the temporal coherence of the input signal that was destroyed by the TDA. The first and the last sub-frames are windowed by half sine windows with a fourth of zero-padding while the second and third sub-frames are windowed with the ordinary sine window as illustrated in Figure 2.4. The overlap between the windowed sub-frames is 50 % and each segment is MDCT transformed, i.e. time aliased and DCTIV-transformed, which results in sub-spectra of length N/4. Thus the total length of the four sub-spectra is N frequency coefficients, i.e. the transform lengths are equal in the stationary and the transient mode of G.719.

8N

8NNLength =

Figure 2.4: Windowing of sub-frames in the transient mode [1].

2.1.2 Perceptual coding

In G.719 the MDCT spectra are perceptually encoded based on a psycho-acoustical model. The model describes the human hearing system and is used in order to introduce coding errors that are not audible.

In Figure 2.5 the principle of the perceptual coder is illustrated. The MDCT spectrum of the transformed windowed time-domain signal is split into 44 sub-vectors that approximate the frequency resolution of the ear (see Appendix A) by increasing sub-vector lengths with increasing frequency. The sub-vector spectra are quantized and coded based on the sub-vector energies, or norms, that are weighted according to the psychoacoustical model. The coding procedure is similar for the two time-resolution modes in G.719, but for the transient mode the spectral coefficients of the four sub-frames are interleaved before coding to preserve the coherence of the signal in the time-domain (see [1] for details).


MDCT Transform

Psychoacoustical Modeling

Time [n]

Frequency subbands [b]

Masking Threshold

Quantization and Coding

…

Bitstream

Weighted sub-vector norms

dB

[ ] k X b [ ] n x

Window

Sub-vector norms

Figure 2.5: Block diagram of the perceptual encoder based on MDCT domain masking.

The norm of each sub-vector is estimated and quantized with a uniform logarithmic scalar quantizer in 40 levels of 3 dB difference. The MDCT spectra are normalized with the quantized norms in order to reduce the amount of information needed to describe the spectra. The quantized norms are both differentially and Huffman encoded [3] before they are transmitted to the decoder where they can be used to de-normalize the decoded MDCT spectra. In the next step of encoding, bits are iteratively allocated to each sub-vector as a function of the quantized sub-vector norms. The goal of the bit allocation is to distribute the available bits in a way that the maximum subjective quality is obtained at a given data rate, i.e. a given number of bits. Therefore the quantized norms of the sub-vectors are perceptually weighted to account for psycho-acoustical masking and threshold effects (see [1] for more information about the perceptual weighting).

For each iteration in the allocation of bits, the sub-vector of the largest weighted norm is found and one bit is assigned to each MDCT coefficient in the corresponding sub-vector. The corresponding norm is decreased by 6 dB and the procedure repeats until all available bits are assigned. When a sub-vector is assigned with 9 bits per coefficient the norm is set to minus infinity in order to not allocate more bits for that sub-vector. Considering the allocated bits the normalized sub-vectors are lattice vector quantized and Huffman coded. More information about the vector quantization can be found in [1] for G.719 specifically, and in [4] more generally.

In the stationary mode the amount of non-coded spectral coefficients in the sub-vectors assigned with zero bits is estimated, quantized and included in the bitstream for frequencies below the so-called transition frequency. The transition frequency is given by the last encoded sub-vector and is in the decoder used to separate noise filling and bandwidth extension as will be described in section 2.2.

The quantization indices of the norms, the encoded sub-vector spectra and the estimated noise level form the encoded bitstream. In addition, information about for example the coding mode (stationary or transient) and the coding (Huffman or not) is added to the bitstream that is transmitted to the G.719 decoder.


2.2 G.719 decoder

Figure 2.6 shows the block diagram of the G.719 decoder where the encoded bitstream is decoded into a 48 kHz waveform signal. The transient flag is used to decode the bitstream correctly since it specifies the type of frame configuration that was used in the encoder. The transmitted sub-vector norms are used to recompute how the bits were allocated in the encoder in order to decode the lattice vector indices into MDCT sub-vectors. With the transmitted noise level estimation, low frequency non-coded spectral coefficients are regenerated by use of a spectral-fill codebook constructed of the received spectral coefficients. For the higher frequency non-coded coefficients above the transition frequency, i.e. the frequency of the last quantized sub-vector, bandwidth extension is used for regeneration (see [1] for more information). Subsequently, the received spectral coefficients are mixed together with the synthesized coefficients into a normalized spectrum. The envelope of the normalized spectrum is shaped by the decoded quantized norms which were used to normalize the spectrum in the encoder. Finally, the decoded time-domain signal is generated by the inverse MLT of the decoded MDCT spectrum. In the following, the inverse transformation related to the G.719 framework will be described while more information of the lattice decoding and the spectral filling can be found in [1].

Output signal

DEMUX

Tran

sien

t fla

g

Noi

se le

vel

Adj

ustm

ent i

ndex

Encoded bitstream

Latti

ce c

odev

ecto

rs

Qua

ntiz

ed n

orm

s

Tran

sien

t fla

g

Lattice decoding Spectral fill generator

Envelope shaping

Inverse transformation +

Figure 2.6: Block diagram of the G.719 decoder [1].

2.2.1 Overlap-add synthesis

Despite the TDA in the MDCT, the time domain signal can be reconstructed due to the Time-Domain Aliasing Cancellation (TDAC) obtained with OverLap-Add (OLA) of the inverse transformed signal blocks. In this section the synthesis framework of G.719 is presented and in Appendix section D.II the TDAC properties of the MDCT are derived.

The Inverse MDCT (IMDCT) is defined as

[ ] [ ]∑−

=

+

++=

1

0 21

221cos2 N

k

MDCTw kNn

NkX

Nnx π

for n=0,…,2N-1

(2.4)

Section 2.2 - G.719 decoder 11

where XMDCT[k] is the decoded MDCT spectra of N frequency coefficients. The IMDCT can similarly as the MDCT (see section 2.1.1.2) be expressed as the Inverse Time-Domain Aliased (ITDA) Inverse DCTIV of the MDCT spectra (see Appendix section D.II).

In Figure 2.7 the G.719 transformation of the MDCT spectra into a time domain signal is illustrated. The spectra of length N = 960 coefficients are transformed with the inverse-DCTIV, which is equal as the forward DCTIV (see section 2.1.1.2). Subsequently, the spectra are inverse time-domain aliased into time-domain blocks of 2N = 1920 samples. The blocks are windowed by a synthesis sine window that is equal to the analysis window used in the encoder. Each windowed block of index j is added to the previous block of index j-1 with 50% overlap. In the OLA the TDA is cancelled due to the properties of the MDCT and consequently, for every inversed MDCT spectra, one frame, i.e. N = 960 samples, of the audio signal is reconstructed. The second half of the windowed block is consequently added to the first part of the next block which together cancel the TDA and reconstruct the next frame of the time-domain signal.

IDCTIV

IDCTIV

IDCTIV N

N

N

2N

2N

2N

+

+

N N

Window

OLA

OLA

…

ITDA

ITDA

ITDA

Window

Window

Frame j - 1 Frame j

Transient flag

Figure 2.7: Inverse MDCT with overlap-add [1].

The inverse transformation is dependent on the transient flag transmitted from the encoder. In the transient mode the decoded MDCT spectrum is first de-interleaved into four spectra. These four spectra are individually inverse-transformed with a short IDCTIV, inverse aliased, windowed and overlap-added as illustrated in Figure 2.8. The obtained time-domain block is reordered as it was in the encoder. The obtained block of length N is inverse aliased, windowed and overlap-added to time-domain signal frames as for the stationary mode which is shown in Figure 2.7.


N/2

N/2

N/2

N/2

N

N

ITDA

ITDA

ITDA

ITDA

OLA

OLA

OLA

Figure 2.8: Inverse transform in the transient mode [1].

2.3 Performance of ITU-T G.719

The performance of the ITU-T G.719 codec can be described by its algorithmic efficiency, which is characterized in terms of algorithmic delay, algorithmic complexity and memory usage, and the subjective audio quality it delivers at different bitrates.

2.3.1 Algorithmic efficiency

The G.719 codec has a low complexity and a low algorithmic delay [1]. The delay depends on the frame size of 20 milliseconds and the look-ahead of one frame used to form the transform blocks. Hence, the algorithmic delay of the G.719 codec is 40 milliseconds.

The algorithmic delay of comparable codecs as 3GPP eAAC+ [61] and 3GPP AMR-WB+ [60] are significantly higher. For AMR-WB+ the algorithmic delay for mono coding is between 77.5 and 227.6 ms depending on the internal sampling frequency. For eAAC+ the algorithmic delay is 323 ms for mono coding with 32 kbps and 48 kHz sampling rate [6].

In Table 2.1 the average and worst-case complexity of G.719 is expressed in Weighted Millions Operations Per Second (WMOPS). The figures are based on complexity reports using the basic operators of ITU-T STL2005 Software Tool Library v2.2 [7]. For comparison, the complexity of the three comparable audio codecs eAAC+, AMR-WB+ and ITU-T G.722.1C [8], the low-complexity super-wideband codec (14 kHz) that G.719 was developed from, is shown in Table 2.2. The memory requirements of G.719 are presented in Table 2.3.

The delay and complexity measures show that the G.719 codec is very efficient in terms of complexity and algorithmic delay especially when compared to eAAC+ and AMR-WB+.

Section 2.3 - Performance of ITU-T G.719 13

Bit rate (kbps) Encoder only Decoder only Encoder plus Decoder

Average Maximum Average Maximum Average Maximum

32 6.663 7.996 6.876 7.413 13.539 15.397

48 7.427 9.073 7.230 7.806 14.657 16.861

64 7.912 9.899 7.554 8.161 15.466 18.060

Table 2.1: G.719 codec complexity in WMOPS [1].

Bit rate (kbps) G.722.1C eAAC+ AMR-WB+

32 10.3 42.6 86.7 Table 2.2: Average complexity of comparable codecs in WMOPS [12].

Memory type Encoder Decoder Codec (encoder plus decoder)

Static RAM (16-bit kWords ) 1.0 3.9 4.9

Scratch RAM (16-bit kWords ) 12.2 12.2 24.4

Data ROM (16-bit kWords ) 8.3 8.9 10.7

Program ROM (1000’s basic operations)

1.2 1.2 1.8

Table 2.3: Memory usage of the G.719 codec [1].

2.3.2 Subjective Quality Performance

ITU-T has performed a subjective listening test of mono coded material in two different experiments [9]. The LAME codec [10] of version 3.97 was used as a reference in the comparison with the G.719 codec in both experiments. LAME is a perceptual MPEG 1,2 and 2.5 layer III encoder under LGPL licence. In the second experiment LAME and G.719 was also compared with ITU-T G.722.1C [8], the super-wideband codec (14 kHz) that G.719 was developed from.

The aim with the test was to prove the audio quality of G.719 at bitrates 32, 48 and 64 kbps not worse than LAME MP3 at 40, 56, 64 kbps, respectively, with a 95% statistical confidence interval. The same requirements applied in the comparison to ITU-T G.722.1C at 48 kbps. The experiments were performed at two independent laboratories with 24 listeners each. The evaluation was performed with triple stimulus/hidden reference/double blind tests with Degradation Mean Option Scores (DMOS) as described in [11].

The tests consisted of clean, reverberant and noisy speech samples (English and French) in the first experiment while samples of mixed content and music were used in the second


experiment. The mixed content was represented by trailers, news jingles, etc. with a mixture of speech (English and Spanish), music and noise.

In the first experiment G.719 performed better than LAME for both languages and for all bitrates that were tested. The second experiment showed similarly that for mixed content and music the quality for G.719 was not worse than LAME and G.722.1C.

An additional test of G.719 at higher bitrates was performed at Ericsson with 12 listeners on critical music items [9]. It was shown that transparency between the reference signal (source) and the G.719 encoded signal was obtained at 128 kbps as can be observed in Figure 2.9. The audio quality starts to saturate at 64 kbps and the degradation for the lower bitrates is significant.

Figure 2.9: Test results from subjective evaluation of ITU-T G.719 at higher bitrates [9].

2.4 Conclusion

In terms of audio coding, fullband audio signals are generally more demanding than speech signals which have characteristics that can be efficiently modelled. ITU-T G.719 is a low-delay, low-complexity fullband audio codec that was standardized in 2008. G.719 encodes 50 % overlapped time-domain blocks of 40 ms in the Modified Discrete Cosine Transform (MDCT) domain. The codec performs well for speech, mixed content and music signals at bitrates from 64 kbps.

In order to enhance ITU-T G.719 with a stereo layer the framework of G.719 should be considered. The time-resolution and the time-to-frequency transform are two important aspects. G.719 uses an instantaneous switching between two different time resolutions (20 or 5 ms) for signal block of stationary or transient characteristics. In order to maintain a low complexity the stereo layer should be compatible to the G.719 transform blocks and a stationary and transient mode might be necessary also for the stereo layer. The special properties of TDAC that enable perfect reconstruction with the MDCT are also important to consider in the stereo coding since they could be disturbed by stereo processing.

Section 2.4 - Conclusion 15

In addition, the characteristics of the mono codec in form of delay and subjective audio quality should be considered. The algorithmic delay of 40 ms in G.719 should not be significantly increased especially since the G.719 codec is suitable for real time applications, e.g. tele and video conferencing. At the standardized bitrates of 32, 48 and 64 kbps the quality is not transparent and a sufficient stereo audio quality could probably be obtained without transparent stereo coding.

17

3 Characteristics of our perception of sound in space

In the same manner as a psychoacoustical model is used in ITU-T G.719 to efficiently model the components of the audio signal that are perceptually important (see chapter 2), it is important to exploit the perception of sound in space in the pursuit of an efficient stereo model for stereo coding. Humans perceive spatial images of what they hear by use of the ears similarly as visual images are perceived with the eyes. The spatial image is formed by the propagating sound from audio sources which are perceived as auditory objects with localization, width, intensity etc. In many cases the perceived direction of the auditory object coincides with the actual location of the sound source, but depending on the environmental properties it can be different [15]. In this chapter some properties of the spatial hearing that are important for stereo coding will be described.

3.1 Localization of sound in space using interaural cues

The spatial perception is related to the analysis of the ear entrance signals by the human auditory system. In a free-field listening scenario the sound of an audio source is propagating without reflections, e.g. from walls, and the spatial image is entirely discriminated by the two paths between the audio source and the ears. The sound that enters the ear canal is filtered by the shoulders, head and outer ears (pinnae) differently depending on the localization of the source. The filters, usually called Head-Related Transfer Functions (HRTFs), modify the source signal so that the horizontal direction (azimuth) and the vertical direction (elevation) to the source can be perceived by the auditory system. The accuracy is high since the differences between the signals from the both ears can be compared as will be described in the following.

The human ear can be divided into three parts: the outer, the middle and the inner ear; see Figure 3.1. The variations in the acoustical pressure are captured by the outer ear, which consists of the pinna and the external auditory canal, and is lead to the eardrum in the middle ear. The malleus, incus and stapes mechanically transmit the air pressure variations to the fluids of the inner ear cochlea through the oval window.

Figure 3.1: The anatomy of the human ear [16].

18 Chapter 3 - Characteristics of our perception of sound in space

In the cochlea there are three fluid-filled tubes separated by membranes where two types of sensory cells (hair cells) are placed. The vibrations form waves in the fluids that travel along the membranes into the cochlea and excite the hair cells. The excitation pattern of the hair cells becomes different depending on the frequencies of the analysed sound. For pure tones the wave has a maximum amplitude, i.e. a maximum response, at a certain position in the cochlea as illustrated in Figure 3.2. High frequencies have a high response towards the base (near the stapes) while the low frequencies have a high response near the center (apex) of the cochlea.

Figure 3.2: Position in cochlea of maximum response to pure tones [67]. The frequency (in Hz) of the maximum response increases non-linearly with the distance (in percent) from the center (apex) of the cochlea.

As shown in Figure 3.2, the frequency response is non-linearly located along the cochlear axis so the excitation for the high frequencies is closer than for the low frequencies. In experiments measuring the threshold of frequency masking [2] it has been found that the threshold is rather flat for a narrow frequency range around the frequency of the masker while it drops-off rapidly outside this range. For example, consider a tone that is masked by noise with a constant spectral density. If the bandwidth of the noise increases, the masking threshold for the tone will be increased but only to a certain limit. For wider noise than this limit the masking threshold is slightly affected even if the total level of noise increases. These results are related to the frequency response in the cochlea and in 1940 Fletcher defined a concept of critical bandwidths [67] (see Appendix A).

The analysed sound is transmitted via the auditory nerve to the superior olivary complex (SOC) in the brain stem where the information from the two ears excites binaural neurons. The binaural neurons describe the differences between the two ear impulses in terms of Interaural Time Differences (ITDs), Interaural Level Differences (ILDs) and something that can be seen as the InterAural Cross-Correlation (IACC) between the two impulses [2][15].

Two of the interaural cues, the ILD and ITD, are related to the horizontal angular position (azimuth) of the source. For an audio source in the front of the head the paths of the source signals are equally long and the wavefronts arrive simultaneously to the ears. For a source more to the right or the left the difference in length of the path implies that there is an ITD between the ear entrance signals. In addition, there is an ILD between the direct and the diffracted sound due to shadowing by the head as illustrated in Figure 3.3.

Section 3.1 - Localization of sound in space using interaural cues 19

Diffracted sound

dB

Δ

t

dB

t

Sound source

Direct sound

Figure 3.3: Localization of a single source described by level and time differences [50].

In order to represent the azimuth of the audio source by the ITD and ILD it is important to consider the frequency dependency due to the frequency analysis in the cochlea. Otherwise there is a cone of confusion which points out from each ear where the location determined by the ITD and ILD is ambiguous [17][18].

Depending on the frequency of the incident sound there is a different response to the ITD and ILD. For low frequencies below about 1 kHz the sensitivity to the ILD is significantly lower than the ITD sensitivity. This can be related to the negligible ILD of long wavelengths that diffract around the head without any significant shadowing. The ITD response is similarly low for frequencies above about 1.6 kHz where the ILD from the shadowing is dominant. For even higher frequencies the effect of the shadowing increases. For example, a sine wave of 3 kHz located straight out from the right ear has been shown to be attenuated by about 10 dB in the left ear while a sine wave of 6 kHz was attenuated with about 20 dB [17].

The idea of a “cut-off” frequency between high ITD and high ILD importance is often referred to as the duplex theorem, but it has been shown that ITDs play a role even for higher frequencies [15][17][18]. However, in that case, the auditory system responds to the timing difference of the amplitude envelope rather than the ITD between the waveforms within the envelope as illustrated in Figure 3.4. The auditory system cannot perceive if the point B lags the point A or leads the point C. There is in other words no lateral displacement due to the ITD between the blue (solid) and red (dashed) waveforms in the figure. However, the time difference between the amplitude envelopes is perceived and affects the localization [17].


0 50 100 150 200 250 300 350

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Sample

Am

plitu

de

LR

A B

C Envelope

ITD ?

Figure 3.4: For high frequencies (above 1.6 kHz) the ITD of the waveform envelopes is perceived by the auditory system while the ITD within the envelope is ambiguous.

Nevertheless, the ITD and ILD depend on more than the azimuth to the source since the incident sound is not just shadowed by the head but also modified by the pinnae and the shoulders, i.e. by the HRTF filters. For example, frequencies between 6 and 10 kHz have been shown to be strongly affected by the head and pinna filtering [15]. In addition, it has been shown that the ITD and presumably the ILD affect the timbre of the signal although it is mainly depending on the monaural spectrum [19].

The Inter-Aural Cross-Correlation (IACC) defined as the maximum of the normalized cross-correlation between the ear entrance signals is not as the ILD and ITD related to the direction of the auditory object but the perceived width of it. Totally coherent signals with IACC = 1 results in a compact auditory object while the width is increased for signals with a lower IACC. In Figure 3.5, the impact of the IACC is shown for listening using headphones where the ear entrance signals are approximately equal to the headphone signals. When the IACC decreases, the source width increases until two distinct uncorrelated sources appear.

Figure 3.5: Width of the auditory object as a function of the IACC [50].

Section 3.1 - Localization of sound in space using interaural cues 21

3.1.1 Limitations of the auditory system

The sensitivity to changes in the interaural cues is limited in the human auditory systems by what is called Just Noticeable Differences (JNDs). For the level differences (ILD) the changes need to be larger than about 0.5 – 1 dB in order to be detected [15]. This resolution is almost constant over all frequencies and stimulus levels, but it is dependent on the reference value of the ILD. With a higher ILD a larger change is needed to be detected, e.g. for a reference ILD of 9 dB the JND is about 1.2 dB while for an ILD of about 15 dB the JND is in the range of 1.5 and 2 dB [15]. The ITD comes with almost the same properties but in contrast it is very frequency dependent [15][20]. For frequencies below 1 kHz, where the JND is lowest, the sensitivity to changes in the ITD can be described by a constant interaural phase difference of 0.05 rad. For higher frequencies the JND increases quickly. In addition, as for the ILD, the JND increases by the reference value, i.e. for a larger ITD there is a larger JND. However, the stimulus level does not seem to affect the JND of the ITD. The sensitivity to changes in the IACC cue is mainly related to the reference level of the IACC. If the coherence is large, i.e. the IACC is close to one, very small changes of about 0.002 can be detected, but the JND increases by a decreasing IACC. At an IACC level of about zero a much larger change of about 0.2 is needed to be detected by the human auditory system [15].

The human auditory system is not just limited by the JNDs but renders the cues with a binaural sluggishness. The rendering has low-pass characteristics with a cut-off frequency in the range of 10 and 33 Hz, i.e. the time constants of the time varying rendered cues are between 30 and 100 ms. Similarly, since the audio signals are analysed in critical bands in the ears (see Appendix A) the cues cannot be distinguished in a finer resolution than according to the ERB scale. However, although the human auditory system cannot follow fast variations of the interaural cues they can actually be sensed anyway. For example, fluctuations in the ILD or the ITD may be perceived as a spatial diffuseness similar to a decreased IACC.

Another psychoacoustic phenomenon of the spatial hearing that is related to the sensitivity to changes of the interaural cues is the Binaural Masking Level Difference (BMLD) [18]. The BMLD implies that the masking threshold is lowered if the masking audio (masker) gets a different spatial location than the masked signal (maskee). The effect can for example be observed when an in-phase masker is present in both ears together with an in-phase maskee. If the maskee signal then is presented out-of-phase between the ears it might suddenly be detected. Thus, the masking threshold is lowered due to the different spatial properties of the masker and the maskee.

3.1.2 Summing localizations of audio sources

Two independent audio sources are perceived as two auditory objects with the location given by the ITD and ILD and the width of the object characterized by the IACC as described in section 3.1. However, when the audio sources are not independent but correlated, new auditory objects (phantom sources) are perceived in the space between the original sources. This property of the spatial hearing is necessary for stereo capturing and reconstruction of auditory objects. In Appendix B common stereo recording techniques are presented while the properties of the reconstruction are described in the following.


The ability of a realistic reconstruction of an auditory stage with several auditory objects is affected by both the recording technique and the playback system. The paths between the speakers and the ears are different considering e.g. headphones and loudspeakers, and the related interaural cues are thereby different.

In case of headphone playback the acoustical path between the speakers and the ears is negligible. In this case the interaural cues can be described by inter-channel cues extracted from the stereo channels. Explicitly, the ITD is described by the Inter-Channel Time Difference (ICTD), the ILD is described by the Inter-Channel Level Difference (ICLD) and the IACC is given by the Inter-Channel Coherence (ICC). Consequently, the ICTD and the ICLD are directly related to the direction of the sources in space while the ICC is related to the width of the sources as shown in

Figure 3.6. When there is no ICLD or ICTD between the channels an auditory object is perceived in the middle of the head while the object is moved in the space between the headphones depending on the ICLD and the ICTD values. The width of the auditory object increases as the ICC decreases from one until there is a very low correlation between the channels and two independent objects are perceived in the left and right channels.

2 3

1

4 4

1 2 3

a b

Figure 3.6: Headphone localization of auditory objects characterized by inter-channel cues.

a) The directional sources are characterized by the ICLD and the ICTD. With no ICLD or ICTD there is an auditory object perceived between the headphones (1) while the source is moved to (2) or (3) with increasing/decreasing ICLD and/or ICTD values.

b) The width of the auditory object increases from (1) to (3) while the ICC decreases until two independent sources are perceived (4).

For the standard stereo loudspeaker playback (-30° and +30°) as illustrated in Figure 3.7, the similar effects of the inter-channel cues can be observed but the acoustical path cannot be neglected. The complexity of the acoustical path implies that the interaural cues can be changed and affected by each other. For example, Blumlein showed that an ICLD panned signal could introduce an ITD at the ears of a listener [21]. However, considerations about the acoustical path effects must not be considered by a stereo coder since as long as the inter-channel differences are preserved in the coding stage the localization will not be changed from the original recorded signal [15][22]. In other words, the auditory objects should be equally located in the encoded audio signal as they are in the uncoded audio signal.

Section 3.2 - Considerations about room acoustics 23

a b

Figure 3.7: Loudspeaker localization of auditory objects characterized by inter-channel cues.

a) The directional sources are characterized by the ICLD and the ICTD. If there is no ICLD or ICTD an auditory object is perceived in the middle between the speakers (1) while increasing/decreasing ICLD or ICTD move the source to e.g. (2) or (3).

b) The width of the auditory object increases from (1) to (3) while the ICC decreases.

3.2 Considerations about room acoustics

After the perception of the direct and diffracted sound signals the reflections of the signal provide to the listener some additional information, a spatial impression. The information describes the environment or more specifically the acoustics of the room for example the material from which the walls are made of. The reverberation, i.e. the accumulated reflections, also tells the listener something about the surrounding environment like the size of the room for example. In Figure 3.8 it is shown how the spatial impression is given by the propagation of the direct and the reflected sounds from an audio source.

Direct sound

Reflexion

Reverberation

Reflection

Figure 3.8: Spatial impression from reflections and reverberation in the room acoustics [50].

Reflected signal components that arrive at the ears after the direct sound have different effects on the perceived spatial image which is illustrated in Figure 3.9. An early reflection


that arrives not later than about 1 ms after the direct sound affects the location by the principle of summing locations. Later reflections are suppressed according to the so-called precedence effect which implies that the first waveform sort of masks the second waveform. In other words, the auditory object appears as if the later signal was not present. However, when the time difference exceeds about 2-50 milliseconds depending on e.g. stimulus properties, the reflection is perceived as an echo [15][17]. The lower echo thresholds (~2 ms) have been observed for brief stimuli such as clicks while the higher thresholds (~50 ms) have been perceived for long duration stimuli, e.g. noise and speech [23]. As presented in the figure, timbral coloration in form of comb filtering occurs for early reflections, usually within 20 ms, which means that frequency components are periodically attenuated and amplified. Later reflections that arrive about 80 milliseconds after the direct sound contribute more to the listener envelopment, i.e. the perception of the environment, than to the auditory object [15].

delay

left

right

Coloration

~1ms ~2-50ms

Precedence effect Echo

Figure 3.9: Source localization in presence of reflections. Early reflections arriving up to 1 ms after the first wavefront results in summing locations. Later reflections causes coloration of the signal but the precedence effect implies that the location of the second wavefront is suppressed. Even later reflections of more than 2-50 ms, depending on stimulus properties, are perceived as echoes [23].

The room acoustics are useful cues for other properties of the signals as well, e.g. the distance to the audio sources. Sources that are known to the listener are mainly determined by the signal power but also by the amount of high frequency content which decreases by the distance travelled. For unexpected sound sources the information is not enough, but in the presence of reflections the ratio of the power of direct and reflected sound and reverberation times may be reliable distance cues. In addition, the auditory object width is also affected by the reflections. It has been shown that early reflections (up to about 80 milliseconds) from angles of about +-90 degrees from the forward direction have a great effect of the spatial impression of the environment [15].

3.3 Conclusion

Based on the sound propagation of audio sources spatial images are perceived by our auditory system. The brain reacts to Interaural Level Differences (ILD) and Interaural Time Differences (ITD) between the analyzed entrance signals of the ears. In a simplified


model of the spatial hearing, these cues describe the direction of audio sources while the Inter-Aural Cross Correlation (IACC) between the signals affects the width of the auditory objects. The spatial image perceived in the case of two audio sources is described by the source signals and the two Head-Related Transfer Functions (HRTFs) modeling the paths from the sources to the ears. For coherent sources the interaural level and time differences become summed and perceived as the ILD and ITD that represent a phantom source in between the real audio sources. As the correlation between the sources decreases the IACC is lowered and the width of the auditory object increases until the two distinctive audio sources are separately perceived.

The human auditory system comprises several limitations. For example there are Just-Noticeable Differences (JNDs) that define the sensitivity to changes of the different interaural cues. The temporal and spectral resolutions of the binaural rendering are also limited which result, for example, in the phenomenon of the Binaural Masking Level Difference (BMLD). The BMLD means that the masking threshold is lowered when the masker and maskee have different spatial properties.

In rooms, reflections and reverb, i.e. accumulated reflections, provide the listener with additional information about the environment, e.g. the size of the room. In addition, due to the precedence effect the localization of audio sources in presence of early reflections is mainly based on the cues of the first wavefront and not the early reflections. In other words, the influence of the early reflections is suppressed by the human auditory system.

The presented properties of the spatial hearing are all important for stereo coding. However, in order to increase the performance of a stereo codec the reconstruction of the interaural cues is particularly important. In addition, the limitations of the human auditory system in form of JNDs, and the time-frequency resolution should definitively be considered in order to efficiently allocate the bits and increase the subjective quality.

27

4 Existing stereo coding techniques

Within the evolution of perceptual audio coding, techniques for efficient coding of stereo audio have been developed. The techniques provide a compact representation of the stereo audio in order to obtain a better stereo quality for bitrate constraint applications. This chapter will give an overview of the existing stereo coding techniques that are part of several codecs available today such as MPEG-1 Layer III (MP3) [24] and MPEG-2/4 AAC [25][28].

Prior to the evolution of efficient stereo coding techniques in the beginning of the 1990s the two channels of stereo material could be encoded separately, i.e. Dual Mono (DM) coded. Each channel was thereby encoded without consideration to the other channel which implies that the method was very bitrate demanding. In the extreme case of identical left and right channels, the bitrate needed for DM coding is twice the bitrate actually needed. In general, stereo signals are more complex than that, but a joint representation of the stereo channels can be very useful and lead to an efficient data rate reduction.

Perceptual DM coding is not problematic just because of the high bitrate demands but also due to audible artefacts that might be introduced due to the Binaural Masking Level Difference (BMLD) that was presented in chapter 3. The BMLD implies that, at least for low frequencies, monaurally masked quantization noise can become audible in stereo playback due to a different spatial property than the audio signal itself [18]. In this case, the psychoacoustical models of the human hearing are applied independently to the channels and therefore the quantization noise might be uncorrelated between the two channels and unmasked by correlated signal components.

The first method used for stereo compression by joint coding of the stereo channels, so-called Joint Stereo (JS) coding, was Mid-Side (MS, see section 4.1.1) coding [29]. In MS coding the left and right stereo channel are transformed into a mid-channel that represents the sum of the both channels and a side-channel that represents the difference. For correlated signals the coding of the mid and side channels might be more efficient than coding of left and right channel, e.g. considering the previous example of equal channels. However, in general it is not suitable to use MS coding solely since for uncorrelated channels more bits are needed to encode the mid and side channel than to encode the left and the right channels independently. MS coding is therefore usually combined with DM coding where the most efficient method for each processed frame is chosen.

The performance of joint stereo coding is additionally affected by the technique used to record the stereo audio (see Appendix B). For example, the spaced microphones in an AB configuration results in less correlation between the channels than coincidentally placed microphones in a XY configuration do. Consequently, for AB recorded material it is more difficult to take advantage of stereo channel similarities and the gain of joint stereo coding is reduced. The JS coding with MS and DM is further described in section 4.1.1.

The JS technique was later enhanced with Intensity Stereo (IS, see section 4.1.2) coding [30] where only the mid-channel is encoded while scale factors are used to describe the level of the left and the right channel in relation to the level of the mid channel. The scalefactors, which are estimated in frequency subbands, can be transmitted to the decoder

28 Chapter 4 - Existing stereo coding techniques

at a very low bitrate. IS coding can therefore be very efficient since the bitrate for transmission of the scalefactors is generally much lower than for transmission of the side channel.

IS can be seen as a parametric model of the stereo channels where the scalefactors describe the spatial properties of the signals in the encoded and transmitted (mono) mid channel. The technique can be generalized to parametric stereo coding with downmixing of the stereo channels and extraction of parameters that describe the stereo image. The downmixed mono channel might be encoded using a perceptual mono codec and transmitted together with the encoded stereo parameters to the stereo decoder. In Figure 4.1 the architecture of a parametric stereo codec is illustrated with downmixing and stereo analysis in the encoder and stereo synthesis using the decoded parameters and the decoded mono channel in the stereo decoder.

Downmix Mono encoder

Stereo analysis Q

MUX

DE- MUX

Mono decoder

Q-1

Stereo synthesis

[ ]nl

[ ]nr

[ ]nl

[ ]nr

Stereo parameters

Bitstream

Bitstream

Mono signal

Encoded mono signal

Encoded stereo parameters

Encoder

Decoder

Figure 4.1: Parametric stereo encoder and decoder schemes.

An advantage of the presented parametric stereo coding architecture is that the artefacts related to the BMLD tend to be reduced. The use of a downmix signal implies that there is only one quantization noise used in the stereo synthesis. The spatial properties of the masked quantization noise and the masking signals are therefore equal and the binaural unmasking is avoided [31]. In MS coding, similar properties are obtained since the quantization noises in the mid and side channels become correlated in the stereo synthesis. In other words, the binaural unmasking is avoided since the left and the right channel are made up of both the mid and the side channel (see section 4.1.1) [29][32].

Another advantage of the architecture is the backward compatibility to mono decoders and mono playback systems. In addition, the scheme can be extended for multi-channel configurations as for example in MPEG Surround (MPS) [15] where the multi-channel audio is downmixed into mono or stereo channels which subsequently can be encoded as mono and stereo signals respectively. However, multi-channel coding of more than two channels will not be presented further since it is beyond the topic of this thesis.

Section 4.1 - Joint stereo coding 29

Following the ideas of IS, the parametric stereo architecture depicted in Figure 4.1 has been enhanced with more complex parametric models. In section 4.1.3 a stereo coding method based on Principal Component Analysis (PCA) is presented. The principal component of the stereo channels is encoded and transmitted to the decoder where the ambience component is estimated using a Principal Component to Ambience energy Ratio (PCAR).

Another technique, the Binaural Cue Coding (BCC, see section 4.2) [15], uses the Inter-Channel Level Difference (ICLDs), the Inter-Channel Time Differences (ICTDs) and the Inter-Channel Coherence (ICC) to describe the stereo image. As explained in chapter 3, these inter-channel cues approximate the interaural cues which model the spatial properties of the auditory objects that are perceived by the human auditory system. However, in order to relate the inter-channel cues of complex stereo signals to the interaural cues of individual auditory objects the stereo channels are decomposed into frequency subbands of signal blocks. The resolutions of the decomposition are chosen according to the time and the frequency resolutions of the auditory system (see chapter 3) in order to model the human spatial hearing.

In Parametric Stereo coding (PS, see section 4.3) [22] the principles of BCC are used in a slightly different implementation. The ICTD parameter is in PS replaced by the Inter-Channel Phase Difference (ICPD) and the Overall Phase Difference (OPD) parameters which can be estimated with a lower complexity than the time differences.

4.1 Joint stereo coding

Joint Stereo (JS) coding is originally the combination of Dual Mono (DM), Mid-Side (MS) and Intensity Stereo (IS) coding. The DM coding of each channel is used when the joint stereo coding methods are not efficient. Switching between the different coding methods is in [29] suggested on a frequency subband frame-by-frame basis with the aim to always choose the most efficient coding algorithm in terms of compression on subjective quality. Due to the properties of the human hearing the spatial image is mainly described by the level differences between the channels at frequencies above about 2 kHz (see chapter 3). IS coding is therefore used for frequencies above 2 kHz with MS coding of the lower frequencies.

4.1.1 Mid-side coding

In Mid-Side (MS) coding the representation of the stereo channels is changed in order to remove the channel redundancies and make the coding more efficient. The mid channel represents the sum of the channels while the side channel represents the corresponding difference. Consequently, the power of the mid channel is usually considerably higher than the power of the side channel and more bits have to be allocated for the mid-channel. Especially when the channels are correlated it is generally more efficient to encode the mid and side channels instead of the original left and right channels. The transformation of the left and the right stereo channels into the mid m[n] and the side s[n] channels can be performed in various domains, e.g. in the DFT domain or by using a specific filterbank. However, for the time domain signals l[n] and r[n] the MS transformation is given by


[ ][ ]

[ ][ ]

−

=

nrnl

nsnm

1111

21

(4.1)

Since the matrix operation is invertible, the original stereo channels can be reconstructed in the stereo decoder from the decoded mid ( [ ]nm ) and side ( [ ]ns ) channels according to

[ ][ ]

[ ][ ]

−

=

nsnm

nrnl

ˆˆ

1111

ˆ

ˆ (4.2)

where [ ]nl and [ ]nr are the reconstructed left and right channel respectively.

MS coding is not always more efficient than DM coding and therefore a switching criterion is usually used in order to select to the most efficient method. According to [33] the criterion could be related to the power of the mid and side channels. More specifically, if the power of the side channel exceeds, to a certain limit, the power of the mid channel it is more efficient to encode the original stereo channel than the mid and side channels. In [29] psychoacoustical thresholds for the left and right channel are calculated and in case they differ by less than 2 dB the mid and side channels are encoded.

The LAME codec (version 3.98.4) [10] uses MS coding with a criterion of switching between MS coding and DM coding based on the Perceptual Entropy (PE). The PE is estimated as the number of bits required for the encoded signal to obtain the same perceptual quality as the original signal, i.e. transparent quality [34][35]. In other words, it determines the amount of bits required to put the quantization noise just below the masking threshold. In previous versions of LAME a Masking Level Difference (MLD) was additionally considered for the selection of MS or DM coding. The MLD was defined as the ratio between the side and the mid channel masking thresholds where binaural unmasking was considered by the Binaural Masking Level Difference (BMLD), which is presented in chapter 3. Nevertheless, in the latest LAME version the effect of BMLD is still considered since it is included in the masking threshold used in the PE measure. The PE for each block of the signals is defined as

[ ] [ ]∑−

=

=1

02 )(log

B

bbSMRbWPE

(4.3)

where SMR[b] is the Signal-to-Masking Ratio (SMR), B is the number of frequency subbands of index b and W[b] is the number of frequency coefficients in each subbands. The SMR is given by

[ ] [ ] [ ] [ ] [ ][ ] [ ]

≤>

=)(1)(/

bTbEifbTbEifbTbE

bSMR (4.4)

where E[b] and T[b] are the spectrum energy and the corresponding masking threshold in the subband b. The estimated PE for the Left and Right (LR) channels is compared to the PE of the Mid and Side (MS) channels in order to determine the most efficient coding strategy. Since a lower PE is less bit demanding the channels with the lowest total PE are encoded, i.e. the MS channels if the PE for the MS channels is lower than the PE for the LR channels and vice versa.


The MS usage has been studied for nine stereo samples of different characteristics (see Database 1 in Appendix E) that have been encoded with LAME at 64 kbps. The characteristics are categorized as: clean speech, mixed content, music and noisy speech. The results are presented in Table 4.1 where the percentage of MS coding varies considerably even within the same category of stereo samples.

Percentage of MS coding [%]

Average [%] Type Sample

Clean speech

cls1 44,50 28,95 cls2 16,77

cls3 25,57 Mixed

content mix1 45,12 72,44 mix2 99,76

Music mus1 85,89 92,95 mus2 100,00

Noisy speech

nos1 100,00 65,34 nos2 30,68

Table 4.1: MS switching for stereo material (see Database 1 in Appendix E) encoded with LAME at 64 kbps high quality joint stereo mode.

In order to better understand the signal characteristics suitable for MS coding the switching for the mixed content and speech samples mix1 and cls2 is illustrated in Figure 4.2 and Figure 4.3. The mixed content sample consists of a moving speech source with restaurant background noise while the clean speech sample consists of two overlapping talkers with stable locations.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.5

0

0.5

Am

plitu

de

LR codingMS coding

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.5

0

0.5

Am

plitu

de

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.10.20.3

log 10

(PE M

S/P

E LR)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.20.40.60.8

1

Normalized time

ICC

a

b

c

d

Figure 4.2: MS switching for the mixed content sample “mix1” encoded with LAME in the 64 kbps high quality joint stereo mode.

a) The waveform of the left channel where the blue and red line corresponds to blocks of LR and MS coding respectively.

b) The waveform of the right channel where the blue and red line corresponds to blocks of LR and MS coding respectively.


c) The Perceptual Entropy Ratio in a logarithmic scale. When the

blue curve is above (respectively below) the black line threshold the PE of MS (respectively LR) coding is larger than the PE of LR (respectively MS) coding.

d) The Inter-Channel Coherence (ICC) between the left and the right channels varying between 0 and 1, where 1 indicates coherent channel while 0 indicates uncorrelated channels.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-0.20

0.2

Ampl

itude

LR codingMS coding

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.2

00.2

Ampl

itude

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.05

00.050.1

log 10

(PE M

S/P

E LR)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

Normalized time

ICC

a

b

c

d

Figure 4.3: MS switching for the clean speech sample “cls2” encoded with LAME in the 64 kbps high quality joint stereo mode.

a) The waveform of the left channel where the blue and red line corresponds to blocks of LR and MS coding respectively.

b) The waveform of the right channel where the blue and red line corresponds to blocks of LR and MS coding respectively.

c) The Perceptual Entropy Ratio between coding of the MS and the LR channels expressed in a logarithmic scale. When the blue line curve lies above (respectively below) the back line threshold the PE of MS (respectively LR) coding is larger than the PE of LR (respectively MS) coding.

d) The Inter-Channel Coherence (ICC) between the left and the right channels varying between 0 and 1, where 1 indicates coherent channel while 0 indicates uncorrelated channels.

On one hand, in the first and the last part of the mixed content signal, i.e. around 0.15 and 0.85 in the normalized time, the Inter-Channel Coherence (ICC) between the channels is low. In these parts of the signals the PE of the MS channels is mostly higher than the PE of the LR channels, and consequently DM coding is mainly used. On the other hand, in the middle of both the clean speech and the mixed content samples MS coding is used where the ICC is relatively low. However, the left and right channels seem to have similar amplitudes (or powers) where the MS coding is preferred. Conclusively, according to the PE it seems to be more efficient to encode the MS channels when the stereo channels are fairly correlated with similar powers. It can be noticed that these properties are coherent


with the criterion presented in [33] where MS is used when the power of the side channel is low in relation to the power of the mid channel.

Even if the different switching criterions are not compared here, studies have shown that MS coding with the PE switching criterion results is relatively high stereo audio quality [35]. In addition, there exist several alternative measures for determination of the most efficient coding methods. For example, in [35] something called the Allocation Entropy (AE) is suggested to be used to in the decision rule for MS and LR switching. In [36] a Spatial Perceptual Entropy (SPE) is introduced as a more relevant lower bound for a transparent quality of multichannel audio than the use of solely the PE.

4.1.2 Intensity stereo coding

Intensity Stereo (IS) coding is a JS coding technique where the intensity differences sometimes called scale factors between the channels are used to model the stereo image. In the encoder, a mono signal is formed as a downmix of the left and the right channel for example equally to the mid channel in MS coding (see section 4.1.1). Consequently, the downmix signal m[n] can be defined as the average of the two channels according to

[ ] [ ] [ ]2

nrnlnm += (4.5)

where l[n] and r[n] are the time domain samples of the left and right channels respectively.

The IS encoder and decoder are illustrated in Figure 4.4.

+ Mono CODEC

Scale factors

x

Decoder Encoder x

x

Equalization factor

[ ]nl

[ ]nr

[ ]nl

[ ]nr

Figure 4.4: Intensity stereo encoder and decoder with transmission of encoded downmix channel and intensity scale factors.

From the left and the right input channels the scalefactors representing the intensity of each channel are calculated. Due to phase differences between the signals the downmix averaging might result in coloration and signal amplification or attenuation of signal components in the downmix signal. In order to reduce these effects the downmix channel is equalized, i.e. scaled to have a power that is equal to the summed powers of the input channels. The downmixing will be further described later in this section. In the decoder,


the scalefactors of the independent channels are used to scale the decoded downmix signal in order to estimate the original left and right stereo channels.

The equalization of the downmix signal and the estimation of the intensity scale factors are performed independently in frequency subbands. The signals are usually split into frequency subbands using a time-domain filterbank or a time-to-frequency transform, e.g. the Discrete Fourier Transform (DFT). The filterbank has a frequency resolution based on the critical band representation that models the human auditory system (see Appendix A). In a DFT based codec the spectral coefficients are grouped into subbands with bandwidths following the critical band representation. However, according to [39] the bandwidth of the subbands can be increased to twice the ERB without notable impairments in the stereo quality. As a result, the bitrate of the scale factors can be decreased by half in relation to the higher resolution for subbands following the ERB scale [15][39][40].

The downmixed signal is equalized such that the power of signal components within every subband b is equal to the summed powers of the corresponding components in each channel. The subband downmix spectrum [ ]kM bj , is given by

[ ] [ ] [ ] [ ]( )kRkLbjekM bjbjbj ,,, , += (4.6)

where [ ]kL bj , and [ ]kR bj , are the left and right channel subband spectra b from the DFT of the time-domain blocks of index j with the frequency coefficient index k. [ ]bje , is the equalization factor defined as

[ ] [ ] [ ][ ]bjP

bjPbjPbjeRL

RL

,,,,

+

+= (4.7)

where the powers [ ]bjPL , and [ ]bjPR , are estimated in the subbands of the left and the right channel DFT spectra according to

[ ] [ ]∑∈

=bkk

bjX kXN

bjp2

,241, (4.8)

[ ]kX bj , denotes the subband spectra of the left or right channels, bk are the frequency coefficients of the subband b and 2N is the length of the transformed time-domain block j. The power [ ]bjP RL ,+ is obtained similarly as the power of the summed channels, i.e. the power of the sum in Eq. (4.6).

The main limitation for IS coding is the large distortions that might occur when the scale factors are used to model the stereo image of complex fullband audio signals with wide spatial images or time shifts between the captured channels [15]. However, when the scale factors are used above 2 kHz they approximate the spatial image fairly well according to the properties of the human auditory system as described in chapter 3. IS coding is therefore, in JS coding, used for the higher frequencies above 2 kHz together with MS coding of the lower frequencies below 2 kHz.

In [33], an enhanced JS coding scheme with time-alignment of the stereo channel is proposed. The time-alignment is used to increase the coherence between the stereo


channels in order to reduce the power of the side channel and enable efficient stereo coding without transmission of the side channel. The side channel is, however, encoded and transmitted at high bitrates in order to achieve a better stereo quality. The right channel is time-aligned to the left channel and the mid and side channels are obtained according to

[ ][ ] [ ]

[ ][ ]

−

=

nr

nlngns

nma1

11 (4.9)

where [ ]nr a is the aligned right channel and g[n] is a gain factor. With an appropriate choice of the gain factor the transform matrix is invertible and the left and the aligned right channel can be reconstructed by

[ ][ ] [ ]

[ ] [ ][ ]

−+

=

nsnmng

ngnrnl

a ˆˆ

111

11

ˆ

ˆ (4.10)

where [ ]nm and [ ]ns are the decoded mid and side signals.

The right channel can then be synthesized by de-alignment of the decoded signal [ ]nr aˆ by using a transmitted time-alignment parameter. The authors in [33] show that the subjective audio quality is a function of the selected coding mode. The lowest quality is obtained for the parametric coding mode where the mid channel is encoded and transmitted together with the gain factors and the time-alignment parameters. The quality increases with an increasing amount of bits allocated for the side channel and at a total bitrate of 96 kbps the performance was shown to be similar to JS coding with LAME at the same bitrate.

4.1.3 KLT-based stereo coding

A more advanced JS coding scheme is proposed in [37]. The scheme relies on a bi-dimensional Karhunen-Loève Transform (KLT) which can be seen as an energetic hierarchization of the signal components. The KLT of two correlated signals theoretically generates two de-correlated signals due to projection on the (orthogonal) eigenvector basis of the covariance matrix between the stereo channels [41]. The signals obtained from the KLT can be considered as optimal mid and side channels, m[n] and s[n] respectively. In the DFT domain, the KLT downmix of the left and right subband spectra is performed according to

[ ][ ]

[ ]( ) [ ]( )[ ]( ) [ ]( )

[ ][ ]

−

=

kRkL

bjbjbjbj

kSkM

bj

bj

bj

bj

,

,

,

,

,cos,sin,sin,cos

σσσσ

, (4.11)

where [ ]bj,σ is a rotational angle for the subband b that minimizes the energy of the side channel while the energy of the mid channel is maximized. According to [38] and [37] the angle can be evaluated as

[ ]

−

= −

2211

121 2tan21,

RRRbjσ , [ ]

4,

4πσπ

≤≤− bj (4.12)


where R11 and R22 are the auto-covariance of the left and right signals respectively, and R12 is the corresponding cross-covariance. The covariances of zero-mean signals can be estimated in the DFT domain according to

[ ] [ ]

[ ] [ ] [ ]

[ ] [ ]

=

⋅ℜ=

=

∑

∑

∑

∈

∈

∗

∈

b

b

b

kkbj

kkbjbj

kkbj

kRN

bjR

kRkLN

bjR

kLN

bjR

2

,222

,,212

2

,211

21,

21,

21,

(4.13)

where bk are the frequency coefficients of the subband b and 2N is the length of the transformed time-domain block j [42][43].

The advantage of the KLT downmix is that the inherit problems of signal cancellation in the mid channel using the passive averaging downmix (see Eq. (4.5)) are minimized. The energy of the side channel is significantly lower than the energy of the mid channel if there is a high correlation between the channels. However, in [37] it is shown that the gain of coding the KLT mid and side channels compared to coding the original MS channels is low. In addition, the coding gain is decreased even further since the rotational angles from the KLT have to be transmitted to the decoder in order to reconstruct the left and right stereo channels from the KLT mid and side channels.

The KLT can however be efficient for IS coding where the mid channel is transmitted together with the scalefactors determined as [ ]( )bj,cos σ and [ ]( )bj,sin σ . The mono decoded KLT downmix channel [ ]kM bj ,

ˆ is then in the decoder inverse transformed according to

[ ][ ]

[ ]( ) [ ]( )[ ]( ) [ ]( )

[ ]

−=

0

ˆ

,cos,sin,sin,cos

ˆˆ

,

,

, kMbjbjbjbj

kRkL bj

bj

bj

σσσσ

(4.14)

However, even if the power of the mid signal is maximized the side channel might contain some energy from uncorrelated signal components. In order compensate for the lost energy, the scalefactors could be amplified so that the overall power of the reconstructed left and right stereo channels becomes correct [37][38].

In [42] and [43] another parametric stereo codec based on the Karhunen-Loève Transform is proposed. The codec is implemented in the Short-Time Fourier Transform (STFT) domain with sine windowed blocks of 4096 samples resulting in 4096 frequency coefficients. The coefficients are grouped into 20 subbands with bandwidths following the ERB scale which is described in Appendix A.

The stereo channels are transformed into a mid and side channel according to Eq. (4.11) but with an additional constraint on the rotational angle [ ]bj,σ . The angle is constrained in order to avoid signal cancellation in the overlap add (see section 2.2.1) of two adjacent signal blocks. The KLT representation is illustrated in Figure 4.5 where the M and S


vectors are illustrated in the span of the left and right channel subband vectors with the subband data as the black dots.

],[ bjσ

bj,L

bj,S

bj,Mbj,R

Figure 4.5: KLT of the left and right subband spectra with constrained rotational angle.

If the M channel changes from the first to the second quadrant between two overlapping signal blocks there will be a cancellation of the left channel components due to the sign change in the M channel. The rotational angle is therefore constrained to the range of

]2,0[ π , i.e. the first quadrant, where it is most probable to find the optimal mid channel in general stereo recordings. The angle is given by

[ ]

−⋅

<−+

−⋅

=otherwise

RRR

RRifRR

R

bj

2211

12

22112211

12

2arctan

21

02

2arctan

21

,

π

σ

(4.15)

where the covariances R11, R12 and R22 are given by Eq. (4.13). Due to the constrained angle there is a possibility that the real mid is captured partly or fully in the side channel. For example, if the left and right channels are fully anti-correlated, the constrained angle

will be [ ]4

, πσ =bj when it should be [ ]4

3, πσ =bj . Consequently, the real mid channel

becomes represented by the side channel while the real side channel becomes represented by the mid channel.

The mid channel is encoded and transmitted to the decoder while the side channel is modeled and represented by parameters. In this case the spectral envelope of the side channel is described by the Principal Component to Ambience energy Ratio (PCAR) which is defined as


[ ] [ ][ ]

=

bjPbjPbjPCAR

S

M

,,log10, 10

(4.16)

where the subband powers [ ]bjPM , and [ ]bjPS , are estimated as described in Eq. (4.8).

At the decoder side the stereo synthesis is based on the inverse KLT where the non-transmitted side channel is estimated. The side channel is characterized by a weak correlation to the mid channel which cannot be described by the transmitted stereo parameters. The mid channel is therefore decorrelated using a random phase all-pass filter (see section 4.3.2) and scaled using the PCAR in order to obtain a signal that has similar properties as the true side channel. The upmixing (synthesis) can be expressed as

[ ]( ) [ ]( )[ ]( ) [ ]( ) [ ]

−=

∧

][][

,001

,cos,sin,sin,cos

][ˆ][ˆ

,

,

,

,

kDkM

bjabjbjbjbj

kRkL

bj

bj

bj

bj

σσσσ

(4.17)

where [ ]kD j,b is the decorrelated version of [ ]kM j,b∧

and

[ ] [ ]20

,

10

1, bjPCARbja =

(4.18)

In [43] this stereo codec was subjectively evaluated with AAC+ for mono coding. The stereo bitrate was 3 kbps which together with the mono coding resulted in a total bitrate of 25 kbps. The proposed stereo codec showed in average a similar but slightly lower quality than the eAAC+ codec at 24 kbps for some critical samples from the MPEG database (see Appendix E and [43] for more information).

4.2 Binaural cue coding

Binaural Cue Coding (BCC) can be seen as an extension to IS coding with additional stereo parameters. The extension implies that BCC is suitable for fullband coding in contrast to IS which is not appropriate for low frequencies where the time differences are perceptually important (see chapter 3). The BCC stereo model is based on the Inter-Channel Level Difference (ICLD), the Inter-Channel Time Difference (ICTD) and the Inter-Channel Coherence (ICC) which approximates the corresponding interaural cues that are important for the human spatial hearing. The ICLD and the ICTD are mainly associated with the perceived spatial location while the ICC corresponds to the perceived width of the auditory objects (see chapter 3). The timbral coloration from early reflections and the listener envelopment from later reflections, which contribute to the perception of the acoustical environment, are described even if they are not explicitly modelled. The coloration, for example, is described by the spectral envelope that is in a sense is controlled by the subband synthesis of the ICLD [15].

In the BCC architecture the audio signals are processed in frequency subbands of filtered signal blocks. The time and frequency resolutions correspond to the time and frequency resolution of the human auditory system similarly as for the IS coding (see section 4.1.2). The stereo image is analysed in signal blocks of about 4-16 ms which is more frequent than the human auditory system sense ILD and ITD changes, i.e. between 30 ms and 100 ms.

Section 4.2 - Binaural cue coding 39

The reason for the higher time resolution is to consider that the human auditory system perceive ILD and ITD variations similar to a lower IACC [15]. The precedence effect, which implies that the localization is predominantly determined by the first wave front, is however not considered. In that case signal blocks of about 2 ms would have to be processed separately. Nevertheless, the BCC stereo architecture proposed in [15] has shown good results for stereo coding at bitrates below 48 kbps. In [40] hybrid BCC is suggested in order to obtain an even better performance. The method is similar to JS coding and uses DM coding for the subbands of the lowest frequencies together with BCC for the higher frequency subbands.

In Figure 4.6 a block diagram of the BCC coding scheme is presented. The ICLD, ICTD and ICC are extracted from subband signals of the left and the right stereo channels. The subband signals are downmixed with equalization as presented in section 4.1.2. The power of each subband of the downmix signal m[n] is thereby equal to the summed powers of the left and right subband signals. The downmixed mono signal and the stereo parameters, which are quantized and encoded, are transmitted to the decoder. In the decoder the decoded mono subband signal [ ]nm bj ,ˆ is decorrelated into the two subband signals [ ]nd bj

1,

and [ ]nd bj2, with the signal block index j and the subband index b. The decorrelated signals

are used to reconstruct signal components with low correlation between the original stereo channels. The ICLD, ICTD and ICC are then used to synthesize the stereo channels as a linear combination of the decorrelated and the decoded mono signals.

Mono decoder

Q-1

[ ]nlDECODER

ICLD, ICTD & ICC analysis

Mono encoder

Q [ ]nl [ ]nr

[ ]nm

Downmix

ENCODER

FB …

… …

:

:

FB

…

: IFB

[ ]nr

:

[ ]nm bj ,ˆ

ICLD, ICTD & ICC synthesis IFB :

Decorr filter

Decorr filter

IFB

…

[ ]nd bj1

,ˆ [ ]nd bj

2,

ˆ

Figure 4.6: Binaural coding scheme with ICLD, ICTD and ICC synthesis. The stereo channels are analyzed and downmixed in the subband domain. The mono signal is encoded and transmitted to the decoder together with the encoded ICLD, ICTD and ICC. In the decoder the left and the right stereo channel are synthesized with the decoded stereo parameters, the decoded mono and two decorrelated versions of the mono signal.

4.2.1 Spatial cue extraction

In BCC the ICLD, ICTD and ICC that are used to describe the spatial image of the stereo signals are extracted from the subband signals or spectra. The ICLD describes the


difference in the intensity of the two channel signals as the power ratio between the right and the left subband signals, expressed in dB, according to [15]

[ ] [ ][ ]

=∆

bjPbjPbjP

l

rrl ,

,log10, 10 (4.19)

where [ ]bjPl , and [ ]bjPr , are the subband powers of the subband signals [ ]nl bj , and [ ]nr bj , . The power of the subband signal [ ]nx bj , is estimated according

[ ] [ ]∑∈

=jnn

bjx nxN

bjP2

,21, (4.20)

where jn are the sample indices of signal block j of length 2N.

The ICLD can be similarly estimated for frequency domain subband spectra if the stereo analysis is performed in for example the DFT domain. The subband powers are then estimated as described in Eq. (4.8).

The ICTD is given by the time lag (in samples) that maximizes the normalized cross-correlation function [ ]τφ ,,bjlr according to

[ ] [ ]{ }τφττ

,,maxarg, bjbj lrlr =∆ (4.21)

where

[ ] [ ][ ] [ ]0,,0,,

,,,,bjbj

bjbjrrll

lrlr

φφ

τφτφ = (4.22)

The cross-correlation function [ ]τφ ,,bjlr is determined as a function of the time-lagτaccording to

[ ] [ ] [ ]( )∑∈

+=jnn

bjbjlr nrnlN

bj ττφ ,,21,, (4.23)

where jn are the sample indices of signal block j of length 2N.

The ICC is determined as the maximum value of the normalized cross-correlation function according to

[ ] [ ]τφτ

,,max, bjbjc lrlr = (4.24)

A method for ICTD extraction in the DFT domain is proposed in [65] where the ICTD is estimated from the linear regression of the unwrapped Inter-Channel Phase Difference (ICPD). The method is less complex for a DFT domain codec, but there is a higher possibility of incorrect estimations due to improper phase unwrapping.

Section 4.2 - Binaural cue coding 41

4.2.2 Spatial cue synthesis

The aim with the spatial synthesis is to synthesize inter-channel cues in the reconstructed channels that are as close as possible to the inter-channel cues of the original stereo signals. The BCC synthesis consists of delays, scale factors and filters that are applied to the decoded mono subband signals [ ]nm bj ,ˆ to estimate the left and right stereo channels. According to [15][39], the upmix can be expressed as

[ ] [ ] [ ][ ] [ ] [ ][ ] [ ] [ ][ ] [ ] [ ]kdbjbbjtnmbjanr

kdbjbbjtnmbjanl

bjbjbj

bjbjbj

2,22,2,

1,11,1,

ˆ,,ˆ,ˆ

ˆ,,ˆ,ˆ

+∆−=

+∆−= (4.25)

where the subband parameters ax, bx and xt∆ , and the decorrelated subband signals [ ]kd xbj ,

ˆ with x=1,2 will be presented in the following.

The ICTD parameter is used to determine the delays that are applied to the decoded mono signal in the synthesized channels. Since a positive (respectively negative) ICTD denotes that the right (respectively left) channel is delayed in relation to the right (respectively left) channel, the delays are given by

[ ] [ ]{ }0,,max,1 bjbjt lrτ∆=∆ (4.26)

and

[ ] [ ]{ }0,,max,2 bjbjt lrτ∆−=∆ (4.27)

In [46] it is suggested that the ICTD could be omitted from the parametric bitstream if lower bitrates are required. In that case an ICTD could be approximated from the ICLD according to

[ ] [ ]bjPqfbj rlslr ,, ∆=∆τ (4.28)

where q is a scaling factor and fs is the sampling frequency. A positive value of q moves the source further in the direction given by the ICLD cue. According to [46] a value of

61025 −⋅=q seconds/dB has showed to improve the subjective audio quality in comparison to the case when no ICTD is used.

As mentioned in section 4.2, decorrelation filters are used to synthesize uncorrelated components with the amount of decorrelation determined by the ICC parameter. In [39] the decorrelated subband signals for each channel x are given by the convolution of the downmix signal [ ]nm bj ,ˆ and the reverberation-filters [ ]nhx

[ ] [ ] [ ]nmnhnd bjxx

bj ,, ˆˆ ∗= (4.29)

with the filter impulse responses given by

[ ] [ ]

−=

−=

otherwise

NnTf

nwnh h

n

hsx

x

0

1,,011 (4.30)


where [ ]nwx is independent stationary white Gaussian noise, Th is a time constant of the exponential decay of the impulse response, fs is the sampling frequency and Nh is the length of the impulse response. The time constant determines the reverberation time and the length of the impulse response is chosen as short as possible with a satisfying ICC synthesis [39]. Additional decorrelation filters will be described in section 4.3.2.

With the assumptions that the downmixed mono signal is equalized (see section 4.1.2) and decorrelated to the subband signals [ ]kd bj

1,

ˆ and [ ]kd bj2,

ˆ the parameters [ ]j,bax and [ ]j,bbx (introduced in Eq. (4.25)) can be determined according to

[ ] [ ] [ ][ ]

[ ] [ ] [ ][ ]

[ ] [ ] [ ]( ) [ ][ ] [ ]

[ ] [ ] [ ]( ) [ ][ ] [ ]j,bPj,bC

j,bPj,bBj,bAj,bb

j,bPj,bCj,bPj,bBj,bAj,bb

j,bCj,bBj,bAj,ba

j,bCj,bBj,bAj,ba

d

m

d

m

2

1

ˆ

ˆ2

ˆ

ˆ1

2

1

1

1

1

1

−+=

−+=

+−=

+−=

(4.31)

with

[ ] [ ]

[ ] [ ]( ) [ ] [ ][ ] [ ]( )j,bAj,bC

j,bcj,bAj,bAj,bB

j,bA

lr

/j,bΔPrl

+=

+−=

=

1241

1022

10

(4.32)

where [ ]bjPm ,ˆ , [ ]bjPd ,1ˆ and [ ]bjPd ,2ˆ are the subband powers for the decoded mono signal respectively the decorrelated mono signals.

Without ICC synthesis the scalefactors [ ]j,bbx becomes zero and [ ]j,bax can be simplified to

[ ] [ ]

[ ] [ ] [ ]

⋅=

+=20

12

101

10

1011/j,bΔP

/j,bΔP

rl

rl

j,baj,ba

/j,ba (4.33)

Another method used to synthesize ICC is based on variations of the ICLD within the subbands [15]. The scalefactors [ ]j,bax could then be defined for each sample as

[ ] [ ] [ ]

[ ] [ ] [ ] [ ]

⋅=

+=+∆

+∆

20/),(12

10/),(1

2

1

10

101/1nbjP

nbjP

rl

rl

nana

naν

ν

(4.34)

with

[ ] [ ] [ ]nbjcn xlrx νν ),1( −= (4.35)

Section 4.3 - Parametric Stereo coding 43

where [ ]nxν is a random sequence which according to [46] could be uniformly distributed over the range of 5± dB. The average subband ICLD is preserved since the random sequence has a zero mean. The ICC synthesis can be performed similarly by varying the ICTD, but the variation should be more smoothly than for the ICLD [15].

The spatial synthesis can also be perform in the DFT domain where each spectral coefficient of the downmix signal is multiplied with the scalefactors [ ]j,ba and [ ]j,bbdefined as in Eq. (4.31) or (4.33). The ICTD is applied as a linear phase shift, i.e. the coefficients of the decoded mono spectra [ ]kM bj ,

ˆ are multiplied by

[ ] xtkN

i

x ekG∆−

=π

(4.36)

where xt∆ are the delays given by Eq. (4.26) and (4.27), and 2N is the length of the transformed signal block j [46][47].

4.3 Parametric Stereo coding

Parametric Stereo (PS) coding can be seen as an improvement of BCC with the same concepts but with a more efficient implementation. For example, PS does not extract the ICTD but the Inter-Channel Phase Differences (ICPD) and the Overall Phase Difference (OPD), which can be efficiently encoded at a very low data rate. In addition, PS comprises dynamic segmentation in time of the audio signals. This means that stable spatial properties can be encoded with a low parameter update rate while strong spatial dynamics can be encoded with a higher time resolution, i.e. with a higher parameter update rate. This is similar to the switching technology in ITU-T G.719 according to transient or stationary characteristics (see chapter 2). The update rate of the parameters in the stationary mode is based on the lower bound of the measured time constant of the human auditory system, i.e. between 23 and 100 milliseconds, while in transient mode the windows are about 2 ms long in order to consider the precedence effect [15].

PS encoders and decoders process the signals either using a time-domain filterbank or the DFT. The frequency resolution is non-uniform according to the frequency resolution of the human auditory system. For the DFT this is obtained by organizing the spectral coefficients into subbands with bandwidths proportional to the critical bandwidths (see Appendix A). One implication of this is, however, that long transform windows are required for a sufficiently good resolution at low frequencies. However, PS is often implemented with a hybrid complex-modulated Quadrature Mirror Filterbank (QMF), which has a low computational complexity. The hybrid QMF partitions the signals into subbands where the lower frequency subbands are fed into an additional filterbank to increase the frequency resolution for those frequencies.

PS and BCC coding techniques never reach transparent quality even at high bitrates, i.e. the model saturates even when the quality of the mono coder approaches perceptual transparency. Indeed the approximations in the parametric description and the stereo image are still noticeable [22].


4.3.1 Improvements over BCC technique

The improvements of PS coding compared to BCC are mainly related to implementation optimizations. The extraction of the Inter-Channel Phase Difference (ICPD) and the Overall Phase Difference (OPD) that replace the ICTD will be presented in the following together with the refined downmix and upmix (synthesis) that is used in PS coding. In addition, the decorrelation technique and the quantization and coding of the stereo parameters will be introduced as well.

4.3.1.1 Phase coding

In PS the relative time differences between the stereo channels are represented by the ICPD and the OPD. In the DFT domain the ICPD is given by

[ ] [ ] [ ]

∠=∆ ∑

∈

∗

bkkbjbjLR kRkLbj ,,,θ (4.37)

where kb are the frequency coefficients of subband b for the transformed signal block j.

Since this parameter only describes the relative phase difference between the two stereo channels, an additional phase parameter is estimated in order to be able to reconstruct absolute phase and reduce the signal cancellation in the overlap-add procedures in both encoder and decoder [22]. The OPD specifies the phase difference between the left and the downmix spectra according to

[ ] [ ] [ ]

∠=Θ ∑

∈

∗

bkkbjbj kMkLbj ,,, (4.38)

where [ ]kM bj , is the downmixed mono spectrum of the subband b.

The ICLD estimation is not affected by the phase coding but in PS the ICLD is defined as the power ratio between the left and the right channel instead of the power ratio between the right and the left channel as in BCC. In the DFT domain the ICLD is estimated according to

[ ] [ ][ ]

=∆

bjPbjP

bjPR

LLR ,

,log10, 10 (4.39)

where the powers [ ]bjPL , and [ ]bjPR , of the left and right subband spectra are estimated according to Eq. (4.8).

In PS, the ICC is estimated differently if the phase parameters are used or not. The ICPD and OPD could be omitted in order to reduce the bitrate, especially at higher frequencies where the phase differences are not very important for the human spatial hearing (see chapter 3). The phase differences can then be included in the ICC parameter which means that the ICC is extended to negative values. However, when the phase parameters are used for the stereo synthesis the ICC is estimated from the phase aligned stereo channels according to


[ ]

[ ] [ ] [ ]

[ ] [ ] [ ] [ ]

=

∑∑

∑

∈

∗

∈

∗

∈

∆∗

bb

b

LR

kkbjbj

kkbjbj

kk

bjibjbj

LR

kRkRkLkL

ekRkLbjc

,,,,

,,,

,

θ

(4.40)

or

[ ][ ] [ ]

[ ] [ ] [ ] [ ]

=

∑∑

∑

∈

∗

∈

∗

∈

∗

bb

b

kkbjbj

kkbjbj

kkbjbj

LR

kRkRkLkL

kRkLbjc

,,,,

,,

, (4.41)

When the phase parameters are not used in the stereo synthesis the ICC is estimated according to

[ ][ ] [ ]

[ ] [ ] [ ] [ ]

ℜ

=

∑∑

∑

∈

∗

∈

∗

∈

∗

bb

b

kkbjbj

kkbjbj

kkbjbj

LR

kRkRkLkL

kRkLbjc

,,,,

,,

, (4.42)

In [44] another method for bit reduction is suggested. The OPD, in the case of a passive averaging downmix (as in Eq. (4.5)), is derived from other spatial cues according to

[ ] [ ] [ ] [ ]( )bjiLR

bjP LRLR ebjcbj ,20/, ,10, θ∆∆ +∠=Θ (4.43)

Even if the OPD is only approximately given by quantized parameters it has been shown that the sensitivity to the quantization is low [44]. Another estimation method of the OPD is presented in [45]. This method requires only the used of the ICLD and ICPD, i.e. the ICC is not needed, according to

[ ][ ]( ) [ ]((

[ ] [ ]( )[ ] [ ] [ ]( )

∆⋅+

∆⋅=∆=∆

=Θ otherwisebjbjcbjc

bjbjcbjPbjif

bjLR

LR

LRLR

,cos,,,sin,

arctan

0,&,0,

21

2

θθ

πθ (4.44)

where

[ ][ ]

[ ] 10/,

10/,

1 10110, bjP

bjP

LR

LR

bjc ∆

∆

+= and [ ] [ ] 10/,2 101

1, bjPLRbjc ∆+=

According to [45] the method is more robust to quantization errors than the method presented in [44] since the ICC quantization errors have no effect on the estimation. In addition, the diffuseness can be better represented due to independent ICPD and ICC controlling.


4.3.1.2 Refined downmix/upmix scheme

The equalized downmixing method that is used in IS (see section 4.1.2) and BCC (see section 4.2) is also suitable for PS. In order to avoid artifacts from a very high amplification of almost cancelled subband spectra the equalization factor can be limited to 3-6dB according to [40][62]. Another improvement to the equalized downmixing is suggested in [48] where the stereo channel spectra are phase aligned before the downmixing. More precisely, the right channel spectrum Rj,b[k] is phase-shifted so that it is in phase with the left signal Lj,b[k] before the sum is equalized. The method has been shown to perform almost equally well for different types of audio material and significantly better for signals with strong out-of-phase components [48].

Even if equalized downmixing is most common for PS coding there are other methods used. In [44] it is shown that the energy losses in a passive averaged downmix signal can be compensated in the decoder. The acoustic energy of the original signals is preserved by scaling the downmix spectral coefficients [ ]kM bj ,

ˆ according to

[ ] [ ] [ ]kMbjkM bjbj ,,ˆ,ˆ κ=′ (4.45)

with

[ ][ ]

[ ] [ ] [ ] [ ]( )bjbjcbj

LRLRbjPbjP

bjP

LRLR

LR

,cos,1021011022, 20/,10/,

10/,

θκ

∆⋅++⋅+

= ∆∆

∆

(4.46)

where the stereo parameters are defined according to section 4.3.1.1.

Without transmission of the ICPD the scalefactor is defined as

[ ][ ]

[ ] [ ] [ ]bjcbj

LRbjPbjP

bjP

LRLR

LR

,1021011022, 20/,10/,

10/,

∆∆

∆

⋅++⋅+

=κ (4.47)

where the ICC parameter is differently estimated in order to consider the phase differences.

The PS stereo synthesis has been developed in order to be more efficient than the BCC stereo synthesis. In PS, only one decorrelation filter is used instead of two which was the case in BCC (see section 4.2). The decorrelation effect is a result from the addition of the decorrelated signal to the downmix signal with different sign in the synthesized left and right channels. This implies that there is a complementary amplification or attenuation of the signal components in the synthesized channels which implies that the correlation between them can be controlled. The amount of decorrelation that is determined by the ICC parameter is controlled by the proportion of the decorrelated signal within the synthesized channels.


In Figure 4.7 a block diagram of a PS encoder and decoder implemented in the DFT domain is presented. The left and right stereo signals are transformed into the DFT domain where the spectral coefficients are grouped into subband. The ICLD, ICPD, OPD and ICC are estimated from the subband spectra and quantized, encoded and transmitted to the decoder together with the encoded mono signal. At the decoder side, the subband DFT spectrum of the decoded mono signal [ ]kM bj ,

ˆ is decorrelated into the subband spectrum[ ]kD bj , . The spectra are used together with the decoded stereo parameters to synthesize the

left and right stereo channels. In the following two different methods of stereo synthesis for PS will be described.

Mono decoder

Q-1

[ ]nlDECODER

ICLD, ICPD, OPD & ICC

analysis

Mono encoder

Q [ ]nl [ ]nr

[ ]nm

Downmix

ENCODER

DFT …

… …

:

:

DFT

…

: IDFT

[ ]nr

:

[ ]kM bj ,ˆ

ICLD, ICPD, OPD & ICC synthesis

IDFT :

Decorr filter

IDFT

…

[ ]kD bj ,

…

Figure 4.7: Parametric Stereo (PS) encoder and decoder working in the DFT domain. The transformed stereo spectra are split into subbands that are analyzed and downmixed. The mono signal is encoded and transmitted to the decoder together with the encoded ICLD, ICPD, OPD and ICC. In the decoder the left and the right stereo channel are synthesized with the decoded stereo parameters, the decoded mono and a decorrelated version of the mono signal.

There are two upmix (stereo synthesis) methods defined for PS coding i.e. with or without phase synthesis [15][22]. For both methods the subband spectra of the synthesized stereo channels can be described by

[ ][ ] [ ] [ ]

[ ]

=

kDkM

bjkRkL

bj

bj

bj

bj

,

,

,

,ˆ

,ˆˆ

E (4.48)

where E is an upmix matrix that is specific for the upmix method.

When the phase parameters are not available in the decoder the upmix matrix aE is used. The power of the downmix signal is assumed to have been equalized to the summed powers of the original left and right stereo channels. In addition, the decorrelation filter is assumed to have an all-pass frequency response which implies that the power of the decorrelated mono [ ]kD bj , is equal to the power of the decoded mono [ ]kM bj ,

ˆ . According to [15], the upmix matrix is given by


[ ] [ ][ ]

[ ] [ ] [ ] [ ][ ] [ ] [ ] [ ]

+−+−++

=

),,sin(),,cos(),,sin(),,cos(

,00,

,2

1

bjbjbjbjbjbjbjbj

bjbj

bja

βαβαβαβα

λλ

E (4.49)

with

[ ][ ]

[ ] 10/,

10/,21 101

10, bjP

bjP

LR

LR

bj ∆

∆

+=λ

(4.50)

and

[ ] [ ] 10/,22 101

1, bjPLRbj ∆+=λ (4.51)

The rotational angle [ ]bj,α is used to determine the amount of correlation between the synthesized stereo channels by

[ ] [ ]( )bjcbj LR ,arccos21, =α (4.52)

On one hand, when the ICC is zero, i.e. [ ] 0, =bjcLR , the angle [ ]4

, πα =bj which implies

that the left and right channels become orthogonal, i.e. uncorrelated. On the other hand, when the absolute value of the ICC is maximal, i.e. [ ] 1, =bjcLR or [ ] 1, −=bjcLR , the angle

is [ ] 0, =bjα and [ ]2

, πα =bj respectively. Then the synthesized signals become fully

correlated or anti-correlated respectively.

The overall rotation angle [ ]bj,β is chosen so that the average of the output signals consists of the decoded mono by

[ ] [ ] [ ][ ] [ ] [ ]( )

+−

= bjbjbjbjbjbj ,tan

,,,,arctan,

12

12 αλλλλβ (4.53)

From Eq. (4.53) it follows that the angle [ ]bj,β becomes zero independently of the ICC for stereo channels with equal power, i.e. [ ] 0, =∆ bjPLR , while [ ] 0, ≥bjβ if [ ] 0, <∆ bjPLR and [ ] 0, ≤bjβ if [ ] 0, >∆ bjPLR .

The aE upmix is illustrated in Figure 4.8.a where the synthesized left and right subband spectra are represented by the vectors bj,L and bj,R in the span of the decoded and the

decorrelated mono spectra bj,M and bj,D respectively.


bj,D

bj,R

bj,L

[ ]bj,α[ ]bj,β

bj,M[ ]bj,α

bj,R

bj,D

bj,M

[ ]bj,δ

bj,L

[ ]( )bj,sin ν [ ]( )bj,cos ν

a b

Figure 4.8: Visualization of the Parametric Stereo (PS) stereo synthesis.

a) aE used for stereo synthesis when phase parameters are not transmitted [15].

b) bE used for stereo synthesis when phase parameters are available [15].

In Figure 4.8.b the second synthesis method with bE (presented in the following) is illustrated. The pairs [ ]kL bj ,

ˆ and [ ]kR bj ,ˆ are represented as points within the oval that is

located in the span of the decoded and the decorrelated mono spectra. The coherence between the channels depends on the angle [ ]bj,δ which according to [15] is obtained by

[ ]

[ ] [ ]

[ ][ ] [ ]

[ ][ ]

−

⋅

=∆=

=otherwise

bjbj

bjcbjbj

bjPbjcif

bj LR

LRLR

2,

1,,

,,,2

arctan21mod

0,,4

,2

2

1

2

1

π

λλ

λλ

π

δ (4.54)

where [ ]bj,1λ and [ ]bj,2λ are given by Eq. (4.50) and Eq. (4.51) respectively.

The angle [ ]bj,δ can be seen as a rotational angle of the orthogonal decoded and decorrelated mono spectra that is equivalent to the bi-dimensional KLT (see section 4.1.2) of the synthesized subband spectra of the left and right channels. The upmix can therefore be seen as the inverse KLT of the decoded and decorrelated mono spectra. However, since the mono spectrum is equalized in the encoder the spectra are scaled using [ ]bj,ν defined according to [22] as

[ ] [ ][ ]bj

bjbjv

,1,1

arctan,µµ

+−

= (4.55)

where


[ ] [ ]( )[ ][ ]

[ ][ ]

2

1

2

2

1

2

,,

,,

1,41,

+

−+=

bjbj

bjbj

bjcbj LR

λλ

λλ

µ

In addition, the phase is synthesized using the ICPD and OPD. The total upmix matrix becomes

[ ][ ]

[ ] [ ]( )[ ]( ) [ ]( )[ ]( ) [ ]( )

[ ]( )[ ]( )

−

=

∆−Θ⋅

Θ⋅

bjvbjv

bjbjbjbj

ee

bj bjbji

bji

,sin00,cos

,cos,sin,sin,cos

00

2, ,,

,

δδδδ

θbE (4.56)

which is used in Eq. (4.48) to synthesize the stereo channels.

The modulo operation in the definition of the angle [ ]bj,δ ensures that the mono signal lies in the first quadrat in order to not have signal cancellations in the overlap-add between signal blocks where the sign of [ ]( )bj,cos δ or [ ]( )bj,sin δ has been changed (see section 4.1.3) [22]. This makes the use of bE not appropriate if the ICC parameters are negative, i.e. defined without an absolute value (see Eq. (4.42)), which should be the case if phase parameters are not estimated [15].

4.3.2 Decorrelator

The decorrelation technique is a filtering method used to generate an output signal that is incoherent with the input signal from a fine-structure point of view. Moreover, the spectral and temporal envelopes of the decorrelated signal shall remain. The decorrelation filters are typically all-pass filters with specific phase modification of the input signal.

Time-differences between stereo channels have a specific relation to what is perceived by the human auditory system. As described in section 3.2, a time-difference up to 1 ms between two correlated source signals is related to the localization of a phantom auditory object. When the ITD between the two signals at the ear entrances increases the precedence effect [18] implies that the direction of the source is still detected but the width of the source is also increasing. When the ITD gets larger than about 40 milliseconds the sources are perceived with an echo see Figure 4.9.

Figure 4.9: The effect of time shifts between identical source signals to the human auditory system [50]. The sound source signal s[n] is delayed by t∆ samples applied to the left channel.

Δt (ms)

θ (degree)

0 1 2 10 20 30 40

30

-30

Stereo

imag

e

Precedence effect Echo effect

0θ

Δt

s[n]


A simple decorrelation filter is obtained by a fullband constant delay [31] but when the decorrelated signal is added to the decoded mono signal in the decoder (as described in section 4.3.1.2) there will be a strong comb-filter effect [31]. This means that in the reconstructed left and right channels there are large amplifications and attenuations periodically over the frequencies due to the linear phase shift obtained by the constant time shift. Such a large comb-filter effect is not desirable since it introduces an annoying metallic sound effect [31]. In addition, the fullband delay can affect the spatial localization, e.g. if transients are spread in the synthesized signals the precedence effect might imply that a source is moved in the stereo image.

In [49] an all-pass decorrelation filter with non-linear phase is proposed with an impulse response given by

[ ] ( )∑=

−+=

2/

012cos2 hN

l hh

lnN

lN

nh π

10 −≤≤ hNn (4.57)

where Nh is the length of the filter. The phase response is similar to the constant delay filter but the slope is decreasing with frequency, which can be seen as a decreased delay for the higher frequencies. This considers the fact that high frequencies evolve faster than low frequencies in time and are therefore more sensitive to time shifting [51].

In [52] an all-pass filter with random phase is proposed for decorrelator. A random phase could be defined from the transfer function

[ ] [ ]kiekH ϕ= (4.58)

where the random phase [ ]kϕ might be uniformly distributed over the interval of [ ]ππ ,− .

In [51] a subband time-shifting filter is proposed with different linear phase shifts applied in different frequency subbands of the signal. According to [51], the time shifts should be in the range of -20 ms to 20 ms in order to obtain enough decorrelation without affecting the overall signal characteristics significantly. In addition, it is suggested to limit the shifts depending on the frequency range, e.g. proportional to the longest waveform period present in each subband. The precedence effect can change the spatial image due to the time shifts but since the shifts are different in each subband the overall result is a wider stereo image, i.e. increased spaciousness. The transfer function of the filter is subband dependent according to

[ ] [ ]btNki

bj ekH∆

=π

, (4.59)

where k is the spectral coefficient index, b is the subband index and 2N is the length of the filtered signal block. The sample delay [ ]bt∆ is a random variable uniformly distributed in the interval of for example [ ]ms20,,ms20 − [51].


Artificial reverberators are another type of decorrelation filters that are characterized by a dense frequency spectrum with irregular resonances along the frequency axis and a diffuse noise-like impulse response. In [31] a reverberator based on serially connected all-pass filters with fractional delays, i.e. delays that are not constrained to the sample interval but can be fractions of samples, is proposed. The all-pass filters are given by the frequency response

[ ] p

p

gzzgzH −

−

++

=1

(4.60)

where g is a gain factor that determines the attenuation of the impulse response and p is the delay in samples and the order of the filter, i.e. the distance between the impulses in the impulse response. In Figure 4.10 the impulse response, the frequency response and the phase response of an all-pass filter following Eq. (4.60) with g = 0.6 and p = 10 are presented.

Figure 4.10: Impulse, frequency and phase response of an all-pass filter, g = 0.6, p = 10.

The proposed reverberator in [31] has a transfer function defined in the Z- transform domain as

[ ] [ ] [ ] [ ]

[ ] [ ] [ ]∏=

−

−−

++

=A

aapai

b

apaib

bd

b zeagzeagzzH

b

b

1 1 α

α

ψ (4.61)

where d is an overall delay, b is the subband index, bψ is a phase rotation factor introducing fractional delay, A is the number of all-pass links, gb is the gain factors, p is the delay and bα is the phase rotation for fractional delay of each all-pass filter.

4.3.3 Quantization and coding of spatial cues

In parametric stereo coders the extracted parameters are quantized, encoded and transmitted to the decoder together with the encoded mono signal. In PS [22] the ICLD, the ICTD, the ICPD, the OPD and the ICC, are scalar quantized according to perceptual criteria. The goal is to introduce quantization errors that are not perceptible by the human auditory system.

10 20 30 40 50 60 70 80 90 100

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

Samples

Am

plitu

de 0 0.2 0.4 0.6 0.8 1-5

-4

-3

-2

-1

0

1

Normalized frequency (xπ rad/sample)

Mag

nitu

de (d

B)

0 0.2 0.4 0.6 0.8 1-2000

-1500

-1000

-500

0

Normalized frequency (xπ rad/sample)

Phas

e (d

eg)


The ICLD quantization table is based on the JND for the ILD, which is roughly independent of stimuli level and frequency but it increases with the reference ILD [15]. The ICLD is therefore non-linearly quantized in 31 levels that can be represented with 5 bits. The range is between -50 and 50 dB with smaller quantization step around 0 dB according to [22]

[ ]

50] 45, 40, 35, 30, 25, 22, 19, 16, 13, 10, 8, 6, 4, 2,... 0, 2,- 4,- 6,- 8,- 10,- 13,- 16,- 19,-

22,...- 25,- 30,- 35,- 40,- 45,- -50,[=qICLD (4.62)

where q = 0,…,30

The quantization indices for the quantized ICLD is obtained according to

[ ] [ ] [ ]

−∆=∆ qbjPbjP LRq

QLR ICLD,minarg, (4.63)

The sensitivity to ITD differences can be described by a constant phase for the low frequencies where the ITD is most important (see section 3.1). Based on this, the ICPD parameter is uniformly quantized with 8 levels, i.e. with 3 bits, according to [22]

[ ] ]4

7,4

6,4

5,4

4,4

3,4

2,4

,0[ πππππππ=qICPD (4.64)

where q = 0,…,7

The corresponding quantization indices are obtained from

[ ] [ ]

Λ

+∆

=∆ ICPD,21,4mod,

πθ

θbjbj LRQ

LR (4.65)

where ICPDΛ is the cardinality of the set of quantized ICPD values, i.e. the number of elements in ICPD .

The OPD is quantized with the same quantizer, i.e. according to Eq. (4.65).

The JND for the IC is smallest for correlated signals, i.e. for a high IC, and increases significantly when the signals are low correlated [15]. The ICC table used in PS is defined in 8 levels according to [22]

[ ] ]1,589.0,0,36764.0,60092.0,84118.0,937.0,1[ −−=qICC (4.66)

where q = 0,…,7

The quantization indices are obtained from

[ ] [ ] [ ]

−= qbjcbjc LRq

QLR ICC,minarg, (4.67)


The quantized stereo parameters are then differentially coded over time, which means that for each signal block of index j the difference to the quantization indices of the previous block j-1 is transmitted. The differential quantization indices are given by

[ ] [ ] [ ]bjbjbj QQQd ,1,, −−= ρρρ (4.68)

where ρ denotes either the ICLD, ICPD, OPD or ICC. The quantization indices for the subbands of the first block have to be transmitted since they are the reference for the differentially coded indices. In order to reduce the bitrate even further the parameters can be modulo coded since the cardinality of the quantization tables are known in both the encoder and the decoder (see [22] for more information). The modulo-coded time-differential indices of the quantized parameters are finally entropy coded. In other words, the most probable index is represented with the shortest codeword by using for example Huffman coding [3].

In [22] the bitrates for each parameter have been estimated for PS with 34 subbands and a time resolution of 23 ms from a database of 80 different audio recordings. The phase parameters are only encoded and transmitted for the subbands below 2 kHz due to the fact that the ILD is dominant for the higher frequencies (see section 3.1). The estimated bitrates are presented in Table 4.2. The average total bitrate is 7.7 kbps, and it can be noticed that a similar amount of bits are allocated for the ICLD and the ICC while the phase parameters, i.e. the ICPD and the OPD, are encoded with a slightly lower bitrate. In eAAC+ [62] the total bitrate have been scaled down to an average of 1.5 kbps with 20 frequency bands, an update of 46 ms and no transmission of ICPD and OPD [22]. The listening test results in [22] show that encoded material with PS in eAAC+ at 24 kbps have equivalent quality to AAC+ [61] at 32 kbps, i.e. there is a coding gain of about 25 percent.

Parameter Bitrate (kbps) ICLD 2.84 ICPD 1.16 OPD 0.96 ICC 2.75 Total 7.7

Table 4.2: Average bitrates for Parametric Stereo (PS) [22].

4.3.4 Pre-weighting for improved downmix from USAC

The main problem in the equalized downmix, which is presented in section 4.1.2, is related to phase cancellation of signal components. In [53] a pre-weighting of the stereo channels has been introduced for MPEG USAC [54]. The stereo channel signals are weighted before they are summed in order to reduce the cancellation from fully anti-correlated signal components, i.e. L[k] = -R[k].

The equalized downmix from section 4.1.2, Eq. (4.6), becomes

[ ] [ ] [ ] [ ] [ ] [ ]( )kRbjwkLbjwbjekM bjbjbj ,2,1, ,,, += (4.69)

where


( )

=

−=

],[],[

],[2],[

2

1

brERbrw

brERbrw (4.70)

with

[ ][ ] [ ]

[ ]

[ ][ ]

[ ]20

,10

,

20,

10,

10,2110

10,),cos(2110],[ bjP

LR

bjP

bjP

LRLR

bjP

LRLR

LRLR

bjc

bjcbjbjER ∆∆

∆∆

⋅⋅++

⋅⋅∆⋅++=

θ (4.71)

The power of the sum of the weighted signals is equalized to the sum of the left and right channel powers as in the original downmix. The equalization factor is conclusively given by

],[],[],[],[

21bjP

bjPbjPbjeRwLw

RL

+

+= (4.72)

where the subband powers are estimated according to Eq. (4.8).

The ER parameter describes the ratio between the power of the summed non-aligned and aligned left and right stereo channels. In order words, the ER parameter in Eq. (4.71) can be expressed as

[ ][ ]bjP

bjPbjER

RL

RL

,,

],[′+

+= (4.73)

where

[ ]bjibj

LRekRkR ,, ][][ θ∆⋅=′

is the phase-aligned right channel. The ordinary equalized downmix of anti-correlated signals is useless since the signal components to be equalized are cancelled by the summation. With the pre-weighting the power of the non-aligned signals, i.e. [ ]bjP RL ,+ , becomes zero and consequently the pre-weighted sum of the signals is the given by

[ ] [ ] [ ] [ ]( )kRkLbjekM bjbjbj ,,, 02, ⋅+⋅= (4.74)

In the other extreme case, when the signals are coherent, ER = 1 which implies that the weights are equal to one and the pre-weighted downmix is equal to the ordinary equalized downmix.

Due to the pre-weighting the OPD will be changed as can be seen in Figure 4.11. If the OPD is estimated in the decoder from the ICPD and the ICLD as described in section 4.3.1.1, the Eq. (4.44) has to be changed to

[ ][ ]( ) [ ]( )( )[ ] [ ]( )

[ ] [ ] [ ] [ ]( )

∆⋅+⋅∆⋅

=∆=∆=Θ

∆otherwise

bjbjwbjwbjbjwbjPbjif

bj

LRbjP

LR

LRLR

LR ,cos,10,,sin,

arctan

0,&,,0,

220/,

1

2

θθ

πθ (4.75)


l

r

(l+r)/2

opd

w1l

w2r

(l+r)/

opd

’

bj,L

bj,R

bj,M

ΘΘ′

bj,L1w

bj,R2w

bj,M

a b

Figure 4.11: The OPD is changed due to the pre-weighting.

a) The ordinary downmix where the left and right channels are summed and equalized.

b) The pre-weighting of the left and right subband spectra changes the OPD, i.e. the phase between the downmix spectrum and the left channel spectrum.

4.4 Conclusion

Stereo coding techniques are based on reduction of redundancies between the stereo channels by matrixing as well as parametric descriptions of the stereo image from a subjective perspective. In this chapter, the evolution of these techniques from Joint Stereo to Parametric Stereo has been presented. In addition, recent improvements of the techniques such as a KLT-based stereo model and improved downmix techniques have been described.

The presented parametric stereo models are known to not deliver transparent spatial audio quality for complex stereo channels. The synthesized stereo channel can suffer from a narrow stereo image with a poor reconstruction of the diffuse sounds. Nevertheless, these stereo coding techniques have shown to be efficient for intermediate subjective quality at bitrates up to 48 kbps [15]. It is therefore interesting to investigate the potential of parametric stereo coding with the ITU-T G.719 codec, especially since, for video conferencing, loudspeaker systems compatible with intermediate subjective quality are used. In the next chapter, a stereo framework that enables parametric stereo coding with G.719 will be presented.

57

5 Stereo framework for the ITU-T G.719 codec

A stereo framework for ITU-T G.719 is naturally related to the framework of the mono codec which is described in section 2.1.1. To be able to use advantageous methods of the existing stereo coding techniques presented in chapter 4 there are some additional requirements of the framework. In this chapter a choice of the transform, the time resolution and the frequency resolution is made in order to enable efficient stereo coding with G.719 using the architecture of parametric stereo coding.

5.1 Transform compatible with G.719 codec

The Modified Cosine Transform (MDCT) used in G.719 has the advantages of high energy compaction with OverLap Add (OLA) to reduce block artifacts and enable perfect reconstruction. However, there is no explicit phase information in the MDCT spectra since the transform is real-valued. In addition, phase differences in form of ITDs are perceptually important for the spatial hearing (see chapter 3). The parametric stereo architecture with downmixing implies that only one channel is present in the decoder and ICTDs should be applied in order to obtain a more correct stereo image. In addition, uncorrelated components in the stereo channel are regenerated using phase altering decorrelation filters. Thus, in order to introduce phase information but still preserve the compatibility to G.719 the MDCT has been combined with the Modified Discrete Sine Transform (MDST) into the complex-valued Modified Discrete Fourier Transform (MDFT). First the MDST will be introduced and then the MDFT given by the MDCT and MDST spectra will be presented.

The MDST resembles the MDCT but the cosine basis functions are replaced by sine functions. The MDST of the time-domain signal x[n] is given by

[ ] [ ]∑−

=

+

++=

12

0 21

221sin2 N

nw

MDST kNnN

nxN

kX π for k=0,…,N-1 (5.1)

The MDST has similar properties as the MDCT and with OLA the effects of the TDA can be cancelled similarly as for the MDCT. The properties of the MDST are further described in Appendix section D.III.

With the MDCT as the real part and the negative MDST as the imaginary part the MDFT is defined as

[ ] [ ]∑−

=

+

++−

=12

0

21

2212 N

n

kNnNi

enxN

kXπ

for k=0,…,N-1 (5.2)

The MDFT transform can also be expressed in relation to the Discrete Fourier Transform (DFT) using the shift theorem which implies that a time shift of the signal x[n] corresponds to a linear phase shift of the DFT spectrum X[k]. Similarly, by duality, a frequency shift of the DFT spectrum corresponds to a complex modulation of the time-domain signal. The MDFT is related to the Discrete Fourier Transform (DFT) by

58 Chapter 5 - Stereo framework for the ITU-T G.719 codec

[ ] [ ]

+

++

−−

= 4

12

122 NkN

Ni

nN

i

eenxDFTN

kXππ

for k=0,…,N-1

(5.3)

where the DFT is given by:

[ ]( ) [ ]∑−

=

−=

12

0

N

n

nkNi

enxnxDFTπ

for k=0,…,2N-1 (5.4)

Consequently, the MDFT can be efficiently computed with the Fast Fourier Transform algorithm [12] that reduces the complexity from ( )( )22NO to ( ) ( )( )NNO 2log2 2

arithmetic operations.

The Inverse MDFT (IMDFT) of the MDFT spectrum [ ]kX is given by

[ ] [ ]

ℜ= ∑

−

=

+

++1

0

21

2212

21 N

k

kNnNi

ekXN

nxπ

for n=0,…,2N-1 (5.5)

The IMDFT can be efficiently computed using the same technique as for the forward MDFT that was expressed in relation to the DFT. In this case the IMDFT is related to the Inverse DFT according to

[ ] [ ]( )kXIDFTeNnxn

Ni 2

2

π

= for n=0,…,2N-1 (5.6)

with

[ ]( ) [ ]∑−

=

−=

12

021 N

k

nkNi

ekXN

kXIDFTπ

for n=0,…,2N-1 (5.7)

and

[ ][ ]

[ ]( )

−=

−−

−== ∗

+

+−−+

+

++

12,,12

1,,0

4112

21

41

21

NNkforekNX

NkforekXkX NkNN

Ni

NkNNi

π

π

(5.8)

where * denotes the complex conjugate. The properties of the MDFT and IMDFT are further described in Appendix section D.IV.

To sum up, the MDFT has an explicit phase which is useful for parametric stereo coding since the time differences between the channels can be analyzed and synthesized. In addition, the compatibility with ITU-T G.719 is maintained since the real part of the MDFT spectra is equal to the MDCT spectra which can be encoded with G.719 and transmitted to the decoder. However, there are some more constraints to the stereo framework as will be described in the next section.

Section 5.2 - Block processing from G.719 with additional zero-padding 59

5.2 Block processing from G.719 with additional zero-padding

The block processing of the stereo layer is similar to the G.719 mono codec block processing (see section 2.1.1). However, time-shifting that could be used for the stereo synthesis results in observable artifacts in the signals when the ordinary G.719 framework is used. There are two issues related to time-shifting in the MDFT domain which are wrap-around of the signal blocks and signal attenuation in the reconstruction due to unsynchronized analysis and synthesis windows.

A time-domain signal x[n] is by the MDFT described in terms of complex sinusoids that are periodic and thus they describe the signal even outside the transformed block. In other words the signal is periodically extended by the MDFT. From Eq. (5.5) it follows that

[ ] [ ]

[ ]( )

[ ]nxekXN

ekXN

Nnx

N

k

kkNnN

i

N

k

kNNnNi

−=

ℜ=

=

ℜ=+

∑

∑

−

=

++

+

++

−

=

+

+++

1

0

1221

221

1

0

21

2212

221

2212

ππ

π

(5.9)

The MDFT extension has thus a 4N sample periodicity i.e. x[n+4N] = x[n]. The signal block x = (x1, x2, x3, x4) of 2N samples with

[ ]

[ ]

−

−=

12/

2/)1(

mNx

Nmxmx for m=1,..,4

(5.10)

is thereby extended to (…, -x1, -x2, -x3, -x4, x1, x2, x3, x4, -x1, -x2, -x3, -x4,…). The DFT shift theorem shows similar results in the MDFT domain which means that a time shift of the time-domain signal x[n] corresponds to a linear phase shift of the MDFT spectrum X[k]. A time shift of Δt samples is realized by

[ ] [ ]tk

Ni

shifted ekXkX∆

+−

= 21π

1,,0 −= Nk (5.11)

However, since the time-shift is performed in the MDFT domain the extended signal is shifted and the effect can be seen as a wrap-around with sign change. If the signal block x = (x1, x2, x3, x4) of length 2N is delayed by N/2 samples in the MDFT domain according to Eq. (5.11) the shifted signal block becomes xshifted = (-x4, x1, x2, x3). For shifted audio signals this wrap-around effect might result in audible artifacts especially when transients are wrapped around.

In order to avoid the wrap-around effect the analysis and synthesis windows are zero-padded on both sides. For MDFT shifted signal blocks the zeros become a part of the time shifted signal instead of the negatively wrapped-around signal components. For example, a delay of N/2 samples applied in the MDFT domain to the signal block x = (x1, x2, x3, x4) of length 2N results in xshifted = (0, x1, x2, x3) instead of xshifted = (-x4, x1, x2, x3) if the amount of zero-padding is at least N/2 samples.


In Figure 5.1 the effect of the zero-padding is shown for a speech signal with a MDFT domain time-shift applied in frequency subbands. The spectrograms of the signal, processed both with and without zero-padding, shows that there are signal components that have been wrapped-around in Figure 5.1.a. According to Figure 5.1.b the energy is significantly lower when zero-padding is used, which also is shown in the power spectrum in Figure 5.1.c.

Frequency [kHz] 20 10 0

sample 9000 4500 0

a

b

c Pow

er [d

B]

-12 -24 -36 -48 -60

0

Freq

uenc

y [k

Hz]

0

10

20

0

10

20

Figure 5.1: Wrap-around effect for time-shifts applied in the MDFT domain.

a) Spectrogram of a time-shifted signal processed with G.719 standard sine windows.

b) Spectrogram of a time-shifted signal with zero-padded windows. c) Power spectrum of the shifted signals illustrated in a) (blue line) and b) (purple

line) at the point where the wrap-around effect is observable.

In Figure 5.2 the zero-padded window for the G.719 stereo layer is presented. The amount of zero-padding is Z = 192 samples which corresponds to 4 ms and determines the maximum time-shift that can be introduced without wrap-around. The amount of zeros is chosen to allow time-shifts that are perceptually important for the spatial hearing (see chapter 3). The effective overlap of the windows is still 50 % in order to maintain the perfect reconstruction given by the Princen-Bradley condition (see Appendix section D.II).

Section 5.2 - Block processing from G.719 with additional zero-padding 61

0 500 1000 1500 2000 2500 3000 35000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

N = 960 samples

20 ms 4

Z = 192

Transform block, 2304 samples

sample

Ampl

itude

Figure 5.2: Zero-padded windows for reduced wrap-around effect.

The second issue related to the time-shifting depends on unsynchronized analysis and synthesis windows. When a signal block is time-shifted in the MDFT domain it is already windowed by the analysis window which means that the window is also shifted. At the synthesis in the OLA the synthesis window is applied with zero time-shifts and consequently it is not synchronized with the shifted analysis window. As a result, the Princen-Bradley condition is not completely fulfilled and the reconstruction is not perfect.

In order to reduce the problem the synthesis window could be shifted for compensation but when subband time-shifts are applied it is not possible to synchronize the windows completely. Alternatively, a squared analysis (respectively synthesis) window could be used together with a rectangular synthesis (respectively analysis) window, but the high frequency leakage of the rectangular window makes the sine analysis and synthesis windows more beneficial. In addition, the sine window is smooth and the effect from the unsynchronized windows is not as significant as for steeper window functions, e.g. the Vorbis window [14]. Annoying artifacts related to the issue have not been observed using the sine window and consequently equal sine analysis and synthesis windows are used for the G.719 stereo framework.

Because of the zero-padded windows the length of the transform blocks is increased as illustrated in Figure 5.2. The window lengths are increased with 2x192 samples from 1920 to 2304 samples as presented in Table 5.1. As already mentioned the overlap is still 50 % which implies that the hop size is unchanged. The size of the transform blocks, that is half the size of the windows, is however increased from 960 to 1152 frequency coefficients. The transform blocks are thereby not fully compatible with the G.719 mono layer anymore due to different transform sizes. In addition, as a first step the stereo architecture for G.719 is implemented in MATLAB while a floating point ANSI-C implementation of the ITU-T G.719 mono codec is used. For simulation purposes an adapted stereo framework has been developed as now will be described.


Mono layer (G.719) Stereo layer

Hop size 960 samples 960 samples

Window length 1920 samples 2304 samples

Transform size 960 coefficients 1152 coefficients

Table 5.1: Different window and transform lengths for the mono and the stereo layer.

The developed stereo framework uses inverse transformations with OLA followed by direct transformations in order to change the resolution of the MDFT/MDCT spectra. In Figure 5.3 a block diagram of the framework is presented.

G.719 decoder

Subband synthesis

Decorrelation

[ ]kM

Q-1

[ ]nl

DECODER

Stereo analysis

G.719 encoder

Q

[ ]nl

[ ]nr

[ ]kM

Windowing MDFT

Downmix

ENCODER

Windowing MDFT

IMDFT OLA

Frequency partitioning …

… …

:

:

Frequency partitioning

…

:

…

:

IMDFT OLA

IMDFT OLA [ ]nr

[ ]nm [ ]nm

Figure 5.3: Block diagram of the proposed stereo framework for ITU-T G.719.

The left and right channel signals l[n] and r[n] are buffered, windowed and transformed with an algorithmic delay of 40 ms as in G.719. The MDFT spectra of each channel are partitioned into frequency subbands, as will be described in the next section, and analyzed in order to describe the stereo image. The stereo parameters are quantized, coded and transmitted to the decoder. Additionally, the subband spectra of the left and right channels are downmixed and inverse transformed with OLA into a time domain signal m[n]. The mono signal is encoded by G.719 with an additional algorithmic delay of 20 ms due to buffering and transmitted to the decoder. In the decoder the mono signal is decoded, windowed and transformed into subband spectra. The spectra are decorrelated and used together with the decoded mono spectra and the decoded stereo parameters to synthesize the left and right channel spectra. The spectra are subsequently inverse-transformed with OLA to the time-domain signals [ ]nl and [ ]nr . The total algorithmic delay of the stereo framework is conclusively 60 ms which is 20 ms more than the G.719 mono codec requires (see chapter 2). In section 9.2.1 the compatibility between the stereo and mono layers is discussed further.

Section 5.3 - Frequency resolution based on the ERB scale 63

Due to the OLA the mono signal m[n] is made up of consecutive blocks that are differently processed and consequently the MDCT of m[n] used in the G.719 encoder is not equal to the real part of the MDFT spectrum M[k]. The effect of the modification of the MDCT spectra due to the OLA has been studied both objectively and subjectively using the developed stereo architecture for G.719 that is further described in the following chapters. The segmental Signal to Noise Ratio (SNR) (see section 8.1), i.e. the power ratio between the signal and the difference between the reference signal and the decoded stereo signal have been estimated in time-domain segments of 256 samples. The SNR showed to be equal in average for the stereo coding both with and without OLA when the mono signal was transmitted to the decoder uncoded. More specifically, for ten stereo samples of different characteristics (see MUSHRA database in Appendix E) the difference in the segmental SNR lies in the interval of [-0.6, 0.6] dB and [-0.9, 0.9] dB for the left and the right channels respectively with 95 % statistical confidentiality. Similar subjective results were observed and no significant artifacts were introduced in the output signals. Conclusively, there are no large effects of the additional OLA introduced in the proposed stereo framework. The defined stereo framework is therefore considered suitable for the first implementation of a stereo architecture for the G.719 codec.

5.3 Frequency resolution based on the ERB scale

Parametric stereo models are often based on stereo analysis in time and frequency tiles in order to capture single sources in each subband of the spectra. In the stereo architecture for ITU-T G.719 the MDFT spectra are partitioned into frequency subbands using overlapping frequency-domain filters that approximate the cochlear filterbank with bandwidths following the ERB scale (see Appendix A).

In Figure 5.4 a filterbank with ten subbands is presented with the total bandwidth limited to 20 kHz similarly as in G.719.

Figure 5.4: Filterbank with 10 frequency subbands following the ERB scale.

The filter functions are multiplied with the MDFT spectra in order to partition them into frequency subbands; in this case ten subbands. The subband spectra [ ]kX bj , are obtained by

0 5 10 15 20 250

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Ampl

itude

Frequency [kHz]


[ ] [ ]kXkFkX jbbj ⋅=][, , Bb ,,1= (5.12)

where [ ]kFb are the subband filter functions of index b and B is the total number of subbands which cover the entire frequency spectrum , i.e. the 1152 MDFT coefficients.

The filterbank is constructed so that the sum of all the filters is one in every frequency coefficient. This implies that the fullband MDFT spectra that are inverse transformed and overlap-added in the encoder and the decoder are given by the sum of the corresponding subband spectra. The fullband spectrum [ ]kX j is thus obtained from the subband spectra

[ ]kX bj , according to

[ ] [ ]∑=

=B

bbjj kXkX

1, (5.13)

where B is the number of subbands, which in the example of Figure 5.4 is B = 10.

Except for the first and the last filters the filterbank is defined by

[ ][ ]

−=−

=

−

−+

=

−−− 1,,1

,,cos5.05.0

1,21,11

,2,1,1,2

,1

bbb

bbbb

b

b

GGkforkF

GGkforGG

GkkF

π (5.14)

where

[ ] [ ]

++

=2

1,1

bHbHG b and [ ] [ ]

+++

=2

21,2

bHbHG b

with

[ ]

( )

+−

=

−

5.000437.0

12196.3

s

Bb

fNe

bH for b = 2,…,B-1

where N=1152 is the length of the MDFT spectra and fs is the sampling frequency, i.e. 48 kHz. The first and the last subband filters (of indices b = 1 and b = B) have to be defined slightly different since they are not fully overlapped by other subband filters. More specifically, the filters are equal to one where there is not overlap by the adjacent filters. This can be observed for the tenth subband in Figure 5.4 where the amplitude of the filter is equal to one for frequencies between about 16 and 20 kHz.

In chapter 6, the number of subbands for the filterbank will be specified and optimized for two different stereo models in order to increase the subjective stereo quality at a low bitrate.


5.4 Conclusion

A stereo framework for ITU-T G.719 has been defined by the complex-valued MDFT, zero-padded sine analysis and synthesis windows and a frequency-domain filterbank following the ERB scale. The framework is robust to time-shifts applied as linear phase shifts in the MDFT domain due to the zero-padding. However, the zero-padding implies that the stereo layer spectra are longer than the mono layer spectra and the compatibility between them is not complete. These problems are resolved with extra transforms that introduces an additional delay of 20 ms for the stereo layer which together with G.719 results in an algorithmic delay of 60 ms for the stereo codec.

67

6 Stereo models based on the BCC/PS architecture

The stereo framework for ITU-T G.719 that is proposed in chapter 1 allows stereo processing at a fixed time and frequency resolution. With respect to the background in chapter 4 a stereo architecture based on BCC and PS is associated with the G.719 codec, which is not subjectively transparent at the standardized bitrates of 32, 48 and 64 kbps. Since neither the parametric stereo models are able to deliver transparent quality, in this case the stereo quality (see section 4.3), it is of a very high interest to combine parametric stereo models with this transform codec. The target is to evaluate what level of quality that can be reached by this stereo architecture when associated with a perceptual audio codec like G.719.

The BCC/PS stereo architecture has been improved and adapted to the ITU-T G.719 mono codec. The proposed architecture uses the fixed time-resolution from G.719 while the frequency resolution for the MDFT domain stereo processing is parameter dependent. This optimized frequency resolutions that are not a part of the original BCC/PS architecture will be presented in section 6.3.1.

As described in chapter 1 the stereo layer is implemented in MATLAB while the ITU-T G.719 mono codec is used in a floating point ANSI-C implementation. Thereby, and due to the fact that the stereo and mono layers comprise different transform sizes, an inverse transformation with OLA is performed prior to the mono coding (see section 5.2). Similarly, the time-domain output signal of the mono coder is transformed back to the MDFT domain in the decoder using a direct transform. This results in an algorithmic delay of 20 ms for the stereo layer. The compatibility between the stereo and mono layers is further discussed in chapter 9.2.1.

The original BCC and PS architectures use equalized downmixing of the stereo channels into a mono (or mid) channel that is perceptually encoded and transmitted to the decoder. The stereo image is described with parametric models which in terms of bitrate can be advantageous over e.g. dual mono or mid side coding, see chapter 4. In BCC a stereo model with the ICLD, ICTD and ICC is used while the ICTD is replaced by the ICPD and the OPD in PS. More details about the BCC and PS stereo models are given in the sections 4.2 and 4.3.

For the proposed stereo architecture two different parametric models have been developed and improved using both novel and existing techniques. The stereo models are characterized by their downmixing method and the set of parameters that is used for the stereo modeling. In the following, the stereo models are described briefly in the sections 6.1 and 6.2 where the concepts are related to existing stereo coding techniques. In section 6.3, optimized frequency resolutions for stereo analysis are presented together with two new algorithms for improved parameter extraction. The algorithms are used in order to stabilize the stereo synthesis and to obtain a better subjective quality in terms of spatial image. Subsequently, signal alignment adapted to the given stereo framework is presented in section 6.4. In section 6.5, a decorrelation filter for spatial width synthesis is presented. Finally, in section 6.6, the synthesis of the stereo channels in the optimized frequency resolutions is described. Note that in this chapter analysis and synthesis is performed with MDFT spectra if nothing else is stated.

68 Chapter 6 - Stereo models based on the BCC/PS architecture

6.1 High level description of model 1

The first stereo model describes the stereo image by using the ICLD, ICTD and ICC. In other words it uses the same parameters as the BCC technique does (see section 4.2). The model is also characterized by the equalized downmix from IS (see section 4.1.2) with additional pre-weighing from MPEG USAC defined in the section 4.3.4. In Figure 6.1 a block diagram of the subband processing in the encoder and the decoder for model 1 is presented with the frame and subband indices omitted for clarity.

G.719 decoder

Subband synthesis

Decorrelation [ ]kM

[ ]kD

Q-1

[ ]kL

[ ]kR

DECODER

Cross-spectrum analysis

Subband analysis

G.719 encoder

Q [ ]kL

[ ]kR QICTDICLD ICC

[ ]kLa [ ]kRa

[ ]kM

Alignment

EQ-DMX with pre-weighting

ENCODER

QICLD QICC

ICTD

MDFT IMDFT OLA

Figure 6.1: Block diagram for encoder and decoder of model 1. The difference between this model and the second model is the downmix/upmix method and the parameterization of the stereo image, which is indicated by the blocks with red text. Block and subband indices are omitted for clarity. Notice that Q refers to both quantization and coding of the stereo parameters. Similarly, Q-1 denotes the decoding and de-quantization of the coded quantization indices into quantized stereo parameters.

According to Figure 6.1 the cross-spectrum of the left and the right subband spectra [ ]kL bj , and [ ]kR bj , is computed at the encoder side in order to extract the ICTD LRτ∆

which is used for alignment of the stereo channels. From the time-aligned subband spectra [ ]kLa

bj , and [ ]kR abj , the ICLD LRP∆ and the ICC LRc are extracted. The time-aligned

subband spectra are pre-weighted and down-mixed into an equalized mono spectrum [ ]kM bj , which is inverse-transformed and encoded with the G.719 encoder. The encoded

mono spectrum is transmitted to the G.719 decoder together with the quantized and encoded stereo parameters.

In the decoder the decoded mono signal is transformed into the MDFT domain. The subband spectrum [ ]kM bj ,

ˆ is decorrelated with a specific all-pass decorrelation filter based

on constrained Time-Shifting (TS). From the decoded mono spectrum [ ]kM bj ,ˆ and the

decorrelated mono spectrum [ ]kD bj , the left and the right channel subband spectra [ ]kL bj ,ˆ

and [ ]kR bj ,ˆ are synthesized using the decoded quantized stereo parameters.

Section 6.2 - High level description of model 2 69

In contrast to BCC/PS the frequency resolution is parameter dependent, i.e. the filterbank is individually optimized for each stereo parameter. The number of subbands in the frequency domain is different for the analysis and synthesis of the ICLD, ICTD and ICC respectively. The resolution for the time alignment is equal to the resolution for the ICTD extraction but for higher performance the downmixing and decorrelation is performed with higher frequency resolutions. The higher resolution of the downmixing does not affect the bitrate but slightly increases the complexity of the algorithm.

6.2 High level description of model 2

The second model utilizes the KLT downmix method, which is described in section 4.1.3, in contrast to the equalized downmix in model 1. The main difference to the existing KLT-based stereo model [42][43] (see section 4.1.3) is that the time differences between the stereo channels are modeled by the ICTD. With this adaptive mid-side transform the energy of the mid is maximized while the energy of the side is minimized. The rotational

angle has been constrained to 2

0 πσ ≤≤ in order to avoid energy losses in the overlap-add

(see section 4.1.3).

A block diagram of model 2 is presented in Figure 6.2. There are several differences between the stereo models but the ICTD is extracted and used equally as in model 1. The KLT rotational angle σ which is obtained in the downmixing stage is encoded together with the ICTD LRτ∆ and the Side-to-Mid Level Difference (SMLD) SMP∆ defined in section 6.3.

σ SMLD

Cross-spectrum analysis

Q [ ]kL

[ ]kR

ICTD

[ ]kLa [ ]kRa

[ ]kM

Alignment

KLT

ENCODER DECODER

Subband analysis

[ ]kS

Qσ QSMLD

G.719 decoder

Subband synthesis

Decorrelation [ ]kM

[ ]kD

Q-1

[ ]kL

[ ]kR

DECODER

QICTD

MDFT

G.719 decoder

IMDFT OLA

G.719 encoder

MDFT [ ]ns

[ ]nm

[ ]nm

[ ]kM[ ]kS~

Figure 6.2: Block diagram for encoder and decoder of model 2. The difference between this model and the first model is the downmix/upmix method and the parameterization of the stereo image, which is indicated by the blocks with red text. There is a local decoding in the encoder for improved estimation of the side channel in the decoder. Block and subband indices are omitted for clarity.


In order to improve the estimation of the side channel in the decoder a local mono decoding is performed in the encoder. The mid and the side spectra are both inverse transformed with overlap add to the time-domain signals [ ]nm and [ ]ns respectively. The mid signal [ ]nm is encoded, decoded and MDFT-transformed while the side signal [ ]ns is directly transformed to the MDFT domain. The SMLD is then calculated as the power ratio of the transformed side spectrum [ ]kS bj ,

~ and the decoded mono spectrum [ ]kM bj ,

ˆ . The inverse and direct MDFT transforms may be seen as redundant but it can be shown that the spectra are not equal. Therefore, since the mid spectrum is transformed via the time-domain the side spectrum is equally transformed as depicted in Figure 6.2.

In the stereo decoder the mid spectrum is decoded and decorrelated with the same decorrelation filter as in model 1. The non-transmitted side-channel spectrum [ ]kS bj , is approximated by the decorrelated mid-channel spectrum [ ]kD bj , that is scaled by the decoded and quantized SMLD parameter. The time-aligned left and right channel subband spectra [ ]kLa

bj ,ˆ

and [ ]kR abj ,

ˆ are subsequently synthesized by the inverse KLT of the mid-

channel spectrum [ ]kM bj ,ˆ and the approximated side channel [ ]kS bj ,

ˆ using the decoded

quantized rotational angle [ ]bjQ ,σ . Finally the ICTD LRτ∆ is synthesized to form the

subband spectra [ ]kL bj ,ˆ

and [ ]kR bj ,ˆ that approximate the original stereo channels [ ]kL bj ,

and [ ]kR bj , . The synthesis is described in more details in section 6.6.

6.3 Parameter extraction

The enhancements in the defined stereo models are partly located in the area of the parameter extraction. The ICLD and ICC of model 1 are extracted similarly as described in section 4.3.1.1. The KLT angle of model 2 is defined according to Eq.(4.15) in the KLT-based codec that is described in section 4.1.3. The SMLD is defined similarly as the PCAR, which is used in that codec, but the energy ratio is inversed to describe the side-to-mid power ratio which is given by

[ ] [ ][ ]bjP

bjPbjPM

SSM ,

,log10, 10=∆ (6.1)

where j and b are the block and subband indices respectively and the powers [ ]bjPS , and [ ]bjPM , are estimated as described in section 4.3.1.1. The SMLD describes the power of

the side channel in relation to the mid channel inversely as the PCAR defined in section 4.1.3 does. The benefit of the SMLD over the PCAR is a lower complexity since a division is replaced by a multiplication.

The ICTD used in both stereo models is extracted with two new algorithms for stabilized stereo synthesis. The algorithms are described later in this section. First the optimization of the filterbanks for the spatial analysis and synthesis, the downmixing and the decorrelation will be described.

Section 6.3 - Parameter extraction 71

6.3.1 Optimization of the frequency domain filterbank

The frequency resolution for the stereo processing (analysis and synthesis) is given by a bank of filters overlapping in frequency. These filters are approximating the cochlear filterbank (see Appendix A) where the filter bandwidths are defined according to the ERB scale as described in section 5.3. In the implemented stereo architecture for ITU-T G.719 the filterbank resolutions have been optimized individually for each stereo parameter in order to maximize the subjective quality while minimizing the data rate. In other words the number of subbands has been reduced until a clear degradation of the subjective quality could be observed. The total number of subbands for the two stereo models is equivalent in order to obtain comparable bitrates for the two stereo models.

The number of subbands used for the stereo analysis and synthesis of the defined stereo parameters, the downmixing and the decorrelation is presented in Table 6.1 and Table 6.2. There is a significant lower frequency resolution for the KLT than for the EQ-DMX with pre-weighting. This is related to the fact that the rotational angles of the KLT are encoded and transmitted to the decoder which means that the bitrate is dependent on the frequency resolution. In the first model there is no downmix parameter transmitted to the decoder so the frequency resolution has no impact on the final bitrate.

Stereo parameter Number of subbands

Model 1 LRP∆ (ICLD) 16

LRτ∆ (ICTD) 12

LRc (ICC) 10

Model 2

σ (KLT) 14

SMP∆ (SMLD) 12

LRτ∆ (ICTD) 12 Table 6.1: Frequency resolution for the stereo parameters. The number of subbands covering the frequency range of 20 kHz is between 10 and 16 for different stereo parameters.

Number of subbands

EQ-DMX 30

KLT 14

Decorrelation 40

Table 6.2: Frequency resolution for downmixing and decorrelation. The number of subbands is significantly different between the EQ-DMX and the KLT.

The two downmix methods have been compared in their respective frequency resolutions by evaluating the errors from the projection of the stereo channels onto the downmix channel in every subband. The orthogonal projection of [ ]kX bj , onto [ ]kM bj , minimizes the energy of the projection/reconstruction error [64], which is given by


[ ]( ) [ ] [ ] [ ]∑∈

−Π=Πbkk

bjbjXXX kXkMbb2

,,ε (6.2)

where the projection coefficient is

[ ][ ] [ ]( )

[ ]∑

∑

∈

∈

∗⋅=Π

b

b

kkbj

kkbjbj

X

kM

kMkXb

2

,

,,

(6.3)

The energy of the projection error for each subband is summed over the subbands and normalized by the spectrum energy of the projected channel [ ]kX j according to

[ ][ ]( )

[ ]∑

∑

=

=

Π=Ε N

kj

B

bXX

X

kX

bj

1

2

110log10

ε (6.4)

where B is the number of subbands, i.e. 30 for the EQ-DMX and 14 for the KLT, and N is the number of frequency coefficients in the spectrum [ ]kX j .

The normalized projection errors of the two downmix methods has been compared block per block and averaged over the stereo channels and the blocks according to

[ ] [ ] [ ] [ ]( )∑=

−− Ε−Ε+Ε−Ε=ΕJ

j

KLTR

DMXEQR

KLTL

DMXEQL jjjj

J 121

(6.5)

where J is the total number of blocks evaluated and [ ]jDMXEQX

−Ε (respectively [ ]jKLTXΕ ) is

the normalized projection error using the equalized downmix with pre-weighing (respectively the KLT downmix). L and R denote the left and right channel spectra [ ]kL j

and [ ]kR j respectively.

The average projection error difference Ε has been evaluated over ten stereo samples of clean, reverberant and noisy speech, mixed content and music. These samples are in chapter 8 used to evaluate the performance of the proposed stereo models (see the MUSHRA database in Appendix E). The comparison showed that 51.0−=Ε dB, i.e. the projection error was in average 0.51 dB lower for the EQ-DMX than for the KLT. Consequently, there is a slightly higher potential for a good reconstruction using the EQ-DMX than using the KLT. However, if the frequency resolution for the KLT is increased to 30 subbands the average projection error difference 19.1=Ε dB. The KLT results thus in average a lower minimum reconstruction error than the EQ-DMX in the case of equal subband resolutions. However, this comparison of the downmix signals does not consider the upmix methods or the psychoacoustical principles of the human hearing.


As a consequence of the reduced number of subbands for the stereo parameters the bandwidths of the frequency filters in the filterbank have been increased. They still approximate the human hearing by nearly following the non-linear ERB scale but the bandwidths have been adjusted and defined as multiples of the original ERB. In Figure 6.3 the filterbank bandwidths at corresponding center frequencies are presented for the different subband resolutions used in the stereo analysis and synthesis.

Figure 6.3: Bandwidth of subbands as a function of the centre frequencies of the frequency filters.

The bandwidths of odd multiples of the ERB scale from 1 to 9 are shown as a reference. The bandwidths for the stereo parameters are varying between 3 and 9 ERB. The largest bandwidths are allowed at the highest frequencies from about 10 kHz since the human auditory system is more sensitive to low frequency components [2]. The ICC and ICTD/SMLD parameters are processed with bandwidths of approximately 5xERB for all frequencies while the ICLD and the KLT parameters are processed with a slightly higher resolution at higher frequencies, i.e. a bandwidth of about 3xERB at 8 kHz. The higher resolution for the KLT and the ICLD parameters at high frequencies can be related to the duplex theory, i.e. the higher importance of level differences above about 2 kHz (see section 3.1). The frequency resolutions could probably be even more optimized with respect to the psychoacoustics, especially with a lower resolution for the high frequency subbands for the ICTD parameter.

The amount of overlap between the frequency filters has been defined as a compromise between time-domain leakage, i.e. spread of signal components in time, and potential signal attenuation when different time shifts are applied in adjacent subbands. The same amount of overlap relative to the bandwidths has been used for all the filterbanks independently of the subband resolutions.

102

103

104

102

103

104

frequency [Hz]

band

widt

h [H

z]

ICC (10)ICTD/SMLD (12)KLT

angle (14)

ICLD (16)

1

3

579

ERB


The sharpest subband filters possible are ideal band-pass filters which can be expressed as a linear combination of ideal low-pass filters. The impulse response of an ideal low-pass filter [ ]nhL with a rectangular DFT frequency response

[ ] ≤

=otherwise

BkBkH L ,0

,1, (6.6)

is according to [63] given by

[ ]n

fB

nfB

fBn

fB

fBBnh

s

s

sssL

π

π

2

2sin22sinc2,

=

= , ∞−∞= ,,n (6.7)

where B is the bandwidth (in frequency bins) , which is equal to the cut-off frequency, of the low-pass filter and fs = 48 kHz is the sampling frequency. The filter is symmetric, indefinitely long and non-causal, which means that it requires future values of the signal to be filtered. Filtering of the signal blocks is, however, performed in the frequency domain as the multiplication of the zero-phase frequency response [ ]kH L and the signal spectra. This corresponds in the time-domain to a circular convolution (see [63]) of the truncated impulse response [ ]BnhL , , n= 0,…,2N-1, that is cyclic extended so that [ ]BNnhL ,+ =

[ ]BnNhL ,1−− . The impulse response of 2N = 2304 samples and the corresponding frequency response for an ideal low-pass filter with a bandwidth of 100 frequency coefficients is illustrated in Figure 6.4.

Figure 6.4: Impulse response [ ]Bn,hL of 2N = 2304 samples (top figure) and frequency response [ ]Bk,H L of N coefficients for frequencies up to fs/2 = 24 kHz (bottom figure) of an ideal low-pass filter where the bandwidth B = 100 samples.

0 500 1000 1500 2000 2500-0.02

0

0.02

0.04

0.06

0.08

0.1

sample

h L[n]

0 200 400 600 800 1000 12000

0.5

1

1.5

frequency coefficient

|HL[k

]|


In a signal block that has been circularly convolved with this filter there will be a resonance of the signal components with a frequency close to the cut-off frequency. The cut-off frequency enhancement is especially significant for signal blocks that consist of both low and high power parts. The high amplitude filter taps might then be supressed by a low power signal in the same time as the low amplitude filter taps become dominant due to a high power signal. The time-domain leakage of this low-pass filter is therefore most noticeable for audio signals with transients, i.e. sudden impulses of high energy. The resonance of the cut-off frequency is shown in Figure 6.5 where the impulse and frequency response of the truncated low-pass filter in Figure 6.4 is presented.

Figure 6.5: Truncated impulse response of the low-pass filter in Figure 6.4 with B = 100 (top figure) and the corresponding frequency response for frequencies up to fs/2 = 24 kHz (bottom figure). The cut-off frequency of the low-pass filter is significantly enhanced by the truncated impulse response.

The stereo framework filterbank is however setup of band-pass filters but as already mentioned these can be expressed as a function of low-pass filters. An ideal band-pass filter of bandwidth BPB can be constructed from two ideal low-pass filters according to

[ ] [ ] [ ]1221 ,,,, BnhBnhBBnh LLBP −= (6.8)

where 12 BB > are the low-pass filter bandwidths that corresponds to the two cut-off frequencies and the bandwidth of the filter 12 BBBBP −= .

As presented in section 5.3, the fullband frequency spectra of the subband processed signal blocks are reconstructed by a summation of the subband spectra (see Eq. (5.13)). The total filter for two adjacent ideal band-pass filters is similarly given by

0 500 1000 1500 2000 2500-4

-2

0

2

4x 10

-3

sample

h[n]

0 200 400 600 800 1000 12000

0.1

0.2

0.3

0.4

0.5

frequency coefficient

|H[k

]|


[ ] [ ] [ ] [ ] [ ] [ ][ ] [ ]1122

2122111222211211

,,

,,,,,,,,

BnhBnhBnhBnhBnhBnhBBnhBBnh

LL

LLLLBPBP

−=

−+−=+ (6.9)

where 11122122 BBBB >=> since the lower cut-off frequency of the second subband coincides with the higher cut-off frequency of the first subband filter.

For the band-pass filter there will be a resonance of the both cut-off frequencies 22B and

11B since the combined impulse response is the difference between the two sinc-functions of the low-pass filters.

In the stereo synthesis the subband spectra are independently scaled and time-shifted (see section 6.6). This will, however, affect the subband filtering and introduce additional resonance frequencies. For example if the two adjacent subband filters in Eq. (6.9) are considered there will be three different frequencies in the combined sinc impulse response. Let the first subband (band-pass filter) be amplified by 1ψ while the second subband is amplified by 2ψ . The combined filter is

[ ] [ ][ ] [ ] [ ] [ ][ ] [ ] ( ) [ ]1221111222

212222111121

2221212111

,,,

,,,,

,,,,

BnhBnhBnhBnhBnhBnhBnh

BBnhBBnh

LLL

LLLL

BPBP

ψψψψ

ψψψψ

ψψ

−+−=

−+−=

+

(6.10)

which implies that there will be three resonance frequencies, 11B , 22B and 2112 BB = , as long as 21 ψψ ≠ .

Similarly, if two different time shifts are applied to the subband filters according to

[ ] [ ][ ] [ ] [ ] [ ][ ] [ ] [ ] [ ]212211111222

212222111121

2221212111

,,,,

,,,,

,,,,

BtnhBtnhBtnhBtnhBtnhBtnhBtnhBtnh

BBtnhBBtnh

LLLL

LLLL

BPBP

∆−−∆−+∆−−∆−=

∆−−∆−+∆−−∆−=

∆−+∆−

(6.11)

there might be three resonance frequencies 11B , 22B and 2212 BB = , if 21 tt ∆≠∆ since [ ] [ ]212211 ,, BtnhBtnh LL ∆−≠∆− generally.

The frequency responses of the subband filters are significantly smoother with an increased amount of overlap (see Figure 5.4). The filters can consequently be described by shorter impulse responses, which imply that the time-domain leakage is reduced. However, with an increased overlap between the subband-filters there will be a larger probability of signal cancellations in the reconstruction of fullband spectra (see Eq. (5.13)) if different time (phase) shifts have been applied to the overlapped subband spectra.

The problems of the filterbank overlaps are illustrated in Figure 6.6 where a filterbank of three filters is used.


Frequency [kHz]

Frequency [kHz]

Pow

er [d

B]

-12 -24 -36 -48 -60

0

Pow

er [d

B]

-12 -24 -36 -48 -60

0

Time [sample]

Freq

uenc

y [k

Hz]

0

10

20

0

10

20

0

10

20

20 10 0

20 10 0

0 2400

a

b

Applied ICTD

Applied ICTD

Frequency [kHz]

Pow

er [d

B]

-12 -24 -36 -48 -60

0

0

10

20

20 10 0

Reference Shifted signal

Applied ICTD

c

𝑓𝑙𝑖𝑚

𝑓𝑙𝑖𝑚

𝑓𝑙𝑖𝑚

Figure 6.6: Effects of the amount of overlap in the frequency-domain filterbank. An artificial ICTD is applied to the third subband of the right channel. The spectrograms of the shifted signal in three cases of different amount of overlap between frequency filters are shown. In addition, the power spectra of the corresponding shifted signals and the non-shifted left channel reference signal are illustrated together with the filterbanks.

a) Sharp filters result in a large time-domain leakage but no signal attenuations. b) Overlapping filters result in a small time-domain leakage but large signal

attenuations. c) Compromise of overlap where both the time-domain leakage and signal

attenuations are acceptable.

In the figure, an artificial ICTD is applied to the third subband, i.e. the subband of the highest frequencies. The right channel is shifted while the left channel is preserved as the reference. The spectrogram in Figure 6.6.a. shows that when the filters are very sharp there is a large time-domain leakage at 3 kHz but no signal attenuations. Furthermore, in Figure 6.6.b., where the amount of overlap is large, there is no noticeable time-domain leakage but large signal attenuations. This can also be observed in the power spectrum where the attenuation is largest for the frequency limf at the intersection of the second and third frequency filters. Finally, Figure 6.6.c. shows that both the time-domain leakage and the signal attenuations are acceptable with the compromise in the overlap.


6.3.2 Improved ICTD extraction

Stereo signals can be difficult to model especially when the environment is noisy or when various audio components of the mixtures overlap in time and frequency i.e. noisy speech, speech over music or simultaneous talkers, etc. However, stereo signals made up of few sound components can also be difficult to model especially with the use of a parametric approach which can lead to some ambiguities in the parameter extraction as described in the following.

The conventional approach of ICTD extraction relies on the Cross-Correlation Function (CCF) where the time-lag corresponding to the maximum amplitude is selected as the ICTD. According to the definition a positive (respectively negative) ICTD means that the left (respectively right) channel is delayed with corresponding ICTD relative to the left (respectively right) channel. The corresponding amplitude of the CCF normalized by the left and right channel powers is identified as the ICC, see section 4.2.1.

There are two identified problems related to the conventional ICTD extraction method which results in unstable ICTDs over the processed frames. The first issue is related to signal components of high tonality, e.g. voiced speech or tones from musical instruments. The second issue is related to signals with a high level of interfering components, e.g. strong reflections, background noises or secondary sources.

A state-of-the art algorithm handling these problems is presented in [65]. The ICTD is extracted with the conventional method but smoothed in time depending on the ICC LRc and the tonality index TON that varies from 0 to 1 where the latter indicates high tonality. The tonality index can be estimated differently, e.g. as described in [66]. The smoothing is performed with a forgetting factor α according to:

[ ] [ ] ( ) [ ]bjbjbj LRLRLR ,11,, −∆⋅−+∆⋅=∆ τατατ (6.12)

where the forgetting factor is given by:

[ ]( ) [ ]bjcbjTON LR ,,1 ⋅−=α (6.13)

The smoothing makes the ICTD very approximate for signals detected as tonal and/or with a low ICC. The spatial location is therefore averaged and very slowly evolving for this type of signal content. In other words, the algorithm does not allow for a precise tracking of the ICTD when the signal characteristics evolve quickly in time.

Figure 6.7 describes the problem of the solution proposed in [65] for the case of tonal components. The analyzed stereo signal is artificially made up of two consecutive glockenspiel tones at 1.6 kHz and 2 kHz with a constant time delay of 88 samples between the channels. The extracted ICTD from the global maximum of the CCF varies significantly between the frames while it should be constant. The smoothed ICTD is updated very slowly due to the high tonality of the signal but nevertheless this results in an unstable spatial image and energy losses in the frame-by-frame spatial synthesis with OLA.


Two novel algorithms that stabilize the extracted ICTD will now be presented. The first algorithm allows for a robust extraction of the ICTD from stereo signals made up of tonal components, while the second algorithm is used to select the relevant ICTDs related to the dominant source.

Figure 6.7: State-of-the-art ICTD extraction [65] for tonal components.

a) Inter-Channel Time Difference (in samples) for two consecutive glockenspiel tones at 1.6 kHz and 2 kHz with an artificially applied time-delay of -88 samples between the channels. The ICTD obtained from the global maximum of the CCF is varying between frames due to the high tonality. The smoothed ICTD is slowly (respectively quickly) updated when the tonality is high (respectively low).

b) Tonality index varying from 0 to 1.

c) Extracted Inter-Channel Coherence (ICC) used as forgetting factor in case of low tonality in the ICTD smoothing algorithm described in [65].

6.3.2.1 Stabilized ICTD extraction of tonal components

The analysis of the CCF for tonal components reveals that there is not always a clear global maximum, i.e. when the conventional method for ICTD extraction is used there is no clear ICTD. Figure 6.8 illustrates a problematic situation where the signals in the stereo channels are delayed. In this example, a voiced segment of a recorded speech signal (AB recording) is considered and two time-lags, one positive and one negative, are close in

0 50 100 150 200 250-100

-50

0

50

frame

ICT

D

Artificially applied ICTDGlobal maximum ICTD extractionICTD smoothing

0 50 100 150 200 2500

0.5

1

frame

Ton

ality

0 50 100 150 200 2500

0.5

1

frame

ICC

a

b

c


amplitude. Therefore there lies an ambiguity in the stereo analysis and both time-lags are possible ICTD values. These observations are also relevant for any kind of tonal signals such as a musical instrument for example and are further described in the following.

Figure 6.8: Ambiguous maximum of the CCF for tonal components.

a) Waveform of the left and right channels. b) Cross-Correlation Function computed from the left and right channels. c) Zoom of the CCF for time-lags between -192 and 192 samples.

The ICTD ambiguity of tonal components such as musical instruments is problematic. Several local maxima of the CCF might have similar amplitude (or very close) and can be considered as potential ICTD candidates. Figure 6.9 describes this ambiguity for an artificial stereo signal generated from a single glockenspiel tone with a constant delay of 88 samples between the stereo channels. The time-lag difference τ∆ between the local maxima is given by the frequency of the tone i.e. f = 1.6 kHz, according to 30==∆ ffsτ where the sampling frequency fs = 48 kHz. For this particular stereo signal, the time-lags of each possible maxima of the CCF are defined by τ∆ and 0τ according to

0τττ +∆×= mm (6.14)

where { }

−===∆

=

6,,0,,,630

20

mff sτ

τ

The time-lags has been limited to {-192,…,+192} samples due to a psycho-acoustical consideration related to the maximum acceptable ITD value, in this case it is considered

0 500 1000 1500 2000 2500 3000 3500 4000

-5000

0

5000

sample

ampl

itude

-2000 -1500 -1000 -500 0 500 1000 1500 2000-1

-0.5

0

0.5

1

time-lag

CC

F

-200 -150 -100 -50 0 50 100 150 200-1

-0.5

0

0.5

1

time-lag

CC

F

-119 61

a

b

c

ICTD = ?


varying in the range {-4,…,+4} ms. 0τ is the minimum time-lag that maximize the CCF. According to Figure 6.9, the artificially introduced ICTD of 88 samples between the left and right channels corresponds to the local maximum of index m = -3 which is not the actual global maximum. As a result, the ICTD obtained using the conventional extraction method is not necessarily reliable in the case of tonal components (voiced speech, music instruments, etc.).

Figure 6.9: The global maximum identification does not match the Inter-Channel Time Difference.

a) Waveforms of the left and right channels. Glockenspiel tone at 1.6 kHz with a time delay of 88 samples between the channels.

b) Cross-Correlation Function computed from the left and right channels.

c) Zoom of the CCF for time-lags between -192 and 192 samples. The time-lag difference between the local maxima is 30 samples.

d) Zoom of the CCF for time-lags between -100 and 100 samples. 20 =τ is, for this particular signal, the time-lag of the global maximum of the CCF. The artificially injected ICTD corresponds to the local maximum at the time-lag

88−=τ samples which is not the global maximum.

This ambiguous ICTD might result in an unstable frame-by-frame spatial synthesis. The overlapped segments in the OLA can become misaligned and generate energy losses. Moreover, the stereo image becomes unstable due to the possible switching between opposite delays.

0 200 400 600 800 1000 1200 1400 1600 1800 2000-0.05

0

0.05

sample

Am

plitu

de

LR

-1000 -800 -600 -400 -200 0 200 400 600 800 1000-1

0

1

time-lag

CC

F

-200 -150 -100 -50 0 50 100 150 200-1

0

1

time-lag

CC

F

-100 -80 -60 -40 -20 0 20 40 60 80 1000.9

0.95

1

time-lag

CC

F

30 samples

a

b

c

0τ

d

-88

ICTD0.06 =τ


Thus, in order to stabilize the extraction of ICTD for stereo signals of several tonal components a novel algorithm has been developed. The algorithm consists of two sub-algorithms that both consider multiple local maxima of the CCF as ICTD candidates. The least complex algorithm determines the sign of the ICTD among one positive and one negative ICTD candidate while the second algorithm in addition considers several positive and negative ICTD candidates before the sign is determined.

The steps of the algorithms can be summarized as follows

1. The CCF [ ]τφlr , which is a normalized function between -1 and 1, is defined along positive and negative time-lags τ and estimated according to

[ ] [ ][ ]

−=−−=+

=1,,01,,2

NforNforN

lr

lrlr

ττφττφ

τφ

(6.15)

with

[ ] [ ]( )( )kIDFTnlr Φℜ=

φ

for 12,,0 −= Nn (6.16)

where

[ ] [ ][ ]( )

−=−−Φ−=Φ

=Φ ∗ 12,,121,,0

NNkforkNNkfork

k

(6.17)

and

[ ] [ ] [ ]

[ ] [ ] [ ] [ ]

⋅

⋅

⋅=Φ

∑∑ ∗∗

∗

kkkRkR

NkLkL

N

kRkLk11

(6.18)

where * denotes the complex conjugate and the Inverse DFT (IDFT), according to Eq. (5.7), is applied to the extended normalized cross-spectrum [ ]kΦ

of the left and right

channel MDFT spectra [ ]kL and [ ]kR of length N.

2. Local maxima Li are defined for both positive and negative time-lags according to

[ ] [ ] [ ][ ] [ ]

+>−>

=11

τφτφτφτφ

τφlrlr

lrlrlriL , [ ]1,0,, −−∈ NN τ (6.19)

where i is a positive integer used to index the local maxima. 2N is the length of the analyzed speech/audio segment of index j. Note that frame and subband indices are omitted for clarity.

In the following, either path A OR B is selected for the two sub-algorithms respectively, i.e. 123.A4 OR 123.B45.B

3.A. Two candidates C for both positive and negative time-lags are identified directly from the set of local maxima according to


( )( )

,2,1,0maxˆ,2,1,0maxˆ

=<=

=≥=−

+

iLC

iLC

ii

ii

τ

τ (6.20)

where iτ is the time-lag of the corresponding local maxima Li. In Figure 6.10-3.A the local maxima and the identified positive and negative ICTD candidates are shown for block j of the glockenspiel sample used in Figure 6.9.

3.B. For all local maxima, one or several candidates C (l is the candidate index) are identified according to the definition of the global maximum

( ) ,2,1,max == iLG i (6.21)

and the following distance criterion

{ } ,2,1,, =×≤−= liTGLLC iil α (6.22)

where α is set to 2 but can possibly be dependent on the signal characteristics by using a tonality measure or the cross-correlation coefficient i.e. G, and T is a threshold defined further down in the algorithm.

Each identified candidate has an amplitude relatively close to G and a corresponding time-lag lτ . Two candidates are selected, one for both the positive and the negative time-lags, according to

−

<∈

=

−

≥∈

=

−∗

−

+∗

+

ττττ

τ

ττττ

τ

ˆ0

minargˆ

ˆ0

minargˆ

l

l (6.23)

where the reference time-lag +∗τ (respectively −

∗τ ) is the last extracted positive (respectively negative) ICTD. The corresponding Cl are possible ICC candidates and denoted +C and −C . In Figure 6.10-3.B the identification of the ICTD candidates and the selection of the positive and negative ICTD candidates are shown for block j of the glockenspiel sample used in Figure 6.9. The last extracted positive and negative ICTDs used to select the positive and negative candidates are also shown.

4. The sign of the ICTD is determined differently depending on the difference between the ICC candidates.

1. If the following condition is verified TCC ≤− −+ ˆˆ , where T is set to 0.1 but can be

signal dependent for example relative to the value of G i.e. T=βxG, there are two possibilities


i. If the ICLD (see the definition in section 4.2.1) is able to indicate a dominant channel i.e. LRP∆<γ then the ICTD is set accordingly

>∆=∆<∆=∆

−

+

0ˆ0ˆ

LRLR

LRLR

PifPif

ττττ

(6.24)

where γ is set to a constant of 6 dB.

ii. Otherwise when the ICLD is not able to indicate a dominant channel, the ICTD candidate that is closest to the ICTD of the previous block of index j-1 is selected, i.e.

[ ] { } [ ] τττττ

τ −−∆−+∈

=∆ 1ˆ,ˆ

minarg jj LRLR (6.25)

2. Otherwise when there is no sign ambiguity the ICTD is given by the time-lag corresponding to the maximum ICC candidate, i.e.

[ ][ ]

=∆>=∆

−

−++

otherwisejCCifj

LR

LR

ττττˆ

ˆˆˆ (6.26)

Step 4 is illustrated in Figure 6.10-4.1, where the sign of the ICTD is determined by the ICTD of the previous block j-1, and Figure 6.10-4.2.

5.B. The reference time-lags are updated accordingly

[ ]

=≥∆=

−−∗

++∗

otherwisejif LR

τττττ

ˆˆ0ˆˆ

(6.27)


-200 -150 -100 -50 0 50 100 150 2000.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

time-lag

CC

F

ICTD candidatesPositive ICTD candidateNegative ICTD candidate

−τ

+τT<

[ ]1−∆ jLRτ4.1

[ ]jLRτ∆

+τ

-200 -150 -100 -50 0 50 100 150 2000.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

time-lag

CC

F


+τ

−τ

+ + + +

+ + + + + + +

+ + +

+ + + +

+

-200 -150 -100 -50 0 50 100 150 2000.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

time-lag

CC

F


−*τ

+*τ

−τ

+τ

Tα

3.A 3.B

+τ

1.0ˆˆ =≤− −+ TCC

-200 -150 -100 -50 0 50 100 150 2000.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

time-lag

CC

F


+τ

−τT>

[ ]jLRτ∆

4.2

Figure 6.10: The ICTD selected as one of the positive and negative ICTD candidates that are identified in the CCF. The CCF is computed from the left and right channel spectra of a stereo sample with glockenspiel tones and an artificially applied ICTD of -88 samples. The subfigure numbers correspond to the step numbers of the algorithm for improved ICTD extraction of tonal components. In both 4.1 and 4.2 the negative ICTD candidate of -88 samples is selected.

Depending on the choice made for the path, path A has the advantage of being less complex than the algorithm described in path B. However, there is no consideration of previously extracted (positive and negative) ICTDs which results in a less stable ICTD extraction. In the following, the more complex algorithm is selected in order to better demonstrate the benefits.

The behavior of the algorithm is illustrated with an artificial stereo signal made up of a glockenspiel tone with a constant delay of 88 samples between the stereo channels. Figure 6.11 shows the ICTD candidates derived from the algorithm. This particular analysis demonstrates that the global maximum is not related to the ICTD between the stereo channels. However, the algorithm identifies a positive and a negative ICTD candidate that are further compared to select the relevant ICTD that was originally applied to the stereo channels.


Figure 6.11: Positive and negative ICTD candidates extracted by the multiple maxima algorithm when the global maximum is not related to the ICTD.

a) Waveform of left and right channels of a stereo signal made up of a glockenspiel tone at 1.6 kHz delayed in the left channel by 88 samples.

b) CCF computed from the left and right channels. The algorithm consider multiple maxima in the range of {-192,…,192} sample time-lags that are equivalent to ICTD varying in the range {-4,…,4} ms in the case of a sampling frequency of 48 kHz.

c) Zoom of the CCF for time-lags between -192 and 192 samples. One positive and one negative ICTD candidates are selected as the closest values to the last selected positive and negative ICTD.

Figure 6.12 illustrates the preservation of the localization for voiced frames in the case of a female speech signal recorded with an AB microphone setup. The ICTD extracted by the algorithm is constant over two frames even if the global maximum of the CCF has changed. The algorithm uses the ICLD (see algorithm step 4.1.i) in order to identify the dominant channel. In this particular example the left channel is dominant and therefore a positive ICTD is not acceptable.

0 200 400 600 800 1000 1200 1400 1600 1800 2000-0.04

-0.02

0

0.02

0.04

sample

Am

plitu

de

LR

-1000 -800 -600 -400 -200 0 200 400 600 800 1000-1

-0.5

0

0.5

1

time-lag

CC

F


-200 -150 -100 -50 0 50 100 150 2000.85

0.9

0.95

1

time-lag

CC

F


a

b

c

6

CC

F

Selected ICTD0.06


Figure 6.12: Improved ICTD extraction based on multiple CCF maxima and the ICLD between the original channels.

A. Analyzed frame of index j. I. Waveform of left and right channels with an ICLD = 8 dB. II. CCF computed from the left and right channels. III. Zoom of the CCF for perceptually relevant time-lags between -4

and 4 ms or equally -192 to 192 samples with a sampling frequency of 48 kHz. The positive ICTD candidate is in this case the global maximum of the CCF in the range of the relevant time-lags but it has not been selected by the algorithm since the ICLD>6 dB. This means that the left channel is dominant and therefore a positive ICTD is not acceptable.

B. Analyzed frame of index j+1. I. Waveform of left and right channels with an ICLD = 9 dB. II. CCF computed from the left and right channels. III. Zoom of the CCF for perceptually relevant time-lags between -4

and 4 ms or equally -192 to 192 samples with a sampling frequency of 48 kHz. The negative ICTD candidate has been selected by the algorithm as the relevant ICTD and in this specific case it is the global maximum of the CCF in the relevant range of time-lags.

0 500 1000 1500 2000-0.2

-0.1

0

0.1

0.2

sample

Am

plit

ud

e

LR

-1000 -500 0 500 1000-1

-0.5

0

0.5

1

time-lag

CC

F


-200 -150 -100 -50 0 50 100 150 2000

0.5

1

time-lag

CC

F

ICLD = 7.9928


0 500 1000 1500 2000-0.2

-0.1

0

0.1

0.2

sample

Am

plit

ud

e

LR

-1000 -500 0 500 1000-1

-0.5

0

0.5

1

time-lag

CC

F


-200 -150 -100 -50 0 50 100 150 2000

0.5

1

time-lag

CC

F

ICLD = 9.0226


II

Selected ICTDSelected ICTD

II

III

II

III

A B


Another ambiguity in the ICTD extraction appears when two overlapped sources with equivalent energy are analyzed within the same time-frequency tile i.e. the same block and the same frequency subband. Figure 6.13 shows the analysis of an artificial stereo signal made up of two speakers with different spatial localizations generated by applying two different ICTD. The algorithm step 4.1.ii is able to preserve the localization as long as there is an ambiguity by selecting the ICTD candidate that is closest to the previously extracted ICTD.

Figure 6.13: Ambiguous ICTD in the case of two different delays in the same analyzed segment solved by preserving the localization.

a) Waveform of left and right channels. b) CCF computed from the left and right channels for a double talker

speech signal with controlled ICTD of -50 and 27 samples artificially applied to the original sources.

c) Zoom of the CCF for time-lags between -192 and 192 samples. The positive and negative ICTD candidates are identified as -50 and 26 samples. The negative ICTD is selected for the currently analyzed frame since this particular time-lag maximizes the CCF and is coherent with the ICTD extracted in the previous frame.

To show the improvement of the multiple maxima algorithm compared to the state-of-the-art algorithm Figure 6.14 illustrates the extracted ICTD for the same glockenspiel sample as depicted in Figure 6.7. The ICTD extraction is clearly improved since the ICTD now perfectly follows the artificially applied time difference between the channels. In particular the smoothing used in the state-of-the-art technique is not able to preserve the localization of the directional source when the tonality is high.

0 200 400 600 800 1000 1200 1400 1600 1800 2000-0.4

-0.2

0

0.2

0.4

sample

Am

plitu

de

LR

-1000 -800 -600 -400 -200 0 200 400 600 800 1000-1

-0.5

0

0.5

1

time-lag

CC

F


-200 -150 -100 -50 0 50 100 150 200-1

-0.5

0

0.5

1

time-lag

CC

F


27

-50 26

a

b

cSelected ICTD


Figure 6.14: Improved ICTD extraction of tonal components. The ICTD is extracted over frames for a stereo sample of two glockenspiel tones at 1.6 kHz and 2 kHz with an artificially applied time difference of -88 samples between the channels. The new ICTD extraction algorithm considering several maxima of the CCF stabilizes the ICTD compared to the existing state-of-the-art algorithm [65].

6.3.2.2 Selection of relevant ICTD based on adaptive Inter-Channel Coherence Limit

In the presence of interfering components as reflections, background noise, secondary sources etc. the extracted ICTD might not be related to the location of the dominant source, i.e. the sound source of interest like speech for example. The extracted ICTD can then be seen as relevant only when the ICC is relatively high as illustrated in Figure 6.15. An ICC of one indicates that the analyzed channels are coherent and the corresponding ICTD determines the delay between the correlated components. However, an ICC close to zero means that the analyzed channels comprise different sound components and that the ICTD is not perceptually important for the localization. With an ICC between zero and one the question is if the ICTD is relevant or not, as illustrated in Figure 6.15.

0 50 100 150 200 250-90

-80

-70

-60

-50

-40

-30

-20

-10

0

10

frame

ICTD

Artificially applied ICTDGlobal maximum ICTD extractionICTD smoothingSeveral maxima ICTD extraction


Figure 6.15: Relevancy of the ICTD depending on the ICC level [50].

Based on this idea an algorithm evaluating the relevancy of the ICTD cue due to the level of ICC has been defined. Rather than evaluating the amount of correlation (ICC) to a fix threshold, the algorithm introduces an adaptation of the ICC limitation according to the evolution of the signal characteristics.

An Adaptive ICC Limitation (AICCL) is computed over analyzed blocks of index j by using an adaptive filtering of the ICC defined as

[ ] [ ] ( ) [ ]11 −×−+×= jAICCjICCjAICC αα (6.28)

where the forgetting factorα has been set to 0.08.

The AICCL is then further limited and compensated by a constant value β set to 0.1 according to

[ ] [ ]( )β−= jAICCAICCLjAICCL ,max 0 (6.29)

The constant compensation allows for a variable degree of selectivity of the ICTD according to

[ ] [ ] [ ] [ ][ ] [ ] [ ] [ ]

<−=≥=

jAICCLjICCjICTDjICTDjAICCLjICCjICTDjICTD

|1|,

(6.30)

The additional limitation AICCL0 restricts the selection of relevant ICTDs to correlated components with an ICC above this lower bound. AICCL0 can be fixed or estimated according to the knowledge of the acoustical environment, e.g. a theater with applause, office with background noise, etc. Without additional knowledge on the level of noise or more generally speaking on the characteristics of the acoustical environment, a suitable value of AICCL0 has been fixed to 0.75.

ICC ~ 1 0 < ICC < 1 ICC ~ 0

Relevant ICTD/ICLD Relevant ? ICTD/ICLD Irrelevant ICTD/ICLD

1=XYc 0≈XYc10 << XYc

Relevant ICTD Relevant ICTD? Irrelevant ICTD


In order to illustrate the behavior of the algorithm, an artificial stereo signal made up as the mixture of speech and recorded fan noise has been generated with a fully controlled ICTD of the speech source. Figure 6.16 shows the benefit of using the AICCL (green curve of Figure 6.16.c.) which allow the extraction of a stabilized ICTD (red curve of Figure 6.16.d.) even when the acoustical environment is critical, i.e. high level of noise in the stereo mixture. The ICTD selected according to the AICCL is coherent with the original (true) ICTD. Conclusively, the algorithm is able to stabilize the position of the sources over time with an update of the ICTD only when the extracted ICTD is perceptually relevant.

Figure 6.16: Adaptive ICC for improved ICTD extraction.

a) Synthetic stereo signal made up of the sum of a speech signal and stereo fan noise with a progressively decreasing SNR.

b) The speech signal is artificially delayed on the stereo channel according to the sine function to approximate an ICTD varying from 1 to -1 ms (fs=48 kHz).

c) The extracted ICC is progressively decreasing and also switching from low to high values. The green line represents the Adaptive ICC Limitation.

d) Superposition of the originally extracted ICTD as well as the perceptually relevant ICTD extracted from coherent components.

ex a m pl e

0 50 100 150 200 250-50

-40

-30

-20

-10

0

10

20

30

40

50

frame

SN

R

LR

0 50 100 150 200 2500.4

0.5

0.6

0.7

0.8

0.9

1

frame

ICC

ICCICC limit

-50

-40

-30

-20

-10

0

10

20

30

40

50

ICTD

0 50 100 150 200 250-50

-40

-30

-20

-10

0

10

20

30

40

50

ICTD

Extracted ICTDICTD with ICC masking

Artificial ICTD generation

Adaptive ICC Limitation

Selected ICTD

a

b

c

d

Progressive injection of noise


6.4 Signal alignment of stereo channels before downmixing

For both parametric stereo models the downmixing of the stereo channels into a mono channel has been enhanced with signal alignment as suggested in [33] (see section 4.1.2). The stereo channels are, however, not aligned in the time-domain but in the MDFT domain subbands. The alignment suppresses the comb-filtering effect, i.e. the signal cancellations that occur in the sum of two non-aligned signals at the frequencies where the delay corresponds to a phase difference of 180 degrees. This is illustrated in Figure 6.17 where a simple averaging of the stereo channels is performed. The spectrograms and power spectra show that the very strong attenuations are avoided by the alignment.

Figure 6.17: Spectral effect of the alignment (time delay compensation) before downmixing the stereo channels into a mono channel.

a) Spectrogram of the downmix of incoherent stereo channels, the comb-filtering effect can be observed as horizontal lines.

b) Spectrogram of the aligned downmix, i.e. sum of the aligned/coherent stereo channels.

c) Power spectrum of both downmix signals. There is a large comb-filtering in case the channels are not aligned which is equivalent to energy losses in the mono downmix.

0 2 4 6 8 10 12 14 16 18 20-80

-70

-60

-50

-40

-30

-20

-10

0

Frequency [kHz]

Pow

er [d

B]

Aligned downmixIncoherent downmix

Section 6.5 - MDFT domain decorrelation 93

In addition, signal components that comprise a clear ICTD are not well represented in the downmix signal if the left and right channels are not aligned. This problem is illustrated in Figure 6.18 where especially transients with non-zero ICTDs become multiplied in the downmixed signal. With this type of downmix signal it is not possible to reconstruct the left and right channels in the decoder. Fortunately, the transients can be significantly better represented by the downmix signal if the stereo channels are aligned prior to the downmixing as is shown in Figure 6.18.b.

0 1000 2000 3000 4000 5000 6000-0.5-0.4-0.3-0.2-0.1

00.10.20.30.40.5

sample

ampl

itude

0 1000 2000 3000 4000 5000 6000-0.5-0.4-0.3-0.2-0.1

00.10.20.30.40.5

sample

ampl

itude

a

b

Figure 6.18: Temporal effect of the alignment (time delay compensation) before downmixing the stereo channels into a mono channel.

a) Waveform of an equalized downmix signal where double transients can be observed due to non-aligned signal components.

b) Waveform of an equalized downmix signal where transients are better represented due to signals alignment prior the downmixing.

The extracted and selected ICTD of block j and subband b, as described in section 6.3.2, is used to shift the right channel while the left channel is used as a reference. The time-shifts are performed in the MDFT domain according to

[ ] [ ]

[ ] [ ][ ]bjk

Ni

bja

bj

bja

bj

LR

ekRkR

kLkL

,21

,,

,,

τπ∆

+−

×=

= 1,,0 −= Nk (6.31)

where the difference to DFT domain time-shifts is the ½ bin frequency shift.

6.5 MDFT domain decorrelation

To model the diffuse sounds that are characterized by low correlation between the stereo channels [15] a constrained Time-Shifting (TS) FIR filter is used to decorrelate the transmitted mono spectra. Different time-shifts are applied to the frequency subbands with


constraints to decrease the amount of signal attenuation in the overlapped frequency regions. This is illustrated in the Figure 6.19 where the frequency and phase responses of the combined filters of the filterbank and the decorrelation filter are shown. Due to the time shift differences between the subbands there are comb filtering effects which are largest at the intersections of the subband filters, i.e. at limf shown in Figure 6.6. With the constrained time-shifts there is no phase cancellation at limf and the comb-filtering is reduced, especially at the lower frequencies. However, at the higher frequencies the comb-filtering effects can still be observed for frequencies close to the intersection frequency limf due to the large overlap of the filterbank filters.

Figure 6.19: Effect of constrained time-shifts in the decorrelation filter. The frequency and phase response of the combined filtering by the filterbank and decorrelation filter. The decorrelation is performed in 40 subbands following the ERB scale.

The impulse responses for each subband are constant over blocks of index j and given by

[ ]][][ btnnhb ∆−= δ (6.32)

where [ ]nδ is the Kronecker delta function.

The time shifts ][bt∆ are obtained by

[ ] [ ] [ ] [ ][ ] [ ]

=

=

⋅−−∆=∆≥⋅+−∆=∆

∆+∆

=∆

Bbfor

bfor

otherwisebTbcbtbtbcbcifbTbcbtbt

ttbt

,,2

1

]1[][]1[][2

][

limmin

minmaxlimmax

maxmin

(6.33)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2x 104

-35-30-25-20-15-10-50

Frequency [Hz]

Pow

er [d

B]

Unconstrained ∆tConstrained ∆t

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2x 104

-6

-5

-4

-3

-2

-1

0

1 x 104

Frequency [Hz]

Pha

se [d

egre

e]

Section 6.5 - MDFT domain decorrelation 95

where B is the number of subbands used for the decorrelation and the minimum and maximum change factors are defined as

[ ] [ ][ ]

[ ] [ ][ ]

−∆−∆=

∆−−∆=

bTbttbc

bTtbtbc

lim

maxmax

lim

minmin

1

1

(6.34)

The time-shifts are constrained to the interval of 6 ms to 8 ms, i.e.

mstmst

86

max

min

=∆=∆

(6.35)

and Tlim is the period time corresponding to the frequency at the intersection between consequent subband filters, i.e.

[ ] [ ]bfbT

limlim

1= (6.36)

The lower bound of the time shifts mint∆ ensures the decorrelation effect at least for non-stationary signals. The variable time-shifts over the frequency subbands enhances this decorrelation effect while the upper bound of the time-shifts maxt∆ restricts the filter lengths in order to suppress framework related artefacts, e.g. wrap-around effects.

The filtering is performed in the MDFT domain where the DFT transformed subband impulse responses ][nhb are multiplied with the decoded mono spectrum [ ]kM bj ,

ˆ according to

[ ] [ ] [ ]( ) Nπ-i

bbjbj enhDFTkMkD 2,,

ˆ ⋅⋅= (6.37)

where the DFT domain filters are transformed to the MDFT domain by the phase shift of ½ frequency bin.

The constrained TS filter has shown some advantages over other types of decorrelation filters. Compared to Random Phase (RP) filters (see section 4.3.2) the amount of signal coloration is low, despite the comb-filtering effects. In fact, informal listening showed that the comb-filtering is not very critical since it is located at the higher frequencies. In the subjective comparison of the filters the TS filtered signal was able to reproduce a clean spatial width when played together with the mono signal. The RP filter on the other hand resulted in a more distorted and unnatural spatial widening.


The TS filter has the advantage of a high amount of decorrelation over the full frequency spectra in contrast to a Non-Linear Phase (NLP) filter where the time-shifts are increasing or decreasing with the frequency (see section 4.3.2). For a NLP filter, the cross-correlation coefficient in certain subbands might be unnecessarily high, i.e. a low decorrelation effect is obtained for the frequencies of the specific subband. This has not just an impact to the reproduction of the spatial width in the stereo signals. A high amount of correlation might also result in signal cancellations in the stereo synthesis if the mono and the decorrelated mono spectra are subtracted as they are in the defined stereo models (see section 6.6).

In addition to the subjective observations, the amount of decorrelation has been measured objectively. In Figure 6.20 the average cross-correlation coefficients between the mono and the decorrelated mono channels are presented. The coefficients are averaged over the processed frames of ten stereo samples with different characteristics (MUSHRA database, see Appendix E). The TS filter is compared to the NLP and RP filters defined by impulse responses of 300 taps. As can be seen in the figure the results are similar for all the filters. However, the amount of decorrelation is in average slightly higher for the TS filter with a cross-correlation coefficient mean of 0.26 for the presented samples.

Figure 6.20: Averaged cross-correlation coefficient between the mono and the decorrelated mono channels for stereo samples of different characteristics used for evaluation of the subjective stereo quality, see section 8.2. The TS, NLP and RP filters show similar results for the objective measure of decorrelation.

6.6 Stereo synthesis

In the stereo decoder the coded mono and the decorrelated mono spectra are used together with the quantized stereo parameters to regenerate the stereo channels. The optimized filterbank implies that the stereo synthesis is performed with mapping of the parameters between the different resolutions.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cro

ss-c

orre

latio

n co

effic

ient

binaural

clean speech1

clean_speech2

mixed content1

mixed content2music1

music2

noisy speech1

noisy speech2

reverberant speechMean

TS6-8msNLP300RP300

Section 6.6 - Stereo synthesis 97

6.6.1 Upmixing in model 1

In model 1 the stereo channels are synthesized according to the upmix matrix aE that is used in PS, as described in section 4.3.1.2. The ICLD is synthesized frequency bin per frequency bin while the ICC and ICTD are applied in their respective filterbank resolutions. The scalefactors 1λ and 2λ , which are defined in section 4.3.1.2, are weighted by the subband filters and summed according to

[ ] [ ] [ ]

[ ] [ ] [ ]∑

∑

=

=

=Λ

=Λ

ICLD

ICLD

ICLD

ICLD

ICLD

ICLD

B

b

ICLDbICLD

B

b

ICLDbICLD

kFbjk

kFbjk

122

111

,

,

λ

λ 1,,0 −= Nk (6.38)

where the subband index bICLD denotes the subbands in the ICLD filterbank [ ]kF ICLDbICLD

and BICLD is the number of ICLD subbands.

The ICLD and ICC are applied to estimate the aligned left and right channels by

[ ][ ]

[ ] [ ]( ) [ ] [ ]( )[ ] [ ]( ) [ ] [ ]( )

+−+−++

⋅

Λ

Λ=

][][ˆ

,,sin,,cos,,sin,,cos

00

][ˆ][ˆ

,

,

2

1

,

,

kDkM

bjbjbjbjbjbjbjbj

kk

kRkL

ICC

ICC

ICC

ICC

bj

bj

ICCICCICCICC

ICCICCICCICC

abj

abj

βαβαβαβα

(6.39)

where the rotational angles α , β are defined as described in section 4.3.1.2. ][ˆ , kL ICCbja

denotes the estimated aligned left channel spectra filtered into subbands of the ICC resolution, i.e.

[ ] [ ]kLkFkL jaICC

bbja

ICCICCˆ][ˆ , ⋅= , ICCICC Bb ,,1= (6.40)

where BICC is the number of ICC subbands. The same notation applies to other spectra and filterbank resolutions.

The ICLD used to determine the rotational angle β is estimated for each subband in the filterbank resolution of the ICC. The ICLD of the middle frequency coefficient in each ICC subband is used as the ICLD for all the coefficients in the subband, which can be expressed as:

[ ]

+=∆=∆

= 2maxarg,,

,1

endb

startbICLD

bBbICLDLRICCLRICCICC

ICLDICLDICLD

kkFbjPbjP

(6.41)

where the subband index bICC (respectively bICLD) denotes subbands in the ICC (respectively ICLD) filterbank. start

bICCk and end

bICCk are the first respectively last non-zero

frequency coefficients in the subband bICC of the ICC filterbank, i.e.


[ ]

>

<=

endb

startbICC

b

ICC

ICC

ICC kkif

kkifkF

0

0 1,,0 −= Nk (6.42)

Subsequently the ICTD is applied to the ICLD and ICC synthesized right channel filtered by the ICTD filterbank according to

[ ]

=

∆

+

][ˆ][ˆ

0

01

][ˆ][ˆ

,

,,

21

,

,

kRkL

ekRkL

abj

abj

bjkN

ibj

bj

ICTD

ICTDICTDLR

ICTD

ICTDτπ (6.43)

using the DFT shift theorem with a compensation of a half frequency bin due to the frequency shift of the MDFT basis functions (see section 5.2). ][ˆ

, kLabj ICTD

and ][ˆ, kRabj ICTD

are

the ][ˆ, kLabj ICC

and ][ˆ, kRabj ICC

subband spectra filtered by the ICTD filterbank instead of the ICC filterbank as denoted according to Eq. (6.40).

6.6.2 Upmixing in model 2

In model 2 the stereo channels are synthesized by the inverse KLT of the decoded mono and the estimated side channel spectra as described in section 6.2 and section 4.1.3. The ICTD is subsequently applied as a linear phase-shift to the right channel of the inverse-KLT transformed subband spectra. The side channel spectra are estimated by scaling of the decorrelated mono spectra [ ]kD

SMLDbj , in the filterbank resolution of the SMLD according to

[ ] [ ][ ] [ ]kD

bjPbjP

kSSMLD

SMLDSM

SMLD bjSMLDD

SMLDMbjP

bj ,20

,

, ,,

10][ˆ∆

= (6.44)

The aligned stereo channels are estimated by the inverse KLT of the mono and scaled decorrelated mono spectra in the frequency resolution of the KLT, i.e.

[ ]( ) [ ]( )[ ]( ) [ ]( )

−=

][ˆ][ˆ

,cos,sin,sin,cos

][ˆ][ˆ

,

,

,

,

kSkM

bjbjbjbj

kRkL

KLT

KLT

KLT

KLT

bj

bj

KLTKLT

KLTKLTa

bj

abj

σσσσ

(6.45)

where [ ]KLTbj,σ is the transmitted rotational angle of the KLT downmix and ][ˆ, kS

KLTbj is

the ][ˆ, kS

SMLDbj subband spectra filtered with the KLT filterbank instead of the SMLD filterbank.

Finally, the ICTD is applied to the right channel of the estimated aligned stereo channels as in model 1 according to Eq. (6.43).

6.7 Conclusion

A stereo architecture for the ITU-T G.719 mono codec has been developed based on the combination of Binaural Cue Coding (BCC) and Parametric Stereo (PS) with additional time-alignment of the stereo signals. Two different parametric stereo models working in the complex-valued MDFT domain have been implemented in MATLAB collaborating with the ITU-T G.719 codec implemented in ANSI-C.


The major differences between the defined stereo models lie in the parametric descriptions of the stereo image and the methods used for downmixing/upmixing of the stereo channels. The first model uses the equalized downmix used in BCC together with pre-weighting from MPEG USAC. The stereo image is parameterized by use of the Inter-Channel Level Difference (ICLD), the Inter-Channel Time Difference (ICTD) and the Inter-Channel Coherence (ICC). In the second model the stereo channels are transformed into optimized mid and side channels using a constrained Karhunen-Loève Transform (KLT). The mid channel is encoded and transmitted to the decoder while the side channel is parameterized by the Side-Mid Level Difference (SMLD). The ICTD is used in combination with the rotational angleσ of the KLT and the SMLD to approximate the original stereo signals in the decoder.

Two new algorithms for improved ICTD extraction in the case of tonal components and for stereo channels with low correlation have been developed. The extracted ICTD is used to both describe the stereo image and improve the downmixing. The stereo channels are analysed and synthesized in frequency subbands using optimized filterbank resolutions for each parameter. The spatial synthesis is performed with mapping function of the parameter values in consequence of the different frequency resolutions for the stereo parameters. Furthermore, a constrained time-shifting decorrelation filter has been defined to estimate uncorrelated components in the stereo channels.

Both models use three stereo parameters with an equal total number of subbands. The models should therefore be comparable but since they are describing the stereo channels differently there might be a difference in the performance, both in terms of stereo bitrate and subjective quality. These aspects are the topics in the following chapters where the stereo parameters are coded and the performance of the defined G.719 stereo architecture is evaluated for both stereo models.

101

7 Quantization and coding of the stereo parameters

The stereo parameters that are used to synthesize the stereo channels in the decoder are quantized and encoded in order to provide an efficient stereo coding solution for data rate constrained applications. The quantization errors are distributed according to perceptual criterions to not introduce audible errors. In the following sections 7.1 and 7.2 the quantization and the coding of the stereo parameters for two stereo models (introduced in section 6.1 and 6.2 respectively) are presented. The final stereo bitrate highly depends on the signal characteristics and therefore estimations of the average stereo bitrate are presented in section 7.3.

7.1 Non-uniform quantization

The stereo parameters in each frame and subband are independently quantized with precisions that are different for each parameter. The precision and the dynamic range of each parameter are determined with the psychoacoustical properties of the human auditory system in consideration. The main property is the Just Noticeable Differences (JNDs) that determine the smallest changes in the interaural cues that are perceived (see section 3.1.1).

In the first G.719 stereo model the ICLD and ICC are equally defined as in PS (see section 4.3.1.1) which implies that the corresponding quantization tables (see section 4.3.3) are representative. The ICLD is quantized according to Eq. (4.62) and the ICC is quantized according to (4.66) where the quantized values are represented by the index of the closest table value according to

[ ] [ ] [ ]

−= qbjbj

q

Q Ρ,minarg, ρρ (7.1)

where ρ denotes the parameter, i.e. the ICLD or ICC in this case, and P is the corresponding quantization table.

The JND for the ITD is not dependent on the stimuli level but it tends to increase for high reference ITD values [15]. The dependence to the frequency is however high and for frequencies below 1 kHz the sensitivity can be described by a constant phase difference (see section 3.1.1). Despite the frequency dependency, the ICTD is equally quantized over all the frequencies. The reason for this is that the coding gains from larger quantization errors were insignificant due to the entropy coding (see section 7.2). The effects of quantization of the ICTD were evaluated with headphones and resulted in a non-uniform quantization table of 63 level as follows

[ ]

192] 175, 160, 145, 130, 115, 100, 90, 82, 75, 69, 63, 57, 51, 48, 45, 42, 39, 36, 33, 30, 27, 24, 21, 18, 15, 12, 9, 6, 3, 1, 0, 1,- 3,- 6,- 9,- 12,- 15,-

18,- 21,- 24,- 27,- 30,- 33,- 36,- 39,- 42,- 45,- 48,- 51,- 57,- 63,- 69,- 75,- 82,- 90,- 100,- 115,- 130,- 145,- 160,- 175,- [-192,=qICTD

(7.2)

where q = 0,…,62.

102 Chapter 7 - Quantization and coding of the stereo parameters

The dynamics range from -4 ms to +4 ms (-192 to 192 samples) is motivated by the fact that the width of the auditory objects are affected by ITDs larger than 1 ms which implies that there are audible quantization errors if the ICTD is limited to 1 ms as suggested by psychoacoustics (see section 3.1). The ICTD quantization indices are obtained according to Eq. (7.1), i.e. as the index of the closest ICTD table value.

For the KLT angle that is used in the second stereo model for G.719 the quantization steps are defined according to ILD. For coherent stereo channels there is a direct relation between the KLT rotational angle and the ICLD. The relation can be observed from geometry and can be derived from the upmix matrices of the defined stereo models (see section 6.6, Eq.(6.39) and Eq. (6.45)) which gives

[ ] [ ]

= ∆

−

20,´

1

10

1tan, bjPLRbjσ (7.3)

Consequently it is possible to use the quantization table for the ICLD to define a quantization table for the KLT. The translation is obtained by

[ ] [ ]

= −

20

1

10

1tan qq ICLDσ (7.4)

which results in a quantization table of 31 levels according to

[ ]

0.1812] 0.3222, 0.5729, 1.0188, 1.8112, 3.2186, 4.5416, 6.4019, 9.0059, 12.6189, .5484,21.7079,17 26.6194, 32.2502, 38.4611, 45.0000,

51.5389, 57.7498, 63.3806, 68.2921, 72.4516, 77.3811, 80.9941, 83.5981, 85.4584, 86.7814, 88.1888, 88.9812, 89.4271, 89.6778, 89.8188,[=qσ

(7.5)

with q = 0,…,30, where the angles are presented in degrees. An angle of 0 degree denotes a source in the left channel and an angle of 90 degrees corresponds to a source in the right channel. The precision is very high at the extremities of the table, i.e. extremely panned sources, while the precision is significantly lower when the channels are equally powered, i.e. for an angle of 45 degrees.

The properties of the rotational angle sensitivity have been evaluated with headphones by listening to white Gaussian noise panned between the left and the right channels using a controlled and variable rotational angle. The listening experiments showed that the auditory system was not that sensitive to small movements of the sound source when panned around 0 degree while these movements of the sound source could be very well perceived when the source was panned either at +45 or -45 degrees. The same effects can be observed for ICLD differences between the stereo channels. The quantization indices of the KLT rotational angle are similarly as for the other stereo parameters determined by the closest table value according to Eq. (7.1).

Section 7.2 - Coding of the quantized parameters 103

The last stereo parameter that is used in the G.719 stereo models is the SMLD which describes the power of the side channel in relation to the mid channel. The dynamic range has been determined by the distribution of SMLDs (see Appendix C) extracted from a training database of 26 stereo samples of different characteristics presented in Appendix E. Based on listening experiments the SMLD quantization table is defined as

[ ] 3] 1,- 5,- 10,- 20,- 30,- 40,- [-50,=qSMLD (7.6)

where q = 0,…,7.

It can be noticed that the power of the side channel can be larger than the power of the mid channel which is related to the constrained rotational angle (see section 6.2).

7.2 Coding of the quantized parameters

The average length of the codewords used to represent the quantized stereo parameters has been decreased by Huffman coding [3]. This means that the bitrate becomes variable since the length of the codewords is dependent on the probability of occurrence. The most probable indices are represented with the shortest codewords. Tables of the Huffman codes have been determined from distributions of the corresponding stereo parameters (see Appendix C) by using a training database of 26 stereo samples of different characteristics, which will be presented in section Appendix E.

The bitrate is lowered even more by differential coding of the quantization indices which is aimed to reduce the dynamics of the quantization indices as much as possible. However, it is not certain that differential coding is more efficient than a direct encoding of the quantization indices. In addition, the indices can be differentially coded in frequency (between the subbands) or in time (between the signal blocks). For the frequency differential coding, the index of the first subband is coded while the difference to the previous subband is coded for the other subbands. The parameter ρ is therefore coded according to

[ ] [ ]( )[ ] [ ]( )

=

−−=

11,,

,,

botherwisefor

bjbjhuffmanbjhuffman

bjQQ

Q

c ρρ

ρρ (7.7)

where [ ]bjQ ,ρ is the quantization index for subband b of block j and [ ]bjc ,ρ is the corresponding Huffman codeword. The same principles are used for the time-differential coding where the parameter ρ is coded according to

[ ] [ ]( )[ ] [ ]( )

=

−−=

1,1,

,,

jotherwisefor

bjbjhuffmanbjhuffman

bjQQ

Q

c ρρ

ρρ (7.8)


In order to determine the most efficient coding strategy the average length of the codewords has been estimated. The codeword lengths are compared for the coding of the quantization indices and the time and frequency differential indices. In addition, the entropy of the corresponding coding strategy has been estimated. The entropy is the measure that determines the lowest possible codeword length which can be obtained with entropy coding [2]. The measure can consequently be used to evaluate the performance of the Huffman coding. Nevertheless, Shannon has showed that the average length of the Huffman codewords cannot be more than one bit higher than the entropy [2]. The entropy is given by

∑=

−=Q

qqq ppEntropy

12log (7.9)

where Q is the total number of codewords, e.g. 31 for the non-differential ICLD, and pq is the probability for the codeword q to occur in the stereo signals.

In Table 7.1 the average codeword lengths for the different coding strategies are presented. The entropy and the Huffman codeword lengths have been determined using the distribution of the quantized stereo parameters for the earlier mentioned training database. For all the stereo parameters, the time-differential coding results in the shortest Huffman codewords. The largest coding gain is obtained for the ICTD where the codeword length is 64 % lower than for the non-differential uncoded quantization indices. The lowest coding gain, on the other hand, is obtained for the rotational angle of the KLT, i.e. σ , where the gain is just 31 % compared to no coding. In addition, the table shows that the Huffman coding is efficient with an average codeword length that is close (less than 0.25 bits) to the shortest possible codewords, i.e. the entropy, in all cases.

Stereo parameter Coding strategy Uncoded Huffman coded Entropy

ICLD ( LRP∆ ) Non-differential 5 3.50 3.47

Frequency differential 6 3.42 3.37 Time differential 6 2.64 2.58

ICC ( LRc ) Non-differential 3 2.71 2.61


ICTD ( LRτ∆ ) Non-differential 6 5.26 5.23


KLT angle (σ ) Non-differential 5 4.27 4.24


SMLD ( SMP∆ ) Non-differential 3 2.51 2.48


Table 7.1: Average codeword lengths (in bits) for coding of the quantization indices of the stereo parameters using different coding strategies. The entropy and the Huffman codewords are determined using the distribution of the quantized stereo parameters from a training database of 26 stereo samples (see Appendix E). The time-differentially coded parameters (presented in boldface) are represented with the shortest average codewords.

Section 7.3 - Bitrate estimations for training and MPEG database 105

Even if the time-differential coding was shown to be the most appropriate method for all stereo parameters in average there is an opportunity to select the most efficient method on a block per block basis. For each block the amount of bits needed to encode the quantized parameters is determined by the Huffman tables and the methods are consequently comparable. However, the information of which method is used has to be transmitted to the decoder in order to decode the stereo parameters properly. In the case of two possible coding methods this can be determined by a one bit flag while two bits are needed for selection between all the three coding strategies, i.e. non-differential, frequency differential or time-differential. For the stereo parameters of the G.719 stereo models the amount of extra bits needed for handling the selection was larger than the obtained gain, at least for the training database of stereo samples. Consequently, time-differential coding has been selected for all the parameters with no adaption to the signal characteristics of a processed signal block.

In PS (see section 4.3.3) the bitrate is further reduced by not transmitting the phase parameters for subbands that represent frequencies higher than 2 kHz which is related to the duplex theory (see section 3.1). However, for the G.719 stereo codec the ICTD parameters are transmitted for all the frequency subbands. The reason is that the human auditory system is sensitive to timing difference between the amplitude envelopes in the both channels (see section 3.1). Informal listening tests showed that there are audible impairments of the stereo quality when the ICTD parameters are not synthesized for the higher frequencies. This result might, however, be related to the time resolution of 20 ms since the envelope of the stereo channel could be better controlled by the ICLD with a finer time resolution.

7.3 Bitrate estimations for training and MPEG database

The first bitrate estimation is obtained from the training database of stereo samples which was used to define the Huffman codebooks. Subsequently, the bitrate has been verified on the MPEG database in order to verify that the statistics (distributions for the stereo parameters) of the training database are representative for other stereo samples. The stereo samples of the used databases are presented in Appendix E.

In Table 7.2 the average bitrates estimated from both the training and the MPEG database are presented for the stereo parameters in the defined G.719 stereo models. In model 1 the ICLD consumes more bits than the ICC and ICTD since it has a finer frequency resolution. The total bitrate for model 1 was estimated to 4.46 kbps for the training database while it was a bit lower with 3.88 kbps for the MPEG database. The total bitrate for model 2 is slightly higher with 4.77 kbps and 4.37 kbps for the two databases respectively. It can be noticed that the main difference between model 2 and model 1 is a higher bitrate for the rotational angle of the KLT than for the ICLD. The reason might be that not just the level difference is controlled by the KLT angle but also the amount of correlation, i.e. the level between the mid and side channels. This makes the KLT more varying than the corresponding parameters in model 1, i.e. the ICLD and the ICC, and the differential coding becomes less efficient, which also can be seen in Table 7.1.


Model Parameter Number of subbands Average bitrate, kbps (training)

Average bitrate, kbps (MPEG)

1 ICLD ( LRP∆ ) 16 2.07 1.75

ICC ( LRc ) 10 1.04 0.88 1&2 ICTD ( LRτ∆ ) 12 1.35 1.25

2 SMLD ( SMP∆ ) 12 1.06 0.96 KLT angle (σ ) 14 2.36 2.16

Total Model 1 4.46 3.88 Total Model 2 4.77 4.37

Table 7.2: Average bitrates for the training and the MPEG database of stereo samples.

Even if the average bitrates are relevant it is also important to consider the variations in the bitrates due to the stereo characteristics. In Figure 7.1 the bitrate of each sample in the MPEG database are presented for model 1 and model 2. The bitrate variation within the database is similar for the both models, but for individual samples there are some differences. For example, the sample “te03.wav” is more efficient to encode than the sample “te01.wav” for model 1 while it is the other way around for model 2. This is related to the smaller coding gain of the KLT angle due to more dynamics than for the ICLD parameter. The lowest bitrates are obtained for the samples with the mono like spatial properties but even for those samples the model 1 is more efficient. This is also related to the high dynamics for the KLT angle which implies that the Huffman codewords that represents the time differentially coded KLT quantization indices becomes longer than the corresponding ICLD codewords.

The total bitrate for any of the stereo samples does not exceed the average bitrate by more than 0.9 and 0.7 kbps for model 1 and model 2 respectively. Furthermore, the average bitrates for the two models differ with approximately 0.5 kbps which is not a significant difference. The total bitrate for both the stereo models is conclusively estimated to about 4 kbps based on the training and the MPEG database of stereo samples.

Section 7.3 - Bitrate estimations for training and MPEG database 107

Figure 7.1: Bitrate estimations for the MPEG database. The average total bitrate for model 1 (M1) is 3.88 kbps distributed in 1.75 kbps for the ICLD, 0.88 kbps for the ICC and 1.25 kbps for the ICTD. For model 2 (M2), the average total bitrate is 4.37 kbps distributed in 2.16 kbps for the KLT angleσ , 0.96 kbps for the SMLD and 1.25 kbps for the ICTD.

Bitr

ate

[kbp

s]te

01_M

1.wav

te01

_M2.

wav

te02

_M1.

wav

te02

_M2.

wav

te03

_M1.

wav

te03

_M2.

wav

te04

_M1.

wav

te04

_M2.

wav

te05

_M1.

wav

te05

_M2.

wav

te06

_M1.

wav

te06

_M2.

wav

te07

_M1.

wav

te07

_M2.

wav

te08

_M1.

wav

te08

_M2.

wav

te09

_M1.

wav

te09

_M2.

wav

te10

_M1.

wav

te10

_M2.

wav

te11

_M1.

wav

te11

_M2.

wav

te12

_M1.

wav

te12

_M2.

wav

te13

_M1.

wav

te13

_M2.

wav

te14

_M1.

wav

te14

_M2.

wav

te15

_M1.

wav

te15

_M2.

wav

te16

_M1.

wav

te16

_M2.

wav

te17

_M1.

wav

te17

_M2.

wav

te18

_M1.

wav

te18

_M2.

wav

te19

_M1.

wav

te19

_M2.

wav

te20

_M1.

wav

te20

_M2.

wav

te21

_M1.

wav

te21

_M2.

wav

te22

_M1.

wav

te22

_M2.

wav

te23

_M1.

wav

te23

_M2.

wav

te24

_M1.

wav

te24

_M2.

wav

te25

_M1.

wav

te25

_M2.

wav

te26

_M1.

wav

te26

_M2.

wav

te27

_M1.

wav

te27

_M2.

wav

te28

_M1.

wav

te28

_M2.

wav

te29

_M1.

wav

te29

_M2.

wav

te30

_M1.

wav

te30

_M2.

wav

te31

_M1.

wav

te31

_M2.

wav

te32

_M1.

wav

te32

_M2.

wav

te33

_M1.

wav

te33

_M2.

wav

te34

_M1.

wav

te34

_M2.

wav

te35

_M1.

wav

te35

_M2.

wav

te36

_M1.

wav

te36

_M2.

wav

te37

_M1.

wav

te37

_M2.

wav

te38

_M1.

wav

te38

_M2.

wav

te39

_M1.

wav

te39

_M2.

wav

te40

_M1.

wav

te40

_M2.

wav

te41

_M1.

wav

te41

_M2.

wav

te42

_M1.

wav

te42

_M2.

wav

te43

_M1.

wav

te43

_M2.

wav

te44

_M1.

wav

te44

_M2.

wav

te45

_M1.

wav

te45

_M2.

wav

te46

_M1.

wav

te46

_M2.

wav

te47

_M1.

wav

te47

_M2.

wav

te48

_M1.

wav

te48

_M2.

wav

te49

_M1.

wav

te49

_M2.

wavM

ean_

M1

Mea

n_M

2

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

ICTDM1

ICCM1

ICLDM1

ICTDM2

SMLDM2

KLTM2


7.4 Conclusion

The stereo parameters that are extracted in the encoder are quantized and coded with respect to perceptual criterions. The aim is to introduce coding errors that are not audible to the human auditory system.

The ICLD and ICC parameters are scalar quantized using non-linear quantization tables from Parametric Stereo (PS). The KLT angle is quantized based on the principles used to quantize the ICLD while completely new tables have been defined for the SMLD and the ICTD parameters. The indices of the quantized parameters are differentially coded over time in order to decrease the dynamics which enables a more efficient Huffman coding.

The ICTD parameters are transmitted for the entire bandwidth in contrast to PS where the phase parameters are not transmitted for frequencies above 2 kHz. This was shown to be necessary for a good stereo reconstruction since the timing of the signal envelopes are perceived by the auditory system.

The total bitrate for transmitting the stereo parameters has been estimated for the two G.719 stereo models. Codewords for the differentially coded quantization indices were defined using a training database from which the bitrates were estimated. The obtained bitrates were verified on a large database of stereo samples from MPEG. The average bitrates were estimated to about 4 kbps for both models.

109

8 Performance of the implemented G.719 stereo codec

In this chapter the performance of the proposed stereo architecture for ITU-T G.719 is evaluated. The evaluation of sound quality is mainly related to subjective experiments following recommendations from the radiocommunication section of the International Telecommunication Union (ITU-R). In addition, objective measures can be used for a rough estimation and to be used in the development work. However, it is hard to predict the subjective quality from the objective measures and they should just be used as complements to the subjective tests.

The performance of the stereo models has been evaluated for two different bitrates. The bitrate for coding the stereo parameters is constant at 4 kbps while the ITU-T G.719 mono codec has been used at 44 and 60 kbps. The bitrates was chosen based on informal listening tests where the saturation of the quality was evaluated for the first parametric model. It was found that the model saturated for bitrates roughly between 64 and 96 kbps depending of the signal characteristics. The gain of an increased mono bitrate with a constant stereo bitrate of 4 kbps was consequently determined to be marginal above 64 kbps in general. For the performance evaluation, a set of ten stereo samples has been encoded with both the G.719 stereo models at the total bitrates of 48 and 64 kbps. The same material has been encoded with ITU-T Dual Mono (DM) at 2x32, 2x48 and 2x64 kbps and the stereo codecs 3GPP eACC+[62] and 3GPP AMR-WB+[60] at 48 kbps for comparison. The encoded material used for the performance evaluation is presented as the MUSHRA database in Appendix E.

8.1 Objective measurement

The performances of the codecs have in a first step been evaluated objectively in terms of reconstruction of the left and the right channel waveforms. The measure that has been used is a segmental Signal-to-Noise Ratio (SNR) which measures the power of the reference signal in relation to the power of the reconstruction error for each analyzed frame/block. The measure is based on the definition used in [55] with additional averaging over the stereo channels. The SNR is determined in time-segments of 256 samples for each stereo channel and averaged over the segments and the channels in order to have a single measure for each stereo sample. This can be expressed as

( )

−=

+110log10 2

1

10rl segSNRsegSNR

SNR (8.1)

with

[ ]( )

[ ] [ ]( )( )∑∑

∑−

=−+

=

−+

=

−+=

1

011

2

112

10

ˆ1log1 SN

sMs

sMn

Ms

sMn

Sx

nxnx

nx

NsegSNR

where SN is the number of segments and M = 256 samples is the length of the segments.

The results from the objective evaluation are presented in Figure 8.1.

110 Chapter 8 - Performance of the implemented G.719 stereo codec

Figure 8.1: SNR for the test database of stereo samples encoded with the selected stereo codecs under the specified conditions. The notations Model1 and Model2 refer to the G.719 stereo models and DM means dual mono coding, i.e. individual mono coding of each channel.

In general there is a significantly better reconstruction for the samples that are DM coded with ITU-T G.719 than the corresponding samples encoded by using both the stereo models. In addition, Figure 8.1 shows that there is no significant difference in the average SNR for model 1 and model 2 and the difference between a total bitrate of 48 and 64 kbps is only about 1 dB in average. The objective performance for the G.719 stereo models is compared to eAAC+ and AMR-WB+ better in average, and for the clean and reverberant speech samples in particular. The SNRs for the proposed stereo models are however lower than for the competitors for the mixed content samples. The waveform is not as well reconstructed for the stereo codecs in comparison to the DM coding but the measure does not consider what is important to the human auditory system. The performances of the codecs will therefore be subjectively evaluated in the following section.

8.2 Subjective evaluation

In order to evaluate the performance of the proposed stereo models the subjective stereo audio quality has been estimated using the methodology of MUltiple Stimuli tests with Hidden Reference and Anchor (MUSHRA) defined by ITU-R [56].

The MUSHRA method is suitable for evaluation of intermediate audio quality under the assumption that the coding introduces clear impairments. Test results with statistically significance can normally be obtained with about 20 persons according to the MUSHRA specification [56].

0

5

10

15

20

25

30

35S

NR

[dB

]

bi1 cs1

cs2

mi1 mi2mu1 mu2 ns

1ns

2 rs1Mea

n

Model1@48kbpsModel2@48kbpsModel1@64kbpsModel2@64kbpsDM@2x32kbpsDM@2x48kbpsDM@2x64kbpsAMR-WB+@48kbpseAAC+@48kbps

Section 8.2 - Subjective evaluation 111

In the tests the listeners are presented a known reference signal, a hidden reference, one anchor and the encoded samples under the specified conditions. The anchor is a low-pass filtered version of the reference signal with a bandwidth of 7 kHz used to get a reference point of bad quality audio for fullband signals (up to 20 kHz).

The items are evaluated by using a scale ranging from 0 to 100, commonly divided into five equally spaced categories: bad, poor, fair, good and excellent. To make the scaling procedure easier for the listener, all conditions are available simultaneously and switching between them allows for comparison between all of them and not just the reference. One attribute to always consider in the scoring is the basic audio quality, but for stereo (or multi-channel audio) the spatial image must also be considered. The final results from the test are obtained by a statistical analysis of the MUSHRA scores. The test specification allows that outliers in the results from either too critical listeners or not enough critical listeners can be rejected to obtain a more reliable result. The individual scores mx for each condition are averaged both over the audio items and the listeners and presented with a 95 % confidence interval. The interval is determined according to

±

Mxx~

96.1 (8.2)

where x is the average

∑=

=M

mmx

Mx

1

1

and x~ denotes the standard deviation which is given by

1

1

~

2

11

2

−

−

=∑∑==

M

xM

xx

M

mm

M

mm

where M is the total number of scores for all listeners and all samples for analysed condition, e.g. all the scores that concern the G.719 stereo model 1 at 48 kbps.

The stereo models for G.719 has been evaluated in two MUSHRA tests with headphone listening using the stereo samples and the coding conditions that were presented in the beginning of this chapter.

8.2.1 MUSHRA – G.719 stereo at 48 kbps

The first MUSHRA test evaluates the performance of the proposed stereo models together with the ITU-T G.719 mono codec at a total bitrate of 48 kbps. The ten stereo samples of clean, reverberant and noisy speech, mixed content and music in the MUSHRA database (see Appendix E) been encoded according to Table 8.1.


G.719 Dual Mono (DM) 2x32, 2x48 kbps

G.719 Stereo 48 kbps

eAAC+ 48 kbps

AMR-WB+ 48 kbps Table 8.1: Coding conditions for the 48 kbps MUSHRA test.

The stereo audio quality concerning both the overall audio quality and the stereo image has been evaluated according to the MUSHRA methodology. The listener panel consisted of 9 experienced listeners at Ericsson Research – Multimedia technologies in Stockholm and Luleå.

The average MUSHRA scores for each condition are presented in Figure 8.2.

The results show that there is no significant difference between the two proposed stereo models for G.719 since the confidence intervals highly overlap. However the model based on inter-channel cues has a slightly better average score than the KLT-based stereo model. Moreover, G.719 stereo using the inter-channel cues seems to perform better than both AMR-WB+ and eAAC+ at the same bit rate since the confidence intervals do not overlap while this statement is not true for the other stereo model associated with G.719.

The G.719 stereo codecs perform as well as G.719 DM coding at 2x32 = 64 kbps. Thereby, there is possibly a coding gain of about 25 % using the parametric stereo models together with G.719.

20

30

40

50

60

70

80

90

100

MU

SHR

A p

oint

s

REF

AMR-WB+_stereo48

G719_dualMono_2x32

G719_dualMono_2x48 LP7

M1_G.719_stereo48

M2_G.719_stereo48

eaacp_stereo48

99.9 65.6 69.7 83.6 73.3 69.4 59.9 33.6

Figure 8.2: MUSHRA test results for ten stereo samples coded at 48 kbps. The labels for each condition denote the codec used, i.e. AMR-WB+, G.719 or eAAC+ (eaacp), the coding technique, i.e. dual mono or stereo and the bitrate. The G.719 stereo models are denoted with M1 and M2 for model 1 and model 2 respectively and the notation LP7 denotes the low-pass filtered anchor at 7 kHz.


In Figure 8.3, the results from the first MUSHRA test is presented only using the stereo samples from the categories of clean and reverberant speech together with a binaural recording (see Appendix section B.IV), i.e. the samples cs1, cs2, rs1 and bi1 defined in Appendix E. For these categories the proposed stereo models have an average score that is about 7 MUSHRA points above the average score over all categories. For these samples the G.719 stereo models score higher or equal to the G.719 DM at 2x48 = 96 kbps, and consequently there might to be a coding gain of about 50%. The AMR-WB+ codec has a quality in the range of the G.719 stereo, although with a lower average score, while eAAC+ performs significantly worse in the same range as G.719 DM at 2x32 kbps.

20

30

40

50

60

70

80

90

100

MU

SHR

A p

oint

s

REF

AMR-WB+_stereo48

G719_dualMono_2x32


M1_G.719_stereo48

M2_G.719_stereo48

eaacp_stereo48

100 71.2 53.0 69.8 79.0 78.5 51.8 35.2

Figure 8.3: MUSHRA test results for the samples of clean speech (cs1, cs2), reverberant speech (rs1) and the binaural recording (bi1) coded at 48 kbps. The labels for each condition denote the codec used, i.e. AMR-WB+, G.719 or eAAC+ (eaacp), the coding technique, i.e. dual mono or stereo and the bitrate. The G.719 stereo models are denoted with M1 and M2 for model 1 and model 2 respectively and the notation LP7 denotes the low-pass filtered anchor at 7 kHz.

For the other categories of stereo samples, i.e. noisy speech, mixed content and music (the samples ns1, ns2, mi1, mi2, mu1 and mu2) it is not possible for the stereo models at 48 kbps to reach the quality of the DM coding at 64 kbps. Figure 8.4 shows that the stereo models for G.719 obtained a score of about 5 MUSHRA points lower than the in average for all the categories (see Figure 8.3). Furthermore, the scores for the DM coding is much higher than for the corresponding scores for the clean speech samples. However, the stereo quality for the G.719 stereo models is still competitive to AMR-WB+ and eAAC+.


20

30

40

50

60

70

80

90

100

MU

SHR

A p

oint

s

REF

AMR-WB+_stereo48

G719_dualMono_2x32


M1_G.719_stereo48

M2_G.719_stereo48

eaacp_stereo48

99.9 61.8 80.8 92.8 69.5 63.3 65.3 35.2

Figure 8.4: MUSHRA test results for the samples of noisy speech (ns1, ns2), mixed content (mi1, mi2) and music (mu1, mu2) coded at 48 kbps. The labels for each condition denote the codec used, i.e. AMR-WB+, G.719 or eAAC+ (eaacp), the coding technique, i.e. dual mono or stereo and the bitrate. The G.719 stereo models are denoted with M1 and M2 for model 1 and model 2 respectively and the notation LP7 denotes the low-pass filtered anchor at 7 kHz.

8.2.2 MUSHRA – G.719 stereo at 64 kbps

The second MUSHRA test evaluates the performance of the proposed stereo models together with the ITU-T G.719 mono codec at the total bitrate of 64 kbps. The mono channel of the parametric stereo architecture is thereby encoded with a higher bitrate than in the first MUSHRA test while the bitrate of the stereo parameters is constant at 4 kbps. The ten previously presented stereo samples has for this test been encoded according to Table 8.2.

G.719 Dual Mono (DM) 2x32, 2x48, 2x64 kbps

G.719 Stereo 64 kbps Table 8.2: Coding conditions for the 64 kbps MUSHRA test.

The stereo audio quality has been evaluated with a listener panel of 8 experienced listeners at Ericsson – Multimedia technologies in Stockholm and Luleå.

In Figure 8.5 the results averaged over all the ten encoded samples are presented. The relation between the two G.719 stereo models is similar as in the first MUSHRA test (see section 8.2.1). The results show that the stereo quality obtained for the two stereo models is significantly higher than the DM coding at equal bitrate, i.e. 2x32 = 64 kbps. In addition, the results are overlapping with the results for G.719 DM at 2x48 = 96 kbps. The quality for DM coding at 2x64 = 128 kbps is however significantly higher with an average MUSHRA score of 89.9 points compared to 77.0 and 73.4 for the both stereo models respectively.


20

30

40

50

60

70

80

90

100

MU

SHR

A p

oint

s

REF

G719_dualMono_2x32

G719_dualMono_2x48


M1_G.719_stereo64

M2_G.719_stereo64

99.5 61.0 79.6 89.9 32.5 77.0 73.4

Figure 8.5: MUSHRA test results for ten stereo samples coded at 64 kbps. The labels for each condition denote the codec used, i.e. G.719, the coding technique and the bitrate. The G.719 stereo models are denoted with M1 and M2 for model 1 and model 2 respectively and the notation LP7 denotes the low-pass filtered anchor at 7 kHz.

For the categories of clean, reverberant speech and binaurally recorded speech there is a similar behavior as shown in the first MUSHRA test as can be seen in Figure 8.6. More specifically, the scores for the DM coding are lower than the corresponding results over all categories presented in Figure 8.5 while the scores for the stereo models are higher. In this case it implies that the stereo quality is significantly better for the stereo coding than the DM coding at equal bitrates. The quality at 64 kbps stereo coding is actually significantly better than DM coding at 2x48 = 96 kbps and lies in the range of DM coding at 2x64 kbps. The coding gain is therefore, once again, up to 50 % for these types of stereo samples.

20

30

40

50

60

70

80

90

100

MU

SHR

A p

oint

s

REF

G719_dualMono_2x32

G719_dualMono_2x48


M1_G.719_stereo64

M2_G.719_stereo64

98.9 43.6 63.4 82.4 34.7 86.3 81.6

Figure 8.6: MUSHRA test results for the samples of clean speech (cs1, cs2), reverberant speech (rs1) and the binaural recording (bi1) coded at 64 kbps. The labels for each condition denote the codec used, i.e. G.719, the coding technique and the bitrate. The G.719 stereo models are denoted with M1 and M2 for model 1 and model 2 respectively and the notation LP7 denotes the low-pass filtered anchor at 7 kHz.


At last, the results for the noisy speech, mixed content and music samples are presented in Figure 8.7. The scores for the DM coding are high for both 2x48 = 96 and 2x64 = 128 kbps. The G.719 stereo coding using the two stereo models can however not reach this level of quality but the quality lies in the range of DM coding at 2x32 = 64 kbps. There is consequently no gain of parametric stereo coding for these categories of stereo samples.

20

30

40

50

60

70

80

90

100

MU

SHR

A p

oint

s

REF

G719_dualMono_2x32

G719_dualMono_2x48


M1_G.719_stereo64

M2_G.719_stereo64

100 72.7 90.4 94.8 31.6 70.8 68.0

Figure 8.7: MUSHRA test results for the samples of noisy speech (ns1, ns2), mixed content (mi1, mi2) and music (mu1, mu2) coded at 64 kbps. The labels for each condition denote the codec used, i.e. G.719, the coding technique and the bitrate. The G.719 stereo models are denoted with M1 and M2 for model 1 and model 2 respectively and the notation LP7 denotes the low-pass filtered anchor at 7 kHz.

8.3 Conclusion

The subjective quality of the proposed G.719 stereo models is in average higher than the quality delivered by G.719 Dual Mono (DM) coding at equal bitrates. The results are based on two MUSHRA tests where the stereo quality was evaluated at a total bitrate of 48 and 64 kbps for ten stereo samples of clean, reverberant and noisy speech, mixed content and music characteristics.

The high performance for the stereo models was especially observable for clean, reverberant and low-noise speech signals even when the stereo image was complex as for the binaural recording. The coding gain was shown to be up to 50% in the comparison to DM coding. G.719 DM coding has a low score for the clean speech samples since G.719 is not a specific speech coder such as AMR-WB+. The quantization noise is therefore higher and more noticeable for G.719 and especially in DM coding systems.


For the samples of noisy speech, mixed content and music the tests showed that there was no consistent gain of parametric stereo coding. This was also indicated by the objective evaluation in terms of segmental SNR of the reconstructed waveform (see section 8.1). The typical problem can be described as a poor representation of stereo ambience which for example was observed in a stereo sample with applauds. This reason is probably a combination of the losses in the downmixing, the low frequency resolution for a lowered stereo bitrate and the synthesis of the uncorrelated components, i.e. the decorrelation. The performance of the proposed parametric stereo models working with the G.719 codec is however not particular bad. In the low-bitrate evaluation, i.e. at a total bitrate of 48 kbps, the G.719 codec was compared to the stereo modes of AMR-WB+ and eAAC+ at equal bitrates. The test showed that the quality is in average equal or better for the G.719 stereo models than it is for the AMR-WB+ and aAAC+ codecs.

In the comparison of the G.719 stereo models, i.e. model 1 and model 2, based on a objective measurement that there was no significant difference between them. The SNR was however in average slightly higher for model 2 than for model 1. The subjective evaluations showed on the other hand that model 1 seems to obtain a higher quality than model 2, both in average and for each of the studied groups of clean and reverberant speech respectively noisy speech, mixed content and music samples. Consequently, it seems like the artifacts introduced by model 2 were perceptually more annoying than the artifacts introduced by model 1. A possible explanation is that the quantization steps of the stereo parameters were not as well defined for model 2 than for model 1 which meant that the comparison between the models could possibly be different at higher stereo bitrates. The lower frequency resolution for the KLT parameter than for the ICLD cue might also contribute to the difference between the G.719 stereo models.

119

9 Conclusion and future work

9.1 Conclusion

In this thesis a stereo architecture for the ITU-T G.719 mono codec is described. ITU-T G.719 is a fullband MDCT-domain codec suitable for speech and audio signals at 32, 48 and 64 kbps (extended up to 128 kbps).

Two different parametric stereo models working in the complex-valued MDFT domain have been developed and implemented in MATLAB for evaluation of the subjective stereo audio quality. The models are based on existing state-of-the-art stereo coding techniques where the stereo signals are described by a downmix of the stereo channels and several stereo parameters. The first model uses an equalized downmix with pre-weighing and performs the extraction of the ICLD, ICTD and ICC cues. The second model is characterized by the KLT downmix, and the extraction of the ICTD and SMLD parameters.

The defined stereo architecture and models have been enhanced with optimized frequency resolutions for stereo analysis and synthesis, signal alignment, optimized downmix methods and improved parameter extraction. The latter mainly consists of two novel algorithms used to improve the ICTD extraction for stereo signals of high tonality and with a high level of interfering components, e.g. background noises. Further on, a constrained time-shifting decorrelation filter and quantization tables for the stereo parameters have been defined.

The stereo parameters have been non-uniformly quantized with quantization tables based on psycho-acoustical principles and parameter distributions. Subsequently, the time-differential quantization indices have been Huffman coded. The total stereo bitrate has been estimated to about 4 kbps in average for stereo samples of different characteristics. Two MUSHRA tests have shown a competitive subjective performance of the defined stereo architecture at total bitrates of 48 and 64 kbps for clean, reverberant and noisy speech, mixed content and music. The proposed stereo models are significantly advantageous in comparison to dual mono coding for complex clean and reverberant speech signals. However, no consistent gain was shown for noisy speech, mixed content and music.

In general the first model shows a slightly higher performance than the second model for all the categories of samples under test. At this bitrate of about 4 kbps the model and architecture constraints appear more restrictive for the second model. An objective measurement of the reconstruction error showed that there was no significant difference between the models and model 2 was actually rather better than model 1 in terms of SNR.

Despite the limitations of the evaluated stereo codecs the subjective listening tests revealed a potential for parametric stereo coding using the ITU-T G.719 codec. In comparison to the existing stereo codecs AMR-WB+ [60] and eAAC+ [61] the average performance was better at the equal bitrate of 48 kbps.

120 Chapter 9 - Conclusion and future work

9.2 Future work

The defined stereo architecture for ITU-T G.719 could be improved from several aspects. In this section three possible subjects of improvements will be discussed.

9.2.1 Compatibility with ITU-T G.719

One important part of further development is to increase the compatibility between ITU-T G.719 and the stereo architecture. The primary issue is the current incompatible transform sizes due to the additional zero-padding of the transform windows in the stereo layer. The zero-padding is needed to suppress noticeable distortions from the wrap-around of the circular convolution obtained by the frequency domain time-shifts of signal alignment, ICTD synthesis and decorrelation.

Since the signal content in the transform blocks is equal irrespective of the zero-padding the possibility of spectral resampling could be investigated. For example, in [57] MDCT spectra are resampled in an improved estimation of fundamental frequencies where the transform size is constrained by the frame size of the audio codec. In the G.719 stereo codec the MDCT spectra of the stereo layer would have to be down-sampled from 1152 to 960 bins before mono coding. Similarly, the decoded MDCT spectra would have to be up-sampled before stereo synthesis in the decoder.

The zero-padding is more important for the signal alignment and the ICTD synthesis than for the decorrelation where the wrap-around effect is not as critical. In addition, the defined stereo architecture allows the signal alignment and ICTD synthesis to be separated from the other stereo processing, which is illustrated in Figure 9.1. Consequently, the time-shifts could be performed in another domain, e.g. using a time-domain filterbank, and the transforms of the stereo and mono layer would be compatible, i.e. without zero-padding.

ICTD analysisAlignment

[ ]nl

[ ]nrMDFT G.719

encoderStereo analysis

G.719decoder

Q

Q-1

Stereo synthesisICTD synthesis IMDFT

[ ]nl

[ ]nr

[ ]nla

[ ]nra

lrτ∆

Qlr ,τ∆

Figure 9.1: MDFT-domain G.719 stereo architecture with external ICTD analysis, signal alignment and ICTD synthesis.

Section 9.2 - Future work 121

One advantage of the ITU-T G.719 codec is the instantaneous switching between the two time resolutions of the stationary and transient modes, i.e. 20 ms and 5 ms. Currently there is no transient mode available for the stereo layer but in order to maintain full compatibility of transform spectra sizes this is required. The development of a stereo transient mode is not completely straight-forward. Among others, transients can be coincident or not in both stereo channels which ideally, due to the frequency resolution, would imply different time-resolutions for the left and right channel processing. In addition, transients in the downmix channel would probably be of main interest. However, in order to adjust the time-resolution, the transients must be identified prior to the time-to-frequency transformations and this is not coherent with the frequency-domain downmixing.

9.2.2 Reduction of complexity and delay

The use of fully compatible transforms in the mono and stereo layers makes it possible to reduce the complexity of the G.719 stereo codec. The inverse and direct transformations of the non-coded and encoded mono spectra become redundant and could therefore be omitted (see Figure 6.1 and Figure 6.2 in chapter 6). The result is a lower complexity but the performance of the codec could also be slightly improved since model distortion effects from the OLA are avoided.

To decorrelate the mono signal in the decoder, the complex MDFT spectra are needed. In this case the imaginary part of the MDFT spectrum, i.e. the MDST spectrum is missing but fortunately they could be estimated from the transmitted MDCT spectrum. In [58] an algorithm for MDST estimation is presented where the precision of the estimations can be defined as a function of the complexity. The algorithm requires three consequent MDCT spectra, of index j-1, j and j+1, to estimate the MDST spectrum of index j. Consequently, since the MDCT spectrum of block j+1 is needed the algorithm implies an algorithmic delay of 20 ms in the case of the ITU-T G.719 framework. The delay of the proposed stereo layer is therefore equal as it is in the current implementation but with the gain of a lower complexity. In other words, a low-delay estimator of the MDST spectra would be needed in order to reduce the algorithmic delay.

9.2.3 Improved subjective quality for complex stereo signals

The subjective quality of complex stereo signals of mixed content and music turned out to not be sufficient in the current implementations of the stereo architecture for ITU-T G.719. In order to improve the performance there are several possibilities.

Informal listening tests showed that the quality could be improved with a higher frequency resolution for the subband processing. The impairments from a lowered frequency resolution seemed to be more noticeable in the presence of mono coding than they were without it in the frequency resolution optimizations (see section 6.3.1).

Further enhancements can be obtained by extended stereo models. For example, the side-signal in the second stereo model with KLT, ICTD and SMLD could probably be better estimated in the decoder with additional stereo parameters that describe the phase and the fine structure of the side signal.

122 Chapter 9 - Conclusion and future work

The subjective performance of the stereo codec could also be improved by the combination of parametric and residual coding, i.e. hybrid coding. The advantage of hybrid coding over dual mono and entirely parametric stereo coding would probably be highest at intermediate bitrates where the quality of the parametric stereo models saturates. This is illustrated in Figure 9.2 where the performance of a typical parametric stereo coder is compared to dual mono and hybrid coding [59].

Parametric stereo coderDual mono

coder

Bitrate

TransparentQuality

Hybrid coder

Figure 9.2: Sketch of the stereo quality as a function of the bitrate showing the benefit of hybrid coding [59].

123

References [1] ITU-T, Recommendation G.719, “Low-complexity full-band audio coding for high-

quality conversational applications”, 2008

[2] M. Bosi & R. E. Goldberg, “Introduction to digital audio coding and standards”, Springer Science+Business Media, New York, 2003

[3] D.A. Huffman, “A method for the construction of minimum-redundancy codes”, Proceeding of the I.R.E, pp. 1098-1102, 1952

[4] A. Gersho & R.M. Gray, “Vector quantization and signal compression”, Kluwer Academic Publishers, Massachusetts, 1992

[5] Y. Wang, L. Yaroslavsky, M. Vilermo & M. Väänänen, “Some Peculiar Properties of the MDCT”, Proceedings of ICSP2000, 2000

[6] 3GPP TR 26.936 V8.0.0, “Technical Specification Group Services and System Aspects; Performance characterization of 3GPP audio codecs (Release 8)”, December 2008

[7] ITU-T, Recommendation G.191, “Software tools for speech and audio coding standardization”, 2005

[8] ITU-T, Recommendation G.722.1, “Low-complexity coding at 24 and 32 kbit/s for hands-free operation in systems with low frame loss”, 2005

[9] M. Xie, P. Chu, A. Taleb & M. Briand, “ITU-T G.719: A new low-complexity full-band (20 kHZ) audio coding standard for high-quality conversational applications”, 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2009

[10] LAME project, http://lame.sourceforge.net, 2010

[11] ITU-R, Recommendation BS.1116-1, “Methods for the subjective assessment of small impairments in audio systems including multichannel sound systems”, 1997

[12] M. Xie, D. Lindbergh & P. Chu, “ITU-T G.722.1 Annex C: A new low-complexity 14 kHz audio coding standard”, 2006 IEEE International Conference on Acoustics, Speech and Signal Processing Proceedings, 2006

[13] J.W. Cooley, J. W. and J. W. Tukey, "An Algorithm for the Machine Computation of the Complex Fourier Series", Mathematics of Computation, Vol. 19, pp. 297-301, April 1965

[14] Xiph.org Foundation, “OGG Vorbis I specification,” http://www.xiph.org/vorbis/doc/, 2010

[15] J. Breebaart and C. Faller, “Spatial Audio Processing, MPEG Surround and Other Applications”, Chichester, 2007

[16] Medway NHS Foundation Trust, Retrieved June 1, 2010 from http://www.medway.nhs.uk/images/parts_of_the_ear.gif

[17] D. R. Begault, “3-D Sound for Virtual Reality and Multimedia”, Academic Press, Cambridge, 1994

[18] J. Blauert, “Spatial Hearing, The psychophysics of Human Sound Localization”, Revised edition, The MIT Press, Cambridge, 1997

124 References

[19] V. Pulkki & C. Faller, “Directional Audio Coding: Filterbank and STFT-based Design”, 120th AES Convention, Paper 6658, 2006

[20] J. Breebaart, S. van de Par and A. Kohlrausch, ”Binaural processing model based on contralateral inhibition. II. Dependence on spectral parameters”, J. Acoust. Soc. Am. 110, pp. 1089-1104, August 2001

[21] Blumlein A. D., “Improvements in and relating to sound transmission, sound recording and sound reproducing systems”, UK Patent 394325, 1931

[22] J. Breebaart, S. van de Par, A. Kohlrausch and E. Schuijers, “Parametric Coding of Stereo Audio”, EURASIP Journal on Applied Signal Processing, Issue 9, pp. 1305-1322, 2005

[23] R. Y. Litovsky, H. S. Colburn, W.A. Yost & S. J. Guzman, “The precedence effect”, J. Acoust. Soc. Am. 106, pp. 1633-1654, October 1999

[24] ISO/IEC JTC 1/SC 29/WG 11, 11172-3, “Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbits/s – Part 3: Audio”, 1993

[25] ISO/IEC JTC 1/SC 29/WG 11, 13818-3, “Generic Coding of Moving Pictures and Associated Audio: Audio”, 1994

[26] ISO/IEC JTC 1/SC 29/WG 11, 13818-7, “Generic Coding of Moving Pictures and Associated Audio: Advanced Audio Coding”, 1997

[27] ISO/IEC JTC 1/SC 29/WG 11, 14496-3, “Coding of Audio-Visual Objects: Audio”, 2001

[28] ISO/IEC JTC 1/SC 29/WG 11, FDAM2 14496-3, “Parametric Coding”, 2004

[29] J. D. Johnston & A. J. Ferreira, “Sum-difference stereo transform coding”, IEEE ICASSP, 1992

[30] J. Herre, K. Brandenburg & D. Lederer, ”Intensity stereo coding”, 96th AES Convention, Preprint 3799, 1994

[31] J. Engdegård, H. Purnhagen, J. Rödén & L. Liljeryd, ”Synthetic Ambience in Parametric Stereo Coding”, 116th AES Convention Paper 6074, May 2004

[32] A. Mouchtaris & P. Tsakalides, “Multichannel Audio Coding for Multimedia Services in Intelligent Environments”, in G. A. Tsihrintzis & L. C. Jain (Eds.), Multimedia Services in Intelligent Environments (pp. 103-148), Springer-Verlag, Berlin, 2008

[33] J. Lindblom, J. H. Plasberg & R. Vafin, “Flexible sum-difference stereo coding based on time-aligned signal components”, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, October 2005

[34] J. D. Johnston, “Estimation of Perceptual Entropy Using Noise Masking Criteria”, IEEE ICASSP, 1988

[35] C.M. Liu, W.C. Lee & Y.H. Hsiao, “M/S Coding Based On Allocation Entropy”, Proc. of the 6th Int. Conference on Digital Audio Effects (DAFX-03), September 2003

[36] S. Chen, R. Hu & N. Xiong, “A Multimedia Application: Spatial Perceptual Entropy (SPE) of Multichannel Audio Signals”, National Engineering Research Center for Multimedia Software, Wuhan University, China, 2010

[37] R.G. van der Waal & R.N.J. Veldhuis, ”Subband coding of stereophonic digital audio signals”, IEEE ICASSP, Vol. 5, pp. 3601-3604, April 1991

References 125

[38] C.M. Liu & J.C. Liu, “A New Intensity Stereo Coding Scheme for MPEG1 Audio Encoder – Layers I and II”, IEEE Transactions on Consumer Electronics , Vol. 42 pp.535-539, Aug 1996

[39] C. Faller, “Parametric Multichannel Audio Coding: Synthesis of Coherence Cues”, IEEE Transactions on Audio, Speech and Language Processing, 2005

[40] C. Faller, “Parametric Coding of Spatial Audio”, Ph. D thesis, EPFL, 2004

[41] R. Irwan & R.M. Aarts, “Two-to-Five Channel Sound Processing”, J. Audio Eng. Soc., Vol. 50, No. 11, November 2002

[42] M. Briand, D. Virette & N. Martin, “Parametric Representation of Multichannel Audio Based on Principal Component Analysis”, Proc. of the 9th Int. Conference on Digital Audio Effects (DAFx-06), September 2006

[43] M. Briand, D. Virette & N. Martin, “Parametric Coding of Stereo Audio Based on Principal Component Analysis”, 120th AES Convention, Paper 6813, May 2006

[44] J. Lapierre & R. Lefebvre, “On Improving Parametric Stereo Audio Coding”, 120th AES Convention, Paper 6804, May 2006

[45] J. Kim, E. Oh & J. Robilliard, “Enhanced Stereo Coding with phase parameters for MPEG Unified Speech and Audio Coding”, 127th AES Convention, Paper 7875, October 2009

[46] C. Faller & F. Baumgarte, “Binaural Cue Coding – Part II: Schemes and Applications”, IEEE Transactions on Speech and Audio Processing, Vol. 11, No. 6, November 2003

[47] P. Ojala, M. Tammi, M, Vilermo, “Parametric Binaural Audio Coding”, IEEE ICASSP, pp. 393-396, March 2010

[48] Samsudin, E. Kurniawati, F. Sattar, N. B. Poh, S. George, “A Subband Domain Downmixing Scheme for Parametric Stereo Encoder”, 120th AES Convention, Paper 6815, May 2006

[49] K. Suresh & T.V. Sreenivas, “MDCT Domain Analysis and Synthesis of Reverberation for Parametric Stereo Audio”, 123th AES Convention, Paper 7281, October 2007

[50] M. Briand, “Etudes d’algorithmes d’extraction des informations de spatialisation sonore: application aux formats multicanaux”, Institut National Polytechnique de Grenoble, 2007

[51] M. Bouéri & C. Kyirakakis, “Audio Signal Decorrelation Based on a Critical Band Approach”, 117th AES Convention, Paper 6291, October 2004

[52] G. S. Kendall, “The Decorrelation of Audio Signals and Its Impact on Spatial Imagery”, Computer Music Journal, 19:4, pp. 71-87, 1995

[53] ISO/IEC JTC 1/SC 29/WG 11, M17494, “Coding of Moving Pictures and Audio, CE on improvements to low bitrate stereo in USAC”, April 2010

[54] ISO/IEC JTC 1/SC 29/WG 11, N11213, “Coding of Moving Pictures and Audio”, January 2010

[55] P. Mermelstein, “Evaluation of a segmental SNR measure as an indicator of the quality of ADPCM coded speech”, J. Acoust. Soc. Am., Volume 66, Issue 6, pp. 1664-1667, December 1979

126 References

[56] ITU-R, Recommendation ITU-R BS.1534-1, “Method for the subjective assessment of intermediate quality level of coding systems”, 2003

[57] C. A. Santoro, & C. I. Cheng, “Multiple F0 Estimation in the Transform Domain”, 10th ISMIR Conference, 2009

[58] C. I. Cheng, “Method for estimating magnitude and phase in the MDCT domain”, AES Convention Paper 6091, 2004

[59] C. Faller, “Parametric Coding of Spatial Audio”, EPFL, 2004

[60] 3GPP TS 26.290 V9.0.0, “Extended Adaptive Multi-Rate – Wideband (AMR-WB+) codec; Transcoding functions (Release 9)”, September 2009

[61] 3GPP TS 26.403 V9.0.0, “Enhanced aacPlus general audio codec; Encoder specification; Advanced Audio Coding (AAC) part (Release 9)”, December 2009

[62] 3GPP TS 26.405 V.9.0.0, “Enhanced aacPlus general audio codec; Encoder specification parametric stereo part (Release 9)”, December 2009

[63] P. Denbigh, “System analysis and signal processing, with emphasis on the use of MATLAB”, Addison-Wesley Longman Ltd, Harlow, 1998

[64] T. Söderström, P. Stoica, “System Identification”, Prentice Hall International, Hemel Hempstead, 1989

[65] C. Tournery & C. Faller, ”Improved Time Delay Analysis/Synthesis for Parametric Stereo Audio Coding”, 120th AES Convention, 2006

[66] J. A. Rossi-Katz & A. Natarajan, ”Tonality and its Application to Perceptual-Based Speech Enhancements”, University of Colorado, 2003

[67] H. Fletcher, “Auditory Patterns”, Review of Modern Physics, Vol. 12, pp. 47-55, January 1940

[68] E. Swicker, “Subdivision of the Audible Frequency Range into Critical Bands (Frequenzgruppen), The Journal of the Acoustical Society of America, Vol. 33., February 1961

[69] B. R. Glasberg, B. C. J. Moore, ”Derivation of auditory filter shapes from notched-noise data”, Hearing Research, Vol. 47, pp. 103-138, 1990

[70] J. Boudreau, R. Frank, G. Sigismondi, T. Vear, R. Waller, “Microphone Techniques for Recording”, Shure, 2009

[71] F. Rumsey & T. McCormick, “Sound and Recording” (6th ed.), Focal Press, Oxford, 2009

[72] J. Breebaart, G. Hotho, J. Koppens, E. Schuijers, W. Oomen & S. van de Par, “Background, Concept, and Architecture for the Recent MPEG Surround Standard on Multichannel Audio Compression”, J. Audio Eng. Soc., Vol. 55, No. 5, May 2007

[73] DPA Microphones, “Stereo Recordings with DPA Microphones”, 2006

[74] http://megapolisfestival.org/blogalogadingdong/wp-content/uploads/2009/09/ku100_dummy_head_400x502.jpg

[75] ISO/IEC JTC 1/SC 29/WG 11, N2006, “Report on the MPEG-2 AAC Stereo Verification Tests”, February 1998

I

Appendix

A. Critical bandwidths and ERB

In 1940 Fletcher [67] described the auditory system as an array of overlapping band-pass filters with bandwidths equal to the “critical bandwidths” according to the constant masking thresholds. This could be related to the frequency discrimination of the cochlea and is therefore sometime referred to as the cochlear filterbank. In [68] Zwicker divided the audible frequency range into 24 critical bands with variable positions along the frequency scale. The critical bandwidths were approximately constant of about 100 Hz for center-frequencies up to 500 Hz and about 1/5 of the center-frequency for higher frequencies. These critical bandwidths can analytically be expressed as

[ ]( )( ) 69.024.117525][ kHzfHzCB c++= (A.1)

where CB is the critical bandwidth in Hz as function of the center-frequency cf in kHz [2].

Later experiments showed similar results but with slightly different bandwidths, especially for the lower frequencies [2]. In 1990 Glasberg and Moore [69] defined the Equivalent Rectangular Bandwidth (ERB) which agreed with the later experiments on the critical bandwidths. The ERB can analytically be expressed as

[ ]( )137.47.24][ ++= kHzfHzERB c (A.2)

where the ERB in Hz is a function of the center-frequency cf in kHz [69].

In Figure A.1 the critical bandwidths from Zwicker are compared to the corresponding ERBs from Glasberg and Moore. The ERB scale results in a finer frequency resolution than the original CB scale, especially at frequencies below 500 Hz, since the bandwidths are lower. Consequently, the ERB scale describes the frequency response of the human auditory system with stronger requirements on the frequency resolution for perceptual coders. In addition to the presented results, experiments have shown that the critical bands for binaural rendering are slightly broader than for the monaural rendering [18]. These observations might consequently allow a lower frequency resolution for rendering of the stereo parameters in a parametric stereo coder than for the mono coding.

II Appendix B - Stereo recording techniques

Figure A.1: Approximations of the critical bandwidth.

B. Stereo recording techniques

The techniques used for stereo recording strongly impact the signal characteristics to encode. In the following the most common recording techniques will be briefly described.

B.I. Coincident microphones

In order to capture stereo signals with coincident microphones, i.e. in the same spatial location, there must be a directional property of at least one of the microphones. The combinations of directional microphones can be categorized into the XY and the Mid-Side (MS) configurations.

In the XY configuration two directional microphones are separated by an angle of 90° to 180° depending on the width of the sound stage to be covered [71]. In Figure B.2.a a XY configuration using two perpendicular cardioid-directional microphones is presented. However, other types of directional microphones can be used in order to increase the amount of captured reverb, e.g. figure-eight microphones which are bi-directional and capture rear sounds as well.

In the MS microphone configuration the first microphone can be cardioid-directional, bidirectional or omnidirectional and captures the middle or mid channel. The other microphone must be bidirectional and captures sound perpendicular to the direction of the source as illustrated in Figure B.2.b. Before playback of the recorded audio the left and right channels are constructed from the mid and the side channels according to

SMRSML

−=+=

(B.1)

101

102

103

104

105

101

102

103

104

Frequency (Hz)

Ban

dwid

th (H

z)

Critical Bandwidth (Zwicker)ERB (Glasberg & Moore)

Section B.II - Spaced microphone configurations III

where L and R are the left and right channels respectively, M is the mid-channel and S is the side channel. For each MS configuration there is an equivalent XY configuration and the characteristics of the captured signals are similar [71].

The Inter-Channel Time Differences (ICTDs) between the channels of coincidentally recorded stereo audio are negligible since the microphones essentially capture the sounds at the same location in space. This property has the disadvantage that the stereo effect becomes weaker, especially at lower frequencies where the time (phase) difference is an important spatial cue (see chapter 3). On the other hand, the advantage of these recording techniques is the backward compatibility to mono systems since the channels can be down-mixed (see chapter 4) without any comb-filter effect, i.e. cancellation or amplification of signal components caused by phase differences [15].

For the coincidental microphone configuration room effects, i.e. reflections and reverberation, affect the two recorded signals similarly. The Inter-Channel Coherence (ICC) between both channels is consequently larger than it would have been for recordings with non-coincident microphones. In other words, the effect of the room (or environment in general) is not fully captured and the width of the spatial objects can be altered [72].

Although there is no ICTD in the recorded channels, the directional microphones capture the intensity of the source differently depending on its spatial location. Consequently, there is an Inter-Channel Level Difference (ICLD) between the captured channels which is perceptually relevant for localization at frequencies above about 1.6 kHz (see chapter 3).

In the comparison between the two coincidental microphone configurations the XY configuration is more disadvantageous for central sounds which are off-axis for both the directional microphones. The central sounds might as a consequence be captured with a poor frequency response which makes the spatial image more unstable [71]. In addition, the MS configuration allows modification of the stereo width without physically changing the angles of the microphones, which is necessary in the XY configuration. The width is controlled by adjustment of the balance between the mid and the side channels when constructing the left and the right stereo channels. A wider stereo image, for example, is obtained when the gain of the side channel is increased since the difference between the left and right channels becomes larger.

B.II. Spaced microphone configurations

In contrast to the XY and MS configurations the microphones can be located in two different positions in the AB configuration. The configuration use two omnidirectional microphones spaced by a distance d as illustrated in Figure B.2.c. The principal stereo characteristics of the captured channels are the ICTD, which range is dependent on the distance d [71]. However, the channels might also comprise an ICLD if the distance between the source and the microphones is short.

The main drawback with the AB configuration is that a simple stereo to mono down-mix will generate comb-filtering effect (see section 6.4). On the other hand, the reproduction of the reverberant sound field is improved and the spatial width of the directional sources may be increased. However, the possible decrease in ICC coefficients might make the directional cues less accurate than in the XY and MS configurations.

IV Appendix B - Stereo recording techniques

Another advantage with the AB configuration is that for large distances between the sources and the microphones the omnidirectional microphones capture low frequencies better than the directional microphones used in other configurations do [73].

B.III. Near coincident configurations

Near coincident microphone configurations can be seen as a combination of the XY and AB microphone setup. The microphones are, as the name indicates, placed in different location but not as distant as in the AB configuration. In consequence, these configurations captures the ambience (environment sounds, room effect, etc.) better than the XY/MS configurations and captures the directions more precise than the AB configuration.

One commonly used near coincident configuration was first used by Office de Radiodiffusion-de la Télévision Française (ORTF). The setup consists of two cardioid-directional microphones that are spaced by 17 cm and separated by 110° as illustrated in Figure B.2.d. The idea behind the setup is to capture spatial cues similar to the binaural cues of the human hearing; the distance between the microphones represents the distance between the ears and the separating angle emulates the shadow effects of the human head [73].

Figure B.2: Microphone configurations for stereo recordings. a) XY configuration with coincident perpendicular cardioid-directional

microphones. b) Mid-Side configuration with coincident omnidirectional and eight-figure

microphones. c) AB configuration with two omnidirectional microphones spaced by the

distance d. d) ORTF near coincident configuration.

a b

c d

Section B.IV - Binaural recording V

B.IV. Binaural recording

The binaural recording technique takes the ideas of the ORTF configuration even further and aims to reproduce exactly the same listening experience as if the listener was present at the auditory stage. Two microphones are placed near the ear entrance of a listener or in a dummy head, with modelled pinnae and ear channels (see Figure B.3). The dummy head is intended to approximate an average auditory system that would be acceptable by most people.

When the binaurally recorded signals are played back through headphones the listening experience is very realistic, but if the listener is not the same person as the one used for the recording there can be spatial distortions due to the incompatibility between the auditory systems. The listener can then experience front-back confusion (front-back reversals [18]) or limited externalization, i.e. the sound is perceived inside the listeners head rather than outside.

Figure B.3: Neumann KU100 dummy head used for binaural recoding [74].

C. Distributions of stereo parameters

In this appendix, the distributions of the extracted stereo parameters for the training database are presented in Figure C.4. The training database consisted of 26 stereo samples of clean, reverberant and noisy speech, mixed content and music with a total playback time of about 7 minutes (see Appendix E). The distributions have been considered in the definition of the quantization tables for the stereo parameters (see section 7.1).

In Figure C.5 to Figure C.9 the distribution of the quantized and differentially coded stereo parameters are presented for the training database of stereo samples. The quantization indices refer to the tables defined in section 7.1. The dynamics of the time-differentially coded quantization indices are lower than the dynamics of the non-differential and the frequency-differential quantization indices for all the stereo parameters. The distributions have been considered for the coding of the stereo parameters (see section 7.2).

VI Appendix C - Distributions of stereo parameters

Figure C.4: Distributions of the extracted non-quantized stereo parameters for the training database. (a) ICLD in dB, (b) ICC, (c) ICTD in samples, (d) KLT angle in radians and (e) SMLD in dB.

Figure C.5: Distributions of the ICLD quantization indices for the training database. The top figure shows the indices of the quantized parameter which are differentially coded between the subbands, i.e. in frequency (the middle figure), and over processed blocks, i.e. in time (the bottom figure).

-100 -50 0 50 1000

1

2

3x 10

4

-1 -0.5 0 0.5 10

1

2x 10

4

-200 -100 0 100 2000

2

4

6x 10

4

0 0.5 1 1.5 20

1000

2000

3000

-100 -50 0 500

5000

10000

0 5 10 15 20 25 30 350

5

10x 10

4

Quantization index

-40 -30 -20 -10 0 10 20 30 400

5

10x 10

4

Frequency-differential index

-40 -30 -20 -10 0 10 20 30 400

5

10

15x 10

4

Time-differential index

a

e

b

c d

Section B.IV - Binaural recording VII

Figure C.6: Distributions of the ICC quantization indices for the training database. The top figure shows the indices of the quantized parameter which are differentially coded between the subbands, i.e. in frequency (the middle figure), and over processed blocks, i.e. in time (the bottom figure).

Figure C.7: Distributions of the ICTD quantization indices for the training database. The top figure shows the indices of the quantized parameter which are differentially coded between the subbands, i.e. in frequency (the middle figure), and over processed blocks, i.e. in time (the bottom figure).

1 2 3 4 5 6 7 80

2

4

6x 10

4

Quantization index

-7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 70

5

10x 10

4


-7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 70

5

10

15x 10

4


0 10 20 30 40 50 60 700

2

4x 10

4

Quantization index

-80 -60 -40 -20 0 20 40 60 800

5

10x 10

4


-80 -60 -40 -20 0 20 40 60 800

1

2x 10

5


VIII Appendix C - Distributions of stereo parameters

Figure C.8: Distributions of the KLT angle quantization indices for the training database. The top figure shows the indices of the quantized parameter which are differentially coded between the subbands, i.e. in frequency (the middle figure), and over processed blocks, i.e. in time (the bottom figure).

Figure C.9: Distribution of the SMLD quantization indices for the training database. The top figure shows the indices of the quantized parameter which are differentially coded between the subbands, i.e. in frequency (the middle figure), and over processed blocks, i.e. in time (the bottom figure).

0 5 10 15 20 25 30 350

2

4x 10

4

Quantization index

-40 -30 -20 -10 0 10 20 30 400

2

4

6x 10

4


-40 -30 -20 -10 0 10 20 30 400

5

10x 10

4


1 2 3 4 5 6 7 80

5

10x 10

4

Quantization index

-7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 70

5

10x 10

4


-7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 70

5

10

15x 10

4


Section D.I - MDCT IX

D. MDCT, MDST, MDFT and TDAC properties

In this appendix the properties of perfect reconstruction is shown for the MDCT and the MDFT. In section D.I it is shown how the MDCT can be expressed in term of the DCTIV of a time-domain aliased signal block. In section D.II the properties of the TDAC in the overlap-add of inverse-transformed MDCT spectra is shown. In section D.III the MDST is expressed in matrix form to be used in D.IV where the MDFT is described in more details.

D.I. MDCT

The MDCT in Eq. (2.3) can be equally expressed in matrix form as

wMDCTw TxX = (D.1)

where

[ ]

[ ]

−=

1

0

NX

X

MDCTw

MDCTw

MDCTwX and

[ ]

[ ]

−=

12

0

Nx

x

w

w

wx

The transform matrix T of dimension Nx2N is given by

[ ] [ ]

[ ] [ ]

−−−

−=

12,10,1

12,00,02

NNtNt

Ntt

N

T

where

[ ]

+

++=

21

221cos, kNn

Nnkt π

(D.2)

Due to the trigonometric properties of the cosine functions it follows that

[ ]

[ ]nktkNnN

kkNnN

kNNnN

kNnNN

nNkt

,21

221cos

212

21

221cos

212

221cos

21

2211cos1,

−=

+

++−=

+−

+

++=

+

−++−=

+

++−−=−−

π

ππ

π

π

for n = 0,…,N/2-1 (D.3)

and

X Appendix D - MDCT, MDST, MDFT and TDAC properties

[ ]

[ ]NnktkNNnN

kkNNnN

kNNnN

kNnNN

nNkt

+=

+

+++=

+−

+

+++=

+

−++−=

+

++−−=−−

,21

221cos

214

21

221cos

213

221cos

21

22112cos12,

π

ππ

π

π

for n = 0,…,N/2-1 (D.4)

The MDCT defined in Eq. (2.3) can then be expressed as

[ ] [ ] [ ]( ) [ ] [ ] [ ]( ) [ ]

+−−+++−−−= ∑∑

−

=

−

=

12/

0

12/

0,12,12 N

nww

N

nww

MDCTw NnktnNxNnxnktnNxnx

NkX (D.5)

for k=0,…,N-1

Defining

=

w,4

w,3

w,2

w,1

w

xxxx

x where

[ ]

[ ]

−

−=

12/

2/)1(

mNx

Nmx

w

w

mw,x for m=1,..,4

(D.6)

the MDCT can be equally written in matrix form as

[ ]

[ ]( ) ( )R

w,4w,32Rw,2w,11

MDCTw xxTxxTX ++−=

−=

1

0

NX

X

MDCTw

MDCTw

(D.7)

where

[ ] [ ]

[ ] [ ]

−−−

−=

12/,10,1

12/,00,0

1

NNtNt

Ntt

T ,

[ ] [ ]

[ ] [ ]

−−−

−=

12/3,1,1

12/3,0,0

2

NNtNNt

NtNt

T

(D.8)

and the time reversed sub-vectors are

Section D.I - MDCT XI

[ ]

[ ]

−

−=

2/)1(

12/

Nmx

mNx

w

w

Rmw,x for m=2,4 (D.9)

As already mentioned, the MDCT can be seen as the Discrete Cosine Transform of type IV (DCTIV) of a signal block that has been time-domain aliased. There exist eight different types of DCTs which can be seen as real valued versions of the Discrete Fourier Transform (DFT), which for a signal x[n] of length N is defined as

[ ] [ ] knN

iN

nenxkX

π21

0

−−

=∑= , for k= 0,…,N-1 (D.10)

For the DCTs the complex exponential basis functions are replaced by cosine functions that are shifted in both time and frequency. The DCT of type IV (DCTIV) has a similar frequency shift as the MDCT while the time shift is different. For the signal [ ]nx of length N the DCTIV is defined as

[ ] [ ]∑−

=

+

+=

1

0 21

21cos2 N

n

DCT

Nknnx

NkX π , for k= 0,…,N-1 (D.11)

or equally in matrix form

UxXDCT = (D.12)

with the vectors

[ ]

[ ]

−=

1

0

NX

X

DCT

DCT

DCTX and

[ ]

[ ]

−=

1

0

Nx

xx (D.13)

and the transform matrix U of dimension NxN

[ ] [ ]

[ ] [ ]

−−−

−=

1,10,1

1,00,02

NNuNu

Nuu

N

U (D.14)

where

[ ]

+

+=

21

21cos, kn

Nnku π

(D.15)

The rows of the transform matrix U are the basis functions of the DCTIV, which are orthonormal. Since the matrix is orthogonal and symmetric it is also equal to its inverse which defines the inverse transform. The DCTIV computed from the definition requires in the order of 2N operations but similar to the Fast Fourier Transform (FFT) algorithm the number of operations can be reduced to the order of NN 2log operations.

XII Appendix D - MDCT, MDST, MDFT and TDAC properties

In order to express the MDCT in terms of the DCTIV it can be noticed that

[ ]nktkNnN

Nnku ,21

21

2cos

2, =

+

++=

+

π (D.16)

and

[ ]NnktkNNnN

kkNNnN

kNNNnN

knNN

nNku

+−=

+

+++−=

+−

+

+++=

+

−+++−=

+

+−−=

−−

,21

221cos

212

21

221cos

212

221cos

21

211

2cos1

2,

π

ππ

π

π

(D.17)

which implies that

( ) ( )

( ) ( )Rw,2w,12w,4

Rw,31

Rw,4w,32

Rw,2w,11w

MDCTw

xxUxxU

xxTxxTTxX

−+−−=

=++−== (D.18)

with

and

[ ] [ ]

[ ] [ ]

−−−

−=

12/,10,1

12/,00,02

NNuNu

Nuu

N

1U

[ ] [ ]

[ ] [ ]

−−−

−=

1,12/,1

1,02/,02

NNuNNu

NuNu

N

2U

(D.19)

where u[k,n] is defined in Eq. (D.15).

Consequently,

( )

w

w,4

w,3

w,2

w,1

N/2N/2N/2N/2

N/2N/2N/2N/2

Rw,2w,1

w,4Rw,3

21MDCTw

UQx

xxxx

00JIIJ00

U

xxxx

UUX

≡

−

−−=

−−−

=

(D.20)

Section D.I - MDCT XIII

where the matrix U is the DCTIV matrix defined in Eq. (D.14) and the matrix Q defines the TDA with the identity matrix I, the time-reversal matrix J and zero matrix 0 of dimension N/2xN/2 according to

=

10

01N/2I ,

=

01

10N/2J and

=

00

00N/20 (D.21)

In this case, the signal block wx of 2N samples is reversed in time, sign changed and folded (aliased) into a vector of half the length, i.e. N samples, by the TDA. There is a positive aliasing of the first N/2 samples, i.e. w,1x , while there is a negative aliasing for the last N/2 samples, i.e. 4w,x . This is one of the properties of the MDCT that enables TDAC in the decoder (see section D.II).

In order to better understand the effects of the window the G.719 sine window w[n] (defined in Eq. (2.1)) can be applied to the equations. The sine window is symmetric, or equally expressed

[ ] [ ]nNwnw −−= 12 for 12,,0 −= Nn (D.22)

and can consequently be expressed in four sub-vectors as

[ ]

[ ]

=

=

−

=

R1

R2

11

wwww

wwww

w 2

4

3

2

12

0

Nw

w

(D.23)

where R1w and Rw 2 denotes the time-reversed sub-vectors 1w and 2w which are defined as

[ ]

[ ]

−=

12/

0

Nw

w1w ,

[ ]

[ ]

−=

1

2/

2

Nw

Nww ,

[ ]

[ ]

−=

0

12/

w

NwR

1w

and

[ ]

[ ]

−=

2/

1

2

Nx

NxRw

(D.24)

Let the signal vector x that is of length 2N be divided into four sub-vectors of length N/2 according to

[ ]

[ ]

=

−

=

4

3

2

1

xxxx

x

12

0

Nx

x

(D.25)

XIV Appendix D - MDCT, MDST, MDFT and TDAC properties

where the sub-vectors x1, x2, x3, and x4 are given by

[ ]

[ ]

−

−=

12/

2/)1(

mNx

Nmxmx for m=1,..,4

(D.26)

The windowed signal is obtained from the time-domain multiplication

⊗

=

=

4

3

2

1

R1

R2

2

1

w,4

w,3

w,2

w,1

w

xxxx

wwww

xxxx

x (D.27)

The MDCT of the windowed signal [ ]nxw given in Eq. (D.18) can then be expressed as

( )

( ) ( )R2

R21124

R1

R321

R2

R211

4R1

R32

21

4

3

2

1

R1

R2

2

1

N/2N/2N/2N/2

N/2N/2N/2N/2MDCTw

xwxwUxwxwU

xwxwxwxw

UU

xxxx

wwww

00JIIJ00

UX

⊗−⊗+⊗−⊗−=

⊗−⊗⊗−⊗−

=

⊗

−

−−=

(D.28)

where 1U and 2U are defined in Eq.(D.19).

The properties of the sine window are as the properties of the MDCT important for the perfect reconstruction. In the following section D.II the inverse MDCT with the TDAC related to the OLA technique will be described using these notations.

D.II. IMDCT and TDAC

With the notations used in Eq. (D.26) the IMDCT of the MDCT spectra obtained in Eq. (D.28) can be expressed in matrix form with

[ ]

[ ]

−=

12ˆ

0ˆˆ

Nx

x

w

w

wx

according to

Section D.II - IMDCT and TDAC XV

( ) ( )( )R2

R21124

R1

R321T

2

T1

N/2N/2

N/2N/2

N/2N/2

N/2N/2

MDCTw

TTw

xwxwUxwxwUUU

0I0JJ0

I0XUQx

⊗−⊗+⊗−⊗−

−−

−=

==ˆ

(D.29)

where the matrices U and Q are defined in Eq. (D.14) and Eq. (D.20).

The columns of the matrices U1 and U2 are orthonormal and U1 is orthogonal to U2 which means that

=

=otherwise

jiif

N

Nj

Ti 0

IUU , for 2,1, =ji

where I and 0 are the identity and zero matrices of the dimension NxN.

The IMDCT in Eq. (D.29) is consequently given by

⊗+⊗⊗+⊗⊗+⊗−

⊗−⊗

=

⊗−⊗⊗−⊗−

−−

−=

==

4R1

R32

R413

R2

22R1

R1

R2

R211

R2

R211

4R1

R32

N/2N/2

N/2N/2

N/2N/2

N/2N/2

MDCTw

TTw

xwxwxwxw

xwxwxwxw

xwxwxwxw

0I0JJ0

I0

XUQx

(D.30)

The windowed time-domain signal that was transformed is not obtained from the inverse MDCT, i.e. ww xx ≠ˆ . However, with windowing and OLA the aliasing errors can be cancelled. Let the time-domain signal x[n] of length 3N be windowed in two blocks with 50% overlap according to

⊗

=

4

3

2

1

R1

R2

2

1

1w

xxxx

wwww

x

⊗

=

6

5

4

3

R1

R2

2

1

2w

xxxx

wwww

x

(D.31)

where

[ ]

[ ]

−

−=

12/

2/)1(

mNx

Nmxmx for m=1,..,6

(D.32)

and the window sub-vectors are defined according to Eq. (D.23).

XVI Appendix D - MDCT, MDST, MDFT and TDAC properties

The IMDCT of the MDCT transformed blocks 1wx and 2

wx can be determined by Eq. (D.30) as

⊗+⊗⊗+⊗⊗+⊗−

⊗−⊗

=

=

4R1

R32

R413

R2

22R1

R1

R2

R211

1w,4

1w,3

1w,2

1w,1

1w

xwxwxwxw

xwxwxwxw

xxxx

x

ˆˆˆˆ

ˆ

⊗+⊗⊗+⊗⊗+⊗−

⊗−⊗

=

=

6R1

R52

R615

R2

42R3

R1

R4

R231

2w,4

2w,3

2w,2

2w,1

w

xwxwxwxw

xwxwxwxw

xxxx

x

ˆˆˆˆ

ˆ 2

(D.33)

When these inverse-transformed blocks are multiplied with the synthesis window and overlap-added with 50% overlap the time-domain signal y is obtained as

⊗⊗

⊗+⊗⊗+⊗

⊗⊗

=

⊗

+

⊗

=

=

2w,4

R1

2w,3

R2

2w,22

1w,4

R1

2w,11

1w,3

R2

1w,22

1w,11

2w,4

2w,3

2w,2

2w,1

R1

R2

2

11w,4

1w,3

1w,2

1w,1

R1

R2

2

1

6

5

4

3

2

1

xwxw

xwxwxwxw

xwxw

xxxx

00

wwww00

00

xxxx

00

wwww

yyyyyy

y

ˆˆ

ˆˆˆˆ

ˆˆ

ˆˆˆˆ

ˆˆˆˆ

(D.34)

where

[ ]

[ ]

−

−=

12/

2/)1(

mNy

Nmymy for m=1,..,6

(D.35)

Insertion of Eq. (D.33) into Eq. (D.34) gives

( ) ( )( ) ( )

( )( )

⊗⊗+⊗⊗⊗⊗+⊗⊗

⊗⊗+⊗⊗⊗+⊗

⊗⊗+⊗⊗−⊗⊗−⊗⊗

=

⊗⊗+⊗⊗⊗⊗+⊗⊗

⊗⊗+⊗⊗−+⊗⊗+⊗⊗⊗⊗−⊗⊗+⊗⊗+⊗⊗

⊗⊗+⊗⊗−⊗⊗−⊗⊗

=

6R1

R1

R52

R1

R6

R215

R2

R2

422R1

R1

311R2

R2

222R12

R1

R2

R21111

6R1

R1

R52

R1

R6

R215

R2

R2

422R32

R14

R1

R1

R32

R1

R4

R21311

R4

R213

R2

R2

222R12

R1

R2

R21111

xwwxwwxwwxww

xwwwwxwwww

xwwxwwxwwxww

xwwxwwxwwxww

xwwxwwxwwxwwxwwxwwxwwxww

xwwxwwxwwxww

y

(D.36)

Section D.III - MDST XVII

In order to cancel the effects of the windowing there are some constraints on the window function. In Eq. (D.36) it can be seen that if

( ) ( )( ) ( )

=⊗+⊗

=⊗+⊗TN/222

R1

R1

TN/211

R2

R2

wwww

wwww

11

11

(D.37)

the sub-vectors x3 and x4 can be reconstructed from the MDCT domain spectra. The sine window used for analysis and synthesis in the G.719 framework satisfies this condition. Thus, a signal part of length N is reconstructed in every OLA of two inverse-transformed spectra as depicted in Figure 2.7.

When equal analysis and synthesis windows are used together with the defined MDCT there are actually two conditions for perfect reconstruction. The window function must [5]:

- be symmetric, i.e. w[n] = w[2N-1-n], and - satisfy the Princen-Bradley condition, i.e.

[ ] [ ] 122 =++ Nnwnw

The first condition is included in the notation for the window defined in Eq. (D.23) where the window is expressed as symmetric.

D.III. MDST

The MDST can be obtained as the Discrete Sine Transform type IV (DSTIV) of a time-domain aliased signal similarly as the MDCT can be expressed as the Discrete Cosine Transform type IV (DCTIV) of a time-domain aliased signal (see section D.I). With the notations of section D.I, more specifically the signal block x of length 2N given by Eq. (D.25) windowed by the sine window w defined in Eq. (D.23) the MDST of the windowed signal is obtained as

( )xwQVXMDSTw ⊗′′= (D.38)

where the matrix Q ′′ consists of the sub-matrices N/2I , N/2J and N/20 of dimension N/2xN/2 according to

−=′′

N/2N/2N/2N/2

N/2N/2N/2N/2

00JIIJ00

Q (D.39)

with the identity matrix I, the time-reversal matrix J and zero matrix 0 defined according to Eq. (D.21). The matrix V is the DSTIV transformation matrix of dimension NxN given by

=

−−−

−

1,10,1

1,00,02

NNN

N

vv

vv

N

V (D.40)

XVIII Appendix D - MDCT, MDST, MDFT and TDAC properties

where

+

+=

Nijv ji

π21

21sin, (D.41)

The MDST of the windowed signal given in Eq. (D.38) can be rewritten as

( )

( ) ( )R2

R21124

R1

R321

R2

R211

4R1

R32

21

4

3

2

1

R1

R2

2

1

N/2N/2N/2N/2

N/2N/2N/2N/2

wMDSTw

xwxwVxwxwV

xwxwxwxw

VV

xxxx

wwww

00JIIJ00

V

xQVX

⊗+⊗+⊗−⊗=

⊗+⊗⊗−⊗

=

⊗

−=

=′′=

(D.42)

where

and

=

−−−

−

12/,10,1

12/,00,02

NNN

N

vv

vv

N

1V

=

−−−

−

1,12/,1

1,02/,02

NNNN

NN

vv

vv

N

2V

(D.43)

The IMDST of the MDST spectra from the windowed signal block xw is given by

( ) ( )( )

⊗+⊗−⊗−⊗⊗+⊗⊗+⊗

=

⊗+⊗+⊗−⊗

−

=

=′′=

4R1

R32

R413

R2

22R1

R1

R2

R211

R2

R21124

R1

R321T

2

T1

N/2N/2

N/2N/2

N/2N/2

N/2N/2

MDSTw

TTw

xwxwxwxwxwxwxwxw

xwxwVxwxwVVV

0I0JJ0I0

XVQx

(D.44)

Section D.IV - MDFT XIX

The result is determined using the fact that the columns of the matrices V1 and V2 are orthonormal and that V1 is orthogonal to V2, i.e.

=

=otherwise

jiif

N

Nj

Ti 0

IVV , for 2,1, =ji

where IN and 0N are the identity and zero matrices of the dimension NxN.

D.IV. MDFT

In matrix form the MDFT of the windowed signal [ ]nxw can be expressed as

( )( )xwQVQUXw ⊗′′−′= i (D.45)

where the matrices U and V are given by Eq. (D.14) and Eq. (D.42) respectively and x and w are the signal and window vectors according to Eq. (D.25) and Eq.(D.23). Q′ and Q ′′ are the TDA matrices of the MDCT and the MDST defined as

and

−=′

N/2N/2N/2N/2

N/2N/2N/2N/2

00JIIJ00

Q

−

−−=′′

N/2N/2N/2N/2

N/2N/2N/2N/2

00JIIJ00

Q

(D.46)

where the identity matrix I, the time-reversal matrix J and zero matrix 0 of dimensions N/2xN/2 are defined according to Eq. (D.21). Note that the MDCT TDA matrix Q′ is equal to the matrix Q in chapter 2.

The Inverse MDFT (IMDFT) of the MDFT spectrum [ ]kX w is given by

[ ] [ ]

ℜ= ∑

−

=

+

++1

0

21

2212

21ˆ

N

k

kNnNi

ww ekXN

nxπ

for n=0,…,2N-1

(D.47)

which can be expressed in matrix form as

( )( ) ( ) ( )( )wH

wH

w xQiVQUQiVQUXQiVQUx ′′−′′′−′ℜ=′′−′ℜ=21

21ˆ (D.48)

The real part of the product between the inverse and the direct transform matrices has some special properties that enable perfect reconstruction without OLA since

( ) ( )( ) 2NH IQiVQUQiVQU =′′−′′′−′ℜ

21 (D.49)

XX Appendix E - Stereo material

Consequently, for the MDFT there are no effects from the TDA in the inverse transformed block, i.e. ww xx =ˆ which is not the case for the MDCT or the MDST. However, the windowing and OLA is still used in the MDFT domain in order to reduce the block artifacts. For the MDFTs of the blocks 1

wx and 2wx defined in Eq. (D.31) the IMDFT

becomes equal to the windowed signal, that is

and

⊗⊗⊗⊗

=

=

4

3

ˆˆˆˆ

ˆ

xwxwxwxw

xxxx

x

R1

R2

22

11

1w,4

1w,3

1w,2

1w,1

1w

⊗⊗⊗⊗

=

=

6

5

4

3

2

ˆˆˆˆ

ˆ

xwxwxwxw

xxxx

x

R1

R2

2

1

2w,4

2w,3

2w,2

2w,1

w

(D.50)

The inverse-transformed blocks are multiplied with the synthesis window and overlap-added with 50% overlap that determines the time-domain signal y according to

⊗⊗

⊗+⊗⊗+⊗

⊗⊗

=

⊗

+

⊗

=

=

2w,4

R1

2w,3

R2

2w,22

1w,4

R1

2w,11

1w,3

R2

1w,22

1w,11

2w,4

2w,3

2w,2

2w,1

R1

R2

2

11w,4

1w,3

1w,2

1w,1

R1

R2

2

1

6

5

4

3

2

1

xwxw

xwxwxwxw

xwxw

xxxx

00

wwww00

00

xxxx

00

wwww

yyyyyy

y

ˆˆ

ˆˆˆˆ

ˆˆ

ˆˆˆˆ

ˆˆˆˆ

(D.51)

which becomes

( )( )

⊗⊗⊗⊗

⊗⊗+⊗⊗⊗+⊗

⊗⊗⊗⊗

=

⊗⊗⊗⊗

⊗⊗+⊗⊗⊗⊗+⊗⊗

⊗⊗⊗⊗

=

6R1

R1

5R2

R2

422R1

R1

311R2

R2

222

111

6R1

R1

5R2

R2

4224R1

R1

3113R2

R2

222

111

xwwxww

xwwwwxwwww

xwwxww

xwwxww

xwwxwwxwwxww

xwwxww

y (D.52)

The sub-vectors x3 and x4 in the overlap region are perfectly reconstructed since the Princen-Bradley condition is fulfilled for the sine window (see section D.II). The signal parts x1, x2, x5 and x6 can similarly be perfectly reconstructed if the previous and the subsequent time-domain blocks from the IMDFT are windowed and overlap-added.

E. Stereo material

In this appendix the different databases of stereo samples that have been used in the thesis are shortly presented.

Section E.I - Database 1 XXI

E.I. Database 1

The database consists of nine fullband stereo samples with a total length of about 1 minute. The samples have been sampled at 48 kHz and are represented by a 16-bit quantizer. The items that consist of clean, noisy speech, mixed content and music are shortly presented in Table E.1.

Item Characteristics

cls1 Clean speech from three sources in three different stable positions determined by ICLD and ICTD. The sources do not overlap in time and there are audible room acoustics.

cls2 Clean speech from a female and a male talker at stable positions determined by ICLD. There is a slight overlap between the sources but no audible room effects.

cls3 ICLD panning of clean speech signal. mix1 ICLD panned speech signal with significant restaurant background noise. mix2 Centred speech signal with cheering sport fans.

mus1 Instrumental part of the unplugged version of the song Layla by Eric Clapton where the spatial image is wide and varying over time.

mus2 Jazz song with percussion, bass, piano, saxophone and vocals. The spatial image is not extreme but there are room acoustics.

nos1 Stable speech source with background noises.

nos2 A distinct speech source that overlaps with a reverberant speech source where the spatial locations are given by both ICLD and ICTD. The background noise is clearly noticeable with an approximately constant power over time.

Table E.1: Description of the database 1 stereo samples.

E.II. Training database

The training database contained 26 fullband or super wideband stereo samples with a total length of approximately 7 minutes. The samples are sampled at 48 kHz and represented by a 16-bit quantizer. There is one binaural recording, nine samples of clean and reverberant speech, six noisy speech samples, five samples of mixed content and five samples with music as presented in Table E.2. Note that the items consist of fullband stereo audio if nothing else is stated.

Item Characteristics

1bSAt1s1.c34 Female and male talkers that overlap in time so that no more than two sources occur simultaneously. The sources have stable spatial locations determined by ICLD and ICTD. Super-wideband

1bSAt1s1.c35 Non-overlapping talkers with stereo office noise backgrounds with stable locations determined by ICLD and ICTD. Super-wideband

1bSAt1s2.c34 Similar characteristics as 1bSAt1s1.c34. Super-wideband 1bSAt2s1.c34 Similar characteristics as 1bSAt1s1.c34. Super-wideband

1bSAt2s2.c33 Two female talkers that slightly overlap in time at stable locations from ICLD and ICTD. Super-wideband.

1bSAt2s3.c33 Similar characteristics as 1bSAt2s2.c33. Super-wideband 1bSAt3s2.c34 Similar characteristics as 1bSAt1s1.c34. Super-wideband

binaural Binaural recording of a speech source that moves around the listener. There is a slight background noise and some occasional secondary sources which can be described as footsteps and a creaking noise.

concat_stereo Concatenation of six very critical music samples of different characteristics. cls1 See Table E.1 item cls1. cls2 See Table E.1 item cls2.

XXII Appendix E - Stereo material

cls3 See Table E.1 item cls3. mix1 See Table E.1 item mix1. mix2 See Table E.1 item mix2. mus2 See Table E.1 item mus2.

mu6

Part of the song Drive my car by the Beatles and it is characterized by stereo channels with low correlation. The left channel consists mainly of bass, guitar and tambourine accompaniment while a cowbell is the main component of the right channel. The signing is however part in both the stereo channels.

mu7 Introduction of the song I feel fine by the Beatles with panning of a tone and low correlated stereo channels.

nos1 See Table E.1, item nos1.

s_no_2t_1_org Four talkers in different locations due to ICLD and ICTD with background noise similar to a fan.

s_no_ft_3_org Talker in the right stereo channel with race car backgrounds in both channels.

Scene_5_AB AB recording of female talkers that overlaps in time. The environment consists of low-frequency noise.

Scene_6_AB AB recording of female talkers that overlaps in time. The environment consists of noise with a higher frequency than the noise in Scene_5_AB.

Scene3_AB_st AB recording of female talkers that overlaps in time with a slightly but not significant background noise.

Scene6_AB_st AB recording of male and female talkers that overlaps in time with a slightly but not significant background noise.

Sting Part of the song It’s Probably Me by Sting with an ambient stereo image.

SwedRota A moving talker walking around the listener in a very ambient environment with reverb and significant background noises.

Table E.2: Description of the training database stereo samples.

E.III. MPEG database

The database consists of 49 stereo samples which has been compiled by MPEG and used for example in the evaluation of MPEG-1 Layer 3 and MPEG-2 AAC [75]. All samples consist of fullband audio sampled at 48 kHz represented by a 16-bit quantizer. The total length of the database is about 16 minutes and the samples have mainly music characteristics with both single and multiple instruments. In addition there are some samples of clean speech and mixed content, e.g. music in rain. The spatial properties of the samples are very different from mono like samples to wide and complex stereo images. The items of the database are presented in Table E.3 where they are shortly presented by their names given by MPEG.

Item Name te01 Dorita te02 We shall be happy te03 Castanets te04 Harpsichord te05 Pitch Pipe te06 Glockenspiel te07 Male German Speech te08 Suzanne Vega te09 Tracy Chapman te10 Fireworks te11 Ornette Coleman te12 Bass Synth te13 Bass guitar

Section E.IV - MUSHRA database XXIII

te14 Hyden Trumpet Concert te15 Carmen te16 Accordion/Triangle te17 Tambourine te18 Percussion te19 Male speech te20 George Duke te21 Asa Jinder te22 Dire Straits te23 Dalarnas Spelmansförbund te24 Stefan Nilsson te25 Stravinsky te26 Ravel te27 Triangles te28 Clay te29 spiral wave te30 aimai te31 ether te32 Palmtop boogie te33 <CROISEMENT I> pour hautbois, violon et contrebasse te34 drifting te35 dramatics te36 O1 te37 Fourth te38 Interlude by Halves for violin, flute and piano te39 accellation te40 atmosphere te41 fanfare te42 Kids Drive Dance(KDD) te43 Bass clarinet te44 Bransle te45 Brel te46 Guitar + Castanets te47 Fools te48 Layla te49 Music Rain

Table E.3: List of the MPEG database stereo samples.

E.IV. MUSHRA database

The MUSHRA database consists of ten fullband or super-wideband stereo samples of clean, reverberant and noisy speech, mixed content and music characteristics. The samples originate from both natural and post-processed recordings where one of the samples is binaurally recorded. The items are sampled at 48 kHz and represented by a 16-bit quantizer.

XXIV Appendix E - Stereo material

They are named by their characteristics as

• 1 binaural recording (bi1) • 1 reverberant speech (rs1) • 2 clean speech (cs1, cs2) • 2 noisy speech (ns1, ns2) • 2 mixed content (mi1, mi2) • 2 music (mu1, mu2)

In Table E.4 the items are shortly presented by their characteristic described from headphone listening. Note that the items consist of fullband stereo audio if nothing else is stated.

Item Characteristics bi1 Critical part of the binaural sample in Table E.2.

rs1 Part of the acappella song Tom’s Diner by Suzanne Vega whichcontains reverberant singing.

cs1 See Table E.1, item cls2. cs2 See Table E.2, item 1bSAt2s2.c33. Super-wideband

ns1 Part of Scene3_AB_st (see Table E.2). Two simultaneous female talkers where one of them is moving. Noticeable background noises which do not overpower the speech sources.

ns2

Part of Scene6_AB_st (see Table E.2). Three talkers (two female and one male) which partly overlap each other. The spatial locations are stable to the left, the middle and to the right for the three sources respectively. Noticeable background noises which do not overpower the speech sources.

mi1 See Table E.2, item SwedRota. mi2 Speech and music with a high level of ambient jubilation and applauds. mu1 See Table E.2, item mu6.

mu2 Part of the song Jolene by Dolly Parton. A wide stereo image of guitar picking with singing located in the middle (front) of the stereo image.

Table E.4: List of the MUSHRA database stereo samples.

Stereo coding for the ITU-T G.719 codec - DiVA Portal

Documents

Transcript of Stereo coding for the ITU-T G.719 codec - DiVA Portal