MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech...

Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected]

MPEG Spatial Audio Object Coding (SAOC)

Prof. Dr.-Ing. Gerald Schuller

Fraunhofer IDMT & Ilmenau Technical UniversityIlmenau, Germany


Overview

• Concept

• MPEG Surround integration

• Advantages of SAOC

• Applications

• Conclusion


Concept: From MPEG Surround to SAOC (1)Current Spatial Audio Coding: Channel-oriented (MPEG Surround)

Chan. #1Chan. #2Chan. #3Chan. #4

. . .

Downmixsignal(s)SAC

EncoderSideInfo

SACDecoder

Chan. #1Chan. #2Chan. #3Chan. #4

. . .


Object-oriented Spatial Audio Coding

Obj. #1Obj. #2Obj. #3Obj. #4

. . .

Downmixsignal(s)SAOC

EncoderSideInfo

SAOCDecoder

Chan. #1Chan. #2

. . .

Renderer

Interaction/ Control

obj. #1

obj. #2

obj. #3

obj. #4

. . .

Concept: From MPEG Surround to SAOC (2)

• Processes object signals instead of channel signals• Side Info: few kbit/s per audio object• Mono or stereo downmix• “Mixing”/rendering parameters vary according to RT user interaction


MPEG Surround integration/extension

Obj. #1Obj. #2Obj. #3Obj. #4

. . .

Downmixsignal(s)

SAOCEncoder SAOC

Bitstream

SAOCTranscoder

Chan. #1Chan. #2

. . .

MPEGSurroundDecoder

Interaction/ Control

Downmixsignal(s)

MPSBitstream

Combined Decoder

• MPEG SAOC decoder = MPEG SAOC Transcoder + MPEG Surround decoder


Advantages using MPEG SAOC (1)

• Highly efficient storage/transport of individual

audio objects ..

• .. in a backwards compatible downmix

• User interactive rendering of the audio

objects (e.g. move or amplify objects)

• Flexible rendering configurations

(e.g. 2.0, 5.1, binaural, ..)

Key features


Advantages using MPEG SAOC (2)• Low complexity decoding/rendering for a

large number of objects compared with individually encoded and rendered objects

• Compatible with any core codec (for the downmix)

• Powerful rendering engine (= MPEG Surround) integrated, no additional solution required

Other features


Applications (1)• Interactive Remix / Karaoke

– Suppress / attenuate instruments or vocals (Karaoke)

– Modify the original track to reflect current preference (e.g. “more drums & less strings” for a dance party)

– Choose between different vocal tracks (“female lead vocal vs. male lead vocal”)

– Control the dialog/speech level in movies/news broadcasts for better speech intelligibility.

• Backwards compatibilityMain feature

Examples


Applications (2)• Gaming / Rich Media

– Efficient and flexible audio transport in multi- player games or applications (e.g. Second Life)

– Efficient storage together with flexible rendering of audio in small interactive games

• Storage/ Bitrate EfficiencyMain feature

Examples


Applications (3)• Teleconferencing

– Mobile conference over headphones: Virtual 3D-audio line-up of communication partners all around the listener

– Conference setup with 2 or more loudspeakers: Spatial distribution of communication partners

• Quality Improvement:– Increased speech intelligibility– Increased listening comfort

Main feature

Examples


Conclusions SAOC• Highly efficient transport/storage of audio

objects and flexible/interactive audio scene rendering

• Backwards compatible downmix for reproduction on legacy devices

• Flexible rendering configurations• Under standardization within MPEG• Very interesting applications, e.g.:

– Remixing/Karaoke– Gaming– Teleconferencing


MPEG Parametric Surround• Signal is decomposed into several bands

(flexible configuration) outside core coder• For certain groups of bands, the Interaural

Level Difference (ILD), the Interaural Time Difference (ITD) and a Coherence Value (the correlation, concentration in space) is determined

• These parameters are used to generate the side information

• The down-mix is either a stereo signal or a mono signal

• The decoder uses the down-mix and the side- information to generate surround sound which sounds “similar” to the original (psycho- acoustics!)


Universal Speech and Audio Coding (USAC)• Problem:

– Speech coders are good at speech but not at music,

– Audio coders are good at music, but not at speech (too instationary, the 1024 sample block size smears the qualtization noise and makes speech sound reverberant)

• MPEG decided to tackle the problem• Goal: to come up with a universal coder

which handles speech and audio as well as the best speech or audio coder in that bit-rate range


Universal Speech and Audio Coding• A competition was conducted by MPEG• Winner of this competition was a joint

submission by Fraunhofer IIS and Voiceage Corp. in Canada

• Their submission was a combination of VoiceAge’s AMR-WB+ coder and Fraunhofers HE-AAC coder

• The bit-rate range for the competition was about 12 to 64 kb/s.

• Target is mainly mobile devices (wireless phones, digital radio…)


Universal Speech and Audio Coding• We already know HE-AAC• But how does the VoiceAge coder work?• Answer: It is based on CELP (Code Excited

Linear Prediction)• CELP is based on predictive coding, just as

we saw for ULD or lossless predictive coding• Here: usually prediction of order 12 (this was

found to be sufficient to model the human vocal tract for speech production)


Universal Speech and Audio Coding• The prediction residual is then encoded using

a codebook vectors, called Code Excitation, using a fixed codebook (innovation) and an adaptive codebook (past samples)


CELP (Code Excited Linear Prediction)• Structure of the CELP decoder (from

Wikipedia, CELP):

Decoder prediction filter(usually order 12)

Constantly adapted delay


Universal Speech and Audio Coding• ACELP (Algebraic CELP): The codebook is

not explicitely stored, by algebraicly described by pulses and their distances to the next pulses

• AMR: Voiceage Speech Coder (for instance for 3GPP), for about 4.75 and 12.2 kb/s

• AMR-WB: Wideband Extension (up to 7 kHz bandwidth), 6.6 to 23.5 kb/s

• AMR-WB+: Used for the MPEG submission, has a transform coding kernel in it too, to obtain higher bandwidth and bit rates up to about 32 kb/s

Source: IEEE TransactionOn Speech and Audio Processing, Bessette et al., 2002


Universal Speech and Audio Coding• AMR-WB+ has a transform based mode

called TCX, which is based on an FFT (not an MDCT)

• The TCX mode is switchable: The audio stream is divided in 80 ms “super frames”, which consists of two 40 ms frames, and each 40 ms frame consists of two 20 ms frames.

• For the 20 ms frame base it is decided if ACELP is used or TCX

• For TCX it is decided of it is applied to frames of 20ms, 40ms, or 80 ms, to obtain different numbers of subbands

Source: IEEE International Conference on Audio andSpeech Signal Processing(ICASSP), 2005,Bessette et al.


Universal Speech and Audio Coding (USAC)• USAC combines AMR-WB+ with HE-AAC• An important component is a suitable switch

between them, such that for the current audio signal the suitable coder is selected

• Some integration between subband coding modes in AMR-WB+ and HE-AAC.


Universal Speech and Audio Coding• Tests showed: the resulting codec is indeed

at least as good as a virtual coder, which is the best of either HE-AAC or AMR-WB+ (which was a requirement)

• It was tested on speech, audio, and mixed speech and audio (the latter being the most difficult)

• That showed that the goal was reached

MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech...

Documents

Transcript of MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech...