MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech...
Transcript of MPEG Spatial Audio Object Coding (SAOC) · – Audio coders are good at music, but not at speech...
Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 1
MPEG Spatial Audio Object Coding (SAOC)
Prof. Dr.-Ing. Gerald Schuller
Fraunhofer IDMT & Ilmenau Technical UniversityIlmenau, Germany
Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 2
Overview
• Concept
• MPEG Surround integration
• Advantages of SAOC
• Applications
• Conclusion
Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 3
Concept: From MPEG Surround to SAOC (1)Current Spatial Audio Coding: Channel-oriented (MPEG Surround)
Chan. #1Chan. #2Chan. #3Chan. #4
. . .
Downmixsignal(s)SAC
EncoderSideInfo
SACDecoder
Chan. #1Chan. #2Chan. #3Chan. #4
. . .
Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 4
Object-oriented Spatial Audio Coding
Obj. #1Obj. #2Obj. #3Obj. #4
. . .
Downmixsignal(s)SAOC
EncoderSideInfo
SAOCDecoder
Chan. #1Chan. #2
. . .
Renderer
Interaction/ Control
obj. #1
obj. #2
obj. #3
obj. #4
. . .
Concept: From MPEG Surround to SAOC (2)
• Processes object signals instead of channel signals• Side Info: few kbit/s per audio object• Mono or stereo downmix• “Mixing”/rendering parameters vary according to RT user interaction
Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 5
MPEG Surround integration/extension
Obj. #1Obj. #2Obj. #3Obj. #4
. . .
Downmixsignal(s)
SAOCEncoder SAOC
Bitstream
SAOCTranscoder
Chan. #1Chan. #2
. . .
MPEGSurroundDecoder
Interaction/ Control
Downmixsignal(s)
MPSBitstream
Combined Decoder
• MPEG SAOC decoder = MPEG SAOC Transcoder + MPEG Surround decoder
Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 6
Advantages using MPEG SAOC (1)
• Highly efficient storage/transport of individual
audio objects ..
• .. in a backwards compatible downmix
• User interactive rendering of the audio
objects (e.g. move or amplify objects)
• Flexible rendering configurations
(e.g. 2.0, 5.1, binaural, ..)
Key features
Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 7
Advantages using MPEG SAOC (2)• Low complexity decoding/rendering for a
large number of objects compared with individually encoded and rendered objects
• Compatible with any core codec (for the downmix)
• Powerful rendering engine (= MPEG Surround) integrated, no additional solution required
Other features
Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 8
Applications (1)• Interactive Remix / Karaoke
– Suppress / attenuate instruments or vocals (Karaoke)
– Modify the original track to reflect current preference (e.g. “more drums & less strings” for a dance party)
– Choose between different vocal tracks (“female lead vocal vs. male lead vocal”)
– Control the dialog/speech level in movies/news broadcasts for better speech intelligibility.
• Backwards compatibilityMain feature
Examples
Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 9
Applications (2)• Gaming / Rich Media
– Efficient and flexible audio transport in multi- player games or applications (e.g. Second Life)
– Efficient storage together with flexible rendering of audio in small interactive games
• Storage/ Bitrate EfficiencyMain feature
Examples
Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 10
Applications (3)• Teleconferencing
– Mobile conference over headphones: Virtual 3D-audio line-up of communication partners all around the listener
– Conference setup with 2 or more loudspeakers: Spatial distribution of communication partners
• Quality Improvement:– Increased speech intelligibility– Increased listening comfort
Main feature
Examples
Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 11
Conclusions SAOC• Highly efficient transport/storage of audio
objects and flexible/interactive audio scene rendering
• Backwards compatible downmix for reproduction on legacy devices
• Flexible rendering configurations• Under standardization within MPEG• Very interesting applications, e.g.:
– Remixing/Karaoke– Gaming– Teleconferencing
Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 12
MPEG Parametric Surround• Signal is decomposed into several bands
(flexible configuration) outside core coder• For certain groups of bands, the Interaural
Level Difference (ILD), the Interaural Time Difference (ITD) and a Coherence Value (the correlation, concentration in space) is determined
• These parameters are used to generate the side information
• The down-mix is either a stereo signal or a mono signal
• The decoder uses the down-mix and the side- information to generate surround sound which sounds “similar” to the original (psycho- acoustics!)
Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 13
Universal Speech and Audio Coding (USAC)• Problem:
– Speech coders are good at speech but not at music,
– Audio coders are good at music, but not at speech (too instationary, the 1024 sample block size smears the qualtization noise and makes speech sound reverberant)
• MPEG decided to tackle the problem• Goal: to come up with a universal coder
which handles speech and audio as well as the best speech or audio coder in that bit-rate range
Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 14
Universal Speech and Audio Coding• A competition was conducted by MPEG• Winner of this competition was a joint
submission by Fraunhofer IIS and Voiceage Corp. in Canada
• Their submission was a combination of VoiceAge’s AMR-WB+ coder and Fraunhofers HE-AAC coder
• The bit-rate range for the competition was about 12 to 64 kb/s.
• Target is mainly mobile devices (wireless phones, digital radio…)
Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 15
Universal Speech and Audio Coding• We already know HE-AAC• But how does the VoiceAge coder work?• Answer: It is based on CELP (Code Excited
Linear Prediction)• CELP is based on predictive coding, just as
we saw for ULD or lossless predictive coding• Here: usually prediction of order 12 (this was
found to be sufficient to model the human vocal tract for speech production)
Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 16
Universal Speech and Audio Coding• The prediction residual is then encoded using
a codebook vectors, called Code Excitation, using a fixed codebook (innovation) and an adaptive codebook (past samples)
Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 17
CELP (Code Excited Linear Prediction)• Structure of the CELP decoder (from
Wikipedia, CELP):
Decoder prediction filter(usually order 12)
Constantly adapted delay
Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 18
Universal Speech and Audio Coding• ACELP (Algebraic CELP): The codebook is
not explicitely stored, by algebraicly described by pulses and their distances to the next pulses
• AMR: Voiceage Speech Coder (for instance for 3GPP), for about 4.75 and 12.2 kb/s
• AMR-WB: Wideband Extension (up to 7 kHz bandwidth), 6.6 to 23.5 kb/s
• AMR-WB+: Used for the MPEG submission, has a transform coding kernel in it too, to obtain higher bandwidth and bit rates up to about 32 kb/s
Source: IEEE TransactionOn Speech and Audio Processing, Bessette et al., 2002
Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 19
Universal Speech and Audio Coding• AMR-WB+ has a transform based mode
called TCX, which is based on an FFT (not an MDCT)
• The TCX mode is switchable: The audio stream is divided in 80 ms “super frames”, which consists of two 40 ms frames, and each 40 ms frame consists of two 20 ms frames.
• For the 20 ms frame base it is decided if ACELP is used or TCX
• For TCX it is decided of it is applied to frames of 20ms, 40ms, or 80 ms, to obtain different numbers of subbands
Source: IEEE International Conference on Audio andSpeech Signal Processing(ICASSP), 2005,Bessette et al.
Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 20
Universal Speech and Audio Coding (USAC)• USAC combines AMR-WB+ with HE-AAC• An important component is a suitable switch
between them, such that for the current audio signal the suitable coder is selected
• Some integration between subband coding modes in AMR-WB+ and HE-AAC.
Prof. Dr.-Ing. K. Brandenburg, [email protected] Dr.-Ing. G. Schuller, [email protected] Page 21
Universal Speech and Audio Coding• Tests showed: the resulting codec is indeed
at least as good as a virtual coder, which is the best of either HE-AAC or AMR-WB+ (which was a requirement)
• It was tested on speech, audio, and mixed speech and audio (the latter being the most difficult)
• That showed that the goal was reached