Download - Image Compression and Video Compression 2004 Notes - 6 Audio Compression

1

Lecture notes of Image Compression and Video

Compression series

6. Perceptual Audio CodingProf. Amar Mukherjee

Weifeng Sun

#2

Topics

Introduction to Image CompressionTransform CodingSubband Coding, Filter BanksIntroduction to Wavelet TransformWavelet Image CompressionPerceptual Audio CodingVideo Compression

2

#3

Contents

SoundPhysiology of the EarHuman Sound Perception

PsychoacousticsAuditory Masking

Frequency maskingTemporal masking

MPEG Audio CompressionExample

#4

Many applications need digital audioStoring

ipodgames

StreamingInternet radio

Interactive multimedia session like VOIP, conferencing.

However, cannot apply entropy coder directly.Very random, missing repeated sequence of byte pattern.

Motivation

3

#5

Sound

Sound is a continuous signal that is created by compressing and decompressing the air.

This changing air pressure causes the eardrum to vibrate.

#6

Describing Sound

IntensityDefined as power per unit area.The level of perceived sound by a human is referred to as loudness.

Measured in terms of phons.Defined as the perceive intensity at 1000 Hz by normal human population.

Pitch frequency of sound (tone)

Quality (timbre)

4

#7

Decibels

For most audio specifications, sound levels are presented in terms of decibels (dB),

A log scale for intensity changes.The softest 1000 Hz tone heard by the normal ear is I0 = 1 picowatt/meter2.With I0 as a reference, all other intensities, I,can be given in terms of dB by:

#8

Physiology of the Ear

Thousands of “microphones”

hair cells in cochleaAutomatic gain control

muscles around transmission bones

Directivitypinna

Boost of middle frequencies

auditory canalNonlinear processing

auditory nerve

5

#9

The Outer Ear

For a point source of sound, it spreads out according to the inverse square law. For a given sound intensity, a larger ear captures more of the wave and hence more sound energy.

#10

The Tympanic Membrane

The tympanic membrane or "eardrum" receives vibrations traveling up the auditory canal and transfers them through the tiny ossicles to the oval window, the port into the inner ear. The eardrum is some fifteen times larger than the oval window, giving an amplification of about fifteen compared to the oval window alone.

6

#11

The Ossicles

The three tiniest bones in the body form the coupling between the vibration of the eardrum and the forces exerted on the oval window of the inner ear. Serves as an amplifier with a factor of about threeunder optimum conditions.Can be adjusted by muscle action to actually attenuate the sound signal for protection against loud sounds.

#12

The Inner Ear (Two Organs)

Semicircular canalsbody's balance organ

Cochleabody's microphoneconverting sound pressure impulsesfrom the outer ear into electrical impulseswhich are passed on to the brain via the auditory nerve.

7

#13

The Cochlea

Snail-shell like structure.Three fluid filled sections.

The organ of Corti is the sensor of pressure variations.

#14

Place Theory

High frequency sounds selectively vibrate the basilar membrane of the inner ear near the entrance port (the oval window). Lower frequencies travel further along the membrane before causing appreciable excitation of the membrane. The basic pitch determining mechanism is based on the location along the membrane where the hair cells are stimulated.

8

#15

The Auditory Nerve

Taking electrical impulses from the cochleaand the semicircular canals, the auditory nerve makes connections with both auditory areas of the brain.

#16

Auditory Area of Brain

Binaural relay When the auditory nerve from one ear takes information to the brain, that information is directly sent to both the processing areas on both sides of the brain.

9

#17

Perceptual Audio Compression

The basis of the Perceptual Codecs is Psychoacoustic Masking.An audio file will contains sounds that are not heard by us, even though these sounds lie within the human audible range.Masking Techniques:

Frequency (Concurrent) MaskingTemporal Masking

#18

Dynamic Range of Hearing

The practical dynamic range could be said to be from the threshold of hearing to the threshold of pain.

This remarkable dynamic range is enhanced by an effective amplification structure which extends its low end and by a protective mechanism which extends the high end.

130 decibels0 decibels

1013I0 = 10,000,000,000,000 I0I0

Threshold of Pain Threshold of Hearing

10

#19

Frequency Response of the Ear

The ear has a non-flat frequency response.

Tones played at the same volume with different frequencies can sound like they are being played at different volume levels. In order to maintain the same loudness over the hearing spectrum, the intensity level must vary.

#20

Equal-loudness Curve

Audible frequencyrange

20 to 20,000 Hz

Normal voice range 500 Hz to 2 kHz Low frequencies are vowels and bass High frequencies are consonants

11

#21

Human Hearing Sensitivity

ExperimentPut a person in a quiet room. Raise level of 1 kHz tone until just barely audible. Vary the frequencyPlot

#22

Frequency (Concurrent) Masking

Experiment: Play 1 kHz tone (masking tone) at fixed level (60 dB). Play test tone at a different level (e.g., 1.1 kHz), and raise level until just distinguishable. Vary the frequency of the test tone and plot the threshold when it becomes audible

12

#23


Frequency masking effect dependent on frequency.As shown people can detect lower frequency test tones closer than higher frequency test tones.

#24


Sub bands are uniform, but human hearing sensitivity isn’t, so psychoacoustic model and MPEG encoder compensates by preserving more data in lower filtered sub bands.There are about 25 critical bands.

13

#25

Temporal Masking

Experiment: Play 1 kHz masking tone at 60 dB, plus a test tone at 1.1 kHz at 40 dB. Test tone can't be heard (it's masked). Stop masking tone, then stop test tone after a short delay. Adjust delay time to the shortest time when test tone can be heard. Repeat with different level of the test tone and plot.

#26

Temporal Masking

PostmaskingIf the volume drops sharply then the human ear takes a few milliseconds to adjust, during this period the lower volume sounds are not heard and these can be discarded.

PremaskingIf the volume rises sharply then the human ear discards the last few milliseconds of the quieter sound and immediately starts processing the louder sound. The last few milliseconds of data before a sharp rise in volume can also be discarded.

The discarded sounds are replaced by silence, which can be run length encoded.

14

#27

Combination

#28

MPEG Audio Coding Algorithm

Subband filteringMasking for each band caused by nearby band using the psychoacoustic modelDiscard those bands If their power are below the masking threshold Quantization/bit-allocation/coding Format bitstream

15

#29

Filter Banks

Explains frequency-domain masking

#30

Frequency-domain Masking

16

#31

Example

After analysis, the first levels of 16 of the 32 bands are:

Band 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Level(db)0 8 12 10 6 2 10 60 35 20 15 2 3 5 3 1

If the level of the 8th band is 60dB, it gives a masking of 12 dB in the 7th band, 15dB in the 9th. Level in 7th band is 10 dB ( < 12 dB ), so ignore it. Level in 9th band is 35 dB ( > 15 dB ), so send it.Only the amount above the masking level needs to be sent, so instead of using 6 bits to encode it, we can use 4 bits saving 2 bits (= 12 dB).

#32

MPEG Audio Compression

Three MPEG codecslayer 1

suitable for consumer recording384kbps for a stereo signal, compression ratio 4:1

layer 2suitable for professional recording and Broadcasting256..192 kbps for a stereo signal, compression ratio 6:1..8:1

layer 3 (mp3)suitable for Internet transmission128..112 kbps for a stereo signal, compression ratio 10:1..12:1

More on MP3Better critical band filter is used (non-equal frequencies)psychoacoustic model includes temporal masking effectstakes into account stereo redundancyuses Huffman coder