VIDEO COMPRESSING TECHNIQUES MODEL-VIDEOCOMP100 · Transport Stream, which is designed to carry...
Embed Size (px)
Transcript of VIDEO COMPRESSING TECHNIQUES MODEL-VIDEOCOMP100 · Transport Stream, which is designed to carry...
SIGMA TRAINERSAHMEDABAD (INDIA)
INTRODUCTION This trainer includes theory and soft wares used for different types video compressing techniques.
1. Manual : Includes more than 200 pages discussing different types video
2. Video Compressing formats : To compress AVI, MPG1, MPEG-2, WMV.
To compress to MPEG using VCD, SVCD, or DVD
3. Video Compressing soft wares: 1. Blaze Media Pro software
2. Alparysoft Lossless Video Codec
3. MSU Lossless Video Codec
4. DivX Player with DivX Pro Codec (98/Me)
5. Elecard MPEG-2 Decoder & Streaming pack
VIDEO COMPRESSING TECHNIQUES- MPEG2 VIDEO COMPRESSION Video compression refers to reducing the quantity of data used to represent video content without excessively reducing the quality of the picture. It also reduces the number of bits required to store and/or transmit digital media. Compressed video can be transmitted more economically over a smaller carrier. Digital video requires high data rates - the better the picture, the more data is ordinarily needed. This means powerful hardware, and lots of bandwidth when video is transmitted. However much of the data in video is not necessary for achieving good perceptual quality, e.g., because it can be easily predicted - for example, successive frames in a movie rarely change much from one to the next - this makes data compression work well with video. Video compression can make video files far smaller with little perceptible loss in quality. For example, DVDs use a video coding standard called MPEG-2 that makes the movie 15 to 30 times smaller while still producing a picture quality that is generally considered high quality for standard-definition video. Without proper use of data compression techniques, either the picture would look much worse, or one would need more such disks per movie.
Theory Video is basically a three-dimensional array of color pixels. Two dimensions serve as spatial (horizontal and vertical) directions of the moving pictures, and one dimension represents the time domain. A frame is a set of all pixels that correspond to a single point in time. Basically, a frame is the same as a still picture. (These are sometimes made up of fields. See interlace) Video data contains spatial and temporal redundancy. Similarities can thus be encoded by merely registering differences within a frame (spatial) and/or between frames (temporal). Spatial encoding is performed by taking advantage of the fact that the human eye is unable to distinguish small differences in colour as easily as it can changes in brightness and so very similar areas of colour can be "averaged out" in a similar way to jpeg images (JPEG image compression FAQ, part 1/2). With temporal compression only the changes from one frame to the next are encoded as often a large number of the pixels will be the same on a series of frames (About video compression).
Lossless compression Some forms of data compression are lossless. This means that when the data is decompressed, the result is a bit-for-bit perfect match with the original. While lossless compression of video is possible, it is rarely used. This is because any lossless compression system will sometimes result in a file (or portions of) that is as large and/or has the same data rate as the uncompressed original. As a result, all hardware in a lossless system would have to be able to run fast enough to handle uncompressed video as well. This eliminates much of the benefit of compressing the data in the first place. For example, digital videotape can't vary its data rate easily so dealing with short bursts of maximum-data-rate video would be more complicated than something that was fixed at the maximum rate all the time.
Intraframe vs interframe compression One of the most powerful techniques for compressing video is interframe compression. This works by comparing each frame in the video with the previous one. If the frame contains areas where nothing has moved, the system simply issues a short command that copies that part of the previous frame, bit-for-bit, into the next one. If objects move in a simple manner, the compressor emits a (slightly longer) command that tells the decompressor to shift, rotate, lighten, or darken the copy -- a longer command, but still much shorter than intraframe compression. Interframe compression is best for finished programs that will simply be played back by the viewer. Interframe compression can cause problems if it is used for editing.
Since Interframe compression copies data from one frame to another, if the original frame is simply cut out (or lost in transmission), the following frames cannot be reconstructed. Some video formats, such as DV, compress each frame independently, as if they were all unrelated still images (using image compression techniques). This is called intraframe compression. Editing intraframe-compressed video is almost as easy as editing uncompressed
video -- one finds the beginning and ending of each frame, and simply copies bit-for-bit each frame that one wants to keep, and discards the frames one doesn't want. Another difference between intraframe and interframe compression is that with intraframe systems, each frame uses a similar amount of data. In interframe systems, certain frames called "I frames" aren't allowed to copy data from other frames, and so require much more data than other frames nearby. (The "I" stands for independent.) It is possible to build a computer-based video editor that spots problems caused when I frames are edited out while other frames need them. This has allowed newer formats like HDV to be used for editing. However, this process demands a lot more computing power than editing intraframe compressed video with the same picture quality.
) MPEG (MOVING PICTURES EXPERTS GROUP It is a set of standards established for the compression of digital video and audio data. It is the universal standard for digital terrestrial, cable and satellite TV, DVDs and digital video recorder. MPEG uses lossy compression within each frame similar to JPEG, which means pixels from the original images are permanently discarded. It also uses interframe coding, which further compresses the data by encoding only the differences between periodic frames (see interframe coding). MPEG performs the actual compression using the discrete cosine transform (DCT) method (see DCT). MPEG is an asymmetrical system. It takes longer to compress the video than it does to decompress it in the DVD player, PC, set-top box or digital TV set. As a result, in the early days, compression was perfomed only in the studio. As chips advanced and became less costly, they enabled digital video recorders, such as Tivos, to convert analog TV to MPEG and record it on disk in realtime (see DVR). MPEG-1 (Video CDs) Although MPEG-1 supports higher resolutions, it is typically coded at 352x240 x 30fps (NTSC) or 352x288 x 25fps (PAL/SECAM). Full 704x480 and 704x576 frames (BT.601) were scaled down for encoding and scaled up for playback. MPEG-1 uses the YCbCr color space with 4:2:0 sampling, but did not provide a standard way of handling interlaced video. Data rates were limited to 1.8 Mbps, but often exceeded. See YCbCr sampling. MPEG-2 (DVD, Digital TV) MPEG-2 provides broadcast quality video with resolutions up to 1920x1080. It supports a variety of audio/video formats, including legacy TV, HDTV and five channel surround sound. MPEG-2 uses the YCbCr color space with 4:2:0, 4:2:2 and 4:4:4 sampling and supports interlaced video. Data rates are from 1.5 to 60 Mbps. See YCbCr sampling.
MPEG-4 (All Inclusive and Interactive) MPEG-4 is an extremely comprehensive system for multimedia representation and distribution. Based on a variation of Apple's QuickTime file format, MPEG-4 offers a variety of compression options, including low-bandwidth formats for transmitting to wireless devices as well as high-bandwidth for studio processing. See H.264.
MPEG-4 also incorporates AAC, which is a high-quality audio encoder. MPEG-4 AAC is widely used as an audio-only format (see AAC).
A major feature of MPEG-4 is its ability to identify and deal with separate audio and video objects in the frame, which allows separate elements to be compressed more efficiently and dealt with independently. User-controlled interactive sequences that include audio, video, text, 2D and 3D objects and animations are all part of the MPEG-4 framework. For more information, visit the MPEG Industry Forum at www.mpegif.org.
MPEG-7 (Meta-Data) MPEG-7 is about describing multimedia objects and has nothing to do with compression. It provides a library of core description tools and an XML-based Description Definition Language (DDL) for extending the library with
additional multimedia objects. Color, texture, shape and motion are examples of characteristics defined by MPEG-7.
MPEG-21 (Digital Rights Infrastructure) MPEG-21 provides a comprehensive framework for storing, searching, accessing and protecting the copyrights of multimedia assets. It was designed to provide a standard for digital rights management as well as interoperability. MPEG-21 uses the "Digital Item" as a descriptor for all multimedia objects. Like MPEG-7, it does not deal with compression methods.
The Missing Numbers MPEG-3 was abandoned after initial development because MPEG-2 was considered sufficient. Because MPEG-7 does not deal with compression, it was felt a higher number was needed to distance it from MPEG-4. MPEG-21 was coined for the 21st century.
MPEG Vs. Motion JPEG Before MPEG, a variety of non-standard Motion JPEG (M-JPEG) methods were used to create consecutive JPEG frames. Motion JPEG did not use interframe coding between frames and was easy to edit, but not as highly compressed as MPEG. For compatibility, video editors may support one of the Motion JPEG methods. MPEG can also be encoded without interframe compression for faster editing. See MP3, MPEG LA, MPEGIF, MPEG-2
MPEG-2 is a standard for "the generic coding of moving pictures and associated audio information ." It is widely used around the world to specify the format of the digital television signals that are broadcast by terrestrial (over-the-air), cable, and direct broadcast satellite TV systems. It also specifies the format of movies and other programs that are distributed on DVD and similar disks. The standard allows text and other data, e.g., a program guide for TV viewers, to be added to the video and audio data streams. TV stations, TV receivers, DVD players, and other equipment are all designed to this standard. MPEG-2 was the second of several standards developed by the Motion Pictures Expert Group (MPEG) and is an international standard (ISO/IEC 13818).
While MPEG-2 is the core of most digital television and DVD formats, it does not completely specify them. Regional institutions adapt it to their needs by restricting and augmenting aspects of the standard. See "#Profiles and Levels."
MPEG-2 includes a Systems part (part 1) that defines two distinct (but related) container formats. One is Transport Stream, which is designed to carry digital video and audio over somewhat-unreliable media. MPEG-2 Transport Stream is commonly used in broadcast applications, such as ATSC and DVB. MPEG-2 Systems also defines Program Stream, a container format that is designed for reasonably reliable media such as disks. MPEG-2 Program Stream is used in the DVD and SVCD standards.
The Video part (part 2) of MPEG-2 is similar to MPEG-1, but also provides support for interlaced video (the format used by analog broadcast TV systems). MPEG-2 video is not optimized for low bit-rates (less than 1 Mbit/s), but outperforms MPEG-1 at 3 Mbit/s and above. All standards-conforming MPEG-2 Video decoders are fully capable of playing back MPEG-1 Video streams.
With some enhancements, MPEG-2 Video and Systems are also used in most HDTV transmission systems.
The MPEG-2 Audio part (defined in Part 3 of the standard) enhances MPEG-1's audio by allowing the coding of audio programs with more than two channels. Part 3 of the standard allows this to be done in a backwards compatible way, allowing MPEG-1 audio decoders to decode the two main stereo components of the presentation.
Part 7 of the MPEG-2 standard specifies a rather different, non-backwards-compatible audio format. Part 7 is referred to as MPEG-2 AAC. While AAC is more efficient than the previous MPEG audio standards, it is much more complex to implement, and somewhat more powerful hardware is needed for encoding and decoding.
Video coding (simplified)
An HDTV camera generates a raw video stream of more than one billion bits per second. This stream must be compressed if digital TV is to fit in the bandwidth of available TV channels and if movies are to fit on DVDs. Fortunately, video compression is practical because the data in pictures is often redundant in space and time. For example, the sky can be blue across the top of a picture and that blue sky can persist for frame after frame. Also, because of the way the eye works, it is possible to delete some data from video pictures with almost no noticeable degradation in image quality.
TV cameras used in broadcasting usually generate 50 pictures a second (in Europe and elsewhere) or 59.94 pictures a second (in North America and elsewhere). Digital television requires that these pictures be digitized so that they can processed by computer hardware. Each picture element (a pixel) is then represented by one luminance number and two chrominance numbers. These describe the brightness and the color of the pixel (see YUV). Thus, each digitized picture is initially represented by three rectangular arrays of numbers.
A common (and old) trick to reduce the amount of data that must be processed per second is to separate the picture into two fields: the "top field," which is the odd numbered rows, and the "bottom field," which is the even numbered rows. The two fields are displayed alternately. This is called interlaced video. Two successive fields are called a frame. The typical frame rate is then 25 or 29.97 frames a second. If the video is not interlaced, then it is called progressive video and each picture is a frame. MPEG-2 supports both options.
Another trick to reduce the data rate is to thin out the two chrominance matrices. In effect, the remaining chrominance values represent the nearby values that are deleted. Thinning works because the eye is more responsive to brightness than to color. The 4:2:2 chrominance format indicates that half the chrominance values have been deleted. The 4:2:0 chrominance format indicates that three quarters of the chrominance values have been deleted. If no chrominance values have been deleted, the chrominance format is 4:4:4. MPEG-2 allows all three options.
MPEG-2 specifies that the raw frames be compressed into three kinds of frames: I(ntra-coded)-frames, P(redictive-coded)-frames, and B(idirectionally predictive-coded)-frames.
An I-frame is a compressed version of a single uncompressed (raw) frame. It takes advantage of spatial redundancy and of the inability of the eye to detect certain changes in the image. Unlike P-frames and B-frames, I-frames do not depend on data in the preceding or the following frames. Briefly, the raw frame is divided into 8 pixel by 8 pixel blocks. The data in each block is transformed by a "discrete cosine transform." The result is a 8 by 8 matrix of coefficients. This transform does not change the information in the block; the original block can be recreated exactly by applying the inverse cosine transform. The math is a little esoteric but, roughly, the transform converts spatial variations into frequency variations. The advantage of doing this is that the image can now be simplified by quantizing the coefficients. Many of the coefficients, usually the higher frequency components, will then be zero. The penalty of this step is the loss of some subtle distinctions in brightness and color. If one applies the inverse transform to the matrix after it is quantized, one gets an image that looks very similar to the original image but that is not quite as nuanced. Next, the quantized coefficient matrix is itself compressed. Typically, one corner of the quantized matrix is filled with zeros. By starting in the opposite corner of the matrix, then zigzaging through the matrix to combine the coefficients into a string, then substituting run-length codes for consecutive zeros in that string, and then applying Huffman coding to that result, one reduces the matrix to a smaller array of numbers. It is this array that is broadcast or that is put on DVDs. In the receiver or the player, the whole process is reversed, enabling the receiver to reconstruct, to a close approximation, the original frame.
Typically, every 15th frame or so is made into an I-frame. P-frames and B-frames might follow an I-frame like this, IBBPBBPBBPBB(I), to form a Group of Pictures (GOP); however, the standard is flexible about this.
P-frames provide more compression than I-frames because they take advantage of the data in the previous I-frame or P-frame. I-frames and P-frames are called reference frames. To generate a P-frame, the previous reference frame is reconstructed, just as it would be in a TV receiver or DVD player. The frame being compressed is divided into 16 pixel by 16 pixel "macroblocks." Then, for each of those macroblocks, the reconstructed reference frame is searched to find that 16 by 16 macroblock that best matches the macroblock being compressed. The offset is encoded as a "motion vector." Frequently, the offset is zero. But, if something in the picture is moving, the offset might be something like 23 pixels to the right and 4 pixels up. The match between the two macroblocks will often not be perfect. To correct for this, the encoder computes the strings of coefficient values as described above for both macroblocks and, then, subtracts one from the other. This "residual" is appended to the motion vector and the result sent to the receiver or stored on the DVD for each macroblock being compressed. Sometimes no suitable match is found. Then, the macroblock is treated like an I-frame macroblock.
The processing of B-frames is similar to that of P-frames except that B-frames use the picture in the following reference frame as well as the picture in the preceding reference frame. As a result, B-frames usually provide more compression than P-frames. B-frames are never reference frames.
While the above paragraphs generally describe MPEG-2 video compression, there are many details that are not discussed including details involving fields, chrominance formats, responses to scene changes, special codes that label the parts of the bitstream, and so on. MPEG-2 compression is complicated. TV cameras capture pictures at a regular rate. TV receivers display pictures at a regular rate. In between, all kinds of things are happening. But it works.
MPEG-2 also introduces new audio encoding methods. These are
• low bitrate encoding with halved sampling rate (MPEG-1 Layer 1/2/3 LSF) • multichannel encoding with up to 5.1 channels • MPEG-2 AAC
Profiles and Levels MPEG-2 Profiles
Abbr. Name Frames YUV Streams Comment SP Simple Profile P, I 4:2:0 1 no interlacing MP Main Profile P, I, B 4:2:0 1
422P 4:2:2 Profile P, I, B 4:2:2 1 SNR SNR Profile P, I, B 4:2:0 1-2 SNR: Signal to Noise Ratio SP Spatial Profile P, I, B 4:2:0 1-3 HP High Profile P, I, B 4:2:2 1-3
low, normal and high quality decoding
MPEG-2 Levels Abbr. Name Pixel/line Lines Framerate (Hz) Bitrate (Mbit/s)
LL Low Level 352 288 30 4 ML Main Level 720 576 30 15
H-14 High 1440 1440 1152 30 60 HL High Level 1920 1152 30 80
Profile @ Level
Framerate max. (Hz) Sampling
Bitrate (Mbit/s) Example Application
320 × 240 24 [email protected] 352 × 288 30 4:2:0 4 Set-top boxes (STB)
720 × 480 30 [email protected]
720 × 576 25 4:2:0 15 (DVD: 9.8) DVD, SD-DVB
1440 × 1080 30 [email protected]
1280 × 720 30 4:2:0 60 (HDV: 25) HDV
1920 × 1080 30 [email protected]
1280 × 720 60 4:2:0 80 ATSC 1080i, 720p60, HD-DVB (HDTV)
[email protected] 4:2:2 720 × 480 30
[email protected] 720 × 576 25
4:2:2 50 Sony IMX using I-frame only, Broadcast "contribution" video (I&P only) 1440 × 1080 30
[email protected] 1280 × 720 60
4:2:2 80 Potential future MPEG-2-based HD products from Sony and Panasonic 1920 × 1080 30
[email protected] 1280 × 720 60
4:2:2 300 Potential future MPEG-2-based HD products from Panasonic DVD
The DVD standard uses MPEG-2 video, but imposes some restrictions:
• Allowed Resolutions o 720 × 480, 704 × 480, 352 × 480, 352 × 240 pixel (NTSC) o 720 × 576, 704 × 576, 352 × 576, 352 × 288 pixel (PAL)
• Allowed Aspect ratio (image) (Display AR) o 4:3 o 16:9 o (2.21:1 is often listed as a valid DVD aspect ratio, but is actually just a 16:9 image with the top
and bottom of the frame masked in black) • Allowed Frame rates
o 29.97 frame/s (NTSC) o 25 frame/s (PAL)
Note: By using a pattern of REPEAT_FIRST_FIELD flags on the headers of encoded pictures, pictures can be displayed for either two or three fields and almost any picture display rate (minimum ⅔ of the frame rate) can be achieved. This is most often used to display 23.976 (approximately film rate) video on NTSC.
• Audio+video bitrate o Video peak 9.8 Mbit/s o Total peak 10.08 Mbit/s o Minimum 300 Kbit/s
• YUV 4:2:0 • Additional subtitles possible • Closed captioning (NTSC only) • Audio
o Linear Pulse Code Modulation (LPCM): 48 kHz or 96 kHz; 16- or 24-bit; up to six channels (not all combinations possible due to bitrate constraints)
o MPEG Layer 2 (MP2): 48 kHz, up to 5.1 channels (required in PAL players only) o Dolby Digital (DD, also known as AC-3): 48 kHz, 32–448 kbit/s, up to 5.1 channels
o Digital Theater Systems (DTS): 754 kbit/s or 1510 kbit/s (not required for DVD player compliance)
o NTSC DVDs must contain at least one LPCM or Dolby Digital audio track. o PAL DVDs must contain at least one MPEG Layer 2, LPCM, or Dolby Digital audio track. o Players are not required to playback audio with more than two channels, but must be able to
downmix multichannel audio to two channels. • GOP structure
o Sequence header must be present at the beginning of every GOP o Maximum frames per GOP: 18 (NTSC) / 15 (PAL), i.e. 0.6 seconds both o Closed GOP required for multiple-angle DVDs
Application-specific restrictions on MPEG-2 video in the DVB standard:
Allowed resolutions for SDTV:
• 720, 640, 544, 480 or 352 × 480 pixel, 24/1.001, 24, 30/1.001 or 30 frame/s • 352 × 240 pixel, 24/1.001, 24, 30/1.001 or 30 frame/s • 720, 704, 544, 480 or 352 × 576 pixel, 25 frame/s • 352 × 288 pixel, 25 frame/s
• 720 x 576 x 50 frames/s progressive (576p50) • 1280 x 720 x 25 or 50 frames/s progressive (720p50) • 1440 or 1920 x 1080 x 25 frames/s progressive (1080p25 - film mode) • 1440 or 1920 x 1080 x 25 frames/s interlace (1080i25) • 1920 x 1080 x 50 frames/s progressive (1080p50) possible future H.264/AVC format
• 1920 × 1080 pixel, 30 frame/s (1080i) • 1280 × 720 pixel, 60 frame/s (720p) • 720 × 576 pixel, 25 frame/s (576i, 576p) • 720 or 640 × 480 pixel, 30 frame/s (480i, 480p)
Note: 1080i is encoded with 1920 × 1088 pixel frames, but the last 8 lines are discarded prior to display.
ISO/IEC 13818 Part 1
Systems - describes synchronization and multiplexing of video and audio. Part 2
Video - compression codec for interlaced and non-interlaced video signals. Part 3
Audio - compression codec for perceptual coding of audio signals. A multichannel-enabled extension of MPEG-1 audio.
Part 4 Describes procedures for testing compliance.
Part 5 Describes systems for Software simulation.
Part 6 Describes extensions for DSM-CC (Digital Storage Media Command and Control.)
Advanced Audio Coding (AAC) Part 9
Extension for real time interfaces. Part 10
Conformance extensions for DSM-CC.
(Part 8: 10-bit video extension. Primary application was studio video. Part 8 has been withdrawn due to lack of interest by industry).
Today, nearly all video compression methods in common use (e.g., those in standards approved by the ITU-T or ISO) apply a discrete cosine transform (DCT) for spatial redundancy reduction. Other methods, such as fractal compression, matching pursuits, and the use of a discrete wavelet transform (DWT) have been the subject of some research, but are typically not used in practical products (except for the use of wavelet coding as still-image coders without motion compensation). Interest in fractal compression seems to be waning, due to recent theoretical analysis showing a comparative lack of effectiveness to such methods.
The use of most video compression techniques (e.g., DCT or DWT based techniques) involves quantization. The quantization can either be scalar quantization or vector quantization; however, nearly all practical designs use scalar quantization because of its greater simplicity.
In broadcast engineering, digital television (DVB, ATSC and ISDB ) is made practical by video compression. TV stations can broadcast not only HDTV, but multiple virtual channels on the same physical channel as well. It also conserves precious bandwidth on the radio spectrum. Nearly all digital video broadcast today uses the MPEG-2 standard video compression format, although H.264/MPEG-4 AVC and VC-1 are emerging contenders in that domain.
Multimedia compression formats
MPEG-1 | MPEG-2 | MPEG-4 | MPEG-4/AVC
ITU-T H.261 | H.262 | H.263 | H.264 Others AVS | Dirac | Indeo | MJPEG | RealVideo | VC-1 | Theora | VP6 | VP7 | WMV
MPEG-1 Layer III (MP3) | MPEG-1 Layer II | AAC | HE-AAC
G.711 | G.722 | G.722.1 | G.722.2 | G.723 | G.723.1 | G.726 | G.728 | G.729 | G.729.1 | G.729a
AC3 | ATRAC | FLAC | iLBC | Monkey's Audio | Musepack | RealAudio | SHN | Speex | Vorbis | WavPack | WMA
T JPEG | JPEG 2000 | JPEG-LS | JBIG | JBIG2
-- -- Others BMP | GIF | ILBM | PCX | PNG | TGA | TIFF | WMP
Media container formats
3GP | ASF | AVI | FLV | Matroska | MP4 | MXF | NUT | Ogg | Ogg Media | QuickTime | RealMedia
Audio only AIFF | AU | WAV -- --
Digital Compression An uncompressed SDI signal outputs 270Mb of data every second. In digital broadcasting compression is essential to squeeze all this data into a 10MHz RF channel. Many people mistakenly equate the term 'bit rate' with picture quality. 'Bit Rate' actually refers to how the signal is processed. Thanks to the unique modular design of all Gigawave digital microwave links the 'plug-in' encoder and modulator modules can easily be changed on-site, or upgraded as new compression techniques evolve.
Compression Techniques used in Telecommunications and Broadcasting:
Standard Bit Rate (Mb/s) Delay ETSI 140 140 0 ETSI 34 34 Negligible ETSI 17 17 ETSI 8 8 DigiBeta 120 (Approx.) Negligible Digital S 50 MPEG 1 1.5 MPEG 2 1.5 - 80 2 - 24 frames Beta SX 18 EBU 24 News 8 MPEG 4 N/A Motion JPEG 30 - 100 3 frames JPEG 2000 N/A DVC Pro 25/50/100 25/50/100 3 frames DVCam 25 3 frames DV 25 3 frames Wavelets 18 - 100
AUDIO COMPRESSION TECHNIQUES
Many different compression techniques exist for for various forms of data. Video compression is simpler because many pixels are repeated in groups. Different techniques for still pictures include horizontal repeated pixel compression (pcx format), data conversion (gif format), and fractal path repeated pixels. For motion video, compression is relatively easy because large portions of the screen don't change between each frame; therefore, only the changes between images need to be stored. Text compression is extremely simple compared to video and audio. One method counts the probability of each character and then reassigns smaller bit values to the most common characters and larger bit values to the least common characters.
However, digital samples of audio data have proven to be very difficult to compress; these techniques do not work well at all for audio data. The data change often, and no values are common enough to save sufficient space. Currently, five methods are used to compress audio data with varying degrees of complexity, compressed audio quality, and amount of data compression.
The digital representation of audio data offers many advantages : high noise immunity, stability, and reproducibility. Audio in digital form also allows for efficient implementation of many audio processing functions through the computer.
Converting audio from analog to digital begins by sampling the audio input at regular, discrete intervals of time and quantizing the sampled values into a discrete number of evenly spaced levels. According to the Nyquist theory, a time-sampled signal can faithfully represent a signal up to half the sampling rate. Above that threshold, frequencies become blurred and signal noise becomes readily apparent.
The sampling frequencies in use today range from 8 kHz for basic speech to 48 kHz for commercial DAT machines. The number of quantizer levels is typically a power of 2 to make full use of a fixed number of bits per audio sample. The typical range for bits per sample is between 8 and 16 bits. This allows for a range of 256 to 65,536 levels of quantization per sample. With each additional bit of quantizer spacing, the signal to noise ratio increases by roughly 6 decibels (dB). Thus, the dynamic range capability of these representations is from 48 to 96 dB, respectively.
The data rates associated with uncompressed digital audio are substantial. For audio data on a CD, for example, which is sampled at 44.1 kHz with 16 bits per channel for two channels, about 1.4 megabits per second are processed. A clear need exists for some form of compression to enable the more efficient storage and transmission of digital audio data.
Voc File Compression
The simplest compression techniques simply removed any silence from the entire sample. Creative Labs introduced this form of compression with their introduction of the Soundblaster line of sound cards. This method analyzes the whole sample and then codes the silence into the sample using byte codes. It is very similar to run-length coding.
Linear Predictive Coding and Code Excited Linear Predictor
This was an early development in audio compression that was used primarily for speech. A Linear Predictive Coding (LPC) encoder compares speech to an analytical model of the vocal tract, then throws away the speech and stores the parameters of the best-fit model. The output quality was poor and was often compared to computer speech and thus is not used much today.
A later development, Code Excited Linear Predictor (CELP), increased the complexity of the speech model further, while allowing for greater compression due to faster computers, and produced much better results. Sound quality improved, while the compression ratio increased. The algorithm compares speech with an analytical model of the vocal tract and computes the errors between the original speech and the model. It transmits both model parameters and a very compressed representation of the errors.
Mu-law and A-law compression
Logarithmic compression is a good method because it matches the way the human ear works. It only loses information which the ear would not hear anyway, and gives good quality results for both speech and music. Although the compression ratio is not very high it requires very little processing power to achieve. It is the international standard telephony encoding format, also known as ITU (formerly CCITT) standard. It is commonly used in North America and Japan for ISDN 8 kHz sampled, voice grade, digital telephone service. It packs each 16-bit sample into 8 bits by using a logarithmic table to encode a 13-bit dynamic range, dropping the least significant 3 bits of precision. The quantization levels are dispersed unevely instead of linearly to mimic the way that the human ear perceives sound levels differently at different frequencies. Unlike linear quantization, the logarithmic step spacings represent low-amplitude samples with greater accuracy than higher-amplitude samples. This method is fast and compresses data into half the size of the original sample. This method is used quite widely due to the universal nature of its adoption.
Adaptive Differential Pulse Code Modulation (ADPCM)
The Interactive Multimedia Association (IMA) is a consortium of computer hardware and software vendors cooperating to develop a standard for multimedia data. Their goal was to select a public-domain audio compression algorithm that is able to provide a good compression ratio while maintaining good audio quality. In addition, the coding had to be simple enough to enable software-only decoding of 44.1 kHz samples on a 20 MHz, 386-class computer.
This process is a simple conversion based on the assumption that the changes between samples will not be very large. The first sample value is stored in its entirety, and the each successive value describes the amount +/- 8 levels that the wave will change, which uses only 4 instead of 16 bits. Therefore, a 4:1 compression ratio is achieved with less loss as the sampling frequency increases. At 44.1 kHz, the compressed signal is an accurate representation of the uncompressed sample that is difficult to discern from the original. This method is used widely today because of its simplicity, wide acceptance, and high level of compression.
MPEG The Motion Picture Experts Group (MPEG) audio compression algorithm is an International Organization for Standardization (ISO) standard for high fidelity audio compressions. It is one of a three-part compression standard, the other two being video and system. The MPEG compression is lossy, but nonetheless can achieve transparent, perceptually lossless compression. MPEG compression is firmly founded in psychoaccoustic theory. The premise behind this technique is simply: if the sound cannot be heard by the listener, then it does not need to be coded. Human hearing is quite sensitive, but discerning differences in a collage of sounds is quite difficult. Masking is the phenomenon where a strong signal "covers" the sound of another signal such that the softer one cannot be heard by the human ear. An extension of this is temporal masking, which describes masking of a soft sound after loud has stopped. The time, measured under scientific conditions, that it takes to hear the softer sound is about 5 ms. Because the sensitivity of the ear is not linear but is instead dependent upon the frequency, masking effects differ depending on the frequency of the sounds.
MPEG compression uses masking as the basis for compressing the audio data. Those sounds that cannot be heard by the human ear do not need to be encoded. The audio spectrum is divided into 32 frequency bands because sound masking occurs over a range of frequencies for each loud sound. Then the volume levels are measured in each band to detect for any masking. Masking effects are taken into account, and the signal is then encoded.
In addition to encoding a single signal, the MPEG compression supports one or two audio channels in one of four modes: 1) Monophonic 2) Dual Monophonic -- two independent channels 3) Stereo -- for stereo channels that share bits, but not using joint-stereo coding 4) Joint - stereo -- takes advantage of the correlations between stereo channels The MPEG method allows for a compression ratio of up to 6:1. Under optimal listening conditions, expert listeners could not distinguish the coded and original audio clips. Thus, although this technique is lossy, it still produces accurate representations of the original audio signal.
SPEECH COMPRESSION I. Introduction
The compression of speech signals has many practical applications. One example is in digital cellular technology where many users share the same frequency bandwidth. Compression allows more users to share the system than otherwise possible. Another example is in digital voice storage (e.g. answering machines). For a given memory size, compression allows longer messages to be stored than otherwise.
Historically, digital speech signals are sampled at a rate of 8000 samples/sec. Typically, each sample is represented by 8 bits (using mu-law). This corresponds to an uncompressed rate of 64 kbps (kbits/sec). With current compression techniques (all of which are lossy), it is possible to reduce the rate to 8 kbps with almost no perceptible loss in quality. Further compression is possible at a cost of lower quality. All of the current low-rate speech coders are based on the principle of linear predictive coding (LPC) which is presented in the following sections.
II. LPC Modeling A. Physical Model:
When you speak:
• Air is pushed from your lung through your vocal tract and out of your mouth comes speech. • For certain voiced sound, your vocal cords vibrate (open and close). The rate at which the vocal cords
vibrate determines the pitch of your voice. Women and young children tend to have high pitch (fast vibration) while adult males tend to have low pitch (slow vibration).
• For certain fricatives and plosive (or unvoiced) sound, your vocal cords do not vibrate but remain constantly opened.
• The shape of your vocal tract determines the sound that you make. • As you speak, your vocal tract changes its shape producing different sound. • The shape of the vocal tract changes relatively slowly (on the scale of 10 msec to 100 msec). • The amount of air coming from your lung determines the loudness of your voice.
B. Mathematical Model:
• The above model is often called the LPC Model. • The model says that the digital speech signal is the output of a digital filter (called the LPC filter) whose
input is either a train of impulses or a white noise sequence. • The relationship between the physical and the mathematical models:
Vocal Tract (LPC Filter)
Vocal Cord Vibration (voiced)
Vocal Cord Vibration Period (pitch period)
Fricatives and Plosives (unvoiced)
Air Volume (gain)
• The LPC filter is given by:
which is equivalent to saying that the input-output relationship of the filter is given by the linear difference equation:
• The LPC model can be represented in vector form as:
• changes every 20 msec or so. At a sampling rate of 8000 samples/sec, 20 msec is equivalent to 160 samples.
• The digital speech signal is divided into frames of size 20 msec. There are 50 frames/second. • The model says that
is equivalent to
Thus the 160 values of is compactly represented by the 13 values of .
• There's almost no perceptual difference in if: o For Voiced Sounds (V): the impulse train is shifted (insensitive to phase change). o For Unvoiced Sounds (UV):} a different white noise sequence is used.
• LPC Synthesis: Given , generate (this is done using standard filtering techniques).
• LPC Analysis: Given , find the best (this is described in the next section).
III. LPC Analysis
• Consider one frame of speech signal:
• The signal is related to the innovation through the linear difference equation:
• The ten LPC parameters are chosen to minimize the energy of the innovation:
• Using standard calculus, we take the derivative of with respect to and set it to zero:
• We now have 10 linear equations with 10 unknowns:
• The above matrix equation could be solved using: o The Gaussian elimination method. o Any matrix inversion method (MATLAB). o The Levinson-Durbin recursion (described below).
• Levinson-Durbin Recursion:
Solve the above for , and then set
• To get the other three parameters: , we solve for the innovation:
• Then calculate the autocorrelation of :
• Then make a decision based on the autocorrelation:
IV. 2.4kbps LPC Vocoder
• The following is a block diagram of a 2.4 kbps LPC Vocoder:
• The LPC coefficients are represented as line spectrum pair (LSP) parameters. • LSP are mathematically equivalent (one-to-one) to LPC. • LSP are more amenable to quantization. • LSP are calculated as follows:
• Factoring the above equations, we get:
are called the LSP parameters.
• LSP are ordered and bounded:
• LSP are more correlated from one frame to the next than LPC. • The frame size is 20 msec. There are 50 frames/sec. 2400 bps is equivalent to 48 bits/frame. These bits
are allocated as follows:
• The 34 bits for the LSP are allocated as follows:
• The gain, , is encoded using a 7-bit non-uniform scalar quantizer (a 1-dimensional vector quantizer).
• For voiced speech, values of ranges from 20 to 146. are jointly encoded as follows:
V. 4.8 kbps CELP Coder
• CELP=Code-Excited Linear Prediction. • The principle is similar to the LPC Vocoder except:
o Frame size is 30 msec (240 samples)
o is coded directly o More bits are need o Computationally more complex o A pitch prediction filter is included o Vector quantization concept is used
• A block diagram of the CELP encoder is shown below:
• The pitch prediction filter is given by:
where could be an integer or a fraction thereof.
• The perceptual weighting filter is given by:
where have been determined to be good choices.
• Each frame is divided into 4 subframes. In each subframe, the codebook contains 512 codevectors. • The gain is quantized using 5 bits per subframe. • The LSP parameters are quantized using 34 bits similar to the LPC Vocoder. • At 30 msec per frame, 4.8 kbps is equivalent to 144 bits/frame. These 144 bits are allocated as follows:
VI. 8.0 kbps CS-ACELP CS-ACELP=Conjugate-Structured Algebraic CELP.
• The principle is similar to the 4.8 kbps CELP Coder except: o Frame size is 10 msec (80 samples) o There are only two subframes, each of which is 5 msec (40 samples) o The LSP parameters are encoded using two-stage vector quantization. o The gains are also encoded using vector quantization.
• At 10 msec per frame, 8 kbps is equivalent to 80 bits/frame. These 80 bits are allocated as follows:
VII. Demonstration This is a demonstration of five different speech compression algorithms (ADPCM, LD-CELP, CS-ACELP, CELP, and LPC10). To use this demo, you need a Sun Audio (.au) Player. To distinguish subtle differences in the speech files, high-quality speakers and/or headphones are recommended. Also, it is recommended that you run this demo in a quiet room (with a low level of background noise).
"A lathe is a big tool. Grab every dish of sugar."
• Original (64000 bps) This is the original speech signal sampled at 8000 samples/second and u-law quantized at 8 bits/sample. Approximately 4 seconds of speech.
• ADPCM (32000 bps) This is speech compressed using the Adaptive Differential Pulse Coded Modulation (ADPCM) scheme. The bit rate is 4 bits/sample (compression ratio of 2:1).
• LD-CELP (16000 bps) This is speech compressed using the Low-Delay Code Excited Linear Prediction (LD-CELP) scheme. The bit rate is 2 bits/sample (compression ratio of 4:1).
• CS-ACELP (8000 bps) This is speech compressed using the Conjugate-Structured Algebraic Code Excited Linear Prediction (CS-ACELP) scheme. The bit rate is 1 bit/sample (compression ratio of 8:1).
• CELP (4800 bps) This is speech compressed using the Code Excited Linear Prediction (CELP) scheme. The bit rate is 0.6 bits/sample (compression ratio of 13.3:1).
• LPC10 (2400 bps) This is speech compressed using the Linear Predictive Coding (LPC10) scheme. The bit rate is 0.3 bits/sample (compression ratio of 26.6:1).
IMAGE COMPRESSING TECHNIQUES – JPEG
One of the hottest topics in image compression technology today is JPEG. The acronym JPEG stands for the Joint Photographic Experts Group, a standards committee that had its origins within the International Standard Organization (ISO). In 1982, the ISO formed the Photographic Experts Group (PEG) to research methods of transmitting video, still images, and text over ISDN (Integrated Services Digital Network) lines. PEG's goal was to produce a set of industry standards for the transmission of graphics and image data over digital communications networks.
In 1986, a subgroup of the CCITT began to research methods of compressing color and gray-scale data for facsimile transmission. The compression methods needed for color facsimile systems were very similar to those being researched by PEG. It was therefore agreed that the two groups should combine their resources and work together toward a single standard.
In 1987, the ISO and CCITT combined their two groups into a joint committee that would research and produce a single standard of image data compression for both organizations to use. This new committee was JPEG.
Although the creators of JPEG might have envisioned a multitude of commercial applications for JPEG technology, a consumer public made hungry by the marketing promises of imaging and multimedia technology are benefiting greatly as well. Most previously developed compression methods do a relatively poor job of compressing continuous-tone image data; that is, images containing hundreds or thousands of colors taken from real-world subjects. And very few file formats can support 24-bit raster images.
GIF, for example, can store only images with a maximum pixel depth of eight bits, for a maximum of 256 colors. And its LZW compression algorithm does not work very well on typical scanned image data. The low-level noise commonly found in such data defeats LZW's ability to recognize repeated patterns.
Both TIFF and BMP are capable of storing 24-bit data, but in their pre-JPEG versions are capable of using only encoding schemes (LZW and RLE, respectively) that do not compress this type of image data very well.
JPEG provides a compression method that is capable of compressing continuous-tone image data with a pixel depth of 6 to 24 bits with reasonable speed and efficiency. And although JPEG itself does not define a standard image file format, several have been invented or modified to fill the needs of JPEG data storage.
JPEG in Perspective
Unlike all of the other compression methods described so far in this chapter, JPEG is not a single algorithm. Instead, it may be thought of as a toolkit of image compression methods that may be altered to fit the needs of the user. JPEG may be adjusted to produce very small, compressed images that are of relatively poor quality in appearance but still suitable for many applications. Conversely, JPEG is capable of producing very high-quality compressed images that are still far smaller than the original uncompressed data.
JPEG is also different in that it is primarily a lossy method of compression. Most popular image format compression schemes, such as RLE, LZW, or the CCITT standards, are lossless compression methods. That is, they do not discard any data during the encoding process. An image compressed using a lossless method is guaranteed to be identical to the original image when uncompressed.
Lossy schemes, on the other hand, throw useless data away during encoding. This is, in fact, how lossy schemes manage to obtain superior compression ratios over most lossless schemes. JPEG was designed specifically to discard information that the human eye cannot easily see. Slight changes in color are not perceived well by the human eye, while slight changes in intensity (light and dark) are. Therefore JPEG's lossy encoding tends to be more frugal with the gray-scale part of an image and to be more frivolous with the color.
JPEG was designed to compress color or gray-scale continuous-tone images of real-world subjects: photographs, video stills, or any complex graphics that resemble natural subjects. Animations, ray tracing, line art, black-and-white documents, and typical vector graphics don't compress very well under JPEG and shouldn't be expected to. And, although JPEG is now used to provide motion video compression, the standard makes no special provision for such an application.
The fact that JPEG is lossy and works only on a select type of image data might make you ask, "Why bother to use it?" It depends upon your needs. JPEG is an excellent way to store 24-bit photographic images, such as those used in imaging and multimedia applications. JPEG 24-bit (16 million color) images are superior in appearance to 8-bit (256 color) images on a VGA display and are at their most spectacular when using 24-bit display hardware (which is now quite inexpensive).
The amount of compression achieved depends upon the content of the image data. A typical photographic-quality image may be compressed from 20:1 to 25:1 without experiencing any noticeable degradation in quality. Higher compression ratios will result in image files that differ noticeably from the original image but still have an overall good image quality. And achieving a 20:1 or better compression ratio in many cases not only saves disk space, but also reduces transmission time across data networks and phone lines.
An end user can "tune" the quality of a JPEG encoder using a parameter sometimes called a quality setting or a Q factor. Although different implementations have varying scales of Q factors, a range of 1 to 100 is typical. A factor of 1 produces the smallest, worst quality images; a factor of 100 produces the largest, best quality images. The optimal Q factor depends on the image content and is therefore different for every image. The art of JPEG compression is finding the lowest Q factor that produces an image that is visibly acceptable, and preferably as close to the original as possible.
The JPEG library supplied by the Independent JPEG Group uses a quality setting scale of 1 to 100. To find the optimal compression for an image using the JPEG library, follow these steps:
1. Encode the image using a quality setting of 75 (-Q 75). 2. If you observe unacceptable defects in the image, increase the value, and re-encode the image. 3. If the image quality is acceptable, decrease the setting until the image quality is barely acceptable. This
will be the optimal quality setting for this image. 4. Repeat this process for every image you have (or just encode them all using a quality setting of 75).
JPEG isn't always an ideal compression solution. There are several reasons:
• As we have said, JPEG doesn't fit every compression need. Images containing large areas of a single color do not compress very well. In fact, JPEG will introduce "artifacts" into such images that are visible against a flat background, making them considerably worse in appearance than if you used a conventional lossless compression method. Images of a "busier" composition contain even worse artifacts, but they are considerably less noticeable against the image's more complex background.
• JPEG can be rather slow when it is implemented only in software. If fast decompression is required, a hardware-based JPEG solution is your best bet, unless you are willing to wait for a faster software-only solution to come along or buy a faster computer.
• JPEG is not trivial to implement. It is not likely you will be able to sit down and write your own JPEG encoder/decoder in a few evenings. We recommend that you obtain a third-party JPEG library, rather than writing your own.
• JPEG is not supported by very many file formats. The formats that do support JPEG are all fairly new and can be expected to be revised at frequent intervals.
The JPEG specification defines a minimal subset of the standard called baseline JPEG, which all JPEG-aware applications are required to support. This baseline uses an encoding scheme based on the Discrete Cosine Transform (DCT) to achieve compression. DCT is a generic name for a class of operations identified and published some years ago. DCT-based algorithms have since made their way into various compression methods.
DCT-based encoding algorithms are always lossy by nature. DCT algorithms are capable of achieving a high degree of compression with only minimal loss of data. This scheme is effective only for compressing continuous-tone images in which the differences between adjacent pixels are usually small. In practice, JPEG works well only on images with depths of at least four or five bits per color channel. The baseline standard actually specifies eight bits per input sample. Data of lesser bit depth can be handled by scaling it up to eight bits per sample, but the results will be bad for low-bit-depth source data, because of the large jumps between adjacent pixel values. For similar reasons, colormapped source data does not work very well, especially if the image has been dithered.
The JPEG compression scheme is divided into the following stages:
1. Transform the image into an optimal color space. 2. Downsample chrominance components by averaging groups of pixels together. 3. Apply a Discrete Cosine Transform (DCT) to blocks of pixels, thus removing redundant image data. 4. Quantize each block of DCT coefficients using weighting functions optimized for the human eye. 5. Encode the resulting coefficients (image data) using a Huffman variable word-length algorithm to remove
redundancies in the coefficients.
Figure 9-11 summarizes these steps, and the following subsections look at each of them in turn. Note that JPEG decoding performs the reverse of these steps.
Figure 9-11: JPEG compression and decompression
Transform the image
The JPEG algorithm is capable of encoding images that use any type of color space. JPEG itself encodes each component in a color model separately, and it is completely independent of any color-space model, such as RGB, HSI, or CMY. The best compression ratios result if a luminance/chrominance color space, such as YUV or YCbCr, is used. (See Chapter 2 for a description of these color spaces.)
Most of the visual information to which human eyes are most sensitive is found in the high-frequency, gray-scale, luminance component (Y) of the YCbCr color space. The other two chrominance components (Cb and Cr) contain high-frequency color information to which the human eye is less sensitive. Most of this information can therefore be discarded.
In comparison, the RGB, HSI, and CMY color models spread their useful visual image information evenly across each of their three color components, making the selective discarding of information very difficult. All three
color components would need to be encoded at the highest quality, resulting in a poorer compression ratio. Gray-scale images do not have a color space as such and therefore do not require transforming.
Downsample chrominance components
The simplest way of exploiting the eye's lesser sensitivity to chrominance information is simply to use fewer pixels for the chrominance channels. For example, in an image nominally 1000x1000 pixels, we might use a full 1000x1000 luminance pixels but only 500x500 pixels for each chrominance component. In this representation, each chrominance pixel covers the same area as a 2x2 block of luminance pixels. We store a total of six pixel values for each 2x2 block (four luminance values, one each for the two chrominance channels), rather than the twelve values needed if each component is represented at full resolution. Remarkably, this 50 percent reduction in data volume has almost no effect on the perceived quality of most images. Equivalent savings are not possible with conventional color models such as RGB, because in RGB each color channel carries some luminance information and so any loss of resolution is quite visible.
When the uncompressed data is supplied in a conventional format (equal resolution for all channels), a JPEG compressor must reduce the resolution of the chrominance channels by downsampling, or averaging together groups of pixels. The JPEG standard allows several different choices for the sampling ratios, or relative sizes, of the downsampled channels. The luminance channel is always left at full resolution (1:1 sampling). Typically both chrominance channels are downsampled 2:1 horizontally and either 1:1 or 2:1 vertically, meaning that a chrominance pixel covers the same area as either a 2x1 or a 2x2 block of luminance pixels. JPEG refers to these downsampling processes as 2h1v and 2h2v sampling, respectively.
Another notation commonly used is 4:2:2 sampling for 2h1v and 4:2:0 sampling for 2h2v; this notation derives from television customs (color transformation and downsampling have been in use since the beginning of color TV transmission). 2h1v sampling is fairly common because it corresponds to National Television Standards Committee (NTSC) standard TV practice, but it offers less compression than 2h2v sampling, with hardly any gain in perceived quality.
Apply a Discrete Cosine Transform
The image data is divided up into 8x8 blocks of pixels. (From this point on, each color component is processed independently, so a "pixel" means a single value, even in a color image.) A DCT is applied to each 8x8 block. DCT converts the spatial image representation into a frequency map: the low-order or "DC" term represents the average value in the block, while successive higher-order ("AC") terms represent the strength of more and more rapid changes across the width or height of the block. The highest AC term represents the strength of a cosine wave alternating from maximum to minimum at adjacent pixels.
The DCT calculation is fairly complex; in fact, this is the most costly step in JPEG compression. The point of doing it is that we have now separated out the high- and low-frequency information present in the image. We can discard high-frequency data easily without losing low-frequency information. The DCT step itself is lossless except for roundoff errors.
Quantize each block
To discard an appropriate amount of information, the compressor divides each DCT output value by a "quantization coefficient" and rounds the result to an integer. The larger the quantization coefficient, the more data is lost, because the actual DCT value is represented less and less accurately. Each of the 64 positions of the DCT output block has its own quantization coefficient, with the higher-order terms being quantized more heavily than the low-order terms (that is, the higher-order terms have larger quantization coefficients). Furthermore, separate quantization tables are employed for luminance and chrominance data, with the chrominance data being quantized more heavily than the luminance data. This allows JPEG to exploit further the eye's differing sensitivity to luminance and chrominance.
It is this step that is controlled by the "quality" setting of most JPEG compressors. The compressor starts from a built-in table that is appropriate for a medium-quality setting and increases or decreases the value of each table
entry in inverse proportion to the requested quality. The complete quantization tables actually used are recorded in the compressed file so that the decompressor will know how to (approximately) reconstruct the DCT coefficients.
Selection of an appropriate quantization table is something of a black art. Most existing compressors start from a sample table developed by the ISO JPEG committee. It is likely that future research will yield better tables that provide more compression for the same perceived image quality. Implementation of improved tables should not cause any compatibility problems, because decompressors merely read the tables from the compressed file; they don't care how the table was picked.
Encode the resulting coefficients
The resulting coefficients contain a significant amount of redundant data. Huffman compression will losslessly remove the redundancies, resulting in smaller JPEG data. An optional extension to the JPEG specification allows arithmetic encoding to be used instead of Huffman for an even greater compression ratio. (See the section called "JPEG Extensions (Part 1)" below.) At this point, the JPEG data stream is ready to be transmitted across a communications channel or encapsulated inside an image file format.
JPEG Extensions (Part 1)
What we have examined thus far is only the baseline specification for JPEG. A number of extensions have been defined in Part 1 of the JPEG specification that provide progressive image buildup, improved compression ratios using arithmetic encoding, and a lossless compression scheme. These features are beyond the needs of most JPEG implementations and have therefore been defined as "not required to be supported" extensions to the JPEG standard.
Progressive image buildup
Progressive image buildup is an extension for use in applications that need to receive JPEG data streams and display them on the fly. A baseline JPEG image can be displayed only after all of the image data has been received and decoded. But some applications require that the image be displayed after only some of the data is received. Using a conventional compression method, this means displaying the first few scan lines of the image as it is decoded. In this case, even if the scan lines were interlaced, you would need at least 50 percent of the image data to get a good clue as to the content of the image. The progressive buildup extension of JPEG offers a better solution.
Progressive buildup allows an image to be sent in layers rather than scan lines. But instead of transmitting each bitplane or color channel in sequence (which wouldn't be very useful), a succession of images built up from approximations of the original image are sent. The first scan provides a low-accuracy representation of the entire image--in effect, a very low-quality JPEG compressed image. Subsequent scans gradually refine the image by increasing the effective quality factor. If the data is displayed on the fly, you would first see a crude, but recognizable, rendering of the whole image. This would appear very quickly because only a small amount of data would need to be transmitted to produce it. Each subsequent scan would improve the displayed image's quality one block at a time.
A limitation of progressive JPEG is that each scan takes essentially a full JPEG decompression cycle to display. Therefore, with typical data transmission rates, a very fast JPEG decoder (probably specialized hardware) would be needed to make effective use of progressive transmission.
A related JPEG extension provides for hierarchical storage of the same image at multiple resolutions. For example, an image might be stored at 250x250, 500x500, 1000x1000, and 2000x2000 pixels, so that the same image file could support display on low-resolution screens, medium-resolution laser printers, and high-resolution imagesetters. The higher-resolution images are stored as differences from the lower-resolution ones, so they need less space than they would need if they were stored independently. This is not the same as a progressive series, because each image is available in its own right at the full desired quality.
The baseline JPEG standard defines Huffman compression as the final step in the encoding process. A JPEG extension replaces the Huffman engine with a binary arithmetic entropy encoder. The use of an arithmetic coder reduces the resulting size of the JPEG data by a further 10 percent to 15 percent over the results that would be achieved by the Huffman coder. With no change in resulting image quality, this gain could be of importance in implementations where enormous quantities of JPEG images are archived.
Arithmetic encoding has several drawbacks:
• Not all JPEG decoders support arithmetic decoding. Baseline JPEG decoders are required to support only the Huffman algorithm.
• The arithmetic algorithm is slower in both encoding and decoding than Huffman. • The arithmetic coder used by JPEG (called a Q-coder) is owned by IBM and AT&T. (Mitsubishi also
holds patents on arithmetic coding.) You must obtain a license from the appropriate vendors if their Q-coders are to be used as the back end of your JPEG implementation.
Lossless JPEG compression
A question that commonly arises is "At what Q factor does JPEG become lossless?" The answer is "never." Baseline JPEG is a lossy method of compression regardless of adjustments you may make in the parameters. In fact, DCT-based encoders are always lossy, because roundoff errors are inevitable in the color conversion and DCT steps. You can suppress deliberate information loss in the downsampling and quantization steps, but you still won't get an exact recreation of the original bits. Further, this minimum-loss setting is a very inefficient way to use lossy JPEG.
The JPEG standard does offer a separate lossless mode. This mode has nothing in common with the regular DCT-based algorithms, and it is currently implemented only in a few commercial applications. JPEG lossless is a form of Predictive Lossless Coding using a 2D Differential Pulse Code Modulation (DPCM) scheme. The basic premise is that the value of a pixel is combined with the values of up to three neighboring pixels to form a predictor value. The predictor value is then subtracted from the original pixel value. When the entire bitmap has been processed, the resulting predictors are compressed using either the Huffman or the binary arithmetic entropy encoding methods described in the JPEG standard.
Lossless JPEG works on images with 2 to 16 bits per pixel, but performs best on images with 6 or more bits per pixel. For such images, the typical compression ratio achieved is 2:1. For image data with fewer bits per pixels, other compression schemes do perform better.
JPEG Extensions (Part 3)
The following JPEG extensions are described in Part 3 of the JPEG specification.
Variable quantization is an enhancement available to the quantization procedure of DCT-based processes. This enhancement may be used with any of the DCT-based processes defined by JPEG with the exception of the baseline process.
The process of quantization used in JPEG quantizes each of the 64 DCT coefficients using a corresponding value from a quantization table. Quantization values may be redefined prior to the start of a scan but must not be changed once they are within a scan of the compressed data stream.
Variable quantization allows the scaling of quantization values within the compressed data stream. At the start of each 8x8 block is a quantizer scale factor used to scale the quantization table values within an image component
and to match these values with the AC coefficients stored in the compressed data. Quantization values may then be located and changed as needed.
Variable quantization allows the characteristics of an image to be changed to control the quality of the output based on a given model. The variable quantizer can constantly adjust during decoding to provide optimal output.
The amount of output data can also be decreased or increased by raising or lowering the quantizer scale factor. The maximum size of the resulting JPEG file or data stream may be imposed by constant adaptive adjustments made by the variable quantizer.
The variable quantization extension also allows JPEG to store image data originally encoded using a variable quantization scheme, such as MPEG. For MPEG data to be accurately transcoded into another format, the other format must support variable quantization to maintain a high compression ratio. This extension allows JPEG to support a data stream originally derived from a variably quantized source, such as an MPEG I-frame.
Selective refinement is used to select a region of an image for further enhancement. This enhancement improves the resolution and detail of a region of an image. JPEG supports three types of selective refinement: hierarchical, progressive, and component. Each of these refinement processes differs in its application, effectiveness, complexity, and amount of memory required.
• Hierarchical selective refinement is used only in the hierarchical mode of operation. It allows for a region of a frame to be refined by the next differential frame of a hierarchical sequence.
• Progressive selective refinement is used only in the progressive mode and adds refinement. It allows a greater bit resolution of zero and non-zero DCT coefficients in a coded region of a frame.
• Component selective refinement may be used in any mode of operation. It allows a region of a frame to contain fewer colors than are defined in the frame header.
Tiling is used to divide a single image into two or more smaller subimages. Tiling allows easier buffering of the image data in memory, quicker random access of the image data on disk, and the storage of images larger than 64Kx64K samples in size. JPEG supports three types of tiling: simple, pyramidal, and composite.
• Simple tiling divides an image into two or more fixed-size tiles. All simple tiles are coded from left to right and from top to bottom and are contiguous and non-overlapping. All tiles must have the same number of samples and component identifiers and must be encoded using the same processes. Tiles on the bottom and right of the image may be smaller than the designated size of the image dimensions and will therefore not be a multiple of the tile size.
• Pyramidal tiling also divides the image into tiles, but each tile is also tiled using several different levels of resolution. The model of this process is the JPEG Tiled Image Pyramid (JTIP), which is a model of how to create a multi-resolution pyramidal JPEG image.
A JTIP image stores successive layers of the same image at different resolutions. The first image stored at the top of the pyramid is one-sixteenth of the defined screen size and is called a vignette. This image is used for quick displays of image contents, especially for file browsers. The next image occupies one-fourth of the screen and is called an imagette. This image is typically used when two or more images must be displayed at the same time on the screen. The next is a low-resolution, full-screen image, followed by successively higher-resolution images and ending with the original image.
Pyramidal tiling typically uses the process of "internal tiling," where each tile is encoded as part of the same JPEG data stream. Tiles may optionally use the process of "external tiling," where each tile is a separately encoded JPEG data stream. External tiling may allow quicker access of image data, easier application of image encryption, and enhanced compatibility with certain JPEG decoders.
• Composite tiling allows multiple-resolution versions of images to be stored and displayed as a mosaic . Composite tiling allows overlapping tiles that may be different sizes and have different scaling factors and compression parameters. Each tile is encoded separately and may be combined with other tiles without resampling.
SPIFF (Still Picture Interchange File Format)
SPIFF is an officially sanctioned JPEG file format that is intended to replace the defacto JFIF (JPEG File Interchange Format) format in use today. SPIFF includes all of the features of JFIF and adds quite a bit more functionality. SPIFF is designed so that properly written JFIF readers will read SPIFF-JPEG files as well.
For more information, see the article about SPIFF.
Other JPEG extensions include the addition of a version marker segment that stores the minimum level of functionality required to decode the JPEG data stream. Multiple version markers may be included to mark areas of the data stream that have differing minimum functionality requirements. The version marker also contains information indicating the processes and extension used to encode the JPEG data stream.
There are three major graphics formats on the web: GIF, JPEG, and PNG. Of these, PNG has the spottiest support, so that generally leaves one to chose between GIF or JPEG format. There are many other available formats in which to save image files; it is likely that many of your web site visitors will not be able to view your files.
JPEG is a lossy compression technology, so some information is lost when converting a picture to JPEG. Use this format for most photographs because the images will be smaller and look better than a GIF format picture.
GIF files are better for figures with sharp contrast (such as line drawings, Gantt charts, logos, and buttons). One can also create transparent areas and animations with GIF images. A GIF image has a maximum of 256 colors however, so images with gradations of color will not look very good.
GIF is a patented file format technology. PNG is an open-source standard that can be used for many of the applications of GIF images. PNG is better than GIF in most respects, providing more possible colors, alpha-channel transparency, and color matching features. The PNG format is not as widely supported as GIF, although it is supported (to differing degrees) on the version 4 and later browsers.
BMP or bitmap files are pictures from the Windows operating system. Using these on a web page can cause problems because they cannot be viewed by most browser. Stay away from using BMP files on the web.
TIFF images have great picture quality but also a very large file size. Most browsers cannot display TIFF images. Use TIFF on your machine to save images for printing or editing; do not use TIFFs on the web.
The GIF image format
GIF stands for Graphics Interchange Format . It is probably the most common image format used on the Web. GIFs have the advantage of usually being very small in size, which makes them fast-loading. Unlike JPEGs, GIFs use lossless compression, which means they make the file size small without losing or blurring any of the image itself.
GIFs also support transparency , which means that they can sit on top of a background image on your web page without having ugly rectangles around them.
Another cool thing that GIFs can do is animation. You can make an animated GIF by drawing each frame of the animation in a graphics package that supports the animated GIF format, then export the animation to a single GIF file. When you include this file in your Web page (with the img tag), your animation will be displayed on the page!
The major disadvantage of GIFs is that they only support up to 256 colours (this is known as 8-bit colour and is a type of indexed colour image). This means they're not good for photographs, or any other image that contains lots of different colours.
Making Fast-Loading GIFs
It 's worthwhile making your GIF file sizes as small as possible, so that your Web pages load quickly. People will get very bored otherwise, and probably go to another website!
Most graphics programs let you control various settings when making a GIF image, such as palette size (number of colours in the image) and dithering. Generally, speaking, use the smallest palette size you can. Usually 32 colour palette produce acceptable results, although for low-colour images you can often get away with 16. Images with lots of colours will of course need a bigger palette - say, 128, or even 256 colours.
8-colour GIF (1292 bytes)
64-colour GIF (2940 bytes)
The JPEG Image Format
JPEG stands for Joint Photographic Experts Group , a bunch of boffins who invented this format to display full-colour photographic images in a portable format with a small file size. Like GIF images, they are also very
common on the Web. Their main advantage over GIFs is that they can display true-colour images (up to 16 million colours), which makes them much better for images such as photographs and illustrations with large numbers of colours.
The main disadvantage of the JPEG format is that it is lossy . This means that you lose some of the detail of your image when you convert it to JPEG format. Boundaries between blocks of colour may appear more blurry, and areas with lots of detail will lose their sharpness. On the other hand, JPEGs do preserve all of the colour information in the image, which of course is great for high-colour images such as photographs.
JPEGs also can't do transparency or animation - in these cases, you'll have to use the GIF format (or PNG format for transparency).
Making Fast-Loading JPEGs
As with GIFs, it pays to make your JPEGs as small as possible (in terms of bytes), so that your websites load quickly. The main control over file size with JPEGs is called quality , and usually varies from 0 to 100%, where 0% is low quality (but smallest file size), and 100% is highest quality (but largest file size). 0% quality JPEGs usually look noticeably blurred when compared to the original. 100% quality JPEGs are often indistinguishable from the original:
Low-quality JPEG (4089 bytes)
High-quality JPEG (17465 bytes)
The PNG Image Format
PNG is a relatively new invention compared to GIF or JPEG, although it 's been around for a while now. (Sadly some browsers such as IE6 still don't support them fully.) It stands for Portable Network Graphics . It was designed to be an alternative to the GIF file format, but without the licensing issues that were involved in the GIF compression method at the time.
There are two types of PNG: PNG-8 format, which holds 8 bits of colour information (comparable to GIF), and PNG-24 format, which holds 24 bits of colour (comparable to JPEG).
PNG-8 often compresses images even better than GIF, resulting in smaller file sizes. On the other hand, PNG-24 is often less effective than JPEGs at compressing true-colour images such as photos, resulting in larger file sizes than the equivalent quality JPEGs. However, unlike JPEG, PNG-24 is lossless, meaning that all of the original image's information is preserved.
PNG also supports transparency like GIF, but can have varying degrees of transparency for each pixel, whereas GIFs can only have transparency turned on or off for each pixel. This means that whereas transparent GIFs often have jagged edges when placed on complex or ill-matching backgrounds, transparent PNGs will have nice smooth edges.
Note that unlike GIF, PNG-8 does not support animation.
One important point about PNG: Earlier browsers don't recognise them. If you want to ensure your website is viewable by early browsers, use GIFs or JPEGs instead.
16-colour PNG-8 (6481 bytes)
Full-colour PNG-24 (34377 bytes) Summary of image formats
This table summarises the key differences between the GIF, JPEG and PNG image formats.
GIF JPEG PNG-8 PNG-24
Better for cl ipart and drawn graphics with few colours, or large blocks of colour
Better for photographs with lots of colours or f ine colour detail
Better for cl ipart and drawn graphics with few colours, or large blocks of colour
Better for photographs with lots of colours or f ine colour detail
Can only have up to 256 colours
Can have up to 16 mil l ion colours
Can only have up to 256 colours
Can have up to 16 mil l ion colours
Images are "lossless" - they contain the same amount of information as the original (but with only 256 colours)
Images are "lossy" - they contain less information than the original
Images are "lossless" - they contain the same amount of information as the original (but with only 256 colours)
Images are "lossless" - they contain the same amount of information as the original
Can be animated Cannot be animated Cannot be animated Cannot be animated
Can have transparent areas Cannot have transparent areas Can have transparent areas Can have transparent areas
Image or Graphic?
Technically, neither. If you really want to be strict, computer pictures are files, the same way WORD documents or solitaire games are files. They're all a bunch of ones and zeros all in a row. But we do have to communicate with one another so let's decide.
Image. We'll use "image". That seems to cover a wide enough topic range.
I went to my reference books and there I found that "graphic" is more of an adjective, as in "graphic format." You see, we denote images on the Internet by their graphic format. GIF is not the name of the image. GIF is the compression factors used to create the raster format set up by CompuServe. (More on that in a moment).
So, they're all images unless you're talking about something specific.
44 Different Graphic Formats?
It does seem like a big number, doesn't it? In reality, there are not 44 different graphic format names. Many of the 44 are different versions under the same compression umbrella, interlaced and non-interlaced GIF, for example.
Before getting into where we get all 44, and there are more than that even, let me back-peddle for a moment.
There actually are only two basic methods for a computer to render, or store and display, an image. When you save an image in a specific format you are creating either a raster or meta/vector graphic format. Here's the lowdown:
Raster image formats (RIFs) should be the most familiar to Internet users. A Raster format breaks the image into a series of colored dots called pixels. The number of ones and zeros (bits) used to create each pixel denotes the depth of color you can put into your images.
If your pixel is denoted with only one bit-per-pixel then that pixel must be black or white. Why? Because that pixel can only be a one or a zero, on or off, black or white.
Bump that up to 4 bits-per-pixel and you're able to set that colored dot to one of 16 colors. If you go even higher to 8 bits-per-pixel, you can save that colored dot at up to 256 different colors.
Does that number, 256 sound familiar to anyone? That's the upper color level of a GIF image. Sure, you can go with less than 256 colors, but you cannot have over 256.
That's why a GIF image doesn't work overly well for photographs and larger images. There are a whole lot more than 256 colors in the world. Images can carry millions. But if you want smaller icon images, GIFs are the way to go.
Raster image formats can also save at 16, 24, and 32 bits-per-pixel. At the two highest levels, the pixels themselves can carry up to 16,777,216 different colors. The image looks great! Bitmaps saved at 24 bits-per-pixel are great quality images, but of course they also run about a megabyte per picture. There's always a trade-off, isn't there?
The three main Internet formats, GIF, JPEG, and Bitmap, are all Raster formats.
Some other Raster formats include the following:
CLP Windows Clipart
DCX ZOFT Paintbrush
DIB OS/2 Warp format
FPX Kodak's FlashPic
IMG GEM Paint format
JIF JPEG Related Image format
MSP MacPaint New Version
PCT Macintosh PICT format
PCX ZSoft Paintbrush
PPM Portable Pixel Map (UNIX)
PSP Paint Shop Pro format
RAW Unencoded image format
RLE Run-Length Encoding (Used to lower image bit rates)
TIFF Aldus Corporation format
WPG WordPerfect image format
Pixels and the Web Since I brought up pixels, I thought now might be a pretty good time to talk about pixels and the Web. How much is too much? How many is too few?
There is a delicate balance between the crispness of a picture and the number of pixels needed to display it. Let's say you have two images, each is 5 inches across and 3 inches down. One uses 300 pixels to span that five inches, the other uses 1500. Obviously, the one with 1500 uses smaller pixels. It is also the one that offers a more crisp, detailed look. The more pixels, the more detailed the image will be. Of course, the more pixels the more bytes the image will take up.
So, how much is enough? That depends on whom you are speaking to, and right now you're speaking to me. I always go with 100 pixels per inch. That creates a ten-thousand pixel square inch. I 've found that allows for a pretty crisp image without going overboard on the bytes. It also allows some leeway to increase or decrease the size of the image and not mess it up too much.
The lowest I 'd go is 72 pixels per inch, the agreed upon low end of the image scale. In terms of pixels per square inch, it 's a whale of a drop to 5184. Try that. See if you like it, but I think you'll find that lower definition monitors really play havoc with the image.
Meta/Vector Image Formats You may not have heard of this type of image formatting, not that you had heard of Raster, either. This formatting falls into a lot of proprietary formats, formats made for specific programs. CorelDraw (CDR), Hewlett-Packard Graphics Language (HGL), and Windows Metafiles (EMF) are a few examples.
Where the Meta/Vector formats have it over Raster is that they are more than a simple grid of colored dots. They're actual vectors of data stored in mathematical formats rather than bits of colored dots. This allows for a strange shaping of colors and images that can be perfectly cropped on an arc. A squared-off map of dots cannot produce that arc as well. In addition, since the information is encoded in vectors, Meta/Vector image formats can
be blown up or down (a property known as "scalability") without looking jagged or crowded (a property known as "pixelating"). So that I do not receive e-mail from those in the computer image know, there is a difference in Meta and Vector formats. Vector formats can contain only vector data whereas Meta files, as is implied by the name, can contain multiple formats. This