MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita...
-
Upload
jaquan-plain -
Category
Documents
-
view
235 -
download
0
Transcript of MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita...
MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms
Mancia Anguita
Universidad de Granada
J. Manuel Martinez – Lechado
Vitelcom Mobile Technology
Abstract
An application’s execution time depends on the processor architecture and clock frequency, the computational complexity of the algorithm, the choice of compiler and optimization options, and it also depends on how well the programmer explicitly and implicitly exploits processor architecture. This article quantifies the influence of these factors for an MP3 decoder through experimental results
Outline
What’s the problem? MP3 decoder overview MP3 decoder implementations Performance comparison Experiment results Conclusion
What’s the problem?
What factors can influence the application’s execution time? Executing processor’s architecture and clock
frequency The computational complexity of the algorithm The compiler The programmer’s skill
But how much influence do these factors exert on overall performance?
MP3 decoder overview( 1)
MP3 decoder overview( 2) Preprocessing
Finds frames in the bitstream Extracts their compressed audio data and informatio
n Huffman tables, scale factors
Requantization Reconstruct the original frequency line samples xri by
using scale factors extracted form preprocessing xri = sign(isi) |isi|4/3 × 2Cj/4
MP3 decoder overview( 3) Huffman decoding
Huffman encoding is a lossless coding scheme Decoding process is based in several Huffman table
s for mapping Huffman code to symbols Total 17 different tables The significant part of the processing
handling the compressed audio bitstream Searching Huffman tables
MP3 decoder overview( 4) Reordering
The encoder reorder short blocks to make the Huffman coding more efficiently
The decoder reverses this reordering
Stereo decoding To exploit redundancies between different stereo
channels When using single channel or dual channel, no
stereo processing is necessary
MP3 decoder overview( 5) Alias reduction
In the encoder, it is necessary to negate the alias effects of the polyphase filter bank
Consist of eight butterfly calculations for each pair of adjacent subbands
IMDCT
MP3 decoder overview( 6) Frequency inversion
To compensate for frequency inversions, this stage negate every odd sample in all odd subbands
Synthesis polyphase filter bank
MP3 decoder implementations( 1) Standard version
Implement MP3 following documentations Using only the tables specified in the standard
Basic version Improving on the standard version Replace some instructions by other with few clock cycles
EX : replace floating-point division by multiplicands and some integer multiply instruction by shift
Replace computationally intensive library functions with tables Library functions, using special processor instructions, replace
slower high-level programmer code Using loop unrolling to improve some loops
MP3 decoder implementations( 2) SIMD version
Improving on the basic version using SIMD extensions
MP3 is based on vector operations, so it can achieve benefit from SIMD instructions Requantization, stereo processing, IMDCT, and synthe
sis filter bank Using SIMD for improving memory initializations and
block transfers
MP3 decoder implementations( 3) Algorithm version
Improving basic version with algorithm Synthesis polyphase filter bank
Konstantinides’ method reduces the number of operations by transforming the matrixing operation to a 32 DCT and some reorder operation
IMDCT Marovich’s method Reduce IMDCT to a fast DCT and some data copying
operations Huffman decoding
A tree-clustering algorithm can speed up the search process
MP3 decoder implementations( 3) Algorithm-SIMD version
Based on SIMD version combined with the SIMD implementation
Using IMDCT and synthesis algorithm and clustering Huffman-decoding
Performance comparison( 0) Optimization operations
Performance comparison( 1) O2
Include classical optimizations that are processor independent Include inline function expansion
G6 This switch optimizes code for Pentium Pro, PII, and PIII, gene
rating code that is compatible with earlier processors G7
This switch optimizes code for Pentium IV, generating code that is compatible with earlier processors
QxK Allow vectorization using the SSE and MMX instruction include
d in PIII and P4 Arch:SSE
Using SSE and cmov instructions
Performance comparison( 2) Test platform
Test MP3 file Note
We measure processor clock
cycle instead of time, so the
result are independent of the
processor clock frequency
Experiment results( 1)
Experiment results( 2)
Experiment results( 3)
Conclusion
Exploiting architecture features can be as important as choosing the right algorithms
Programmer can exploit architecture features to a higher degree than compiler
Optimization choice depends on the application
Sub-band samples(32 subband x 18 samples)
0 1 2 ………………………………16 1701...
3031
DCT
0 1 2 ………………………62 63
0…………31
32…………63
64…………95
96………127
128………159
160………191
16 x 64-bitFIFO
= 1024 samples
896………927
928………959
960………991
992……1023
0 1 2 14 15
0……31
0……31
32……63
32……63
64……95
64……95
480…511
480…511
…………
…………
U vector
D window
x x x …… x
0…………31
0…………31
0…………31
…………
0…………31
w0 w1 w2 w15
+ + + + =Sum(w0 ~w15)