L28:Lower Power Algorithm for Multimedia Systems(2)

25
L28:Lower Power Algorithm for Multimedia Systems(2) 1999. 8 성성성성성성 성 성 성 http://vada.skku.ac.kr

description

L28:Lower Power Algorithm for Multimedia Systems(2). 1999. 8 성균관대학교 조 준 동 http://vada.skku.ac.kr . Low Power Video Processor. Uzi Zangi, Technion - VLSI Systems Research Center, 1997 Asynchronous logic to save power - PowerPoint PPT Presentation

Transcript of L28:Lower Power Algorithm for Multimedia Systems(2)

Page 1: L28:Lower Power Algorithm for Multimedia Systems(2)

L28:Lower Power Algorithmfor Multimedia Systems(2)

1999. 8 성균관대학교 조 준 동

http://vada.skku.ac.kr

Page 2: L28:Lower Power Algorithm for Multimedia Systems(2)

Low Power Video Processor

Uzi Zangi, Technion - VLSI Systems Research Center, 1997

Asynchronous logic to save power Didn’t work because:Slow design (13.5MHz) &Small circ

uit (<100K gates) : clock load is small.Adding Async. control costs more then clocking.

Gated clock Didn’t work because:

Frequency is very low (13.5MHz). Register activity is very high. No need for clock tree.

Page 3: L28:Lower Power Algorithm for Multimedia Systems(2)

Minimizing bus switching Transfer the value or it’s negative on the bus, according to the

minimum number of toggle bits. Add one bit that will indicate the polarity of the bus. Good for buses with:

large number of bits (more than 10). High capacitance (more then 2pF). High toggle activity (more then 1/2).

Overheads: Routing of one more bit. Extra logic for the decision (timing, area).

Page 4: L28:Lower Power Algorithm for Multimedia Systems(2)

Minimizing bus switching (Cont.)Didn’t work because:

Largest bus is 8bit.Capacitance less than 1pF.Toggle activity not very high.

Block Adecision

unitCx

Block B

nnBus (Ct)

E linen slice

n

Page 5: L28:Lower Power Algorithm for Multimedia Systems(2)

Power Reduction in InfoPadApproach Power

ReductionComments

Voltage Scaling x21 1.1V vs 5VOptimized Cell Lib. x3-4 TR sizing, Reduced swing

and self-timed FIFO…Gated Clocks x2-3 error checking for

address onlyBlock decoding x8 enabling only one block in

the SRAMAlgorithm Selection x5-10 VQ vs DCTBit swing reduction x3.7 1.1V vs 300mV in

memory

Page 6: L28:Lower Power Algorithm for Multimedia Systems(2)

Power Management by Gated Clock

• Power Management Scheme by Enabling Clock

• Power Management Scheme by adding Clock Generation block

block 1

block 1

block 1

enable 1

enable 3

enable 2

c lk

block 1

block 1

block 1

c lk

enable 1

enable 3

enable 2

c lock management

Page 7: L28:Lower Power Algorithm for Multimedia Systems(2)

Method That Works: Pixel Differentials

Pixel value area locality. This is exploited most heavily in compression (save on sto

rage and transmission). Most of the functions are linear, able to work on differenc

es. The entire algorithm was rewritten (interpolations, filters,

matrices, etc.) New algorithm differs from original by no more then 1 lsb bit per pixel.

Page 8: L28:Lower Power Algorithm for Multimedia Systems(2)

MethodologyC++

SimulatorAlgorithm Image

Image

Compare

VerilogSimulatorRTL

Synopsys Netlist P&RCadence Opus

SpiceNetlist

EpicPowermill

Currents,power

Image

0.35 LibCompass

Page 9: L28:Lower Power Algorithm for Multimedia Systems(2)

Pixel Difference

0

2

4

6

8

10

12

0.00% 20.00% 40.00% 60.00% 80.00% 100.00%Pixel Differential

Cur

rent

Register Current [mA]

Logic Current [mA]

Total Current [mA]

Page 10: L28:Lower Power Algorithm for Multimedia Systems(2)

Pixel Differentials Algorithm Results

Number of Pixels

Differential Pixels

Differential Ratio

Register Current (mA)

Logic Current (mA)

Total Current (mA)

Current Saving

Power Saving

3600 0 0.00% 3.8 6.8 10.63646 424 11.63% 1.5 3.6 5.1 52% 77%3646 616 16.90% 1.4 3.22 4.62 56% 81%3190 1536 48.15% 1.02 2.2 3.22 70% 91%3494 2730 78.13% 0.82 1.21 2.03 81% 96%3190 3116 97.68% 0.8 1.16 1.96 82% 97%

Page 11: L28:Lower Power Algorithm for Multimedia Systems(2)

Summary Attempted to save power on a battery-operated chip by app

lication specific algorithmic/architectural techniques: Async. Logic, Gated clock, Minimizing bus switching.

All Attempts failed. These methods may still apply to very large, very fast chips, and on variable load application.

Successfully applied an algorithmic change, inspired by image compression. It may not work on non-compressible data but works exceptionally well on images.

Easily saved 80% power, potentially can save more than 90%.

Page 12: L28:Lower Power Algorithm for Multimedia Systems(2)

A SINGLE-CHIP DIGITAL CAMERAH. Teresa H. Meng, “Low-Power Wireless Video System” , IEEE Communication Magazine, June, 1998

◈ Given the recent development in CMOS RF transceiver design, wireless transmission at a bandwidth in excess of 10Mb/s will soon become possible using next-generation CMOS technology.

◈ The design of a low-power large-scale parallel MPEG2 encoder architecture to be used in a single-chip digital CMOS video camera.

◈ The single-chip digital camera architecture includes a 640 x 480 array of CMOS photo diodes, embedded DRAM for storing four frames of color data, and parallel array processor for video signal processing

◈ The parallel processor architecture is designed to implement highly computationally intensive image and video processing tasks such as color conversion , discrete cosine transform(DCT), and motion estimation for MPGE2.

Page 13: L28:Lower Power Algorithm for Multimedia Systems(2)

A SINGLE-CHIP DIGITAL CAMERAC MO S photo sensors

Emnbedded DRAM (pixel memory)Parallel video

processors

C olume proc essor 40

C olume proc essor 39

C olume processor 2

C olume processor 116 c olume x 480 pixels

480 pixels

640

pixe

ls

S ilicon surface

Sideview

Topview

Page 14: L28:Lower Power Algorithm for Multimedia Systems(2)

A SINGLE-CHIP DIGITAL CAMERA

Module/ operation

External I/ O access

8 x 128 x 16 SRAM (write)

8 x 128 x 126 SRAM (read)

Latch

Multiplier

C arry- selector adder

Word size

16 bits

16 bits

16 bits

16 bits

16 bits

16 bits

Energy/ op(pJ )

160

180

80

4

64

18

Normalized to adder

10

9

4.4

0.22

3.6

1

Energy per operation at a 1.5V supply in 0.8m CMOS technology

Page 15: L28:Lower Power Algorithm for Multimedia Systems(2)

A SINGLE-CHIP DIGITAL CAMERA◈ Design Consideration

The proposed architecture considers three algorithms commonly used in video coding standards : red-green-blue(RGB)-to-yellow-ultraviolet (YUV) conversion, discrete cosign transform(DCT), and motion estimation

To reduce power consumption, as many parallel processors as practically feasible should be used to reduce the clock frequency, because a reduced clock frequency implies a lower supply voltage.

For MPEG-2 encoding, the computational demand required for motion estimation(1.6 BOPS for 30 frames/s based on the algorithm proposed by Chalidabhongese and Kuo) limits the number of columns in each processor domain to 16, because otherwise the required clock speed for each processor would be too high for a low-power design

Page 16: L28:Lower Power Algorithm for Multimedia Systems(2)

A SINGLE-CHIP DIGITAL CAMERA

◈ PERFORMANCE

In order to sustain this computational demand, each processor is required to run at a clock frequency equal to or higher than 40 MHz.

When implemented in a 0.2 CMOS technology, a 1V supply voltage should be more than enough to support a 40MHz operation

Under these condition, this parallel processor architecture delivers a processing of 1.6 BOPS with a power consumption of 40mW

Page 17: L28:Lower Power Algorithm for Multimedia Systems(2)

Vector Quantization

• Lossy compression technique which exploits the correlation that exists between neighboring samples and quantizes samples together

Page 18: L28:Lower Power Algorithm for Multimedia Systems(2)

Complexity of VQ EncodingThe distortion metric between an input vector X anda codebook vector C_i is computed as follows:

Three VQ encoding algorithms will be evaluated: full search, tree search and differential codebook tree-search.

Page 19: L28:Lower Power Algorithm for Multimedia Systems(2)

Full Search

• Brute-force VQ: the distortion between the input vector and every entry in the code-book is computed, and the codeindex that corresponds to the minimum distortion is determined and sent over to the decoder.

• For each distortion computation, there are 16 8-bit memory accesses (to fetch the entries in the codeword), 16 subtractions, 16 multiplications, 15 additions. In addition, the minimum of 256 distortion values, which involves 255 comparison operations, must be determined.

Page 20: L28:Lower Power Algorithm for Multimedia Systems(2)

Tree-structured Vector Quantization

If for example at level 1, the input vector iscloser to the left entry, then the right portion of the tree is never compared below level 2 and an index bit 0 istransmitted.

Here only 2 x log 2 256 = 16 distortion calculations with 8 comparisons

Page 21: L28:Lower Power Algorithm for Multimedia Systems(2)

Algorithmic Optimization• Minimizing the number of operations

– example• video data stream using the v

ector quantization (VQ) algorithm

• distortion metric

– Full search VQ• exhaustive full-search• distortion calculation : 256• value comparison : 255

15

0

2

jijji CXD

– Tree-structured VQ• binary tree-search• some performance

degradation• distortion calculation : 16 ( 2

x log2 256 )• value comparison : 8

1

2 2

3 3 3 3

8 8

0

0 1 0 1

1

Page 22: L28:Lower Power Algorithm for Multimedia Systems(2)

Differential Codebook Tree-structure Vector Quantization

• The distortion difference b/w the left and right node needs to be computed. This equation can be manipulated to reduce the number of operations

.

Page 23: L28:Lower Power Algorithm for Multimedia Systems(2)

Algorithmic Optimization

– Differential codebook tree-structure VQ• modify equation for optimizing operations

algorithm # ofmem.

accessfull searchtree searchdifferentialtree search

# ofmul.

# ofadd.

# ofsub

4096 4096 3840 4096256 256 240 264

136 128 128 0

15

0

15

0,,

2,

2,

15

0

15

0

2,

2,

2j j

jleftjrightjjrightjleft

j jjrightjjleftjrightleft

CCXCX

CXCXD

Page 24: L28:Lower Power Algorithm for Multimedia Systems(2)

Multiplication with Constants

• Techniques and tools have been developed to scale coefficients so as to minimize the number of 1’s in the coefficients so as to minimize the number of shift-add operations.

Page 25: L28:Lower Power Algorithm for Multimedia Systems(2)

Gated clocks to shut down modules when not used.