Multi-core SOC for Future Media Processing

Multi-core SOC for Future Media Processing

Qin Xing, Yan Xiaolang

The Institute of VLSI Design, Zhejiang University

The Institute of VLSI Design, Zhejiang Univ. 223/4/19

Outline

Opportunities & challenges from media processing

Multimedia algorithm characteristics & mapping Multi-core SOC architecture & technology Benchmarking results Project status Future work


Opportunities Video conference IP-phone Smart terminal PDA Video camera HDTV Set-top box …


Challenges—multiple standards

0

1

2

3

4

5

6

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

Mb

it/s

MPEG-2

MPEG-4

H.26L

H.263

1st MPEG-2 Encoder

2nd Generation Encoder

3rd Generation Encoder

4th Generation Encoder

H.264 / MPEG-4 part 10

5th Generation Encoder

H.264

AVS

WMV

VP3

WMV

AVS

VP3


Challenges — excellent hardware Very high computation complexity

H.264 encoding of 720 x 576 pixels @ 30 frames/s needs up to 30 GOPS

Multiple standards co-exist Demands of flexibility & programmability

Low power Low cost

Best choice : Application Specific Instruction Processor


Multimedia algorithm characteristics Outer-loop and inner loop

Outer loop: Interface (GUI) Os (Linux) Bit-stream parsing (park/unpack, VLC, CABAC) Data transferring

Inner loop: Regular algorithms

(Prediction, FIR, DCT, motion estimation)

Interface

Operation System

Data transfering

Bitstream Parsing

Filtering 2D transform Block add/sub

Outer Loop

Inner Loop


Multimedia algorithm mapping

Programmable and heterogeneous processors are the preferred choice for the implementation General MCU (RISC core) — outer loop Enhanced DSP(EDSP, +bit wise operation) —outer

loop Vector processor(VP, VLIW+SIMD) — inner loop

General MCU Enhanced DSP Vector Processor


Multi-core SOC architecture Top level

CK520

IM

DM

AH

B

Master

EDSP

DM

IM

DMA

VP

IM

DM

AH

B

Master

AH

B

Slave

AH

B

Slave

AH

B

Slave I/F

I/F

Mem Ctrl

AMBA AHB

AMBA APB

Media processing kernel


Inside the media processing kernel

E-D

P

V-D

P1

V-D

P2

V-D

P3

V-D

P4

GD

M

V-D

M1

V-D

M2

V-D

M3

V-D

M4

GTM

2D crossbar connection network

GAG1 GAG2 GAG3GAG4

ED

SP-c

on

trol p

ath

Vect

or

con

trol p

ath

DM

A a

nd o

ff c

hip

mem

ori

es


Technologies— specified instruction set

for (j=0;j<BLOCK_SIZE;j++){

for (i=0;i<BLOCK_SIZE;i++){

m5[i]=img->cof[i0][j0][i][j];

}

m6[0]=(m5[0]+m5[2]);

m6[1]=(m5[0]-m5[2]);

m6[2]=(m5[1]>>1)-m5[3];

m6[3]=m5[1]+(m5[3]>>1);

}

__asm{ mov edx, mptr

movdqu xmm1, [edx]

packssdw xmm1,xmm1// read m50] from memory to xmm1}

__asm{ movdqu xmm4, [edx +48]

packssdw xmm4,xmm4// read m5[3] from memory}

__asm{ movq xmm5,xmm1

psubw xmm1,xmm3 //m6[1]=(m5[0]-m5[2]);

paddw xmm3,xmm5 //m6[0]=(m5[0]+m5[2]);

movq xmm5, xmm2

psraw xmm2,1

psubw xmm2,xmm4 //m6[2]=(m5[1]>>1)-m5[3]

psraw xmm4,1

paddw xmm4,xmm5 //m6[3]=m5[1]+(m5[3]>>1)}

Our IS

Intel MMX:13 cycles

6 cycles

Integer IDCT in H.264


Technologies—instruction mergence

result = 0;

pres_y = dy == 1 ? y_pos : y_pos+1;

pres_y = max(0,min(maxold_y,pres_y));//load

for(x=-2;x<4;x++) //control

{

pres_x = max(0,min(maxold_x,x_pos+x));//load

result += imY[pres_y][pres_x]*COEF[x+2];

// computation, permutation and load

}

result1 = max(0, min(255, (result+16)/32));//computation

Control

Computation

Permutation

Load/Store 30%

25%

35%

10%

Computation

Control

Ld/St and Perm. Merged

Reduce a half of time

6 – tap sub- pixels interpolation


Benchmarking results for CPU core

CK520

MIPS

0

100200

300

400

500600

700


Simulation results for DSP performance

Enhanced DSP CAVLC(context adaptive variable length coding)

OGG(new audio standard)Function MIPS/frame

MDCT 6

De_VQ 2.5

Floor/Coupling 3.5

Sequence

(CIF)

MIPS/frame

Max Average

Foreman 0.147,832 0.029,898

Mobile 0.541,943 0.134,240


Simulation results for DSP performance

Vector processor H.264 baseline decoder

Sequence

(298 frames)

MIPS@30 frames

Max Average

QCIF Foreman 28.1 12.7

Aikyo 19.8 5.3

CIF Foreman 116.3 52.3

Aikyo 92.9 22.8


Project status Finished 2 versions of CPU Core Released DSP instruction set Writing and verifying RTL of the enhanced DSP Benchmarking vector processor Developing software tools


Future work Scheduling for task level parallelism(TLP)

between heterogeneous processors Simulation/debugging tools for heterogeneous

processors Methodologies for design space exploration


Thank you!

Multi-core SOC for Future Media Processing

Documents

Transcript of Multi-core SOC for Future Media Processing