Multi-core SOC for Future Media Processing

17
Multi-core SOC for Future Media Processing Qin Xing, Yan Xiaolang The Institute of VLSI Design, Zhejiang University

description

Multi-core SOC for Future Media Processing. Qin Xing, Yan Xiaolang The Institute of VLSI Design, Zhejiang University. Outline. Opportunities & challenges from media processing Multimedia algorithm characteristics & mapping Multi-core SOC architecture & technology - PowerPoint PPT Presentation

Transcript of Multi-core SOC for Future Media Processing

Multi-core SOC for Future Media Processing

Qin Xing, Yan Xiaolang

The Institute of VLSI Design, Zhejiang University

The Institute of VLSI Design, Zhejiang Univ. 223/4/19

Outline

Opportunities & challenges from media processing

Multimedia algorithm characteristics & mapping Multi-core SOC architecture & technology Benchmarking results Project status Future work

The Institute of VLSI Design, Zhejiang Univ. 323/4/19

Opportunities Video conference IP-phone Smart terminal PDA Video camera HDTV Set-top box …

The Institute of VLSI Design, Zhejiang Univ. 423/4/19

Challenges—multiple standards

0

1

2

3

4

5

6

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005

Mb

it/s

MPEG-2

MPEG-4

H.26L

H.263

1st MPEG-2 Encoder

2nd Generation Encoder

3rd Generation Encoder

4th Generation Encoder

H.264 / MPEG-4 part 10

5th Generation Encoder

H.264

AVS

WMV

VP3

WMV

AVS

VP3

The Institute of VLSI Design, Zhejiang Univ. 523/4/19

Challenges — excellent hardware Very high computation complexity

H.264 encoding of 720 x 576 pixels @ 30 frames/s needs up to 30 GOPS

Multiple standards co-exist Demands of flexibility & programmability

Low power Low cost

Best choice : Application Specific Instruction Processor

The Institute of VLSI Design, Zhejiang Univ. 623/4/19

Multimedia algorithm characteristics Outer-loop and inner loop

Outer loop: Interface (GUI) Os (Linux) Bit-stream parsing (park/unpack, VLC, CABAC) Data transferring

Inner loop: Regular algorithms

(Prediction, FIR, DCT, motion estimation)

Interface

Operation System

Data transfering

Bitstream Parsing

Filtering 2D transform Block add/sub

Outer Loop

Inner Loop

The Institute of VLSI Design, Zhejiang Univ. 723/4/19

Multimedia algorithm mapping

Programmable and heterogeneous processors are the preferred choice for the implementation General MCU (RISC core) — outer loop Enhanced DSP(EDSP, +bit wise operation) —outer

loop Vector processor(VP, VLIW+SIMD) — inner loop

General MCU Enhanced DSP Vector Processor

The Institute of VLSI Design, Zhejiang Univ. 823/4/19

Multi-core SOC architecture Top level

CK520

IM

DM

AH

B

Master

EDSP

DM

IM

DMA

VP

IM

DM

AH

B

Master

AH

B

Slave

AH

B

Slave

AH

B

Slave I/F

I/F

Mem Ctrl

AMBA AHB

AMBA APB

Media processing kernel

The Institute of VLSI Design, Zhejiang Univ. 923/4/19

Inside the media processing kernel

E-D

P

V-D

P1

V-D

P2

V-D

P3

V-D

P4

GD

M

V-D

M1

V-D

M2

V-D

M3

V-D

M4

GTM

2D crossbar connection network

GAG1 GAG2 GAG3GAG4

ED

SP-c

on

trol p

ath

Vect

or

con

trol p

ath

DM

A a

nd o

ff c

hip

mem

ori

es

The Institute of VLSI Design, Zhejiang Univ. 1023/4/19

Technologies— specified instruction set

for (j=0;j<BLOCK_SIZE;j++){

for (i=0;i<BLOCK_SIZE;i++){

m5[i]=img->cof[i0][j0][i][j];

}

m6[0]=(m5[0]+m5[2]);

m6[1]=(m5[0]-m5[2]);

m6[2]=(m5[1]>>1)-m5[3];

m6[3]=m5[1]+(m5[3]>>1);

}

__asm{ mov edx, mptr

movdqu xmm1, [edx]

packssdw xmm1,xmm1// read m50] from memory to xmm1}

__asm{ movdqu xmm4, [edx +48]

packssdw xmm4,xmm4// read m5[3] from memory}

__asm{ movq xmm5,xmm1

psubw xmm1,xmm3 //m6[1]=(m5[0]-m5[2]);

paddw xmm3,xmm5 //m6[0]=(m5[0]+m5[2]);

movq xmm5, xmm2

psraw xmm2,1

psubw xmm2,xmm4 //m6[2]=(m5[1]>>1)-m5[3]

psraw xmm4,1

paddw xmm4,xmm5 //m6[3]=m5[1]+(m5[3]>>1)}

Our IS

Intel MMX:13 cycles

6 cycles

Integer IDCT in H.264

The Institute of VLSI Design, Zhejiang Univ. 1123/4/19

Technologies—instruction mergence

result = 0;

pres_y = dy == 1 ? y_pos : y_pos+1;

pres_y = max(0,min(maxold_y,pres_y));//load

for(x=-2;x<4;x++) //control

{

pres_x = max(0,min(maxold_x,x_pos+x));//load

result += imY[pres_y][pres_x]*COEF[x+2];

// computation, permutation and load

}

result1 = max(0, min(255, (result+16)/32));//computation

Control

Computation

Permutation

Load/Store 30%

25%

35%

10%

Computation

Control

Ld/St and Perm. Merged

Reduce a half of time

6 – tap sub- pixels interpolation

The Institute of VLSI Design, Zhejiang Univ. 1223/4/19

Benchmarking results for CPU core

CK520

MIPS

0

100200

300

400

500600

700

The Institute of VLSI Design, Zhejiang Univ. 1323/4/19

Simulation results for DSP performance

Enhanced DSP CAVLC(context adaptive variable length coding)

OGG(new audio standard)Function MIPS/frame

MDCT 6

De_VQ 2.5

Floor/Coupling 3.5

Sequence

(CIF)

MIPS/frame

Max Average

Foreman 0.147,832 0.029,898

Mobile 0.541,943 0.134,240

The Institute of VLSI Design, Zhejiang Univ. 1423/4/19

Simulation results for DSP performance

Vector processor H.264 baseline decoder

Sequence

(298 frames)

MIPS@30 frames

Max Average

QCIF Foreman 28.1 12.7

Aikyo 19.8 5.3

CIF Foreman 116.3 52.3

Aikyo 92.9 22.8

The Institute of VLSI Design, Zhejiang Univ. 1523/4/19

Project status Finished 2 versions of CPU Core Released DSP instruction set Writing and verifying RTL of the enhanced DSP Benchmarking vector processor Developing software tools

The Institute of VLSI Design, Zhejiang Univ. 1623/4/19

Future work Scheduling for task level parallelism(TLP)

between heterogeneous processors Simulation/debugging tools for heterogeneous

processors Methodologies for design space exploration

The Institute of VLSI Design, Zhejiang Univ. 1723/4/19

Thank you!