A single-chip MPEG-2 codec based on customizable media embedded

11
530 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 38, NO. 3, MARCH 2003 A Single-Chip MPEG-2 Codec Based on Customizable Media Embedded Processor Shunichi Ishiwata, Tomoo Yamakage, Yoshiro Tsuboi, Takayoshi Shimazawa, Tomoko Kitazawa, Shuji Michinaka, Kunihiko Yahagi,Hideki Takeda, Akihiro Oue, Tomoya Kodama, Nobu Matsumoto, Takayuki Kamei, Mitsuo Saito, Takashi Miyamori, Goichi Ootomo, and Masataka Matsui, Member, IEEE Abstract—A single-chip MPEG-2 MP@ML codec, integrating 3.8M gates on a 72-mm die, is described. The codec employs a heterogeneous multiprocessor architecture in which six micropro- cessors with the same instruction set but different customization execute specific tasks such as video and audio concurrently. The microprocessor, developed for digital media processing, provides various extensions such as a very-long-instruction-word coprocessor, digital signal processor instructions, and hardware engines. Making full use of the extensions and optimizing the architecture of each microprocessor based upon the nature of specific tasks, the chip can execute not only MPEG-2 MP@ML video/audio/system encoding and decoding concurrently, but also MPEG-2 MP@HL decoding in real time. Index Terms—Audio coding, codecs, microprocessors, motion compensation, MPEG-2, multiprocessing, video signal processing. I. INTRODUCTION R ECENTLY, many single-chip MPEG-2 MP@ML video encoders and coders/decoders (codecs) have been devel- oped [2]–[7]. Most of these devices have the same architectural characteristics: 1) heavy iteration of relatively simple tasks such as motion estimation, discrete cosine transform (DCT), quantization and variable length encoding is implemented using dedicated hardware engines, and 2) a simple embedded reduced-instruction set (RISC) or digital signal processor (DSP) is used to perform the remainder of the tasks. This organization is very common and provides a suitable balance between cost and performance. As semiconductor technology progresses, single-chip MPEG-2 MP@ML encoders have integrated audio and system [8]–[11] while keeping the above-mentioned architectural characteristics. However, the optimum architec- ture for video encoding/decoding is different from that for audio and system encoding/decoding. This motivates another architectural characteristic: a heterogeneous multiprocessor architecture, where each processor is optimized for a specific application such as video or audio. Such application-specific processors are common in many applications. However, recent techniques employed in high-end Manuscript received April 19, 2002; revised October 24, 2002. S. Ishiwata, Y. Tsuboi, T. Shimazawa, T. Kitazawa, S. Michinaka, K. Yahagi, H. Takeda, A. Oue, N. Matsumoto, T. Miyamori, G. Ootomo, and M. Matsui are with the SoC Research and Development Center, Semiconductor Company, Toshiba Corporation, Kawasaki 212-8520, Japan (e-mail: shunichi.ishiwata@ toshiba.co.jp). T. Yamakage and T. Kodama are with the Multimedia Laboratory, Research and Development Center, Toshiba Corporation, Kawasaki 212-8582, Japan. T. Kamei and M. Saito are with the Broadband System LSI Project, Semicon- ductor Company, Toshiba Corporation, Kawasaki 212-8520, Japan. Digital Object Identifier 10.1109/JSSC.2002.808291 microprocessors, such as out-of-order speculative execution, branch prediction, and circuit technologies to achieve high clock frequency, are unsuitable for these application-specific processors. On the other hand, the addition of application-spe- cific instructions and/or hardware engines to accelerate heavy iteration of relatively simple tasks often results in a large gain in performance at little additional cost, especially in DSP ap- plications. A heterogeneous multiprocessor architecture using simple and small microprocessors achieves a better cost–per- formance tradeoff than a single high-end general-purpose microprocessor with hardware engines, because a general-pur- pose microprocessor is too expensive to do the simple job of coordinating hardware engines. Based on this understanding, we have developed a cus- tomizable media embedded processor architecture [1] suitable for digital media processing. We have also developed hard- ware/software development tools to support the customization. We have applied the architecture and tools to develop a single-chip MPEG-2 MP@ML codec. It has a heteroge- neous multiprocessor architecture in which each processor is optimized based on the nature of its intended application. The proposed chip can execute not only MPEG-2 MP@ML video/audio/system encoding and decoding simultaneously, but also MPEG-2 MP@HL decoding. Several MPEG-2 MP@ML video codecs [4], [7], MPEG-2 MP@ML video/audio/system encoders [8]–[11], and MPEG-2 MP@HL decoders [12]–[15] have already been reported. However, none of these previous circuits are capable of performing all these functions with a single integrated circuit. Recently, an integrated circuit that is capable of performing these functions was developed [16] concurrently with our chip, but the detail has not been disclosed as far as the authors know. The customizable media embedded processor architec- ture and hardware/software development tools to support customization are described in Sections II and III, respec- tively. The details of the MPEG-2 codec architecture and its large-scale integrated (LSI) implementation are described in Sections IV and V, respectively. II. CUSTOMIZABLE MEDIA EMBEDDED PROCESSOR ARCHITECTURE Fig. 1 shows the proposed media embedded processor archi- tecture. It is customizable and consists of a basic part and an extensible part. The former is described in Section II-A and the latter is described in the subsequent sections. 0018-9200/03$17.00 © 2003 IEEE

Transcript of A single-chip MPEG-2 codec based on customizable media embedded

Page 1: A single-chip MPEG-2 codec based on customizable media embedded

530 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 38, NO. 3, MARCH 2003

A Single-Chip MPEG-2 Codec Based onCustomizable Media Embedded Processor

Shunichi Ishiwata, Tomoo Yamakage, Yoshiro Tsuboi, Takayoshi Shimazawa, Tomoko Kitazawa, Shuji Michinaka,Kunihiko Yahagi, Hideki Takeda, Akihiro Oue, Tomoya Kodama, Nobu Matsumoto, Takayuki Kamei, Mitsuo Saito,

Takashi Miyamori, Goichi Ootomo, and Masataka Matsui, Member, IEEE

Abstract—A single-chip MPEG-2 MP@ML codec, integrating3.8M gates on a 72-mm2 die, is described. The codec employs aheterogeneous multiprocessor architecture in which six micropro-cessors with the same instruction set but different customizationexecute specific tasks such as video and audio concurrently.The microprocessor, developed for digital media processing,provides various extensions such as a very-long-instruction-wordcoprocessor, digital signal processor instructions, and hardwareengines. Making full use of the extensions and optimizing thearchitecture of each microprocessor based upon the nature ofspecific tasks, the chip can execute not only MPEG-2 MP@MLvideo/audio/system encoding and decoding concurrently, but alsoMPEG-2 MP@HL decoding in real time.

Index Terms—Audio coding, codecs, microprocessors, motioncompensation, MPEG-2, multiprocessing, video signal processing.

I. INTRODUCTION

RECENTLY, many single-chip MPEG-2 MP@ML videoencoders and coders/decoders (codecs) have been devel-

oped [2]–[7]. Most of these devices have the same architecturalcharacteristics: 1) heavy iteration of relatively simple taskssuch as motion estimation, discrete cosine transform (DCT),quantization and variable length encoding is implementedusing dedicated hardware engines, and 2) a simple embeddedreduced-instruction set (RISC) or digital signal processor (DSP)is used to perform the remainder of the tasks. This organizationis very common and provides a suitable balance between costand performance. As semiconductor technology progresses,single-chip MPEG-2 MP@ML encoders have integrated audioand system [8]–[11] while keeping the above-mentionedarchitectural characteristics. However, the optimum architec-ture for video encoding/decoding is different from that foraudio and system encoding/decoding. This motivates anotherarchitectural characteristic: a heterogeneous multiprocessorarchitecture, where each processor is optimized for a specificapplication such as video or audio.

Such application-specific processors are common in manyapplications. However, recent techniques employed in high-end

Manuscript received April 19, 2002; revised October 24, 2002.S. Ishiwata, Y. Tsuboi, T. Shimazawa, T. Kitazawa, S. Michinaka, K. Yahagi,

H. Takeda, A. Oue, N. Matsumoto, T. Miyamori, G. Ootomo, and M. Matsuiare with the SoC Research and Development Center, Semiconductor Company,Toshiba Corporation, Kawasaki 212-8520, Japan (e-mail: [email protected]).

T. Yamakage and T. Kodama are with the Multimedia Laboratory, Researchand Development Center, Toshiba Corporation, Kawasaki 212-8582, Japan.

T. Kamei and M. Saito are with the Broadband System LSI Project, Semicon-ductor Company, Toshiba Corporation, Kawasaki 212-8520, Japan.

Digital Object Identifier 10.1109/JSSC.2002.808291

microprocessors, such as out-of-order speculative execution,branch prediction, and circuit technologies to achieve highclock frequency, are unsuitable for these application-specificprocessors. On the other hand, the addition of application-spe-cific instructions and/or hardware engines to accelerate heavyiteration of relatively simple tasks often results in a large gainin performance at little additional cost, especially in DSP ap-plications. A heterogeneous multiprocessor architecture usingsimple and small microprocessors achieves a better cost–per-formance tradeoff than a single high-end general-purposemicroprocessor with hardware engines, because a general-pur-pose microprocessor is too expensive to do the simple job ofcoordinating hardware engines.

Based on this understanding, we have developed a cus-tomizable media embedded processor architecture [1] suitablefor digital media processing. We have also developed hard-ware/software development tools to support the customization.We have applied the architecture and tools to develop asingle-chip MPEG-2 MP@ML codec. It has a heteroge-neous multiprocessor architecture in which each processoris optimized based on the nature of its intended application.The proposed chip can execute not only MPEG-2 MP@MLvideo/audio/system encoding and decoding simultaneously, butalso MPEG-2 MP@HL decoding. Several MPEG-2 MP@MLvideo codecs [4], [7], MPEG-2 MP@ML video/audio/systemencoders [8]–[11], and MPEG-2 MP@HL decoders [12]–[15]have already been reported. However, none of these previouscircuits are capable of performing all these functions with asingle integrated circuit. Recently, an integrated circuit thatis capable of performing these functions was developed [16]concurrently with our chip, but the detail has not been disclosedas far as the authors know.

The customizable media embedded processor architec-ture and hardware/software development tools to supportcustomization are described in Sections II and III, respec-tively. The details of the MPEG-2 codec architecture and itslarge-scale integrated (LSI) implementation are described inSections IV and V, respectively.

II. CUSTOMIZABLE MEDIA EMBEDDED

PROCESSORARCHITECTURE

Fig. 1 shows the proposed media embedded processor archi-tecture. It is customizable and consists of a basic part and anextensible part. The former is described in Section II-A and thelatter is described in the subsequent sections.

0018-9200/03$17.00 © 2003 IEEE

Page 2: A single-chip MPEG-2 codec based on customizable media embedded

ISHIWATA et al.: SINGLE-CHIP MPEG-2 CODEC 531

Fig. 1. Proposed media embedded processor architecture.

A. Basic Configuration

The basic configuration of the processor consists of the coreinstruction set, an instruction cache/RAM, a data cache/RAM,a direct-memory access (DMA) controller, and a bus bridge toa main bus, as shown in Fig. 1. The base processor is a 32-bitfive-stage simple RISC core which consists of about 50K gatesand can operate at 200 MHz in 0.18-m CMOS technology.Basic properties as well as customizable extensions of this coreare summarized in Table I [17]. The basic configurations includememory size and optional instructions such as multiply and di-vide. The configurable extensions are depicted in Fig. 1 and de-scribed in Sections IV-A–D.

B. VLIW Extension

Currently, the following two very-long-instruction-word(VLIW) extensions are supported:

1) 2-issue VLIW execution unit with a 32-bit instructionlength;

2) 3-issue VLIW execution unit with a 64-bit instructionlength [1].

In both cases, a core ALU and an extended coprocessor operatein parallel.

In addition, load/store instructions to/from coprocessor reg-isters support an autoincrement with modulo addressing mode,in which the general-purpose registers (GPRs) in the RISC coreare used as the address registers.

C. User Custom Instructions

The instruction format of user-custom instructions is limitedto two register operands plus a 16-bit sub-opcode or immediate

TABLE ICONFIGURATION AND EXTENSION OF THEMEDIA EMBEDDED PROCESSOR

operand. Execution of each of these instruction must be com-pleted within one clock cycle. Branch or load/store instructionscannot be added.

D. DSP Extension Instructions

Each DSP extension instruction contains two registeroperands plus a 16-bit sub-opcode or immediate operand.In these instructions, execution may exceed one cycle. Stateregisters can be implemented. Load/store instructions from/toa data RAM can also be added. For example, a typical DSP

Page 3: A single-chip MPEG-2 codec based on customizable media embedded

532 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 38, NO. 3, MARCH 2003

Fig. 2. Tool integrator.

pipeline might consist of an address calculation using the coreALU followed by load from the data RAM, which is furtherfollowed by multiply-accumulation with the extended DSPunit.

E. Hardware Engine and Local Memory

Heavy iteration of a relatively simple task can be imple-mented as a hardware engine. A C or C++ function is mappedinto a hardware engine and its associated code as follows.

1) Arguments and return values which are integers aremapped into specific instructions to read or write regis-ters in a hardware engine through the control bus.

2) Passing large data structure is performed by load/storeinstructions or using a DMA transfer through the local busto/from local memory connected to the hardware engine.

3) A function call is performed by starting a hardware engineusing a specific instruction to write to a command registerin the hardware engine through the control bus.

Both the control bus and the local bus are such that a load/storethrough the control bus, a DMA transfer over the local bus, anda hardware engine execution can operate in parallel.

F. Heterogeneous Multiprocessor Architecture

Multiple processors can be connected to shared memoryusing a main bus as shown in Fig. 1. In this context, eachprocessor is referred to as a media module (MM). Each MMcontains a basic processor configuration and may also containa coprocessor, user-custom instructions, DSP extensions,hardware engines, and local memories. The architecture ofeach MM can be optimized independently based on its targetapplication. Both 32- and 64-bit bus widths are supported.

III. H ARDWARE AND SOFTWAREDEVELOPMENTTOOL

We have also developed a tool that generates 1) synthesizableregister transfer level (RTL), scripts for synthesis tools, and averification program, and 2) a software development kit (assem-bler, compiler, linker, simulator, and debugger). The tool gener-

TABLE IIMAIN FUNCTIONS OF THECHIP

ates these based on a script which describes the architecture ex-tensions (see Fig. 2). This tool can significantly reduce the timerequired for hardware/software development. Details of this toolare described in [18] and [19].

In practice, extending the architecture also yields the needfor modification of the application C source code and additionof a C++ model to describe the function of the extended part.The compiler is extended automatically so that user-custom in-structions and DSP extension instructions can be expressed asfunction calls in the C source code. To support the custom hard-ware engines, function calls must be replaced by data transfersthrough the control bus and the local bus. Our tool integrates theC++ model into the core simulator and produces a chip-levelsimulator. The corresponding RTL is automatically generatedfrom the C++ model by a high-level synthesis tool as long asthe extended part is relatively small.

Page 4: A single-chip MPEG-2 codec based on customizable media embedded

ISHIWATA et al.: SINGLE-CHIP MPEG-2 CODEC 533

Fig. 3. Block diagram of the MPEG-2 codec LSI.

Fig. 4. Outline of data flow in MPEG decoding.

IV. MPEG-2 CODEC ARCHITECTURE

Based on the media embedded processor architecture, a codecintegrated circuit was implemented. Table II shows the mainfunctions of the chip.

Fig. 3 shows a block diagram of the chip. It contains six MMs.Each MM is optimized based on the nature of its target applica-tion and includes the following extensions:

• bitstream mux/demux MM: bitstream input/output (I/O)hardware engine (HWE);

• audio MM: 2-issue VLIW with a multiply-accumulationcoprocessor and an audio I/O HWE;

• video encode/decode MM: five HWEs, a DSP extensionfor variable length coder (VLC) and variable length de-coder (VLD), and user custom instructions for single-in-struction multiple-data (SIMD) operations;

• motion estimation (for video compression) MM: block-matching operation HWE;

• video pre/postprocessing MM: video filter HWE andvideo I/O HWE;

• general control MM: no extension (i.e., basic configura-tion only).

The fact that the optimized architecture contains such a va-riety of MMs illustrates the necessity of the proposed extensiblemedia processor. The methodology results in the effective reuseof common components and can be used in other applicationssuch as image recognition [1].

Next, an overview of software implementation is described.As shown in Fig. 4, the MPEG-2 decode task is distributed toeach MM as follows.

1) In the bitstream mux/demux MM, a bitstream input to thebitstream I/O HWE is transferred to the SDRAM by aninterrupt handler. This bitstream is read from the SDRAMand demultiplexed into an audio elementary stream and avideo elementary stream. These streams are written to theSDRAM.

Page 5: A single-chip MPEG-2 codec based on customizable media embedded

534 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 38, NO. 3, MARCH 2003

Fig. 5. Outline of data flow in MPEG encoding.

2) In the audio MM, an audio elementary stream is read fromthe SDRAM and decoded. The result is written to theSDRAM. It is transferred to the audio I/O HWE by aninterrupt handler.

3) In the video encode/decode MM, a video elementarystream is read from the SDRAM and decoded. The resultis written to the SDRAM.

4) In the video pre/postprocessing MM, the decoded videois read from the SDRAM and postprocessed. The result iswritten to the SDRAM. It is transferred to the video I/OHWE by an interrupt handler.

The MPEG-2 encode operation is the reverse behavior of de-code operation except for motion estimation. An outline of dataflow, shown in Fig. 5, is as follows.

1) In the video pre/postprocessing MM, a video source inputto the video I/O HWE is transferred to the SDRAM byan interrupt handler. It is read from the SDRAM, prepro-cessed, and written to the SDRAM.

2) In the motion estimation MM, coarse-grain motion esti-mation is performed using the preprocessed video and theresult is written to the SDRAM.

3) In the video encode/decode MM, the preprocessed videois encoded using the above result and the generated videoelementary stream is written to the SDRAM.

4) In the audio MM, an audio source input to the audio I/OHWE is transferred to the SDRAM by an interrupt han-dler. Then it is encoded and the generated audio elemen-tary stream is written to the SDRAM.

5) In the bitstream mux/demux MM, the video and audio ele-mentary stream is read from the SDRAM and multiplexedinto a program or transport stream. The result is writtento the SDRAM. Finally, it is transferred to the bitstreamI/O HWE by an interrupt handler.

It is noted that data transfer between MMs is executed viathe SDRAM. This implementation is essential in that it allevi-ates the difficulties in scheduling the multitasking operations.It is also indispensable for reducing the peak data transfer ratebetween the MMs, which justifies the single shared-bus archi-tecture. The bus arbitration algorithm is fair roundrobin so thatthe worst case latency is easily estimated. Our method of datatransfer between MMs causes simultaneous access to shared

Fig. 6. VLIW extension of audio MM.

resources in the SDRAM. To resolve this problem, hardwaresemaphore registers are used.

On each MM, multiple tasks such as encode, decode, andI/O tasks can run pseudosimultaneously in an interrupt-drivenand time-division-multiplex manner. A kernel for each MM toswitch tasks is also customized based on the nature of the targetapplication.

Several notable implementation details involving exten-sions of the media embedded processor are described inSections IV-A–D.

A. DMA Controller

A DMA controller in each MM performs the following twofunctions.

1) Transferring a rectangular region between memoryregions using a single command. This is often requiredin video applications and is usually implemented inMPEG-2 encoders. The implementation is similar to theone in [20].

2) Descriptor chain mode to chain some DMA transfers. De-scriptor tables are stored in a data RAM by software and

Page 6: A single-chip MPEG-2 codec based on customizable media embedded

ISHIWATA et al.: SINGLE-CHIP MPEG-2 CODEC 535

Fig. 7. Block diagram of video encode/decode MM.

are read by a DMA controller. This is indicated by thesolid line from the data RAM to the DMAC in Fig. 1.The descriptor chain mode reduces the frequency of inter-actions between hardware and software and makes soft-ware pipelining less complicated. This is especially im-portant in the case of a shared-bus multiprocessor archi-tecture, because it is very difficult to estimate when aDMA transfer finishes.

B. Audio MM

Fig. 6 shows the structure of the VLIW extension in the audioMM. It has the following features:

1) eight 32-bit GPRs;2) two 64-bit accumulators;3) 32-bit ALU, shifter, and multiply-accumulator (MAC)

supporting the following instructions:

a) add, subtract, logical, shift, and funnel shift;b) multiply, multiply–add, and multiply–subtract;c) leading zero detect, absolute difference,

MIN/MAX, and clipping.

The processor core has two processing modes: core mode andVLIW mode. The core mode issues a 16- or 32-bit instructioneach cycle. In the VLIW mode, a 32-bit fixed length instruc-tion is executed. The mode is changed using a subroutine calland return instructions. Each VLIW instruction consists of a32-bit core instruction, a 32-bit coprocessor instruction, or acombination of a 16-bit core instruction and coprocessor in-struction. For instance, a 16-bit load instruction with addressincrement and a coprocessor multiply–addition instruction canbe executed simultaneously. The GPRs in the core can be usedas input data of the 32-bit MAC in the coprocessor. Furthermore,the leading zero detection, absolute difference, MIN/MAX, andclipping instructions are added to the processor core as optionalinstructions. The customization results in about a threefold im-provement in performance over the basic processor core for fastFourier transform (FFT) and filter operations, which appears inaudio decoding and encoding.

C. Video Encode/Decode MM

The block diagram of the video encode/decode MM is shownin Fig. 7. In MPEG-2 video encoding/decoding, most of thetime-consuming operations are executed by the hardware en-gines, while the firmware’s jobs are mainly to address calcula-tions for DMA transfers and parameter preparation for the hard-ware engines. Some of the local memories are double bufferedso that DMA transfers and all hardware engines can operate inparallel. This parallel operation is accomplished by a softwarepipeline technique. The unit of pipelining is the macroblocklevel.

The VLC and VLD functions utilize the DSP extension witha resource-sharing technique for sharing data between a DSPextension and a hardware engine. The VLD performs with thefollowing two functions:

• as a DSP extension, with instructions to decode a symbolof a fixed/variable length code;

• as a hardware engine, with a command to decode amacroblock.

The use of the single-symbol decode instruction describedabove leads to an efficient implementation of the macroblock-decode command, since it allows the use of resources such as1) local memory to store bitstream; 2) arithmetic unit to extractbits; and 3) control logic to refill a temporary bit buffer from thelocal memory.

An example of the use of these two functions is as follows.The macroblock-decode command stores macroblock headerparameters such as the macroblock type and motion vectors inthe data RAM. As an exception, the macroblock address in-crement [22], which is the first variable length code in a mac-roblock header, is decoded by a DSP instruction because thesoftware should not issue macroblock-decode commands forskipped macroblocks while it controls the rest of the pipeline.In addition, providing both the above DSP extension and thehardware engine makes it possible to support both 1) MPEG-2MP@HL which requires high performance but little flexibility,and 2) other standards, such as JPEG, DV, and MPEG-4, whichdemand moderate performance but much flexibility.

Page 7: A single-chip MPEG-2 codec based on customizable media embedded

536 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 38, NO. 3, MARCH 2003

Another feature that supports other standards is that hardwareengines have not only macroblock commands but also singlefunction commands such as Q, 88 DCT, and 2 4 8 IDCT.The single function commands share resources with the mac-roblock commands and make the hardware slightly more com-plex. As an example, the quantization method in JPEG [23] issomewhat different from that in MPEG-2 and, therefore, a mac-roblock command for MPEG-2 is inappropriate. In this case,the single function commands are still useful. Another exampleis 2 4 8 IDCT for DV decoding [24]. In DV decoding, thefirmware’s jobs of unpacking, variable length decoding, and in-verse quantization are shared with the motion estimation MMwhich is not otherwise used in decoding. The same techniqueis useful for superposition of text and video object layers inMPEG-4 shape decoding [25]. Such a flexible task assignmentto a MM is one of the advantages of our architecture comparedwith conventional dedicated architectures.

Several user-custom instructions for SIMD operations areadded to the RISC core. They are parallel add, subtract, shift, setless than, logical operations, special instructions for encodingmotion vectors, and byte shuffle. They are especially usefulin calculating addresses of reference pictures and encodingmotion vectors, which are dominant tasks of firmware in MPEGencoding. In addition, a predictive motion vectors (PMV) hard-ware engine is added to accelerate decoding motion vectors inMPEG-2 MP@HL decoding. The user-custom instructions areinsufficient to achieve this level of performance.

The fine-grain motion estimation (ME Fine) hardware enginehas the following functions:

• motion estimation with half pixel precision;• mode selection (Candidates are {forward or backward

or bidirectional} {frame or field} motion compensation(MC), dual-prime MC, no MC, and intra. Some modesare excluded depending on a picture coding type and aprogressive frame flag.);

• forming the best predicted picture corresponding to theselected mode and storing it in the prediction RAM;

• calculating macroblock activity which is used for rate con-trol.

With the exception of the dual-prime MC option, all func-tions can be executed by a single command. If the dual-primeMC option is required, the command is split into two; for thedual-prime MC option, the first command is used to calculate anaddress, while the second command uses this address to transferan opposite parity field.

The ME Fine hardware engine is separate from the motionestimation MM since fine-grain motion estimation requires notonly block-matching operations but also other functions suchas half-pixel interpolation and averaging forward and backwardreference picture.

In summary, real-time MPEG-2 encoding/decoding,MPEG-4 encoding/decoding, and DV decoding in Table II areachieved by utilizing the following extensions:

• MPEG-2 encode: DSP extension, DCT, Q, IQ, IDCTHWE, MC HWE, ME Fine HWE, SIMD UCI, VLC/VLDHWE;

• MPEG-2 decode: DSP extension, DCT, Q, IQ, IDCTHWE, MC HWE, PMV HWE, VLC/VLD HWE;

Fig. 8. Block diagram of motion estimation MM.

• MPEG-4 encode: DSP extension, DCT, Q, IQ, IDCTHWE, MC HWE, ME Fine HWE, SIMD UCI;

• MPEG-4 decode: DSP extension, DCT, Q, IQ, IDCTHWE, MC HWE;

• DV decode: DSP extension, DCT, Q, IQ, IDCT HWE, MCHWE.

The firmware’s main jobs other than address calculation forDMA transfer and parameter preparation for the hardware en-gines are:

MPEG-2 encode: rate control;MPEG-2 decode: nothing;MPEG-4 encode: VLC, DCT DC/AC coefficient predic-tion, rate control;MPEG-4 decode: VLD, DCT DC/AC coefficient predic-tion, shape decoding;DV decode: unpack, VLD, IQ.

D. Motion Estimation MM

The block diagram of the motion estimation MM is shown inFig. 8. The block-matching operation is dominant in MPEG-2encoding and is implemented as a hardware engine. It consistsof a pair of block-matching engines, a reference buffer, a refer-ence prefetch buffer, and a target macroblock buffer. The lattertwo are double buffered so that the block-matching engines andDMA transfer can operate in parallel. The block-matching en-gines have the following features.

• Each block-matching engine has ability to perform an8 8-pixel block-matching operation with one cyclethroughput, which is attained by utilizing a pipelinedSIMD structure. Efficient use of these engines achieveswide search range up to horizontal144 pixels and ver-tical 96 pixels with a hierarchical multifield telescopicsearch algorithm [21] modified to reduce the requiredmemory bandwidth to a third of the original value (usingtelescopic search algorithm causes an increase in the re-quired memory bandwidth because the search location foreach macroblock is irregular and it increases the amountof reference reading. The modified algorithm makes

Page 8: A single-chip MPEG-2 codec based on customizable media embedded

ISHIWATA et al.: SINGLE-CHIP MPEG-2 CODEC 537

TABLE IIISPECIFICATIONS OF THECHIP

Fig. 9. Microphotograph of the chip.

the search location regular at the cost of more frequentcurrent reading). It is worth noting that the architecture isfairly general and allows use of a wide variety of motionestimation algorithms.

• The shape of the search range can be flexibly set, whichmakes it possible to widen the search range when the inputvideo is rapidly panning.

• The difference between pixel dc (or average) value of thecurrent picture and that of the reference picture is com-pensated during the block-matching operation. It providesefficient motion estimation even if the input video containsa fade-in/out pattern.

V. LSI IMPLEMENTATION

Table III shows the chip specification. It is fabricated using0.18- m six-level metal CMOS technology, integrating 3.8Mlogic gates in the 72-mmdie. It operates at 150 MHz and con-sumes 1.5 W to achieve the main functions such as the real-timecodec operations listed in Table II. Fig. 9 shows a chip mi-crophotograph and Fig. 10 shows a photograph of the referencesystem board.

Fig. 10. Photograph of the evaluation board.

VI. CONCLUSION

This paper has presented an customizable embedded pro-cessor architecture and hardware/software development toolsto support the customization. This architecture enables thequick development of an application-optimized heteroge-neous multiprocessor. (By “application,” we are referring tomedia processing applications such as video, audio, bitstreammux/demux, graphics, and image processing.) Based on thiscustomizable architecture, an MPEG-2 MP@ML codec hasbeen successfully developed. Making full use of customization,the chip can execute in real time: 1) MPEG-2 MP@MLvideo/audio/system simultaneous encoding and decoding; 2)MPEG-2 MP@HL decoding; 3) DV decoding; 4) MPEG-4core profile plus interlace encoding and decoding; etc. Weexpect our approach to become more advantageous in the futureas the design complexity and the demand for systems-on-chipincrease.

ACKNOWLEDGMENT

The authors wish to thank Dr. K. Suzuki, A. Yokosawa,Y. Uetani, S. Koto, and N. Kogure for research on algorithm andfirmware implementation, and the media embedded processorteam, the software development tool team, and the designverification team for their cooperation and great efforts.

REFERENCES

[1] Y. Kondo, T. Miyamori, T. Kitazawa, S. Inoue, H. Takano, I. Katayama,K. Yahagi, A. Ooue, T. Tamai, K. Kohno, Y. Asao, H. Fujimura, H. Ue-tani, Y. Inoue, S. Asano, Y. Miyamoto, A. Yamaga, Y. Masubuchi, andT. Furuyama, “A 4-GOPS 3-way VLIW image recognition processorbased on a configurable media-processor,” inIEEE Int. Solid-State Cir-cuits Conf. (ISSCC 2001) Dig. Tech. Papers, Feb. 2001, pp. 148–149.

[2] A. van der Werf, F. Brüls, R. P. Kleihorst, E. Waterlander, M. J. W. Ver-straelen, and T. Friedrich, “I.McIC: A single-chip MPEG-2 video en-coder for storage,”IEEE J. Solid-State Circuits, vol. 32, pp. 1817–1823,Nov. 1997.

[3] M. Mizuno, Y. Ooi, N. Hayashi, J. Goto, M. Hozumi, K. Furuta,A. Shibayama, Y. Nakazawa, O. Ohnishi, S. Zhu, Y. Yokoyama,Y. Katayama, H. Takano, N. Miki, Y. Senda, I. Tamitani, and M.Yamashina, “A 1.5-W single-chip MPEG-2 MP@ML video encoderwith low-power motion estimation and clocking,”IEEE J. Solid-StateCircuits, vol. 32, pp. 1807–1816, Nov. 1997.

[4] E. Iwata, K. Seno, M. Aikawa, M. Ohki, H. Yoshikawa, Y. Fukuzawa, H.Hanaki, K. Nishibori, Y. Kondo, H. Takamuki, T. Nagai, K. Hasegawa,H. Okuda, I. Kumata, M. Soneda, S. Iwase, and T. Yamazaki, “A2.2GOPS video DSP with 2-RISC MIMD, 6-PE SIMD architecturefor real-time MPEG2 video coding/decoding,” inIEEE Int. Solid-StateCircuits Conf. (ISSCC 1997) Dig. Tech. Papers, vol. 32, Feb. 1997, pp.258–259.

Page 9: A single-chip MPEG-2 codec based on customizable media embedded

538 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 38, NO. 3, MARCH 2003

[5] E. Miyagoshi, T. Araki, T. Sayama, A. Ohtani, T. Minemaru, K.Okamoto, H. Kodama, T. Morishige, A. Watabe, K. Aoki, T. Mitsumori,H. Imanishi, T. Jinbo, Y. Tanaka, M. Taniyama, T. Shingou, T. Fuku-moto, H. Morimoto, and K. Aono, “A 100 mm0.95 W single-chipMPEG2 MP@ML video encoder with a 128-GOPS motion estimatorand a multi-tasking RISC-type controller,” inIEEE Int. Solid-StateCircuits Conf. (ISSCC 1998) Dig. Tech. Papers, Feb. 1998, pp. 30–31.

[6] E. Ogura, M. Takashima, D. Hiranaka, T. Ishikawa, Y. Yanagita, S.Suzuki, T. Fukuda, and T. Ishii, “A 1.2-W single-chip MPEG2 MP@MLvideo encoder LSI including wide search range (H:�288, V:�96)motion estimation and 81-MOPS controller,”IEEE J. Solid-StateCircuits, vol. 33, pp. 1765–1771, Nov. 1998.

[7] Y. Tsuboi, H. Arai, M. Takahashi, and M. Oku, “A low-powersingle-chip MPEG-2 CODEC LSI,” inProc. IEEE Custom IntegratedCircuits Conf. (CICC), May 1999, pp. 73–76.

[8] G. Kizhepat, K. Choy, R. Hinchley, P. Lowe, and R. Yip, “A single-chipMPEG-2 video audio and system encoder,” inIEEE Int. Solid-State Cir-cuits Conf. (ISSCC 1999) Dig. Tech. Papers, Feb. 1999, pp. 266–267.

[9] S. Kumaki, T. Matsumura, K. Ishihara, H. Segawa, K. Kawamoto, H.Ohira, T. Shimada, H. Sato, T. Hattori, T. Wada, H. Honma, T. Watanabe,H. Sato, K.-I. Asano, and T. Yoshida, “A single-chip MPEG2 422@MLvideo, audio, and system encoder with a 162-MHz media-processor anddual motion estimation cores,” inProc. IEEE Custom Integrated CircuitsConf. (CICC), May 1999, pp. 95–98.

[10] W. H. A. Bruls, E. W. Salomons, A. van der Werf, R. K. Gunnewiek,and L. Camiciotti, “A low-cost audio/video single-chip MPEG2 encoderfor consumer video storage applications,” inIEEE Int. Conf. ConsumerElectronics (ICCE) Dig. Tech. Papers, June 2000, pp. 314–315.

[11] S. Kumaki, H. Takata, Y. Ajioka, T. Ooishi, K. Ishihara, A. Hanami,T. Tsuji, Y. Kanehira, T. Watanabe, C. Morishima, T. Yoshizawa, H.Sato, S.-I. Hattori, A. Koshio, K. Tsukamoto, and T. Matsumura, “A 99mm , 0.7-W single-chip MPEG2 422P@ML video, audio, system en-coder with a 64-Mbit embedded DRAM for portable 422P@HL encodersystem,” inProc. IEEE Custom Integrated Circuits Conf. (CICC), May2001, pp. 425–428.

[12] H. Kim, W.-S. Yang, M.-C. Shin, S.-J. Min, S.-O. Bae, and I.-C. Park,“Multithread VLIW processor architecture for HDTV decoding,” inProc. IEEE Custom Integrated Circuits Conf. (CICC), May 2000, pp.559–562.

[13] T. Koyama, E. Iwata, H. Yoshikawa, H. Hanaki, K. Hasegawa, M. Aoki,M. Yasue, T. Schronbenhauser, M. Aikawa, I. Kumata, and H. Koyanagi,“A 250 MHz single-chip multiprocessor for A/V signal processing,” inIEEE Int. Solid-State Circuits Conf. (ISSCC 2001) Dig. Tech. Papers,Feb. 2001, pp. 146–147.

[14] Y. Watanabe, Y. Otobe, K. Yoshitomi, H. Takahashi, and K. Kohiyama,“A single-chip MPEG2 MP@HL decoder for DTV recording/playbacksystem,” inIEEE Int. Conf. Consumer Electronics (ICCE) Dig. Tech.Papers, June 2001, pp. 90–91.

[15] H. Yamauchi, S. Okada, K. Taketa, Y. Matsuda, T. Mori, S. Okada, T.Watanabe, Y. Harada, M. Matsudaira, and Y. Matsushita, “A 0.8-WHDTV video processor with simultaneous decoding of two MPEG2MP@HL streams and capable of 30 frame/s reverse playback,” inIEEEInt. Solid-State Circuits Conf. (ISSCC 2002) Dig. Tech. Papers, Feb.2002, pp. 372–373.

[16] LSI Logic. [Online]. Available: http://www.lsilogic.com[17] T. Miyamori, “A configurable and extensible media processor,” pre-

sented at the Embedded Processor Forum 2002, Apr. 2002.[18] A. Mizuno, K. Kohno, R. Ohyama, T. Tokuyoshi, H. Uetani, H. Eichel,

T. Miyamori, N. Matsumoto, and M. Matsui, “Design methodologyand system for a configurable media embedded processor extensibleto VLIW architecture,” inProc. Int. Conf. Computer Design (ICCD2002), Sept. 2002, pp. 2–7.

[19] SoC Design Methodology With MeP. [Online]. Available:http://www.mepcore.com/english/index_e.html

[20] K. Okamoto, T. Jinbo, T. Araki, Y. Iizuka, H. Nakajima, M. Takahata,H. Inoue, S. Kurohmaru, T. Yonezawa, and K. Aono, “A DSP for DCT-based and wavelet-based video codecs for consumer applications,”IEEEJ. Solid-State Circuits, vol. 32, pp. 460–467, Mar. 1997.

[21] Toshiba Science and Technology Highlights 1998. Single-chip MPEG2encoder LSI. [Online]. Available: http://www.toshiba.co.jp/tech/review/1998/high98/research/r3/index.htm

[22] Information Technology—Generic Coding of Moving Pictures and Asso-ciated Audio Information: Video, ISO/IEC 13818-2, ITU-T recommen-dation H.262, May 1996.

[23] Information Technology—Digital Compression and Coding of Con-tinuous-Tone Still Images—Requirements and Guidelines, ISO/IEC10918-1, ITU-T recommendation T.81, 1992.

[24] Helical-Scan Digital Video Cassette Recording System Using 6.35-mmMagnetic Tape for Consumer Use (525-60, 625-50, 1125-60, and1250-50 Systems), IEC 61834, 1994.

[25] Information Technology—Generic Coding of Audio-Visual Ob-jects—Part 2: Visual, ISO/IEC 14 496-2, 1999.

Shunichi Ishiwata received the B.E., M.E., andPh.D. degrees in physical electronics engineeringfrom the Tokyo Institute of Technology, Tokyo,Japan, in 1987, 1989, and 1992, respectively.

In 1992, he joined Toshiba Corporation, Kawasaki,Japan, where he has been working on the develop-ment of MPEG2 decoder LSIs, media processors, andMPEG2 codec LSIs. He is currently a Senior Spe-cialist in the SoC Research and Development Centerand is working on MPEG video encoder algorithm,architecture, and software development.

Tomoo Yamakagereceived the B.E. degree in infor-mation engineering and the M.E. degree in electronicengineering from Tohoku University, Sendai, Japan,in 1988 and 1990, respectively.

In 1990, he joined Toshiba Corporation, Kawasaki,Japan, where he has been working on MPEG-2 stan-dardization and the development of MPEG-2 LSIs.He is currently a Research Scientist in the CorporateResearch and Development Center and is working onthe development of high-definition DVD and videowatermarking technology.

Yoshiro Tsuboi received the Ph.D. degree in theoret-ical physics from the University of Tsukuba, Ibaraki,Japan, in 1989.

He joined the ULSI Research Center, Toshiba Cor-poration, Kawasaki, Japan, in 1991, where he was en-gaged in development of high-speed silicon bipolardevices. In 1995, he changed his field of activity toresearch and development of system LSI design, andis currently dedicated to system LSI firmware designand development in the SoC Research and Develop-ment Center.

Takayoshi Shimazawawas born in Saitama, Japan,on April 27, 1965. He received the B.S. and M.S. de-grees in electric engineering from Keio University,Yokohama, Japan, in 1989 and 1991, respectively.

In 1991, he joined the Semiconductor DeviceEngineering Laboratory, Toshiba Corporation,Kawasaki, Japan, where he was engaged in theresearch and development of advanced system LSIs,including MPEG2 video decoders and MPEG2video and audio codecs. Currently, he is involved inthe design of system LSIs based on a customizable

media embedded processor in the SoC Research and Development Center.

Tomoko Kitazawa received the B.S. degree inphysics from the University of Tsukuba, Ibaraki,Japan, in 1983.

She joined Toshiba Corporation, Kawasaki, Japan,in 1983, where she was engaged in research and de-velopment of amorphous-silicon TFT-LCDs. In re-cent years, she has been working on the developmentof media processors and MPEG2 codec LSIs. She iscurrently a Specialist with the SoC Research and De-velopment Center.

Page 10: A single-chip MPEG-2 codec based on customizable media embedded

ISHIWATA et al.: SINGLE-CHIP MPEG-2 CODEC 539

Shuji Michinaka received the B.S. and M.S. degreesfrom Hokkaido University, Sapporo, Japan, in 1990and 1992, respectively.

In 1992, he joined Toshiba Corporation, Kawasaki,Japan, where he has been working on the develop-ment of MPEG2 decoder LSIs.

Kunihiko Yahagi received the B.S. degree in physicsfrom the University of Tokyo, Tokyo, Japan, in 1993.

In 1993, he joined Toshiba Corporation,Kawasaki, Japan, where he has been working onthe development of MPEG2 decoder LSIs, mediaprocessors, and MPEG2 codec LSIs. He is currentlyworking on the development of memory subsystemfor a MPEG codec LSI with the SoC Research andDevelopment Center.

Hideki Takeda received the B.S. degree in electricalengineering and the M.S. degree in electronic engi-neering from the University of Tokyo, Tokyo, Japan,in 1992 and 1994, respectively.

In 1994, he joined Toshiba Corporation, Kawasaki,Japan, where he has been engaged in the developmentof MPEG2 decoder LSIs and media processors. Hehas also been working on the design methodology forthose LSIs. He is currently involved in the develop-ment of MPEG2 codec LSIs.

Akihiro Oue received the B.E. and M.E. degrees inelectrical and electronic engineering from the TokyoInstitute of Technology, Tokyo, Japan, in 1996 and1998, respectively.

In 1998, he joined Toshiba Corporation, Kawasaki,Japan, where he has been working on the develop-ment of MPEG2 codec LSIs. He is currently a Soft-ware Engineer with the SoC Research and Develop-ment Center.

Tomoya Kodama received the B.E. and M.E.degrees in electronic engineering from TohokuUniversity, Sendai, Japan, in 1991 and 1993,respectively.

In 1993, he joined Toshiba Corporation, Kawasaki,Japan, where he has been working on the develop-ment of MPEG2 software and LSIs. He is currently aResearch Scientist with the Corporate Research andDevelopment Center and is working on MPEG videoCODEC algorithm and software development.

Nobu Matsumoto received the B.E. and M.E. degrees in electrical engineeringfrom Waseda University, Tokyo, Japan, in 1981 and 1983, respectively.

In 1983, he joined Toshiba Corporation, Kawasaki, Japan, where he has beenengaged in the research and development of EDA systems and software devel-opment environment.

Takayuki Kamei was born in Tokyo, Japan, in 1972.He received the B.E. degree in electrical engineeringand the M.S. degree in computer science from KeioUniversity, Yokohama, Japan, in 1995 and 1997, re-spectively.

He is currently a Processor Development Engineerwith the Broadband System-LSI Project, ToshibaCorporation Semiconductor Company, Kawasaki,Japan.

Mitsuo Saito received the M.S. degree in electronicengineering from the University of Tokyo, Tokyo,Japan, in 1974.

He joined the Research and Development Center,Toshiba Corporation, Kawasaki, Japan, in 1974.His focus has been the incorporation of Japanese(including Kanji) characters into computer systems,and desktop publishing. From 1984 to 1985, hestudied at the Massachusetts Institute of Tech-nology Media Laboratory and performed researchon three-dimensional (3-D) graphics and human

interface. In 1986, after returning to Japan, he started a 3-D graphics chipdevelopment project. The successor of this chip was adopted for use in theSony Playstation. He also started a RISC processor development projectwhich later became a joint project with Silicon Graphics, Inc. Beginningin 1994, he was involved in a server development project and designed asuper-fault-resiliant system. Since 1997, he has been a Director of the SystemULSI Engineering Laboratory, Toshiba, where he has lead the project teamdeveloping the processor for Sony’s Playstation2, a media processor projectand a communication chip set project. In April 2000, his organization becamethe System LSI Research and Development Center, by combining the LSIdesign and the device technology development team. In July 2001, he becameChief Fellow, and now leads the Broadband System LSI Project.

Takashi Miyamori received the B.S. and M.S. de-grees in electrical engineering from Keio University,Yokohama, Japan, in 1985 and 1987, respectively.

Since joining Toshiba Corporation, Kawasaki,Japan, in 1987, he has performed research and de-velopment of microprocessors. From 1996 to 1998,he was a Visiting Researcher at Stanford University,Stanford, CA, where he researched reconfigurablecomputer architectures. He is currently a SeniorSpecialist in the SoC Research and DevelopmentCenter, Toshiba.

Goichi Ootomo received the B.S. and M.S. degreesin Electronic Engineering from Ibaraki University,Ibaraki, Japan, in 1987 and 1989, respectively.

In 1989, he joined Toshiba Corporation, Kawasaki,Japan, where he was involved in the research anddevelopment of the digital signal processors. Since1992, he has been involved in the development ofthe MPEG2 video decoder/encoder LSIs.

Page 11: A single-chip MPEG-2 codec based on customizable media embedded

540 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 38, NO. 3, MARCH 2003

Masataka Matsui (S’83–M’85) received the B.S.and M.S. degrees in electronic engineering fromthe University of Tokyo, Tokyo, Japan, in 1983 and1985, respectively.

In 1985, he joined the Semiconductor DeviceEngineering Laboratory, Toshiba Corporation,Kawasaki, Japan, where he was engaged in theresearch and development of static memoriesincluding 1-Mb CMOS SRAM and 1-Mb BiCMOSSRAM, BiCMOS ASICs, video compression/de-compression LSIs, media processors, and embedded

microprocessors. From 1993 through 1995, he was a Visiting Scholar atStanford University, Stanford, CA, doing research on low-power LSI design.He is currently managing digital media system-on-chip (SoC) developmentat the SoC Research and Development Center, Toshiba. His current researchinterests include configurable embedded processor design, low-power andhigh-speed circuit design, and video compression/decompression SoC design.He is also a Visiting Lecturer at the University of Tokyo.

Mr. Matsui serves as a Program Committee Member for the Symposium onVLSI Circuits and the IEEE Custom Integrated Circuits Conference.