Signal Processing: Image Communication€¦ · 05.02.2016 · order to accelerate kernel image...

ARTICLE IN PRESS

Contents lists available at ScienceDirect

Signal Processing: Image Communication

Signal Processing: Image Communication ] (]]]]) ]]]–]]]

0923-59

doi:10.1

� Cor

E-m

Pleasreco

journal homepage: www.elsevier.com/locate/image

An attention controlled multi-core architecture for energy efficientobject recognition

Joo-Young Kim �, Sejong Oh, Seungjin Lee, Minsu Kim, Jinwook Oh, Hoi-Jun Yoo

Division of Electrical Engineering, School of Electrical Engineering and Computer Science, KAIST, 335 Gwahak-ro, Yuseong-gu, Daejeon 305-701, Republic of Korea

a r t i c l e i n f o

Article history:

Received 2 October 2009

Accepted 8 March 2010

Keywords:

Attention controlled

Multi-core architecture

Object recognition

Visual attention

Energy efficient

65/$ - see front matter & 2010 Published by

016/j.image.2010.03.003

responding author. Tel.: +82 42 350 8931; fax

ail address: [email protected] (J.-Y. Ki

e cite this article as: J.-Y. Kim, et agnition, Signal Process. Image Comm

a b s t r a c t

In this paper, an attention controlled multi-core architecture is proposed for energy

efficient object recognition. The proposed architecture employs two IP layers having

different roles for energy efficient recognition processing: the attention/control IPs

compute regions-of-interest (ROIs) of the entire image and control the multiple

processing cores to perform local object recognition processing on selected area. To this

end, a task manager is proposed to perform dynamic scheduling of various ROI tasks

from the attention IP to multiple cores in a unit of small-sized grid-tile. Thanks to a

number of grid-tile threads generated by the task manager, the utilization of the

multiple cores amounts to 92% on average. As a result, the proposed architecture

achieves 2.1� energy reduction in multi-core recognition system by indicating

processing cores to focus on critical area of the image with a 0.87 mJ attention

processing. Finally, the proposed architecture is implemented in 0.13 mm CMOS

technology and the fabricated chip verifies 3.2� lower energy dissipation per frame

than the state-of-the-art object recognition processor.

& 2010 Published by Elsevier B.V.

1. Introduction

Object recognition is the process of identifying objectsout of an input image by matching their features withtrained database of target objects. It has been a coretechnology for computer vision applications such asautonomous vehicles, autonomous mobile robots, andsurveillance systems [1–5], and also can be applied tovideo applications such as JPEG/MPEG encoding anddecoding [6]. Fig. 1 shows a simplified process of thescale invariant feature transform (SIFT) [7], a commonlyused object recognition algorithm. First, various scalespaces of an input image are generated by Gaussianfiltering operations with different coefficient values, anddifference-of-Gaussian (DoG) images are generated bysubtracting two neighboring image spaces. Then, object

Elsevier B.V.

: +82 42 350 3410.

m).

l., An attention contrun. (2010), doi:10.10

key-points are localized by searching local maxima/minima points among three neighboring DoG images.Each key-point is converted to a descriptor vector todescribe its magnitude and orientation characteristics.After this SIFT key-point description, generated vectorsare matched with the object database for the finalrecognition.

Since each stage of object recognition requires hugeamount of computations, it is difficult to achieve a real-time operation with a conventional single general-purpose processor (GPP). Zhang et al. [8] reveals a2.33 GHz Intel Core2 processor can only obtain 2 frame/sperformance when it runs the SIFT algorithm for VGA(640�480) sized input video, and they show thatreal-time operation over 30 frame/s for the SIFT can beachieved by parallelizing it on a multi-core systemcontaining two 2.33 GHz Intel Core2 Quad processors(8 cores). However, a multi-core system using GPPs is acostly solution in terms of silicon area, power, and energyconsumption. For example, the power consumption of the

olled multi-core architecture for energy efficient object16/j.image.2010.03.003

www.elsevier.com/locate/image

dx.doi.org/10.1016/j.image.2010.03.003

mailto:[email protected]

dx.doi.org/10.1016/j.image.2010.03.003

ARTICLE IN PRESS

σ00

InputImage

Gaussian ScaleSpace Generation

-

Difference ofGaussian (DoG)

Key-pointLocalization

DescriptorGeneration

-

-

- [5,3,7,2,…, ]

σ01

σ02

σ03

σ04

PostDatabaseMatching

Fig. 1. SIFT based object recognition process.

J.-Y. Kim et al. / Signal Processing: Image Communication ] (]]]]) ]]]–]]]2

multi-core system of [8] is more than 100 W, which isusually not appropriate for mobile applications.

To achieve real-time object recognition with consider-ably lower power, several embedded multi-core proces-sors have been developed in recent years [1–5]. Abbo et al.[1] and Kyo et al. [2] present massively parallel singleinstruction multiple data (SIMD) architectures that max-imize data-level parallelism in pre-processing stages ofobject recognition. Kim et al. [3] present a multi-corearchitecture including 10 processing units and 8 channelmemories to exploit task-level parallelism over data-levelparallelism in object recognition. Kim et al. [4] propose aspecial hardware IP named visual attention engine (VAE)[9] in a multi-core architecture with 8 SIMD processingcores. This processor proves that the concept of visualattention [10] can reduce the computational costs ofobject recognition in real system implementation. Theproposed VAE extracts the saliency map out of the inputimage and filters the key-points extracted by the proces-sing cores based on the map. By reducing the number ofvalid key-points, workloads of key-point based task suchas descriptor generation and further database matchingcan be reduced. However, the VAE plays a role in just apre-processing filter in the architecture because it doesnot have any controlling ability over multiple cores exceptreducing the number of key-points. The pixel based taskssuch as Gaussian scale space generation, DoG generation,and local maxima/minima search do not benefit from theVAE, and are executed by multiple cores with a conven-tional column-wise processing for the whole area ofimage.

In this paper, we propose an attention controlledmulti-core architecture for energy efficient object recog-nition. This architecture introduces two different IP layershaving different roles for more energy efficient proces-sing: the attention/control IPs that estimate regions-of-interest (ROIs) as a global workload estimator for theentire image and control the operations of multipleprocessing cores, and multiple processing cores thatperform local object recognition processing focusing onthose areas. Based on the ROI results, the introduced taskmanager dynamically controls the tasks and threads ofmultiple cores with a fine-grained grid-tile based proces-

Please cite this article as: J.-Y. Kim, et al., An attention contrrecognition, Signal Process. Image Commun. (2010), doi:10.10

sing model. It also manages the overall processing speedand resource allocation of multiple cores by differentiat-ing task scheduling methods. The rest of this paper isorganized as follows. In Section 2, the visual perception,an enhanced attention processing that extracts theROIs out of an input image, is introduced and the overallobject recognition algorithm based on it will be brieflycovered. In Section 3, the attention controlled multi-corearchitecture is proposed for energy efficient objectrecognition. Its grid-tile based processing model, applica-tion mapping, workload distribution, and schedulingmethod will be explained. The overall energy savingeffect of the proposed architecture will be evaluated.Section 4 describes the implementation of visual percep-tion in energy efficient way with cellular neural networksand neuro-fuzzy classifier. Section 5 shows the siliconrealization of the proposed architecture. After that, thispaper will be summarized in Section 6.

2. Visual perception based object recognition algorithm

Fig. 2 shows the flow diagram of the proposed visualperception based object recognition algorithm. It com-bines two different algorithms, visual perception andgeneral object feature description algorithm described inSection 1. The visual perception, an enhanced attentionprocessing, extracts the regions-of-interest (ROIs) out ofthe input image to provide more complete attentioninformation than previous one [4]. Firstly, the visualperception adopts Itti’s visual attention model [10] toextract several bottom-up features such as intensity,color, orientation, and motion, and promote them to asingle unified map named saliency map. Then, it selectsthe most salient parts of the image as the seed points andperforms region growing from seed points via repeatedhomogeneity classifications to determine the ROIs ofobjects [11]. For the feature description part, we employthe SIFT algorithm [7]. Consequently, object recognitionprocessing can be selectively performed in this algorithmwith the visual perception stage that extracts the ROIs ofinput image prior to detailed object feature descriptionstage.


dx.doi.org/10.1016/j.image.2010.03.003

ARTICLE IN PRESS

Fig. 2. Visual perception based object recognition.

VisualPerception

Engine

TaskManager

SIMD0

Network-on-Chip

ROI tiles

Multiple Cores

Attention /Control IPs

iMEM

Threadcontroller

Statustable

Dynamicscheduler

ROI Power

Management

SIMD1

iMEM

SIMDN-2

iMEM

SIMDN-1

iMEM

Fig. 3. Proposed attention controlled multi-core architecture.

J.-Y. Kim et al. / Signal Processing: Image Communication ] (]]]]) ]]]–]]] 3

3. Attention controlled multi-core architecture

3.1. Overall architecture

Fig. 3 shows the proposed attention controlled multi-core architecture. Unlike the conventional architecturesthat increase parallelism simply by exploiting moreprocessing cores, this architecture employs two differentIP layers – attention/control IP layer and parallelprocessing layer – having different roles for energyefficient recognition processing. The attention/control IPlayer estimates global workload of overall input image,i.e., regions-of-interest, and controls multiple cores inparallel processing layer to run application programfocusing on those areas, which results in more energyefficient processing. The attention/control IP layer


includes a visual perception engine (VPE) and taskmanager (TM), and the parallel processing layer consistsof N multiple cores. We call each of them an SIMDprocessor unit (SPU). The SPU core is a fullyprogrammable device and each of N SPU cores can runan independent program.

The VPE is a special IP to extract attention regions outof an input image by performing the visual perceptionalgorithm described in Section 2. As a result, it representsthe ROIs of the input image in the unit of small-sized tilecalled grid-tile. For example, the ROIs of a VGA (640�480pixels) sized image can be built by a number of grid-tileswhose size is 40�40 pixels. The SPU cores are employedto perform a feature description algorithm such as SIFT forthe selected grid-tile area by the VPE. It exploitsdual-issued VLIW instructions whose length is 51-bit and


dx.doi.org/10.1016/j.image.2010.03.003

ARTICLE IN PRESS


8-way ALU datapaths for SIMD extended instructions inorder to accelerate kernel image processing tasks such asGaussian image filtering, gradient computation, andhistogram operation. More detailed information aboutan SPU core is presented in [5]. The TM takes a role incontrol of multiple SPU cores. It performs dynamicscheduling of the ROI tile tasks from the VPE to SPU coresand manages multiple threads of SPU cores based on theircentralized status table when they run whole applicationprogram. In addition, it can control the overall executiontime of the SPU cores by its scheduling method. It canassign the grid-tile tasks as fast as possible to idle coresfor real-time applications or can save the SPU coreresources to meet low power requirements.

3.2. Grid-tile based ROI processing model

Fig. 4 shows a grid-tile based ROI processing model ofthe proposed architecture (Fig. 4b) with a conventionalcolumn-wise processing model (Fig. 4a). Most of theconventional multi-core architectures for image processingapplications adopt the column-wise processing model thateach processing core performs processing for the designatedcolumn region of the input image. By doing this, theworkloads of overall image processing are well distributedto multiple cores and data-level parallelism is easily ex-ploited by them in this model. However, the conventionalcolumn-wise processing cannot be applied to the proposedattention controlled architecture because attended regionswill be different every time according to each input image.By assigning specific regions to specific cores like conven-tional model, the workloads of ROI tile tasks would not bedistributed among processing cores. To handle this, weintroduce a grid-tile based ROI processing model with adynamic scheduler that assigns the ROI grid-tiles todifferent processing cores in a grid-tile unit every singleframe. With the status table of processing cores, it candynamically re-assign remaining grid-tile tasks to idle cores.Therefore, this grid-tile based ROI processing model issuitable in the proposed architecture to distribute theworkloads of overall image, i.e., the ROIs of the inputimage, to multiple cores.

Image C

olumn 0

PU PU PU PU PU

Image C

olumn 1

Image C

olumn 2

Image C

olumn 3

Image C

olumn N

-1

Image width

Pix

el u

pdat

e

N cores

S

=

Fig. 4. (a) Column-wise processing model and


3.3. Application mapping

There are two strategies in application mapping inmulti-core architecture. The first one is multiprocessorprogramming. In this scenario, multiple cores perform aone large process together while each core performs adifferent kind of small thread as a part of the wholeprocess. For example, for the SIFT algorithm mappingcase, cores 0–3 are mapped for Gaussian scale spacegeneration, core 4 is for DoG computation, cores 5–7are for key-point localizations, and so on. Therefore,intermediate data communications should be carried inthis application mapping and task-level parallelism isachieved because different cores perform different kindsof tasks at the same time. However, this multiprocessorprogramming requires different kinds of programming fordifferent multiple cores with program and data synchro-nization issues in data communications. Typically,multiprocessor programming demands much more timeand efforts than single processor programming and itsperformance gain is limited due to the data dependencyamong cores. In the proposed architecture, each SPU corecontains a 16-bit barrier register for program synchroni-zation in multiprocessor programming. The TM in thiscase should do complex partitioning of the ROI tasks,configure each core’s program, and control threads ofthem. The second strategy is to divide the whole processby the region while each core runs the whole applicationprogram. For example, core 0 performs SIFT for the area ofgrid-tile 0 while core 1 performs it for the area of grid-tile 1. In this scenario, multiple cores can achieve data andthread-level parallelism. Data communications amongcores occurring in boundary condition are less than themultiprocessor programming case. Table 1 summarizesthe two methods of application mapping in multi-corearchitecture.

Since we can avoid hard programming efforts andsynchronization issues by adopting the second method,we apply it to the SIFT mapping to the proposedarchitecture. The SIFT kernel applicable for any tile sizeis mapped to each SPU core and the TM schedules the ROIgrid-tile tasks to multiple SPU cores and manages the

Dynamiccheduler

PU

PU

PU

PU

Dyn

amic

sch

edul

ing

ROI area M grid-tiles

N cores

Status report

(b) grid-tile based ROI processing model.


dx.doi.org/10.1016/j.image.2010.03.003

ARTICLE IN PRESS

Table 1Two methods in application mapping.

Programming efforts Inter-processor data

transactions

Issues Parallelism

Multiprocessor

programming

High (multiple

programs)

Frequently occurs Program and data

synchronization

Task-level parallelism

Divide by region Low (use same program) Not frequently occurs Boundary region data sharing Data-level parallelism

Thread-level

parallelism

Partition k

Input tile and support region onexternal memory

stream_base

Circular buffer

SPU local memory

W

H

h

w

vdata reuse

(x,y)

newly fetched data

Parallel execution

Fig. 5. Software mapping on SPU core.


threads. In the SIFT mapping to a single SPU core, astreaming approach is used to reduce the size of localembedded memory as shown in Fig. 5. A mapped taskfetches input stream into the local memory by the DMA, itperforms data calculations, and the output stream iswritten back via the DMA. The DMA is effective to fetchand store such partial streams with hiding data fetchoverhead because it runs in parallel with the SIMDdatapaths of SPU core. The pseudo code in the figureshows an example of processing partial 2-dimensionalstreams. For each partition, the win_init function loads theinitial window as large as w�h pixels and the inner loopadvances the window by w�v pixels per iteration.The wait function waits until the DMA is idle. Softwarepipelining technique is applied to the inner loop to hidethe latency of data fetch. The win_load function loadsthe pixel data for the next iteration, and it has no flowdependence to the kernel_func function. Therefore, thekernel_func processes data fetched at the pervious loopiteration without waiting for the completion of win_load.The window processing creates a logical circular buffer inthe local memory, so that the overlapped region acrosstwo consecutive loop iterations can be reused. The SPU’ssoftware development kit provides a runtime libraryincluding the 2-dimensional window operations and acompiler based on GNU C Compiler. While developing theapplications, most of the source code is compiled by the


compiler and small hotspot kernels are developed ininline assembly.

3.4. Scheduling method

In the proposed architecture, the TM can control theusage of multiple SPU cores based on the ROI results ofthe VPE. It can slow down the overall process withresource saving or speed it up for real-time performanceby exploiting different scheduling methods. Since thenumber of ROI grid-tiles can be used as a workloadestimator of all SPU cores, the resource allocationof SPU cores can be controlled based on it. Fig. 6 showsthe flowchart of the proposed workload-aware taskscheduling (WATS) that differentiates the number ofoperating cores based on the measured workloads. Thisscheduling method aims at intelligent resource allocationand power saving by performing it.

At the beginning, the TM counts the number of ROIgrid-tiles from the VPE to measure the overall workload ofinput image. It classifies the measured workload to 1 of W

workload levels to represent the amount of workloads ofcurrent frame. The W workload levels are divided by W�1threshold values that can be updated in frame basis. Then,it determines the number of operating SPU cores andcorresponding power domains according to the selected


dx.doi.org/10.1016/j.image.2010.03.003

ARTICLE IN PRESS

Any update?

Update thresholds

Classify levels

Determine # of operatingcores (out of N)

Determine # ofpower domains

Turn off unusedpower domains

Assign ROIs to cores

WorkloadMeasure

Frame end?

N

Y

Wait powertransition

Y

Y

N

Measure Exe.Time

1 of W workload levels

NxWmea

W

Fig. 6. Flowchart of workload-aware task scheduling.


workload level Wmea. Simply, we can determine thenumber of operating cores out of N cores to be a linearlyproportional, Nx (Wmea/W). Power co-optimization is alsoperformed by gating off the power domains of unusedcores. The WATS only activates the appropriate amount ofcores and power domains, and saves the rest of them.After waiting a hundred ms power transition periods of thedomains, the TM starts to assign the ROI grid-tile tasks tothe active SPU cores. In our implementation case, whichwill be described later in detail in Section 5, we choose thenumber of SPU cores and their power domains to 16and 4, respectively. Since the WATS determines thenumber of operating SPU cores in proportion to theworkload amount, the execution time of whole frame ismanaged around a constant value. Unless all the ROIgrid-tile tasks are assigned to active SPU cores, the TMaggressively assigns remaining tasks to idle cores basedon their status table. Completing all of the ROI grid-tiletasks, the TM measures the overall execution time ofcurrent frame for further information to update thethreshold values. The execution time can be shortened


or lengthened by decreasing and increasing the thresholdvalues, respectively. This WATS shows the execution timeof whole process and resource saving of multiple SPUcores can be managed at the same time. A schedulingmethod of the TM can be changed by suitably program-ming it for the purpose of overall system.

3.5. Experimental results

To evaluate the performance of the proposed attentioncontrolled multi-core architecture, we perform experi-ments on 100 sample images from real video frames witha cycle accurate simulator in C++/System C. The size ofinput image is the VGA (640�480) and the size of a grid-tile can be configured from 24�24, 32�32, y to 72�72pixels, by 8 pixels step both in width and height. In theexperiments, the VPE extracts the ROIs of the input imagein a grid-tile unit and the TM schedules them to N SPUcores with or without any scheduling method. Theparameterized SIFT program is mapped to each SPU core


dx.doi.org/10.1016/j.image.2010.03.003

ARTICLE IN PRESS


and the N value is set to 16. The experiments cover 4issues of the proposed architecture. First, the utilizationsof multiple SPU cores are evaluated. Second, the perfor-mance of the grid-based ROI processing architecture isanalyzed according to the size of the grid-tile. Third,execution time management, and resource usage of theWATS will be covered. Last, the overall energy reductioneffect by the proposed attention controlled architecturewill be shown.

Fig. 7 shows the hardware utilizations of 16 SPU coreswhen the TM schedules the ROI grid-tile tasks to all of the16 SPU cores without any special scheduling method.The size of grid-tile is set to 40�40 pixels. For 100 frameimages, the average activation time of each SPU core ismeasured. With a number of ROI grid-tile tasks, the

SPU0SPU1

SPU2SPU3

SPU4SPU5

SPU6SPU

4

8

12

16

Exe

cutio

n tim

e (m

s)

Fig. 7. Hardware utiliza

24 32 40 48

1

2

3

4

5

6

7

8

Size of grid-tile

Optimal point

Overall execution time

Exe

cutio

n tim

e pe

r RO

I* g

rid-ti

le (m

s)

Fig. 8. Tile size vs. overa


average utilization of all SPU cores is calculated as 92%while ranging from 87% to 99%. This result showsattention controlled multiple processing cores are highlyutilized in the proposed architecture.

Fig. 8 shows the performance of the grid-based ROIprocessing according to the size of a grid-tile. We measurethe average execution time per ROI grid-tile and theaverage number of ROI grid-tiles for the 100 sampleimages where the size of a grid-tile varies from 24�24 to72�72 with a step of 8�8. As a result, the execution timeof an ROI grid-tile linearly increases with the size of grid-tiles. Since the SPU core contains SIMD extended1-dimensional vector processing, the overall executiontime is linear even though the processed area increasesquadratically. On the other hand, the number of ROI

7SPU8

SPU9SPU10

SPU11SPU12

SPU13SPU14

SPU15

Avg: 92%

tion of SPU cores.

56 64 72

20

40

60

80

100

120

140

160#

of R

OI g

rid-ti

les

(pixels)

* ROI: Regions-of-Interest

ll execution time.


dx.doi.org/10.1016/j.image.2010.03.003

ARTICLE IN PRESS

Conventional

Ene

rgy

cons

umpt

ion

(mJ)

2.1Xreduction

0.87mJ

13.54mJ

Attentioncontrolled

5

10

15

16 SPU cores5.60mJ

6.47mJ

Attention

Fig. 10. Energy reduction effect.


grid-tiles decreases quadratically according to the size ofgrid-tile. By multiplying two factors, we can obtain overallexecution time. The resulting performance is not optimalwhen the size of grid-tile is too small or too large, i.e.,24�24 and 72�72, however, the differences are not verylarge for the medium size cases from 32�32 to 64�64.The optimal point is when the tile size is 32�32. Thecases of multiples of 16 such as 32�32, 48�48, and64�64 show better performance due to their better rowdata mapping to SPU core. Lastly, it is remarkable thatthe experimental results can be affected by hardwarestructure and software mapping on it. In our case, we usedan 8-way SIMD processor and streaming model insoftware mapping as described in Section 3.3. Datatransaction style with external memory can be anotherfactor in determining the optimal size of the grid-tile.Using cache or scratch pad, using DMA or not can affectthe performance. In our case, we used a scratch pad with aDMA capable of 2-dimensional data access.

Fig. 9 shows the experimental results of the WATSshowing its execution time management. We measuredthe execution times of the overall process of SPU coreswhen the ROI grid-tiles are scheduled to 16 SPUs by theWATS. The mean and variance value of the number of ROIgrid-tiles for 100 sample images are measured as 79.33and 22.87, respectively, where the size of input image isthe VGA (640�480) and the size of a grid-tile is 40�40pixels. As shown in the result graph, the WATS managesthe overall execution time around 16.4 ms with a smallvariance of 2.62 under the large ROI workload variation. Inthis situation, the average number of operating SPU coresis measured as 13.16 and the rest SPU core resources canbe used for other purposes or saved for low powerconsumption.

Fig. 10 shows the total energy reduction by theproposed architecture. We measured the energyconsumption of conventional and attention controlledmulti-core architecture under the condition that thehardware of multiple cores are the same as 16 SPU

Frame

# of

RO

I* g

rid-ti

les

0

40

80

120

160

Input video frames μ=79.33 / σ

Execution time16.4ms

Varianc2.62

Fig. 9. Execution time ma


cores and the energy consumption of a controller unit isignored in both the cases. Therefore, energy overhead ofthe proposed attention controlled architecture is energydissipation by the VPE for attention processing. As aresult, the attention controlled multi-core architecturereduces the total energy consumption by 2.1 times byreducing the energy consumption of energy hungrymultiple cores with investing a 0.87 mJ attentionprocessing.

4. Neuro-fuzzy visual perception implementation

In the proposed attention controlled architecture, theROIs of an input image are used as an important clue forintelligent control of multiple cores to reduce the overallenergy consumption. In this section, we describe theimplementation of a special hardware named visualperception engine (VPE) that extracts the ROIs out of theinput image prior to detailed object recognition. Since thevisual perception is an additional process to obtain energysaving in object recognition processing, the implementa-

#

(100 samples)= 22.87

Exe

cutio

n tim

e (m

s)

10

20

30

40

0

16.4

e Resource use13.16/16.00

* ROI: Regions-of-Interest

nagement by WATS.


dx.doi.org/10.1016/j.image.2010.03.003

ARTICLE IN PRESS


tion cost of it in terms of energy dissipation should besmall enough.

For energy efficient design, the VPE employs an RISCcontroller and 3 dedicated hardware blocks for thedetailed visual perception process as shown in Fig. 11: amotion estimator (ME), a visual attention engine (VAE),and an object detection engine (ODE). The ME is employedto generate a motion vector map between twoconsecutive frames using a conventional block matchingalgorithm [12]. The VAE is employed to accelerate featuremap extraction and normalization to generate saliencymap. The ODE is proposed to perform final ROIclassification from selected seed points that determinesneighboring pixels can be joined to interest region ofobject or not. The RISC controller takes a role in control ofthe 3 dedicated hardware blocks and performingsoftware-oriented operations. A 24 KB memory is usedfor storing original images and data communicationamong the 3 special blocks by sharing intermediateprocessing data. In the design of the VAE and ODE, bio-inspired 2-dimentional cellular neural networks (CNNs)and neuro-fuzzy classifier suitable for rapid featureextraction and robust classification, respectively, areemployed.

4.1. Cellular neural networks based visual attention engine

Fig. 12 shows the block diagram of the VAE and its cellorganizations [9]. The VAE is composed of 40�40 cellarrays and a linear array of 40 PEs. The cells performstorage and inter-cellular communication functions whilethe PEs are responsible for processing the cells data. EachVAE cell consists of a 6T SRAM based 4�8-bit registerthat stores intermediate data and result data of CNNoperation, and 8-bit 4-directional shift register that shiftsthe register data to the adjacent cells. A shift operation on

Target algorithm

Intensity Color Orientation

dynamicfeature

Saliency map

static features

Motion

Region growing via pixel classification

Input Image

Multi-scale generation

Center surround difference

Normalization & combination

Seed selection (10seeds)

ROIs of image

Fig. 11. Block diagram of vis


the entire chip requires only 1 cycle to complete. Basedon its fully connected 2-dimentional cellular structureand fast data movement among neighbor cells, its regionaland collective processing accelerates the various featureextractions that contains a lot of image filtering andconvolution operations.

4.2. Object detection engine: neuro-fuzzy classifier

Fig. 13 shows the block diagram of the ODE and itstransistor-level circuit implementations [11]. The ODEconsists of Gaussian fuzzy membership and single-layerneural network, used for similarity measure between theseed and the target pixel and for final decision making inclassification, respectively. An analog-digital mixed modeapproach is used to exploit both advantages of analogand digital circuits. Firstly, data processing parts of theODE are designed by analog circuits to obtain area andpower reduction in non-linear Gaussian function andsynaptic multiplier implementation. After converting8-bit intensity, saliency, and location digital values ofthe target and seed pixel to analog signals, 3 analogGaussian function circuits measure the similaritiesbetween the two pixels. Gaussian function is realized bythe combination of MOS differential pair and minimumfollowing circuit in current mode operation. Thedifferential pair circuit generates two symmetricdifferential signals where each of them has exponentialnon-linearity characteristics, and the minimum followercircuit results Gaussian-like output by following theminimum between the two signals. Current mode neuralsynaptic multipliers are implemented by binary-weightedcurrent mirrors to merge the 3 similarities with theirweight values. A comparator circuit makes the finaldecision result 0 or 1 through thresholding operation.On the other hand, a learning part in order to update the

D $

32-bitRISC

SharedMemory(24KB)

VisualAttentionEngine

ObjectDetection

Engine

MotionEstimator

32b Pipelined BU

S

Visual Perception Engine

I $

ual perception engine.


dx.doi.org/10.1016/j.image.2010.03.003

ARTICLE IN PRESS

Cellular Neural Network (CNN) Cell Array

Redistribution Layer

CNN Processing Elements (x40)

Controller

state

IMEM(2KB)

/equalize

read bus 1

writeenable

write bus

North

West South

East

/write bus

readenable 1

/pre-charge

loadenable

WL0

WL1

WL2

WL3

/shift

shift

LOA

D

SHIF

T IN

SHIFT O

UT

readenable2

REGISTERFILE SHIFT

REGISTER

north en

south en

east en

west en

Pulsed Latch6T SRAM Cell

NMOS only Mux/ Demux(Dynamic Logic)read bus 2

Fig. 12. Cellular neural network based visual attention engine.


weight values of neural network is implemented bydigital circuits for easy management of data storage.With an unsupervised Hebbian learning that strengthensthe connection weight to firing cells [13] only a multiplierand an adder are used for each weight value. As aresult, analog-digital mixed mode implementationreduces the area and power consumption of the ODEby 59% and 44%, respectively, compared with those ofdigital-only implementations. With the 4 parallelexecution units, the ODE completes the ROI detectionfor each region within 7 ms at 200 MHz operatingfrequency.


4.3. VPE evaluation

To evaluate the energy efficiency of the VPE, wecompare the energy consumption of it with the othercomparable processing unit when both of them run thesame visual perception algorithm. We compare the VPE toan SPU core because its SIMD processing is also quitesuitable for visual perception algorithm. To consider areafactor, we compare the results of 2 SPU cores with theVPE. As shown in Table 2, the VPE’s dedicated hardwareblocks execute target algorithm more than 3.5 timesfaster than SPU core’s SIMD acceleration. As a result, the


dx.doi.org/10.1016/j.image.2010.03.003

ARTICLE IN PRESS

X

Gaussian Function Circuits Neural Synaptic Multiplier

w1…DAC

1-DGaussian

ADC

Wk+1 = Wk+ �*xk*yk

out

DAC

Analog Neuro-Fuzzydatapath

Iout

A

B

Differentialpair

Vbias

Xin Xrefb3 b0

Iout

b7 b4

vIin

Iout= Iinx(∑bix2i)

DAC

1-DGaussian

St

DACSs

It

Is

DAC

2-DGaussian

Xt

DACXs

DACYt

DACYs

Weight DataRegister FileDigital

w2…

w3…

�

yk

xk

Analog

4

Fig. 13. Neuro-fuzzy object detection engine.

Table 2VPE evaluation.

Execution time Power dissipation Energy consumption Energy gain

2 SPU cores 36.9 ms 67.5 mW 2.49 mJ –

VPE 10.4 ms 83 mW 0.87 mJ 2.85�


VPE completes the target visual perception algorithmwith 0.87 mJ energy dissipation and this is 2.85� moreenergy efficient than 2 SPU cores.

5. System integration

The proposed attention controlled architecture isimplemented in a real-time object recognition processor


as shown in Fig. 14. Except the VPE, TM, and 16 SPUcores, a decision processor (DP) is added for post data-base matching task after key-point description by the 16SPU cores. In the proposed processor, 3 recognition stages,visual perception, feature description, and databasematching, operated by the VPE, 16 SPUs, and DP,respectively, are executed in the pipeline for high-recognition throughput. All 21 IP blocks including2 external memory interfaces are fully interconnected


dx.doi.org/10.1016/j.image.2010.03.003

ARTICLE IN PRESS

Fig. 14. System integration block diagram.

0 1

2 3

4 5

6 7

8SPU SPU SPU SPU SPU SPU

SPU SPU SPU SPU SPU SPU

9

10 11

SPU12

SPU13

SPU14

SPU15

Visual PerceptionEngine

DecisionProcessor

NoC

PLL

Task Manager

Fig. 15. Chip micrograph.


through a network-on-chip (NoC). The 16 SPU cores aredivided into 4 SPU clusters where a cluster contains 4 SPUcores and has a separate power domain. Further analysisabout the number of separate power domains and itspower reduction effect is given in [14]. For power gatingof each domain, external power gating is used. In the chip,the TM manages 4 power domains of 4 SPU clusters byrequesting each of the external regulators to gate-offcorresponding domain.

Fig. 15 shows the chip micrograph of the proposedprocessor fabricated in 0.13 mm 1-poly 8-metal CMOStechnology. Its area is 7�7 mm2 and contains 36.4 Mtransistors including 3.7 M logic gates and 396 KB on chipSRAM. The operating frequency is 200 MHz for IP blocksand 400 MHz for NoC. Its peak performance amounts to201.4 GOPS where 1 operation means 16-bit fixed-pointoperation. The average power consumption is 496 mW atthe supply voltage of 1.2 V. Table 3 summarizes chipfeatures.

Table 4 shows performance comparisons with previousobject recognition processors [2–4,14]. While IMAP-CARhas a target application of vehicle vision, other processorsshare the same target application, SIFT based objectrecognition. In commercial CELL-BE processor [15] andmulti-core object recognition processor [3], only SIFTfeature description is implemented without post databasematching process. On the other hand, full object


recognition is implemented in dual-mode objectrecognition processor [4] and the proposed processorwith the database size of 5632 and 16,384, respectively.


dx.doi.org/10.1016/j.image.2010.03.003

ARTICLE IN PRESS

Table 3Chip summary.

Technology 0.13 mm 1P 8 M CMOS

Package 320 pin FPGA

Die size 7 mm�7 mm

Power supply 1.2V core, 2.5V I/O

Operating frequency 200 MHz IPs/400 MHz NoC

Transistor counts 36.4 M transistors

3.73 M gates/396 KB SRAM

Power consumption Peak: 695 mW/ Average: 496 mW

Peak performance VPE 54 GOPS

16 SPU cores 128 GOPS

DP 19.4 GOPS

Total 201.4 GOPS

Power efficiency 290 GOPS/W

Target application Real-time object recognition

Input screen VGA (640�480 pixels)

Frame rate 60 frame/s

Database size 16,384 vectors (50 objects)

Table 4Performance comparisons.

IMAP-CAR [2] CELL-BE [15] CICC2007 [3] ISSCC2008 [4] This work

Process 130 nm 90 nm 180 nm 130 nm 130 nm

Chip size/

Transistor count

Not reported/

26.8 M TRs

235 mm2/241 M TRs 38.5 mm2/0.8 M gates, 34 KB 36 mm2/2M gates, 228 KB 49 mm2/36 M TRs

Power supply 1.2 V 0.9–1.3 V 1.8 V 1.2 V 1.2 V

Operating

frequency

100 MHz 3.2 GHz 200 MHz 200 MHz 200 MHz

Peak performance 100 GOPS 4200 GFLOPS 81.6 GOPS 125 GOPS 201.4 GOPS

Power

consumption

o2 W About 50 W 1.08 W (avg.) 0.6 W (avg.) 0.5 W (avg.)

Target application Vehicle area

detection

SIFT descriptor

generation

SIFT descriptor generation Object recognition

(attention based)

Object recognition

(attention based)

Hardware

architecture

SIMD Ring interconnected

multi-core

Multi-core w/shared channel

memories

Reconfigurable SIMD/

MIMD

Attention controlledmulti-core

Frame rate 32 fps 0.48 fps 16 fps 22 fps 60 fpsDatabase size N.A. N.A. N.A. 5632 16,384

Energy/frame 40 mJ 104 J 67.9 mJ 26.5 mJ 8.2 mJScreen resolution 256�240 UVGA (1600�1200) QVGA (320�240) QVGA (320�240) VGA (640�480)

Energy/pixel 651 nJ 54 mJ 884 nJ 345 nJ 26.7 nJ


Since the attention controlled multi-core architecturereduces the overall workload of object recognition andmanages multi-core resource efficiently, the proposedprocessor achieves 60 frame/s frame rate for the VGA-sized input video with 496 mW low power consumption.As a result, the obtained 8.2 mJ energy dissipation perframe is the lowest ever and 3.2� less than the state-of-the-art object recognition processor.

6. Conclusions

In this article, an attention controlled multi-corearchitecture is presented for energy efficient objectrecognition. The VPE computes regions-of-interest (ROIs)out of the input image and the TM controls the multipleprocessing cores to perform local object recognitionprocessing for the selected area. The TM performsdynamic scheduling of various ROI tasks from the VPE tomultiple cores in a unit of small-sized grid-tile. In this


architecture, the utilization of the multiple cores ismeasured as 92% and the 32�32 pixel size is the bestfor the grid-tile in terms of overall execution cycles. As aresult, the proposed architecture achieves 2.1� energyreduction in multi-core object recognition system withsacrificing small overhead for attention processing. Final-ly, the proposed architecture is realized in 0.13 mm CMOStechnology and verifies 3.2� lower energy dissipation perframe than the state-of-the-art object recognition processor.

Although this architecture is optimized for ROI basedselective processing, it is also suitable for general imageprocessing applications. With the tile based fine-grainedprocessing, task management by the TM can achieve highutilization of multiple processing cores.

References

[1] A. Abbo, et al., XETAL-II: a 107 GOPS, 600 mW massively-parallelprocessor for video scene analysis, IEEE ISSCC Dig. Tech. Pap.(February 2007) 270–271.


dx.doi.org/10.1016/j.image.2010.03.003

ARTICLE IN PRESS


[2] Shorin Kyo, et al., A 100 GOPS in-vehicle vision processor for pre-crash safety systems based on ring connected 128 4-way VLIWprocessing elements, IEEE Symp. VLSI Circuits (2008) 28–29.

[3] D. Kim, et al., An 81.6 GOPS object recognition processor based on NoCand visual image processing memory, IEEE CICC (2007) 443–446.

[4] K. Kim, et al., A 125GOPS 583 mW network-on-chip based parallelprocessor with bio-inspired visual attention engine, Dig. Tech. Pap.IEEE ISSCC (2008) 308–309.

[5] J.-Y. Kim, et al., A 201.4GOPS 496 mW real-time multi-objectrecognition processor with bio-inspired neural perception engine,Dig. Tech. Pap. IEEE ISSCC (2009) 150–151.

[6] C. Privitera, L. Stark, Algorithms for defining visual regions-of-interest: comparison with eye fixations, IEEE Trans. Pattern Anal.Mach. Intell. 22 (9) (2000) 970–981.

[7] David G. Lowe, Distinctive image features from scale-invariantkeypoints, ACM Int. J. Comput. Vision 60 (2) (2004) 91–110.

[8] Q. Zhang, et al., SIFT implementation and optimization for multi-core systems, IEEE International Symposium on Parallel andDistributed Processing, IPDPS 2008.


[9] S. Lee, et al., The brain mimicking visual attention engine: an80�60 digital cellular neural network for rapid global featureextraction, IEEE Symp. VLSI Circuits (2008) 26–27.

[10] L. Itti, et al., A model of saliency-based visual attention for rapidscene analysis, IEEE Trans. Pattern Anal Mach. Intell. 20 (11) (1998).

[11] M. Kim, et al., A 22.8GOPS 2.83 mW neuro-fuzzy object detectionengine for fast multi-object recognition, IEEE Symp. VLSI Circuits(2009) 260–261.

[12] P. Pirsch, N. Demassieux, W. Gehrke, VLSI architectures for videocompression—a survey, Proc. IEEE 83 (2) (1995) 220–246.

[13] D.O. Hebb, in: The Organization of Behavior, John Wiley & Sons,1949.

[14] J.-Y. Kim, et al., A 60 fps 496 mW multi-object recognitionprocessor with workload-aware dynamic power management,ACM/IEEE International Symposium on Low-Power Electronicsand Design (ISLPED), pp. 365–370, 2009.

[15] B. Kwon, et al., Parallelization of the scale-invariant keypointdetection algorithm for cell broadband engine architecture, IEEEConsum. Commun. Networking Conf. (CCNC) (2008) 1030–1034.


dx.doi.org/10.1016/j.image.2010.03.003

Signal Processing: Image Communication€¦ · 05.02.2016 · order to accelerate kernel image...

Documents

Transcript of Signal Processing: Image Communication€¦ · 05.02.2016 · order to accelerate kernel image...