Image Processing Capabilities of ARM-Based

7
Image Processing Capabilities of ARM-based Micro-controllers for Visual Sensor Networks M. Amac Guvensan, A. Gokhan Yavuz, Z. Cihan Taysi, M. Elif Karsligil, Esra Celik  Department of Computer Engineering, Y ildiz T echnical University  Istanbul, Turke y amac, gokhan, cihan , elif, esra@ce.yildiz.edu.tr  Abstract—La st dec ade wit nes sed the rap id dev elo pme nt of Wireless Sensor Networks (WSNs ). Mor e recen tly , the avail- ability of ine xpe nsive har dwar e such as CMOS cameras and micr opho nes that are able to ubiqu itou sly captu re mult imedi a content from the env iron ment has fos tere d the deve lopment of Wir eless Multimedia Sensor Networks (WMSNs) [1]. Nodes in such networks requi re sign ican t amou nt of proc essi ng powe r to int erp ret the col lec ted sensor data. Most of the curre ntl y available wireless multimedia sensor nodes are equip ped with ARM7 core micr o-con trol lers [2]. On the other hand, ARM9 core is a viab le alte rnati ve, which deliver s deter mini stic high perf orma nce and exib ilit y for dema nding and cost- sensi tive embed ded appl icati ons. Thus, we eva luate d the perf ormance of the ARM7 core against the ARM9 core in terms of processing power and energy consumption. Our test result s showed that architectural improvements of the ARM9 core alone resulted in a 30% speed -up in exe cut ion time, wher e the ARM9 cor e in general performed 9 to 11 times faster than the ARM7 core.  Keywords -Convolution, Visual Sensor Networks, ARM-based micro-controllers, Performance Evaluation I. I NTRODUCTION Wi rele ss Mult imedi a Sensor Networks (WMS Ns) recei ve much attenti on with the rapid devel opmen t and progress in sensor technologies, embedded computing, and availability of inex pensi ve CMOS camer as and micr ophone s [2]. WMSNs are able to obtain audio/visual information from an observed area in contrast to tradi tion al WSNs where sensor nodes were restr ict ed to col lec t bas ic sca lar dat a. Thu s, the y rec ent ly enabled many WMSN applications such as surveillance sensor networks, law- enfor cemen t repor ts, traf c contr ol syste ms, advanced health care delivery, automated assistance to elderly telemedicine, and industrial process control [1]. Interpreting the observed data accurately/timely is the main focus of these applicat ions. Respecti ve to the requirements of the application, this process might occur either at the more powerful base station or within the network on the battery/CPU constr ained sensor nodes. WMSNs are suppos ed to proce ss huge amounts of data during their operation [3]. To transmit these raw data without in-network processing to a base sta- tion causes sensor nodes to deple te thei r batte ries quickly . Hen ce, sub sta nti al ene rgy could be sav ed by av oid ing the trans missi on of redun dant infor mati on [4]. Besides, sendi ng raw data especially of medium/high resolution video images is a serious problem due to communication link restrictions [5]. Fortunately , new generation micro-controllers encourage the resear chers to impl ement signal proce ssing algor ithms such as encoding, compression, object detection, etc. on the nodes. In-situ processing brings many advantages to overcome the cha lle nge s in WMSNs . The ref ore, res ear che rs aim at tran sferr ing only meani ngful informat ion from sensor nodes to base stations. Obj ect s and/or regio ns could be descri bed as the most important meaningful information in a picture/frame. Sensor nodes in a vi sual sensor networ k are re sponsible for the detection, recognition and tracking of targeted objects [6]. To fulll these tasks, sev eral image proce ssing technique s have been implemented on sensor nodes recently. Real-time image processing, especially real-time video processing, on existing nod es is sti ll har d to imp lement due to its CPU/memo ry constraints. Exploring currently available wireless multimedia sensor nodes [2][ 7] shows us that ARM micr o-con trol lers are very common among these nodes. Due to their low cost, small size and simplicity, they became very popular and dominant in the market. ARM processors are used extensively in consumer elect ronics, inclu ding PDAs, mobile phones , music /med ia players, and hand-held game consoles. ARM [8] offers a wide range of micro-controller cores, such as ARM7, ARM9, and ARM11, with different functionalities and capabilities. ARM7 is one of the powerful micro-controllers for embedded comput- ing. However, its performance is still behind the requirements of real -time video processi ng especi ally for high resolutio n image s. On the other han d, ARM9 and ARM11 int roduces new skills for signal processing algorithms. However, ARM11 includes a wide range of peripherals such as LCD controller, mpeg encoder, and graphic accelerator, since it is specically designed for multimedia applications. Due to its high cost and peripheral overhead it is not an optional candidate for visual sensor nodes. Thus, it is not included in this study. In this paper, we compare the performance of ARM9 with the performance of ARM7 and show the necessity of ARM9 based micro-co ntrol lers for embed ded compu ting in visua l sensor net wor ks. A wid e ran ge of ima ge/ vid eo pro ces sing algorithms, from preprocessing to encoding, are made up of a sequence of convolutions, which transform images using a set of kernel matrices [9]. Thus, we believe that, how well a micro-control ler performs the compute intensive convolut ion pro cess could be indica ti ve of its sui tab ili ty for emb edded signal processing applications. The rest of thi s paper is or ganize d as fol lo ws. Secti on 2011 Ninth IEEE/IFIP International Conference on Embedded and Ubiquitous Computing 978-0-7695-4 552-3/11 $26.00 © 2011 IEEE DOI 10.1109/EUC.2011.4 4 250 2011 Ninth IEEE/IFIP International Conference on Embedded and Ubiquitous Computing 978-0-7695-4 552-3/11 $26.00 © 2011 IEEE DOI 10.1109/EUC.2011.4 4 243 2011 IFIP Ninth International Conference on Embedded and Ubiquitous Computing 978-0-7695-4 552-3/11 $26.00 © 2011 IEEE DOI 10.1109/EUC.2011.4 4 243

Transcript of Image Processing Capabilities of ARM-Based

Page 1: Image Processing Capabilities of ARM-Based

7/30/2019 Image Processing Capabilities of ARM-Based

http://slidepdf.com/reader/full/image-processing-capabilities-of-arm-based 1/6

Image Processing Capabilities of ARM-based

Micro-controllers for Visual Sensor Networks

M. Amac Guvensan, A. Gokhan Yavuz, Z. Cihan Taysi, M. Elif Karsligil, Esra Celik 

 Department of Computer Engineering, Yildiz Technical University

 Istanbul, Turkeyamac, gokhan, cihan, elif, [email protected]

 Abstract—Last decade witnessed the rapid development of Wireless Sensor Networks (WSNs). More recently, the avail-ability of inexpensive hardware such as CMOS cameras andmicrophones that are able to ubiquitously capture multimediacontent from the environment has fostered the development of Wireless Multimedia Sensor Networks (WMSNs) [1]. Nodes insuch networks require significant amount of processing powerto interpret the collected sensor data. Most of the currently

available wireless multimedia sensor nodes are equipped withARM7 core micro-controllers [2]. On the other hand, ARM9core is a viable alternative, which delivers deterministic highperformance and flexibility for demanding and cost-sensitiveembedded applications. Thus, we evaluated the performance of the ARM7 core against the ARM9 core in terms of processingpower and energy consumption. Our test results showed thatarchitectural improvements of the ARM9 core alone resulted ina 30% speed-up in execution time, where the ARM9 core ingeneral performed 9 to 11 times faster than the ARM7 core.

 Keywords-Convolution, Visual Sensor Networks, ARM-basedmicro-controllers, Performance Evaluation

I. INTRODUCTION

Wireless Multimedia Sensor Networks (WMSNs) receivemuch attention with the rapid development and progress in

sensor technologies, embedded computing, and availability of 

inexpensive CMOS cameras and microphones [2]. WMSNs

are able to obtain audio/visual information from an observed

area in contrast to traditional WSNs where sensor nodes were

restricted to collect basic scalar data. Thus, they recently

enabled many WMSN applications such as surveillance sensor

networks, law-enforcement reports, traffic control systems,

advanced health care delivery, automated assistance to elderly

telemedicine, and industrial process control [1].

Interpreting the observed data accurately/timely is the main

focus of these applications. Respective to the requirements of 

the application, this process might occur either at the more

powerful base station or within the network on the battery/CPUconstrained sensor nodes. WMSNs are supposed to process

huge amounts of data during their operation [3]. To transmit

these raw data without in-network processing to a base sta-

tion causes sensor nodes to deplete their batteries quickly.

Hence, substantial energy could be saved by avoiding the

transmission of redundant information [4]. Besides, sending

raw data especially of medium/high resolution video images

is a serious problem due to communication link restrictions

[5]. Fortunately, new generation micro-controllers encourage

the researchers to implement signal processing algorithms

such as encoding, compression, object detection, etc. on the

nodes. In-situ processing brings many advantages to overcome

the challenges in WMSNs. Therefore, researchers aim at

transferring only meaningful information from sensor nodes

to base stations.

Objects and/or regions could be described as the most

important meaningful information in a picture/frame. Sensornodes in a visual sensor network are responsible for the

detection, recognition and tracking of targeted objects [6]. To

fulfill these tasks, several image processing techniques have

been implemented on sensor nodes recently. Real-time image

processing, especially real-time video processing, on existing

nodes is still hard to implement due to its CPU/memory

constraints. Exploring currently available wireless multimedia

sensor nodes [2][7] shows us that ARM micro-controllers are

very common among these nodes. Due to their low cost, small

size and simplicity, they became very popular and dominant in

the market. ARM processors are used extensively in consumer

electronics, including PDAs, mobile phones, music/media

players, and hand-held game consoles. ARM [8] offers a widerange of micro-controller cores, such as ARM7, ARM9, and

ARM11, with different functionalities and capabilities. ARM7

is one of the powerful micro-controllers for embedded comput-

ing. However, its performance is still behind the requirements

of real-time video processing especially for high resolution

images. On the other hand, ARM9 and ARM11 introduces

new skills for signal processing algorithms. However, ARM11

includes a wide range of peripherals such as LCD controller,

mpeg encoder, and graphic accelerator, since it is specifically

designed for multimedia applications. Due to its high cost and

peripheral overhead it is not an optional candidate for visual

sensor nodes. Thus, it is not included in this study.

In this paper, we compare the performance of ARM9 with

the performance of ARM7 and show the necessity of ARM9based micro-controllers for embedded computing in visual

sensor networks. A wide range of image/video processing

algorithms, from preprocessing to encoding, are made up of 

a sequence of convolutions, which transform images using a

set of kernel matrices [9]. Thus, we believe that, how well a

micro-controller performs the compute intensive convolution

process could be indicative of its suitability for embedded

signal processing applications.

The rest of this paper is organized as follows. Section

2011 Ninth IEEE/IFIP International Conference on Embedded and Ubiquitous Computing

978-0-7695-4552-3/11 $26.00 © 2011 IEEE

DOI 10.1109/EUC.2011.44

250

2011 Ninth IEEE/IFIP International Conference on Embedded and Ubiquitous Computing

978-0-7695-4552-3/11 $26.00 © 2011 IEEE

DOI 10.1109/EUC.2011.44

243

2011 IFIP Ninth International Conference on Embedded and Ubiquitous Computing

978-0-7695-4552-3/11 $26.00 © 2011 IEEE

DOI 10.1109/EUC.2011.44

243

Page 2: Image Processing Capabilities of ARM-Based

7/30/2019 Image Processing Capabilities of ARM-Based

http://slidepdf.com/reader/full/image-processing-capabilities-of-arm-based 2/6

2 discusses available studies on the performance evaluation

for embedded systems. In Section 3, we categorize the basic

convolution algorithms and explain their importance for multi-

media sensor applications. Section 4 presents the architectural

details and differences of ARM7 based and ARM9 based

micro-controllers. In Section 5, we present the results of the

performance evaluation of ARM7 and ARM9 cores.

I I . RELATED WOR K

Performance optimization of image processing algorithms

on embedded systems can be achieved either in hardware or

software. In [10], authors propose an embedded video image

recognition system based on a Single Instruction Multiple

Data (SIMD) micro-controller and memory arrays. Although

their platform runs four times faster than a general purpose

processor, it is not suitable to be used in a wireless sensor

network due to its very high production cost and energy con-

sumption. In [11], it is shown that, using a high performance

micro-controller is more efficient, when performing complex

operations like image processing. Thus, authors propose a

dual processor architecture consisting of a Intel PXA255micro-controller and TI MSP430F1611 micro-controller. The

PXA255 is used only for image processing, whereas the

MSP430F1611 is used for general sensing purposes. The

proposed system also adapts a dual radio solution based on

low power 802.15.4 and 802.11g interfaces.

In [12], an embedded smart camera system is proposed

for traffic surveillance. Authors employ both hardware and

software optimizations to improve the system performance.

Hardware optimizations include the use of Direct Memory

Access (DMA) and the integration of a DSP. Software op-

timizations are limited to the removal of external memory

accesses and the replacement of floating point operations

with fixed-point operations. Stationary vehicle detection was

selected as an example application. The prototype system is

capable of processing 1.5 frames per second (fps) at full PAL

resolution.

In [13], authors propose a low cost embedded color vision

system consisting of a RISC based micro-controller operating

at 75 MHz and a CMOS camera. The proposed system

performs color blob tracking by comparing RGB values of 

each pixel with lower and upper bounds. The system is capable

of tracking color blobs with 16 fps at a resolution of 80x143.

A parallel image processing architecture is proposed in [14]

for embedded vision systems. The system is based on FPGA

and operates at a clock rate of 50 MHz. It is capable of per-

forming pre-processing functions such as filtering, correlation

and transformation. The system benefits from 16 identicalprocessing elements. Each element is considered as a small

DSP intended for image processing.

In [15], authors demonstrate the feasibility of a distributed

surveillance system by developing a prototype implementation.

Authors design system architecture based on commercial off-

the-shelf elements to accelerate the development phase and to

lower the production cost. The system has three main parts:

the sensing unit, the processing unit, and the communication

unit. Sensing unit is a CMOS camera that is able to deliver

30 fps at VGA resolution and is connected to the processing

unit via a FIFO memory. Processing unit consists of 10 TI

TMS320C64x DSPs that can deliver an aggregate performance

of 80 GIPS. A peripheral component interconnect (PCI) bus

couples the DSPs and connects them to the communication

unit. Communication unit has a wired ethernet interface, a

wireless GSM interfaces, and a Intel XScale IXP425 processorto manage communications.

In [9], authors evaluate the performance of the AD Blackfin

BF561 fixed point, dual core DSP for image processing algo-

rithms. They use assembly optimizations, and stream image

processing technique to reduce runtime of the algorithms.

Most of the work discussed above propose special ar-

chitectures to satisfy the requirements of visual sensor net-

works. Although these architectures perform well, they are

not applicable to visual sensor networks due to their high

manufacturing cost and energy consumption. Thus, we believe

that general purpose micro-controllers are more appropriate for

the task.

III. CONVOLUTION IN IMAGE PROCESSING

Convolution is a fundamental mathematical operation for

image processing. Many image processing algorithms exploit

convolution basically to enhance the image and/or extract

features from the image [6]. The basic idea is that an image

is scanned with a window of predefined size and shape. The

output pixel value is the weighted sum of the input pixels

within the window, where the weights are the values of the

filter assigned to every pixel of the window itself. The window

consisting of the weights is called the convolution kernel.

An image represented by 2D matrix F  := (f (x, y)mxn) can

be transformed into another image represented by the matrixg(x, y) by convolving F  with a kernel, denoted by H .

g(x, y) = f (x, y)

h(x, y); (1)

This process could be explicitly expressed as in Equation

2.

g(x, y) =

n

2j=−n

2

n

2i=−n

2

h[ j+(n−1), i+(n−1)]×f [x− j, y−i]

(2)

Convolution is applied to images for many different pur-poses including noise removal, smoothing, blurring, sharpen-

ing and determination of the edges [16]. Convolution-based

image processing algorithms can be categorized into three

groups.

1) Enhancing Operations

2) Derivative-based Operations

3) Morphological Operations

251244244

Page 3: Image Processing Capabilities of ARM-Based

7/30/2019 Image Processing Capabilities of ARM-Based

http://slidepdf.com/reader/full/image-processing-capabilities-of-arm-based 3/6

Fig. 1. Convolution-based Operations

 A. Enhancing Operations

Image enhancing operations are utilized to prepare images

for further processing such as segmentation and feature extrac-

tion. Applying these type of filters mainly aims at reducing the

noise and smoothing/blurring/sharpening the image. Existing

image filters can be grouped into several domains such as

linear and non-linear, low-pass and high-pass, rectangular and

circular. As an example, two most popular filters, low-pass

and high pass filters, are given in Equation 3.

LpassF  =1

9

⎡⎣

1 1 11 1 11 1 1

⎤⎦HpassF  =

⎡⎣−1 −1 −1−1 8 −1−1 −1 −1

⎤⎦ (3)

 B. Derivative-based Operations

Derivative-based operations are applied for detecting the

three basic types of discontinuities in a digital image; points,

lines and edges [16]. There are two types of operations;

 first-order derivatives and second-order derivatives. The most

common first-order derivative-based operation is the edge

detection. To find edges in a 2D image, a horizontal and a

vertical kernel are scanned across the image to calculate both

the gradient magnitude and the gradient direction. The mostpreferred gradient filters among the first-order derivatives are

Roberts, Prewitt, Sobel and Gaussian kernels. Equation 4 gives

an example of Prewitt filter-pair.

Prewittx =

⎡⎣

1 2 10 0 0−1 −2 −1

⎤⎦Prewitty =

⎡⎣

1 0 −12 0 −21 0 −1

⎤⎦

(4)

Another way of finding edges is the Laplacian, which is an

example for the second order derivative of a 2D function.

C. Morphological Operations

Morphological operations in image processing aim at ex-

tracting image components that are useful in the representation

and description of region shapes, such as the boundaries, the

skeletons, and the convex hull. There are four main types

of morphological operations; dilation, erosion, opening and

closing. Dilation causes objects to dilate or grow in size,

whereas erosion causes objects to shrink. The amount and the

way that they grow and shrink depend upon the choice of the

structuring element, i.e. the convolution kernel. On the other

hand, opening and closing are the combination of dilation and

erosion. In opening, first erosion, then dilation is applied to an

image. Opening generally smooths the contours of an image,

breaks narrow isthmuses, and eliminates thin protrusions. In

closing, erosion and dilation is applied in the opposite order

compared to opening. Although it tends to smooth sections

of contours, as opposed to opening, it generally fuses narrow

breaks and long thin gulfs, eliminates small holes, and fills

gaps in the contours.

 D. Complexity Analysis of the Convolution Process

2D convolution is computationally intensive. For an N ×N 

area and M ×M  kernel, the time complexity of the sequential

convolution is O(N 2M 2). Convolution process consists of 

a series of load-multiply-store operations. Thus, it can be

regarded as a Multiply-And-Accumulate operation and usually

micro-controllers with such DSP features do perform the

convolution faster. The pseudo code for the convolution is as

follows.

Algorithm 1 The pseudo-code of the 2D Convolution

CONVOLUTION(*srcImg,width,height,convKernel,kernelSize,*dstImg)for i = kernelSize/2 to height − (kernelSize/2) do

for j = kernelSize/2 to width− (kernelSize/2) do

for k = −kernelSize/2 to kernelSize/2 do

for l = −kernelSize/2 to kernelSize/2 do

tmp=tmp+convKernel[k][l] × srcImg[i+k][j+l];end for

end for

dstImg[i][j]=tmp;end for

end for

IV. ARM9 VS ARM7

ARM9 core provides several key improvements over the

ARM7 core in terms of  clock frequency, cycle count, pipeline,

and extra peripherals. In the following subsections we willdescribe these improvements in detail.

 A. Clock Frequency

The increased clock frequency of the ARM9 core comes

from a 5-stage pipeline design, compared to a 3-stage pipeline

design in the ARM7 core. Increasing the number of pipeline

stages increases the amount of parallelism in the design,

thus reducing the amount of logic, which must be evaluated

within a single clock period. With a 5-stage pipeline design,

the processing of each instruction is spread across five or

more clock cycles. Therefore, up to five instructions could be

worked on during any one clock cycle. The maximum clock 

frequency of the ARM9 core is generally in the range 1.8 to2.2 times the clock frequency of the ARM7 core [8].

 B. Cycle Count 

Cycle count improvements give increased performance, in-

dependent of the clock frequency. The amount of improvement

depends on the mix of instructions in the code being executed,

which is affected by the nature of the program, and for high

level languages, by the compiler used.

252245245

Page 4: Image Processing Capabilities of ARM-Based

7/30/2019 Image Processing Capabilities of ARM-Based

http://slidepdf.com/reader/full/image-processing-capabilities-of-arm-based 4/6

1) Loads and stores: The most significant improvement

in the instruction cycle count, moving from the ARM7 core

to the ARM9 core, is the performance of load and store

instructions. Reducing the number of cycles for loads and

stores gives a significant improvement in program execution

time, since almost 30% of instructions are loads and/or stores.

The reduction of cycles for loads and stores is achieved by

the two fundamental micro-architectural differences between

the designs of the cores.

• The ARM9 core has separate instruction and data mem-

ory interfaces, allowing the CPU to simultaneously fetch

an instruction and read or write a data item. This is called

a modified-Harvard architecture. On the other hand, the

ARM7 core has a single memory interface.

• The 5-stage pipeline introduces separate Memory and

Write Back stages. These are used to access memory for

loads or stores, and to write results back to the register

file.

Table I summarizes the cycles required to execute various

load and store instructions. The table shows that all store

instructions take one cycle less on the ARM9 core than on theARM7 core. It also shows that load instructions generally take

two less cycles on the ARM9 cores, if there are no interlocks.

TABLE ILOAD AND STORE CYCLE COUNTS

Instruction ARM7 Core ARM9 CoreType Execute Interlock Execute Interlock

LDR1 3 0 1 0 or 1

LDM of 2 n+2 0 n 0 or 1n registers

LDRH3 3 0 1 0 to 2

LDRB4

LDRSB5

LDRSH6

STR7 2 0 1 0STM8 n+1 0 n 01 load one word 2 load multiple words3 load one halfword 4 load one byte5 load one signed byte 6 load one signed halfword7 store one word 8 store multiple words

2) Interlocks: Pipeline interlocks occur when the data re-

quired for an instruction is not available due to the incomplete

execution of an earlier instruction. When an interlock occurs,

the hardware stalls the execution of an instruction until the

data is ready. This provides complete binary compatibility

with earlier ARM processor designs, however it increases

the execution time of the code sequence by a number of 

interlock cycles. Compilers and assembler-code programmers

can in many cases reduce the number of interlock cyclesby re-arranging the order of instructions or by using other

techniques.

ARM compilers implement code scheduling optimizations

to reduce the number of interlock cycles. It is often possible

to find a useful instruction to move between two consecutive

loads, but not always. This means that the average number of 

cycles to execute a LDR is between 1 and 2. The exact number

depends on the code being compiled, and the sophistication

of the compiler.

3) Branches: Although ARM9 core has a bigger pipeline,

the number of cycles for a branch instruction both on ARM9

and ARM7 cores are equal. This is because the pipelines have

the same number of stages up to the end of the execute stages.

Thus, branches are implemented in the same way on both

cores. The cycle counts for branches are given in Table II.

TABLE IILOAD AND STORE CYCLE COUNTS

Condition Code Check ARM7 ARM9

Pass 3 3

Fail 1 1

The ARM9 core does not implement branch prediction,

because branches on these CPUs are fairly inexpensive in

terms of lost opportunity to execute other instructions.

C. Pipeline

The ARM7 core implements the 3-stage pipeline design as

shown in Figure 2. In a single cycle, the Execute stage can read

operands from the register bank, pass them through the ShiftRegister, and through the Arithmetic and Logic unit (ALU)

and write the results back to the register bank.

Data reads from and writes to the memory system are also

performed in the Execute stage. To do this, the instruction

stays in the Execute stage of the pipeline for multiple cycles.

Fig. 2. 3-stage pipeline of ARM7 core

The ARM9 core implements the 5-stage pipeline design as

shown in Figure 3. It is also a Harvard architecture, so that

data accesses do not have to compete with instruction fetches

for the use of one bus. Result forwarding is also implemented,

so that results from the ALU and data loaded from memory are

fed back immediately to be used by the following instructions.

This avoids having to wait for results to be written back to

and read from the register bank.

In this pipeline design, dedicated pipeline stages have been

added for memory access and for writing results back to

the register bank. Also, register read has been moved back 

into the decode stage. These changes allow for higher clock 

frequencies by reducing the maximum amount of logic which

must operate in a single clock cycle.

Fig. 3. 5-stage pipeline of ARM9 core

Loads from the memory system and stores to the memory

system are also performed in the Execute stage. To do this

the instruction stays in the Execute stage of the pipeline for

multiple cycles.

253246246

Page 5: Image Processing Capabilities of ARM-Based

7/30/2019 Image Processing Capabilities of ARM-Based

http://slidepdf.com/reader/full/image-processing-capabilities-of-arm-based 5/6

LDR uses the Execute stage for only one cycle, allowing

other instructions to use the Execute stage in the following

cycles, unless there are interlocks. This means LDR is a single

cycle instruction.

V. PERFORMANCE EVALUATION

Nowadays several companies offer micro-controllers with

ARM cores. Thus, a huge number of ARM7 and ARM9based micro-controllers are available off-the-shelf with a large

variety of peripheral configurations. In our case, our primary

focus is to analyze the suitability of the cores for visual sensor

networks in terms of the processing power and energy con-

sumption. Thus, for our purposes, the peripheral configuration

is not crucial to micro-controller selection and therefore any

micro-controller with an ARM7 or AMR9 core could be used

for the task.

For the performance evaluation of the ARM7 core, we used

a development board from Olimex [17] built around the At-

mel’s AT91SAM7S256 [18] micro-controller, which contains

an ARM7TDMI-S core. The AT91SAM7S256 operates at a

maximum speed of 55MHz and features 128KBs of flashmemory and 64KBs of SRAM. On the other hand, we used a

single board computer (SBC) from FriendlyARM [19], which

has Samsung’s ARM920T based S3C2440A micro-controller.

This SBC is capable of operating up to 400 MHz and is

equipped with 64 MBs of external SDRAM.

We performed the tests on the boards without any operating

system, and with all unnecessary peripherals of the micro-

controller turned off. By doing so we were able to measure the

pure performance of the cores. To eliminate the possible neg-

ative effects of different development tools, we used Sourcery

CodeBench development tool with gcc-4.5.1 compiler from

Code Sourcery for the tests.

On both cores, we applied the convolution process to

images of different resolutions. In order to demonstrate the

performance differences between the two cores, we conducted

our tests with regard to three main criteria; image resolution,

kernel size, and code optimization. For image resolution, reso-

lutions of 160×120, 320×240, and 640×480 were selected for

low, medium, and high resolution images respectively. Kernel

sizes were chosen as 3×3, 5×5, and 7×7. Both unoptimized

and optimized code were generated for each convolution of the

resolution and kernel size. For code optimization, the com-

pilers -O2 optimization mode was used. These optimizations

mainly include loop optimizations, jump threading, common

subexpression elimination and instruction scheduling. Figure

4, Figure 5, and Figure 6 show the time required for the convo-

lution process on both ARM7 and ARM9 cores. Consideringthe relative operating frequencies of both of the cores, a linear

speedup of 8.33 would be the theoretical limit. However, our

tests showed that ARM920T performed 9 to 11 times faster

than ARM7TDMI. This extra speedup is to be credited to the

ARM920T’s architectural design enhancements. Furthermore,

we observed that the extra speedup would be as high as 30%

with an optimized code, whereas with optimizations enabled

the speedup drops and varies between 7% and 12%. As an

example, the optimized convolution code runs 90% faster on

ARM920T with a kernel size of  3× 3 and a image resolution

of  160 × 120 if ARM7TDMI-S were to operate at the same

clock frequency as the ARM920T.

ARM920T defeats ARM7TDMI-S not only in terms of 

speedup but also in terms of its energy consumption. Both

ARM7TDMI-S and ARM920T cores consume 0.25mW  per

MHz . Although at full speed ARM920T consumes 8.33times more power than ARM7TDMI-S, it can complete the

same process 9 to 11 times faster than ARM7TDMI-S. Thus,

ARM920T consumes 7%-32% less power for the same opera-

tion. Table III gives detailed comparison of the cores regarding

their energy consumption.

The primary bottleneck of the ARM7TDMI-S core is that

the convolution process itself takes too much time to complete

especially for medium and high resolution images. Thus, this

bottleneck is the primary inhibiting factor for real-time video

processing on ARM7 cores. On the other hand, the results of 

ARM920T core encourages us to use it on video sensor nodes.

TABLE III

PERFORMANCE COMPARISON OF ARM7TDMI-S AND ARM920TCORES IN TERMS OF ENERGY CONSUMPTION (MJ)

ResolutionARM7TDMI-S1 ARM920T2

3×3 5×5 7×7 3×3 5×5 7×7

Low 0.76 1.82 3.32 0.4 1.7 3

Medium 3.02 7.46 13.87 1.6 6.8 12.4

High 12.26 30.47 56.84 6.3 27.8 50.9

1 ARM7TDMI-S operates at 48MHz and consumes 0.25mW/MHz.2 ARM920T operates at 400MHz and consumes 0.25mW/MHz .

VI . CONCLUSION

Performance evaluation test results shows that most multi-

media sensor nodes are not equipped with enough processing

power for real-time video processing on medium/high resolu-

tion images. One of the most favored micro-controller cores,ARM7TDMI-S, has performed far behind of the ARM920T in

terms of processing power and energy consumption. The ar-

chitectural design enhancements of ARM920T alone speedups

the convolution process up to 30% compared to ARM7TDMI-

S. Including its high clock frequency, ARM920T performs 9

to 11 times faster than ARM7TDMI-S. Besides, ARM920T

consumes 10%-50% less energy for the same type of oper-

ations. Test results encourage us to utilize ARM9 cores on

multimedia sensor nodes for real-time video processing.

REFERENCES

[1] I. F. Akyildiz, T. Melodia, and K. R. Chowdhury, “A Survey on WirelessMultimedia Sensor Networks,” Computer Networks (Elsevier), vol. 51,no. 4, pp. 921–960, mar 2007.

[2] I. T. Almalkawi, M. Guerrero Zapata, J. N. Al-Karaki, and J. Morillo-Pozo, “Wireless multimedia sensor networks: Current trends and futuredirections,” Sensors, vol. 10, no. 7, pp. 6662–6717, 2010.

[3] J. B. Javier Molina, Javier M. Mora-merchan and C. Leon, “Multimediadata processing and delivery in wireless sensor networks,” InTech, 2010.

[4] L. W. Chew, L.-M. Ang, and K. P. Seng, “Survey of image compressionalgorithms in wireless sensor networks,” in Information Technology,

2008. ITSim 2008. International Symposium on, vol. 4, Aug. 2008, pp.1 –9.

[5] S. Soro and W. Heinzelman, “A survey of visual sensor networks,” Hindawi Advances in Multimedia, vol. 2009, 2009.

254247247

Page 6: Image Processing Capabilities of ARM-Based

7/30/2019 Image Processing Capabilities of ARM-Based

http://slidepdf.com/reader/full/image-processing-capabilities-of-arm-based 6/6

(a) Time elapsed during the convolution process for differentresolutions with 3×3 kernel (Unoptimized Code)

(b) Time elapsed during the convolution process for differentresolutions with 3×3 kernel (Optimized Code)

Fig. 4. Performance Comparison of ARM7TDMI-S and ARM920T in terms of processing power

(a) Time elapsed during the convolution process for differentresolutions with 5×5 kernel (Unoptimized Code)

(b) Time elapsed during the convolution process for differentresolutions with 5×5 kernel (Optimized Code)

Fig. 5. Performance Comparison of ARM7TDMI-S and ARM920T in terms of processing power

(a) Time elapsed during the convolution process for differentresolutions with 7×7 kernel (Unoptimized Code)

(b) Time elapsed during the convolution process for differentresolutions with 7×7 kernel (Optimized Code)

Fig. 6. Performance Comparison of ARM7TDMI-S and ARM920T in terms of processing power

[6] R. Zilan, J. M. Barcelo-Ordinas, and B. Tavli, “Wireless systems andmobility in next generation internet.” Berlin, Heidelberg: Springer-Verlag, 2008, ch. Image Recognition Traffic Patterns for WirelessMultimedia Sensor Networks, pp. 49–59.

[7] I. Akyildiz, T. Melodia, and K. Chowdhury, “Wireless multimedia sensornetworks: Applications and testbeds,” Proceedings of the IEEE , vol. 96,no. 10, pp. 1588–1605, Oct. 2008.

[8] “ARM Website,” http://www.arm.com/.

[9] M. G. Benjamin and D. Kaeli, “Stream image processing on a dual-core

embedded system,” in Proceedings of the 7th international conference on Embedded computer systems: architectures, modeling, and simulation,ser. SAMOS’07, 2007, pp. 149–158.

[10] S. Kyo, S. Okazaki, and T. Arai, “An integrated memory array processorarchitecture for embedded image recognition systems,” in Proceedings

of the 32nd annual international symposium on Computer Architecture,ser. ISCA ’05, 2005, pp. 134–145.

[11] D. McIntire, K. Ho, B. Yip, A. Singh, W. Wu, and W. J. Kaiser,“The low power energy aware processing (leap)embedded networkedsensor system,” in Proceedings of the 5th international conference on

 Information processing in sensor networks, ser. IPSN ’06, 2006, pp.449–457.

[12] M. Bramberger, J. Brunner, B. Rinner, and H. Schwabach, “Realtimevideo analysis on an embedded smart camera for traffic surveillance,”in RealTime and Embedded Technology and Applications Symposium,

2004. Proceedings. RTAS 2004. 10th IEEE , may 2004, pp. 174 – 181.

[13] A. Rowe, C. Rosenberg, and I. Nourbakhsh, “A low cost embeddedcolor vision system,” in Intelligent Robots and Systems, 2002. IEEE/RSJ 

 International Conference on, vol. 1, 2002, pp. 208 – 213 vol.1.

[14] S. McBader and P. Lee, “An fpga implementation of a flexible, parallelimage processing architecture suitable for embedded vision systems,”

in Proceedings of the 17th International Symposium on Parallel and  Distributed Processing, ser. IPDPS ’03, 2003, pp. 228.1–.

[15] M. Bramberger, A. Doblander, A. Maier, B. Rinner, and H. Schwabach,“Distributed embedded smart cameras for surveillance applications,”Computer , vol. 39, no. 2, pp. 68 – 75, feb. 2006.

[16] R. C. Gonzalez and R. E. Woods, Digital Image Processing. PrenticeHall, 2002.

[17] “SAM7-P256 Development Board for AT91SAM7S256 ARM7TDMI-SMicro-controller,” http://www.olimex.com/dev/index.html.

[18] “AT91SAM7S256 Datasheet,” http://www.atmel.com/.

[19] “FriendlyARM - ARM based Development Boards and Modules,”http://www.friendlyarm.net/.

255248248