1 Neighbor Cache Prefetching for Multimedia Image and ... · chitectures are based on blocks...

1

Neighbor Cache Prefetching for Multimedia Imageand Video Processing

Rita Cucchiara, Member, IEEE , Massimo Piccardi, Member, IEEE and Andrea Prati, Member,IEEE

Abstract— Cache performance is strongly influ-enced by the type of locality embodied in programs.In particular, multimedia programs handling im-ages and videos are characterized by a bidimen-sional spatial locality, which is not adequately ex-ploited by standard caches. In this paper we pro-pose novel cache prefetching techniques for imagedata, called neighbor prefetching, able to improveexploitation of bidimensional spatial locality. Aperformance comparison is provided against otherassessed prefetching techniques on a multimediaworkload (with MPEG-2 and MPEG-4 decoding,image processing, and visual object segmentation),including a detailed evaluation of both the missrate and the memory access time. Results provethat neighbor prefetching achieves a significant re-duction in the time due to delayed memory cycles(more than 97% on MPEG-4 with respect to 75%of the second performing technique). This reduc-tion leads to a substantial speedup on the overallmemory access time (up to 140% for MPEG-4).Performance has been measured with the PRIMAtrace-driven simulator, specifically devised to sup-port cache prefetching.

Keywords— Cache memories, prefetching, mul-timedia, image processing, neighbor prefetching.EDICS: 2-EXTN

I. Introduction

MULTIMEDIA workloads are leading the designof modern computers, since high performance in

executing multimedia applications is mandatory. Tothis aim, memory access performance proves particu-larly critical, since the speed gap between processors

Rita Cucchiara is with Dipartimento di Ingegneriadell’Informazione, Universita di Modena e Reggio Emilia, ViaVignolese, 905/b - Modena - Italy - phone: +39-059-2056136 -fax: +39-059-2056129 - e-mail: [email protected]

Massimo Piccardi is with Department of Computer Systems,Faculty of IT, University of Technology, Sydney - BroadwayNSW 2007 - Australia - phone: +61-2-9514-7942 - fax: +61-2-9514-1807 - e-mail: [email protected]

Andrea Prati is with Dipartimento di Ingegneriadell’Informazione, Universita di Modena e Reggio Emilia, ViaVignolese, 905/b - Modena - Italy - phone: +39-059-2056142 -fax: +39-059-2056129 - e-mail: [email protected]

and memories still tends to increase.Cache memories improve memory access perfor-

mance by allowing the exploitation of temporal andspatial data locality. However, spatial locality can ac-tually be exploited only if the storage organization inthe cache mirrors the memory access schemes embod-ied in programs. This requirement is often not sat-isfied for multimedia data, such as images or videos.In fact, image programs are characterized by a pe-culiar access locality, that we call 2D spatial locality,since images are structured as bidimensional arraysand whenever the CPU accesses a single data item,a high probability exists to access logically adjacentdata items in both vertical and horizontal directions.

Most algorithms for image compression, noise fil-tering, and image analysis process two-dimensionalpixel blocks (e.g., MPEG typically uses 8 x 8 or 16x 16 pixels blocks), thus exhibiting both horizontaland vertical spatial locality. Since standard cache ar-chitectures are based on blocks preserving horizontalspatial locality only, in multimedia programs a largeamount of compulsory misses in accessing image datahas been reported [1][2][3]. Enlarging the cache size isnot the right solution, since it decreases only conflictor capacity misses, and even increasing the block sizeimproves exploitation of horizontal locality only. It ispossible to alter also the order of memory references inthe algorithms in order to achieve a decreased numberof misses. However, we aim to describe a general tech-nique that does not require instruction re-ordering.

For enhancing cache performance, prefetching tech-niques have been deeply explored. Cache dataprefetching consists of inserting data into the cachebefore they are requested by the current instruction,so as to limit or eliminate the waiting time due tocompulsory misses. Prefetching can be implementedmainly in either hardware or software. The main ad-vantage of hardware implementations is that prefetch-ing can exploit program information available only tothe hardware, such as the fact that a memory refer-ence results in a hit or a miss. The main drawback, in-stead, is that hardware implementations are more lim-

2

ited than software implementations in terms of tim-ing and computational resources. However, the re-cent increase of on-chip resources has made hardwareprefetching more appealing, and therefore in the scopeof this paper we will concentrate on hardware tech-niques[4][5][6]. A detailed analysis of hardware imple-mentations of cache prefetching is found in the paperfrom Tse and Smith [4]. For hardware techniques, twomain directions have been explored: static and adap-tive (or stride-based). Static techniques are directedtowards prefetching one or more blocks, based on astatic assumption of probability; adaptive techniquesanalyze history of strides (the stride is the differencein address between two consecutive accesses made bya same instruction) to predict the most useful blocksto be prefetched. The former approach achieves highperformance on vector data, and this results in a goodperformance improvement also on image data, since atleast the horizontal locality is used. The latter is goodfor programs working on data with regular memoryaccess schemes. For this reason they have been pro-posed for multimedia programs, too [2][7][8][9].

In fact, many multimedia image and video process-ing programs are regular in memory access: thinkfor instance of the basic convolution-based filtering,where all the image points are raster scanned and pro-cessed by a coefficient mask. Nevertheless, some mul-timedia programs are not regular, since the computa-tion is data-dependent. Typical examples are contour-tracing algorithms, region-growing segmenters, or la-beling operators, used to extract visual objects fromimages (visual objects are supported by MPEG-4 andMPEG-7). In these cases, even stride-based tech-niques are not adequate: as we will show in this paper,their performance can be outperformed by some spe-cific techniques optimized for prefetching image data.

For these reasons, this work aims to analyze perfor-mance of prefetching techniques on a significant mul-timedia benchmark with different access schemes toimage and video data. The goal is to prove that asignificant speedup can be achieved by using prefetch-ing techniques, and especially by some innovative andimage-oriented prefetching methods that we presentin this paper. In particular, we propose two newstatic techniques, called neighbor and 8-step neighborprefetching that outperform other prefetching tech-niques both in terms of miss rate and memory ac-cess time improvement. In [10], we introduced theneighbor prefetching and compared it with other well-assessed prefetching techniques in terms of eliminatedcache misses only. In this paper we extend the per-

formance evaluation to temporal analysis, since themost relevant performance figure in this context is theachievable reduction in terms of memory access time.

The paper is structured as follows: the next sectionpresents related works on cache prefetching, includingthe main proposals for image and multimedia work-loads. Section 3 describes the prefetching techniquescompared in this paper, including the novel neighborprefetching techniques. Section 4 outlines the mul-timedia benchmark used for performance evaluation.Section 5 describes the temporal model used for tim-ing evaluation and the simulation environment. Sec-tion 6 presents the performance results in terms ofcache misses and improvement in memory access timeon the benchmark’s programs; the impact of line size,cache size and miss penalty is shown, too. In the con-clusions, the relevant aspects of our work are brieflysketched and the main results achieved commented.

II. Related works

In this section, we aim to present a brief overview ofthe most relevant works addressing cache prefetching.Most papers in the literature address general-purposeworkloads, but cache prefetching techniques have beenevaluated also for some multimedia programs; for in-stance, results are proposed by Zucker et al. in [7][9]for MPEG-2 decoder and encoder programs and byPimentel et al. in [8] for the median algorithm, usedto produce non-interlaced frames from videos with in-terlaced frames. In this work we extend the analysisto a larger and more complete (in the sense that it cov-ers different image data access schemes) multimediabenchmark oriented to image and video processing.

In cache prefetching, the two main directions ofstatic and adaptive techniques have been investigated.The basic static prefetching method is known asOne-Block-Lookahead (OBL) [11] (also called alwaysprefetch in a more recent work [4]). With OBL, everytime a reference to block i is made, a lookup in thecache is issued for block i + 1; if the block is absent, itis prefetched. Many other OBL-based solutions havebeen proposed; amongst them we cite those based onstream buffers [7][12][13], that prefetch one or moreblocks following block i + 1 in another level of thememory hierarchy, i.e. the stream buffer. The implicitassumption of all static techniques is that, given thespatial locality, block i + 1 has a high probability tobe referenced in the near future, and that schedulingits prefetching on the first reference to block i grantsadequate timeliness. These assumptions are correct,and testified by many simulations, especially when ar-

3

rays are accessed. In this paper we show that they arejustifiable for image data too, when pixels of an imageor a video frame are evaluated and processed one byone in raster-scan mode (i.e. row-by-row). Neverthe-less, whenever programs exhibit 2D spatial locality,we will show that prefetching performance can be im-proved with respect to that of OBL, since it does notexploit locality in the vertical direction.

The other main direction for prefetching is that ofadaptive techniques. Adaptive techniques take intoaccount some form of adaptivity in order to achievea better prevision than that represented by block i +1. The idea is to induce block reference probabilityby recent references history, and is promising for pro-grams with regular memory access schemes. A basicalgorithm computes the stride, i.e. the address differ-ence between the last two memory references made bya same instruction, and adds it to the address of thelast memory reference to obtain block prevision. Fuand Patel in [14] proposed the use of an associativememory, called the Stride Prediction Table (SPT), tostore stride information. Chen and Baer in [6] pro-posed the use of a Reference Prediction Table (RPT),similar to the SPT, but with an added state machineto assert if the prediction can be trusted (correct) ornot (incorrect). Selecting strides useful for prefetchingis often referred to as stride filtering. Other propos-als combine static and adaptive techniques in orderto receive benefits from both the approaches: in [15],a stride predictor is combined with a stream bufferand a voting scheme is used to decide whether thestream buffer or the stride predictor must be used. Fi-nally, other authors propose more complicated formsof adaptivity (as in [16]), but we expect them to betoo demanding in hardware resources to be feasible inhardware, and therefore didn’t use them for compar-ison in this paper.

In this work we consider image data only, i.e. theregular array-based structure used for referencing pix-els in an image or in a video frame. Hence, we evaluatethe performance of a possible data cache for imagedata only. Several studies refer to the performanceimprovement that can be achieved by further splittingthe data cache for scalar and vector data types [17] orfor temporal locality and spatial locality [18][19]. In[9], a stream cache separated from the main cache isused to store data accessed with regular strides. Justas many authors suggest the use of a separate cachefor vector data, we suggest the adoption of a specialcache for image data since this choice, together witha specific cache replacement strategy, reduces cache

misses in the very frequent accesses to image data [20].Although the addition of a further dedicated cache inmodern general-purpose processors is not very likely,this may not be the case for multimedia processorswhich could significantly benefit from the results re-ported in this paper.

The primary metrics typically used in the literaturefor cache performance evaluation are the miss rateand the efficiency expressed as the fraction of elimi-nated misses. These metrics [7][9][10][6] are useful foran initial evaluation of prefetching techniques becausethey focus on cache performance independently of sys-tem parameters. Nevertheless, as clearly stated in [4]and [5], performance analysis must address the exe-cution time. Tse and Smith in [4] proposed a system-level analysis of the impact of prefetching on systemperformance by way of an accurate cycle-by-cycle exe-cution simulation. The main metric used is the MCPI,defined as the total memory access penalty due to de-layed memory cycles divided by the total number ofinstructions executed. This figure includes all aspectswhich can be influenced by prefetching, while at thesame time excluding all those which cannot. To thepurpose of comparison, the metric used in [4] is ac-tually the relative MCPI, expressed as the ratio be-tween the MCPI with and without cache prefetching.Hennessy and Patterson in [5] use instead the averagememory access time (AMAT), which includes also thetime due to non-delayed memory cycles, and is dividedby the number of memory references. The AMAT isexpressive of the overall impact of memory access. Inthis work, we, report results with metrics equivalentto the relative MCPI and AMAT for a wide range ofsystem parameters such as the prefetching technique,cache size, miss penalty, and others.

III. Prefetching techniques

As previously stated, we divide prefetching tech-niques in the two main categories of static and adap-tive prefetching. To the purpose of comparison, inthis paper we use OBL as reference for the statictechniques [4][6], and the stride-based prediction tech-nique [9] as the reference for the adaptive class (calledSPT for short in the following). In [10], we exploredalso more sophisticated adaptive techniques based onsome form of stride filtering (2-delta filters, [21]), butperformance achieved didn’t prove significantly differ-ent from that of the basic scheme, and therefore arenot reported in this paper.

In OBL prefetching, every time a memory refer-ence is made, a lookup in the cache is performed for

4

Possible prefetching address Nr. of prefetches attempted Nr. of prefetches issued(A0 is the current block address)

No-prefetch - 0 0OBL B(A0)+1 1 0-1adaptive B(A0 + S) 1 0-11st-ref B(A0)+1, B(A0)-1, 8 0-8neighbor B(A0+NBrow),B(A0-NBrow),

B(A0+NBrow)+1, B(A0+NBrow)-1,B(A0-NBrow)+1, B(A0-NBrow)-1

8-step B(A0)+1, B(A0)-1, 0-8 0-1neighbor B(A0+NBrow), B(A0-NBrow),

B(A0+NBrow)+1, B(A0+NBrow)-1,B(A0-NBrow)+1, B(A0-NBrow)-1

TABLE IComparison of prefetching techniques

the block that follows the currently referenced one inmemory. If we call A0 the current address and B(A0)the current block address, i.e. the memory addresswithout the byte offset field, the block address of thelooked-up block will be B(A0) + 1 (see Table I). Ifthis block is absent from the cache, its prefetch is is-sued. Thanks to the lookup, the actual prefetching ofdata is performed only if it is not already in the cache,thus avoiding useless bus occupation.

In adaptive prefetching, every time a memory ref-erence is made, a lookup in the SPT is performed. Ifthere is a hit, a lookup in the cache is performed at theaddress B(A0+S) (by calling S the computed stride).If this block is absent from the cache, its prefetch isissued.

In this paper, we propose to explicitly explore the2D spatial locality in images by a new approach calledneighbor prefetching. To this aim, we give definitionsof three algorithms, called basic, first-reference, and8-step neighbor prefetching, respectively.

Def. 1 : The basic neighbor prefetching, at eachmemory reference, attempts to prefetch all blocks con-taining data in the nearest-neighborhood of that cur-rently referenced; we assume a neighborhood of 3 x3 blocks that are called 8-connected blocks; for eachblock belonging to the 8-connected set that is not inthe cache already, a prefetch is issued.

For the sake of clarity, see the sketch of Fig. 1.Let us assume image pixels are stored row-by-row,and divided in blocks of the same size of cache blocks;the 8-connected blocks contain pixels adjacent in bothvertical and horizontal directions.

In basic neighbor prefetching, every prefetch-

ing stage consists of eight lookups and M issuedprefetches, with M ∈ [0,8] (M is equal to 0 if allthe 8-connected blocks are already present in thecache). The block addresses of the looked-up blocksare shown in Table I, with reference to the A0 addressand where NBrow is the number of bytes in an im-age row. Neighbor prefetching mirrors the way dataare accessed by many image processing algorithms,including both raster-scan and data-dependent ones,and thus promises a general improvement in 2D spa-tial locality exploitation. At the same time, potentialdrawbacks of this approach are evident: i) it carriesa substantial increase of the lookup pressure, and ii)potentially issues a higher number of prefetches. Thefirst effect can be limited by an adequate cache archi-tecture. We assume a highly efficient lookup mecha-nism, with a multiple-port tag directory and pipelin-ing with prefetch issuing, so as to consider the lookuptime negligible. We also introduce two modificationsto the basic neighbor prefetching, reducing both thelookup pressure and the number of issued prefecthes.

We consider the sequences of memory referencesmade all to a same block. A sequence starts with thefirst reference to a block and ends with the first ref-erence to a different block (not included). Therefore,a generic sequence can range from one reference onlyup to an unlimited number of references. However, inthe tested applications sequences are typically in theorder of a few units. The idea is to reduce the amountof prefetching by scheduling the prefetching activitynot on all the references but only on the first refer-ence of each sequence. The rationale of first-referenceneighbor prefetching is that if a block is prefetched at

5

Fig. 1. 8-connected blocks of an image pixel

this stage, it will not be substituted in the cache inthe sequence access, making it unnecessary to checkfor its presence during next references to the block inthe same sequence.

Def. 2 : In first-reference neighbor prefetching, theprefetching activities are scheduled at the first refer-ence only of each sequence of references.

First-reference neighbor prefetching minimizeslookup costs, accounting for them just once for eachblock referenced. Nevertheless, drawback ii) remainson the first reference, when a high M number ofprefetches could be issued all together, with a poten-tially long completion time. In order to improve thisaspect we define a modification for distributing the Mprefetches over time along the sequence of referencesto the same memory block.

Def. 3 : In the 8-step neighbor prefetching, at eachmemory reference of a sequence of references to a sameblock, at most one prefetch is actually issued.

We can more precisely describe this method witha simple algorithm. Let us call block i one of the 8-connected blocks of the currently referenced one, withi = 1...8 as in Fig. 1. At each memory reference toblock A0 the following algorithm is executed:

{ if first reference (A0 block) = TRUE then i ← 1;else i ← lastdirection + 1;hit ← TRUE;while (i ≤ 8 AND hit = TRUE) do

{ hit ← lookup (block i);if (hit = TRUE) then i ← i+1;else prefetch (i); }

lastdirection ← i ; }

Every time the program changes the referencedblock, the direction #1 is looked-up. If the datais already present in the cache the following block

within the neighborhood (in the clockwise direction) ischecked. This process continues until either the datais not present in the cache (hit = FALSE) or the wholeneighborhood has been checked. The next referenceto the same block will restart the process from thelast direction checked. The idea is to more effectivelydistribute the prefetching activity over time.

Block #1 is used for the first lookup (like in OBL)since it is that with the highest static probability ofaccess (experimental evidence is given in the nextsection). The clockwise direction is a heuristic ruletuned in raster scan processing, since data belongingto block #2 should be absent from the cache and thusits prefetching ought to be scheduled before the oth-ers. Therefore, for each reference of a sequence of ref-erences to a same block, the 8-step neighbor prefetch-ing performs a number N of lookups (with N rang-ing between 0 and 8) until a miss occurs; the num-ber of lookups in the same block cannot exceed thenumber of 8, like in first-reference neighbor prefetch-ing. Moreover, the 8-step algorithm performs at mostone prefetch for each reference, thus avoiding longprefetching latencies.

In neighbor prefetching, in order to enable theprefetching mechanism, the memory address A0 mustbe associated with the correct image row size in bytes,NBrow, by way of a mapping function. For imple-mentation of the mapping function, we refer to mod-ern programming languages such as C, C++ and Javaand we assume that image data are declared as eitherstatic or dynamic 2D arrays (or objects derived from2D arrays). For static variables, the NBrow informa-tion can be extracted by the compiler from variabledeclarations and stored in a special symbol table; themapping function can then be initialised at run timefrom the special symbol table. For dynamic variables,

6

the NBrow information can be stored in the mappingfunction at allocation time.

In either cases, the extracted information must thenbe made available to the prefetching unit. In our im-plementation, the prefetching unit is enabled to querythe mapping function to retrieve NBrow from A0; inorder to improve the efficiency of this mechanism, themost recent A0-NBrow associations are assumed to becached in the prefetching unit.

IV. Multimedia benchmark

Multimedia programs exhibit a different localityin data access depending on the data type: numer-ical data and text are accessed with standard spa-tial/temporal locality schemes, while audio is oftenprocessed with a stream access showing a strong 1Dspatial locality. Instead, in image and video process-ing a wide variety of memory access schemes is used,often depending both on programs and data them-selves. Unfortunately, no generally accepted bench-mark exists for multimedia image and video process-ing yet. Therefore, we selected a kernel of basic pro-grams that are commonly adopted in image multi-media tools, as in other works on multimedia per-formance evaluation [7][22]: the set includes imageand video decompression programs and common pro-grams for image manipulation, characterized by dif-ferent data access schemes.

A. The benchmark’s programs

Table II summarizes the benchmark’s programsused in this work.

Image convolution (Convo for short) is the basicalgorithm for image processing; it consists of pro-cessing each image pixel by convoluting pixels of itsbidimensional neighborhood with a coefficient mask.Examples of programs based on convolution are fil-tering for noise cleaning, image enhancement, edgedetection, and template matching. Image is evalu-ated in raster-scan mode and the algorithm executionand data access are not data-dependent; a substantialamount of strictly 2D locality is embedded in pro-cessing each pixel and its neighborhood. In many pa-pers addressing performance evaluation in image pro-cessing [2][7][23], convolution is included in the basicbenchmark.

Thresholding (Thresh) is the pure raster-scan dataaccess, since pixels are loaded and processed one byone, row by row. In particular, it consists of a simplecomparison between the pixel value and an assignedthreshold.

Chain code (Chain) is included as a typical data-dependent image processing program characterized byan unpredictable 2D spatial locality. It is a propaga-tive algorithm, since the computational flow is prop-agated along the image, in an a-priori unknown di-rection depending on the data themselves [24]. Theprogram we used is a standard edge tracing algorithmthat scans objects’ contours and can be used for en-coding objects’ geometric properties [24][25][26], in-cluding shape encoding for MPEG-4 [27][28]. In edgetracing, the direction of the next pixel to be useddepends on the image and its edges, and thereforethe address of the next memory reference is data-dependent.

The benchmark includes also standard video de-coders. MPEG-2 decode is the typical benchmark formultimedia processors, since it is currently the mostspread multimedia program, both for video streamsand files [1][6][27][28][29][30]. MPEG frames are ofthree different types: I (only spatial compression inJPEG style), P (forward-predicted), and B (forward-backward predicted). Typically MPEG sequences areperiodical repetitions of a same type pattern, like forinstance the standard IPBBPBBPBBPBB (MPEGvideos used in the tests); when the pattern is II-IIIII, the decompression (called JPEG in Table II)consists only of JPEG decoding of a sequence offrames (sometime called also MJPEG or M-JPEGtype). MPEG-2 performs a variable-length decodingand inverse quantization of Discrete Cosine Transform(DCT) coefficients, Inverse DCT (IDCT) computa-tion, and motion compensation, mostly operating on8x8 and 16x16 pixel blocks. For the tests we have useddifferent MPEG data, selected from standard movieswith different type pattern, image size and numberof frames (files and address traces are available athttp://enki.ing.unimo.it/ImageLab/research.html).In particular, the first two (Frisco and Pirates) areMJPEG and are described in Table III , while theothers (Waterski and FlowerG) are MPEG movieswith sequence pattern IPBBPBBPBBPBB and aredescribed in Table III.

Finally MPEG-4 decode (MPEG4) has been includedas an example of the recent audio and video compres-sion standard, ISO/IEC 14496 [27]. MPEG-4 stan-dard defines the basic coding structure for manag-ing media objects, and among them Visual Objects(VOs) as meaningful individual parts of images thatare coded separately. VO decoding requires more im-age planes to be decoded for each VO, one for theshape (or alpha plane) and one for texture informa-

7

Name Algorithm Application Data access

Thresh Pixel thresholding: for each pixel ofthe image, executes an action (i.e.,a threshold) in function of the pixelvalue

binarization, color change, lumi-nance enhancement, motion de-tection...

Raster-scan, data-independent

Convo 5x5 convolution: Executes an imageconvolution in a window of 5x5 pixels

filtering, noise cleaning, imageenhancement, edge detection...

Raster-scan, local in a pixel neigh-borhood, data-independent

Chain Chain code: starting from a initialpixel (e.g. an edge pixel), follows theadjacent pixels exhibiting the sameproperty

Edge coding, information extrac-tion from image; encoding of vi-sual objects in MPEG-4; contourtracing...

Local in a pixel neighborhood,data-dependent

JPEG JPEG decode: processing of 8x8 or16x16 pixel blocks; only spatial com-pression is used

Decompressing JPEG still im-ages or MJPEG videos (MPEGwith I-frames only)

Partially raster-scan, partiallyblock-based processing; data-dependent inside the pixel blocks

MPEG2 MPEG-2 decode: processing of 8x8(16x16) pixel blocks, both spatial andtemporal compressions are used

Decompressing MPEG-2 videos,for video visualization or frameprocessing


MPEG4 MPEG-4 decode: each visual objectcan be treated differently

Decompressing MPEG-4 videos,for visualization, animation of vi-sual objects, video processing


TABLE IIThe benchmark’s programs

FRISCO PIRATES

Nr. of frames 51 280Frame Size 160x128 160x128

Type MJPEG MJPEGWATERSKI FLOWERG DANCE

Nr. of frames 80 61 4Frame Size 336x208 352x288 325x288

Type MPEG2 MPEG2 MPEG2

TABLE IIIJPEG, MPEG2 and MPEG4 workload

tion. In this work we use the MPEG-4 decoder fromCSELT, developed according to standard specifica-

tions [27]. Tests are reported for a video example,whose features are summarized in Table III; note that

8

the trace from only 4 frames results as being so largeas to contain more than 23.4 millions memory refer-ences and, for this reason, we limit our study to fewframes. Nevertheless, the behaviour of the programand the data access pattern do not change signifi-cantly when considering more frames.

The source code of the programs selected are thefollowing:• as JPEG/MPEG2 decoder, the free codec developedby the MPEG Software Simulation Group has beenused. The complete reference is: MPEG-2 Encoder/ Decoder, Version 1.2, July 19, 1996, Web sitehttp://www.mpeg.org/MSSG/;• as MPEG4 decoder, the Version 1.0 (August 1999)of the program developed for the MPEG-4 Video(ISO/IEC 14496-2) standard within the framework ofthe European ACTS project MoMuSys was selected.Further information on the project can be foundat http://www.tnt.uni-hannover.de/project/eu/momusys/ and the software can be downloaded forexample from http://www.ganesh.org/~pmeerw/.Instead, the Thresh, Convo and Chain programs areself-written in standard C language.

B. The impact of the workload on cache architecture

In this work, we analyze the impact of cache archi-tecture for accessing image and video data type (thatwe call 2D data) and we propose new cache prefetch-ing techniques for a dedicated 2D cache, containingonly 2D data. Instead, no improvement is needed forthe conventional cache containing other data types(that we call for short, although improperly, scalardata) such as text, audio, numerical data, and so on.This is due to the fact that the impact of memory ac-cesses to image data is significantly more critical thanthat to other data in multimedia image processing ap-plications. Table IV reports a comparison of the num-ber of memory accesses to 2D and scalar data (NREF

2D and NREF Scalar columns respectively) and thenumber of misses and miss rate without any prefetch-ing approach for a 2D and conventional caches of 32KB each, 2-way set-associative, with 32 bytes/line.Table IV refers to the following algorithms: Convoand Thresh on a 512x512 image, Chain on a 352x240frame of the FlowerG Video reported in Table III,JPEG and MPEG2 programs for videos of Table III andMPEG4 decode.

Table IV shows that the number of scalar referencesis normally higher than that of 2D references (apartfrom Convo) but, conversely, the scalar miss rate isnearly zero in all cases. This means that the standard

locality is already perfectly caught by a conventionalcache and that no further effort is justified to reduce itany more (especially if interference with 2D referencesis avoided, as we grant by using a separate cache forimage data). Instead, the 2D miss rate is two ordersof magnitude greater than the scalar one on average.This proves that the locality of 2D references is notcaught by conventional caches as accurately as otherdata. Therefore, it could be convenient to exploredifferent prefetching techniques based on a separatecache, in order to achieve better performance whilegranting less interference with data exhibiting differ-ent locality.

Raster-scan algorithms, such as Convo and Thresh,allow for static prediction of cache misses, that aremainly of the compulsory type (if the convolutionmask radius is limited). Thus, cache performance interms of miss rate could be strongly improved withdata prefetching techniques. Moreover, convolutionprograms suitably match standard cache architecturein terms of temporal locality, which arises from thelarge re-use of pixels in a neighborhood (in fact, Ta-ble IV shows that Convo has the lowest 2D miss rate).

Conversely, Chain has been included as an exampleof image computation that potentially does not ben-efit from standard cache techniques since it exhibitsunusual and non-predictable spatial locality. Froma data access point of view, propagative algorithmsshow the same bidimensional spatial locality of theraster-scan ones and at the same time the impossibil-ity of predicting the data access. Moreover, the tem-poral locality is difficult to emphasize, because pointsinvolved in the neighbor computation may be usedin the future, but perhaps only after long computa-tion and thus capacity misses probably occur. Forinstance, Fig. 2 shows that when the P pixel is refer-enced, the other pixels loaded in cache (of the sameblocks) could be unused or, as in this case, used af-ter a long processing time; therefore spatial localityis not exploited and the data loaded in cache do nothave adequate timeliness.

Nevertheless, Chain has an embedded 2D data lo-cality that can be measured: we profile memory ac-cesses to image data structures based on a simple mea-sure of spatial locality given by the difference betweenthe current and the following memory address madeby a same instruction (i.e. the stride). In the his-togram of Fig. 3(a), a large amount of strides equalto +1 and -1 testifies the standard 1D spatial locality(i.e. the pixel accessed is adjacent to that previouslyaccessed in the forward or backward direction, respec-

9

NREF 2D NREF Scalar Nr. Miss 2D Nr. Miss Scalar Miss Rate Miss Rate2D % Scalar %

Thresh 524,288 1,874,661 8,192 1,622 1.5625% 0.0865%Convo 19,504,949 6,538,099 16,370 1,518 0.0839% 0.0232%Chain 484,491 15,023,718 6,987 2,012 1.4421% 0.0134%

JPEG Frisco 1,566,730 18,649,423 44,288 14,638 2.8268% 0.0785%JPEG Pirates 8,601,610 117,003,333 242,259 69,609 2.8164% 0.0595%

MPEG2 FlowerG 47,474,320 85,751,703 631,784 62,809 1.3308% 0.0732%MPEG2 Waterski 47,702,928 98,544,033 684,189 69,311 1.4343% 0.0703%MPEG4 Dance 23,439,182 689,270,094 1,218,222 1,337,120 5.1974% 0.1940%

TABLE IVWorkload features

Fig. 2. A propagative algorithm

tively), but a large number of strides equal to -352 and352 indicates that the block of pixels belonging to theprevious and following rows (image is 352x240) is ac-cessed with a 2D spatial locality.

The JPEG and MPEG2 programs are dominated byoperations on image blocks (8 x 8 pixels) or mac-roblocks (16 x 16 pixels), thus exhibiting a strong 2Ddata locality. MPEG2 decoding typically carries ahigh number of cache misses [4]: obviously, a largepart are compulsory misses, if no prefetching is used.Moreover, due to the relatively large mask size, thelarge 2D locality causes a number of cache conflicts aswell. The 2D miss rate depends on the image formatand compression but is the same order of magnitudefor the same MPEG format: in our case is about 2.8%for JPEG videos and 1.4% for MPEG2 (see Table IV).Contrary to what expected, MPEG2 has a lower missrate than JPEG. This has been confirmed by our for-mer experiments in which the locality of the MPEG2decoder has been studied: the results of these experi-ments are not reported in this paper, but they demon-strate that MPEG2 shows higher re-use of the pixel’sneighborhood and thus achieve lower miss rates.

The performance improvement achievable by reduc-ing the 2D miss ratio depends on the amount of the

2D references with respect to the overall amount ofmemory references; in our workload, for the MPEGapplications, for instance, the percentage of 2D refer-ences vary between about 3% and 30%.

Fig. 3(b) reports the stride histogram for the Wa-terski video. The dominant stride value is 1, whichmeans that the greater part of memory accesses areperformed in raster-scan order. However, some othernon negligible stride values are concentrated aroundvalues corresponding to pixels of a same 8x8 (16x16)block belonging to the previous and following rows,thus providing the 2D data locality.

Finally, the MPEG4 program is more complex sincemany image planes for each VO are processed witha large number of dynamic image data allocations.It computes the same Inverse DCT as the MPEG-2programs but it must also decode alpha planes forthe VOs’ shape information. MPEG4 decoding exhibitsa very high 2D miss rate that is of 5% in our test(see Table IV). By assuming that most of the missesare compulsory, all the benchmark programs shouldbenefit greatly from adopting aggressive prefetchingtechniques.

Miss rates of multimedia applications cannot beconsidered negligible in general: in our workload the

10

(a) Locality of Chain (b) Locality of MPEG2 on Waterski video

Fig. 3. Demonstration of 2D spatial locality

2D miss rates vary in a range of 1-5%, but stronglydepending on the cache size and configuration; missrates of up to 20% have been reported by other au-thors for some cache configurations [3].

V. Architectural and timing model

In this section, we describe the architectural andtiming model used to evaluate results by means oftemporal analysis. We simulate a RISC processorwith an ideal single pipeline allowing a standard oneclock-per-instruction execution (TEXEC = 1 · TCK)and a reference memory architecture with a specificcache for image data (2D cache) as well as a standarddata cache for other data (scalar cache). Simulationsare based on the following assumptions [31]:• The 2D and scalar caches can be accessed sepa-rately, without interference.• The size of the 2D cache is assumed 32 KB (1 Klines of 32 bytes each, 2-way set-associative), that isa typical size for an L1 split cache. In addition, wereport performance also for different cache sizes (bothsmaller and larger).• The 2D cache is not multiple-ported, i.e. only ei-ther one miss or one prefetch can be sustained at atime. The hit penalty is assumed null for simplicity,i.e. THIT = 1 ·TCK , included in the instruction execu-tion, while miss and prefetch accesses have the samepenalty TMISS = TPF = 8 · TCK for the line fill. Inaddition, we report performance also assuming highermiss penalties.• The lookup time, i.e., the time to access the cachetag directory, is assumed null; this is a simplifying as-sumption which does not affect the significance of sim-

ulation results. In particular, for neighbor prefetchingwe assume a multiple-ported tag directory which cansustain multiple lookups at a time.• Misses are blocking (or synchronizing) events for ex-ecution, meaning that when a miss happens, executionmust wait the miss line fill completion before startingagain. Moreover, any prefetching activity scheduledis completed before serving the miss. Hence, if theprefetching completion loads in the cache the data re-quired by the miss, no miss line fill is then issued. Thismechanism is called lookahead miss inquire, i.e. whena miss occurs, the current prefetching queue is testedand a line fill is performed only if queued prefetchesare not already loading the data which caused themiss, thanks to the lookahead miss inquire.• Instead, prefetches are obviously non-blockingevents for execution, i.e. can be accomplished inparallel with instruction execution and their penaltycould be hidden by execution of other instructions.• In our architectural model, we assume in-order exe-cution, since this is the typical execution model of sev-eral processors used for multimedia applications suchas for instance the Philips Trimedia and the TexasC6000. Modern general-purpose processors are basedon out-of-order engines which can more effectively tol-erate the effect of cache misses.

The total execution time is computed according tothis model. We account a time TEXEC (the basic one-clock execution time) for completing each instructionwith no memory reference or with a memory hit ac-cess. If an instruction causes a miss, we account anadditional execution time of TMISS . If an instructioncauses a prefetch, prefetching is performed in parallel

11

with instruction execution, and no time is accountedfor prefetching in addition to that of instruction ex-ecution. In case instruction execution causes a misswhile a prefetch is performed, the prefetching activ-ity is completed before serving the miss; for comple-tion of prefetching we account an additional executiontime (1 − OV )TPF , where OV is the fraction of TPF

covered by overlapping. If further prefetches were al-ready scheduled in the prefetching queue, they arecompleted and a time TPF = TMISS is added for eachof them to the total execution time. The time TMISS

for the miss line fill is counted only if the prefetches inthe prefetching queue have not already loaded in thecache the data required by the reference which causedthe miss.

In this work, performance were measured by usingtrace-driven simulation.

Traces are collected by using the tracer Spy [32]and processed using our simulator, called PRIMA(PRefetching IMproves Multimedia Applications),evolved from ACME [33] simulator: we integrated asupport for dealing with prefetching and 2D caches;then, we extended it to perform temporal simulation([34]).

VI. Performance results

Using the architectural model and the benchmarkdefined, we measured the performance achieved us-ing caches without prefetching, comparing well-knownprefetching methods (OBL and SPT) with the newneighbor-based ones.

A. Number of misses

The number of misses is an important parameterfor estimating the efficacy of a prefetching method,since an effective prefetching method will assess a highmiss reduction. In addition to the miss number, anexpressive measure to the aim of comparison betweenthe different prefetching techniques is the fraction ofmisses eliminated. We call it miss-based efficiency ofthe i-th prefetching method, ηi, defined as:

ηi =NMISS −NMISS PFi

NMISS(1)

where NMISS is the number of misses withoutprefetching. This miss-based efficiency ranges be-tween 0 and 1 (but could be even negative incase of ineffective prefetching) and tends to 1 whenprefetching achieves the highest performance, thatis its NMISS PFi (i.e., the number of non-eliminatedmisses) tends to 0.

Table V and Fig. 4 show, respectively, the numberof misses NMISS PFi and the efficiency ηi (in percent-age) for the four considered prefetching techniques onthe test programs, for a 2-way set-associative cache of32 KB with 32 bytes per line.

Results confirm that exploiting horizontal spatiallocality only (as OBL does) is not an effective solution:in fact the number of misses with OBL prefetching re-mains non negligible and the OBL’s efficiency is only75% for JPEG, 83-87% for MPEG2 and is 61.9% onlyfor MPEG4 (see Fig. 4). Better results are achieved bySPT prefetching (η is greater than 90% for MJPEGvideo, is 86-87% for MPEG2 videos and in particu-lar is 97.5% for MPEG4). However, the best perfor-mance is obtained by both the neighbor-based tech-niques, with efficiency very close to 100% for most ofthe programs. If the prefetching mechanism did notintroduce any time cost, these results would supportthe feasibility of including prefetching techniques, es-pecially neighbor-based, in the design of a processororiented to multimedia image and video processing.

B. Time analysis

However, the miss number is not sufficient to de-fine execution performance, since many bus transfersare due to prefetches, and if the prefetch number istoo high or if prefetch is not enough overlapped withinstruction execution, according with the model pre-viously presented, the gain achieved by the miss re-duction would be lost due to the prefetching cost [4].

As a matter of fact, the prefetching techniques is-sue many prefetching operations, whose number isreported in Table VI; note however that this num-ber is less than or equal to the original number ofmisses. For each method the second row of Table VIdenotes the number of prefetch cycles that are notcompletely hidden by instruction executions. In manycases this number is 0: this mean that all prefetchesissued are hidden and do not cause any time penalty.Conversely, in other cases (especially for MPEG2 andMPEG4) the number of not completed prefetches in-troduces a non negligible delay, that depends on thefraction of prefetch cycle that is not completely over-lapped (1 - OV).

In order to take into account the prefetch cost, wecompute the Memory Access Delay Time MADT, asadopted in [4][35]. This figure accounts for the delaycaused by the access to the lower level of memory hi-erarchy. Thus, we define MADT as the ratio betweenthe total amount of memory delay and the number ofmemory reference instructions executed (NREF ).

12

Thresh Convo Chain JPEG JPEG MPEG2 MPEG2 MPEG4Frisco Pirates FlowerG Waterski Dance

No pf 8,192 16,370 6,987 44,288 242,259 631,784 684,189 609,111OBL 1 6 1,351 11,008 60,245 78,601 113,966 231,661SPT 1 32 40 1,590 8,688 78,282 90,166 14,826

1st-ref neighbor 1 2 12 83 428 28,385 70,734 12,9688-step neighbor 1 2 23 83 428 35,362 75,753 16,815

TABLE VNumber of misses NMISS PFi

Fig. 4. Miss-based efficiency ηi (in percentage)


OBL 8,192 16,366 6,653 33,410 182,718 582,638 630,747 390,1020 0 0 0 0 25,892 30,714 652

SPT 8,192 16,418 7,606 49,749 272,576 797,478 910,582 614,5880 0 2,615 0 0 104,234 51,871 213,780

1st-ref neighbor 8,225 16,436 11,320 45,417 248,439 777,279 1,119,519 711,2340 0 6 0 0 67,154 84,191 3,715

8-step neighbor 8,224 16,436 11,117 45,373 248,280 766,309 1,103,399 692,8900 0 6 0 0 17,043 22,227 758

TABLE VIThe number of issued prefetches NPFi and the number of prefetches not completely overlapped

In absence of prefetching, this quantity is:

MADTNO PF =NMISSTMISS

NREF(2)

In the case of the i-th prefetching (by assumingTPFi = TMISS), MADT becomes:

MADTPFi =NMISS PFi

TMISS + NPFi(1−OVPFi

)TMISS

NREF(3)

In Eq. 3, NMISS PFi accounts for the actual num-ber of misses (i.e. excluding those attempted but not

issued thanks to the lookahead miss inquire mech-anism), as shown in Table V. NPFi is the totalnumber of issued prefetches (see Table VI), whileNPFiOVPFiTMISS is the total amount of time savedthank to the overlap between prefetches and instruc-tion execution (OVPFi being the average of all OVcontributions over the whole execution). Therefore,MADT is a more significant performance measurethan only the number of misses, since it takes intoaccount also the number of not completed prefetches,weighted by the fraction of not overlapping with other

13

executed instructions.The NPFi quantity strongly depends on the

prefetching technique. As summarized in Table I, theOBL and SPT techniques schedule a maximum of oneprefetch for each reference: therefore the prefetchesissued are less than (or at most equal to) the num-ber of references NREF (note that we are consider-ing prefetching on reference). Actually, by comparingTables IV and VI, it is evident that the number ofissued prefetches is just a few percents of the num-ber of references. Instead, in first-reference neighborprefetching, as discussed in Section 3, a maximum ofeight prefetches could be issued on the first referenceof each sequence of references to the same block; inthe 8-step version, eight prefetches (at most) are dis-tributed along the sequence of references to the sameblock. Since the length of a sequence of referencesis not known a priori, the neighbor-based techniquescould cause a higher number of prefetches and poten-tially a higher MADT. Instead, by comparing TableIV and Table VI, we measured that even for neighbortechniques the number of prefetches actually issued ismuch less than the number of references. This provesthat the locality embedded in the program is suitablyexploited by the cache prefetching techniques.

Table VII reports the total memory access delaytime (MADT* NREF ) for each tested program, or inother words the number of clocks spent for accessingthe lower level of memory hierarchy; without prefetch-ing (first row), it measures the miss penalty only andin the case of a prefetch technique, the sum of bothmiss and prefetch penalties.

Just as in the previous section we defined the miss-based efficiency (Eq. 1), we can define the MADTefficiency in terms of MADT as:

ηTi =

MADTNO PF −MADTPFi

MADTNO PF(4)

This measure has the characteristics of an efficiency,but at the same time is a relative measure equivalentto the relative MCPI of [4] (ηT

i = 1−relativeMCPI).The measures of ηT

i , reported in the histogram of Fig.5, are similar to the ηi ones of Fig. 4, but differ for thecost of the non completely-overlapped prefetching. Infact, by substituting the definition of Eq. 2 and 3 inEq. 4, we have that

ηTi = ηi −

NPFi(1−OVPFi)NMISS

(5)

The second term is the cost of prefetching for the noncompletely overlapped memory accesses. By compar-ing the graphs of Figg. 4 and 5 we can note that

prefetches are totally hidden (and thus ηTi = ηi), for

Convo, while in other cases (JPEG, MPEG2, MPEG4)this cost is not negligible. The prefetching costs par-ticularly affect the OBL and SPT methods and far lessthe neighbor methods. On average, the 8-step neigh-bor prefetching exhibits the best performance: the ex-plicit exploitation of the 2D data locality allows a bet-ter prediction of the data to be prefetched, togetherwith a good distribution over time of the prefetchingactivity.

An important consideration is that, observing thetime measures for MPEG4, not only OBL but also SPTprefetching exhibits a relatively low efficiency (ηT

i is61.8% and 75.6% respectively). Instead the adoptionof neighbor prefetching reveals itself to be extremelybeneficial for MPEG-4, proving an efficiency of 97%.

Prefetching actually improves performance in termsof memory access time. This result is not obvious,since prefetching could even worsen the memory ac-cess time, in the case where the prefetching activitycannot be completely accomplished in overlap withexecution and the number of prefetches is high. Theamount of execution time actually saved can be easilydeduced from Table VII, if the number of spent clockcycles with a given prefetching method is comparedwith that of the first row, assessed without prefetch-ing.

Another relevant measure takes into account howmuch the improvement obtained by prefetching influ-ences the memory access time: this can be obtainedby adding up the MADT (which is the average mem-ory access delay time) with the normal memory ac-cess time without delay (equal to THIT = TEXEC =1 · TCK). The result is the average Memory AccessTime, MAT [5], defined as:

MATNO PF = MADTNO PF +NHIT THIT

NREF(6)

and

MATPFi = MADTPFi +NHIT PFiTHIT

NREF(7)

We use the MAT measure to compute the relative per-centage speedup of the i-th technique with respect tothe case without prefetching as

ξTi = (

MATNO PF

MATPFi

−1)∗100 =MATNO PF −MATPFi

MATPFi

∗100

(8)The measures in the graph of Fig. 6 show that

prefetching achieves a significant speedup even whenconsidering the total memory access time and not

14


No pf 65,536 130,960 55,896 354,304 1,938,072 5,054,272 5,473,512 4,872,888OBL 8 48 10,808 88,064 481,960 748,250 1,095,475 1,857,230SPT 8 256 8,165 12,720 69,504 931,044 899,561 1,188,205

1st-ref neighbor 8 16 137 664 3,424 705,661 1,182,970 130,3928-step neighbor 8 16 184 664 3,424 359,468 728,157 138,596

TABLE VIIThe total memory access delay time (MADT ∗NREF )

Fig. 5. MADT efficiency ηTi (in percentage)

Fig. 6. MAT speedup ξTi (in percentage)

the memory delay time only. We can note that thespeedup is low with all prefetching techniques forthe Convo program since its initial miss rate wasvery low (0.08%). The speedup is instead about10% with whichever prefetching technique for a sim-ple raster-scan program as Thresh. The speedup be-comes relevant for JPEG, whose initial miss rate isabout 2.8%, since neighbor prefetching reaches the19.7% of speedup (versus the 14.1% and 18.8% ofOBL and SPT). Speedup assesses up to 8% in the bestcase in our experiments on MPEG2 with 8-step neigh-bor. More importantly, the 8-step neighbor speedup

achieves up to 35% for MPEG4, which has a consider-able 5% of initial miss rate. Moreover, this result hasbeen computed taking into account a relatively lowTMISS = TPF = 8 ·TCK ; in subsection D we will showthat this speedup is significantly greater for highervalues of the miss penalty, exceeding values of 140%.

In conclusion, in the case of algorithms with sig-nificant miss rates, such as JPEG and, in particular,MPEG4, the 2D memory access time improvement ishighly relevant. In the case of algorithms with stillnon negligible miss rates, such as Thresh, Chain andMPEG2, the time improvement is about 10% with the

15

given cache configuration. Finally, in the case of asimple and regular algorithm with high temporal lo-cality such as Convo, the time improvement achievableby prefetching is heavily limited by the low initial missrate.

C. Impact of cache size

In this section we discuss the impact of cache pa-rameters, mainly cache size, block size, and degree ofassociativity, on performance of the multimedia pro-grams.

Varying the block size does not particularly affectperformance. This is due to the fact that enlargingblock size means improving 1D spatial locality andtherefore we should find similar performance improve-ments as in OBL techniques. An example, confirmingthis assumption, is in Fig. 7(a), showing the MADTefficiency on MPEG4 decoder, measured on a cache of32KB two-way associative (with the time model dis-cussed in section 5) with block size varying from 8 to32 bytes.

We do not report similar results for the associa-tivity degree, since we proved that this not a criticalparameter, since most of the misses were compulsoryand not due to conflicts.

From our experiments, the cache size is the mostrelevant parameter. Fig. 7(b) shows the MADT effi-ciency measured with different cache sizes (two-wayassociative cache with 32 bytes per block) for theMPEG4 program. Enlarging the cache size, the per-formance improves from very small cache (8KB) tomedium size cache (32K); then, a further cache en-largement does not improve performance anymore.However, a significant result is that 8-step neighborprefetching is able to completely eliminate misses evenwith caches of small size. This measure can be veryuseful in cache design for low-cost, multimedia proces-sors, which cannot be endowed with large data caches(for instance, like the Philips TriMedia TM1100 pro-cessor, with 16 KB data cache).

D. Impact of the miss penalty

The following test analyzes the impact of the misspenalty time on the total memory access time: thehigher the penalty, the larger the impact of misses andnon completely overlapped prefetches. Therefore, thebenefit achievable with a prefetching strategy able tonullify almost all misses while at the same time over-lapping prefetches will accordingly be higher. Sincethe ratio between memory and cache access times is

increasing with current technologies, large miss penal-ties are expected to become common.

In all the previous tests the miss penalty consid-ered was TMISS = TPF = 8 · TCK . Now, let us sup-pose we modify the architecture, by incrementing thenumber of clocks needed to access to the lower levelof hierarchy. What could happen if the memory misspenalty increases? Considering the memory access de-lay time, not only the miss costs but also prefetchingcosts will increase. Moreover, this increase is not pro-portional, since the number of non completely over-lapped prefetches changes, together with the amountof fraction overlapped. We report the simulations pro-vided for the MPEG4 decoder with different miss penal-ties. Results with other programs of the benchmarkare similar and are omitted for brevity. Fig. 8(a)shows the MADT efficiency ηT

i for MPEG4 on a 32KB x32B cache with TMISS = TPF =8, 16, 24 and 32·TCK .

With the assumption previously introduced ofTMISS = 8 · TCK , the efficiency of neighbor methodswas higher than that of all standard techniques (seeFig. 5). Increasing the time latency, the prefetch-ing efficiency varies differently for each method: inparticular the SPT technique is very sensitive tothis change, while OBL is almost insensitive: whenTMISS = 24 ·TCK SPT exhibits an efficiency of 59.7%only with respect to 62.14% of OBL. Nevertheless, ef-ficiency of neighbor techniques remains close to 100%;in particular, the 8-step neighbor prefetching has anefficiency of 97.42% with 8 · TCK and 97.3% with24 · TCK .

Results for the speedup on the overall memory ac-cess time (MAT) are even more significant. Fig.8(b) reports the MAT speedup ξT

i for MPEG4 for thesame 32KB x 32B cache with TMISS = TPF =8, 16,24 and 32·TCK . The speedup achievable with OBLand SPT prefetching grows modestly with the misspenalty, proving that the non eliminated misses andnon overlapped prefetches heavily limit prefetchingperformance. Instead, the speedup obtained withneighbor-based prefetching techniques grows linearlywith the miss penalty, exceeding 110% for a misspenalty of 24 · TCK and 140% for a miss penalty of32 · TCK . This means that the memory access timecan be more than halved by neighbor prefetching forthese configurations.

VII. Conclusion

This paper has presented a detailed performanceanalysis of different cache prefetching techniques on amultimedia benchmark. The benchmark consists of a

16

(a) MADT efficiency for MPEG4 with variable block size(TMISS = 8 · TCK ; 32KB size)

(b) MADT efficiency for MPEG4 with variable cache size(TMISS = 8 · TCK ; 32 bytes/line)

Fig. 7.

(a) MADT efficiency for MPEG4 with variable TMISS (32bytes/line; 32KB size)

(b) MAT speedup for MPEG4 with variable TMISS (32bytes/line; 32KB size)

Fig. 8.

set of standard image and video decompression pro-grams (JPEG, MPEG-2, MPEG-4) and some commonimage processing programs characterized by differentmemory access schemes to 2D data, including bothraster-scan and data-dependent accesses.

We proposed two new techniques of cache prefetch-ing, namely first-reference and 8-step neighborprefetching, whose results outperform standard tech-niques both in terms of miss rate and memory accessdelay time. The most important results can be sum-marized as:1) Prefetching should be adopted by multimedia ori-ented processors since all programs working on imagesand video have a high initial miss rate, and misses areessentially of compulsory type. All prefetching meth-ods improve performance considerably.2) The standard One-Block Lookahead prefetchingtechnique that exploits classic (1D) spatial locality isinteresting for raster-scan programs but is not partic-ularly efficient for programs working on macroblocks

such as MPEG-2 and MPEG-4.3) Stride-based techniques (such as the Stride Pre-diction Table) achieve better results in terms of elim-inated misses, but if a time analysis is performedby taking into account also the cost of prefetching,it shows limited efficiency (81-83% for MPEG-2 and75% for MPEG-4 with the reference cache architec-ture). Moreover, the SPT technique decreases its per-formance as the latency time increases (see previoussection).4) The new prefetching techniques proposed in thispaper, namely the neighbor and the 8-step neigh-bor prefetching, show high efficiency both in terms ofmiss rate and memory access delay time, and an im-pressive speedup of the overall memory access time.They outperform the other techniques in the test pro-vided on the multimedia workload. On average, the8-step neighbor performs slightly better than first-reference neighbor, thank to a better distribution ofthe prefetching activity.

17

5) A specific consideration must be addressed regard-ing the MPEG-4 program. MPEG-4 decoding ex-hibits a significant miss rate (about 5% in the ref-erence 2D cache) since it is characterized by a dataaccess sequence which is not well caught by standardcaches. With this workload, neighbor techniques arevery interesting: for instance, with a two-way set-associative 32 KB cache with 32 bytes/line, the missrate becomes about 0.11-0.14% and the speedup inthe average memory access time ranges from 35% to140% assuming increasing miss penalty.6) A detailed analysis of implementation costs wasout of the scope of the present paper. However, thearea and power costs of a split data cache can be con-sidered in first approximation proportional to its size.In this paper, results were reported with a referencecache size of 32 KB which is fairly large and corre-spondingly costly, but results seem to be very goodalso for smaller sizes such as for instance 8 KB, asshown in Fig. 7(b). Therefore, the technique couldbe used even for small caches which imply far morelimited implementation costs.These results qualify neighbor prefetching as an effec-tive solution for prefetching of 2D data such as imagesand videos processed in the most common multimediaapplications.

References

[1] I. Kuroda and T. Niscitani, “Multimedia processors,” Pro-ceedings of the IEEE, vol. 86, no. 6, pp. 1203–1221, 1998.

[2] R. Cucchiara, M. Piccardi, and A. Prati, “Exploiting cachein multimedia,” in Proc. of IEEE Intl. Conf. on MultimediaComputing and Systems (ICMCS), 1999, vol. 1, pp. 345–350.

[3] Z. Wu and W. Wolf, “Study of cache system in video signalprocessors,” in Proc. IEEE Workshop on Signal ProcessingSystems (SiPS), Oct. 1998, pp. 23–32.

[4] J. Tse and A.J. Smith, “CPU cache prefetching: timingevaluation of hardware implementation,” IEEE Transac-tions on Computers, vol. 47, no. 5, pp. 509–526, 1998.

[5] J. Hennessy and D. Patterson, Computer architecture: aquantitative approach, Morgan Kaufmann Publisher, 2 edi-tion, 1996.

[6] T.F. Chen and J.L. Baer, “A performance study of hard-ware and software data prefetching schemes,” in Proc. ofthe 21th Intl. Symp. on Computer Architecture (ISCA),1996, pp. 223–232.

[7] D. Zucker, M.J. Flynn, and R. Lee, “A comparison of hard-ware prefetching techniques for multimedia benchmark,” inProc. of IEEE Multimedia, 1996, pp. 236–244.

[8] A.D. Pimentel, L.O. Hertberger, P. Struik, and P. VanDer Wolf, “Hardware versus hybrid data prefetching inmultimedia processors,” in Proc. of IEEE Intl. Perfor-mance, Computing and Communications Conf. (IPCCC),1999, pp. 525–531.

[9] D.B. Zucker, R.B. Lee, and M.J. Flynn, “Hardware and

software cache prefetching techniques for MPEG bench-marks,” IEEE Transactions on Circuits and Systems forVideo Technology, vol. 10, no. 5, pp. 782–796, Aug. 1995.

[10] R. Cucchiara, M. Piccardi, and A. Prati, “Improving cacheperformance for multimedia applications,” Tech. Rep. n.91, Dept. of Engineering, University of Ferrara, Italy, ac-cepted for publication on Multimedia Tools and Applica-tions, Kluwer Academic Publishers, 2001.

[11] A.J. Smith, “Cache memories,” ACM Computing Surveys,vol. 14, no. 3, pp. 473–530, 1982.

[12] N.P. Jouppi, “Improving direct-mapped cache perfor-mance by the addition of a small fully-associative cacheand prefetch buffers,” in Proc. of IEEE/ACM Intl. Symp.on Computer Architecture (ISCA), May 1990, pp. 364–373.

[13] P. Ranganathan, S. Adve, and N.P. Jouppi, “Performanceof image and video processing with general-purpose proces-sors and media ISA extensions,” in Proc. of IEEE/ACMIntl. Symp. on Computer Architecture (ISCA), 1999, pp.124–135.

[14] J.W.C. Fu and J.H. Patel, “Data prefetching in multipro-cessor vector cache memories,” in Proc. of IEEE/ACMIntl. Symp. on Computer Architecture (ISCA), May 1991,pp. 54–63.

[15] G. Singh Manku, M.R. Prasad, and D.A. Patterson, “Anew voting based hardware data prefetch scheme,” inProc. of IEEE Intl. Conf. on High-Performance Comput-ing, 1997, pp. 100–105.

[16] D. Joseph and D. Grunwald, “Prefetching using Markovpredictors,” IEEE Transactions on Computers, Special Is-sue on Cache Memory and Related Problems, vol. 48, no.2, pp. 121–134, Feb. 1999.

[17] M. Tomasko, S. Hadjiyiannis, and W.A. Najjar, “Ex-perimental evaluation of array caches,” in IEEE TCCANewsletter, Mar. 1997, pp. 11–16.

[18] A. Gonzalez, C. Aliagas, and M. Valero, “A data cachewith multiple caching strategies tuned to different types oflocality,” in Proc. of ACM Intl. Conf. on Supercomputing,1995, pp. 338–347.

[19] V. Milutinovic, B. Markovic, M. Tomasevic, and M. Trem-blay, “The split temporal/spatial cache: initial perfor-mance analysis,” in Proc. of the SCIzzL-5, Santa Clara,CA, USA, 1996.

[20] R. Cucchiara and M. Piccardi, “Exploiting image pro-cessing locality in cache pre-fetching,” in Proc. of IEEEIntl. Conf. on High-Performance Computing, Dec. 1998,pp. 466–472.

[21] R.J. Eickemeyer and S. Vassiliadis, “A load instructionunit for pipelined processor,” IBM Journal of Researchand Development, vol. 37, pp. 547–564, July 1993.

[22] C. Lee, M. Potkonjak, and W. Mangione-Smith, “Medi-aBench: a tool for evaluating multimedia and communi-cations systems,” in PRoc. of IEEE/ACM Intl. Symp. onMicroarchitecture (MICRO), 1997, pp. 330–335.

[23] P. Baglietto, M. Maresca, M. Migliardi, and N. Zingirian,“Image processing on high performance RISC systems,”Proceedings of the IEEE, vol. 84, no. 7, pp. 917–925, 1996.

[24] H. Freeman, “On the encoding of arbitrary geometric con-figurations,” IRE Trans. Electron. Comput., vol. EC-10,pp. 260–268, 1961.

[25] J. Koplowitz, “On the performance of chain codes for quan-tization of line drawings,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 3, pp. 180–185,1981.

18

[26] T. Kanedo and M. Okudaira, “Encoding of arbitrary curvesbased on the chain code representation,” IEEE Trans-actions on Communications, vol. COM-33, pp. 697–707,1985.

[27] ISO/IEC DIS 14496-2 Information technology, Coding ofaudio-visual objects – Part 2: Visual.

[28] A.K. Katsaggelos, P.K. Lisimachos, F.W. Meier, J. Oster-mann, and G.M. Schuster, “MPEG-4 and rate-distortion-based shape-coding techniques,” Proceedings of the IEEE,vol. 86, no. 6, pp. 1126–1154, June 1998.

[29] D. Gall, “MPEG: a video compression standard for multi-media application,” Communications of the ACM, vol. 34,no. 4, pp. 46–58, 1991.

[30] T.F. Chen and J.L. Baer, “Effective hardware-based dataprefetching for high-performance processors,” IEEE Trans-actions on Computers, vol. 44, no. 5, pp. 609–623, 1995.

[31] R. Cucchiara, M. Piccardi, and A. Prati, “Temporal anal-ysis of cache prefetching strategies for multimedia applica-tions,” in Proc. of IEEE Intl. Performance, Computing andCommunications Conf. (IPCCC), accepted for publication,in press, 2001.

[32] G. Irlam, SPA – SPARC analyzer tool set,http://www.base.com/gordoni/spa/spa-1.0.tar.Z, 1991.

[33] http://atanasoff.nmsu.edu/ acme/acs.html. ACMECache Simulator.

[34] http://enki.ing.unimo.it/ImageLab/prima.html PRIMAWeb Site.

[35] G. Park, O. Kwon, T. Han, A. Kim, and S. Yang, “Animproved lookahead instruction prefetching,” in Proc. ofthe High-Performance Computing on the Information Su-perhighway (HPC-Asia), 1997, pp. 712–715.

1 Neighbor Cache Prefetching for Multimedia Image and ... · chitectures are based on blocks...

Documents

Transcript of 1 Neighbor Cache Prefetching for Multimedia Image and ... · chitectures are based on blocks...