Lecture Notes in Computer Science 5657€¦ · Lecture Notes in Computer Science 5657 Commenced...

Lecture Notes in Computer Science 5657Commenced Publication in 1973Founding and Former Series Editors:Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board

David HutchisonLancaster University, UK

Takeo KanadeCarnegie Mellon University, Pittsburgh, PA, USA

Josef KittlerUniversity of Surrey, Guildford, UK

Jon M. KleinbergCornell University, Ithaca, NY, USA

Alfred KobsaUniversity of California, Irvine, CA, USA

Friedemann MatternETH Zurich, Switzerland

John C. MitchellStanford University, CA, USA

Moni NaorWeizmann Institute of Science, Rehovot, Israel

Oscar NierstraszUniversity of Bern, Switzerland

C. Pandu RanganIndian Institute of Technology, Madras, India

Bernhard SteffenUniversity of Dortmund, Germany

Madhu SudanMicrosoft Research, Cambridge, MA, USA

Demetri TerzopoulosUniversity of California, Los Angeles, CA, USA

Doug TygarUniversity of California, Berkeley, CA, USA

Gerhard WeikumMax-Planck Institute of Computer Science, Saarbruecken, Germany

Koen Bertels Nikitas DimopoulosCristina Silvano Stephan Wong (Eds.)

Embedded ComputerSystems: Architectures,Modeling, and Simulation

9th International Workshop, SAMOS 2009Samos, Greece, July 20-23, 2009Proceedings

13

Volume Editors

Koen BertelsStephan WongDelft University of TechnologyMekelweg 4, 2628 CD Delft, The NetherlandsE-mail: {k.l.m.bertels,j.s.s.m.wong}@tudelft.nl

Nikitas DimopoulosUniversity of VictoriaDepartment of Electrical and Computer EngineeringP.O. Box 3055, Victoria, BC, V8W 3P6, CanadaE-mail: [email protected]

Cristina SilvanoPolitecnico di MilanoDipartimento di Elettronica e InformazioneP.za Leonardo Da Vinci 32, 20133 Milan, ItalyE-mail: [email protected]

Library of Congress Control Number: 2009930367

CR Subject Classification (1998): C, B

LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues

ISSN 0302-9743ISBN-10 3-642-03137-4 Springer Berlin Heidelberg New YorkISBN-13 978-3-642-03137-3 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material isconcerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publicationor parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,in its current version, and permission for use must always be obtained from Springer. Violations are liableto prosecution under the German Copyright Law.

springer.com

© Springer-Verlag Berlin Heidelberg 2009Printed in Germany

Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, IndiaPrinted on acid-free paper SPIN: 12718269 06/3180 5 4 3 2 1 0

Preface

The SAMOS workshop is an international gathering of highly qualified researchersfrom academia and industry, sharing ideas in a 3-day lively discussion on thequiet and inspiring northern mountainside of the Mediterranean island of Samos.The workshop meeting is one of two co-located events (the other event being theIC-SAMOS). As a tradition, the workshop features presentations in the morning,while after lunch all kinds of informal discussions and nut-cracking gatheringstake place. The workshop is unique in the sense that not only solved researchproblems are presented and discussed but also (partly) unsolved problems andin-depth topical reviews can be unleashed in the scientific arena. Consequently,the workshop provides the participants with an environment where collaborationrather than competition is fostered.

The SAMOS conference and workshop were established in 2001 by StamatisVassiliadis with the goals outlined above in mind, and located on Samos, one ofthe most beautiful islands of the Aegean.

The rich historical and cultural environment of the island, coupled with theintimate atmosphere and the slow pace of a small village by the sea in the middleof the Greek summer, provide a very conducive environment where ideas can beexchanged and shared freely.

SAMOS IX followed the series of workshops started in 2001 in a new ex-panded program including three special sessions to discuss challenging researchtrends. This year, the workshop celebrated its ninth anniversary, and 18 paperswere presented which were carefully selected out of 52 submissions resulting inan acceptance rate of 34.6%. Each submission was thoroughly reviewed by atleast three reviewers and considered by the international Program Committeeduring its meeting at Delft in March 2009. Indicative of the wide appeal of theworkshop is the fact that the submitted works originated from a wide interna-tional community. More in detail, the regular papers come from 19 countries:Austria (2), Belgium (2), Brazil (1), Canada (1), Finland (8), France (6), Ger-many (5), Greece (3), India (1), Italy (3), Japan(1), Norway (2), Russia (1),Spain (2), Sweden (2), Switzerland (2), The Netherlands (7), UK (1) and USA(2).

Additionally, three special sessions were organized on topics of current in-terest: (1) “Instruction-set Customization”, (2) “Reconfigurable Computing andProcessor Architectures”, and (3) “Mastering Cell BE and GPU Execution Plat-forms”. Each special session used its own review procedures, and was given theopportunity to include some relevant works selected from the regular papers sub-mitted to the workshop in addition to some invited papers. Globally, 14 paperswere included in the three special sessions. The workshop program also includedone keynote speech by Yale Patt from the University of Texas at Austin.

VI Preface

A workshop like this cannot be organized without the help of many people.First of all, we would like to thank the members of the Steering and ProgramCommittees and the external referees for their dedication and diligence in select-ing the technical papers. The investment of their time and insight was very muchappreciated. Then, we would like to express our sincere gratitude to Karin Vas-siliadis for her continuous dedication in organizing the workshop. We also wouldlike to thank Carlo Galuzzi for managing the financial issues, Sebastian Isaza formaintaining the Website and publicizing the event, Zubair Nawaz for managingthe submission system, and Dimitris Theodoropoulos and Carlo Galuzzi (again)for preparing the workshop proceedings. We also thank Lidwina Tromp for hercontinuous effort in the workshop organization.

We hope that the attendees enjoyed the SAMOS IX workshop in all its as-pects, including many informal discussions and gatherings. We trust that youwill find this year’s SAMOS workshop proceedings enriching and interesting.

July 2009 Koen BertelsCristina Silvano

Nikitas DimopoulosStephan Wong

Organization

General Co-chairs

N. Dimopoulos University of Victoria, CanadaS. Wong TU Delft, The Netherlands

Program Co-chairs

K. Bertels TU Delft, The NetherlandsC. Silvano Politecnico di Milano, Italy

Special Session Co-chairs

L. Carro UFRGS, BrazilE. Deprettere Leiden University, The NetherlandsC. Galuzzi TU Delft, The NetherlandsA. Varbanescu TU Delft, The NetherlandsS. Wong TU Delft, The Netherlands

Proceedings Co-chairs

C. Galuzzi TU Delft, The NetherlandsD. Theodoropoulos TU Delft, The Netherlands

Web and Publicity Chair

S. Isaza TU Delft, The Netherlands

Submissions Chair

Z. Nawaz TU Delft, The Netherlands

Finance Chair

C. Galuzzi TU Delft, The Netherlands

Symposium Board

S. Bhattacharyya University of Maryland, USAG.N. Gaydadijev TU Delft, The Netherlands

VIII Organization

J. Glossner Sandbridge Technologies, USAA.D. Pimentel University of Amsterdam, The NetherlandsJ. Takala Tampere University of Technology, Finland

(Chairperson)

Steering Committee

L. Carro UFRGS, BrazilE. Deprettere Leiden University, The NetherlandsN. Dimopoulos University of Victoria, CanadaT. D. Hamalainen Tampere University of Technology, FinlandS. Wong TU Delft, The Netherlands

Program Committee

C. Basto NXP, USAJ. Becker Karlsruhe University, GermanyM. Berekovic TU Braunschweig, GermanyS. Chakraborty University of Singapore, SingaporeF. Ferrandi Politecnico di Milano, ItalyG. Fettweis TU Dresden, GermanyJ. Flich Technical University of Valencia, SpainW. Fornaciari Politecnico di Milano, ItalyP. French TU Delft, The NetherlandsK. Goossens NXP, The NetherlandsD. Guevorkian Nokia, FinlandR. Gupta University of California Riverside, USAC. Haubelt University of Erlangen-Nuremberg, GermanyM. Hannikainen Tampere University of Technology, FinlandD. Iancu Sandbridge Technologies, USAV. Iordanov Philips, The NetherlandsH. Jeschke University of Hannover, GermanyC. Jesshope University of Amsterdam, The NetherlandsW. Karl University of Karlsruhe, GermanyM. Katevenis FORTH-ICS and University of Crete, GreeceA. Koch TU Darmstadt, GermanyK. Kuchcinski Lund University, SwedenD. Liu Linkoping University, SwedenW. Luk Imperial College, UKJ. McAllister Queen’s University of Belfast, UKD. Milojevic Universite Libre de Bruxelles, BelgiumA. Moshovos University of Toronto, CanadaT. Mudge University of Michigan, United StatesN. Navarro Technical University of Catalonia, SpainA. Orailoglu University of California San Diego, USAB. Pottier Universite de Bretagne Occidentale, France

Organization IX

K. Rudd Intel, USAT. Sauter Austrian Academy of Sciences, AustriaP-M. Seidel SMU University, USAH. Schroder University of Dortmund, GermanyF. Silla Technical University of Valencia, SpainM. Sima University of Victoria, CanadaG. Theodoridis Aristotle University of Thessaloniki, GreeceL. Vintan University of Sibiu, Romania

Reviewers

Aaltonen, TimoAgosta, GiovanniAli, ZeyshanAlvarez, MauricioArnold, OliverArpinen, TeroAzevedo, ArnaldoBasto, CarlosBecker, JuergenBecker, TobiasBerekovic, MladenBlume, SteffenBournoutian, GaroBuchty, RainerChakraborty, SamarjitChen, MingJingCiobanu, CatalinDeprettere, EdDimitrakopoulos, GiorgosDimopoulos, NikitasEhliar, AndreasFeng, MinFerrandi, FabrizioFettweis, GerhardFlatt, HolgerFlich, JoseFlynn, MichaelFornaciari, WilliamFrench, PaddyGaluzzi, CarloGelado, IsaacGlossner, JohnGoossens, KeesGuevorkian, David

Gupta, RajivHanke, MathiasHannikainen, MarkoHaubelt, ChristianIancu, DanielIsaza, SebastianJeschke, HartwigJesshope, ChrisJin, QiweiKakarountas, AthanasiosKarl, WolfgangKarlstrom, PerKatevenis, ManolisKellomaki, PerttiKlussmann, HeikoKuchcinski, KrzysztofLee, KwangyoonLimberg, TorstenLiu, DakeLuk, WayneMamidi, SumanMartin-Langerwerf, JavierMartorell, XavierMcAllister, JohnMerino, JulioMilojevic, DragomirMoshovos, AndreasMudge, TrevorNagarajan, VijayNajjar, WalidNavarro, NachoNikolopoulos, DimitriosNolte, NormanNorkin, Andrey

X Organization

Orailoglu, AlexPilato, ChristianPottier, BernardRudd, KevinSauter, ThiloSazeides, YiannakisSchroder, HartmutSchulte, MichaelSeo, SangwonSilla, FedericoSilvano, CristinaSima, MihaiSima, Vlad-MihaiSpinean, Bogdan

Takala, JarmoTheodoridis, GeorgeThomas, DavidTian, ChenTsoi, BrittleVintan, LucianWestermann, PeterWoh, MarkWong, StephanWu, DiYang, ChengmoZaccaria, Vittorio

Table of Contents

Beachnote

What Else Is Broken? Can We Fix It? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1Yale Patt

Architectures for Multimedia

Programmable and Scalable Architecture for Graphics ProcessingUnits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Carlos S. de La Lama, Pekka Jaaskelainen, and Jarmo Takala

The Abstract Streaming Machine: Compile-Time PerformanceModelling of Stream Programs on Heterogeneous Multiprocessors . . . . . . 12

Paul M. Carpenter, Alex Ramirez, and Eduard Ayguade

CABAC Accelerator Architectures for Video Compression in FutureMultimedia: A Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Yahya Jan and Lech Jozwiak

Programmable Accelerators for Reconfigurable Video Decoder . . . . . . . . . 36Tero Rintaluoma, Timo Reinikka, Joona Rouvinen, Jani Boutellier,Pekka Jaaskelainen, and Olli Silven

Scenario Based Mapping of Dynamic Applications on MPSoC: A 3DGraphics Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Narasinga Rao Miniskar, Elena Hammari, Satyakiran Munaga,Stylianos Mamagkakis, Per Gunnar Kjeldsberg, andFrancky Catthoor

Multiple Description Scalable Coding for Video Transmission overUnreliable Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Roya Choupani, Stephan Wong, and Mehmet R. Tolun

Multi/Many Cores Architectures

Evaluation of Different Multithreaded and Multicore ProcessorConfigurations for SoPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Sascha Uhrig

Implementing Fine/Medium Grained TLP Support in a Many-CoreArchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Roberto Giorgi, Zdravko Popovic, and Nikola Puzovic

XII Table of Contents

Implementation of W-CDMA Cell Search on a FPGA BasedMulti-Processor System-on-Chip with Power Management . . . . . . . . . . . . . 88

Roberto Airoldi, Fabio Garzia, Tapani Ahonen,Dragomir Milojevic, and Jari Nurmi

A Multiprocessor Architecture with an Omega Network for theMassively Parallel Model GCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Christian Schack, Wolfgang Heenes, and Rolf Hoffmann

VLSI Architectures Design

Towards Automated FSMD Partitioning for Low Power UsingSimulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

Nainesh Agarwal and Nikitas J. Dimopoulos

Radix-4 Recoded Multiplier on Quantum-Dot Cellular Automata . . . . . . 118Ismo Hanninen and Jarmo Takala

Prediction in Dynamic SDRAM Controller Policies . . . . . . . . . . . . . . . . . . . 128Ying Xu, Aabhas S. Agarwal, and Brian T. Davis

Inversion/Non-inversion Implementation for an 11,424 Gate-CountDynamic Optically Reconfigurable Gate Array VLSI . . . . . . . . . . . . . . . . . 139

Shinichi Kato and Minoru Watanabe

Architecture Modeling and Exploration Tools

Visualization of Computer Architecture Simulation Data forSystem-Level Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

Toktam Taghavi, Mark Thompson, and Andy D. Pimentel

Modeling Scalable SIMD DSPs in LISA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161Peter Westermann and Hartmut Schroder

NoGAP: A Micro Architecture Construction Framework . . . . . . . . . . . . . . 171Per Karlstrom and Dake Liu

A Comparison of NoTA and GENESYS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181Bernhard Huber and Roman Obermaisser

Special Session 1: Instruction-Set Customization

Introduction to Instruction-Set Customization . . . . . . . . . . . . . . . . . . . . . . . 193Carlo Galuzzi

Table of Contents XIII

Constraint-Driven Identification of Application Specific Instructions inthe DURASE System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

Kevin Martin, Christophe Wolinski, Krzysztof Kuchcinski,Antoine Floch, and Francois Charot

A Generic Design Flow for Application Specific Processor Customizationthrough Instruction-Set Extensions (ISEs) . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

Kingshuk Karuri, Rainer Leupers, Gerd Ascheid, and Heinrich Meyr

Runtime Adaptive Extensible Embedded Processors — A Survey . . . . . . 215Huynh Phung Huynh and Tulika Mitra

Special Session 2: The Future of ReconfigurableComputing and Processor Architectures

Introduction to the Future of Reconfigurable Computing and ProcessorArchitectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

Luigi Carro and Stephan Wong

An Embrace-and-Extend Approach to Managing the Complexity ofFuture Heterogeneous Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

Rainer Buchty, Mario Kicherer, David Kramer, and Wolfgang Karl

Applying the Stream-Based Computing Model to Design HardwareAccelerators: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

Frederico Pratas and Leonel Sousa

Reconfigurable Multicore Server Processors for Low Power Operation . . . 247Ronald G. Dreslinski, David Fick, David Blaauw,Dennis Sylvester, and Trevor Mudge

Reconfigurable Computing in the New Age of Parallelism . . . . . . . . . . . . . 255Walid Najjar and Jason Villarreal

Reconfigurable Multithreading Architectures: A Survey . . . . . . . . . . . . . . . 263Pavel G. Zaykov, Georgi K. Kuzmanov, and Georgi N. Gaydadjiev

Special Session 3: Mastering Cell BE and GPUExecution Platforms

Introduction to Mastering Cell BE and GPU Execution Platforms . . . . . . 275Ed Deprettere and Ana L. Varbanescu

Efficient Mapping of Multiresolution Image Filtering Algorithms onGraphics Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277

Richard Membarth, Frank Hannig, Hritam Dutta, and Jurgen Teich

XIV Table of Contents

Implementing Blocked Sparse Matrix-Vector Multiplication on NVIDIAGPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

Alexander Monakov and Arutyun Avetisyan

Experiences with Cell-BE and GPU for Tomography . . . . . . . . . . . . . . . . . 298Sander van der Maar, Kees Joost Batenburg, and Jan Sijbers

Realizing FIFO Communication When Mapping Kahn ProcessNetworks onto the Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308

Dmitry Nadezhkin, Sjoerd Meijer, Todor Stefanov, and Ed Deprettere

Exploiting Locality on the Cell/B.E. through Bypassing . . . . . . . . . . . . . . 318Pieter Bellens, Josep M. Perez, Rosa M. Badia, and Jesus Labarta

Exploiting the Cell/BE Architecture with the StarPU Unified RuntimeSystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

Cedric Augonnet, Samuel Thibault, Raymond Namyst, andMaik Nijhuis

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341

What Else Is Broken? Can We Fix It?

Yale Patt

The University of Texas at Austin

Abstract. The founder and soul of this conference, Professor StamatisVassiliadis, always wanted a Keynote on the beach. A keynote with-out PowerPoint, air conditioning, and all the other usual comforts ofkeynotes, comforts both for the speaker and for the audience. After all,the great thinkers of this ancient land did their thinking, teaching, andarguing without PowerPoint and without air conditioning. But they werethey and we are we, and no sane SAMOS keynote speaker would puthimself in the same league with those masters.

Nonetheless, Stamatis wanted it, and I never found it easy to say noto Stamatis, so last year at SAMOS VIII, I agreed to give a Keynoteon the Beach. It has been subsequently relabeled The Beachnote, and Ihave been asked to do it again.

The question of course is what subject to explore in this setting, wherethe sound of the speaker’s voice competes with the sounds of the wavesbanging against the shore, where the image of the speaker’s gesturescompetes with the image of the blue sky, bright sun, and hills of Samos.I decided last summer to choose a meta-topic, rather than a hard coretechnical subject: ”Is it broken,” with particular emphasis on professors– are they ready to teach, are they ready to do research, and students –are they learning, is their education preparing them for what is neededafter they graduate.

My sense is that for this environment, a meta-topic is the right model,and so I propose to visit it again. For example: our conferences andjournals. Are they broken? Can we fix them? Somewhat more technical:The interface between the software that people write to solve problemsand the hardware that has to run that software. Is it broken? Can we fixit? These are just examples of some of the things we might explore in thisyear’s Beachnote. As I said last year, I will welcome other suggestionsfrom the audience as to what they think is broken. My hope is to haveus all engaged in identifying and discussing some of the fundamentalproblems that plague our community.

K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, p. 1, 2009.

Programmable and Scalable Architecture for

Graphics Processing Units

Carlos S. de La Lama1, Pekka Jaaskelainen2, and Jarmo Takala2

1 Universidad Rey Juan Carlos,Department of Computer Architecture,

Computer Science and Artificial Intelligence,C/ Tulipan s/n, 28933 Mostoles, Madrid, Spain

[email protected] Tampere University of Technology,Department of Computer Systems,

Korkeakoulunkatu 10, 33720 Tampere, [email protected], [email protected]

Abstract. Graphics processing is an application area with high level ofparallelism at the data level and at the task level. Therefore, graphicsprocessing units (GPU) are often implemented as multiprocessing sys-tems with high performance floating point processing and applicationspecific hardware stages for maximizing the graphics throughput.

In this paper we evaluate the suitability of Transport Triggered Ar-chitectures (TTA) as a basis for implementing GPUs. TTA improvesscalability over the traditional VLIW-style architectures making it in-teresting for computationally intensive applications. We show that TTAprovides high floating point processing performance while allowing moreprogramming freedom than vector processors.

Finally, one of the main features of the presented TTA-based GPUdesign is its fully programmable architecture making it suitable targetfor general purpose computing on GPU APIs which have become popu-lar in recent years.

Keywords: GPU, GPGPU, TTA, VLIW, LLVM, GLSL, OpenGL.

1 Introduction

3D graphics processing can be seen as a compound of sequential stages appliedto a set of input data. Commonly, graphics processing systems are abstractedas so called graphics pipelines, with only minor differences between the variousexisting APIs and implementations. Therefore, stream processing [1], where anumber of kernels (user defined or fixed) are applied to a stream of data of thesame type, is often thought as the computing paradigm of graphics processingunits.

Early 3D accelerating GPUs were essentially designed to perform a fixedset of operations in an effective manner, with no capabilities to customize thisprocess [2]. Later, some vendors started to add programmability to their GPU

K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 2–11, 2009.c© Springer-Verlag Berlin Heidelberg 2009

Programmable and Scalable Architecture for Graphics Processing Units 3

products, leading to standardization of “shading languages”. Both of the majorgraphics APIs (OpenGL and DirectX) proposed their own implementation ofsuch languages. DirectX introduced the High Level Shading Language [3], whileOpenGL defined the OpenGL Shading Language (GLSL) [4], first supported asan optional extension to OpenGL 1.4 and later becoming part of the standardin OpenGL 2.0.

GLSL is similar to the standard C language, but includes some additional datatypes for vectors and matrices, and library functions to perform the commonoperations with the data types. Programs written in GLSL (called shaders) cancustomize the behavior of two specific stages of the OpenGL graphics pipeline(dashed boxes in Figure 1) [5]. Vertex shaders are applied to the input pointsdefining the vertices of the graphics primitives (such as points, lines or polygons)in a 3D coordinate system called model space.

Depending on the type of primitive being drawn, the rasterizer then gener-ates a number of visible points between the transformed vertices. These newpoints are called fragments. Each drawn primitive usually produces the equalnumber of fragments as there are covered pixels on the screen. The rasterizerinterpolates several attributes, such as color or texture coordinates, between ver-tices to find the corresponding value (called varying) for each fragment, and theprogrammable fragment shader can postprocess and modify those values.

The movement to allow programming of parts of the graphics pipeline ledto GPU vendors providing custom APIs for using their GPUs for more generalpurpose computing (GPGPU) [6], extending the application domain of GPUs toa wide range of programs with highly parallelizable computation. Finally, in theend of 2008, a vendor-neutral API for programming heterogeneous platforms(which can include also GPU-like resources) was standardized. The OpenCLstandard [7] was welcomed by the GPGPU community as a generic alternativeto platform-specific GPGPU APIs such as NVIDIA’s CUDA. [8]

This paper presents a design work in progress of a programmable and scalableGPU architecture based on the Transport Triggered Architecture (TTA), a classof VLIW architectures. The proposed architecture we call TTAGPU is fullyprogrammable and implements all of the graphics pipeline in software. TTAGPUcan be scaled at the instruction and task level to produce GPUs with varyingsize/performance ratio, enabling its use in both embedded and desktop systems.Furthermore, the full programmability allows it to be adapted for GPGPU styleof computation, and, for example, to support the OpenCL API. While commonpractice in GPU design goes through the intensive use of data-parallel models,

To screen

VertexShader

Vertices Transformed verticesRasterizer

Fragments

FragmentShader

Colored fragmentsFramebuffer

Fig. 1. Simplified view of the customizable OpenGL pipeline

4 C.S. de La Lama, P. Jaaskelainen, and J. Takala

our approach tries to exploit parallelism at instruction level, thus avoiding theprogrammability penalty caused by SIMD operations.

The rest of the paper is organized as follows. Section 2 discusses briefly the re-lated work, Section 3 describes the main points in the TTAGPU design, Section 4provides some preliminary results on the floating point scalability of the architec-ture, and Section 5 concludes the paper and discusses the future directions.

2 Related Work

The first generation of programmable GPUs included specialized hardware forvertex processing and fragment processing as separate components, togetherwith texture mapping units and rasterizers, set up on a multi-way stream con-figuration to exploit the inherent parallelism present on 3D graphic algorithms.

As modern applications needed to customize the graphic processing to ahigher degree, it became obvious that such heterogeneous architectures werenot the ideal choice. Therefore, with the appearance of the unified shader modelin 2007 [9], the differences between vertex and fragment shaders begun to dis-appear. Newer devices have a number of unified shaders that can do the samearithmetic operations and access the same buffers (although some differences inthe instruction sets are still present). This provides better programmability tothe graphic pipeline, while the fixed hardware on critical parts (like the raster-izer) ensures high performance. However, the stream-like connectivity betweencomputing resources still limits the customization of the processing algorithm.The major GPU vendors (NVIDIA & ATI) follow this approach in their latestproducts [10,11].

The performance of the unified shader is evaluated in [12] by means of imple-menting a generic GPU microarchitecture and simulating it. The main conclusionof the paper is that although graphical performance improves only marginallywith respect to non-unified shader architectures, it has real benefits in terms ofefficiency per area. The shader performance analysis in the paper uses shadersimplemented with the OpenGL ARB assembly-like low level language. Althoughalready approved by the Architecture Review Board, this is still an extensionto the OpenGL standard, while GLSL is already part of it, which is why wehave used it as our shader program input language. Furthermore, new trendson parallel non-graphical computations on GPUs are geared towards using highlevel languages.

A different approach to achieve GPU flexibility is being proposed by Intelwith its Larrabee processor [13]. Instead of starting from a traditional GPUarchitecture, they propose a x86-compatible device with additional floating pointunits for enhanced arithmetic performance. Larrabee includes very little specifichardware, the most notable exception being the texture mapping unit. Instead,the graphics pipeline is implemented in software, making it easier to modify andcustomize. Larrabee is to be deployed as a “many-core” solution, with numberof cores in 64 and more. Each core comprises a 512-bit vector FPU capable of16 simultaneous single-precision floating point operations.


3 TTAGPU Architecture

The goal of the TTAGPU design is to implement an OpenGL-compliant graph-ics API which is accelerated with a customized TTA processor, supports theprogramming of the graphic pipeline as described in the OpenGL 2.1 specifica-tion [14] (GLSL coded shaders) and allows high-level language programmabilityespecially with support for OpenCL API in mind. Therefore, the design follows asoftware-based approach, similar to Larrabee, with additional flexibility providedthrough programmability. However, as not being tied to the x86 architecture,the datapath resource set can be customized more freely to accelerate the GPUapplication domain.

3.1 Transport Triggered Architectures

VLIWs are considered interesting processor alternatives for applications withhigh requirements for data processing performance[15] and with limited controlflow, such as graphics processing.

Transport Triggered Architectures (TTA) is a modular processor architecturetemplate with high resemblance to VLIW architectures. The main differencebetween TTAs and VLIWs can be seen in how they are programmed: insteadof defining which operations are started in which function units (FU) at whichinstruction cycles, TTA programs are defined as data transports between registerfiles (RF) and FUs of the datapath. The operations are started as side-effectsof writing operand data to the “triggering port” of the FU. Figure 2 presents asimple example TTA processor.[16]

The programming model of VLIW imposes limitations for scaling the numberof FUs in the datapath. Upscaling the number of FUs has been problematic inVLIWs due to the need to include as many write and read ports in the RFsas there are FU operations potentially completed and started at the same time.Additional ports increase the RF complexity, resulting in larger area and criticalpath delay. Also, adding an FU to the VLIW datapath requires potentially newbypassing paths to be added from the FU’s output ports to the input ports of theother FUs in the datapath, which increases the interconnection network complex-ity. Thanks to its programmer-visible interconnection network, TTA datapath

Fig. 2. Example of a TTA processor


can support more FUs with simpler RFs [17]. Because the scheduling of datatransports between datapath units are programmer-defined, there is no obliga-tion to scale the number of RF ports according to the number of FUs [18]. Inaddition, the datapath connectivity can be tailored according to the applicationat hand, adding only the bypassing paths that benefit the application the most.

In order to support fast automated design of TTA processors, a toolset projectcalled TTA-based Codesign Environment (TCE) was started in 2003 in Tam-pere University of Technology [19]. TCE provides a full design flow from softwarewritten in C code down to parallel TTA program image and VHDL implementa-tion of the processor. However, as TTAGPU was evaluated only at architecturallevel for this paper, the most important tools used in the design were its cycle-accurate instruction set simulator and the compiler, both of which automaticallyadapt to the set of machine resources in the designed processors.

Because TTA is a statically scheduled architecture with high level of detailexposed to the programmer, the runtime efficiency of the end results producedwith the design toolset depends heavily on the quality of the compiler. TCE usesthe LLVM Compiler Infrastructure [20] as the backbone for its compiler toolchain (later referred to as ’tcecc’), thus benefits from its global optimizationssuch as aggressive dead code elimination and link time inlining. In addition, theTCE code generator includes an efficient instruction scheduler with TTA-specificoptimizations, and a register allocator optimized to produce better instruction-level parallelism for the post-pass scheduler.

3.2 Scaling on the Instruction Level

The TTAGPU OpenGL implementation is structured into two clearly separatedparts. First part is the API layer, which is meant to run on the main CPU onthe real scenario. It communicates with the GPU by a command FIFO, eachcommand having a maximum of 4 floating-point arguments. Second part is thesoftware implementation of the OpenGL graphics pipeline running in the TTA.We have tried to minimize the number of buffers to make the pipeline stages aslong as possible, as this gives the compiler more optimization opportunities.

The OpenGL graphics pipeline code includes both the software implementa-tion of the pipeline routines itself, in addition to the user defined shader programsdefined with GLSL. For the graphics pipeline code, we have so far implementeda limited version capable of doing simple rendering, allowing us to link againstreal OpenGL demos with no application code modification. Because tcecc alreadysupports compilation of C and C++, it is possible to compile the user-definedGLSL code with little additional effort by using C++ operator overloading anda simple preprocessor, and merge the shader code with the C implementation ofthe graphics pipeline.

Compiling GLSL code together with the C-based implementation of the graph-ics pipeline allows user-provided shaders to override the programmable parts,while providing an additional advantage of global optimizations and code special-ization that is done after the final program linking. For example, if a custom shaderprogram does not use a result produced by some of the fixed functionality of the


for i = 1...16 do

f = produce_fragment() // the rasterizer code

f = glsl_fragment_processor(f)

write_to_framebuffer_fifo(f)

Fig. 3. Pseudocode of the combined rasterizer/fragment shader loop body

graphics pipeline code, the pipeline code will be removed by the dead code elimi-nation optimization. That is, certain types of fragment shader programs compiledwith the pipeline code can lead, to higher rasterizer performance.

Preliminary profiling of the current software graphics pipeline implementa-tion showed that the bottleneck so far is on the rasterizer, and, depending onits complexity, on the user-defined fragment shader. This makes sense as thedata density on the pipeline explodes after rasterizing as usually a high num-ber of fragments are generated by each primitive. For example, a line can bedefined using two vertices from which the rasterizer produces fragments enoughto represent all the visible pixels between the two vertices. Thus, in TTAGPUwe concentrated on optimizing the rasterizer stage by creating a specialized ras-terizer loop which processes 16 fragments at a time.

The combined rasterizer/custom fragment shader loop (pseudocode shown inFig. 3) is fully unrolled by the compiler, implementing effectively a combined16-way rasterizer and fragment processor on software. The aggressive procedureinlining converts the fully unrolled loop to a single big basic block with the actualrasterizer code producing a fragment and the user defined fragment shader process-ing itwithout the need for large buffers between the stages. In addition, the unrolledloop bodies can be often made completely independent from each other, improvingpotential for high level of ILP exposed to the instruction scheduler. Inorder to avoidextra control flow in the loop which makes it harder to extract instruction level par-allelism (ILP) statically, we always process 16 fragments at a time “speculatively”and discard the possible extra fragments at the end of computation.

3.3 Scaling on the Task Level

In order to achieve scalability on the task level, we placed hardware-based FIFObuffers at certain points in the software graphics pipeline. The idea is to add“frontiers” at suitable positions of the pipeline allowing multiple processors toproduce and process the FIFO items arbitrarily. It should be noted, however, thatit is completely possible in this configuration that the same processor producesand processes the items in the FIFOs. In this type of single core setting, thehardware FIFO merely reduces memory accesses required to pass data betweenthe graphics pipeline stages.

The guidelines followed when placing these buffers were: 1) separate stageswith different data densities, 2) place the FIFOs in such positions that the poten-tial for ILP at each stage is as high as possible, and 3) compile the user-definedshader code and related graphics pipeline code together to maximize code spe-cialization and ILP.


Vertex FIFO

Fragment FIFO

State

OpenGL

Command FIFO

API

OpenGL

GPU

Clipping

TTAGPU driver tasks

Framebuffer

Fragment proc.

Vertex processing

Rasterization /

writing

Fig. 4. High-level software structure

These three points are met by placing two hardware FIFOs in the pipeline.One after the vertex processing, as the number of processed vertices needed forprimitive rasterizing changes with the different rendering modes (points, lines orpolygons), resulting in varying data density. This FIFO allows vertex processingto proceed until enough vertices for primitive processing are available. It alsoserves as an entry point for new vertices generated during clipping.

The second FIFO is placed after fragment processing, and before the frame-buffer writing stage. Framebuffer writing has some additional processing to per-form (ownership test, blending, etc.) that cannot be performed completely onper-fragment basis as they depend on the results of previous framebuffer writes.This FIFO allows us to create the highly parallelizable basic block block per-forming rasterization and fragment processing with no memory writes as theframe buffer writing is done with a custom operation accessing the FIFO.

The hardware-supported FIFOs have a set of status registers that can be usedto poll for FIFO emptiness and fullness. This enables us to use light weight co-operativemultithreading tohide theFIFOwaiting timewithprocessingof elementsfrom the other FIFOs. Software implementation structure is shown in Figure 4.

The clean isolation between stages allows the system to connect sets of pro-cessors that access the FIFO elements as producers and/or consumers makingthe system flexible and scalable at the task level. Scaling at the task level can bedone simply by adding either identical TTAs or even processors with completelydifferent architectures to the system. The only requirement placed for the addedprocessors is the access to the hardware FIFOs.

4 Results

In order to evaluate the ILP scalability of the TTAGPU in the combined ras-terizer/fragment processor loop, we implemented a simple example OpenGL


Table 1. Resources in the TTAGPU variations

resource 1 FPU 2 FPU 4 FPU 8 FPU 16 FPU

floating point units 1 2 4 8 1632 bit x 32 register files 1 2 4 8 161 bit boolean registers 2 4 8 16 32

transport buses 3 6 12 24 48

integer ALUs 1 1 1 1 132 bit load-store units 1 1 1 1 1

32 bit shifters 1 1 1 1 1

application that renders number of lines randomly to the screen and colors themwith a simple fragment shader.

The goal of this experiment was to see how well the single TTAGPU coresscale at the instruction level only by adding multitudes of resource sets to thearchitecture and recompiling the software using tcecc. The resource set we usedfor scaling included a single FPU, three transport buses, and a register file with32 general purpose 32 bit registers. The resources in the benchmarked TTAGPUvariations are listed in Table 1.

In order to produce realistic cycle counts for floating point code, we usedthe pipeline model of the MIPS R4000 floating point units of which descriptionwas available in literature [21]. The unit includes eight floating-point operationsthat share eleven different pipeline resources. However, our benchmark used onlyaddition, division, multiplication and comparison of floating point values.

The benchmark was executed using the TCE cycle-accurate processor ar-chitecture simulator for TTAGPUs with the different number of resource sets.Figure 5 shows the speedup improvements in the unrolled rasterizer loop fromjust adding multiples of the “scaling resource sets” to the machine and recompil-ing the code. This figure indicates that the ILP scalability of the heavily utilizedrasterizer loop is almost linear thanks to the aggressive global optimizations anda register allocator that avoids the reuse of registers as much as possible, re-ducing the number of false dependencies limiting the parallelization between the

81 2 4 16# of FPU resource sets

0

2

4

6

8

10

12

speedup

1.0x1.8x

3.8x

7.2x

11.5x

Fig. 5. Scalability of the rasterizer loop with different number of floating point re-sources


loop iterations. The scaling gets worse when getting closer to the 16 FPUs ver-sion because a hard limit of about 500 general purpose registers in our compiler,and because the loop was implemented with only 16 iterations. With a largeriteration count there would be more operations with which to hide the latenciesof the previous iterations.

5 Conclusions

In this paper we have proposed a mainly software-based implementation of agraphics processing unit based on the scalable TTA architecture. We have shownTTA is an interesting alternative to be used for applications where high dataprocessing performance is required, as is the case with GPUs. TTA providesimproved scalability at the instruction level in comparison to VLIWs, due to itsprogrammer-visible interconnection network.

The scalability of the proposed TTAGPU on both the task and the instruc-tion level makes the system an interesting platform also to be considered forother data parallel applications designed to be executed on GPU-type platforms.Evaluating the proposed TTAGPU platform for supporting applications writtenusing the OpenCL 1.0 standard [7] is being worked on. Additional future workincludes completing the OpenGL API implementation, evaluating the multi-coreperformance of TTAGPU and implementing an actual hardware prototype.

Acknowledgments. This research was partially funded by the Academy ofFinland, the Nokia Foundation and Finnish Center for International Mobility(CIMO).

References

1. Stephens, R.: A survey of stream processing. Acta Informatica 34(7), 491–541(1997)

2. Crow, T.S.: Evolution of the Graphical Processing Unit. Master’s thesis, Universityof Nevada, Reno, NV (December 2004)

3. St-Laurent, S.: The Complete Effect and HLSL Guide. Paradoxal Press (2005)4. Kessenich, J.: The OpenGL Shading Language. 3DLabs, Inc. (2006)5. Luebke, D., Humphreys, G.: How GPUs work. Computer 40(2), 96–100 (2007)6. Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Kruger, J., Lefohn, A.E.,

Purcell, T.J.: A Survey of General-Purpose Computation on Graphics Hardware.Computer Graphics Forum 26(1), 80–113 (2007)

7. Khronos Group: OpenCL 1.0 Specification (Februrary 2009),http://www.khronos.org/registry/cl/

8. Halfhill, T.R.: Parallel Processing with CUDA. Microprocessor Report (January2008)

9. Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: NVIDIA Tesla: A unifiedgraphics and computing architecture. IEEE Micro. 28(2), 39–55 (2008)

10. Wasson, S.: NVIDIA’s GeForce 8800 graphics processor. Tech. Report (November2007)

http://www.khronos.org/registry/cl/


11. Wasson, S.: AMD Radeon HD 2900 XT graphics processor: R600 revealed. TechReport (May 2007)

12. Moya, V., Gonzalez, C., Roca, J., Fernandez, A., Espasa, R.: Shader PerformanceAnalisys on a Modern GPU Architecture. In: 38th IEEE/ACM Int. Symp. Mi-croarchitecture, Barcelona, Spain, November 12-16. IEEE Computer Society, LosAlamitos (2005)

13. Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins,S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Han-rahan, P.: Larrabee: A Many-Core x86 Architecture for Visual Computing. ACMTransactions on Graphics 27(18) (August 2008)

14. Segal, M., Akeley, K.: The OpenGL Graphics System: A Specification. SiliconGraphics, Inc. (2006)

15. Colwell, R.P., Nix, R.P., O’Donnell, J.J., Papworth, D.B., Rodman, P.K.: A VLIWarchitecture for a trace scheduling compiler. In: ASPLOS-II: Proc. second int. conf.on Architectual support for programming languages and operating systems, pp.180–192. IEEE Computer Society Press, Los Alamitos (1987)

16. Corporaal, H.: Microprocessor Architectures: from VLIW to TTA. John Wiley &Sons, Chichester (1997)

17. Corporaal, H.: TTAs: missing the ILP complexity wall. Journal of Systems Archi-tecture 45(12-13), 949–973 (1999)

18. Hoogerbrugge, J., Corporaal, H.: Register file port requirements of Transport Trig-gered Architectures. In: MICRO 27: Proc. 27th Int. Symp. Microarchitecture, pp.191–195. ACM Press, New York (1994)

19. Jaaskelainen, P., Guzma, V., Cilio, A., Takala, J.: Codesign toolset for application-specific instruction-set processors. In: Proc. Multimedia on Mobile Devices 2007,pp. 65070X–1 — 65070X–11 (2007), http://tce.cs.tut.fi/

20. Lattner, C., Adve, V.: LLVM: A compilation framework for lifelong program anal-ysis & transformation. In: Proc. Int. Symp. Code Generation and Optimization,Palo Alto, CA, March 20-24, p. 75 (2004)

21. Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Ap-proach, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco (2003)

http://tce.cs.tut.fi/

The Abstract Streaming Machine:

Compile-Time Performance Modelling of StreamPrograms on Heterogeneous Multiprocessors

Paul M. Carpenter, Alex Ramirez, and Eduard Ayguade

Barcelona Supercomputing Center, C/Jordi Girona, 31, 08034 Barcelona, Spain{paul.carpenter,alex.ramirez,eduard.ayguade}@bsc.es

Abstract. Stream programming offers a portable way for regular appli-cations such as digital video, software radio, multimedia and 3D graph-ics to exploit a multiprocessor machine. The compiler maps a portablestream program onto the target, automatically sizing communicationsbuffers and applying optimizing transformations such as task fission orfusion, unrolling loops and aggregating communication. We present amachine description and performance model for an iterative stream com-pilation flow, which represents the stream program running on a hetero-geneous multiprocessor system with distributed or shared memory. Themodel is a key component of the ACOTES open-source stream compilercurrently under development. Our experiments on the Cell BroadbandEngine show that the predicted throughput has a maximum relative errorof 15% across our benchmarks.

1 Introduction

Many people [1] have recognized the need to change the way software is writ-ten to take advantage of multi-core systems [2] and distributed memory [3,4,5].This paper is concerned with applications such as digital video, software radio,signal processing and 3D graphics, all of which may be represented as blockdiagrams, in which independent blocks communicate and synchronize only viaregular streams of data. Such applications have high task and data parallelism,which is hidden when the program is written in C or a similar sequential pro-gramming language, requiring the programmer to apply high level optimizationssuch as task fusion, fission and blocking transformations by hand. Recent workon stream programming languages, most notably StreamIt [6] and SynchronousData Flow (SDF) [7], has demonstrated how a compiler may potentially matchthe performance of hand-tuned sequential or multi-threaded code [8].

This work is part of the ACOTES project [9], which is developing a completeopen-source stream compiler for embedded systems. This compiler will auto-matically partition a stream program to use task-level parallelism, size commu-nications buffers and aggregate communications through blocking. This paperdescribes the Abstract Streaming Machine (ASM), which represents the targetsystem to this compiler. Figure 1 shows the iterative compilation flow, with a

K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 12–23, 2009.c© Springer-Verlag Berlin Heidelberg 2009

The Abstract Streaming Machine: Compile-Time Performance Modelling 13

source +SPM pragmas

Task fusionAllocation

Mercurium

source +acolib

Blocking gcc ICI plugin

executable

trace

Searchalgorithm

ASMsimulator

Fig. 1. The ACOTES iterative stream compiler

search algorithm determining the candidate mapping, which is compiled usingMercurium [10] and GCC. The Mercurium source-to-source convertor translatesfrom the SPM source language [11,12], and performs task fusion and allocation.The resulting multi-threaded program is compiled using GCC, which we are ex-tending within the project to perform blocking to aggregate computation andcommunication. Running the executable program generates a trace, which isanalysed by the search algorithm to resolve bottlenecks. An alternative feedbackpath generates a trace using the ASM simulator, which is a coarse-grain modelof the ASM. This path does not require recompilation, and is used when resizingbuffers or to approximate the effect of fission or blocking.

2 Stream Programming

There are several definitions of stream programming, differing mostly in thehandling of control flow and restrictions on the program graph topology [13]. Allstream programming models, however, represent the program as a set of kernels,communicating only via unidirectional streams. The producer has a blocking pushprimitive and the consumer has a blocking pop primitive. This programmingmodel is deterministic provided that the kernels themselves are deterministic,there is no other means of communication between kernels, each stream has oneproducer and one consumer, and the kernels cannot check whether a push orpop would block at a particular time [14].

When the stream program is compiled, one or more kernels are mapped toeach task, which is executed in its own thread. The communications primitives

14 P.M. Carpenter, A. Ramirez, and E. Ayguade

are provided by the ACOTES run-time system, acolib, which also creates andinitializes threads at the start of the computation, and waits for their completionat the end. The run-time system supports two-phase communication, and can beimplemented for shared memory, distributed memory with DMA, and hardwareFIFOs. On the producer side, pushAcquire returns a pointer to an empty arrayof np elements; the np parameter is equal to the producer’s blocking factor,and is supplied during stream initialization. When the task has filled this bufferwith new data, it calls pushSend to request that acolib delivers the data to theconsumer. On the consumer side, popAcquire returns a pointer to the next fullblock of nc elements. When the consumer has finished with the data in thisblock, it calls popDiscard to mark the block as empty.

3 ASM Machine Description

The target is represented as a bipartite graph of processors and memories inone partition, and interconnects in the other. Figure 2 shows the topology of twoexample targets. Each processor and interconnect is defined using the parameterssummarized in Figures 3 and 4, and described below. The machine descriptiondefines the machine visible to software, which may not exactly match the physicalhardware. For example, the OS in a Playstation 3 makes six of the eight SPEsavailable to software. We assume that the processors used by the stream programare not time-shared with other applications while the program is running.

Each processor is defined using the parameters shown in Figure 3(a). Thedetails of the processor’s ISA and micro-architecture are described internally tothe back-end compiler, so are not duplicated in the ASM. The processor de-scription includes the costs of the acolib library calls. The costs of the pushSendand popAcquire primitives are given by a staircase function; i.e. a fixed cost, a

SPE1

LS1

LS0

SPE0

SPE3

LS3

LS2

SPE2

SPE5

LS5

LS4

SPE4

SPE7

LS7

LS6

SPE6Mem

PPE

EIB

P1

$1

P2

$2

P3

$3

Mem

Bus

(a) Cell-based system (b) Shared-memory systemProcessor Memory Interconnect

Fig. 2. Topology of two example targets


Parameter Description Valuename Unique name in platform namespace ‘SPEn ’

clockRate Clock rate, in GHz 3.2

hasIO True if the processor can perform IO False

addressSpace List of the physical memories addressable by thisprocessor and their virtual address

[(LSn,0)]

pushAcqCost Cost, in cycles, to acquire a producer buffer (beforewaiting)

448

pushSendFixedCost Fixed cost, in cycles, to push a block (before wait-ing)

1104

pushSendUnit Number of bytes per push transfer unit 16384

pushSendUnitCost Incremental cost, in cycles, to push pushUnit bytes 352

popAcqFixedCost Fixed cost, in cycles, to pop a block (before waiting) 317

popAcqUnit Number of bytes per pop transfer unit 16384

popAcqUnitCost Incremental cost, in cycles, to pop popUnit bytes 0

popDiscCost Cost, in cycles, to discard a consumer buffer (beforewaiting)

189

(a) Definition of a processor

Parameter Description Valuename Unique name in platform namespace ‘EIB’


elements List of the names of the elements (processors andmemories) on the bus

[‘PPE’,‘SPE0’,

· · · , ‘SPE7’]

interfaceDuplex If the bus has more than one channel, then definefor each processor whether it can transmit andreceive simultaneously on different channels

[True, · · · ,True]

interfaceRouting Define for each processor the type of routingfrom this bus: storeAndForward, cutThrough, orNone

[None, · · · ,None]

startLatency Start latency, L, in cycles 80

startCost Start cost on the channel, S, in cycles 0

bandwidthPerCh Bandwidth per channel, B in bytes per cycle 16

finishCost Finish cost, F , in cycles 0

numChannels Number of channels on the bus 3

multiplexable False for a hardware FIFO that can only supportone stream

True

(b) Definition of an interconnect

Fig. 3. Processor and interconnect parameters of the Abstract Streaming Machine andvalues for the Cell Broadband Engine

16 P.M. Carpenter, A. Ramirez, and E. Ayguade

block size, and an incremental cost for each complete or partial block after thefirst. This variable cost is necessary both for FIFOs and for distributed memorywith DMA. For distributed memory, the size of a single DMA transfer is oftenlimited by hardware, so that larger transfers require additional processor timein pushSend to program multiple DMA transfers. The discontinuity at 16K inFigure 5 is due to this effect.

The addressSpace and hasIO parameters provide constraints on the compilermapping, but are not required to evaluate the performance of a valid mapping.The former defines the local address space of the processor; i.e. which memoriesare directly accessible and where they appear in local virtual memory, and isused to place stream buffers. The model assumes that the dominant bus trafficis communication via streams, so either the listed memories are private localstores, or they are shared memories accessed via a private L1 cache. In the lattercase, the cache should be sufficiently effective that the cache miss traffic on theinterconnect is insignificant.

The hasIO parameter defines which processors can perform system IO, and isa simple way to ensure that tasks that need system IO are mapped to a capableprocessor.

Each interconnect is defined using the parameters shown in Figure 3(b). Thesystem topology is given by the elements parameter, which for a given intercon-nect lists the adjacent processors and memories. Each interconnect is modelledas a bus with multiple channels, which has been shown to be a good approxima-tion to the performance observed in practice when all processors and memorieson a single link are equidistant [15]. Each bus has a single unbounded queue tohold the messages ready to be transmitted, and one or more channels on whichto transmit them. The compiler statically allocates streams onto buses, but thechoice of channel is made at runtime. The interfaceDuplex parameter definesfor each resource; i.e. processor or memory, whether it can simultaneously readand write on different channels.

The bandwidth and latency of each channel is controlled using four parame-ters: the start latency (L), start cost (S), bandwidth (B), and finish cost (F ).In transferring a message of size n bytes, the latency of the link is given byL + S + � n

B � and the cost incurred on the link by S + � nB � + F . This model

is natural for distributed memory machines, and amounts to the assumption ofcache-to-cache transfers on shared memory machines.

Hardware routing is controlled using the interfaceRoutingparameter, whichdefines for each processor whether it can route messages from this interconnect:each entry can take the value storeAndForward, cutThrough or None.

Each memory is defined using the parameters shown in Figure 4. The latencyand bandwidth figures are currently unused in the model, but may be used bythe compiler to refine the estimate of the run time of each task. The memorydefinitions are used to determine where to place communications buffers, andprovide constraints on blocking factors.


Parameter Description Valuename Unique name in platform namespace ‘LSn ’

size Size, in bytes 262144


latency Access latency, in cycles 2

bandwidth Bandwidth, in bytes per cycle 128

Fig. 4. Memory parameters of the Abstract Streaming Machine and values for the CellBroadband Engine

4 ASM Program Description

The compiled stream program is a connected directed graph of tasks and point-to-point streams, as described in Section 2. All synchronization between taskshappens in the blocking acolib communications primitives described above.

A task may have complex data-dependent or irregular behaviour. The basicunit of sequencing inside a task is the subtask, which pops a fixed number ofelements from each input stream and pushes a fixed number of elements on eachoutput stream. In detail, the work function for a subtask is divided into threeconsecutive phases. First, the acquire phase obtains the next set of full inputbuffers and empty output buffers. Second, the processing phase works locally onthese buffers, and is modelled using a fixed processing time, determined froma trace. Finally, the release phase discards the input buffers, and sends theoutput buffers, releasing the buffers in the same order they were acquired. Thisthree-stage model is not a deep requirement of the ASM, and was introducedas a convenience in the implementation of the simulator, since our compiler willnaturally generate subtasks of this form.

A stream is defined by the size of each element, and the location and lengthof either the separate producer and consumer buffers (distributed memory) orthe single shared buffer (shared memory). These buffers do not have to be of thesame length. If the producer or consumer task uses the peek primitive, then thebuffer length should be reduced to model the effective size of the buffer, excludingthe elements of history that share the buffer. The Finite Impulse Response (FIR)filters in the GNU radio benchmark of Section 6 are described in this way. Itis possible to specify a number of elements to prequeue on the stream beforeexecution begins.

5 Implementation and Methodology

We use a small suite of benchmarks and target platforms, which have been trans-lated by hand into the description files. The benchmarks were evaluated on anIBM QS20 blade, which has two Cell processors. The producer-consumer bench-mark is used to determine basic parameters, and has two actors: a producer, and

Lecture Notes in Computer Science 5657€¦ · Lecture Notes in Computer Science 5657 Commenced...

Documents

Transcript of Lecture Notes in Computer Science 5657€¦ · Lecture Notes in Computer Science 5657 Commenced...