An Operating System for Reconfigurable Computing · ii an operating system for reconfigurable...

198
An Operating System for Reconfigurable Computing Research Thesis for the Degree of Doctor of Philosophy By Grant Brian Wigley Bachelor of Engineering in Computer Systems Engineering (Hons), University of South Australia [email protected] Adelaide, South Australia April 2005 Reconfigurable Computing Lab School of Computer and Information Science Division of Information Technology, Engineering and Environment The Unversity of South Australia

Transcript of An Operating System for Reconfigurable Computing · ii an operating system for reconfigurable...

An Operating System

for Reconfigurable Computing

Research Thesis for the Degree of Doctor of Philosophy

By Grant Brian Wigley Bachelor of Engineering in Computer Systems Engineering (Hons), University of South Australia

[email protected]

Adelaide, South Australia

April 2005

Reconfigurable Computing Lab

School of Computer and Information Science

Division of Information Technology, Engineering and Environment

The Unversity of South Australia

i

Abstract

Field programmable gate arrays are a class of integrated circuit that enable logic functions and

interconnects to be programmed in almost real time. They can implement fine grained parallel

computing architectures and algorithms in hardware that were previously the domain of

custom VLSI. Field programmable gate arrays have shown themselves useful at exploiting

concurrency in a range of applications such as text searching, image processing and

encryption. When coupled with a microprocessor, which is more suited to computation

involving complex control flow and non time critical requirements, they form a potentially

versatile platform commonly known as a Reconfigurable Computer.

Reconfigurable computing applications have traditionally had the exclusive use of the field

programmable gate array, primarily because the logic densities of the available devices have

been relatively similar in size compared to the application. But with the modern FPGA

expanding beyond 10 million system gates, and through the use of dynamic reconfiguration, it

has become feasible for several applications to share a single high density device. However,

developing applications that share a device is difficult as the current design flow assumes the

exclusive use of the FPGA resources. As a consequence, the designer must ensure that

resources have been allocated for all possible combinations of loaded applications at design

time. If the sequence of application loading and unloading is not known in advance, all

resource allocation cannot be performed at design time because the availability of resources

changes dynamically.

The use of a runtime resource allocation environment modelled on a classical software

operating system would allow the full benefits of dynamic reconfiguration on high density

FPGAs to be realised. In addition to runtime resource allocation, other services provided by

an operating system such as abstraction of I/O and inter-application communication would

provide additional benefits to the users of a reconfigurable computer. This could possibly

reduce the difficulty of application development and deployment.

In this thesis, an operating system for reconfigurable computing that supports dynamically

arriving applications is presented. This is achieved by firstly developing the abstractions with

which designers implement their applications and a set of algorithm requirements that specify

the resource allocation and logic partitioning services. By combining these, an architecture of

ii

an operating system for reconfigurable computing can be specified. A prototype

implementation on one platform with multiple applications is then presented which enables an

exploration of how the resource allocation algorithms interact amongst themselves and with

typical applications.

Results obtained from the prototype include the measurement of the performance loss in

applications, and the time overheads introduced due to the use of the operating system.

Comparisons are made with programmable logic applications run with and without the

operating system. The results show that the overheads are reasonable given the current state of

the technology of FPGAs. Formulas for predicting the user response time and application

throughput based on the fragmentation of an FPGA are then derived. Weaknesses are

highlighted in the current design flows and the architecture of current FPGAs must be

rectified if an operating system is to become main-stream. For the tool flows this includes the

ability to pre-place and pre-route cores and perform high speed runtime routing. For the

FPGAs these include an optimised network, a memory management core, and a separate layer

to handle dynamic routing of the network.

iii

Contents

1 INTRODUCTION .................................................................................................1

2 RUNTIME SUPPORT FOR RECONFIGURABLE COMPUTING ........................6

2.1 Field programmable technology.............................................................................................................8

2.1.1 Introduction ..........................................................................................................................................8

2.1.2 Reconfigurable computing architectures ............................................................................................10

2.1.3 FPGA architectures.............................................................................................................................15

2.1.4 Conclusion ..........................................................................................................................................19

2.2 Abstractions, services and runtime systems ........................................................................................20

2.2.1 Services...............................................................................................................................................20

2.2.2 Prototypes ...........................................................................................................................................24

2.2.3 Evaluation...........................................................................................................................................26

2.3 Allocation and partitioning...................................................................................................................29

2.3.1 Allocation ...........................................................................................................................................29

2.3.2 Partitioning .........................................................................................................................................32

2.4 Reconfigurable computing design flow................................................................................................35

2.4.1 Traditional design flow.......................................................................................................................35

2.4.2 Runtime application design flow ........................................................................................................38

2.5 Applications and benchmarks for reconfigurable computers ...........................................................40 2.6 Conclusion..............................................................................................................................................42

3 METHODOLOGY ..............................................................................................43

3.1 Abstractions, architecture and design flow .........................................................................................47 3.2 Resource allocation and application partitioning ...............................................................................47 3.3 Operating system prototype and metrics.............................................................................................48 3.4 Performance evaluation ........................................................................................................................48 3.5 Conclusion..............................................................................................................................................49

4 ABSTRACTIONS, ARCHITECTURE AND DESIGN FLOW .............................50

4.1 Abstractions ...........................................................................................................................................52

4.1.1 Process abstraction..............................................................................................................................52

4.1.2 Address space .....................................................................................................................................59

iv

4.1.3 Inter-process communication..............................................................................................................61

4.1.4 Conclusion ..........................................................................................................................................67

4.2 Operating system architecture .............................................................................................................68

4.2.1 Previous reconfigurable computing runtime system architectures......................................................68

4.2.2 Proposed reconfigurable computing runtime system architecture ......................................................70

4.2.3 Sample application execution .............................................................................................................73

4.2.4 Conclusion ..........................................................................................................................................74

4.3 Algorithm specifications........................................................................................................................75

4.3.1 Runtime requirements for algorithms .................................................................................................75

4.3.2 Allocation ...........................................................................................................................................76

4.3.3 Partitioning .........................................................................................................................................76

4.3.4 Conclusion ..........................................................................................................................................77

4.4 New application design flow .................................................................................................................78 4.5 Conclusion..............................................................................................................................................80

5 RESOURCE ALLOCATION AND APPLICATION PARTITIONING..................81

5.1 Allocation ...............................................................................................................................................82

5.1.1 Survey of allocation literature.............................................................................................................82

5.1.2 Algorithm 1 – Greedy based...............................................................................................................84

5.1.3 Algorithm 2 – Bottom left ..................................................................................................................85

5.1.4 Algorithm 3 – Minkowski Sum ..........................................................................................................87

5.1.5 Algorithm performance.......................................................................................................................90

5.1.6 Algorithm selection ..........................................................................................................................100

5.2 Partitioning ..........................................................................................................................................101

5.2.1 Survey of partitioning literature........................................................................................................101

5.2.2 Algorithm 1 – Temporal partitioning................................................................................................103

5.2.3 Algorithm performance.....................................................................................................................104

5.3 Conclusion............................................................................................................................................107

6 OPERATING SYSTEM PROTOTYPE & METRICS ........................................108

6.1 Operating system prototype ...............................................................................................................110

6.1.1 Hardware platform............................................................................................................................111

v

6.1.2 Application architecture....................................................................................................................113

6.1.3 Primitive architecture........................................................................................................................114

6.1.4 ReConfigME implementation architecture .......................................................................................115

6.1.5 Sample application execution ...........................................................................................................121

6.1.6 Applications for ReConfigME..........................................................................................................124

6.1.7 Implementation issues ......................................................................................................................129

6.2 Metrics..................................................................................................................................................130

6.2.1 Response time...................................................................................................................................130

6.2.2 Throughput .......................................................................................................................................131

6.3 Conclusion............................................................................................................................................132

7 PERFORMANCE EVALUATION.....................................................................133

7.1 Experimental environment .................................................................................................................135

7.1.1 Benchmark application .....................................................................................................................135

7.1.2 Experiential configuration ................................................................................................................136

7.2 Performance results.............................................................................................................................144

7.2.1 User response time............................................................................................................................144

7.2.2 Application throughput .....................................................................................................................150

7.2.3 Conclusion ........................................................................................................................................156

7.3 Predictor metrics .................................................................................................................................157

7.3.1 Response time...................................................................................................................................157

7.3.2 Application throughput .....................................................................................................................160

7.3.3 Comparison of fragmentation measure.............................................................................................162

7.3.4 Chance of allocation .........................................................................................................................163

7.4 Conclusion............................................................................................................................................167

8 CONCLUSION AND FUTURE WORK ............................................................168

8.1 Research contributions........................................................................................................................169

8.1.1 Summary of major contributions ......................................................................................................172

8.2 Suggestions for future work................................................................................................................173

9 REFERENCES ................................................................................................174

vi

List of Figures

Figure 1: General FPGA Structure .............................................................................................9

Figure 2: Reconfigurable computer with a reconfigurable ALU..............................................11 U

Figure 3: Reconfigurable computer with a reconfigurable coprocessor...................................12

Figure 4: Loosely coupled reconfigurable computer ................................................................13

Figure 5: FPGA granularity examples ......................................................................................16

Figure 6: Granularity of an FPGA architecture ........................................................................17

Figure 7: Various FPGA logic allocation mechanisms ............................................................21

Figure 8: Two Dimensional Bin Packing .................................................................................30

Figure 9: Hardware Circuit Design Methodology ....................................................................36

Figure 10: A Summary of the methodology used in this thesis................................................46

Figure 11: The previous work, methodology and deliverables associated with this chapter ...50

Figure 12: Software Operating System Process .......................................................................53

Figure 13: Data flow graph.......................................................................................................55

Figure 14: Reconfigurable computing process abstraction.......................................................58

Figure 15: Classical operating system address space ...............................................................59

Figure 16: Reconfigurable computing address space abstraction.............................................61

Figure 17: Software inter-process communication abstraction ................................................62

Figure 18: Possible inter-process communication mechanisms ...............................................63

Figure 19: Processes of fixed size arranged in a fixed mesh topology orientated network......65

Figure 20: The on-chip network used in the reconfigurable computing inter-process

communication abstraction ...............................................................................................67

Figure 21: Client-Server model architecture ............................................................................69

Figure 22: The RAGE System Dataflow Architecture .............................................................69

Figure 23: Architecture of the operating system ......................................................................70

vii

Figure 24: Allocation service....................................................................................................76

Figure 25: Hardware partitioning .............................................................................................77

Figure 26: The previous work, methodology and deliverables associated with this chapter ...81

Figure 27: Greedy based allocation ..........................................................................................85

Figure 28: The bottom left allocation algorithm process..........................................................86

Figure 29: The heuristic used to calculate the remaining rectangles........................................87

Figure 30: Minkowski Sum example........................................................................................88

Figure 31: Bottom left heuristic used with the Minkowski Sum..............................................90

Figure 32: The execution runtime of the greedy, bottom left and Minkowski Sum allocation

algorithms .........................................................................................................................93

Figure 33: A Fragmented FPGA...............................................................................................96

Figure 34: Fragmentation recorded for the typical, large and small sized applications ...........98

Figure 35: Temporal partitioning proposed by Purna.............................................................103

Figure 36: The execution runtime obtained from the partitioning algorithm.........................105

Figure 37: Previous work, methodology and deliverables associated with this chapter ........108

Figure 38: ReConfigME implementation architecture ...........................................................110

Figure 39: RC1000pp Block Diagram....................................................................................112

Figure 40: The RC1000pp ......................................................................................................113

Figure 41: Application architecture for ReConfigME............................................................113

Figure 42: Operating system primitive architecture ...............................................................115

Figure 43: Platform tier architecture.......................................................................................117

Figure 44: Architecture of ReConfigME’s Colonel ...............................................................118

Figure 45: Operating system tier ............................................................................................120

Figure 46: User tier architecture .............................................................................................120

Figure 47: Complete sample application in data flow graph format ......................................121

Figure 48: Handel-C code listing for add one data graph flow node......................................121

viii

Figure 49: Java class file defining data flow graph structure .................................................122

Figure 50: ReConfigME data flow graph class structure .......................................................123

Figure 51: Status displayed before the allocation of the application......................................124

Figure 52: Status displayed after the allocation of the application.........................................124

Figure 53: Screen capture of the blob tracking application executing on ReConfigME........125

Figure 54: Allocation status of the FPGA when the blob tracking is loaded onto the FPGA by

ReConfigME...................................................................................................................126

Figure 55: Screen capture of the edge enhancement application executing on ReConfigME126

Figure 56: Allocation status of the FPGA when the edge enhancement is loaded onto the

FPGA by ReConfigME...................................................................................................127

Figure 57: Allocation status of the FPGA when the edge enhancement and the blob tracking

are loaded onto the FPGA by ReConfigME ...................................................................128

Figure 58: Screen capture from Xilinx Floorplanner verifying the locations of the applications

on the FPGA ...................................................................................................................128

Figure 59: Previous work, methodology, and deliverables associated with this chapter .......133

Figure 60: DES block architecture..........................................................................................136

Figure 61: Test case 1 floor-plan ............................................................................................139

Figure 62: Test case 2, set 1 (typical sized) floor-plans .........................................................139

Figure 63: Test case 2, set 2 (large sized) floor-plans ............................................................140

Figure 64: Test case 3 floor-plans...........................................................................................141

Figure 65: Test case 4, set 1 (typical sized) floor-plans .........................................................142

Figure 66: Test case 4, set 2 (large sized) floor-plans ............................................................143

Figure 67: The response time verses the number of partitions the application is divided into

for sets 1 (typical) and 2 (large) in test case 4 ................................................................149

Figure 68: The response time verses the number of applications already allocated onto the

FPGA for sets 1 (typical) and 2 (large) in test case 2 and 4 ...........................................150

Figure 69: Possible worse case signal delay...........................................................................151

ix

Figure 70: The application throughput versus the number of partitions the application is

divided into for sets 1 (typical) and 2 (large) in test case 4 ............................................155

Figure 71 : The application throughput versus the number of applications already allocated

onto the FPGA for sets 1 (typical) and 2 (large) in test case 2 and 4 .............................155

Figure 72: The user response time versus the fragmentation percentage for both sets and all

test cases .........................................................................................................................158

Figure 73: User response time versus the adjusted fragmentation for both sets and all test

cases ................................................................................................................................159

Figure 74: A graph of application throughput versus fragmentation......................................160

Figure 75: User response time versus the adjusted fragmentation for both sets and all test

cases ................................................................................................................................161

Figure 76: Response time versus Walder fragmentation measure..........................................163

Figure 77: Signal delay versus Walder fragmentation measure .............................................163

Figure 78: A graph of the percentage success and failed allocation of applications in test case

2 and 3.............................................................................................................................166

x

List of Tables

Table 1: A Summary of Characteristics of Reconfigurable Computing Architectures ............14

Table 2: A summary of the characteristics relating to the use of an operating system for

reconfigurable computing .................................................................................................19

Table 3: Services Provided by Runtime System Prototypes.....................................................27

Table 4: Summary of common reconfigurable computing applications...................................40

Table 5: A research methodology suggested by Crnkovic .......................................................44

Table 6: Methodology paths used in this thesis ........................................................................45

Table 7: Evaluation of network topologies...............................................................................66

Table 8: A summary of the well-known allocation algorithms that appear in the research

literature ............................................................................................................................83

Table 9: Parameters of the applications used to measure the execution runtime of the

allocation and partitioning algorithms ..............................................................................92

Table 10: Number of applications allocated onto the FPGA....................................................94

Table 11: The average percentage increase in fragmentation for the algorithms compared to

each other..........................................................................................................................99

Table 12: Summary of partitioning algorithm runtime complexities .....................................102

Table 13: A summary of the metrics designed for reconfigurable computing operating systems

........................................................................................................................................130

Table 14: Parameters of the applications used in the response time and throughput

experiments.....................................................................................................................138

Table 15: User response time for test case 1 ..........................................................................146

Table 16: User response time for test case 2 ..........................................................................146

Table 17: User response time for test case 3 ..........................................................................147

Table 18: User response time for test case 4 ..........................................................................147

Table 19: Application throughput for the worse case and when the application was not under

the operating system control ...........................................................................................152

xi

Table 20: Application throughout for test case 1....................................................................152

Table 21: Application Throughput for test case 2 ..................................................................152

Table 22: Application throughput for test case 3....................................................................153

Table 23: Application throughput for test case 4....................................................................153

Table 24: Results from an experiment to verify the allocation successful formula ...............165

xii

List of Equations

Equation 1: Minkowski Sum ....................................................................................................88

Equation 2: Walder Fragmentation Grade ................................................................................96

Equation 3: Example of Fragmentation Grade .........................................................................97

Equation 4: Fragmentation percentage .....................................................................................97

Equation 5: Fragmentation percentage ...................................................................................157

Equation 6: Linear equations for the response time versus fragmentation percentage for both

sets ..................................................................................................................................158

Equation 7: Adjusted fragmentation percentage for predicting user response time ...............159

Equation 8: Linear equations for the signal delay versus fragmentation percentage for both

sets ..................................................................................................................................161

Equation 9: Adjusted fragmentation percentage for predicting signal delay..........................162

Equation 10: The percentage chance of allocating a process .................................................164

xiii

Acknowledgements

I’d firstly like to thank my supervisor Dr David Kearney. I met David about six years ago

when he offered to supervise me in an honours project. Towards the end of the year it was

very clear that the project could turn into a great PhD and suggested I should seriously

consider it. Over the next five years and after much blood, sweat and tears, a PhD dissertation

was written. Throughout this period David was always there to discuss research ideas, proof

read my papers, discuss the weekly football results on a Monday morning, and of course edit

the dissertation. I am very grateful for the time he spent with me as it was far beyond what

was expected. I would also like to thank him for his contribution to my travel expenses which

allowed me to attend several international conferences where I was able to meet fellow

researchers who I could discuss my ideas with. Neither of us will ever forget the 2 weeks we

spent in La Grande Motte and Montpellier at FPL 2003. I would also like to thank my

associate supervisor Dr Oliver Diessel for all his help and guidance in the first few years of

my PhD.

Over the past five years I have worked with a great group of people within the Reconfigurable

Computer Lab including Martyn George, John Hopf, Mark Jasiunas, Ross Smith, Matthew

Pink, and Maria Dahlquist. I would like to thank you all for the many research discussions we

have had and friendship you have shown me over the years. I would also like to thank the

financial support provided to both myself and the lab by the Sir Ross and Keith Smith Trust

fund.

I would also like to thank the School of Computer and Information Science for the financial

support they have provided to both me and the Reconfigurable Computing Lab. Since I began

my PhD, the school has had three Head of Schools; Andy Koronios, Brenton Dansie, and

David Kearney, who have always provided me with any equipment I needed to conduct my

research. Other academic staff members within the school I would also like to thank include

Sue Tyerman and Rebecca Witt for proof reading the thesis, and Jill Slay for giving me the

opportunity to teach parts of our offshore program in Hong Kong and Malaysia. Although this

did not directly contribute to my thesis, I gained valuable experience from it which I will take

into my professional academic career. I would also like to thank the general staff who manage

our office, without you guys the place would come to a stand still. A special mention goes to

xiv

Malcolm Bowes and Greg Warner for their brilliant system administration assistance and

friendship.

Completing a PhD can be very stressful at times, especially during the write up. But with the

help of great friends, I managed to find the necessary support when needed. I’d like to thank

Ali and Shania Darling, Stewart Itzstein, Hayley Reynolds, and Kate Tidswell who did just

that. Special thanks go to Wayne Piekarski and Hannah Slay. Wayne; thanks for your

friendship, advice, all those stories, the Hong Kong showroom trips and the thousands of

lunches we have had over the past five or so years. Completing a PhD at the same time as

you, made it that much more enjoyable. Hannah; what can I say? Work colleague, dive buddy,

boat captain, but most of all true friend. Thank you so much for your support, advice, and

friendship you have given me over the past 4 years. All those dive trips, the hundreds of

lunches, and all the gossip sessions we had really did give me a chance to forget about the

PhD just when I needed to.

Most importantly of all, I would like to thank my mum Glenda, dad Brian, my sister Kelly,

and my grandparents, for all their love, support, and financial assistance they have given me

over the past 27 years; without it I would not be writing this section in my PhD thesis. Mum,

thanks for all your help around home and all those late night dinners. Dad, thanks for getting

me to help you around the house, it made for a nice distraction when I needed it. Kelly, thanks

for telling me to keep at it all those times when I thought I’d had enough. Finally, my late

grandpa Jack Wigley always told me that one of the most important things in life is your

education. I never forgot that and it turned out that he was right; thanks Pa.

Grant Wigley

xv

Declaration

I declare that this thesis does not incorporate without acknowledgment any material

previously submitted for a degree or diploma in any university and that to the best of my

knowledge it does not contain any materials previously published or written by another

person except where due reference is made in the text or published in my paper list below.

Grant Wigley

Adelaide April 2005

xvi

Author publications

1. Piekarski, W, Smith, R, Wigley, G, Thiele, N, Thomas, B, and Kearney, D., “Mobile Hand Tracking

using FPGAs for Low Powered Augmented Reality.” In 8th International Symposium on Wearable

Computers, Arlington, VA, Nov 2004.

2. Smith, R, Piekarski, W, and Wigley, G., “Hand Tracking for Low Powered Mobile AR User

Interfaces.” In 6th Australasian User Interface Conference, Newcastle, Australia, 2005.

3. Jasiunas, M, Kearney, D, Hopf J, and Wigley, G., “Fusion for Uninhabited Airborne Vehicles.” In 2nd

Field Programmable Technology (FPT’02), Hong Kong, China, 2002, IEEE Computer Society.

4. Wigley, G, Kearney, D, and Warren, D., “Introducing ReConfigME: An Operating System for

Reconfigurable Computing.” In 12th International Conference on Field Programmable Logic and

Applications (FPL’02), Montpellier, France, 2002, Springer.

5. Warren, D, Wigley, G, and Kearney, D., “Hardware Implementation of Geometric Hashing.” In 2nd

Field Programmable Technology (FPT’02), Hong Kong, China, 2002, IEEE Computer Society.

6. Wigley, G, and Kearney, D. “The Management of Applications for Reconfigurable Computing using an

Operating System”. In 7th Asia-Pacific Computer Systems Architecture Conference, 2002, ACS Press.

7. Wigley, G, and Kearney, D., “Research Issues in Operating Systems for Reconfigurable Computing.” In

International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA’02), 2002,

CSREA Press.

8. Wigley, G, Hopf, J, and Kearney, D., “Using Software Techniques when Developing Hardware

Applications”. In International Conference on Engineering of Reconfigurable Systems and Algorithms

(ERSA’02), 2002, CSREA Press.

9. Warren, D, Kearney, D, and Wigley, G., “Field Programmable Technology Implementation of Target

Recognition Using Geometric Hashing”. In International Conference on Engineering of Reconfigurable

Systems and Algorithms (ERSA’02), 2002, CSREA Press.

10. George, M, Pink, M, Kearney, D, and Wigley, G., “Efficient Allocation of FPGA Area to Multiple

Users in an Operating System for Reconfigurable Computing”. International Conference on

Engineering of Reconfigurable Systems and Algorithms (ERSA’02), June 2002, CSREA Press.

11. Wigley, G, and Kearney, D., “The Development of an Operating System for Reconfigurable

Computing”. In IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'01), Napa

Valley, CA, USA, April 2001, IEEE Press.

12. Wigley, G, and Kearney, D., “The First Real Operating System for Reconfigurable Computers”. In 6th

Australasian Computer Science Week (ACSAC’01), Gold Coast, Australia, January 2001. IEEE

Computer Society.

13. Diessel, O, and Wigley, G., “Opportunities for Operating Systems Research in Reconfigurable

Computing”. School Technical Report, March 2000

14. Diessel, O, Kearney, D, and Wigley, G. “A Web-Based Multi-user Operating System for

Reconfigurable Computing”. 6th Reconfigurable Architectures Workshop (RAW’99). Springer,

December 1999

Chapter 1 – Introduction

1

1 Introduction

In any electronic product development a choice must be made between using general-purpose

hardware such as a microprocessor, or special purpose hardware such as application specific

integrated circuits (ASIC). If a microprocessor is used as the target hardware, the application

will be written in software. For many applications such as word processing this is suitable.

However there are many applications that require complex computations to be performed in

real time and there are no microprocessors that can achieve this. These applications can

include automatic target recognition (ATR) [72], active networks [127] and image processing

[8]. Such applications are transferred into a custom hardware implementation as it is well-

accepted that this type of solution produces higher performance with lower unit cost of

production but at the high non-recurring engineering cost and often long system development

cycles. For many applications, producing an application specific mask is not economically

viable as only very few units may be required.

An alternative to a pure hardware implementation that avoids the disadvantages described

above is through the use of a field programmable gate array (FPGA). The key difference

between an FPGA and other chip technologies such as ASIC is that it can be configured by

the end user in the field. There is less risk involved when configuring applications in the field

because if a mistake is made, it is not necessary to wait weeks and spend a large sum of

money to fabricate a new device. The FPGA can simply be reconfigured with the corrected

application. Although FPGAs may not accelerate an algorithm as much as if it were

implemented in an ASIC, it has been shown that significant performance increases over

software can be achieved. The current popularity of these devices is reflected by the sales

figures of FPGA vendors. In 2003, Xilinx who only fabricates FPGAs was ranked third in

overall sales figures of semi conductor manufacturers [121].

1

Chapter 1 – Introduction

2

Although FPGAs have provided considerable speedup to numerous algorithms, they are not

ideally suited in all types of applications. Some algorithms such as floating point arithmetic

can not be efficiently implemented in an FPGA in the absence of “hardware” multipliers.

Other algorithms are best implemented with a combination of hardware and software to

optimise performance. To accommodate this type of application, FPGAs have been coupled

with microprocessors to form what is commonly known as a Reconfigurable Computer. This

provides the ability to exploit both the general purpose processor’s flexibility and the FPGA’s

capability to implement application specific computations.

The research domain of reconfigurable computing has become very popular over the past ten

years. At present there are at least five international conferences held each year which discuss

various topics of reconfigurable computing. The topic of operating systems for reconfigurable

computing has had increased interest over the past few years with several special sessions and

focus groups hosted at these conferences.

There have been many reconfigurable computing platforms proposed and built. Early attempts

include Garp [66], SPACE [63] and Pam [18]. These platforms primarily consisted of low

density FPGAs, small amounts of on-board memory, and a low bandwidth between the FPGA

and microprocessor. Due to limited resources, many of these types of reconfigurable

computers could only accommodate one application, resulting in a single task environment.

These applications were then able to be completely designed prior to loading onto the

reconfigurable computer as their execution order was known in advance.

With the development of reconfigurable computers containing FPGAs with in excess of 6

million system-gates, such as the RC2000 [32] and Bioler 3 [20], it is now feasible to consider

the possibility of sharing the FPGA between multiple concurrently executing applications.

This could potentially increase the resource usage of the expensive FPGA logic and decrease

response times so users will not have to wait for the FPGA to be completely available. The

multiple use of an FPGA depends on some form of runtime reconfiguration.

All SRAM-based FPGAs can be runtime reconfigured, that is the context of the device can be

completely changed whilst the controlling software application continues to run. Partial

reconfiguration is an optimisation of runtime reconfiguration where only a specific part of the

FPGA is changed. This minimises the total reconfiguration time but requires all of the

Chapter 1 – Introduction

3

applications to be stopped. Dynamic reconfiguration allows part of the FPGA to be stopped

and reconfigured whilst the remainder continues to operate unaffected. This allows

applications to be configured onto the FPGA without having to wait until the entire FPGA is

free or stop other executing applications. However none of these innovations provide

hardware resource allocation, a mechanism that is essential if an FPGA is to be shared

amongst applications. The use of dynamic runtime reconfiguration has gone part way to

providing this sharing support that would be needed in an operating system like environment.

Consider the following application. An unpiloted aerial vehicle (UAV) is an aircraft without

on-board pilots that is flown autonomously or via radio link. These aircraft are usually small

and have limited power and payload capabilities. During a flight, a UAV may require several

diverse algorithms including data encryption and target recognition. Traditionally these

algorithms would have all been configured onto the FPGA at one time even if they were not

used concurrently. This results in the use of an FPGA with a much higher logic density than

would be required if it could be shared amongst the multiple algorithms. Through the use of

an operating system, a much lower logic density FPGA can be used as the algorithms can be

configured onto the FPGA when needed, resulting in smaller and less demanding

reconfigurable computers.

Such a scenario raises several interesting research questions that have not been deeply

investigated. How is logic area to be allocated to FPGA applications at runtime? Are there

any suitable algorithms that already exist in other research domains that support this? Can an

application be divided into a number of parts to better fit the FPGA surface without

significant impact on application performance? If so, what is a suitable partitioning

algorithm? What abstractions are necessary for designers to write applications for such an

operating system? What would the application design flow be, and would the existing one be

modifiable to suit? Are popular FPGA architectures suitable for an operating system and if

not what is needed for them to support it? These are the major questions that will be addressed

in this thesis. The remainder of this thesis is structured as follows.

Chapter 2 organises the previous work in the field into several major themes. Initially, a

review of reconfigurable computing and FPGA architectures is given with the aim of

determining the most suitable FPGA and reconfigurable computer to be used with an

operating system. Previous attempts at sharing the FPGA amongst multiple applications are

Chapter 1 – Introduction

4

then outlined. The current allocation of FPGA resources, application partitioning techniques

and previous attempts at describing and building runtime system for reconfigurable computers

are detailed. The current design flow used in reconfigurable computing application

development is then reviewed in relation to its suitability to support dynamic resource

allocation. Finally, suitable applications and benchmarks that appear in the research literature

that can be used in conjunction with an operating system environment will be investigated.

In chapter 3, the methodology used in this thesis is presented. The methodology consists of

four phases. Firstly, an operating system for a reconfigurable computer is conceptualised. This

results in a set of abstractions and operating system architecture for a reconfigurable

computer. Secondly, a set of suitable algorithms that might be used to implement the

architecture is specified. The top performing algorithms according to selected criteria will

then be implemented. Thirdly, a prototype and set of metrics that can be used to measure its

performance are described. The prototype is described by implementing all of the previously

suggested abstractions, architecture and algorithm specification. The set of metrics are

selected by reviewing application literature to determine what hardware designers perceive

application performance to be. Finally, the performance of the prototype is evaluated and

characterised. This is achieved by selecting a benchmark application, building a test

environment with which to use the prototype, and performing a series of experiments.

Following this, a series of correlations are made between the results.

In chapter 4, the operating system for reconfigurable computing is conceptualised. The

abstractions, algorithm specifications, operating system architecture and application design

flow are presented. The abstractions that make the conceptual framework of the operating

system are developed by making comparisons to software operating systems. The algorithm

specifications that will ultimately implement these abstractions are presented. From this, an

architecture of an operating system for a reconfigurable computer is outlined. Finally, an

investigation is conducted into whether the current reconfigurable computing application

design flow can be used or modified to suit the proposed operating system architecture.

In chapter 5, suitable algorithms that perform FPGA area allocation and application

partitioning are presented. These algorithms are selected from a range of algorithms initially

evaluated for approximate complexity to rule out those that are unsuitable. Selected

algorithms are then implemented and their performance measured against particular micro-

Chapter 1 – Introduction

5

metrics. From these results, the most suitable allocation and partitioning algorithms can be

determined.

Chapter 6 presents a prototype operating system known as ReConfigME. This consists of the

previously described abstractions, algorithms, and architecture and is demonstrated running

three applications at runtime. A set of metrics that can be used to measure the performance of

the operating system and its applications will then be presented. This is achieved by

investigating what application designers perceive application performance to be and

determining if any software metrics can be transferred into the reconfigurable computing

domain.

Chapter 7 involves using the prototype to evaluate its performance. This is achieved by

selecting a suitable benchmark application, building a set of test cases for use in a series of

experiments, and carrying out the experiments to measure impact the operating system has on

the user response time and application performance. From the results obtained from the

experiments, correlations are made to determine if there are any connections between the

measured metrics. A series of formulas are then derived to predict the likely performance

before an application is loaded.

Chapter 8 concludes this thesis by summarising the key contributions made by the author and

presents possible future work.

Chapter 2 – Runtime support for reconfigurable computing

2 2 Runtime support for reconfigurable computing

From the early beginnings of development with programmable hardware, such as

programmable logic devices, the goal has always been to capture some of the flexibility in

conventional software based systems, while retaining the algorithmic speedup that hardware

provides [79]. A Field Programmable Gate Array (FPGA) [26] is one such programmable

logic device. An FPGA is a silicon chip device made of an array of configurable logic blocks.

Once programmed through the use of hardware programming, these configurable logic

blocks form hardware circuits. FPGAs have been coupled with general purpose processors

and memory to form a potentially versatile platform commonly known as a Reconfigurable

Computer [40].

A unique feature of some reconfigurable computers is dynamic reconfiguration [71]. This

allows a hardware application to be configured onto the FPGA without having to stop the

execution of other resident applications. As FPGA density increases beyond 10 million

configurable gates [108], and with the use of dynamic reconfiguration, it’s becoming more

feasible for several applications that once required one FPGA each, to share a single high

density device. However, the current design flow has almost no support for allocating FPGA

resources to dynamically arriving applications. As a consequence the designer must ensure

that resources have been allocated for all possible combinations of loaded applications at

design time. However if the types of applications to be loaded are not known at design time, it

is not feasible for all resource allocation to be performed as the availability of resources

changes dynamically over time.

The use of a runtime environment loosely modelled on the traditional software operating

system [126] may overcome this problem. As reconfigurable computing applications arrive 6

Chapter 2 – Runtime support for reconfigurable computing

7

(consisting of both software and hardware), the runtime system would need to provide

services such as FPGA, microprocessor and memory resource allocation, communication

between the software and the hardware parts of the application, and general housekeeping

duties. While dynamic reconfiguration is necessary for an operating system, it is not sufficient

for the support of sharing a reconfigurable computer.

This chapter reviews previous work associated with these issues and is structured into themes.

The first theme is configurable hardware. In this section an investigation into the numerous

types of reconfigurable computing platforms and FPGA architectures proposed in the

literature is presented. From all of these proposals, the most suitable platform and device

architecture for use in an operating system environment will be selected. The second theme is

prototype runtime systems for reconfigurable computing applications. In this section it will be

shown that although there are many runtime systems previously developed, they are usually

only responsible for trivial FPGA configuration and data transfer between the host and FPGA.

It is shown that the issues of resource allocation have not been deeply explored in published

prototypes. The third theme is algorithms for allocation and partitioning that might be used in

the proposed operating system. In this section the use of allocation and partitioning algorithms

for further increasing the usage of the FPGA are detailed. It is shown that there have been

very few attempts to apply resource allocation to the reconfigurable computing domain. The

fourth theme is design flow. In this section, it is shown that the current design flow used for

reconfigurable computing applications has limitations when applied to a dynamic runtime

reconfigurable environment. It will also be shown that of the few alternative design flows

proposed, none fully address all the issues of designing reconfigurable computing applications

for a shared environment. The final theme is on applications and benchmarks for

reconfigurable computers. In this section it will be shown that few benchmarks have been

developed for use with an operating system. It will also present the types of applications that

are best suited to an operating system environment.

Chapter 2 – Runtime support for reconfigurable computing

8

2.1 Field programmable technology

In this section both field programmable technologies (FPT) and reconfigurable computers

(RC) will be discussed with the aim of determining the most suitable platform for use with an

operating system. This will be achieved by firstly classifying reconfigurable computers

according to their degree of coupling between the FPT and the microprocessor, and secondly

classifying FPT according to their logic granularity. From this the most suitable category of

reconfigurable computer and FPT for use in an operating system is identified. It will be shown

that the most suitable platform consists of a large number of FPT logic cells arranged in

medium granularity on a commercial FPGA chip, coupled to a custom von Neumann

processor using a high speed general purpose bus and via a local bus to bulk commodity

RAM. Such a platform will be shown to have the capacity to run multiple applications of

general interest with acceptable performance and to have a justified need and capability for

runtime resource allocation.

2.1.1 Introduction

A FPT device [34] can be configured in the field by the end user [95] to create a digital logic

hardware solution. These devices have become very popular over the past decade as they do

not require the high engineering costs and long manufacturing lead times as application

specific integrated circuits (ASIC) [21] do. In this thesis we are interested in them because

they allow reuse and sharing of the device by different applications. Field programmable

technology devices primarily include programmable logic devices [115] and FPGAs [26].

Programmable logic devices (such as CPLDs and PALs) will not be considered in this thesis

as their density is relatively small compared with the modern FPGA and there is little

advantage in sharing a low density device amongst multiple applications.

An FPGA consists of an array of uncommitted circuit elements and interconnect resources

and is configured by the end user through a form of hardware programming. Figure 1 gives an

overview of the general structure of an FPGA. The logic cell on an FPGA is often referred to

as a configurable logic block (CLB) and performs the logic operations of the application,

usually implemented via a k-way lookup table and a flip flop for state storage. The routing

matrix connects the CLBs together using a specific structure that may consist of local and

chip length wires. The I/O cells, often referred to as I/O blocks, connect directly to the pins of

the device and are used to read and write signals from outside the chip.

Chapter 2 – Runtime support for reconfigurable computing

Figure 1: General FPGA Structure

FPGA manufacturers have developed a variety of hardware technologies for programming.

Some make chips with fuses that are programmed by passing a large current through them.

These types of FPGAs are called one-time programmable (OTP) because they cannot be

rewritten internally once the fuses are blown [36]. Other FPGAs make the connections using

pass transistors controlled by configuration memory. One type of FPGA resembles an

EPROM or EE-PROM: it is erasable and is then placed in a special programming socket to

reprogram it. Most manufacturers now use static RAM to control the pass transistors for each

interconnection, thus allowing the FPGA to be rapidly reconfigured any number of times.

With this rapid reconfigurability, some FPGAs have provided support partial and or runtime

reconfiguration [71]. This is the idea of changing the configuration of an FPGA whilst its

computation is still in progress. Some FPGAs even support dynamic runtime reconfiguration

where only a portion of the FPGA is reconfigured while the remaining part continues to

execute. In this thesis we will only consider SRAM reprogrammable FPGAs because if an

FPGA can not support dynamic runtime reconfiguration, such as the fused or EPROM type

FPGAs, there is little advantage in using an operating system to manage it as its resources can

only be allocated at compile time. This prevents applications from sharing the FPGA at

runtime.

9

Initially considered as a weakness due to the volatility of the programming data storage, in-

system reprogramming capabilities of SRAM-based FPGAs led to the new Reconfigurable

Computing paradigm [132]. It is generally agreed in the literature that a reconfigurable

computer (RC) is a computing machine that incorporates programmable logic devices to

Chapter 2 – Runtime support for reconfigurable computing

10

create a hardware architecture that may be modified at runtime [122] [95] [130]. In common

with the original conception [55] of a machine that can have a fixed architecture driven by

software and a variable architecture via programmable interconnect of user definable logic

cells, a reconfigurable computing machine should include a general purpose processor. This

provides the ability to exploit both the general purpose processor’s flexibility and the

reconfigurable processor fabrics capability to implement application specific computations.

2.1.2 Reconfigurable computing architectures

In this section there is a review of reconfigurable computing platform architectures with the

aim of selecting the most suitable for use in applied research for an operating system

environment. The platforms are divided into three categories according to the coupling

between the microprocessor and reconfigurable processing unit. The three categories are then

evaluated against a set of criteria to determine the most suitable one for use in an operating

system environment. These criteria include the

• Availability of parallelism

• The need for custom design tools not currently available

• The potential for runtime resource allocation

• The overall density and speed of the programmable logic, and

• The commercial availability of platforms.

There have been many attempts to categorise reconfigurable computing platforms. A

commonly accepted classification is either a tightly-coupled or a loosely-coupled platform

[49]. A tightly-coupled reconfigurable computer has the reconfigurable processing unit

integrated into the general purpose processor internal buses, whereas a loosely-coupled

reconfigurable computer is connected via a general purpose bus. Wolinski [135] extended this

to include new coupling architectures. This led to three categories: a reconfigurable ALU in

which arithmetic operations can be customised, a coprocessor where special instructions can

be diverted from the integer unit to the attached coprocessor, and a loosely-coupled

configuration where the processor transfers the data for the programmable logic via an I/O

bus such as PCI.

Chapter 2 – Runtime support for reconfigurable computing

Reconfigurable processing unit as a replacement ALU

Reconfigurable instruction set processors (see Figure 2) have been used in a variety of

applications, but now have renewed interested as they have been shown to increase the

performance of some multimedia applications [103]. In this type of reconfigurable computer

the arithmetic operations can be customised by reconfiguring the ALU. For example instead

of add it can be replaced with add modulo 3. Due to the customised instruction set, general

interest applications would need to be recast in terms of the specific instruction set. There is a

lack of commercially available platforms and design tools with which to build the

applications and prototype. There are only a few offerings including Elixent [53], and PACT

[15], with several more proposed in academia including GARP [66] and Kress [85].

The ability to perform runtime resource allocation on such architectures is very limited. The

instruction sets could be swapped at the same time as switching the application in a multi-

threaded way. However this could be incorporated into the context of a traditional operating

system by increasing its size to include the custom instruction set in use for each thread. This

would still result in sequential processing so the advantage of real hardware parallelism is still

lost.

Figure 2: Reconfigurable computer with a reconfigurable ALU

Reconfigurable processing unit as a coprocessor

It has been shown in the past that an increase in performance could be gained when a von

Neumann based architecture is coupled with a coprocessor, often a precision floating point

processor. Similarly, in reconfigurable computing a performance increase can be achieved

11

Chapter 2 – Runtime support for reconfigurable computing

when a microprocessor is coupled with a reconfigurable processing unit (see Figure 3). Unlike

a reconfigurable ALU, where the instruction set can be customised, in a coprocessor

architecture the special instructions are diverted from the integer unit to the attached

reconfigurable processing unit for execution in programmable logic.

The microprocessor either exists within the same fabric as the reconfigurable processing unit,

commonly known as a hard core microprocessor, or is configured onto the programmable

logic, known as a soft core. Hard core microprocessors have been demonstrated through

several commercially available platforms containing devices such as Xilinx Virtex II Pro

(PowerPC) [141] and Altera Excalibur (ARM) [3]. Soft core microprocessors include the

Xilinx Microblaze [137] and Altera Nios [5]. However as the fabric is shared between the

microprocessor and programmable logic, the processor clock speed is usually slower than a

modern von Neumann processor, and the amount of logic available to the user is less as some

must be consumed by the hard or soft core microprocessor.

Figure 3: Reconfigurable computer with a reconfigurable coprocessor

Reconfigurable processing unit coupled to the I/O system bus

A reconfigurable processing unit coupled to a general purpose processor via an I/O bus is

commonly known as a loosely coupled architecture (see Figure 4). This type of architecture is

very common and commercially available (for example the RC2000 [32], BenOne [105] and

Virtual Computer [29]). There are also research versions of these including PRISM-I [9],

SPACE 2 [63] and Pilchard [92]. The main advantage of this type of architecture is the ease

of constructing a system using recent technology high speed microprocessors and large gate

12

Chapter 2 – Runtime support for reconfigurable computing

count reconfigurable processing units [14]. Although it has been argued in the past that the

bandwidth between the attached processor and the reconfigurable processing unit is too low

[69], current improvements to the performance of standard buses (e.g. 64 bit, 133 MHz PCI-X

bus has increased bandwidth to in excess of 1000 MB/s) has increased the effective data

transfer rate. The loosely coupled architecture also has a high possibility of runtime resource

allocation due to the user of a high density FPGA.

Figure 4: Loosely coupled reconfigurable computer

Summary

To determine which of the three reconfigurable computing platform architectures should be

used in an operating system environment, a set of criteria was chosen to rank them according

to particular characteristics. These criteria are as follows.

1. The availability of the platform to support parallelism; without this, most of the

applications would need to be redesigned, resulting in a possible drop in performance.

2. The commercial availability; part of this thesis involves measuring the actual

performance of the operating system and the applications executing under it.

3. The need for developing custom design tools; there is a better chance of the operating

system being accepted within the programmable logic domain if the current design

flow and tool set can be used. This also alleviates the need for any tool set

development.

4. The overall density and speed of the programmable logic and microprocessor; the

density of the programmable logic must be able to have modern hardware applications

13

Chapter 2 – Runtime support for reconfigurable computing

14

configured onto it and the microprocessor must have the necessary clock speed to

execute the software part of the application.

5. The possibility that the architecture can support runtime resource allocation; as the

goal of the operating system is to support multiple runtime applications, the ability to

perform resource allocation is essential.

6. The suitability of the platform to support the general interest applications that will be

used under an operating system environment. The architecture must minimise the

performance loss on the applications introduced due to the introduction of the

operating system.

Table 1 shows a comparison of each of these three architectures, ranking them on how well

they perform in each of the criteria described above.

Reconfigurable

ALU

Reconfigurable

Computer with

Coprocessor

Loosely

Coupled

Availability for parallelism Low Medium High

Commercial availability

of platforms

Medium Medium High

Need for custom design tools

not currently available

Medium Medium Low

Overall density and speed of

programmable logic and

processor

Low Medium High

Possibilities for runtime

Resource allocation

Low Medium High

Supports general interest

reconfigurable computing

applications

Low Medium High

Table 1: A Summary of Characteristics of

Reconfigurable Computing Architectures

The architecture of the reconfigurable computer that appears most attractive for an operating

system is the one based on the loosely-coupled microprocessor and reconfigurable processing

Chapter 2 – Runtime support for reconfigurable computing

15

unit. A platform with such architecture is readily commercially available with a fast clock rate

microprocessor and high density FPGA, can have its applications designed using the standard

design flow and tool set, supports applications with parallelism, provides possibilities for

runtime resource allocation and supports general interest reconfigurable computing

applications. Neither the reconfigurable coprocessor or ALU are ideally suited as they do not

perform as well in one or more of the criteria.

2.1.3 FPGA architectures

In this section architectures for FPGAs are reviewed with the objective of evaluating their

suitability to work together with an operating system that allocates resources at runtime and a

loosely-coupled reconfigurable computing architecture. There are two major facets of the

FPGA architectures that are of interest. The granularity of the configurable logic elements,

and the availability of dynamic reconfiguration support either as a capability for dynamic

reconfiguration driven from the attached host or as built in hardware support to implement

context switching through autonomous self reconfiguration. It will be shown that a medium

grain FPGA which has dynamic reconfiguration but not necessarily self reconfiguration is the

closest match for the proposed operating system.

Granularity

FPGA devices are commonly categorized according to the granularity of the architecture. A

typical architecture may be considered as fine, medium or coarse grain. Fine grain

architecture (see Figure 5 (a)) usually contains a simple configurable logic block, often only

two-input gate logic. They are best suited for applications that require very fine grain bit

manipulation. Some of the existing well-known academic fine-grain reconfigurable devices

include Montage [65], Triptych [22], 3D-FPGA [37], GARP [66] and BRASS [28].

Commercial fine-grain architectures included Xilinx 6200 [142] and Altera Flex 10K [4]. Fine

grained architectures require more routing resources than coarser grained ones, and given the

cost of routing in terms of area and delay, the optimal architecture is not likely to be the finest

grain. Fine grain architectures are also probably now obsolete with very few commercially

available versions recently released.

Chapter 2 – Runtime support for reconfigurable computing

(a) Xilinx 6200 FPGA (Fine grain) (b) Xilinx Virtex (Medium grain)

(c) PACT (Coarse grain)

Figure 5: FPGA granularity examples

FPGAs with larger lookup tables and CLBs, more flip flops and CLB to CLB direct carry

chains are commonly known as medium grain architectures. These include LP_PGAII [60],

LEGO [38], Xilinx 4000 and Xilinx Virtex [139]. This type of architecture looks promising

for use in an operating system environment as it has a high enough density to support multiple

applications, is commercially available, and supports general interest applications. Figure 5

(b) is an example of a medium grained architecture functional unit and shows the increase in

complexity as compared with the fine grain architecture.

Coarse grained architectures for FPGAs have more complex logic and routing elements that

are domain specific. This implies that there are components that optimise a domain specific

function that could be implemented in other ways on a medium grained architecture. They

also have a routing structure that is suited to a particular application domain. Devices with

such an architecture include RaPiD [52] and Chameleon [33]. Domain specific architectures

often provide little performance increase to applications that can not be customised to suit the

architecture. Figure 5 (c) is an example of a coarse grained architecture functional unit and

16

Chapter 2 – Runtime support for reconfigurable computing

shows the increase in complexity when compared with the medium grain architecture. A

summary of what contributes a particular type of granularity is shown in Figure 6.

1. 2 input NAND gate CLBs and local routing

2. Reconfigurable logic blocks with lookup

tables, flip flops and a routing hierarchy

3. Reconfigurable logic blocks with fast carry

and other inter-block communication, and a

routing hierarchy

4. Reconfigurable logic block including a

multiplier and a routing hierarchy

5. Reconfigurable logic block including a state

machine and internal RAM, and a specialist

routing structure

Fine Grain

Medium Grain

Coarse Grain

Figure 6: Granularity of an FPGA architecture

Dynamic reconfigurability

Another facet of the architecture that is of interest is the ability of the FPGA to support

dynamic reconfiguration from either the attached host or via single cycle self reconfiguration.

This is necessary so applications can be configured onto the FPGA at runtime whilst not

affecting the already configured ones. An early attempt at dynamic reconfiguration driven

from the attached host was demonstrated on the Xilinx 6200 series FPGA. This fine grained

FPGA performed random access partial reconfiguration which allowed selected CLBs of the

FPGA to be reconfigured while the remainder of them continued to operate unaffected, thus

allowing applications to be configured at runtime. This type of dynamic reconfiguration was

inherited by the Xilinx Virtex, Virtex II and Virtex II pro series of FPGAs.

A single cycle self-reconfigurable FPGA is a device that can toggle a particular part of its

own configuration which results in part of or the entire FPGA to reconfigure itself within a

single clock cycle. Early research into this was conducted by DeHon [47] where he presented

the Dynamically Programmable Gate Array (DPGA) which was able to rapidly switch among

several pre-programmed configurations. This rapid reconfiguration allows DPGA array

17

Chapter 2 – Runtime support for reconfigurable computing

18

elements to be reused in time without significant overhead. Trimberger [128] presented a

time-multiplexed FPGA architecture based on an extension of the Xilinx 4000 product series.

It contained eight configurations of the FPGA that were stored in on-chip memory and could

reconfigure the FPGA in a single clock cycle of the memory. Scalera [114] presented the first

design and implementation of a context switching reconfigurable computer (CSRC). CSRC

was designed to be a 4 bit DSP dataflow engine that is simultaneously capable of efficiently

implementing glue logic. However Sidhu [119] stated that for efficient self reconfiguration,

the device should perform fast logic switching and fast random access of the configuration

memory. CSRC could switch contexts in a single clock cycle but provide only serial

configuration memory access. Therefore Sidhu proposed the Self-Reconfigurable Gate Array

architecture, which supports both single cycle context switching and single cycle random

memory access. More recently, two commercial reconfigurable devices with coarse grain

architectures that support self reconfiguration have appeared in the literature. These include

XPP [15] from PACT and ACM [110] from Quicksilver. These devices have specialist

architecture that can execute domain specific applications more efficiency.

Although an FPGA architecture that supports single cycle self reconfiguration would be of

interest for an operating system environment, the current devices that support it are either too

domain specific or are research based projects without commercially available products.

Conclusion

The FPGA architecture that is most suited to an operating system would be one that can

support self reconfiguration, is not too domain specific, and is commercially available.

However as shown in Table 2, none of the three FPGA architectures have all of these

characteristics. The architecture that looks the most promising for use in this operating system

is a medium grained FPGA. Although self reconfiguration is desirable, in commercially

available FPGAs it is only supported in coarse grained architectures. Such architectures are

too domain specific and it could be difficult to map general interest applications to it. Fine

grained architectures, although not domain specific and support runtime reconfiguration

(Xilinx 6200) are not suited either as they are no longer commercially available or therefore a

prototype can not be built.

Chapter 2 – Runtime support for reconfigurable computing

19

Fine Grain Medium Grain Coarse Grain

Commercial

Availability

Low High Medium

Support for

reconfiguration

Dynamic Dynamic Dynamic or Self

Domain specific No No Yes

Table 2: A summary of the characteristics relating to the use of an operating system

for reconfigurable computing

2.1.4 Conclusion

In this section it has been shown that medium grained FPGAs arranged as an I/O attachment

to a standard microprocessor present the most appropriate platform for an operating system

for reconfigurable computing. While devices that self reconfigure are emerging they are

excessively domain specific which limits their use for an operating system. Hence FPGAs

which are externally reconfigurable under the control of a host are a more appropriate

platform for general purpose reconfigurable computing even though their reconfiguration

times are longer and the flexibility of the geometry of reconfiguration modules is limited.

Chapter 2 – Runtime support for reconfigurable computing

20

2.2 Abstractions, services and runtime systems

In the previous section it was determined that the most suitable reconfigurable computing

platform for use with an operating system is a medium grained FPGA that supports runtime

reconfiguration, coupled to a microprocessor via a standard I/O interface. As a reconfigurable

computer has little support for automated resource allocation, the scheduling of incoming

applications, or the management of the transfer of I/O, the use of a runtime system may be

necessary.

In this section various services and runtime systems that have been proposed to support

reconfigurable computing applications are reviewed. What is implied by the services of

resource allocation, hardware partitioning, application scheduling and I/O in a reconfigurable

computing environment are initially examined. The prototypes that have been published are

then summarised and compared in relation to these services. An evaluation shows that none of

the prototypes have included all the services.

2.2.1 Services

The structure of an operating system for a traditional von Neumann based architecture has

been well defined [120] [126] [125] as comprising of services (e.g. application launching,

multi-tasking, and file and memory management) and abstractions (e.g. process and file). This

is not the case in reconfigurable computing. However there is a growing trend in the

reconfigurable computing research literature to investigate services for a runtime system. To

date these include resource allocation, hardware partitioning, application scheduling and I/O.

Resource allocation

If a reconfigurable computer has applications competing for hardware resources, mechanisms

and policies are required to allocate these resources in a way that will not interfere with other

executing applications. These resources include the FPGA logic area, routing matrix, I/O pins

and external memory. Most of the reconfigurable computing resource allocation literature has

avoided discussion on the allocation of the routing matrix, I/O pins and external memory and

has concentrated on the FPGA logic area [25].

The concept of allocating the FPGA logic area to reconfigurable computing applications

within an operating system like environment was first suggested by Brebner (Virtual

Hardware Operating System [24]). He suggested that the FPGA area be divided into square

Chapter 2 – Runtime support for reconfigurable computing

segments of equal size which the applications could be allocated to (Figure 7 (b)). This

increased the number of applications from one per FPGA to one per square segment. As the

FPGA was only divided into a small number of fixed segments, this kept the complexity of

the allocation algorithm to a minimum. However there needed to be modifications made to the

current design flow so applications could be expressed in these fixed sized segments [24] or

current applications be broken down at runtime. The process required to break a

reconfigurable computing application into segments, or swappable logic units (SLU) as they

are defined by Brebner, is too time consuming to be performed at runtime [23] in an operating

system. To avoid this overhead he then suggested that SLUs have various rectangular

dimensions (Figure 7 (c)). Although there was no deep investigation, a two-dimensional

recursive bisection technique was suggested as an algorithm to perform this allocation.

RACE [124] and the Dynamically Reconfigurable System [73] suggested that the FPGA logic

area was allocated by configuring one application per FPGA device (see Figure 7 (a)). This

allows the current design flow to be used in application development but reduces the

allocation complexity as only one application is configured per FPGA. The limitation with

this policy is the number of applications on the platform at one time is restricted to the

number of FPGA devices. Burns [27] extended this idea to allow applications to have various

geometric shapes (Figure 7 (d)). Although this gives the runtime system the most freedom in

the shape selection, allocation algorithms that could perform such a task with a minimal

overhead were not presented.

Figure 7: Various FPGA logic allocation mechanisms

If the resource allocation policy shown in Figure 7 (b)(c)(d) were to be used, there is a

possibility that routing resources beyond the logic area allocation may be needed to allow

inter application communication or external I/O. These routes will breach the logic area

allocation and could possibly require a global router to manage them [45]. A runtime routing

21

Chapter 2 – Runtime support for reconfigurable computing

22

package for the Xilinx Virtex FPGAs (JRoute) has been designed by Keller [82] and can be

configured as a global router. As applications are routed, JRoute is able to manage what

resources have been used and what resources are available. This gives JRoute the ability to

perform routing at runtime. However, the routing algorithm used in JRoute does not guarantee

a route can be found and thus limits its useability.

Although there is limited literature on I/O pin allocation, Babb [12] proposed a technique that

allows more than one application to use a single I/O pin. He suggested that the FPGA pins be

multiplexed amongst multiple logical wires and be clocked at the maximum frequency of the

FPGA. This would allow more than one application to use a selected pin. The decision of

whether to share an I/O pin was made at compile time through the use of a static scheduler. If

this mechanism was to be transferred to the operating system, a dynamic scheduler would

need to be used as applications would be arriving at runtime.

Application partitioning and scheduling

The concept of partitioning logic circuits is well published and is defined to be the process of

dividing an application into multiple parts because it can not fit onto the target device in its

entirety. Once partitioned, these parts can then either be configured onto a reconfigurable

computer in time [77, 78] or configured at the same time onto a single or multi-FPGA

reconfigurable computer. If partitioned in time, a scheduler can be used to determine which

parts should be loaded and removed at what times [117]. Scheduling applications for a

reconfigurable computer is not the same as with a von Neumann based architecture. There are

no obvious ways to pre-empt a reconfigurable application due to the absence of a well defined

instruction fetch, decode, and, execute cycle. If all the application parts are configured onto

the FPGA at one time and are not removed until the application has completed, pre-emption is

not necessary.

The idea of partitioning reconfigurable computing applications in time under runtime system

control was introduced in both the XOS [86] and ACE runtime [45] systems. In these papers

it was proposed that if a reconfigurable computing application was larger than the target

device, it should be partitioned into equal sizes matching the target device and then these parts

be swapped onto the FPGA through the use of a pre-emptive scheduler. Partitioning

reconfigurable computing applications across multiple FPGAs was demonstrated by the

Dynamically Reconfigurable System [73] and RACE [124]. In both these systems the

Chapter 2 – Runtime support for reconfigurable computing

23

reconfigurable computer had multiple FPGAs and a single application was partitioned and

configured at one time onto them. The process of partitioning the application across multiple

FPGAs was performed statically using well known partitioning algorithms. Performing the

partitioning statically does not introduce an unwanted time overhead but provides the ability

to execute applications larger than the target FPGA. The order in which to load the partitioned

application can then be determined through the use of a static scheduler [112] [104], which

have been proposed in the research literature.

Instead of partitioning an application across multiple FPGAs, Brebner [24] proposed the idea

of partitioning reconfigurable computing applications into partitions known as swappable

logic units (SLU). SLUs could be swapped onto an FPGA into a number of fixed sized

locations. The advantage of this is the ability for the FPGA to support multiple concurrent

applications. However the process of partitioning all applications into an SLU structure is

complex and time consuming and it was noted that it could not be performed at runtime

without a significant time overhead. Therefore all applications were statically partitioned into

the fixed sized SLU structure. It appears from the literature that very few researchers have

considered the idea of partitioning applications into variable sized parts at runtime whilst

under operating system control.

Input and output

In almost all of the runtime systems proposed and built, researchers have seen the need to

manage the transfer of I/O between the reconfigurable computing application and the

microprocessor. As most of the runtime systems only support one concurrent application, it

has been assumed that the I/O would simply be transferred via the FPGA pins. However if the

runtime system was to support more than one concurrent application, this may not be possible

as sharing I/O pins amongst applications on an FPGA can be difficult.

To avoid contention on the FPGA pins, Babb [12] suggested that multiple signals be

multiplexed onto one pin (see section 0). This would then allow more than one application to

use a single pin at one time. However the number of applications that can share the FPGA pin

is limited by the clock speed of the reconfigurable application and FPGA. An alternative to

multiplexing the pins is to use an on-chip network and arbitrator. This involves configuring a

network and arbitrator onto the FPGA before applications are. As applications are configured

onto the FPGA, they are connected to the on-chip network which is responsible for creating a

Chapter 2 – Runtime support for reconfigurable computing

24

route between the application and the network arbitrator. The arbitrator then negotiates with

all of the applications giving each one exclusive use of the network and I/O pins at a specific

time. To avoid contention on the FPGA pins only the network arbitrator is directly connected

to them and all of the applications transfer their I/O via the network. Although this has been

proposed [81], it has not been used in conjunction with any runtime system for reconfigurable

computing.

2.2.2 Prototypes

Operating systems traditionally provide run time support for applications. Surprisingly, in

view of the number of reconfigurable platforms and architectures proposed and built [18, 31,

61, 63], very few of these projects have included an investigation of run time support.

Everybody who ever built a platform has seen the need for a single user loader [63] [124],

often in the guise of interface software between the reconfigurable computing platform and

the host system. Some researchers have seen the need for a run time environment.

Smith developed a system he called RACE [124], reconfigurable and adaptive computing

environment. In RACE the allocation of FPGA logic area was on a one application FPGA

basis. This dramatically reduces the complexity of the allocation algorithm as it is much easier

to determine an available FPGA as compared with a section of area on one FPGA. It is worth

noting that although RACE can support as many applications as it has FPGAs, as FPGA

density increases, the need for multiple FPGA based platforms diminishes as numerous

applications can fit onto one very dense FPGA.

Similar to Smith, with respect to the allocation policy, Jean reported on a resource manager

for a dynamically reconfigurable system [73]. He defined a dynamically reconfigurable

system to be one that allows part of the hardware to be reconfigured whilst another part

continues to execute. If implemented on one FPGA, it would be similar to partial

reconfiguration, however in this paper it was implemented by the use of multiple FPGAs.

Therefore the resource manager allocated and de-allocated whole FPGAs when required. Jean

reported on the performance of the resource manager when used with several applications and

the impact of supporting concurrency and preloading in reducing application execution time.

Davis [45] proposed a method for developing reconfigurable hardware object class libraries

and a runtime environment to mange these. He had a conceptual development of layered

Chapter 2 – Runtime support for reconfigurable computing

25

abstractions. At the top is a hardware object scheduler which manages precompiled cores, in

the middle is a place and route layer and at the bottom is a virtualization of the FPGA to make

the system portable. However the authors missed the need for resource allocation. This

indicates they really didn’t have a vision for a full scale runtime system, as a loader does not

need the allocation step. There is also no evidence the authors attempted to implement the

layered architecture as the details for each level of abstraction is very brief.

Moseley [104] proposed a runtime management system called Reconnetics. The runtime

system provides an environment of high-level control over the logic gates which requires little

knowledge of the underlying hardware technology. The user supplies the circuits; these are

captured and placed in an archive for later use and an engine directed by a high-level user

program loads, places, and interacts with these processors. Although this provides the ability

to dynamically load applications and performance results were given showing that the runtime

system has been implemented, the author does not mention anything about area allocation, a

fundamental process that is required in an operating system.

Rakhmatov [112] proposed an adaptive multi-user online reconfigurable engine for the

architecture AMORE, which consists of FPGAs with attached multiple microprocessors. This

runtime system is of interest as they address the issues of FPGA logic area allocation and

application scheduling. It supports variable sized FPGA logic area allocation through the use

of a two dimensional bin packing algorithm. However the authors do not explain why the

particular bin packing algorithm was chosen and do not investigate other alternative allocation

algorithms. Application scheduling is performed by the use of a dynamic scheduler [111]. The

order in which the applications are loaded onto the FPGA is determined by the

communication constraints, making sure all communication applications are resident on the

FPGA at one time. Although named a dynamic scheduler, the scheduling decisions are all

made offline when a complete list of applications to be loaded onto the FPGA is known.

A commercial reconfigurable computing operating system is known as Field Upgradeable

Systems Environment (FUSE), developed by the Nallatech Company [106]. FUSE is an

interface between the platform operating system and the hardware circuit design language. It

allows data communications from software directly to the FPGA based applications running

on the hardware. Although named “Reconfigurable Computer Operating System”, it really

Chapter 2 – Runtime support for reconfigurable computing

26

only provides an API between the user and the hardware platform. It provides very few of the

services described in the section above that are necessary for it to be an operating system.

Although this thesis will concentrate on operating systems for reconfigurable computers with

an attached microprocessor, it is worth noting runtime system research undertaken on the

Xputer [85] and AMORE architectures. Kress outlined an operating system for custom

computing machines (XOS) based on the Xputer Paradigm in [86]. Multi-tasking was

performed on the Xputer through the use of a dynamic scheduler, instead of having

concurrently executing applications. This involved the dynamic scheduler swapping the

application on the reconfigurable logic unit after a set amount of time.

2.2.3 Evaluation

A summary of the services that each prototype incorporates is shown in Table 3. From this

table it can clearly be seen that no runtime system provides support for even the minimal

number of services that have already been proposed in the literature. In fact, of the eight

prototypes discussed, only five of them perform any type of hardware partitioning, only one

provides any method of data input and output, two implement a pre-emptive scheduler, one

implements a static scheduler and all eight implement some form of FPGA logic area

allocation. Of the prototypes that implement some form of hardware partitioning, four of them

partition across multiple FPGAs and one partitions into fixed sized SLUs. None of the

prototypes partition applications at runtime into variable sized partitions. Although all of the

prototypes perform logic area allocation, only two allocate applications of variable sizes while

the remaining six allocate on a per FPGA basis. The per FPGA basis allocation was primarily

used as the current design flow could be used to produce applications for such a system and

the allocation algorithm has a low computational complexity. Both of the prototypes that

performed variable sized allocation did not give any performance results of the algorithms and

did not compare other alternatives.

Chapter 2 – Runtime support for reconfigurable computing

27

Name of

Prototype

Allocation of

resources

Application

Partitioning

Scheduling of

applications

I/O

AMORE

[112]

Logic area

variable sized

rectangles

Not

mentioned

Static

scheduling

Not

mentioned

Dynamically

Reconfigurable

System

[73]

Logic area

per FPGA

Across

multiple

FPGAs

Not

mentioned

Not

mentioned

Virtual

Hardware

Operating System

[24]

Logic area

variable sized

rectangles

Static into

SLUs

Not

mentioned

Bus

Addressable

registers

RACE

[124]

Logic area

per FPGA

Across

multiple

FPGAs

None Not

mentioned

ACEruntime

[45]

Logic area

per FPGA

Across

multiple

FPGAs

Pre-emptive

scheduler

Not

mentioned

Reconnectics

[104]

Not

mentioned

Not

mentioned

Demand

and static

scheduling

Not

mentioned

XOS

[86]

Logic area

per FPGA

Across

multiple

FPGAs

Pre-emptive

scheduler

Not

mentioned

FUSE

(Nallatech)

Logic area

per FPGA

Not

mentioned

Not

mentioned

Not

mentioned

Table 3: Services Provided by Runtime System Prototypes

This section outlined what is implied by resource allocation, hardware partitioning,

application scheduling and I/O management as discussed in reconfigurable computing

runtime system literature. It was shown that there is a lack of a concise set of services that

must be provided by a runtime system for a reconfigurable computer. Of the prototypes

Chapter 2 – Runtime support for reconfigurable computing

28

proposed and built no one runtime system has demonstrated a complete set of these services.

In most of the published research, resource allocation and hardware partitioning appear to be

the defining services of runtime systems for reconfigurable computing. However there

appears to have been little investigation and performance evaluation into optimal allocation

and partitioning algorithms for variable sized applications.

Chapter 2 – Runtime support for reconfigurable computing

29

2.3 Allocation and partitioning

In the previous section, runtime systems for reconfigurable computers that appear in the

research literature were presented. It was shown that although the services of resource

allocation and hardware partitioning were suggested, none of the prototypes had included any

systematic evaluation of algorithms that can operate on variable sized dynamically arriving

applications. In this section, published resource allocation and hardware partitioning

algorithms that are suitable and appear in either the reconfigurable computing or other

research domains will be presented. The review will focus on algorithms which have a

potential to be transferred into the reconfigurable computing domain.

2.3.1 Allocation

In any system where there are applications competing for limited hardware resources, a

resource manager is required. Without one, applications can take resources that are already

occupied by other applications, ultimately causing them to have undefined behaviours. The

task of a resource manager is to monitor the availability of the hardware resources and

allocate them as requested to arriving applications. In a reconfigurable computer with

applications competing for the hardware, the major resource that needs management is the

FPGA logic area. Allocating reconfigurable computing applications onto the FPGA logic area

can be simply defined as translating an area (the application) onto another area (FPGA) so

that the translated area does not overlap any existing used area but fits within the boundary of

the total usable area.

The general resource allocation problem has been investigated in various research domains

including allocating data to computer file systems [96] and memory [98], allocating

processors to multi-processor computers [107], and block placement within mesh connected

parallel computers [144]. However, many of these previous attempts are unable to be adapted

for reconfigurable computing as there is a fundamental difference. In traditional von

Neumann computing, the data is moved to the processor. In multiprocessors for example,

programs are moved to the processors, such as the classical load balancing problem. Another

example is the well explored problem of allocating applications to a mesh of processors. This

is not really relevant to FPGAs either because the location of the hardware modules in an

FPGA is not necessarily predetermined by the location of existing hardware. In the case of

existing hardware on the FPGA (such as might be provided for memory access) the task is not

Chapter 2 – Runtime support for reconfigurable computing

to load a program onto this hardware but to locate the FPGA core in an appropriate position in

relation to the exiting hardware. In reconfigurable computers, the circuits (“processors”) are

moved to connect with memory under area constraints and therefore these general resource

allocation algorithms can not be adapted.

The problem of determining where to allocate incoming reconfigurable computing

applications onto an FPGA is very similar to that of calculating fabric cutting plans in the

manufacture of clothing. In clothing manufacturing, the goal is to minimise the amount of

fabric wasted when a particular pattern of clothing is cut out. Milenkovic [102] used a

commonly known mathematical formula known as the Minkowski Sum [58] to calculate all

the possible locations where the clothing pieces could be laid out. As the Minkowski Sum

only reports on the possible locations in which the pattern can be laid out, a greedy based

algorithm was used to select which of them resulted in the optimal placement. He reported

that current software packages performing this task are wasting approximately 20% of the

fabric and it was shown that through the use of the Minkowski Sum and the greedy algorithm

this dropped to around 9.5%.

The problem of packing reconfigurable computing applications onto an FPGA is also similar

to the well-studied problem of bin packing. The traditional bin-packing problem involves

packing a list L = {a1, a2, … an} of items into a set of bins BB1, B2, … subject to the constraints

that the set of items in any bin fits within that bin’s capacity. Most literature on bin-packing

concentrates on classical one dimensional bin-packing, such as First Fit and Best Fit. One

dimensional bin-packing is not suitable for area allocation on an FPGA as the surface of the

FPGA is two dimensional. However as shown in , two dimensional bin packing is. Figure 8

Figure 8: Two Dimensional Bin Packing 30

Chapter 2 – Runtime support for reconfigurable computing

31

The two-dimensional finite bin packing problem consists of determining the minimum

number of large identical rectangles bins that are required to allocate without overlapping a

given set of rectangular items. Two categories of two dimensional bin-packing algorithms

include offline bin packing, where it can use full knowledge of all items in packing L, and

online bin packing where packed items cannot be repacked at a later time and the full list of

items to be packed is not known at the start. Offline bin packing is not relevant in this thesis

as it can only be used when all the applications that are required to be packed onto the FPGA

surface are known in advance. As the application arrival rate into the runtime system

proposed in this thesis will be random, the process of bin packing can not be performed

offline. Online bin packing algorithms are able to accept applications arriving after the

packing process have begun and are thus more suitable to the proposed runtime system.

Baker, Coffman, and Rivest [13] introduced an implementation of two dimensional online bin

packing. In this implementation, the items are rectangles and the goal is to pack them into a

unit-width semi-infinite strip so as to minimize the total length of the strip spanned by the

packing. Packed rectangles can not overlap each other or the boundaries of the strip. Each

successive item is placed as near the bottom of the strip as possible and then as far left at that

height as possible. The problem with this particular implementation and most other online two

dimensional bin packing algorithms is the strip of area is assumed to be infinite, whereas an

FPGA area is finite and the runtime complexity is at best O (n3) [39] where n is the number of

rectangles being allocated.

Chazelle [35] implemented a modified two dimensional online bin packing algorithm in an

attempt to reduce the runtime complexity but retain the quality of allocation. It was found that

using poorly ordered lists can lead to arbitrarily bad packings and long runtimes [13].

However this could be avoided by simply ordering the lists in decreasing widths and

allocating all tasks in the bottom left corner (a heuristic commonly known as bottom-left).

Chazelle reported that the algorithm had only quadratic complexity in terms of the number of

rectangles. Although a reduction in runtime was achieved, the algorithm still assumed an

infinite height bin.

In an attempt to adapt a bin packing algorithm that can allocate FPGA logic area, Bazargan

[17] made modifications to the best fit and first fit algorithms. His algorithm involved

dividing the remaining free FPGA area into empty regions (sometimes referred to as holes)

Chapter 2 – Runtime support for reconfigurable computing

32

and if the incoming application is able to fit into that empty region, it was marked as a

candidate for allocation. Best fit, lowest bottom side or the bottom left heuristic is then used

to determine into which empty region the application is allocated. The advantage of this

implementation is by not considering every possible place in which an application can be

allocated to, the time complexity was reduced to O (log n) for each allocation. However this

resulted in an average of around 7% loss in allocation quality.

An attempt to develop a resource allocation management tool for reconfigurable computing

was made by Eatmon [51]. He introduced the Dynamic Resource Allocation and Management

framework (DREAM). DREAM is a tool that evaluates placement (defined as allocation in

this thesis to avoid confusion with traditional logic placement) algorithms for configurable

logic devices. Incorporated into DREAM are three placement algorithms, best-fit, first-fit and

random placement. The results gained by the author for these algorithms are unacceptable for

use in a runtime system though. For example, the total execution time for the First-fit

algorithm averaged around 11 seconds, for the Best-Fit algorithm around 6 seconds and for

the random placement around 4 seconds. These total execution times place too much of an

overhead onto any runtime system that would use such algorithms.

In this section it was shown that there are some promising algorithms for FPGA logic area

allocation in both the reconfigurable computing and other research domains. However high

complexity and the quality of the allocation in terms of wasted area on the FPGA is a problem

for some algorithms. There appears to be a trade-off between the quality of allocation and the

execution runtime and no-one has yet attempted this type of analysis. The allocation problem

has only been considered in isolation here and the ability to partition applications to overcome

blocking (where an application cannot be allocated unless spilt up) has not been considered.

Hardware partitioning for reconfigurable computing is reviewed in the next section.

2.3.2 Partitioning

Logic partitioning is traditionally used to split an application into equal sized parts when it

can not fit onto target devices. In a runtime environment, it is envisaged that logic partitioning

would be used to divide an application into numerous parts of different sizes. This is because

in a runtime system as applications are loaded and removed from the FPGA, the area becomes

fragmented (distributed non-contiguously). Instead of waiting for contiguously available area

to configure the application onto, it is partitioned into specified sizes that match what is

Chapter 2 – Runtime support for reconfigurable computing

33

currently available. The benefits of this is that it should overcome the possibility that large or

odd shaped applications may be starved of execution time because they do not fit in their

original form onto the allocated space. As logic partitioning has been an active area of

research for at least the last 25 years, there have been numerous proposed solutions.

According to Alpert [2], logic partitioning algorithms can be divided into four major

categories: move-based approaches, geometric representations, combinatorial formulations,

and clustering approaches. However move-based approaches dominate the research literature

primarily because the algorithms are very intuitive and simple to describe and implement.

Those that have the potential to be used in conjunction with a runtime system will be outlined

below.

Generally, an algorithm is move-based if it iteratively constructs a new solution based on the

previous history. In 1970, Kernighan and Lin (KL) [83] described an algorithm that involves

iteratively swapping pairs of neighbourhood modules with an objective function of

minimising the cut-size, that being the number of nets connected to nodes in both partitions.

A simple implementation of KL requires O (n3) time per pass. Fiduccia and Mattheyses (FM)

[57] modified the KL algorithm and reduced the time per pass to linear in the size of the

netlist. The key to the speed up was the bucket data structure used to find the best node to

move. Instead of using the greedy improvement approaches described above to minimise the

cut-size, Kirkpatrick et al. [84] introduced Simulated Annealing (SA). This involves picking a

random neighbour of the current solution and moving to that solution if the new one

represents an improvement. Through the use of the algorithm, it can be shown that over time

it will converge to a globally optimum solution given an infinite number of moves and a

temperature schedule that cools to zero slowly. The authors of [75] conclude that SA is a

competitive approach when compared to KL in specific circumstances; however multiple runs

of KL with random starting solutions may be preferable in others.

However FPGA partitioning poses different challenges than Min-Cut partitioning due to the

hard size and pin constraints in mapping onto these devices. FPGAs also have variable sized

partitions and as such partitioning with the objective of minimising the number of

communication wires may not be adequate enough as the partition may not fit into the desired

location. Woo and Kim [136] proposed a k-way extension to FM algorithm which has the

objective function of minimising the maximum number of I/O pins used in the device. This

algorithm is similar to FM in that modules are swapped until an objective function is reached,

Chapter 2 – Runtime support for reconfigurable computing

34

but many more modules may need to be examined before finding a feasible solution. Kuznar

et al [88] applied FM bi-partitioning to address the common multiple FPGA device

partitioning problem. In this algorithm, given a number of devices and modules in the circuit,

an integer linear program can be solved to find a set of devices that yields a lower bound on a

cost. It is possible that this solution can be mapped onto the devices while still satisfying the

I/O pin constraints. However logic partitioning algorithms with these objective functions are

not suited to a runtime environment because they partition into fixed sizes. In a runtime

environment the application may need to be divided in numerous different geometrical

dimensions.

A possible way to represent a circuit is to describe it according to a directed acyclic graph.

The nodes in the graph represent computation while the edges represent the communication

between the nodes. Purna [109] introduced the concept of temporal partitioning of directed

acyclic graphs. Temporal partitioning partitions and schedules a data flow graph into

temporally interconnected subtasks. Given the logic capacity of the configurable computing

unit, temporal partitioning will partition the circuit k-way such that each partitioned segment

will not exceed the capacity of the configurable unit. Scheduling then assigns an execution

order to the partitioned segments so as to ensure correct execution. This algorithm might be

suitable for a runtime environment if the temporal dimension was replaced by a geometric

one.

In this section it was shown that although there have been numerous allocation and logic

partitioning algorithms proposed and implemented, very few of them are suitable for a

runtime environment for a reconfigurable computer. As applications are dynamically arriving,

the two dimensional FPGA area must be allocated to incoming applications so as not affect

the other executing applications. When applications are loaded and removed, holes of various

shapes and sizes are created and therefore the partitioning algorithm must be able to divide an

application into the various sized parts.

Chapter 2 – Runtime support for reconfigurable computing

35

2.4 Reconfigurable computing design flow

This section reviews the current application design environment for reconfigurable computing

platforms that consist of a medium grained FPGA loosely coupled to a standard

microprocessor. It will be shown that at present this environment assumes that the designer

will carry out resource allocation on the FPGA at design time. Where dynamic

reconfiguration is supported, the designer must still do resource allocation because the

reconfiguration involves interchanging cores with identical resource requirements. Very little

support exists for the adaptive integration of software and reconfigurable hardware as would

be needed if runtime reconfiguration were to be widely used in reconfigurable applications.

2.4.1 Traditional design flow

An application that targets a reconfigurable computer with an I/O system bus style coupling

consists of two parts: a hardware circuit, or bitstream that configures the FPGA, and a

software host program that interacts with the platform. These two files combined form a

reconfigurable computing application. The hardware circuit design methodology for a

reconfigurable computing application has been adopted from the VLSI domain and is shown

in Figure 9. It consists of three major stages: design entry, technology mapping and place and

route.

Design entry primarily involves describing how the circuit will behave but as a consequence

also involves allocating the necessary FPGA I/O pins. There are two common ways to do this,

using schematic capture or using a hardware description language. Schematic capture uses a

computer aided drawing package to describe the circuit. It involves the selection of

components from a library, connection of component’s input and output wires, and naming

and commenting of the components. As shown in Figure 9, schematic capture results in an

implicit entry of a netlist and does not require synthesis. However, a disadvantage is it may

require an early selection of the target technology [101]. Schematic capture was popular a

decade ago when hardware circuits were several thousand gates large. As the number of gates

that make up circuits increased, it became very difficult to use as it was too time consuming to

layout large circuits at gate level.

Chapter 2 – Runtime support for reconfigurable computing

Logic Synthesis & Target Library Mapping

Generate gate-level descriptions

using target library cells

Gate Level Netlist

Placement and RoutingCreate circuit layout using an automatic placement

and routing tool

Create BitstreamThe file that will configure the FPGA

Hardware Description Language

Detailed code to describe gate-level or register transfer

level to describe logic functionality

Schematic CaptureCAD package to layout the

logic functionality

Technology MappingMapping the netlist to vendor specific architecture

Design Entry

Figure 9: Hardware Circuit Design Methodology

A hardware description language (HDL) can be a more abstract design entry method. It

includes many of the elements known from programming languages like data (addition,

subtraction, etc) and control operations (if, case, etc). Two traditional HDLs are Very High

Speed Integrated Circuits Hardware Description Language (VHDL) [7] and Verilog. These

languages allow circuits to be described at either the behavioural or structural level.

Behavioural descriptions involve abstract definitions of system functionality as register

transfer level (RTL) whereas structural descriptions involve gate level connections. The

description then undergoes the process of synthesis, which involves mapping the circuit to a

netlist.

A new set of HDLs to create hardware circuits has recently become popular. These HDLs

contain subsets of common software programming languages such as C [16]. They use similar

syntax and are extended to support hardware circuits. Examples of such languages include

Handel-C [30], System-C [19], and Hardware Join Java [68]. An advantage of these

languages as compared with the traditional HDLs is the ability to reduce the design time.

Traditional design involves prototyping the algorithm in software or behavioural VHDL and

36

Chapter 2 – Runtime support for reconfigurable computing

37

then translating it into register transfer level VHDL or Verilog; a process that can introduce

errors and requires extensive debugging. These new HDLs may avoid these problems as there

is no need to prototype in software or behavioural VHDL because the language is already

software-like. Another advantage of these new HDLs is students with limited or no VHDL

experiences are able to develop hardware circuits. It was shown in Loo [93], students with

limited VHDL experience were able to develop hardware applications such as a parallel filter

within weeks. However, a disadvantage of Handel-C for example, is it often requires more

area than what a VHDL implementation would. Loo [93] showed that a DES encryption in

Handel-C required five times more area than the corresponding VHDL implementation.

Once a detailed gate description of the circuit has been created, it must be translated to the

actual logic elements of the FPGA. This stage is known as technology mapping and is

dependent on the exact target architecture. With a lookup table (LUT) based architecture, the

circuit is divided into a number of small sub-functions, each of which can be mapped to a

single LUT. The resultant blocks are then allocated to a specific location within the hardware,

often close to communicating blocks to minimise routing delays, in a process known as

placement. The communicating blocks are allocated and wired together by configuring the

appropriate routing matrices in a process known as routing. These two processes are often

combined and are known as place and route as the placement of the circuit will directly affect

the quality of routes made. Floor-planning is often used as part of the current design flow as it

can reduce the time required to complete the place and route phase. However, floor-planning

assumes you know what resources will be available [54]. At this stage in the design

methodology, the timing and behaviour of the circuit can be analysed to verify that it meets

the minimal operating speed constraints. After successful timing verification, the hardware

design process is complete and a bit sequence (commonly known as a bitstream) is generated.

Once a bitstream has been produced, a software host program must be developed to load it

onto the surface of the FPGA. A typical flow of this host program would be to set the clock

rate, configure the FPGA and perform the necessary I/O between the host computer and

reconfigurable computing platform. It uses a combination of the platform device driver and

associated application programming interface to perform these required management tasks.

After both the hardware circuit and software host program have been developed, the software

host program is used to load the hardware circuit onto the FPGA, perform the desired input

Chapter 2 – Runtime support for reconfigurable computing

38

and output between the FPGA application and host computer, and finally close the

reconfigurable computer after use, leaving it in a known state. To guarantee the application

will not be interrupted by other applications modifying the hardware, the software host

program and platform device driver block subsequent attempts to reconfigure the platform.

However with dynamic runtime reconfiguration becoming a feature in most modern FPGAs,

additional bitstreams are needed to be configured onto the FPGA at the same time. This has

resulted in changes to the traditional design flow.

2.4.2 Runtime application design flow

There has been considerable research conducted into improvements in design methodologies

for reconfigurable computing applications so they can enable the use of runtime and dynamic

reconfiguration. Hadley and Hutching [64] described a methodology for implementing

runtime systems that partially reconfigure FPGA devices. It involved maximising the static

circuitry and minimising the dynamic changing circuits. Shirazi et. al. [116] described a

method that automates the identification and mapping of reconfigurable regions in runtime

reconfigurable designs. This involved identifying possible components for reconfiguration, a

sequence of conditions for activating an appropriate component, and optimising the

successive components based upon reconfiguration time, operation speed, and design size.

Vasilko et. al. [129] introduced a design methodology for partial runtime reconfiguration

composed of two phases, the design of the static portion using the traditional design flow and

the partitioning, scheduling and allocating of the runtime part of the application. He stated

that their methodology reduced the number of design flow iterations, resulting in shorter

design time and high-quality results.

An application note [138] from Xilinx (a major manufacturer of dynamic runtime

reconfigurable FPGAs) outlined how the traditional design flow had been modified to support

module based partial reconfiguration. The design methodology is as follows. The design is

firstly described with a traditional HDL in conformance to a set of partial reconfiguration

guidelines. It had been noted earlier in the thesis that for partial reconfiguration to be

successful on their architecture, numerous constraints had to be placed on the designer. The

major one being that partial reconfiguration is column based and the modules should extend

column wise and not row wise. Further constraints are also placed onto the HDL coding

including having a top level design that is limited to I/O, clock logic, and the instantiation for

the bus macro, defining each reconfigurable module as a self-contained block, only using bus

Chapter 2 – Runtime support for reconfigurable computing

39

macros for communication between modules, defining all clock resources to use dedicated

global routing and not allowing modules to directly share signals with other modules except

clocks. Following design entry a floor-plan is then constructed describing the position on the

FPGA where each of the modules in the application will be configured. The standard place

and route tools are then run on each of the modules as well as each configuration of a

particular module in the application. An initial bitstream for the full design is then created,

followed by individual bitstreams for each reconfigurable module. The bitstreams are then

configured onto the FPGA via a software host program via the SelectMap interface.

All of these reported methodologies are built around the traditional sequential design flow

described in section 2.4.1 and can only be used when all of the order of application executes is

known prior to runtime. Cores can not be pre-placed and pre-routed and then relocated at

runtime. However Dyer [50] proposed that through the use of direct bitstream manipulation

and the Xilinx JBits SDK [62] applications could be relocated and connections could be

rerouted online, opening up future dynamic applications. The current SDK of JBits does not

support combinatorial and sequential synthesis, timing-driven placement, or advanced

routing. JBits also in our experiments does not scale to circuits beyond a very small size.

In this section it was shown that the traditional design flow is not suitable for developing

applications that use dynamic resource allocation. However through various academic and

commercial design methodologies it is now possible to develop applications that use dynamic

reconfiguration if all of the applications are known prior to configuring the FPGA. The

literature lacks a design methodology that supports applications that arrive dynamically.

Chapter 2 – Runtime support for reconfigurable computing

40

2.5 Applications and benchmarks for reconfigurable computers

Many reconfigurable computing applications are composed of a combination of hardware and

software. This has the advantage of being able to exploit the algorithmic speedup that

hardware can provide as well as the flexibility software gives. Since the introduction of the

first commercially available FPGA by Xilinx in 1984, there have been numerous applications

proposed and built. A selection of the many published reconfigurable computing applications

that have been implemented on a medium grained FPGA loosely-coupled reconfigurable

computer are shown in Table 4.

Cryptography Signal/Image processing

Communications Other

DES

[80]

Cordic

[99]

IPsec

[43]

Searching and Sorting

AES

[97]

Automatic target recognition

[131]

Reconfigurable Router

[67]

Boolean Satisfiability (SAT)

[145]

IDEA

[91]

Edge detection

[10]

LZ Data compression

[70]

Convolution

[113]

Table 4: Summary of common reconfigurable computing applications

These selected applications can be broadly classified into the applications domains of

cryptography, demonstrated through various implementations of the commonly used single

and triple Data Encryption Standard (DES) [80], Rijndael or Advanced Encryption Standard

(AES) [97], the International Data Encryption (IDEA) [91] and LZ data compression [70];

communications, through the implementations of the IP security protocol [43] and

reconfigurable routers [67]; signal processing, with implementations of cordic [99], automatic

target recognition (ATR) [131], edge detection [10], convolution [113], and software radios

[48]; and searching, through the Boolean satisfiability (SAT) algorithm [145]. All of these

applications are well suited to reconfigurable computing as the algorithms can be heavily

parallelised resulting in considerable speedups in execution runtime. Most of them can also be

written with a pipeline like structure which allows data to be streamed from the input device,

usually the microprocessor, into the application and then streamed out of it. This type of I/O

Chapter 2 – Runtime support for reconfigurable computing

41

architecture minimises the amount of onboard or on-chip memory needed to store and hold

large amounts of input data. This is seen as an advantage as most reconfigurable computers

have limited onboard memory (often less than 128 MB) and on-chip memory (often less than

10MBits). It also suits the loosely coupled reconfigurable computing architecture selected for

use in an operating system for reconfigurable computer. This is because of the high bandwidth

coupling between the microprocessor and FPGA.

Unlike in the software community, the reconfigurable computing domain lacks benchmarks

that can be used to compare the performance of these applications. Although benchmarks for

general purpose computers have been deeply investigated, there still appears very few that are

specifically designed for the area of reconfigurable computing. Two examples of these

however are the Adaptive Computer System (ACS) benchmark suite [87] and the

Reconfigurable Architecture Workstation (RAW) [11] benchmark suite. The ACS benchmark

suite has been designed for evaluating a configurable computing system’s architecture and

tools. Instead of using functional benchmarks, ACS used stress-marks, or benchmarks that

focus on a specific characteristic of a reconfigure system such as versatility, capacity, timing

sensitivity, scalability and interfacing. The RAW benchmark suite consists of 12 programs

representing general purpose algorithms including binary heap, bubble sort, merge sort, DES,

Fast Fourier Transform (FFT), game of life, and matrix multiply. The size of each benchmark

program is adjusted depending upon the capacity of the target reconfigurable hardware.

Chapter 2 – Runtime support for reconfigurable computing

42

2.6 Conclusion

This chapter has presented a review of the major themes in the published literature on runtime

systems for reconfigurable computing. The important outcomes of the survey have been to

highlight the absence of an operating system like software for a reconfigurable computer, the

lack of resource allocation and partitioning algorithms that are suitable for use in an operating

system, the inability for the current design flow to produce applications to be used in

conjunction with such a system, and the need for metrics to measure the performance of the

applications and the associated system.

This raises several research questions which will be addressed in this thesis:

• Is it feasible to implement an operating system with low overheads that supports

dynamically arriving applications for a reconfigurable computer?

• What abstractions and services are needed to be provided by such an operating

system?

• What constraints will be placed onto applications if resource allocation and application

partitioning has to be completed at runtime?

• Are there suitable algorithms for these services and how will they effect the operating

system?

• What modifications will need to be made to the current design flow to adapt it so it

can produce runtime applications?

• What benchmark and metrics are necessary to measure the performance of the

applications under operating system and the operating system itself?

It is the purpose of the remainder of this work to provide theoretical and experimental

foundations to show that an operating system can be described and implemented for a

reconfigurable computer. Each of these questions will in turn be answered throughout this

thesis.

Chapter 3 – Methodology

3

3 Methodology

In the previous chapter it was shown that there are quite significant gaps in the literature

regarding the runtime management of reconfigurable computing applications. A summary of

these gaps are stated below and in this thesis research contributions will be made to address

them.

1. There is no agreed list of abstractions that should be used in an operating system for

reconfigurable computing (section 2.2.1).

2. Current design flows have little support for dynamic reconfiguration with resource

allocation (section 2.4).

3. Algorithms for runtime resource allocation and runtime application partitioning have

not been deeply investigated in the reconfigurable computing domain (section 2.3).

4. There is no prototype runtime system for reconfigurable computing that demonstrates

runtime resource allocation and partitioning (section 2.2.2).

5. There has been little discussion of metrics that might be used to evaluate the

performance of an operating system for reconfigurable computing (section 2.5).

6. There have been no evaluations into the affect an operating system environment will

have on reconfigurable computing application performance (section 2.5).

In this chapter the methodology that will be used to address these gaps will be outlined. It is

based on previous work of Crnkovic [42] in which he suggested that a methodology consist of

following a path of categorising the research question, selecting a strategy that will result in

the question being answered, and choosing a validation technique to verify the results

obtained. For each stage he suggested five different types of research questions and these are

summarised in Table 5. As an example, a type of question being proposed is “how to do X?”;

a strategy is then selected to provide an answer to the question; in this case it could be any of

the 5 shown in Table 5. In the final stage, a validation technique is selected to verify the

43

Chapter 3 – Methodology

44

results obtained from the strategy. The selected validation technique depends upon the

strategy used.

Question Strategy/Result Validation

Feasibility

Does X exist and what is it?

Is it possible to do X at all?

Qualitative model

Report interesting observations

Generalise from examples

Structure a problem area

Persuasion

I thought hard

about this, and

I believe . . .

Characterisation

What are the characteristics of X?

What exactly do we mean by X?

What are the varieties of X

and how are they related?

Technique

Invent new ways to do some

tasks, including implementation

techniques

Develop ways to select from

alternatives

Implementation

Here is a prototype

of a system that . .

Method/means

How can we do X?

What is a better way to do X?

How can we automate doing X?

System

Embody result in a system using

the system both for insight and as

a carrier of results

Evaluation

Given these criteria,

the object rates as . .

Generalisation

Is X always true of Y?

Given X, what will Y be?

Empirical model

Develop empirical predictive

models from observed data

Analysis

Given the facts, here

are the consequences

Selection

How do I decide whether X or Y?

Analytic model

Develop structural models

that permit formal analysis

Experience

Report on use in

practise

Table 5: A research methodology suggested by Crnkovic

The research undertaken in this thesis has been divided into four chapters; abstractions,

architecture and design flow; resource allocation and application partitioning; operating

system prototype and metrics; and performance evaluation. For each chapter, a methodology

that will address the associated research questions is constructed from Crnkovic’s [42] work

above and are shown in Table 6. Figure 10 summarises the four methodologies used in this

thesis. For each methodology, the previous work that is drawn upon and the expected

deliverables resulting from the research are shown. The remainder of this chapter is structured

into four sections, with each directly relating to a future chapter in this thesis. In each section,

Chapter 3 – Methodology

45

the research questions that will be addressed are stated, the methodology that will used to

derive answers for them will be presented and the resultant deliverables will be outlined.

Path

Methodology Question Strategy Validation

Chapter 4

Abstractions, architecture and design flow

Feasibility Qualitative model Persuasion and implementation

Chapter 5

Resource allocation and application partitioning

Method Technique Evaluation

Chapter 6

Operating system prototype and metrics

Method Technique Implementation

Chapter 7

Performance evaluation

Characterisation System Evaluation

Table 6: Methodology paths used in this thesis

Chapter 3 – Methodology

Figure 10: A Summary of the methodology used in this thesis

46

Chapter 3 – Methodology

47

3.1 Abstractions, architecture and design flow

When designing abstractions, the architecture and a design flow for use in conjunction with an

operating system there are two questions that need to be addressed. These are as follows:

1. Is it feasible to define abstractions and an architecture to support runtime resource

allocation for reconfigurable computing?

2. Is it feasible to design applications for an operating system using the current tools and

design flow?

Both these questions are categorised as feasibility and the methodology chosen to address

them involves developing a qualitative model that will be validated through a combination of

persuasion and implementation. In the first question there will be two parts in developing this

qualitative model. A uniqueness and analogy investigation between reconfigurable computing

and the software based architecture will be conducted. This will result in a list of abstractions

for a reconfigurable computing operating system. Based on these abstractions the architecture

for an operating system will be derived. This architecture will then define the requirements of

the algorithms that provide these abstractions. The list of abstractions will be validated

through the use of persuasion and the architecture will be validated via an implementation.

The second question will involve investigating whether it is possible to modify the current

design flow and tools to develop applications for use in conjunction with an operating system

environment. This will be validated through a qualitative discussion.

3.2 Resource allocation and application partitioning

As a result of the research carried out in the previous section, a set of algorithm specifications

for both area allocation and hardware partitioning will have been derived. An area allocation

and partitioning algorithm that satisfy those specifications and meet selected criteria will then

be chosen for use in the operating system. This stimulates the following question:

1. How is area allocation and application partitioning performed in conjunction with an

operating system for reconfigurable computing?

This type of question is categorised as method; the strategy of technique was chosen to

provide a solution to the question and an evaluation will be used to validate the results. The

Chapter 3 – Methodology

48

strategy technique involves initially undertaking a survey of the research literature in other

domains to see if either area allocation or hardware partitioning algorithms that meet the

specifications has already been proposed. From this, both allocation and partitioning

algorithms that might be suitable will be selected. These algorithms will then be sorted based

on their complexity and runtime performance. The higher ranked algorithms will be then be

adapted to suit the operating system architecture proposed in the previous section. The best

performing allocation and partitioning algorithm will then be selected for use in the operating

system. This research will be validated through evaluation where the most suitable algorithms

will be evaluated against criteria to determine the allocation and partitioning algorithm that

performs the best.

3.3 Operating system prototype and metrics

Once the architecture and algorithms of the operating system have been determined, a set of

metrics are needed to measure the performance of the applications. This raises the following

question:

1. How can the performance of the applications under operating system control be

measured?

This question can be categorised as method; the strategy of technique will be used to address

it, and it will be validated through implementation. The strategy of technique will involve

selecting a set of metrics that will measure the impact of the introduction of an operating

system on the user and application performance. This will be achieved by reviewing the

current research literature to determine exactly what application designers perceive the

performance characteristics of their applications to be. The most important performance

characteristics will then become the metrics that measure the performance associated with the

operating system. The research results will be validated through an implementation prototype

of the operating system and by executing some popular applications on the operating system.

3.4 Performance evaluation

The result of the research conducted in the previous section will be a set of metrics that can be

used to characterise the performance of a prototype operating system. The research question

that needs to be addressed in this section is:

Chapter 3 – Methodology

49

1. What effect does this operating system have on application performance?

2. How quickly does the prototype respond to user interaction?

3. Are there any relationships between the results obtained from the experiments?

All of these questions are categorised as characterisation; the strategy of system will be used

to address them, and the results will be validated through an evaluation. To address these

questions via the strategy of system involves initially creating a test environment so the effect

the operating system has on application performance and user interaction can be measured.

This test environment will incorporate a selected benchmark application, a series of test cases,

and a prototype implementation as a test bed. To verify the result from the strategy of system,

an evaluation to determine if there are any relationships between any of the measured metrics

will be performed. If so, an attempt to derive formulas for predicting the correlation will be

undertaken.

3.5 Conclusion

In this chapter it was explained how the research conducted in this thesis will be divided into

four chapters. For each chapter, the research questions being proposed were presented,

methodologies to address these questions were put forward and the associated deliverables

were outlined. In the remainder of this thesis, these methodologies will be executed with the

aim of filling the research gaps that were exposed in chapter 2.

Chapter 4 – Abstractions, architecture and design flow

4 4 Abstractions, architecture and design flow

It was highlighted in the literature review that there appears to be no agreed set of

abstractions, architecture or design flow for a reconfigurable computing operating system.

Therefore a set of abstractions, architecture, algorithm specification and new design flow will

be presented in this chapter. A summary of the previous work, methodologies and

deliverables associated with this chapter are shown in Figure 11.

Software Operating Systems

FPGA Technology

Existing Algorithms

Operating System

Abstractions

Specifications of Algorithms

Abstractions, architecture and design flow (chapter 4)

Uniqueness analysis

DeliverablesMethodologyPrevious Work

Architecture design process

Analogous analysis

Architecture Design

Existing Design Flow

New Design Flow

Design flow analysis

Figure 11: The previous work, methodology and

deliverables associated with this chapter

50

The chapter is divided into four parts, with each section associated with a specific deliverable.

In the first section, a set of abstractions that suit a reconfigurable computing operating system

will be presented. This will be achieved through a survey of software operating systems and

reconfigurable computing technology, combined with a qualitative approach based on analogy

and uniqueness. The selected abstractions will define what the architecture of the operating

system must implement and is therefore presented in the second section. In the third section,

the specifications of the algorithms that implement the resource allocation and partitioning

Chapter 4 – Abstractions, architecture and design flow

51

components of the architecture will be defined. In the final section, the implications of the

new operating system architecture and its underlying abstractions for the design of

reconfigurable computing applications are investigated. This is combined with previously

published design flow research to result in a new design flow for application development

within an operating system environment.

Chapter 4 – Abstractions, architecture and design flow

52

4.1 Abstractions

Abstraction is a design technique that focuses on the essential aspects of an entity and ignores

or conceals less important ones [76]. It is an important tool for simplifying a complex

situation to a level where analysis, experimentation, or understanding can take place. It has

long been associated with classical software operating systems. A widely accepted set of

software operating system abstractions include the process, the address space, and the inter-

process communication [120]. These abstractions define the architecture and algorithm

specifications that will ultimately be implemented in the operating system. In an operating

system for reconfigurable computing, a generally agreed set of abstractions is yet to appear

(see 2.2.1). Therefore before any architecture or algorithm specification can take place, a set

of abstractions and resulting services needs to be selected.

In this section, a set of abstractions for a reconfigurable computing operating system will be

defined. This will be achieved by drawing an analogy from the software operating system

domain and examining unique features of a reconfigurable computer. For abstractions that

already exist in the software domain (process, address space and inter-process

communication) the investigation aims to find out if it can be transferred to the reconfigurable

computing domain. If there are unique features preventing direct transfer, this work attempts

to accommodate these unique features.

4.1.1 Process abstraction

Early software computer systems allowed only one program to be executed at a time. This

program had complete control of the system and had sole access to that subset of the system’s

resources which it is authorised to access. This resulted in the notion of a process shown in

Figure 12. A process is defined to be a sequential program in execution and is composed of

the object program (or code), the data on which the program will execute, any resources

required by the program and the status of the program’s execution.

Chapter 4 – Abstractions, architecture and design flow

Figure 12: Software Operating System Process

Analogy and uniqueness

As was the case in the microprocessor computer system, reconfigurable computing needs to

evolve into a multiple application environment to better utilise the hardware resources. If

multiple applications are loaded onto the reconfigurable computer, the hardware circuit, the

application data, and any resources the application is using all need to be associated with the

particular application. However the software process abstraction can not be transferred

directly to a reconfigurable computing operating system as there are three unique features of

reconfigurable computing applications preventing it.

Firstly, there can be no exact counterpart of context switch program code. In the software

process abstraction, the program code is a set of sequential instructions which can arbitrarily

be divided into equal sized parts. However, in a reconfigurable computer, the “program”

consists of a two-dimensional logic circuit that is commonly loaded in its entirety onto an

FPGA for execution. Partitioning hardware is far more computationally complex than

partitioning sequential software, unless the circuit has been arranged to facilitate this at design

time.

Secondly, maintaining the process state of a reconfigurable computing application is much

more complex than a software one. When a software process is swapped off a microprocessor,

the operating system performs a context switch to ensure the current state of the process is

maintained. This involves saving the values of a fixed number of registers, often including the

53

Chapter 4 – Abstractions, architecture and design flow

54

process number and a program counter. This procedure ensures the program can be loaded

with the same state at a later date. However if a circuit is swapped off an FPGA, there are not

a fixed set of registers that can be saved to ensure it can be reloaded with the same state at a

later date. This can only be achieved by saving all state holding elements of the circuit,

thereby resulting in a variable sized process state; much different to that of a software context.

The traditional software process abstraction is unable to hold a variable sized state and thus

can not be used for the reconfigurable computing process abstraction.

Thirdly, in a software process, the program data associated with the object code is always

contained in a separate part within that process. However, in a reconfigurable computing

application, instructions are circuits and the division between data and computational

elements is less clear. Three commonly used locations where data can be stored in a

reconfigurable computing application is one, in the lookup tables [41], two, in the block

memory [140] commonly distributed around the edge of the FPGA, and three, in external data

RAM[32]. Most modern reconfigurable computing platforms have memory attached directly

to the I/O pins of the FPGA and data can be streamed into the circuit via an on-chip memory

controller. The classical software process abstraction is unable to represent all these forms of

data storage.

Due to these three unique features of reconfigurable computing applications, the software

process abstraction can not be transferred to the reconfigurable computing domain without

modification. An investigation into what has previously been documented in the research

literature that potentially modifies the software based abstraction to overcome these unique

features will now be undertaken.

Survey of literature

Although dividing hardware circuits and swapping parts during execution is usually avoided

because of the difficulty of saving state and the loss in performance, logic partitioning can be

performed with minimal loss in performance if an application is designed to support it. There

have been several suggested ways to structure a circuit so logic partitioning can be performed

within the operating system with minimal loss in performance.

Firstly, the circuit could have a fixed structure composed of smaller equally sized circuits that

when arranged in a particular geometric alignment would be logically equivalent to one large

Chapter 4 – Abstractions, architecture and design flow

circuit. An example of where this type of structure is implemented is in the Virtual Hardware

Operating System [24]. Brebner introduced the idea of describing a circuit as a collection of

swappable logic units (SLUs), or “hardware pages”. This allows parts of the circuit or SLUs

to be more easily swapped in and out of the hardware on demand, similar to how a software

page is swapped in and out of memory. However, there are problems with an SLU like

structure. Partitioning a significantly large circuit into many small sized partitions can impact

of the performance. Partitioning it into few large SLUs can lead to internal fragmentation of

the area. It would seem more appropriate to partition the application into parts that are

variable in size that follow the natural structure of the application.

Secondly, the circuit could be structured according to a data flow graph (DFG) [44] as was

demonstrated in the adaptive multi-user online reconfigurable engine AMORE [112]. A data

flow graph can be viewed as an abstract circuit representation without clock signals or timing

information, where the nodes represent operations and the edges represent data paths. In

particular the nodes of the graph can be either simple operations such as adders, bit shifts, or

memory read and writes; or complex operations such as floating point division or multipliers.

An example is shown in Figure 13. The advantage of modelling a circuit according to a data

flow graph is the nodes on the graph do not have a fixed size area. A data flow graph based

circuit can be partitioned if there is insufficient area available on the FPGA for the entire

circuit to be loaded in one location.

1 32

X

+

Y = '(1+2) x 3'

Y

Figure 13: Data flow graph

55

Chapter 4 – Abstractions, architecture and design flow

56

However, algorithms that partition a DFG require some intelligence to maintain the circuit’s

performance. Most of these algorithms [2] attempt to minimise the number of communication

channels required between partitions as these can increase the circuit delay. In an operating

system, not only does the number of communication links need to be minimised but the

application has to be partitioned into specified sizes that match the available space on the

FPGA. This means that some nodes may need to aggregated together and others separated. If

the DFG contains a feedback loop that is partitioned, a state holding element needs to be

inserted. A more detailed evaluation of logic partitioning is held off until later in this thesis

(see section 5.2).

If any part of the reconfigurable computing application is to be swapped off the FPGA before

it has completed execution (pre-emption), a context switch, similar to that associated with a

software program has to be performed. The status of a hardware circuit (task switch) needs to

be preserved when it is removed from the FPGA surface part way through its execution.

Previous research in task switching for reconfigurable computing applications has been

conducted by Simmler [123]. He outlined that to successfully perform task switching, the

current state of all registers and internal memories must be able to be extracted from the

circuit, all registers and memory bits must be able to be preset or reset when a circuit is

restored, the position of all state holding elements must be known in advance to perform state

extraction, and the platform clock has to be able to be completely stopped. He also described

three limitations of FPGA designs if used in a task switching environment. Firstly, latches or

registers can be implemented by means of combinatorial logic then their storage can neither

be read nor initialised from most FPGAs. Secondly, the design must indicate when it is safe to

stop the clock and switch the task. For example allowing a task to switch at any time can lead

to a switch right after an addressing phase of the external memory. When the task is restored

it will read invalid data as it would have already been presented at the memory output. Due to

these design limitations, it is felt that pre-emption in the initial prototype would place too

many restrictions onto the designer and will not be considered.

Unlike a classical software operating system where if a process has an extremely long

execution time it typically blocks all other waiting processes, on a reconfigurable computer,

the process with the long execution time only blocks a portion of the FPGA area. Other

waiting applications can be loaded onto the other available area. Thus performing pre-emption

is not as critical for a reconfigurable computing multi-user operating system.

Chapter 4 – Abstractions, architecture and design flow

57

Reconfigurable computing process abstraction

The reconfigurable computing process abstraction that will be used in the operating system in

this thesis will consist of the hardware circuit being described as a data flow graph with data

source and sink nodes inserted for simplified I/O access. Hardware circuits can be designed

according to a DFG model [94] and it simplifies I/O access as a DFG is modelled as a flow of

data. A DFG also provides support to efficient partitioning.

There are several research tools described in the literature that are able to convert a typical

hardware circuit into a data flow graph if not initially designed with one [74] [134] [89].

These tools could be used in conjunction with the standard FPGA design flow to assist in

developing applications so they can fit a DFG structure. If however the circuit can not be

modelled as a data flow graph, it can still be used with the process abstraction, although the

operating system will not be able to partition it unless the application structure comes with an

custom partitioning algorithm.

The use of a DFG with inserted data source and sink nodes simplifies I/O because as a DFG is

modelled as a flow of data, input can be loaded at one end of the graph and the output can be

obtained from the other (see Figure 14 (c)). All data transferred between the process and

external resources is passed via the data source and sink nodes. These nodes are then

interfaced to a standard communication module that will be attached to every process that

requires I/O. This provides the basis for virtual I/O in the operating system. The associated

data will be streamed into the circuit from the external memory or other processes. This could

quite easily be extended to support streaming from both block RAM (BRAM) and lookup

table configured memory in future operating system prototypes. The reconfigurable process

abstraction in this thesis will not have the ability to store the process state as the operating

system will not preform pre-emption.

Chapter 4 – Abstractions, architecture and design flow

Figure 14: Reconfigurable computing process abstraction

Until now, the reconfigurable computing process abstraction will perform all I/O via a

specially constructed standard communication module attached to every process (Figure

14(a)). However, a special type of process abstraction that can directly access the FPGA pins

(Figure 14(b)) for application performance reasons is now introduced. This process

abstraction is very similar to the ones described above, although it can have an alternative

source of I/O instead of, or as well as the external memory. How the process connects to the

I/O pins will be left to the designer and abstracted away from the process abstraction. Such an

extension to the process abstraction is required for applications to avoid any loss in

performance due to the introduced memory latency. This may include applications that

require large amounts of streaming data such as real-time video images in an image fusion

application. The major reason why not all processes have direct pin access is because it places

a constraint on the allocation. Ideally for routing efficiency, processes that have direct I/O pin

access should be placed as close to the pins as possible. Placing this restriction on all

processes may increase the complexity of the associated allocation algorithms.

58

Chapter 4 – Abstractions, architecture and design flow

4.1.2 Address space

Most modern software operating systems support multiple pseudo concurrent processes. This

is partly achieved by processes sharing the main memory and being swapped back and forth

to the microprocessor for execution. For the main memory to be safely shared amongst

multiple processes, it must be allocated to them according to operating system polices and

then have mechanisms put in place to prevent illegal access from other processes. This led to

what is known as an address space abstraction and in a classical operating system, is a linear

set of locations used by a process to reference the primary memory locations, operating

system services, and resources (see Figure 15). The address space stores all the logical entities

used by a process and specifies an address by which they can be referenced without kernel

involvement. A process can only reference memory that has been mapped into its address

space.

Figure 15: Classical operating system address space

Analogy and uniqueness

In a reconfigurable computing environment, if there are multiple applications on the FPGA at

one time, an address space abstraction will be required to prevent hardware circuits from

accessing or modifying parts of the FPGA that may affect other executing circuits. If the

application data is stored separate from the circuit, in on-board memory for example, a

mechanism to address and protect it will also be required. These requirements are analogous

to what the software address space abstraction can provide. However, there are three unique

features of a reconfigurable computer that prevent the software address space abstraction from

being transferred without modification into the reconfigurable computing domain.

59

Chapter 4 – Abstractions, architecture and design flow

60

One, in a software operating system a process consists of sequential instructions and data that

are allocated into memory and accessed through a linear address space. In a reconfigurable

computer, a process will consists of a two-dimensional logic circuit that needs to be loaded

onto an FPGA and possibly data that needs to be stored into on-board memory. In the

software address space abstraction there is no concept of a two-dimensional hardware

resource.

Two, there are other resources apart from external RAM that need be allocated to processes

on a reconfigurable computer. In a software operating system, memory is a major resource

that requires allocation. In a reconfigurable computer, CLBs, routing wires, BRAM,

multipliers and I/O pins are just some of the resources that could require allocation to

processes. Address space allocation algorithms need to be modified to suit this complex

environment.

Thirdly, the ability to share a logic circuit on a reconfigurable computer is much more

difficult than sharing a software program stored in memory. The software address space

abstraction allows software located in memory to be shared between processes for read

access. This has been well demonstrated through examples such as shared libraries. However

in a reconfigurable computer, not all circuits can be shared. For example, if a circuit has the

associated data embedded into it, sharing the circuit is not possible. For a circuit to be shared

it must be able to have data streamed in externally and it must be able to be time multiplexed.

An example of a shared circuit is a memory controller that is responsible for reading and

writing a single bank of off-chip RAM for several processes. Only one process at a time can

read or write to the memory and once it is complete the next process can use the shared

circuit. These three unique features of a reconfigurable computer prevent the software based

address space abstraction from being transferred from the software operating system domain

without modification.

Reconfigurable computing address space abstraction

The reconfigurable computing address space abstraction that will be used in this operating

system will consist of a two-dimensional address space for the FPGA and a single dimension

address space for the on-board memory as shown in Figure 16. The FPGA address space will

be represented in two-dimensions with each cell corresponding to a configurable logic block

(CLB). Each cell in the address space will initially only hold a value to represent whether the

Chapter 4 – Abstractions, architecture and design flow

CLB is available for allocation. This abstraction, in combination with an allocation algorithm

will provide protection as it will prevent other circuits from being allocated to occupied

FPGA locations. The on-board memory will be represented by a linear address space in

conjunction with conventional memory allocation algorithms. If the data used is not

embedded into the logic it must come from an external source and hence the external data

needs to be included with the logic in a single process concept.

Figure 16: Reconfigurable computing address space abstraction

The local routing resources and I/O pins will not separately be represented in this address

space but will be considered to be included within the area used exclusively by the process, as

they will be assumed to be part of a primitive architecture (to be discussed in the inter-process

communication abstraction). The process can use any local routing resources within the

bounding box of area allocated to it. However, the operating system must be able to reserve

some routing resources for inter-process communication that do not constrain where the

process can be allocated.

4.1.3 Inter-process communication

61

In early operating systems, processes were able to communicate only through the use of

shared memory. In shared memory, the user implements code that accesses a special part of

the computer’s address space which more than one process can access. Data is placed in this

part of the address space by one process and other processes subsequently read and use it.

However, if the two processes did not share the same address space, the operating system

kernel must manage access. Inter-process communication (IPC) provides a mechanism to

Chapter 4 – Abstractions, architecture and design flow

allow processes to communicate and to synchronise their actions without necessarily sharing

some part of the address space. IPC abstractions enable a process to copy information from its

own address space, form it into a message and send the message to a receiving process which

will copy it into its own address space. This is shown in Figure 17.

Message

Data to be shared

Data to be shared

Address Space for Process X

Address Space for Process Y

Figure 17: Software inter-process communication abstraction

Analogy and uniqueness

Processes in an operating system for reconfigurable computing also need to be able to

communicate with other processes that do not share the same address space. When an

application is partitioned, each partition becomes a new process and as these processes do not

share the same address space, a communication like mechanism is needed. This concept is

analogous to the software inter-process communication abstraction in which messages are

formed in packet like capsules of data and passed between the communicating processes.

However in a reconfigurable computer, instead of passing the packets via special files known

as ports as is the case in software, the data can be transferred either via memory, non-shared

direct hardware channels, abutment, or an on chip network.

Performing inter-process communication via memory often involves connecting the two

communicating processes to the memory via an arbitrator and memory controller (see Figure

18). This reduces the need for the memory access circuitry to be configured into each process.

When communication between the two processes is required, one process would indicate to

the arbitrator it wishes to write data into memory. The arbitrator would grant it access,

allocate it a memory location and the data would then be loaded into memory at that location

via the memory controller. Once the write was completed, the arbitrator would indicate to the

communicating process that data is available and it would be passed onto it via the memory

controller.

62

Chapter 4 – Abstractions, architecture and design flow

Higher performance processes typically need to communicate directly to other processes

without the overhead introduced by the memory controller and arbitrator. This can be

achieved through non-shared hardware channels or via abutment (see Figure 18). Non-shared

hardware channels involve the use of a runtime router that can dynamically route channels

between the two communicating processes. Abutment involves collocating two processes

with a particular geometric alignment so that a standard interface creates a communication

channel just because they are placed next to each other. Once the channels have been routed,

the data can then be transferred between the two processes at a greater performance than if it

were transferred via memory.

Another mechanism that supports inter-process communication in reconfigurable computing

applications is the use of an on-chip network (see Figure 18). This involves configuring a

communication infrastructure onto the FPGA separate from all processes and then a process

can connect to the shared network. The advantage of using an on-chip network is that similar

performance may be achieved when compared to the non-shared hardware channels, but there

is more flexibility in where the processes can be allocated. There are a variety of network

topologies that can be used for the shared network, ranging from a bus network to a star

network, and these will be discussed in more detail in the next section.

Process 2Process 1

FPGA

I/O Pins

Onboard M

emory

Process 3

Mem

ory C

ontroller

(a) Process 1 and 2 via a hardware channel(b) Process 1 and 3 via abutment(c) Process 2 and 3 via onboard memory(d) Process 4 and 5 via shared on-chip network

Process 4

Process 5

Figure 18: Possible inter-process communication mechanisms

63

Chapter 4 – Abstractions, architecture and design flow

64

Although the software inter-process communication abstraction can be transferred to the

reconfigurable computing domain with little modification, an alternative to ports must be used

to transfer the data between communicating processes. A survey of previously published

communication interfaces will now be investigated.

Survey

An early attempt at inter-process communication with FPGAs via abutment was described by

Brebner [24]. He stated for inter-process communication to be possible each swappable logic

unit had to be fixed in size and have a communication interface built into each one. Once the

SLUs were placed onto the FPGA in pre-determined locations, they could communicate to the

SLU directly above, below, right and left of itself via the communication interface. If an SLU

wanted to communicate to another SLU that was not in one of these locations the

communication was either not possible, had to pass through other SLUs until it reached the

required one or wait until the desired location became available.

Mignolet [100] and Yi-Ran [143] avoided some of the problems that are faced by the

abutment style of communication. They proposed the use of a shared fixed uniform mesh

packet forwarding network as shown in Figure 19. This is the most widely suggested

architecture that is described in the literature for the sharing of a single programmable logic

chip among applications that are loaded at user demand. However, it does not avoid the

difficulty of fixed sized circuits. The problem with using equal sized processes in inter-

process communication is the restrictive nature of the size of the process that can be placed,

and the possible increase in internal area fragmentation when the processes and the fixed size

area segments are not exactly the same size. It can also result in a loss in application

performance because the application is automatically partitioned into multiple processes so it

can fit into the area requirements. However, this has the advantage that the network is a fixed

location on the FPGA and does not need to be altered at runtime.

Chapter 4 – Abstractions, architecture and design flow

Figure 19: Processes of fixed size arranged

in a fixed mesh topology orientated network

A consequence of the relaxation of the regular fixed size constraints is that a shared on-chip

network must be dynamically re-routable. Since routing at runtime involves online algorithms

that must have execution times that are not excessive in comparison with application

execution time, the complexity of the runtime routing must be restrained. In [81], Kearney

presented an evaluation of network topologies including bus, star, mesh, ring and tree that

might be suitable for such runtime re-routable shared networks. This is shown in Table 7. His

criterion was based on ease of implementation, wire routing cost (i.e. some topologies require

many wires to be run over large distances on the chip), concurrency or the ability to support

multiple memory banks, latency, and scalability or how does it perform with a substantial

amount of applications connected to it; all important criteria associated with a reconfigurable

computing operating system. He concluded that although the bus topology was clearly not the

best performer based on the criteria, the poor concurrency and latency of the topology could

be overcome through the use of multiple buses.

65

Chapter 4 – Abstractions, architecture and design flow

66

Ease of

implementation

Wire routing

cost

Concurrency Latency Scalability

Bus ++ - -- - --

Star ++ -- ++ ++ --

Mesh -- -- + + +

Ring ++ + - +/- +/-

Tree - - + + +

Table 7: Evaluation of network topologies

+ favourable; - unfavourable; +/- neutral

Reconfigurable computing inter-process communication abstraction

The software inter-process communication abstraction will be transferred to the

reconfigurable computing domain as there are no real unique features of FPGAs preventing it.

It will consist of the formation of messages and the passing of these between communicating

processes. However, instead of the use of ports, it will be supported by a pre-configured

primitive architecture. A primitive architecture is an FPGA logic design shared by several

applications and remains on the FPGA as applications are allocated and de-allocated. The

primitive architecture may be runtime reconfigured in minor ways as the needs of applications

change. The primitive architecture proposed here consists of a memory controller and shared

on-chip re-routable bus network (see Figure 20). A memory based inter-process

communication style was initially selected because it would be the easiest to implement for

the first operating system as most platforms have onboard memory. However, there appears

no real reason why the other more direct forms of inter-process communication can not be

implemented in future prototypes. Processes wishing to communicate data can be placed

anywhere on the FPGA and the network will re-routed to connect to the process.

Chapter 4 – Abstractions, architecture and design flow

Figure 20: The on-chip network used in the reconfigurable computing inter-process

communication abstraction

4.1.4 Conclusion

In this section, through an analogy and uniqueness survey, a set of abstractions consisting of

the process, address space and inter-process communication were selected for use with a

reconfigurable computer. It was outlined that the process abstraction will consist of the

hardware circuit being described as a data flow graph with inserted source and sink nodes.

This gives the operating system the ability to stream I/O data into the application as well as

support application partitioning. The address space abstraction will consist of a two-

dimensional address space to represent the FPGA and a traditional one-dimensional address

space to address the attached memory. The inter-process communication abstraction will

consist of processes forming messages and passing them to other processes via an on-chip

network and memory controller. This type of abstraction best supports dynamically arriving

variable sized processes.

67

Chapter 4 – Abstractions, architecture and design flow

68

4.2 Operating system architecture

In the previous section it was demonstrated that an operating system for reconfigurable

computing should consist of three abstractions: process, address space and inter-process

communication. Although all of these abstractions exist in the software operating system

domain, they had to be modified in order to suit a reconfigurable computer. These newly

defined abstractions influence the structure and components of the operating system

architecture. In this section, the architecture for the operating system proposed in this thesis

will be developed. This will be achieved by summarising any previous attempts at suggesting

reconfigurable computing operating system architectures and from this previous knowledge,

any ideas that fit the proposed abstractions will form the basis of the new architecture. The

new architecture will then be completed by describing all the relevant components and

interactions between them.

4.2.1 Previous reconfigurable computing runtime system architectures

In chapter 2, previous research on operating system like artefacts was reviewed. In this

section lower level implementation details on runtime systems are reviewed. As there have

been several runtime systems developed for reconfigurable hardware (see 2.2.2), these have

resulted in a few simple customised architectures. The most primitive of these is the

client/server like architecture described by Simmler et. al [123]. This architecture is composed

of only three sub-systems: a client application, a hardware management unit, and the

reconfigurable hardware itself, as shown in Figure 21. In this architecture the client

application communicates to the hardware management unit which in turns converts the

request into the specific platform API and then directly passes it to the hardware. This type of

architecture is very primitive and not really suited to the operating system proposed in this

thesis as it only performs simple tasks such as the configuration of the FPGA and

management of I/O. It does not provide support for dynamic resource allocation, application

partitioning or inter-process communication.

Chapter 4 – Abstractions, architecture and design flow

Client / Application

Client / Application

Client / Application

Server / Hardware Manager

FPGA Coprocessor Board

Figure 21: Client-Server model architecture

Burns [27] extended this architecture to include two new sub-systems: a transformation and

configuration manager, as presented in Figure 22. The transformation manager is responsible

for translating the circuits to improve area usage. For example, if the position that the circuit

was designed to be placed onto is occupied, the translation manager will rotate, mirror or

scale the circuit so it can be placed in a different location. However, very few details were

provided on how to actually perform the transformations and it is felt that it would be too

computationally expensive to do so at runtime. The configuration manager is simply a

hardware abstraction layer so the architecture can be ported to any type of reconfigurable

hardware. Although the architecture itself is not suitable for the operating system in this

thesis, as it also does not provide support for the proposed abstractions, the concept of a

hardware abstraction layer will be utilised.

Figure 22: The RAGE System Dataflow Architecture

69

Chapter 4 – Abstractions, architecture and design flow

4.2.2 Proposed reconfigurable computing runtime system architecture

As there is no architecture of an operating system that suits the selected process, address

space, and inter-process communication abstraction described in the literature, the

components and interactions between them that result in an architecture will be presented

here. The architecture shown in Figure 23 consists of seven components responsible for user

input, service providing, resource allocation, logic partitioning, bitstream compilation and on-

chip network configuration, hardware abstraction, and the on-chip network itself. Each of

these components will now be described in more detail, followed by a description of the path

a sample application would take to be executed under this architecture.

Figure 23: Architecture of the operating system

Shell

70

The shell in this operating system will be very similar to a traditional one. It will provide an

interface between the user, the operating system and the hardware. Users will input

commands and execute applications via the interface. The operating system service provider

is then responsible for converting them into appropriate calls to the operating system

application programming interface. Applications will be delivered to the Allocator which will

begin the process of preparing it for execution. The operating system service provider will

also report to back to the shell on the status of the requested operations. Possible commands

Chapter 4 – Abstractions, architecture and design flow

71

that the shell could provide would include loading an application, accessing on board

memory, and clock control.

Operating system service provider

The operating system service provider is responsible for interpreting the user commands from

the shell, converting them into the appropriate application programming interface, and then

passing them directly to the hardware abstraction layer. The services the component would

provide include on-board memory reading and writing, and platform specific configuration

information. The advantage of using such a component is if the hardware abstraction layer is

altered at any point in time, only the operating system service provider has to be modified. If

the architecture did not have such a component and the hardware abstraction layer was

changed, all of the user interfaces would need to be modified. As the operating system may

have several different shells in the future, this will reduce the amount of maintenance that

needs to be performed.

Allocator

The Allocator is responsible for finding a section of vacant area on the FPGA that is large

enough to accommodate the new application, and for allocating the necessary on-board

memory. This requires the Allocator to keep track of where the free FPGA area is and where

the previously allocated processes have been put. When an application is passed to the

Allocator from the operating system service provider, it will initially calculate an estimate of

the amount of free area that will be required to accommodate it. If the area is available, the

Allocator selects the best place to locate the application. If there isn’t enough area available

for the complete application, it will either partition or block it.

If there is enough area available but it’s not in one contiguous block, the application will need

to be partitioned. The Allocator will then determine the largest segment of free area available

and pass that and the application to the Partitioner. This process may be repeated numerous

times until the complete application has been successfully allocated or partitioning fails and

the application is blocked. All of the allocated and partitioned parts will then be passed to the

placer for the next stage of processing. The full specifications of the allocation algorithm will

be outlined in section 4.3.2.

Chapter 4 – Abstractions, architecture and design flow

72

Partitioner

The Partitioner is responsible for partitioning the application logic into multiple parts if it can

not fit onto the available area in its current geometric dimensions. If the Partitioner is called, it

would have already been determined by the Allocator that there is enough area available for

the application, but it is not in one contiguous block. The Allocator will inform the Partitioner

of the amount of area it has to partition the application into and the Partitioner will return

which aggregation of nodes the data flow graph have been allocated to that particular segment

of area. This process will be repeated until the entire application has been allocated and

partitioned. If the Partitioner is unable to divide the application into a small enough partition

to fit into the specified area, it will inform the Allocator of this. The Allocator will then decide

whether to search for a larger portion of area or block and place the application in the ready

queue. More details on the partitioning algorithm will be given in section 4.3.3.

Loader

The loader is primarily responsible for creating the bitstreams that will configure the FPGA,

and the configuration of the on-chip physical network. Once the application has been

partitioned into a process and allocated a place on the FPGA, the loader determines how the

communication network must be configured to incorporate the new process. It combines this

network configuration information with the process itself to produce the specific FPGA

bitstream. It then passes the bitstream and configuration information to the hardware

abstraction layer that will actually load it onto the target FPGA.

Hardware abstraction layer

It is commonly agreed that a hardware abstraction layer (HAL) is a layer of programming that

allows an operating system to interact with a hardware device at a more abstract level. Unlike

modern personal computer hardware such as the hard disk or memory, there is no standard

interface for an FPGA or reconfigurable computer. Every manufacturer has a different

application programming interface (API) and for the user to access it, they need to have a

good understanding of the platform. The hardware abstraction layer in this operating system

should provide a standard API that abstracts away the underlying reconfigurable computing

hardware. It should provide an API to connect to the platform, configure the FPGA, control

the clock rates and access any onboard memory. All of the hardware specific code can then be

hidden, not only from the user but from the operating system itself. The advantage of using

Chapter 4 – Abstractions, architecture and design flow

73

this type of HAL is the target platform can be changed without significant redevelopment of

the operating system except for the hardware abstraction layer itself.

Network and network manager

The network and network manager together form the primitive architecture which is primarily

responsible for supporting inter-process communication and I/O. It consists of the inter-

process communication network and associated hardware to support it. The primitive

architecture is configured onto the FPGA by the loader before any user processes are. Then as

processes are configured onto the FPGA, the network configurator informs the network

manager via the hardware abstraction layer what must be changed and configured to the

network to support the incoming process. Once the changes have been made, it is responsible

for arbitrating between the processes to decide who has access to the network to guarantee the

data being transferred is not damaged or corrupted.

4.2.3 Sample application execution

The execution of an application begins with the user informing the operating system via the

shell that there is an application waiting execution. The application described in a data flow

graph format is then passed via the operating system service provider to the Allocator. The

Allocator begins by calculating if there is enough vacant FPGA area for the entire application

to be configured onto the FPGA in its current geometric dimensions. If there is, the allocated

location and the application will be passed onto the placer. If not, the application will be

blocked, put into a ready queue until more area becomes available and the user informed via

the operating system service provider. If there is area available, but it’s not in one contiguous

block, the largest amount of free area and the application will be passed to the Partitioner. The

Partitioner will then attempt to partition the application into a process that can fit into the

allocated area. This process of partitioning and allocating is repeated until either the

application has been fully partitioned and allocated or all the area has been used. If all the area

is used and the application has not been fully allocated, the application will be blocked and

placed into the ready queue.

The completed application will be passed to the loader so an FPGA bitstream can be

produced. The loader will also configure the on-chip network for the processes that require

inter-process communication and off-chip I/O. Once the bitstream has been created, it will be

configured onto the FPGA via the hardware abstraction layer.

Chapter 4 – Abstractions, architecture and design flow

74

Once the application is in execution, the user can interact with any of the processes through

the operating system service provider. Via the shell, the user can request an I/O operation, an

alteration to the clock speed, or even the termination and removal of selected processes. The

user operations will be translated into the hardware abstraction API by the operating system

service provider which will in turn pass directly the commands onto it.

4.2.4 Conclusion

In this section, the architecture for a reconfigurable computing operating system that suits the

process, address space, and inter-process communication abstraction has been presented. This

was achieved by firstly surveying the previous literature to investigate whether one had

already been proposed that could be used. It was shown that little architecture for

reconfigurable computing runtime systems have been presented and as such, a new one that

suits the selected abstractions was developed. This architecture consists of seven components

with the Allocator supporting the address space abstraction, the Partitioner supporting the

process abstraction and the network and network manager supporting the inter-process

communication abstraction. The other four components include a shell as a user interface, an

operating system service provider to support hardware configuration, a loader to generate the

FPGA bitstreams, and a traditional hardware abstraction layer to abstract the low level

platform programming from the user and operating system.

Chapter 4 – Abstractions, architecture and design flow

75

4.3 Algorithm specifications

It was highlighted in the architecture described in the previous section that the defining

components of the operating system that will ultimately support two of the proposed

abstractions (process and address) is the allocation of the FPGA area (Allocator) and logic

partitioning of the applications (Partitioner). Although several algorithms have been proposed

in the previous literature to solve both (see section 2.3), very few of them are suitable for this

environment because of the need to allocate area to variable sized applications, partition

applications into predefined sizes, and to carry out both of these at runtime.

As these algorithms now need to be performed at runtime because the status of the FPGA can

not be predicted at design time, a trade-off between execution runtime and quality of

allocation and partitioning needs to made. In this section the requirements and algorithm

specifications of the allocation and partitioning components will be presented.

4.3.1 Runtime requirements for algorithms

The traditional design stages of partitioning, and placement and routing use stochastic

algorithms to produce high performance applications at the expense of the total execution run

time. This type of algorithm is ideal for the offline design flow because the resulting

performance of the application is far more important than the total execution runtime of the

algorithms. The designer is able and happy to wait for an extended period of time for the

stages to produce such a high performance result. However, in an operating system

environment this is neither necessary nor possible. A runtime complexity near linear and with

an actual execution runtime in the order of milliseconds are required.

If the execution runtime of either the allocation or partitioning algorithms is reduced, the

resulting performance of the application can suffer. In the case of allocation, the applications

may use more FPGA area than compared to if a stochastic based allocation algorithm was

used. However as FPGA logic density will reach 20 million system gates in the foreseeable

future, logic area is becoming much less of a restriction than it once was. As the applications

used in the operating system will be made up of larger granularity processes which have

already been pre-placed and pre-routed using the high performance commercial design tools,

the number of iterations a partitioning algorithm would need to make in order to perform a

successful partition would be reduced.

Chapter 4 – Abstractions, architecture and design flow

4.3.2 Allocation

The specifications of the FPGA allocation algorithm are as follows. As a reconfigurable

computing application either dynamically arrives into the operating system via the ready

queue or part of an application arrives via the Partitioner, given the application’s geometrical

dimensions, the Allocator must first determine if the total amount of area the application

requires is less than what is currently available on the FPGA. If not, the application will be

placed back into the ready queue. If there is enough area available the Allocator must return

the position and size of the area where the application can be put so it will not overlap or

interfere with any other already resident applications. If there may be several positions where

the application can be put, the one that is returned must be best in terms of some criteria. This

will be discussed in more detail in the next chapter.

If the total sum of the available area is greater than what is required by the incoming

application but there is not one contiguous block of area larger than what is needed by the

application, the position and size of the largest available area will be returned. This will then

be used by the Partitioner so it can attempt to assemble a subset of connected data flow graph

nodes that match the available area. This allocation process is shown in Figure 24.

Figure 24: Allocation service

4.3.3 Partitioning

To avoid possible lengthy delays in user response time, logic partitioning will be used to

divide applications into geometric dimensions that match what is available on the FPGA.

Unlike most traditional logic partitioning algorithms that perform bi-partitioning or min-cut

(see section 2.3.2), to maximise the area utilisation of the FPGA, an application should be

76

Chapter 4 – Abstractions, architecture and design flow

partitioned into a specified size. Therefore the partitioning algorithm must be able to accept a

particular size constraint from the Allocator and fill the area with as many data flow graph

nodes as possible without impacting on the application’s performance or integrity, shown in

Figure 25. It should also avoid partitioning feedback loops and minimise the amount of inter-

process communication where possible. Once the specified area is full, the Partitioner should

indicate to the Allocator it requires another block of area, and then continue its process of

partitioning. This process is repeated until the entire application has been partitioned.

Figure 25: Hardware partitioning

4.3.4 Conclusion

In this section the runtime and functional requirements of both the allocation and partitioning

algorithms were defined. It was specified that as the algorithms will be executing at runtime,

their runtime complexity must be approximately linear, and actually execute in milliseconds

on a typical software platform. It was outlined that the allocation algorithm must calculate a

location on the FPGA that can accommodate the size of an incoming application and the

partitioning algorithm must be able to divide an application structured as a data flow graph

into any number of various sized partitions. These runtime and functional specifications will

be used to guide the selected of the algorithms that will be incorporated into the operating

system prototype in chapter 6.

77

Chapter 4 – Abstractions, architecture and design flow

78

4.4 New application design flow

It was outlined in the literature review in section 2.4 that the traditional design flow does not

really support the development of partial bitstreams for use with dynamic runtime

reconfiguration. Recent attempts have been made to overcome this limitation through the use

of methodologies that support module based application design [138]. However, dynamic

allocation and partitioning will require further modifications to suit the suggested operating

system environment. In this section the limitations of the current design flow methodologies

that prevent it from being used in this environment will be outlined. This will be followed by

the new design flow methodology that will be used to develop applications for execution

under the proposed operating system architecture.

The current design flow as outlined in the literature review is used for describing hardware

circuits that are to be loaded onto a reconfigurable computer whose entire surface is

configured at once. However, with many FPGAs now supporting dynamic runtime

reconfiguration, modifications to the design flow have had to be made to support it. A recent

suggestion states that dynamic runtime reconfiguration would benefit if the applications were

designed into module like components. These modules are then swapped in and out on the

same location on the FPGA through the use of dynamic runtime reconfiguration. This is

achieved through the compilation of an initial bitstream and numerous partial bitstreams that

are ordered and then configured onto the FPGA over time. This can only be performed if all

of the application modules are known prior to the compilation of the bitstreams. In the

architecture proposed above however, applications will dynamically arrive into the system

and through the use of the suggested operating system can be arbitrarily placed anywhere on

the FPGA. Applications that are designed with the current design flow are unable to be used

in such a system.

Firstly, as FPGA area allocation is performed at runtime because the availability of hardware

resources can not be predicted at compile time, all of the application modules need to be

relocatable. However in the current design flow, these modules are pre-placed and pre-routed

and can only be relocated through the use of device specific APIs such as JBits.

Experimentation with JBits by the author however shows that it is unable to arbitrarily

relocate and reconnect pre-placed cores of a practical size. Secondly, as dynamic partitioning

will be used to divide an application into a more suitable geometric size, applications need to

Chapter 4 – Abstractions, architecture and design flow

79

be designed with a data flow graph structure. As logic partitioning is most commonly

performed manually by the application designer, the current design flow has no support for

designing applications with any such structure. Thirdly, a runtime router is required to finalise

any routes that are not local to any pre-routed core. In fact, pre-routed cores must be designed

to avoid some of the routing resources on the FPGA so these resources are available for

runtime global routing. This is because all current commercial FPGA architectures do not

have a separate global routing architecture. Once the position of the application has been

determined, the runtime router will then connect it to either another application or directly to

I/O pins. Fourthly, if the operating system is used with an architecture that does not support

dynamic reconfiguration, checkpoints are needed to be inserted into the design so the

application can be parsed whilst a reconfiguration of the whole FPGA is carried out. There are

no provisions in the current design flow for any type of checkpoint insertion.

All of the limitations described above have led to the development of a new design flow

methodology. It initially involves describing the application through the use of a traditional

hardware description language. In order to utilise the dynamic partitioning of the operating

system, it must be designed with a data flow graph structure. The nodes of the graph will be

the computational elements of the application such as the adders and multipliers. The arcs of

the graph describe which nodes need communications between each other. Following the

design entry, the application is then synthesized and technology mapped. Each of the data

flow graph nodes are then internally pre-placed and pre-routed through the use of the

traditional placement and routing algorithms. This reduces the amount of placement and

routing that needs to be performed at runtime. As there are no external routing connections

made between the nodes, they will be relocatable at runtime. Once all of the modules have

been completed, the application is then ready to be loaded into the operating system. After the

operating system has determined the location of all the data flow graph nodes, the runtime

router would be to connect all the communicating nodes together.

Chapter 4 – Abstractions, architecture and design flow

80

4.5 Conclusion

This chapter resulted in four major deliverables. First, through a qualitative based analogy and

uniqueness survey between software and reconfigurable computing operating systems, three

newly defined abstractions were described. These were:

1. The reconfigurable computing process abstraction consisting of the hardware circuit

described as a data flow graph model with data source and sink nodes inserted for

virtualised I/O.

2. The reconfigurable computing address space abstraction consisting of a two-

dimensional address space for the FPGA and a single dimensional address space for

the external memory.

3. The reconfigurable computing inter-process communication consisting of the

formation of messages and passing them between processes via a pre-configured

primitive architecture of a memory controller and on-chip re-routable bus network.

Secondly, these abstractions were combined together with particular features from previously

suggested reconfigurable computing runtime systems to result in the formation of the

reconfigurable computing operating system architecture. Its major components include a shell

for user interaction, Allocator for resource allocation, Partitioner for application partitioning,

a loader for configuring and managing the platform, and a primitive architecture that supports

inter-process communication.

Thirdly, the specifications of the algorithms that will implement the Allocator and Partitioner

components of the architecture were defined. It was determined that for the runtime

requirements of these algorithms, their quality of allocation or partitioning needed to be

traded for a reduced execution runtime. For the functional requirements, the Allocator must be

able to load an incoming application on a vacant part of the FPGA that will not interfere with

any other executing applications. The Partitioner must be able to divide a data flow graph

structured application into partitions of specified sizes.

Finally, a modified design flow for application development that suits the newly defined

abstractions and operating system architecture was proposed. This was achieved by

investigating the limitations of the current design flow and surveying the literature to

determine if any other application development environments could be used.

Chapter 5 – Resource allocation and application partitioning

5 5 Resource

allocation and application partitioning

In the previous chapter, a set of abstractions, the operating system architecture and resulting

algorithm specifications for the components of allocation and partitioning were presented. In

this chapter the most suitable algorithms that meet the specifications of the allocation and

partitioning components will be selected, implemented and their performance measured

against a set of selected metrics. This will then enable the most suitable allocation and

partitioning algorithm to be selected. A summary of the previous work, methodologies and

deliverables associated with this chapter are shown in Figure 26.

Figure 26: The previous work, methodology and

deliverables associated with this chapter

The chapter is divided into two sections: algorithms for resource allocation and algorithms for

application partitioning. In each section there are three tasks undertaken that result in the

selection of the most suitable algorithm. This initially involves surveying the research

literature with the aim of listing all the algorithms that suit the functional and runtime

specifications outlined in the previous chapter. These algorithms are then sorted based on

their complexity and runtime performance. The higher ranked algorithms will be adapted and

implemented to suit the operating system architecture. The performance of the implemented

algorithms will then be measured through the use of selected metrics. The algorithm that is

judged to perform the best will then be selected to be used in the associated component within

the operating system prototype.

81

Chapter 5 – Resource allocation and application partitioning

82

5.1 Allocation

When the FPGA surface is shared amongst multiple applications, an address space abstraction

will be used to define what resources each application have been allocated. It will also prevent

applications from corrupting each other by using already taken resources. In order to support

this address space abstraction, an Allocator which is responsible for allocating hardware

resources to incoming applications, has been defined. In section 4.3.2, the functional

specifications of the allocation algorithm within the Allocator component were presented.

These are summarised below.

1. To determine the size and position of vacant segments of area that an incoming

application can fit onto that does not interfere with already allocated applications.

2. If there is enough vacant area available on the FPGA for the incoming application but

not in one contiguous segment, the largest segment that is available should be

determined.

3. If there is not enough vacant area on the FPGA for the incoming application to be

allocated onto, the application should be added to a ready queue until more area

becomes available.

4. If there is more than one possible location to place the application, choose the location

that maximises the usage of the FPGA area among all present and future applications.

In this section the most suitable allocation algorithm for use in the operating system prototype

will be selected. This will be achieved through an initial survey of the previous allocation

literature that appears in either the reconfigurable computing or other research domains. These

algorithms will be ranked according to their runtime complexity and the lower complexity

ones that meet the absolute runtime limits will be adapted to suit this environment. The

performance of these adapted algorithms will then be measured using selected metrics with

the aim of determining the most suitable allocation algorithm for use in the operating system

prototype.

5.1.1 Survey of allocation literature

Shown in Table 8 is a summary of the allocation algorithms presented in this thesis that have

some potential for use in the proposed operating system. They are ranked in order of runtime

Chapter 5 – Resource allocation and application partitioning

83

complexity from least to most where n is the number of possible locations the applications

can be allocated onto and m is the size of the application being allocated.

Algorithm Runtime complexity Satisfies functional specifications

Bottom Left

[17]

O (log n) Yes

Minkowski Sum

[58]

O (n + m) Yes

One Dimensional Bin Packing

[90]

O (n log n) No

Two Dimensional Bin Packing

[13]

O (n3) Yes

DREAM

[51]

O (n2) Yes

Table 8: A summary of the well-known allocation

algorithms that appear in the research literature

The traditional bin-packing problem is similar to the allocation of FPGA area. Most bin-

packing algorithms concentrate on the classical one dimensional bin-packing problem but

these are not suitable for the allocation of FPGA area as the FPGA surface is two

dimensional. Two dimensional bin-packing algorithms can be adapted to suit the operating

system Allocator and a particular implementation of one that suits the functional

specifications has been described by Baker, Coffman and Rivest [13]. The problem associated

with this algorithm is the stated runtime complexity of O (n3) far exceeds the linear

requirements as defined in section 4.3.1. Bazargan [17] presented a modified two dimensional

bin-packing algorithm, by not considering every possible place in which the application can

be allocated to, met the runtime complexity as it was reduced to O (log n). Although the

quality of allocation will be affected, this is offset by the significant decrease in runtime

complexity. Eatmon [51] presented a modified algorithm based on the Bazargan proposal

which improved the quality of the overall allocation but at the expense of an increase in the

runtime complexity O (n2).

Chapter 5 – Resource allocation and application partitioning

84

The Minkowski Sum [58] has been shown to improve the utilisation of material when applied

to the problem of fabric cutting plans. This problem is very similar to the allocation of

applications onto an FPGA. However, the runtime complexity to determine the fabric cutting

plans (O (n5)) far exceeds the linear requirements as defined in section 4.3.1. This is because

non-convex polygons are allowed in the most general Minkowski Sum and a very slow

greedy algorithm is used to select the optimal location. The problem of allocating applications

onto an FPGA can be reduced by the use of convex polygons, as will be shown in section

5.1.4. This results in runtime complexity of the Minkowski Sum reducing to linear time.

Another linear time algorithm that selects between possible multiple locations where the

applications can be allocated is used instead of a greedy based algorithm.

Only the Bottom Left and Minkowski Sum algorithms meet both the functional and runtime

specifications required for the Allocator. These two algorithms will now be described in more

detail, outlining any modifications that were needed to suit the operating system architecture.

5.1.2 Algorithm 1 – Greedy based

The first algorithm described and implemented for allocating the FPGA area to incoming

applications is based on a traditional greedy style algorithm. Such an algorithm was

implemented before either of the two described above were because not only is its runtime

complexity linear (see below), it is also very easy to implement and a better understanding of

the problem could be achieved before more complex algorithms were implemented.

As the applications arrive into the operating system, they are queued in a standard first in first

out (FIFO) queue (shown in Figure 27 (a)). When there are one or more applications in the

queue, the algorithm will take the application at the front of the queue and begin the

allocation. This involves searching a list of areas that match the known minimum area

requirements of the application (shown in Figure 27 (b)). The minimum area requirements of

the application are pre-calculated at compile time and are known as a virtual rectangle. The

FPGA area in this algorithm is represented as a list, analogous to a list of disk blocks on a

disk drive. To calculate the location of the vacant area within which the application will be

allocated, the algorithm initially determines if there is enough vacant area on the FPGA for

the application to be allocated. If not, the application will be placed back into the queue until

more area becomes available.

Chapter 5 – Resource allocation and application partitioning

Ready Queue

Application Allocated

Incoming applications

Free Space Here

(a)

(c) (d)

(b)

Figure 27: Greedy based allocation

If there is enough vacant area, the bottom left corner of the virtual rectangle will be placed

over the first CLB in the list. If it overlaps with other previously allocated applications, the

virtual rectangle will progressively and deterministically be moved through the list until a

location can be found (shown in Figure 27 (c)) where the application can be allocated and not

interfere with other applications. Once the algorithm finds a successful location, it will stop

searching and mark the area as used (shown in Figure 27 (d)). The location details will then

be passed onto the next stage in the architecture for further processing. If the virtual rectangle

searches the list of CLBs without a successful allocation, it will return the location and size of

the segment of free area that was closest to the application area requirements. The details of

this segment are then passed onto the Partitioner for further processing.

5.1.3 Algorithm 2 – Bottom left

The second allocation algorithm chosen to be adapted and implemented for the Allocator is a

variation modelled on the bottom left algorithm proposed by Bazagan. It consists of two parts:

one, an empty space manager for insertion and deletion of applications, similar to the greedy

based algorithm, and two, a set of heuristics for dividing up the free space. In the empty space

manager, the free FPGA area is represented by rectangles, where each rectangle may have

multiple CLBs (Figure 28 (c)). When an application arrives in the queue (Figure 28 (a)), the 85

Chapter 5 – Resource allocation and application partitioning

algorithm searches the list of available rectangles, looking for a rectangle with dimensions

that are equal to or larger than the size of the application, remembering the size of the

application has been pre-calculated at compile time. When the first suitable rectangle is

located, the algorithm will allocate the application to the bottom left-hand corner of the

selected rectangle, assuming the application size is less than the rectangle it is being allocated

into (Figure 28 (d)). The location of where the application is to be placed is then passed on to

the placer for further processing. If there are no rectangles that can accommodate the

application, the application will be blocked and placed back into the ready queue.

Figure 28: The bottom left allocation algorithm process

In the second part of the bottom left allocation algorithm, the remaining free space is divided

into two more rectangles according to a heuristic. Initially, the remaining area is partitioned

into three new rectangles. These rectangles are defined by two segments that intersect with the

corner of the allocated application and the edge of the rectangle it is being allocated into (see

Sa and Sb in Figure 29 (a)). Depending upon the heuristic chosen when the algorithm is

started, either the shortest (SRS) (Figure 29 (c)) or longest (LRS) (Figure 29 (b)) remaining

segment is used to divide the remaining area into two rectangles. The details of the original

rectangle are then removed from the list and replaced with the size and location of the two

new ones.

86

Chapter 5 – Resource allocation and application partitioning

Figure 29: The heuristic used to calculate the remaining rectangles

There were two changes made to the algorithm that was originally proposed by Bazagan.

Firstly, the algorithm can now dynamically choose between the heuristic that divides the

remaining area into rectangles. Bazagan made no mention in his paper what strategy should

be chosen for which application. The Allocator can now select from the shortest or longest

segment in an attempt to keep the available free rectangles as square as possible. For example,

a rectangle 1 CLB wide x 12 CLBs high is much harder to allocate to applications as

compared with a rectangle 3 CLBs wide x 4 CLBs high. Secondly, the algorithm was

integrated into the operating system architecture so it could be called iteratively from the

Partitioner.

The time complexity of the bottom left algorithm is linear in the number of rectangles stored

in the list. As the algorithm creates at most two new rectangles every time a new application

is allocated, the number of rectangles is at most twice the number of applications on the

FPGA. Not only does this minimise the average total runtime of the allocation function, it will

also produce a more predictable runtime because the total number of areas stored in the list is

predictable.

5.1.4 Algorithm 3 – Minkowski Sum

87

The final allocation algorithm implemented for possible use in the operating system prototype

is based on the Minkowski Sum. The Minkowski Sum algorithm is often used in motion

planning to determine the free space among a set of obstacles so that an optimal path may be

planned for traversal by some physical entity between two points in the space [58]. The

Chapter 5 – Resource allocation and application partitioning

specifications required by the allocation algorithm are similar to the motion planning problem

in that the identification of free space is needed.

The Minkowski Sum based allocation algorithm consists of calculating all of the possible

locations where the incoming application could be placed. A second step is needed to

determine which of those locations the application should be allocated to, to optimise

performance. The Minkowski Sum can be defined as the set of all points that are the sum of a

point in one set together with a point in another set. This is shown in Equation 1.

{ }QqPpqpQP ∈∈+=+ ,|

Equation 1: Minkowski Sum

For the allocation problem described in this thesis, it is assumed that the FPGA area can be

depicted as two polygons, U and F where the used area is denoted by U, and the vacant area is

denoted by F. Given that T is a rectangle depicting the area required by the incoming

application, and r is the centre point within T, all possible locations it can be allocated into is

performed by calculating the Minkowski Sum of the polygons T and U. Figure 30 below

further explains Minkowski Sum.

88

Figure 30: Minkowski Sum example

Chapter 5 – Resource allocation and application partitioning

89

The polygon or application labelled T with the centre point p inside it represents a new

application in the queue waiting to be allocated onto the FPGA (Figure 30 (a)). The

crosshatched polygons represent the used space or other applications that have already been

allocated on the FPGA (Figure 30 (b)). Figure 30 (c) shows the Minkowski Sum of the

applications and is represented by the dotted area (S) surrounding those applications. This

area combined with the application area (U) is the location where the centre point p inside

polygon T cannot be placed, indicated as prohibited in the figure. The rest of the area (F) is

considered to be safe to allocate the application T into. If polygon T is translated around the

boundary of the dotted area S (Figure 30 (c)), with centre point p of t fixed to the boundary’s

edge, it is clear that T will always be touching but never intersecting with the shaded polygon.

For the Minkowski Sum algorithm to have a runtime complexity of O (U + T) as needed by

the Allocator, the polygons of U and T must both be convex. If the incoming applications will

be rectangular in shape, T will always be convex. However depending upon previous

allocation history, U may either be convex or non-convex. To ensure the polygon will be

convex, De Berg [46] proposed the following method that avoided using any non-convex

polygons in the Minkowski Sum.

1. Decompose the non-convex polygon U into discrete polygons u1, u2, …, un.

2. For each of the polygons ui in the set of polygons u1, u2, …, un, find the Minkowski

Sums s1, s2, …, sn of T and si.

3. Use the elementary polygon union operator to combine polygons s1, s2, …, sn.

This results in both of the polygons in the Minkowski Sum being convex and therefore meets

the runtime requirements of the Allocator previously defined.

The area where the incoming application can be allocated has now been calculated. However,

within this area, there are numerous locations where the application could be allocated. A

simple bottom left corner heuristic is used to determine the exact location of where to allocate

the incoming application. This involves calculating the location of all the corners of the

available area (see Figure 31 (a)) and then allocating the application into the segment whose

corner is closest to the bottom left of the FPGA and is large enough to accommodate the

application (see Figure 31 (b)). If there is no segment of area that is large enough to

accommodate the application but the total amount of free area is less than the pre-compiled

Chapter 5 – Resource allocation and application partitioning

estimate of the application, the algorithm will determine the largest segment and pass its

details onto the Partitioner for further processing. If the FPGA does not have enough total

area to accommodate the FPGA application, it will be blocked and put back into the ready

queue until more area becomes available.

Figure 31: Bottom left heuristic used with the Minkowski Sum

5.1.5 Algorithm performance

The performance of a hardware resource allocation algorithm can be measured by how much

of the resource cannot be utilised by any application. For example, if a microprocessor is

being shared, a measure could be how many instructions it’s processing over a set period of

time are not related to any applications. To measure the performance of an FPGA allocation

algorithm, the raw area utilisation of the FPGA surface could be used. However, such a

measurement ignores the amount of area that is wasted because the algorithm does not tightly

pack all incoming applications, possibly preventing any application from being loaded there.

FPGA area is a finite and expensive resource and a trade-off between execution runtime and

area utilisation needs to be made. As each of the algorithms allocate the applications

according to a different set of rules, the overall utilisation of the FPGA area will vary for each

one. To measure the overall usage of the FPGA area, a metric known as fragmentation will be

introduced.

As the allocation algorithm in this operating system will be used at runtime, the amount of

time it takes to execute will also be a measure of its performance. To gain a more accurate

measurement of the expected runtime of each algorithm, an experiment will be performed to

measure it when each algorithm is used with various sizes and numbers of applications. As it

90

Chapter 5 – Resource allocation and application partitioning

91

is currently unknown what type and size of applications will be used in conjunction with the

operating system, three sets of varying sized applications are proposed. The allocation

algorithms will then be tested using all three sets of applications. These sets of applications

have been selected to produce the worst degradation in runtime performance that would be

expected in the actual operating system environment.

In this section, an experiment that measures both the execution runtime and the amount of

area that is wasted due to poor allocations is performed. Initially, the experiment test bed is

described which includes the number and size of applications used in the experiment. The first

part of the experiment will be to measure the runtime consumed by all three algorithms under

various conditions. In the second part of the experiment, the same set of applications will be

loaded onto the FPGA and the fragmentation will be measured after each application has been

allocated. In both parts, results and graphs will be presented and then conclusions will be

drawn on the result obtained.

Experimental test bed

In order to create an environment in which realistic results could be generated, an assumption

on the size of the applications had to be made. As the operating system can accommodate

multiple concurrent applications, sets of incoming applications had to be generated. As it is

currently unclear what type of application the operating system will primarily be used in

conjunction with, it is difficult to estimate the size, arrival rate and execution time of the

incoming applications. As such, three categories of different means and standard deviation

area usage of the applications were used. These values were chosen to represent small, typical

and large applications with respect to the target size of the FPGA and the variations were

modelled on a Gaussian distribution. The application arrival and execution time were selected

for each category so there would always be several applications waiting in the queue. The

details are shown in Table 9.

Chapter 5 – Resource allocation and application partitioning

92

Mea

n ar

ea

(per

cent

age

of F

PG

A)

Sta

ndar

d de

viat

ion

area

(p

erce

ntag

e of

FP

GA

)

Mea

n in

ter-

arriv

al ti

me

(uni

t of t

ime)

Mea

n ex

ecut

ion

time

(uni

t of t

ime)

Typical size applications 4% 2% 20 200

Large size applications 8% 4% 30 200

Small size applications 2% 1% 6 200

Table 9: Parameters of the applications used to measure the

execution runtime of the allocation and partitioning algorithms

Execution runtime

The experiment to measure the execution runtime of each allocation algorithm involved

generating three sets of applications, allocating them onto the FPGA with each of the

algorithms, and then measuring how long it took to complete the allocation for each

application (algorithm execution runtime). The execution time in this experiment was

measured as wall clock time on an otherwise idle Celeron 1.2 GHz microprocessor with

256Mb of RAM, running Microsoft Windows XP.

The experiment began by generating fifty applications for each of the sets of typical, small

and large sized applications. Each of these application sets were then allocated onto the FPGA

by all three algorithms, resulting in nine iterations of the experiment. A single iteration of the

experiment involved the following. Initially, the first application in the selected set was

allocated onto an empty FPGA and the execution runtime it took to do so was recorded. This

should result in the minimum runtime requirement associated with the particular allocation

algorithm. For each remaining application in the set or until the applications could no longer

fit, it was allocated onto the FPGA, the number of other applications resident on the FPGA

was recorded, and the execution runtime of the algorithm to perform the allocation was

measured. For each set of applications (typical, small and large), a graph summarising the

number of applications resident on the FPGA versus the execution runtime for each algorithm

is presented in Figure 32.

Chapter 5 – Resource allocation and application partitioning

Allocation Execution Runtime (typical)

0

100

200300

400

500600

700800

900

0 5 10 15 20 25 30No of applications already resident on the FPGA

Tim

e (m

S) o

n an

Inte

l Cel

eron

1.2

GH

z m

icro

proc

esso

r

Greedy Bottom Left Minkowski Sum with best fit allocation Allocation Execution Runtime (small)

0

100

200300

400

500600

700800

900

0 10 20 30 40No of applications already resident on the FPGA

Tim

e (m

S) o

n an

Inte

l Cel

eron

1.2

GH

z m

icro

proc

esso

r

50

Greedy Bottom Left Minkowski Sum with best fit allocation Allocation Execution Runtime (large)

0

100

200300

400

500600

700800

900

0 5 10 15 20No of applications already resident on the FPGA

Tim

e (m

S) o

n an

Inte

l Cel

eron

1.2

GH

z m

icro

proc

esso

r

Greedy Bottom Left Minkowski Sum with best fit allocation

Figure 32: The execution runtime of the greedy, bottom left

and Minkowski Sum allocation algorithms

93

Chapter 5 – Resource allocation and application partitioning

94

The graphs in Figure 32 can be summarised as follows. The greedy based algorithm

consumed the most execution runtime averaging between 40ms when the FPGA is empty, up

to 812ms when the incoming application is unable to successfully be allocated. The bottom

left algorithm consumes the second least execution runtime averaging 173% less, ranging

between 8ms when the FPGA is empty up to 356ms depending upon the size and number of

applications already resident on the FPGA. The algorithm that consumed the least execution

runtime is the Minkowski Sum with best fit allocation, averaging 140% less than the bottom

left and 598% less than the greedy algorithm. It ranged between 18ms when the FPGA was

empty up to 156ms depending upon the size and number of applications already resident on

the FPGA. There are several points worth noting regarding these results.

1. The number of applications that could be allocated onto the FPGA varied depending

upon the algorithm used. Shown in Table 10 is the total number of applications that

could be allocated onto the FPGA by each algorithm for all three sets of applications.

The least number of applications were allocated onto the FPGA when the Minkowski

Sum algorithm was used, followed by the Bottom Left algorithm, and the most

applications were able to be allocated when the Greedy algorithm was used. This was

consistent across all three application sets. There was also little variation between the

numbers of applications allocated onto the FPGA by each algorithm indicating the

amount of area wasted due to fragmentation would be similar. This is shown later in

this section.

Algorithm Small Typical Large

Minkowski Sum 40 23 14

Bottom Left 43 25 16

Greedy 45 28 18

Table 10: Number of applications allocated onto the FPGA

2. The minimum execution times for each of the algorithms were very similar. Across all

three application sets, the minimum execution runtime averaged 24ms for the

Minkowski Sum algorithm, 26ms for the Bottom Left algorithm and 28ms for the

Greedy Based algorithm. If the operating system only operated with few resident

applications on the FPGA at one time, there would be little advantage in using any

particular allocation algorithm.

Chapter 5 – Resource allocation and application partitioning

95

3. In each of the graphs, the greedy based algorithm’s execution runtime peaked at

approximately 805ms. This situation occurred because in each case the attempt at

allocating the application failed. The number of times the virtual rectangle had to be

progressively moved through the FPGA until the allocation failed was similar for each

set, even though the applications were of various sizes. This is because the relative

sizes of the applications are quite small as compared to the size of the FPGA

(approximately 2% – 8%). This resulted in hundreds of moves and altering the number

of moves by 10 or 15 made very little difference to the execution runtime. However, if

larger sized applications were used, greater than 20% of the FPGA area, a reduction in

the overall execution runtime would be experienced.

4. The greedy based allocation algorithm consumed far more execution runtime when

used with large sized applications, averaging 259% more than the bottom left and

1123% more than Minkowski Sum. This can be explained by the way it performs its

allocation. For each application allocated, a search of the FPGA area begins in the

bottom left corner. When available area is found it immediately allocates the

application to it. Therefore as the FPGA fills, searching from the bottom left corner

becomes ineffective as it is checking area that has just been allocated.

5. The fluctuations shown in the graphs of the bottom left allocation algorithm (17 to 25

typical; 30 to 43 small; 10 to 17 large) are due to the algorithmic complexity of

recalculating the free area when applications are removed. The process of

recalculating this free area involves combining all the surrounding available area and

then dividing it into the minimum number of rectangular segments so it can more

easily be reallocated. The sharp increase on the graph reflects whether an application

was removed before or after the incoming application was allocated.

Fragmentation

During the experiment to measure the allocation algorithms, it was found that each of the

algorithms allocated the incoming applications in different locations, as would be expected.

As a result it was noted that in some of the iterations of the experiment, more applications

could be allocated onto the FPGA when particular algorithms were used. This resulted in a

better usage of the available FPGA area. For example, more FPGA area would be wasted if a

small application was allocated in the middle of a large area. To measure this efficiency, a

metric known as fragmentation will be introduced.

Chapter 5 – Resource allocation and application partitioning

Fragmentation is a term usually associated with secondary disk storage. Resources are

typically represented as linear arrays of equal sized blocks. Intuitively, disk fragmentation

exists when a large file is not able to be stored as a contiguous set of blocks and thus must be

partitioned. An FPGA is similar, as it is broken up into CLB block units, but the FPGA area is

two dimensional. There is no agreed definition of what area fragmentation on an FPGA is. It

is defined here that an FPGA is fragmented if an application is unable to be allocated to the

FPGA but the FPGA has adequate non-contiguous free FPGA area, as shown in Figure 33.

Figure 33: A Fragmented FPGA

Walder and Platzner [133] gave a method to measure fragmentation to quantify allocation

situations. Their fragmentation grade (or method for measuring it) is shown in Equation 2.

)(

)(1

2

ii

ii

an

anF

∑∑

⋅−= where ni is the number of rectangles and ai is the size.

Equation 2: Walder Fragmentation Grade

The problem with this measurement of fragmentation is it does not work well with the case

when every second CLB is used, the checkerboard case. For example, if the FPGA has 32

CLBs and 16 are used and 16 are available and are laid out in a checkerboard pattern the

fragmentation grade is shown in Equation 3.

96

Chapter 5 – Resource allocation and application partitioning

75.016

)116(1

2

=⋅

−=F

Equation 3: Example of Fragmentation Grade

In this case the fragmentation grade should be 100% as every application needs to be

partitioned, except 1 CLB large applications. To overcome this limitation, a new

fragmentation measure represented as a percentage of the number of holes left between the

previously allocated applications and the number of remaining CLBs was developed. This is

shown in Equation 4. A hole is defined as a contiguous portion of free FPGA area. For

example, if the FPGA has five holes and there are 50 free CLB blocks the percentage of

fragmentation is 8%. It also satisfies the checkerboard limitation of the Walder fragmentation

formula. This measure for fragmentation will be used to rank the allocation algorithms in

terms of fragmentation performance. It was also shown in section 7.3 that when adjusted for

the mean size of the applications on the FPGA, it is an excellent predictor for the user

response time and application throughput associated with the operating system.

⎪⎩

⎪⎨

=

>×⎟⎠⎞

⎜⎝⎛

−−

=1,0

1,10011

A

AAh

F

where h is the number of holes and A is the total free area A has units of the minimum unit of allocation (CLBs in most cases)

Equation 4: Fragmentation percentage

The second part of the experiment measures the fragmentation generated by each algorithm

when the FPGA is at various capacities. This initially involved generating over a hundred

applications in each of the previously described application sets. These applications were then

allocated onto the FPGA by all three algorithms and the fragmentation was calculated. For

each of application sets (typical, small and large), a graph summarising the fragmentation

versus FPGA capacity is shown in Figure 34.

97

Chapter 5 – Resource allocation and application partitioning

Fragmentation (typical)

0

0.2

0.4

0.6

0.8

1

1.2

20 30 40 50 60 70Amount of area used (%)

Frag

men

tatio

n

Greedy Bottom Left Minkowski Sum with Best fit allocation

Fragmentation (large)

0

0.2

0.4

0.6

0.8

1

1.2

20 30 40 50 60 70Amount of area used (%)

Frag

men

tatio

n

Greedy Bottom Left Minkowski Sum with best fit allocation

Fragmentation (small)

00.20.40.60.8

11.21.41.61.8

20 30 40 50 60 70Amount of area used (%)

Frag

men

tatio

n

Greedy Bottom Left Minkowski Sum with best fit allocation

Figure 34: Fragmentation recorded for the typical,

large and small sized applications

98

Chapter 5 – Resource allocation and application partitioning

99

The graphs in Figure 34 can be summarised as follows. Averaging all of the 14 fragmentation

measures taken in each application set, the Greedy algorithm produced the least

fragmentation. This was followed by the Bottom Left algorithm which created on average

16% more fragmentation than the Greedy algorithm. The Minkowski Sum algorithm with

bottom left corner allocation produced the most fragmentation, average 34% more than the

Greedy algorithm and 16% more than the Bottom Left. A summary of the increase in

fragmentation between the algorithms is shown in Table 11. There are several points that can

be drawn from either the table or the graphs and will be discussed below.

Category Greedy to Bottom Left Greedy to Minkowski Bottom Left to Minkowski

Typical 15.7% 33.1% 15.2%

Large 15.2% 30.3% 13.7%

Small 20.8% 46.2% 18.0%

Total 16.0% 34.2% 16%

Table 11: The average percentage increase in fragmentation

for the algorithms compared to each other

1. A higher percentage of fragmentation was generated when the small sized application

set was allocated onto the FPGA as compared to the typical and large sized application

sets. This was consistent across all three allocation algorithms with the maximum

fragmentation being recorded at 1.65% for Minkowski Sum, 1.42% for Bottom Left,

and 1.23% for Greedy. This can be explained because for the same percentage of area

usage, more of the smaller sized holes are created on the FPGA as it introduces gaps

between applications which ultimately led to an increase in the fragmentation

measure.

2. The maximum area usage of the FPGA varied depending upon the size of the

application set used. For the typical sized applications, the maximum percentage of the

FPGA that was occupied by applications was approximately 58%. This increased by

10% to 68% for large sized applications and increased a further 1% to 69% for the

small sized applications. As the execution runtime of the applications was selected so

the FPGA would fill up and a queue of applications would occur after some time,

these results reflect the maximum usage area that can be obtained with applications of

Chapter 5 – Resource allocation and application partitioning

100

the specified size. The remaining area not consumed by the applications is taken up by

fragmentation.

3. Although not conclusive, there appears to be a connection between the amount of

fragmentation generated and the FPGA area usage. Within each application set and in

most cases, as the amount of area consumed on the FPGA increased so did the

fragmentation. There are several measurements where this is not the case including

36% and 43% for the small sized; 64% for the large size; and 46% and 47% for the

typical size. However it is felt that these could have been generated by the size of the

incoming application exactly matching a hole; significantly reducing the

fragmentation or having to be allocated in a location which generated several new

holes. As the amount of the FPGA usage increased, it would be expected that the

fragmentation would increase as the number of holes created by applications not being

allocated in ideal locations would increase.

5.1.6 Algorithm selection

From the results gained in the experiment described above, the algorithm based on the

Minkowski Sum with the bottom left corner heuristic was selected as the most suitable of the

three allocation algorithms to be used in the operating system prototype. The justification is as

follows. Although the fragmentation generated by the Greedy algorithm was the least, its

absolute execution time was far too great as compared with the Bottom Left and Minkowski

Sum, especially for larger sized applications. An excessive execution runtime could ultimately

result in a much longer response, a factor that has to be minimised. Of the remaining two

algorithms, the Bottom Left generated the least fragmentation averaging 16% less but

consumed more execution runtime, approximately 140% more.

It depends how area and response time are valued in the particular situation, but as extra area

can easily be purchased, it was decided to use the Minkowski Sum allocation algorithm in the

operating system prototype as the increase in fragmentation is not great and the extra

execution time of the Bottom Left algorithm was significant. Although FPGA area is a

valuable resource and needs to be managed, it was decided that the execution runtime was

valued higher in this situation.

Chapter 5 – Resource allocation and application partitioning

101

5.2 Partitioning

Once a reconfigurable computing application is under execution on the FPGA, it will be

known as a process. This process consists of an application, or part thereof, that is structured

in a data flow graph format with data source and sink nodes inserted for easier I/O transfer. In

an attempt to reduce the user response time and increase the FPGA usage, applications will be

broken down into multiple processes of specified sizes so they can fit into particular locations

on the FPGA. This requires a logic partitioning algorithm that has the following functional

specifications as were defined in section 4.3.3.

1. To partition an application which consists of a data flow graph structure into various

specified area constraints.

2. Partition the application in a way that does not affect the integrity of its operation.

3. Minimise the effect on the application’s performance due to the partitioning.

In this section the most suitable partitioning algorithm for use in the operating system

prototype will be selected. This will be achieved through an initial survey of the previous

partitioning literature that appears in either the reconfigurable computing or other research

domains. These algorithms will then be ranked according to their complexity and runtime

performance and the highest ranked ones that meet the runtime requirements will be adapted

to suit this environment. The performance of these adapted algorithms will then be measured

using selected metrics with the aim of determining the most suitable partitioning algorithm for

use in the operating system prototype.

5.2.1 Survey of partitioning literature

Logic partitioning has been an active area of research for at least the last 25 years and has

resulted in numerous algorithms proposed and implemented. Logic partitioning has

traditionally been used to divide an application into equal sized parts when it can not fit onto

the target device. However, in the proposed operating system, logic partitioning will be used

to divide an application into a particular size and geometrical configuration. Shown in Table

12 is a summary of the partitioning algorithms presented in this thesis. They are ranked in

order of runtime complexity from least to most. n is the number of nodes in the application

which need to be partitioned.

Chapter 5 – Resource allocation and application partitioning

102

Algorithm Runtime complexity Satisfies functional specifications

Temporal Partitioning

[109]

O (V + E)

V – vertex

E – edge

Yes

FM

[57]

O (n) No

Simulated Annealing

[84]

O (V1/2 E)

V – vertex

E – edge

No

MP2

[136]

O (n2) No

KL

[83]

O (n2 log n) No

Table 12: Summary of partitioning algorithm runtime complexities

Three of the most well-known partitioning algorithms are the Kernighan and Lin (KL) [83],

Fiduccia and Mattheyses (FM) [57] and Simulated Annealing [84]. These algorithms are

based on an iterative min-cut heuristic for partitioning networks and have the runtime

complexities of O (n2 log n), O (n) per pass and O (V1/2 E) respectively. Although very

common, these algorithms are not suited to the proposed operating system as they do not meet

the previously defined partitioning specifications as they are unable to partition applications

into varying specific sizes. They are primarily used to partition an application so the number

of communication channels required between the partitions are minimised. KL and Simulated

Annealing also have runtime complexities that far exceed the linear specifications as

previously discussed. Although FM is stated as being linear per pass, it is commonly accepted

that several passes of the algorithm are required before an acceptable result is obtained.

There have been several partitioning algorithms targeted for use with FPGAs that consider the

hard size and I/O pin constraints associated with such devices. Woo and Kim [136] proposed

an extension to the FM algorithm that minimised the maximum number of I/O pins used on

the device. Kuznar et al [88] also modified the FM algorithm to address the problem of

partitioning applications onto multiple FPGAs. However again, these algorithms are not

suited to the proposed operating system as they partition applications into fixed sizes and their

runtime complexities are far too high. Purna [109] introduced the concept of temporal

Chapter 5 – Resource allocation and application partitioning

partitioning a directed acyclic graph. Given the size of the FPGA, the algorithm will partition

an application into k-way equal sized parts in linear time. Although this algorithm does not

meet the variable partition size specification set for the Partitioner, it does meet the linear

runtime complexity constraint and could be adapted to support variable sized partitioning.

Only the temporal partitioning [109] algorithm proposed by Purna meets the runtime

requirements and partially meets the functional specifications needed by the Partitioner. This

algorithm will now be described, outlining the modifications that were made to adapt it to the

operating system environment.

5.2.2 Algorithm 1 – Temporal partitioning

Although the temporal partitioning algorithm proposed by Purna meets the linear runtime

complexity required by the Partitioner, it needs modifications. The major one is to transfer

from partitioning applications in time to partitioning them in space. The current temporal

partitioning algorithm initially assigns each node in the data flow graph an ‘As Soon As

Possible’ (ASAP) execution level: or depth level (Figure 35 (b)). This level is used to

guarantee that a node can only be executed if all of its predecessors have, therefore respecting

the data flow graph node dependencies. The algorithm then uses these ASAP levels to

partition the nodes of the data flow graph into k way equal sized partitions (Figure 35 (a)) in a

time complexity of O (V + E) where V is the number of vertex and E is the number of edges

in the data flow graph.

Figure 35: Temporal partitioning proposed by Purna

103

Chapter 5 – Resource allocation and application partitioning

104

Instead of having to partition an application into equal sizes, the operating system requires the

partitioning algorithm to be able to divide the application into a number of predefined sized

partitions that match the current FPGA layout. For example if the incoming application was

30 CLBs in size (10 x 3) and there were two segments of free area of 20 CLBs (4 x 5) and 12

CLBs (4 x 3), the application would need to be partitioned to fit into these two segments.

This was achieved by integrating a target partition size, and a monitor that keeps track of how

much the application has been partitioned. The process of partitioning an application in the

operating system begins with the target partition size being calculated by the Allocator and

passing it to the Partitioner along with the application. As the application arrives at the

Partitioner, it calculates the ASAP levels for each of the data flow graph nodes. It then begins

to partition the data flow graph nodes according to their ASAP levels, starting at the lower

levels first. Once the combined area required by the nodes exceeds the target size of the

partition, the Partitioner records the last node that was put into the partition and returns back

to the Allocator the details of which nodes have been allocated into the partition. If the entire

application does not fit into the allocated segment, it will request another from the Allocator.

The Partitioner will then repeat the process described above, however it will begin with the

last node that was partitioned instead of the first node. This entire process is repeated until the

entire data flow graph has been partitioned into segments of vacant area.

The changes made to the algorithm have not affected its runtime complexity. To gain an

accurate measure of the actual time the partitioning algorithm took to execute, an experiment

described in the next section was conducted to measure it under various conditions.

5.2.3 Algorithm performance

The performance of a partitioning algorithm is usually measured by the effect it has on the

performance of the application it partitions. If the clock speed or throughput is significantly

reduced because it has been partitioned, the algorithm is usually considered to be poor.

However the performance of a partitioning algorithm within an operating system must

consider both the effect it has on application performance and the amount of execution

runtime it consumes. In this section the amount of execution runtime that the partitioning

algorithm consumes under various conditions will be measured. The experiment to measure

the loss in application performance will be held off until the operating system prototype has

been described.

Chapter 5 – Resource allocation and application partitioning

To measure the execution runtime the partitioning algorithm consumes under various

conditions, a trivial application structured as a data flow graph consisting of 40 nodes was

partitioned 40 times. Each time the application was partitioned, the number of parts it was

divided into increased. It was initially divided into two parts, then three, and so on until all 40

nodes of the data flow graph were divided into their own partition. Each time the algorithm

completed an iteration of the partitioning algorithm, its execution runtime was recorded. To

make sure only the runtime of the partitioning algorithm was being measured, an array of

target partition sizes had previously been determined as to prevent the measured times being

distorted by the runtime of the Allocator. It was decided to use an application consisting of 40

nodes across all of the iterations of the partitioning experiment because this would produce

the worst possible execution runtime in each test. For example, for the same number of

partitioned nodes, the amount of runtime the algorithm would consume is likely to be less if

the application only had a total of 20 nodes for example. It was felt that the results from this

experiment be very similar when other applications were used because although the

application only performed trivial computation, the partitioning algorithm did not consider

what computation was being performed within the nodes, only the connections between them.

The results from this experiment are graphed in Figure 36 and there are several points that can

be drawn from the results.

Executiion Runtime of the Partitioning Algorithm

0

200

400

600

800

1000

1200

0 5 10 15 20 25 30 35 40

No of cores the application is partitioned into

Tim

e (m

S) o

n an

Inte

l Cel

eron

1.2

GH

z m

icro

proc

esso

r

Figure 36: The execution runtime obtained from the partitioning algorithm

105

Chapter 5 – Resource allocation and application partitioning

106

1. The minimum execution time consumed by the algorithm was approximately 85ms.

The major part of this execution time was consumed calculating the ASAP levels of

each node in the data flow graph as the application had 40 nodes at various levels.

This was evident because even though the number of partitions the application was

being divided into went from 2 to 5, the execution time only increased by 18ms. The

minimum execution time could be reduced if an application with fewer nodes was

partitioned.

2. The graph appears to have at most a linear relationship between the number of

partitions the application is being divided into and when the execution runtime in the

range of 5 and 20 partitions. At 20 and 35 partitions the graph appears to break from

this linear relationship and significantly increases. There does not appear to be any

obvious explanation of this.

3. There would be few situations where dividing an application into any more than 20

partitions would be suitable because the loss in performance would likely far outweigh

the benefits of squeezing the application into the last few percent of spare area on the

FPGA.

In summary, it is felt the execution runtime of the modified temporal partitioning algorithm

would not introduce too much of an overhead if integrated into the operating system. The

average application would only be partitioned into between 2 and 15 partitions and as such

would introduce approximately 85ms to 250ms of delay. This was considered to be an

acceptable overhead. Therefore, the modified temporal partitioning algorithm will be

integrated into the prototype operating system described in the next chapter. This concludes

the experiments conducted into the performance measurement of the allocation and

partitioning algorithms.

Chapter 5 – Resource allocation and application partitioning

107

5.3 Conclusion

This chapter resulted in two major deliverables; an algorithm for the Allocator and an

algorithm for the Partitioner. This was achieved by firstly creating a list of algorithms that

matched the runtime and functional specifications of the Allocator and Partitioner from either

the reconfigurable computing or non-reconfigurable computing domains. These algorithms

were then sorted based on their runtime complexity and the most promising were modified to

suit the architecture and then implemented for further experimentation. An experiment to

measure the execution runtime of the algorithms was then performed to determine whether it

was acceptable. From this experiment it was judged the best performing allocation algorithm

was the Minkowski Sum with bottom left heuristic which recorded a maximum execution

runtime of 100ms. It was also determined that the modified temporal partitioning algorithm

also meets the runtime requirements with a maximum execution runtime of approximately

200ms. Both of these algorithms will now be integrated into the operating system prototype

described in the next chapter.

Chapter 6 – Operating system prototype and metrics

6 6 Operating system prototype & metrics

In the previous chapters a set of abstractions, an architecture, algorithm specifications, and

specific allocation and partitioning algorithms for a reconfigurable computing operating

system were all defined. This chapter describes ReConfigME; the prototype of a

reconfigurable computing operating system. The chapter also reports on the experience

running applications on the operating system and introduces the metrics that will be used to

assess its performance in chapter 7. Figure 37 illustrates the methodology that is associated

with this chapter.

Figure 37: Previous work, methodology and

deliverables associated with this chapter

This chapter is divided into two sections and each section is associated with a deliverable. The

first section details how the prototype operating system known as ReConfigME was

constructed according to the architecture and algorithm specifications that were previously

defined in this thesis. This section will include a discussion on the prototype’s target platform,

application and primitive architecture, operating system structure, the applications developed

for use with ReConfigME, and issues that were faced during the implementation. Previous

108

Chapter 6 – Operating system prototype and metrics

109

research literature from both the software and reconfigurable computing operating system

domains will be used to influence the construction of ReConfigME. In the second section, a

set of metrics will be selected that will be used in the following chapter to measure the

associated performance of the operating system prototype. These metrics will be selected by

reviewing previous literature to determine what application designers perceive reconfigurable

computing application performance to be. These will be combined with any metrics that can

be transferred from the software operating system domain that measure important operating

system performance characteristics.

The actual programming of ReConfigME was carried out by the author, Martyn George,

Maria Dahlquist, and Mark Jasiunas based on the detailed design demonstrated here under the

direction of the author and his supervisor. The work was funded by the Sir Ross and Sir Keith

Smith Trust Fund and acknowledgements are made here to the programmers and the funding

authority that supported them over several years.

Chapter 6 – Operating system prototype and metrics

6.1 Operating system prototype

In the previous chapters, the process, address space, and inter-process communication

abstractions, an architecture, and allocation and partitioning algorithms were all discussed and

decisions were made on which were the most suitable for use in a reconfigurable computing

operating system. In this section all of these details and decisions are combined into the

construction of a prototype operating system known as ReConfigME. The purpose of

ReConfigME is to manage applications on the FPGA but ReConfigME does not run on the

FPGA. In theory it could run on the same FPGA if that FPGA had a suitable hard or soft core

processor that can support high level languages and the FPGA supported self reconfiguration.

However current commercially FPGAs do not have a fast enough hard core processor and do

not have self reconfiguration capabilities. The algorithms comprising of the operating system

could also be adapted to run in hardware but since this is the first prototype, the emphasis is

on an easy implementation platform. Therefore ReConfigME is Java based applications

executing in software on a standard PC. Shown in Figure 38 is the internal structure of

ReConfigME. This implementation architecture is more complex than the original

architecture proposed in chapter 4 as that architecture only maps to the Colonel component of

the operating system.

Figure 38: ReConfigME implementation architecture

110

Chapter 6 – Operating system prototype and metrics

111

The ReConfigME implementation is structured into three tiers consisting of user, platform

and operating system which are connected via a standard TCP/IP network. Users connect to

ReConfigME through a custom built client interface which enables them to load applications,

transfer application data and configuration information, and monitor the reconfigurable

computing platform status. ReConfigME enforces a strict FPGA application architecture

consisting of a data flow graph structure, memory based I/O, EDIF application file format,

and the associated software only components. It supports multiple applications through the

use of FPGA hardware resource allocation, application logic partitioning, runtime bitstream

generation, and runtime reconfiguration. For easier implementation and due to technology

limitations, ReConfigME has a limit on the number of concurrent applications and uses static

application memory allocation. The current FPGAs and their design tools do not support

dynamic runtime reconfiguration of arbitrary sized applications, thus ReConfigME simulates

dynamic runtime reconfiguration. When ReConfigME wants to allocate a new application to

the FPGA, all running applications are check pointed, the FPGA clock is stopped, and a new

bitstream including the new and all the existing applications is downloaded. The existing and

new applications are then started or restarted.

This section is structured as follows. The reconfigurable computing platform used and the

factors affecting its selection is described first. In section 6.1.2, the restrictions placed on the

application’s design are outlined as an application architecture. The primitive architecture

used to support the inter-process communication abstraction is then detailed. In section 6.1.4,

ReConfigME’s software implementation structure is described which includes the use of a

three-tier networked communication architecture. A detailed listing of the procedure involved

in executing an application under ReConfigME is then described through the use of a sample

application. The applications that were implemented to test the correct functionality of

ReConfigME are then detailed. The section concludes by reviewing why the implementation

did not entirely match the proposed architecture and a set of implementations issues that were

faced during construction.

6.1.1 Hardware platform

The prototype of ReConfigME was developed on a standard PC with a Celoxica RC1000pp

development board, in a typical co-processor configuration. The RC1000pp is a standard PCI

bus card equipped with a Xilinx Virtex XCV1000 part with 1 million system gates. It has

8Mb of SRAM directly connected to the FPGA in four 32-bit wide memory banks. The

Chapter 6 – Operating system prototype and metrics

memory is dual ported to the host CPU across the PCI bus accessible by DMA transfer or as a

virtual address. Figure 39 is a block diagram showing the connections between the

components of the RC1000pp development board.

Figure 39: RC1000pp Block Diagram

This platform was selected for the operating system prototype for several reasons. Firstly, the

platform consists of a medium grained FPGA, loosely coupled to a modern high performance

microprocessor via a standard PCI bus. This configuration was determined in section 2.1.2 to

best suit an operating system environment. The medium grained FPGA has ample resources

to be shared amongst multiple concurrent applications and the PCI bus has sufficient I/O

bandwidth to support the streaming of data into the applications. Secondly, the platform has

four banks of high capacity dual port memory. As was described in the inter-process

communication abstraction, processes will communicate with each other and the external

microprocessor via the platform’s on-board memory. External I/O data will be loaded into a

memory bank via the PCI bus and then passed into the process via the FPGA pins and

memory controller. This type of I/O transfer requires dual-port memory as the host and FPGA

can communicate directly with the memory bank. Finally, the platform supports runtime

reconfiguration via SelectMAP over PCI.

112

Chapter 6 – Operating system prototype and metrics

Figure 40: The RC1000pp

6.1.2 Application architecture

The applications used in conjunction with ReConfigME need to be designed with the

following four characteristics. Firstly, the applications should be structured according to the

data flow graph model defined in section 4.1.1 and shown in Figure 41. This enables the

Partitioner to divide the application into several partitions that better match the geometric

dimensions of the vacant FPGA area. However, applications that are not structured as a data

flow graph model can still be used with ReConfigME but no attempt to partition them will be

made. This may result in an extended application response time, as not being able to partition

the application will increase the chance of it being blocked if the particular size of vacant area

needed is not available.

Figure 41: Application architecture for ReConfigME

Secondly, data source and sink nodes are inserted into the applications at points where input

or output data is required. As all inter-process communication is conducted via on-board

memory, these nodes provide the interface between the application and the on-chip memory

controller. The applications in the prototype have access to 1Mb of memory each starting at a

113

Chapter 6 – Operating system prototype and metrics

114

virtual address of 0x00. The memory controller will then convert this virtual address to a real

address based on the static allocation of external memory. The on-chip memory controller in

conjunction with the ReConfigME server is then responsible for reading and writing the data

to and from the applications and the appropriate memory location. This interface allows

applications to be programmed in both VHDL and Handel-C.

Thirdly, each of the nodes of the data flow graph must be relocatable on the FPGA as

ReConfigME will determine where to allocate the nodes at runtime. Since the current tools do

not support runtime routing of pre-placed and pre-routed applications, this prevents them from

being arbitrary relocated. So each of the nodes of an application is synthesised into an

intermediate file format at compile time, but not placed and routed to a bitstream. The

intermediate file format chosen in ReConfigME is EDIF. This file format has advantages over

many of the others because almost all design entry methods can generate it, it’s not

commercially specific to an FPGA or company, it has an open source specification, and

multiple EDIFs can easily be merged together to result in a single FPGA bitstream. EDIF’s

are combined by the operating system with an area constraint file that specifies the location of

each node and the complete FPGA is then place and routed.

Finally, each of the nodes in the data flow graph model and the entire model itself must have

an estimate of the geometric dimensions of the FPGA area they will require when passed into

ReConfigME (see Figure 41). Thus it is necessarily at design time to execute the place and

route tools over each of the nodes to gain a size estimate. It can be expected that this will not

be an entirely accurate area estimate especially if the aspect ratio has to be change and as such

a margin for error is added to the area estimate used in ReConfigME.

6.1.3 Primitive architecture

The primitive architecture of ReConfigME is that part of the hardware that is configured onto

the FPGA before any user applications and remains there. The primitive architecture is used

to support the previously defined inter-process communication abstraction. It consists of a

memory controller and network terminators. The memory controller is responsible for

granting access to the memory when requested by an application, and managing the transfer

of the I/O to the particular application. As the RC1000 consists of four 2Mb memory banks,

accessible either via the host computer or FPGA, the memory controller has to negotiate with

the platform memory arbitrator to ensure both the host and FPGA applications do not write to

Chapter 6 – Operating system prototype and metrics

the same memory bank simultaneously. For ease of implementation, the memory controller

logically divides the memory into fixed sized blocks each of which are then allocated to a

single process requiring I/O. Although this limits the total number of processes resident on the

FPGA, it will not impact on the results gained from the set of experiments that will be

conducted on the prototype. Shown in Figure 42 are the memory and the primitive

architecture associated with ReConfigME.

Figure 42: Operating system primitive architecture

As I/O arrives at the memory controller from a process, it negotiates with the memory

arbitrator to ensure it has exclusive access to the particular memory bank. Once access has

been granted, it then has to convert the local addressing scheme that each process is using into

the global addressing scheme to ensure the data is loaded into the correct location in memory.

The memory controller then either reads or writes the data into the calculated memory

position.

Each of the processes is connected to the memory controller via a single network terminator.

The network terminators simply provides the matching interface for the data source and sink

nodes so processes can easily connect to it. This currently consists of a custom bus of 21

address lines, 32 data lines, 4 single bit control lines, and a single bit clock line. Processes can

then either read or write to anywhere within the range up to 1 Mb which is allocated to it.

6.1.4 ReConfigME implementation architecture

The overall architecture of the operating system is component based with each operation

separated into small independent components which communicate via a simple message based

115

Chapter 6 – Operating system prototype and metrics

116

mechanism. As there are many issues relating to reconfigurable computing operating systems

that have not been fully researched, this type of architecture was chosen over the more

traditional monolithic operating system architecture. As the requirements of a modern

traditional operating system have been well defined, the implementation of a monolithic

architecture is relatively straightforward. However in a reconfigurable computing operating

system the requirements are still unclear and as such the construction of a monolithic

architecture had to be avoided because they are difficult to maintain as the detailed

requirements emerge.

A simple multi-client server arrangement to structure the inter-component communication

was chosen to be used with the prototype. This involved one client server connection between

the user and bitstream generation components, and another between the bitstream generation

components and the reconfigurable computing platform. This allows the user to be remotely

located from the majority of the operating system components, possibly via a remote web

front end, and the reconfigurable computer to be remotely located from the bitstream

generation tools. Another benefit of this design is that ReConfigME can manage multiple

FPGA cards which can be physically located within the same machine or in separate

machines making it easily scalable. Such an arrangement allows maximum flexibility with

respect to location of the user, platform and ReConfigME’s bitstream generation components.

The inter-component communication structure has ReConfigME divided into three tiers; user,

operating system, and platform. Although there is no general agreement about what

contributes as a tier [6], a machine separated by network communication is considered a tier

in this prototype. The client tier primarily performs the interaction between the operating

system tier and the user by providing a shell as an interface. The operating system tier

contains the operating system architecture that consists of the resource allocation, application

partitioning and bitstream generation. The platform tier consists of the reconfigurable

computing platform and the components needed to access it. ReConfigME’s components

were then separated into these three tiers and can be seen in Figure 38. The curved cornered

rectangles indicate the component was constructed specifically for the prototype. Rectangular

components represent off the shelf products. This figure is very similar to a protocol stack;

data enters the tier via the bottom component which is connected to the others via a physical

network. Data progresses through the tier until it reaches the destination component. Likewise

data that needs to be transferred to another tier will progress down through the tier until it

Chapter 6 – Operating system prototype and metrics

reaches the physical network. Each of the components and tiers will now be discussed in more

detail.

Platform tier

The platform tier consists of seven components and is primarily responsible for the

communications to and from the reconfigurable computing platform. All of the components

except the reconfigurable computer and network are all resident in software on a PC that hosts

the reconfigurable computing platform. The top level component is the hardware abstraction

layer (HAL) server and is responsible for hiding the platform specific API. It is a simple API

written in Java that can be used with various platforms to offer access and control over the

hardware. It provides methods for reading and writing bitstreams to the FPGA, reading and

writing to the on-board memory, and clock management. As the RC1000 used in

ReConfigME is shipped with C++ libraries, Java native method calls were used to connect the

hardware abstraction layer API to the corresponding RC1000 library method. The advantage

of the hardware abstraction layer is the same API can be used to communicate to any number

of different target platforms.

The hardware abstraction layer also supports a client/server paradigm so the reconfigurable

computing platform can be remotely located (see Figure 43). Connections are made to the

HAL server via standard TCP/IP sockets from the HAL client, located in the operating system

tier. Bitstream files, input and output data, and clock configurations are then passed back and

forth between the client and server.

Figure 43: Platform tier architecture

The other components in the platform tier are used to support the HAL server. Java was

chosen as the implementation language because of its ease of internetworking, its object

orientated semantics, and its portability across different hosts, operating systems, and

hardware. The PC operating system component, in this case Windows XP, is needed to

117

Chapter 6 – Operating system prototype and metrics

manage the hardware resources of the host computer and the TCP/IP and network components

are required to provide the connectivity between the HAL server and the HAL client.

Operating system tier

The operating system tier consists of seven components and is responsible for allocating and

partitioning applications, the generation of the FPGA bitstreams, and the transfer of

application data and configuration information between the platform and user tiers. The top

level component of the operating system tier is dubbed “Colonel” (analogous to a software

operating system but is spelt differently to avoid confusion). The Colonel does everything

except the transfer of data between the other tiers. It consists of three sub-components and the

bitstream generation tools (see Figure 44) in a structure that reflects the architecture of the

operating system that was described in section 4.2.2.

Figure 44: Architecture of ReConfigME’s Colonel

As a user connects to ReConfigME, their application and configuration information is passed

into the Colonel via the user server. The application and its pre-compiled geometric

dimensions are then passed onto the Allocator and in conjunction with the Partitioner, will

determine whether the application can configured onto the FPGA or is blocked and put into a

queue because of the lack of vacant area. The Allocator consists of the Minkowski Sum with

bottom left fit algorithm that was described in section 5.1.4 and the Partitioner consists of the

modified temporal partitioning algorithm that was described in section 5.2.2.

Once all the locations of the application’s partitions have been determined, the Allocator will

create a file which ensures the application’s absolute placement details calculated by the

Allocator are followed once the FPGA bitstream is generated. In ReConfigME, the constraints

file is in the standard vendor format. The main control loop will then create and call a script

that executes the place and route tools. This will generate an FPGA bitstream that includes all

118

Chapter 6 – Operating system prototype and metrics

119

of the loaded applications in their correct locations. It is then passed onto the HAL client who

is responsible for connecting to the platform and configuring the new bitstream onto the

FPGA.

The Colonel also manages the transfer of application data involving capturing the input data

from the user loading it into the on-board memory, and reading the output data from the on-

board memory and passing it back to the user. This task primarily consists of an address

translation. The local addressing scheme is translated into the platform’s global addressing

scheme to ensure the correct location in the platform’s memory is accessed for either reading

or writing. The Colonel also passes specific clock and platform information between the HAL

client and user server.

The second level component of the operating system tier is the HAL client and user client (see

Figure 45). The HAL client component is responsible for creating a connection to the desired

platform and passing all of the I/O, bitstreams and configuration information between the two.

It allows the platform to be remotely located from the Colonel. The advantage in this is

ReConfigME can target numerous different platforms without having to have them all located

in the same machine as the Colonel.

The user server handles all the communications between the user client in the user tier and the

Colonel. This includes input and output of application data, incoming applications, and

platform configuration information such as clock settings. The user server accepts

connections via standard TCP/IP sockets from numerous remotely located clients located in

the user tier. Once a connection has been established, it is responsible for passing the data to

the Colonel and then responding to the client with the associated response. The advantage of

having the communication component separate from the Colonel is if the network protocol or

client/server API is altered, only those components need to be modified, not the complex

Colonel itself.

Chapter 6 – Operating system prototype and metrics

Figure 45: Operating system tier

User tier

The user tier contains five components and is primarily responsible for providing a user

interface and connection to the operating system tier. The top level component is the user

interface (see Figure 46) and consists of a combination of a simple command line interface for

user input and a graphical user interface for displaying the geometrically layout of currently

executing application on the reconfigurable computing platform. Via the command line

interface, users are able to load applications, stream I/O data to the platform’s on-board

memory and configure particular platform settings such as clock values. The graphical user

interface displays the results of the allocation and partitioning of applications as they are

loaded into ReConfigME (see Figure 51).

The user client is the second level component in the user tier and provides an interface to the

Colonel via the user server. It communicates via standard TCP/IP sockets to the user server

located in the operating system tier and simply converts user requests from the command line

interface into the API defined for use between the user client and server components. The

advantage in using the user client and server is other user interfaces can easily be added with

little or no change to the Colonel.

120 Figure 46: User tier architecture

Chapter 6 – Operating system prototype and metrics

6.1.5 Sample application execution

There are two types of files that are needed to be created for an application to be loaded onto

the reconfigurable computer via ReConfigME: the application itself with an EDIF file for

each data flow graph node, and a Java class file that defines how each of these EDIF files are

connected together in data flow graph model. The first stage in developing an application for

use with ReConfigME is the generation of the series of EDIF files that describes the

behaviour of the application. This procedure initially involves the designer determining how

the application will be structured. Shown in Figure 47 is the complete sample application

structured as a data flow graph model with the ADD 1, XOR, and AND representing one node

each.

Figure 47: Complete sample application in data flow graph format

Each of these nodes will then result in a single EDIF file. Almost any design entry method

can be used to create these nodes but in this example the hardware description language

developed by Celoxica known as Handel-C [30] was used. Shown in Figure 48 is a code

listing of the first node in a sample application.

Figure 48: Handel-C code listing for add one data graph flow node

It simply reads a 32 bit number from the first location in memory, adds one to the number,

and then writes the result back into the second location in memory. As can be seen from the 121

Chapter 6 – Operating system prototype and metrics

code, the data is loaded into the memory from the host via the readMem() and writeMem()

methods. These methods insert the data source and sink nodes into the application so it can be

connected to the memory controller. The Handel-C source code for the other nodes in the data

flow graph look very similar except instead of adding one to the number, the second node

performs a logical XOR against a set mask and the third node performs a logical AND against

another set mask. All the Handel-C source files are then compiled and an EDIF file is

generated for each node. As is shown in the code in Figure 49, three new cores and their

dimensions which represent each node in the graph are added into the instance tg.

Figure 49: Java class file defining data flow graph structure

The code initially involves creating an instance of the class TaskGraph which represents a

data flow graph, and initialising the parameters defining its geometric dimensions, name and

whether it should be partitioned. An application can be prevented from being partitioned by

ReConfigME if the designer believes it has strict performance constraints. Each of the EDIF

filenames and the area they will consume are then added into the structure of the data flow

graph as nodes by simply adding a new Vertex into an array within the instance. In this

sample execution, the data flow graph consist of three nodes or EDIF files; add_one_core,

XOR_core, and AND_core. The edges which represent the communication links between the

nodes in the data flow graph are created in the instance by calling the method addEdge and

passing the core numbers of the communicating nodes. In the sample application, the

add_one_core connects to the XOR_Core, which connects to the AND_core. Shown in Figure

50 is a static class diagram of the complete data flow graph application.

122

Chapter 6 – Operating system prototype and metrics

Figure 50: ReConfigME data flow graph class structure

The next part in the Java class file is to define the connection to the ReConfigME server and

pass the TaskGraph object containing all the data flow graph details. This is simply performed

by creating an instance of the class RC1000 with the parameters of the TaskGraph, IP address

and port number of the server. This results in the instance of the data flow graph and all

associated EDIF files being loaded into ReConfigME so the generation of the bitstream can

begin. Once the bitstream has been generated and dynamically configured onto the FPGA, the

necessary read and writes to and from the memory are performed.

Shown in Figure 51 and Figure 52 are screen captures of ReConfigME processing the sample

application. Figure 51 (a) shows the primitive architecture consisting of the memory

controller (shown in grey) and two network terminators (shown in black and white)

configured onto the FPGA when the prototype is started. Once the user client connects to

ReConfigME, it begins processing the application and a log of this is shown in Figure 51 (b).

This involves allocating the application onto the FPGA, connecting the data source and sink

nodes onto the network terminator, and generating and configuring the bitstream onto the

FPGA. Once the application is configured onto the FPGA, the user interface shown in Figure

52 (a) is updated to reflect the new allocation layout. The final stage in the sample application

execution is to read and write the I/O data with the output data being stored in a local file. The

status of these actions is reported via the client interface and is shown in Figure 52 (b). Once

the client disconnects from the ReConfigME server, the application is removed from the

FPGA and the new bitstream is generated and configured onto the FPGA. This completes the

sample application execution listing.

123

Chapter 6 – Operating system prototype and metrics

(a) User interface (b) log file

Figure 51: Status displayed before the allocation of the application

(a) User interface (b) log file

Figure 52: Status displayed after the allocation of the application

6.1.6 Applications for ReConfigME

There were four applications implemented to verify the correct functionality of ReConfigME.

The first one was described in the previous section and was a simple mathematical based

application used to demonstrate the process of designing and executing applications under

ReConfigME. The next two applications were implemented to show that ReConfigME can be

used with real applications that require large amounts of I/O to be transferred between the

hardware circuit and software part of the application. These two applications are described

here. The last application implemented for use with ReConfigME is based on encryption and

was used to measure the performance of the prototype including the allocation and

partitioning algorithms as reported in the next chapter.

In this section, the applications of blob tracking and edge enhancement implemented for use

with ReConfigME will be detailed. This will include a description of the application, the

application’s specifications such as area consumption, and the output generated by

ReConfigME to show its allocation details.

124

Chapter 6 – Operating system prototype and metrics

Blob tracking

Blob tracking is a term used in the vision tracking research community which is the process of

finding the location of a known object in a series of images. In the application described here,

the object of interest is an orange coloured ball and a series of images were taken as the ball

was randomly moved. The first step in blob tracking algorithm implemented for ReConfigME

was to separate the orange coloured ball from the rest of image. This is achieved by

performing a threshold operating on the image, based on a colour value that matched the

orange ball. Each pixel in the image was examined to determine if it matched the colour of

interest. If the pixel matched the colour, in the output image it was set to white whereas if it

did not match, it was set to black. This procedure was repeated for every pixel in a frame.

Once the known colour had been separated from the image, the centre of these pixels had to

be calculated. This was achieved by simply calculating the mean location of all the pixels that

matched the threshold colour of interest. This point was then indicated by the use of red

crosshairs. Shown in Figure 53 is a screen capture of the blob tracking application.

Figure 53: Screen capture of the blob tracking application executing on ReConfigME

The application consists of two parts: the hardware circuit containing the blob tracking

algorithm which performs the threshold and calculation of the centre location written in

Handel-C, and the software application responsible for transferring the I/O to ReConfigME,

capturing the video in real time via a camera, and displaying the threshold image and location

of the crosshairs. The hardware circuit of the blob tracking application consumes

approximately 400 CLBs or 7% of the target FPGA, has pre-defined dimensions calculated to

be 20 CLBs by 20 CLBs and is 70 lines of non-commented Handel-C. Shown in Figure 54 is

a screen capture of the location of where the application was allocated on the FPGA when

125

Chapter 6 – Operating system prototype and metrics

loaded by ReConfigME. The light blue rectangle is the memory controller, the pink rectangle

is the network terminator, and the red rectangle is the blob tracking hardware circuit. The

remaining blue area is the available FPGA area for allocation.

Figure 54: Allocation status of the FPGA when the blob tracking is loaded onto the

FPGA by ReConfigME

Edge enhancement

Edge enhancement is another well-known image processing algorithm and involves

identifying the edges of objects in an image. This algorithm is often the first stage in template

matching or target recognition. The algorithm firstly involves performing a threshold of the

intensity change across a window of pixels. If the intensity change exceeds the selected

threshold, the pixel is marked as an edge. The window is moved across the entire image in

both a horizontal and a vertical direction. The output from the edge enhancement application

is shown in Figure 55.

Figure 55: Screen capture of the edge enhancement application

executing on ReConfigME

126

Chapter 6 – Operating system prototype and metrics

As was the case in the blob tracking application, the edge enhancement application consists of

two parts: the hardware circuit that executes the edge detection algorithm, and the software

part that transfers the I/O to ReConfigME and displays the resultant edge detection. The

hardware circuit consumes 480 CLBs, has pre-calculated dimensions of 40 CLBs wide by 12

CLBs high and is 111 lines of non-commented Handel-C code. Shown in Figure 56 is a

screen capture of the location of where the edge enhancement application was allocated on the

FPGA when loaded by ReConfigME as the sole application. The red rectangle is the memory

controller, the pink rectangle is the network terminator, and the grey rectangle is the blob

tracking hardware circuit. The remaining blue area is the available FPGA area for allocation

to other applications. The colours are different compared to the blob tracking application as

ReConfigME generates random colours to incoming applications.

Figure 56: Allocation status of the FPGA when the edge enhancement

is loaded onto the FPGA by ReConfigME

Multiple concurrent applications with ReConfigME

Shown in Figure 57 is the allocation status when both the blob tracking and edge

enhancement applications were allocated onto the FPGA at the same time. The edge

enhancement application was loaded first (shown in grey), and next the blob tracking

application was loaded (shown in white). The memory controller is shown in pink.

127

Chapter 6 – Operating system prototype and metrics

Figure 57: Allocation status of the FPGA when the edge enhancement and the blob

tracking are loaded onto the FPGA by ReConfigME

With both the applications allocated onto the FPGA and the clock set to 25MHz, both

applications executed correctly and there was no noticeable difference in the frame rate of

both applications as compared to running them separately. The output from both applications

was identical when compared to the output generated when each application had exclusive use

of the FPGA. The edge enhancement application was removed by the operating system, the

network terminator was re-allocated, and the blob tracking application continued to execute

correctly. Finally, the blob tracking application was removed from ReConfigME and the

FPGA was re-configured with no applications. Shown in Figure 58 is the output from the

Xilinx Floor-planner design tool which shows both applications configured onto FPGA. The

blob tracking application is shown in yellow, the edge enhancement in green and the memory

controller in light grey. This figure reflects the allocation constraints placed onto the

applications by ReConfigME.

Figure 58: Screen capture from Xilinx Floorplanner verifying

the locations of the applications on the FPGA

128

Chapter 6 – Operating system prototype and metrics

129

In this section it has been shown that ReConfigME can correctly manage real reconfigurable

computing applications. Both of these applications were designed to be used with the

operating system but likewise, existing applications could easily be either loaded with every

little modification, or re-designed according to the operating system application architecture

so as to take advantage of application partitioning.

6.1.7 Implementation issues

In an attempt to minimise the implementation complexity of ReConfigME, as well as several

technology limitations, selected characteristics that were defined in the architecture of an

operating system in chapter 5 were not transferred into the construction of the prototype. The

most noticeable is the omission of a shared bus network to support inter-process

communication. In section 4.1.3 it was determined that inter-process communication could be

optimised through the use of such a network. However the runtime routing of the bus proved

impractical with the available tools. Although direct bitstream manipulation is possible

through the use of the JBits API, it was found through experimentation that the JBits runtime

router was unable to achieve a successful route for the cases of complexity requested by the

operating system. Dynamic reconfiguration, where one application is running whilst another

is being loaded was also not implemented due to the FPGA architecture providing only

column based partial reconfiguration. Together with limits on the location and configuration

of tri-state buffers, the FPGA was found to be too inflexible to support dynamic

reconfiguration under the operating system. Even if this limitation on dynamic

reconfiguration did not exist, the tool flow’s inability to support relocatable pre-placed and

pre-routed nodes and runtime routability is currently the main limitation to its practical use

due to the extra time required to re-place and re-route applications each time a context switch

occurs.

Chapter 6 – Operating system prototype and metrics

130

6.2 Metrics

With any set of new abstractions, metrics are required to define particular characteristics of a

system’s performance. Two metrics that are commonly used to measure and compare the

performance of traditional software operating systems are response time and throughput, and

shown in Table 13. As many of the goals of a reconfigurable computing operating system are

similar to that of the traditional operating system, it was felt that these metrics should also be

used to measure the performance of the prototype ReConfigME. In this section response time

and throughput will be outlined in more detail with the aim of using them in the next chapter

for a performance evaluation.

Metric Definition

Response time The amount of time the operating system takes

to respond to a user request

Throughput The amount of processing on user level tasks per unit of time

Table 13: A summary of the metrics designed for

reconfigurable computing operating systems

6.2.1 Response time

Response time is a well defined metric commonly described as the amount of time an

operating system takes to respond to user requests. It is heavily influenced by scheduling

policies, context switch time slices, and I/O latency. It is considered a very important measure

in an operating system that performs a significant number of applications that have real-time

user interaction. In the use of a reconfigurable computer, the response time, also known as

latency in this case, would have been the amount of time taken between the user loading the

application and the first results to arrive back. In the prototype reconfigurable computing

operating system, this response time primarily consists of the execution runtime of the

allocation and partitioning algorithms, commercial place and route tools, and FPGA

reconfiguration time as they are all completed after the application execute requests have been

entered into.

Chapter 6 – Operating system prototype and metrics

131

6.2.2 Throughput

Throughput is a commonly used metric in software operating systems to measure the amount

of output an application can generate over a specific time. In a reconfigurable computer, the

definition of throughput is no different. In order to measure the throughput of a hardware

application though, a characteristic commonly known as circuit delay is measured. Circuit

delay by definition is the amount of time between successive outputs of the circuit. The

inverse of this is usually the circuit’s clock speed. When comparing two identical

applications, the larger the circuit delay, the lower the throughput.

It is expected that the throughput of the individual reconfigurable computing applications will

be reduced when used in conjunction with an operating system. As previously stated, the

traditional design flow tools and algorithms have been implemented to maximise this.

However in the operating system, the allocation and partitioning algorithms have traded this

throughout for a decrease in execution time. Therefore in the next chapter an experiment to

measure the loss in throughput in applications used in conjunction with ReConfigME will be

performed.

Chapter 6 – Operating system prototype and metrics

132

6.3 Conclusion

The chapter resulted in two deliverables. One is a prototype operating system known as

ReConfigME which is based on the architecture described in chapter 4. This included details

on the selected platform and the detailed implementation. It was discussed how the

applications are created so they are structured according to the data flow graph model, the

primitive architecture that is used to support inter-process communication, the networked tier

architecture used to implement the prototype itself, a sample application execution listing, and

implementation issues that arose during its construction. The second deliverable was a set of

metrics including response time and application throughput that will be used in the next

chapter to measure the performance of the operating system and its applications.

Chapter 7 – Performance evaluation

7

7 Performance evaluation

In the previous chapter, a prototype operating system for reconfigurable computing known as

ReConfigME was presented and an example of concurrently executing applications was

given. Along with this prototype, a set of commonly used metrics from the software operating

system domain that measure an operating system’s impact on user response and application

performance were identified. In this chapter, these metrics will be transferred into the

reconfigurable computing operating system domain and will be used in a series of

experiments to measure the performance of applications under ReConfigME control.

Correlations between throughput, response time, and fragmentation will then be presented,

and a formula for predicting the likelihood of the operating system allocating a particular

application onto an FPGA will be developed. A summary of the previous work,

methodologies and deliverables associated with this chapter are shown in Figure 59.

Figure 59: Previous work, methodology, and deliverables

associated with this chapter

The chapter is divided into three sections with each reflecting a particular deliverable. In the

first section, the test environment, benchmark application, and test cases that will be used in

the experiments to measure the response time and throughput of ReConfigME will be

outlined. In the second section there is a detailed description of the three experiments Page 133

Chapter 7 – Performance evaluation

Page 134

conducted on the operating system, the results obtained, and the conclusions that can be

drawn from the results. In the first experiment, the average time the user has to wait for their

application to have its resources allocated on the FPGA will be measured. This will be

performed by loading a series of different sized applications onto the FPGA under varying

conditions. In the second experiment, the throughput of a reconfigurable computing

application executing in conjunction with ReConfigME will be compared to the throughput

when executed in the reconfigurable computing environment without an operating system. In

the third section, all the results from these experiments will be compared to determine if any

correlation between the user response time, throughput, and area usage can be achieved

through the introduction of a new fragmentation metric.

Chapter 7 – Performance evaluation

Page 135

7.1 Experimental environment

In the previous chapter, the operating system architecture defined in section 4.2.2 was

implemented into a prototype known as ReConfigME and the execution of some non-

partitionable applications were demonstrated. To measure ReConfigME’s performance such

as user response time or application performance, a suitable partitionable benchmark

application and test environment generating simulated workloads had to be developed. In this

section, the benchmark application and the reasons why it was selected to be used in the

experiments, as well as the test cases used to generate the performance results will be

outlined.

7.1.1 Benchmark application

Although benchmarks for general purpose computers have been deeply investigated, there

still appears to be very few that are specifically designed for reconfigurable computing

applications running under an operating system. The Adaptive Computer System (ACS)

benchmark suite [87] and the Reconfigurable Architecture Workstation (RAW) [11]

benchmark suite provide a set of benchmarks in the form of commonly implemented

reconfigurable computing applications. These benchmark applications have been commonly

used to evaluate the performance of placement and routing algorithms by measuring the

characteristics of versatility, capacity, timing sensitivity, and scalability. However many of

these applications cannot be applied immediately to the operating system because they need to

be extensively re-designed to suit the new application architecture.

An application that can be readily re-designed according to the operating system application

architecture is the Data Encryption Algorithm (DEA). DEA or the ANSI equivalent Data

Encryption Standard (DES) [56] is a widely used method of encryption that enciphers and

deciphers blocks of data consisting of 64 bits under control of a 56-bit key. The algorithm

consists of three different procedures which encrypt the text: an initial permutation (IP), a

complex key-dependent computation and another permutation which is the inverse of the

initial one as shown in Figure 60. These processes are repeated 16 times to produce the

encrypted or decrypted text. To reduce the chance of the text being decrypted by unauthorised

parties, a stronger version of DES called Triple DES is used. This simply involves three

copies of the DES application used in sequence.

Chapter 7 – Performance evaluation

Figure 60: DES block architecture

There are several hardware implementations of the DES algorithm previously published [1,

118]. The Free-DES [59] implementation was chosen as the basis for the benchmark

application because it is able to fit onto the target FPGA (approximately 16% of the target

area on a Xilinx Virtex 1000) and has a high performance in terms of clock speed and

throughput. The Free-DES implementation was modified to suit the application architecture

by structuring it as the data flow graph model with three nodes. Three copies of the modified

Free-DES application were then combined in sequence to form a triple DES application. A

data source and sink node was then inserted on the first and last node respectively. The

resultant triple DES application consists of nine individual parts that can be independently

partitioned.

7.1.2 Experiential configuration

To ensure ReConfigMe’s performance is measured in situations that approximate a realistic

operating environment, a set of four tests cases were developed to be used in each of the

experiments. These tests cases were generated as follows. A workload generator produces

simulated applications and the actual DES application which is then placed in a ready queue.

The operating system ready queue is a first in first out (FIFO) where applications are stored

prior to being allocated to the FPGA. The simulated tasks are assigned a size and execution

time, each of which is selected from different and independent Gaussian distributions. When a

simulated application has been allocated onto the FPGA, it is also connected to the memory

controller via an I/O bus to generate a most realistic routing resource usage. The workload

generator also samples an inter-arrival time for each simulated application from a random

Page 136

Chapter 7 – Performance evaluation

Page 137

distribution which determines when each application is entered onto the ready queue.

Depending on the arrival and execution time chosen, the ready queue can be empty which

implies that the FPGA is not fully loaded with applications. If the ready queue contains one or

more applications the FPGA is considered for the purpose of the chapter to be fully loaded.

In the experiments, the mean and variance of the inter-arrival time was selected so that at least

one application was on average in the queue so that all experiments represented a fully loaded

FPGA in the context of using the operating system. Note that a fully loaded FPGA does not

mean that all the area was used by executing applications because fragmentation prevents this

from occurring. A fully loaded FPGA was used in the experiment as it is expected to result in

the worst case degradation in performance.

The experiments were divided into four cases each and two sets. Cases two, three and four

involve allocating the DES application after all the simulated applications. Case one, shown

in Figure 61 is special in that the DES application is allocated first and no other simulated

applications are allocated at all. Case one was generated to provide the expected best case

when using the operating system. These values will then be used as the baseline to compare

the subsequent cases. For cases two, three, and four, the state of the task laid out on the FPGA

just before the DES core is placed in the ready queue is denoted the initial floor-plan for the

case.

In case two the operating system is run with applications being allocated and then as they

complete being removed and replaced with new tasks. It is possible to run the operating

system indefinitely but we expect that after a certain number of application completions some

steady state will be reached. To examine if this happened, case two has a number of runs of

different lengths n. A run consists of the operating system allocating n tasks to the FPGA and

then placing the DES application in the queue. In case two n was chosen to be 100, 200, 300,

400, and 500. The initial floor-plans for this test case are shown in Figure 62 and Figure 63.

In case three a run for n = 100 was repeated 5 times. This allowed an examination of

variability in the measured mean performance values due to the random nature of the

selection of simulated tasks (the random nature of the initial floor plan). It was then assumed

that all data collected from the experiment has this amount of random variability. The floor-

plans for this test case are shown in Figure 64.

Chapter 7 – Performance evaluation

Page 138

In case four a run of 100 applications was generated. Simulated applications were then added

continually after that until an initial floor plan was found that required the partitioning of the

DES application to exactly m partitions where m ranged from one to nine. The DES

application was then partitioned and allocated onto the FPGA using the OS and performance

measured. The floor-plans for this test case are shown in Figure 65 and Figure 66.

In test cases two and four all the experiments were duplicated for two sets. Set one, denoted as

“typical” corresponded to a mean application size of 4% of the FPGA area for the simulated

tasks. Set two, denoted as “large” corresponded to a mean size of 8%. In both sets the

variance of size was selected by half the mean. The mean and variance of the area and the

mean inter-arrival time and execution time are all shown in Table 14. The DES application

used approximately 16% of the FPGA.

Mea

n ar

ea

(per

cent

age

of F

PG

A)

Sta

ndar

d de

viat

ion

area

(p

erce

ntag

e of

FP

GA

)

Mea

n in

ter-

arriv

al ti

me

(uni

t of t

ime)

Mea

n ex

ecut

ion

time

(uni

t of t

ime)

Typical size applications 4% 2% 20 200

Large standard deviation size 8% 4% 30 200

Table 14: Parameters of the applications used in the

response time and throughput experiments

In all of the initial floor-plans, the thin red rectangles located around the edges of the FPGA

are the memory controllers, the blue rectangles are the simulated applications and the green

rectangles are the triple DES benchmark application. If there is more than one green

rectangle, it means ReConfigME was unable to allocate the DES application without

partitioning it. If there are nine green rectangles it means every node in the data flow graph

was partitioned into a single process. In some cases, although there should be n number of

applications, it may appear as though there as less because some partitions were allocated next

to others, thus making it appear as though the application was not partitioned into the correct

number of parts. The red lines connecting the processes are the routes that would be used to

connect the applications to the memory controllers.

Chapter 7 – Performance evaluation

Figure 61: Test case 1 floor-plan

(a) n = 100 (b) n = 200

(c) n = 300 (d) n = 400

(e) n = 500

Figure 62: Test case 2, set 1 (typical sized) floor-plans

Page 139

Chapter 7 – Performance evaluation

(a) n = 100 (b) n = 200

(c) n = 300 (d) n = 400

(e) n = 500

Figure 63: Test case 2, set 2 (large sized) floor-plans

Page 140

Chapter 7 – Performance evaluation

Page 141

(a) run 1 (b) run 2

(c) run 3 (d) run 4

(e) run 5

Figure 64: Test case 3 floor-plans

Chapter 7 – Performance evaluation

Page 142

(a) m = 1 (b) m = 2

(c) m = 3 (d) m = 4

(e) m = 5 (f) m = 6

(g) m = 7 (h) m = 8

(i) m = 9

Figure 65: Test case 4, set 1 (typical sized) floor-plans

Chapter 7 – Performance evaluation

(a) m = 1 (b) m = 2

(c) m = 3 (d) m = 4

(e) m = 5 (f) m = 6

(g) m = 7 (h) m = 8

(i) m = 9

Figure 66: Test case 4, set 2 (large sized) floor-plans

Page 143

Chapter 7 – Performance evaluation

Page 144

7.2 Performance results

The use of software operating systems on a von Neumann based architecture saw the

introduction of extra unwanted overheads such as context switching. However, the average

user was prepared to accept these extra overheads as long as the advantages of an increase in

accessibility and ease of use of the hardware platform that an operating system can provide

outweighed them. A similar situation seems likely to occur with the introduction of an

operating system to reconfigurable hardware.

In the previous chapter, it was identified that the execution time of the operating system

causes applications to have extra latency at start up and to have possible loss in application

performance because of the use of the application architecture and partitioning. These are the

major contributors to these overheads in a reconfigurable computing operating system. The

metrics of user response time and application throughput were selected to measure these

overheads. To judge if the increased accessibility and ease of use provided by the operating

system outweigh the overheads, a series of experiments to measure the user response time and

application throughput have been carried out.

7.2.1 User response time

The user response time in an operating system performing real-time user interaction should be

kept to a minimum. Users are only prepared to wait for a certain length in time for a response

to their input. In a software operating system, the majority of the user response time is

consumed when a context switch is performed, and when other applications have use of the

microprocessor. However, in the reconfigurable computing environment proposed in this

thesis, a context switch is performed far less often. When an application is loaded into the

reconfigurable computing operating system, other applications can continue to execute while

the new application is allocated and partitioned. When the application is about to be loaded to

the FPGA, assuming the new application can fit onto the FPGA, the other applications can

continue to execute if the new application would be loaded via dynamic reconfiguration. This

is unlike a software operating system where the application using the microprocessor would

need to be stopped. Therefore context switches are not as expensive in terms of response time

for reconfigurable computing. The user response time in the prototype operating system is

only the latency experienced when loading the application onto the FPGA. The majority of

this latency is the execution runtime of the allocation and partitioning algorithms.

Chapter 7 – Performance evaluation

Page 145

The user response time in this experiment will be calculated by measuring the execution

runtime of the Colonel component of ReConfigME. This execution time includes the runtime

consumed by the allocation and partitioning algorithms, the interactions between them, and

the generation of the placement constraints file.

The execution runtime of the entire bitstream generation is excluded in the user response time

because it was performed at runtime in ReConfigME only due to tool limitations. With the

future development of new tool flows which support relocatable pre-placed and pre-routed

cores, and a hierarchical routing structure where only the top level needs to be routed at

runtime, there will be no need to place and route the entire bitstream at runtime. All of the

nodes in the data flow graph would be pre-placed and routed at compile time. ReConfigME

would then determine the location of these nodes at runtime and simply relocate them to the

allocated positions. After this, a runtime router would connect all of the communicating nodes

together via a special routing layer on the FPGA reserved for inter-process communication.

This would only require a few new routes as compared to routing the entire application and

thus could be completed in a short amount of execution time. As the current place and route

consumes approximately five minutes for the triple DES application at present, there appears

to be no reason why a new tool flow with these features could not reduce this to times

comparable with the rest of the operating system algorithms.

The actual reconfiguration time is also excluded from the user response time because it is

likely that future development of new FPGA architectures will better support dynamic

reconfiguration because of its increased interest within the reconfigurable computing

community. The majority of the current FPGAs supporting dynamic reconfiguration only

allow column based partial reconfiguration which is too restrictive in the current environment

so the entire device is reconfigured every time a change is made. As complete reconfiguration

on a modern FPGA such as a Virtex II Pro currently consumes less than a second,

architectures supporting true dynamic reconfiguration whilst making a contribution to the

operating system performance, will be secondary to the impact of improved tool flows that

allow relocatable pre-placed and pre-routed cores.

The aim of the user response time experiment is thus to measure only the operating system

allocation and partitioning latency when the benchmark application is partitioned into a range

of processes (test case 4), and when it is allocated onto the FPGA when a number of

Chapter 7 – Performance evaluation

Page 146

applications are already allocated onto the FPGA (test case 2). The variance in user response

time will also be measured through the use of test case 3 as an estimate of experimental

variability.

Results

In test cases three and four, the number of applications already allocated on the FPGA, the

user response time, and the remaining FPGA area was measured for each initial floor-plan. In

addition, in test case two, the number of partitions the application was divided into was also

recorded. The results from these experiments are shown in Table 15, Table 16, Table 17 and

Table 18. In the tables, n is the number of previously allocated applications and m is the

number of partitions the application it is divided into.

Test case 1

Applications aleady resident on the FPGA

User response time (ms)

m = 1 0 58

Table 15: User response time for test case 1

Test

cas

e 2

set 1

(typ

ical

)

App

licat

ions

al

read

y re

side

nt

on th

e FP

GA

No.

of p

artit

ions

Use

r res

pons

e tim

e (m

s)

Rem

aini

ng a

rea

(CLB

s)

Test

cas

e 2

set 2

(lar

ge)

App

licat

ions

al

read

y re

side

nt

on th

e FP

GA

No.

of p

artit

ions

Use

r res

pons

e tim

e (m

s)

Rem

aini

ng a

rea

(CLB

s)

n = 100 8 3 332 2851 n = 100 4 4 267 2551

n = 200 8 5 493 2739 n = 200 7 7 490 2802

n = 300 9 4 400 2953 n = 300 6 5 358 2775

n = 400 6 5 444 2861 n = 400 3 3 211 2421

n = 500 11 7 675 2842 n = 500 7 5 362 2587

Table 16: User response time for test case 2

Chapter 7 – Performance evaluation

Test

cas

e 3

n =

100

m =

3

App

licat

ions

re

side

nt o

n th

e FP

GA

Use

r res

pons

e tim

e (m

s)

Rem

aini

ng a

rea

(CLB

s)

Run 1 9 339 3118

Run 2 9 337 2912

Run 3 8 330 2835

Run 4 10 345 2651

Run 5 6 299 2897

Table 17: User response time for test case 3

Page 147

Table 18: User response time for test case 4

Shown in Table 15 is the user response time when the DES application was allocated without

partitioning onto an empty FPGA. As expected this generated the lowest user response time

of all the experiments at 58ms. This is because it did not need to be partitioned, there were no

other applications on the FPGA, and only one partition needed to be allocated. In Table 16,

Test case 4 Set 1 (typical)

Number of partitions m A

pplic

atio

ns

resi

dent

on

the

FPG

A

Use

r res

pons

e tim

e (m

s)

Rem

aini

ng a

rea

(CLB

s)

Test case 4 Set 1

(typical)

Number of partitions m A

pplic

atio

ns

resi

dent

on

the

FPG

A

Use

r res

pons

e tim

e

(ms)

Rem

aini

ng a

rea

(CLB

s)

m = 1 10 98 3118 m = 1 3 60 2059

m = 2 6 238 2268 m = 2 2 175 2097

m = 3 7 306 3054 m = 3 4 217 2268

m = 4 10 391 2891 m = 4 4 259 2984

m = 5 8 478 3231 m = 5 5 343 2551

m = 6 10 596 2778 m = 6 3 375 2954

m = 7 7 615 2665 m = 7 4 412 2891

m = 8 10 818 2779 m = 8 3 503 2054

m = 9 9 899 2685 m = 9 2 560 2211

Chapter 7 – Performance evaluation

Page 148

given the number of applications already allocated onto the FPGA, the user response time

varied from 332ms for three applications to 675ms for seven applications for set 1, and 211ms

for three applications to 490ms for seven applications in set 2. It also shows the amount of

remaining area in CLBs was fairly consistent for all values of n. This indicates the system has

reached a steady state after 100 applications have been allocated.

As it was determined from the previous test case that the system had reached a steady state

after 100 applications had been completed for both tests, from Table 17, the variance in the

user response time was calculated from five runs of n = 100 and m = 3, and was found to be

approximately 16ms. Only five runs were used to calculate the variance in the user response

time because it appears as though the variation in user response time was almost constant

between these five runs. This variance will be used on all graphs and response time

calculations to indicate the range of error in any one measurement.

In the final experiment to measure the response time, when m (number of partitions) ranged

from 1 to 9 is shown in Table 18. In set 2 (large) in test cases 2 and 4, a reduction of

approximately 40% in the user response time was recorded when compared to set 1. For

example in test case 4, when the application was partitioned into two partitions, in set 1 the

user response time was 238ms as compared to 175ms for set 2.

Relationships

Shown in Figure 67 is a graph of the response time versus the number of partitions the

application is divided into for both sets in test case 4. There appears to be a linear like

relationship between the response time and the number of partitions the application is divided

into. That is for whatever number of partitions the application is being divided into, the

response time increases at the same rate. This is an expected result because the allocation and

partitioning algorithms that allocate and divide the application have a linear runtime

complexity as defined in chapter 5.

Shown in Figure 68 is a graph of the response time versus the number of applications already

allocated onto the FPGA before the DES application was placed for both sets in test case 2

and 4. There appears to be very little correlation between the user response time and the

number of applications already allocated on the FPGA. This is evident in both sets because

Chapter 7 – Performance evaluation

for example, the user response time ranged from 60ms to 500ms for three applications

allocated in set 1, and ranged from 100ms to 820ms for 10 applications allocated in set 2.

Response time vs. number of partitions the application is partitioned into for test case 4

0

100

200

300

400

500

600

700

800

900

1000

1 2 3 4 5 6 7 8 9

Number of partitions the application is divided into

Res

pons

e tim

e (m

s)

10

Set 1 (typical) Set 2 (large)

Figure 67: The response time verses the number of partitions the application is

divided into for sets 1 (typical) and 2 (large) in test case 4

Page 149

Chapter 7 – Performance evaluation

Response time vs. number of applications on the FPGA for test case 2 and 4

0

100

200

300

400

500

600

700

800

900

1000

0 2 4 6 8 10

Number of applications on the FPGA

Res

pons

e Ti

me

(ms)

12

Set 1 (typical) Set 2 (large)

Figure 68: The response time verses the number of applications already allocated

onto the FPGA for sets 1 (typical) and 2 (large) in test case 2 and 4

7.2.2 Application throughput

Application throughput is a commonly used metric that measures the amount of output an

application can produce per unit of time. The use of an operating system to manage the

hardware resources can result in a drop in application throughput because of the introduced

application architecture and operating system policies. Users may be prepared to trade some

of the application throughput for an increase in platform accessibility and ease of application

design but too large a loss in throughput could make the operating system unattractive.

In ReConfigME, designing an application using a data flow graph structure will not reduce

the concurrency, that is change the number of pipeline stages, however it might be expected to

reduce the clock rate due to the introduced design methodology. The application throughput

of a reconfigurable computing application will be measured in this experiment by calculating

the benchmark application’s signal delay (clock rate). Signal delay by definition is the amount

of time it takes between outputs. When comparing two identical applications, the larger the

signal delays, the lower the throughput.

Page 150

Chapter 7 – Performance evaluation

The aim of this experiment is to measure the effect on application throughput when the

number of partitions the application is divided into (test case 4) and the number of

applications that are already allocated on the FPGA (test case 2) are varied. The worst case

application throughput was measured on a floor-plan generated when the benchmark

application is partitioned into the maximum of nine individual partitions with all

communicating partitions separated by the maximum wire length. This floor-plan is shown in

Figure 69. The benchmark’s application best case throughput will be measured when not

under control of the operating system by not placing any allocation constraints onto it and

using the standard design flow and place and route tools alone. These test cases should

generate the two extreme values and will be used as a baseline to compare the application

throughput measured from the other tests.

Figure 69: Possible worse case signal delay

Results

In all of test cases, the signal delay measured in milliseconds, the performance loss compared

to the application throughput when the benchmark was not under operating system control,

measured as a percentage, and the fragmentation calculated from Equation 4 and measured in

a percentage were calculated for each of the final floor-plans. The results from these

experiments are shown in Table 19, Table 20, Table 21, Table 22 and Table 23. In the tables,

n is the number of previously allocated applications and m is the number of partitions the

application it is divided into.

Page 151

Chapter 7 – Performance evaluation

Page 152

Signal delay(ms)

Performance loss (%)

Fragmentation %

Remaining area (CLBs)

Expected worst case

Figure 69 46.047 33.14 0.1239

4479

Expected best case not under

OS control 30.016 0 Not applicable

4719

Table 19: Application throughput for the worse case and when the application was

not under the operating system control

Signal delay

(ms) Performance

loss (%) Fragmentation

%

Best case under operating system control (Test case 1) 30.785 2.49 0.0214

Table 20: Application throughout for test case 1

Test

cas

e 2

Set

1

(typi

cal)

Sig

nal D

elay

(m

s)

Per

form

ance

Los

s (%

)

Frag

men

tatio

n %

Test

cas

e 2

Set

2

(larg

e)

Sig

nal D

elay

(m

s)

Per

form

ance

Los

s (%

)

Frag

men

tatio

n %

n = 100 31.475 4.64 0.7213 n = 100 36.341 17.41 0.6778

n = 200 36.529 17.83 1.2402 n = 200 34.205 12.25 0.7631

n = 300 32.949 8.9 0.8810 n = 300 27.676 -8.45 0.5509

n = 400 34.836 13.84 0.9448 n = 400 40.221 25.37 0.7851

n = 500 34.305 12.5 1.9143 n = 500 35.902 16.39 1.1310

Table 21: Application Throughput for test case 2

Chapter 7 – Performance evaluation

Page 153

Test case 3

n = 100 Signal Delay

(ms) Performance

Loss (%) Fragmentation

%

Run 1 32.572 7.85 1.1286

Run 2 33.102 9.32 1.1637

Run 3 30.768 2.44 0.8449

Run 4 32.132 6.59 1.3728

Run 5 31.056 3.35 0.7119

Table 22: Application throughput for test case 3

Test

cas

e 4

Set

1

(typi

cal)

m =

num

ber o

f pa

rtitio

ns

Sig

nal D

elay

(m

s)

Per

form

ance

Los

s (%

)

Frag

men

tatio

n %

Test

cas

e 4

Set

2

(larg

e)

m =

num

ber o

f pa

rtitio

ns

Sig

nal D

elay

(m

s)

Per

form

ance

Los

s (%

)

Frag

men

tatio

n %

m = 1 30.514 1.63 0.8151 m = 1 30.972 3.08 0.4753

m = 2 30.675 2.15 0.7619 m = 2 30.578 1.85 0.4284

m = 3 31.578 4.95 0.8912 m = 3 31.279 4.04 0.5207

m = 4 34.254 12.38 1.0253 m = 4 33.927 11.51 0.5732

m = 5 35.102 14.49 1.0477 m = 5 35.087 14.46 0.7248

m = 6 37.501 19.96 1.6852 m = 6 38.047 21.12 0.8012

m = 7 40.258 25.44 1.7526 m = 7 38.482 21.99 0.7218

m = 8 43.024 30.32 2.0140 m = 8 39.973 24.9 0.8212

m = 9 44.023 31.82 2.1542 m = 9 42.731 29.76 0.9096

Table 23: Application throughput for test case 4

Shown in Table 19 is the signal delay for the worst case test, and as expected it was 33%

higher than in any other experiments. The use of longer routes for the inter-process

communication significantly affects the amount of throughput the application could generate

in this floor-plan. Also reported in this table is the signal delay when the application was not

executed under the operating system. This value is used as a baseline to compare all of the

other signal delays against. The fragmentation was not calculated in this part of experiment

Chapter 7 – Performance evaluation

Page 154

because the operating system did not perform the allocation as it was automatically performed

by the commercial place and route tools.

The signal delay recorded in the expected best case test and shown in Table 20 resulted in an

increase of 2.5% compared to when the application was used with the operating system. This

can be explained because when the application was not used with the operating system, the

place and route algorithms were able to further optimise the application’s performance as

there were no allocation restrictions placed onto it.

Shown in Table 21 is the response time recorded when a different number of applications

were already allocated onto the FPGA (test case 2). The signal delay varied from 31.475ms

for three applications to 36.529ms for five applications for set 1, and 27.676ms for five

applications to 40.221ms for three applications in set 2. When n = 300 in both set 1 and 2

there was minimal impact in performance due to the partitioning. This can be explained

because in each test case the communicating partitions were allocated next to each other, thus

minimising signal delay. From the results shown in Table 22, it was determined in the user

response time experiment that the system reached a state of stability after 100 applications

had been completed, the variance in signal delay was calculated to be 0.906ms. In test case 4

shown in Table 23, the signal delay when m (number of partitions) ranged from 1 to 9 was

from 30.514ms for m = 1 to 44.023ms for m = 9 in set 1, and 30.072ms for m = 1 and

42.731ms for m = 9 in set 2.

Relationships

Shown in Figure 70 is a graph of the application throughput versus the number of partitions

the application was divided into, for both sets in test case 4. There appears to be a linear

relationship between the signal delay and the number of partitions the application is

partitioned into beyond 3 partitions. That is the rate at which the signal delay increases is

proportional to the number of partitions the application is divided into. The flat part of the

graph between 1 and 3 partitions shows that partitioning the application into 1 to 3 partitions

has in little affect on its application throughput. The size and number of the applications

already on the FPGA appears to have little effect on the application throughput as shown in

Figure 71.

Chapter 7 – Performance evaluation

Application Thoughput vs. No. of partitions the application is partitioned into for test case 4

25

27

29

31

33

35

37

39

41

43

45

0 1 2 3 4 5 6 7 8 9 1

No of partitions the application is divided into

Sign

al D

elay

(mS)

0

Set 1 (typical) Set 2 (large)

Figure 70: The application throughput versus the number of partitions the application

is divided into for sets 1 (typical) and 2 (large) in test case 4

Application throughput vs. number of applications on the FPGA for test case 2 and 4

25

30

35

40

45

50

0 2 4 6 8 10

Number of applications on the FPGA

Sign

al D

elay

(ms)

12

Set 1 (typical) Set 2 (large)

Figure 71 : The application throughput versus the number of applications already

allocated onto the FPGA for sets 1 (typical) and 2 (large) in test case 2 and 4

Page 155

Chapter 7 – Performance evaluation

Page 156

7.2.3 Conclusion

In this section, experiments were conducted to measure the impact on the user response time

and application performance due to the introduction of the prototype operating system

ReConfigME. The user response time ranged from 58ms recorded in the expected best case

(test case 1) to 899ms when the benchmark application was partitioned into a maximum of

nine partitions in set 1 of test case 4. The signal delay ranged from 30.016ms in the expected

best case not under the operating system control to 46.047ms in the expected worst case test.

From these experiments it was found that both the user response time and signal delay only

significantly increased when the application was partitioned. The Partitioner itself did not

consume the all of the runtime, the Allocator used most of it as it had to be called for every

partition. The signal delay increased with the number of partitions because of the extra wire

length introduced due to the inter-process communication. It was also found that the number

of applications already allocated onto the FPGA did not significantly affect either the user

response time or signal delay. Overall, it was concluded that the extra user response time and

lower application throughput introduced by the prototype operating system were not excessive

and would be outweighed by the advantages that an operating system provide.

Chapter 7 – Performance evaluation

7.3 Predictor metrics

In the previous experiments, the user response time of ReConfigME and the application

throughput were measured. A common factor that appears to likely cause a loss in the

response time and throughput is the fragmentation of the FPGA (the formula proposed for

calculating the fragmentation is repeated from chapter 5 and is shown in Equation 5). It would

be expected that response time and signal delay would increase with the fragmentation. This

is because the more fragmented the FPGA becomes, the more chance the application will

need to be partitioned. Once the application is partitioned, the response time and signal delay

will increase because the Allocator has to be called multiple times and more inter-process

communication routes are used. However the amount of increase is currently unknown. In this

section, an investigation will verify if there is a correlation between response time, signal

delay, and fragmentation, and quantify it.

⎪⎩

⎪⎨

=

>×⎟⎠⎞

⎜⎝⎛

−−

=1,0

1,10011

A

AAh

F

where h is the number of holes and A is the total free area A has units of the minimum unit of allocation (CLBs in most cases)

Equation 5: Fragmentation percentage

7.3.1 Response time

From the results obtained in the experiments that calculated the response time and application

throughput (the signal delay), and fragmentation percentage was measured. A graph of the

signal delay versus the fragmentation percentage for all of these test cases is shown in Figure

72. From this graph, there appears to be a linear like relationship between the response time

and the fragmentation which is highlighted by the two linear regression graphs, one for set 1

and set 2. However, the rate at which the response time increases compared to the

fragmentation differs between the two sets of applications. For the same response time, the

large sized applications in set 2 have a lower fragmentation percentage compared with the

smaller sized applications in set 1.

Page 157

Chapter 7 – Performance evaluation

Response Time vs Fragmentation

0

100

200

300

400

500

600

700

800

900

1000

0 0.5 1 1.5 2

Fragmentation (%)

Res

pons

e Ti

me

(ms)

2.5

Test case 1 Test case 2 - set 1 Test case 2 - set 2 Test case 3 - set 1Test case 4 - set 1 Test case 4 - set 2 Linear (Test case 4 - set 1) Linear (Test case 4 - set 2)

Figure 72: The user response time versus the fragmentation

percentage for both sets and all test cases

Before the fragmentation percentage can be used as a predictor for response time, the

fragmentation percentages for the larger sized applications need to be adjusted. To determine

the amount of adjustment, the linear equations for the response time versus the fragmentation

percentage for both sets were calculated from the linear regression graphs shown in Figure 72.

These equations are shown in Equation 6. From these equations it was determined that the

fragmentation percentage for set 1 (typical size) had to be adjusted by multiplying it by a

factor of 1.9. This value has been calculated with a limited amount of experimental data and

further investigation would be required before applying it to a wider set of situations. A graph

of the response time versus the adjusted fragmentation that can be used as a predictor for

response time is shown in Figure 73.

150470250895

−=−=

xYlxYt

where

Yt = set 1 typical sized application Yl = set 2 large sized application

Equation 6: Linear equations for the response time versus

fragmentation percentage for both sets

Page 158

Chapter 7 – Performance evaluation

Response time vs Adjusted Fragmentationfor response time

0100200300400500600700800900

1000

0 0.5 1 1.5 2

Adjusted fragmentation for response time (%)

Res

pons

e tim

e (m

s)

2.5

Test case 1 Test case 2 - set 1 (typical) Test case 2 - set 2 (large)Test case 3 Test case 4 - set 1 (typical) Test case 4 - set 2 (large)Combined test cases Linear (Combined test cases)

Figure 73: User response time versus the adjusted fragmentation

for both sets and all test cases

A linear regression data analysis was performed on these adjusted values and the R2 value was

calculated at 0.766. This means that the regression explained 76% of the variations. An

adjusted fragmentation formula was thus developed that can be used as a predictor for

response time is shown in Equation 7. It is simply the original fragmentation formula shown

in Equation 4 with an adjustment for the mean size of the applications already allocated onto

the FPGA represented as a percentage of the entire FPGA area multiplied by the adjustment

value.

⎟⎠⎞

⎜⎝⎛

⎟⎠⎞

⎜⎝⎛

−−

=FM

AhFr 5.23*

11

where

Fr = adjusted fragmentation percentage for response time h = number of holes on the FPGA

A = number of free CLBs on the FPGA M = mean size of the applications on the FPGA in CLBs

F = total size of the FPGA in CLBs

Equation 7: Adjusted fragmentation percentage for predicting user response time

Page 159

Chapter 7 – Performance evaluation

7.3.2 Application throughput

When examining the results obtained from the experiment which measured application

throughput, it was noticed that there appeared to be a connection between the fragmentation

percentage and the signal delay. Shown in Figure 74 is a graph of the fragmentation versus the

measured signal delay for each of the floor-plans in both sets for all test cases. Similar to the

response time versus fragmentation, there is a linear-like relationship between the

fragmentation and signal delay highlighted by the linear regression plots. This is an expected

result because as the fragmentation of the FPGA increases, so does the number of vacant

holes on the FPGA. This often leads to a higher percentage of chance that the application will

need to be partitioned. Once the application has been partitioned, wires need to be routed

between the processes for inter-process communication. However, the rate at which the signal

delay increases is different between the two sets representing different application sizes. For

the same fragmentation percentage, the larger sized applications in set 2 have a higher signal

delay.

Application Throughput vs Fragmentation

25

30

35

40

45

50

0 0.5 1 1.5 2 2

Fragmentation (%)

Sign

al D

elay

(ms)

.5

Test case 1 Test case 2 - Set 1 Test case 2 - Set 2 Test case 3Test case 4 - Set 1 Test case 4 - Set 2 Linear (Test case 4 - Set 2) Linear (Test case 4 - Set 1)

Figure 74: A graph of application throughput versus fragmentation

Again, the two linear equations were calculated from the graph and are shown in Equation 6.

From these equations it was determined that the fragmentation percentage for set 1 (typical

size) had to be adjusted by multiplying it by a factor of 2.55. A graph of the response time

Page 160

Chapter 7 – Performance evaluation

versus the adjusted fragmentation that can be used as a predictor for response time is shown in

Figure 75.

24105.1921

+=+=

xYlxYt

where

Yt = set 1 typical sized application Yl = set 2 large sized application

Equation 8: Linear equations for the signal delay versus

fragmentation percentage for both sets

Application Throughput vs Adjusted fragmentationfor signal delay

25

30

35

40

45

50

0 0.5 1 1.5 2 2

Adjusted fragmentation for signal delay (%)

Sign

al D

elay

(ms)

.5

Test case 1 Test case 2 - Set 1 (typical) Test case 2 - Set 2 (large)Test case 3 Test case 4 - Set 1 (typical) Test case 4 - Set 2 (large)Combined test cases Linear (Combined test cases)

Figure 75: User response time versus the adjusted fragmentation

for both sets and all test cases

A linear regression data analysis was performed on these adjusted values and the R2 value was

calculated at 0.6749. This resulted in an adjusted fragmentation formula that can be used as a

predictor for signal delay and is shown in Equation 9.

Page 161

Chapter 7 – Performance evaluation

⎟⎠⎞

⎜⎝⎛

⎟⎠⎞

⎜⎝⎛

−−

=FM

AhFs 25*

11

where

Fs = adjusted fragmentation percentage for signal delay h = number of holes on the FPGA

A = number of free CLBs on the FPGA M = mean size of the applications on the FPGA in CLBs

F = total size of the FPGA in CLBs

Equation 9: Adjusted fragmentation percentage for predicting signal delay

From the response time and signal delay correlation analysis, adjusted fragmentation formulas

that predict the user response time and signal delays were derived. From this result, it was

shown that the mean size of the applications already allocated on the FPGA affects the

response time under the operating system and the application throughput for a given

fragmentation. The larger the application size the more signal delay and response time will be

experienced for the same fragmentation value. To minimise the response time and signal

delay, it is proposed to separate the FPGA into regions where large and small sized

applications are allocated. By doing this, this will result in higher application throughputs and

lower user response times for the smaller sized applications. That is the large sized

applications will not have as much affect on the performance of the smaller sized applications.

7.3.3 Comparison of fragmentation measure

There is only one other fragmentation formula for calculating the amount of fragmentation on

an FPGA (see chapter 5) that appears in the research literature. Walder proposed a formula

shown in Equation 2 which is derived from a histogram of free rectangular areas. He states the

lower the value of the fragmentation, the higher the probability that a future application can

be mapped. However, as shown in the graphs in Figure 76 and Figure 77, there is a very small

correlation between the Walder fragmentation percentage and either response time or signal

delay. It can be concluded that the Walder fragmentation measure is not able to be used as a

good predictor for either operating system response time or application throughput.

Page 162

Chapter 7 – Performance evaluation

Response time vs Walder fragmentation measure

0

100

200

300

400

500

600

700

800

900

1000

0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55

Walder fragmentation measure

Res

pons

e tim

e (m

s)

Test case 4 - Set 1 (typical) Test case 4 - Set 2 (large)

Figure 76: Response time versus Walder fragmentation measure

Application Throughput vs Walder Fragmentation measure

29

31

33

35

37

39

41

43

45

47

0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55

Walder fragmentation measure

Sign

al D

elay

(ms)

Test case 4 - Set 1 (typical) Test case 4 - Set 2 (large)

Figure 77: Signal delay versus Walder fragmentation measure

7.3.4 Chance of allocation

The fragmentation of the FPGA area in this thesis has been defined as a ratio of the number of

holes left between the previously allocated applications and the amount of remaining area. It

Page 163

Chapter 7 – Performance evaluation

is calculated from the formula shown in Equation 4. Although the percentage of

fragmentation existing on an FPGA at a particular point in time is useful for measuring the

performance of the allocation algorithm, for predicting the possible user response time and

application throughput, it gives no indication to the user of the likelihood that their

application will be successfully allocated. The factors affecting the chance of an application

being allocated successfully are likely to include available area on the FPGA, the number of

holes that have been created by previously allocated applications, and the size of the incoming

application. These three characteristics have been combined into a formula shown in Equation

10 that predicts the chance that an incoming application will be successfully allocated.

APhA

=

where

S = percentage of successful allocation of application

A = number of free CLBs available h = number of holes on the FPGA

PA = number of CLBs of the application being allocated

Equation 10: The percentage chance of allocating a process

This formula considers the size of the application attempting to be allocated PA, the number of

holes that exist on the FPGA h due to previous allocations, and the amount of area still

available A. The size of the incoming application and the amount of available area must both

be measured in the same unit. The larger S the more chance the application has of being

allocated and the closer the value gets to zero, the less chance there is of it being allocated. If

the value is greater than 100%, the application will be almost always be allocated.

The results obtained in the experiments described in section 7.2 have been used to verify the

formula for predicting the success of allocating a particular sized application onto an already

occupied FPGA. The process of verifying the formula was conducted as follows. For all of

the initial floor-plans generated in test case 3, and four of the initial floor-plans generated in

test case 2, the amount of free area in CLBs and the number of holes present in each were

calculated. An initial floor-plan has previously been defined to be the resultant arrangement of

applications by ReConfigME processing all of the input applications prior to allocating the

benchmark application. The two test cases were chosen to be used in the verification as the

application was partitioned into the same number of processes in each test case and a different

Page 164

Chapter 7 – Performance evaluation

Page 165

sized incoming application was successfully allocated in each test case. These results are

shown in Table 24.

Test case

3

Free area

(CLBs) Holes

S (%)

420 CLBs

Success

S (%)

930 CLBs

Failed

Testcase

2

Free area

(CLBs) holes

S (%)

306 CLBs

Success

S (%)

420 CLBs

Failed

Run 1 3919 23 40.6 18.3 200 3193 22 47.4 34.5

3499 25 33.3 15.0 2887 24 39.3 28.6

3079 28 26.2 11.8

Run 2 3668 22 39.7 17.9 300 3380 21 52.6 38.2

3248 24 32.2 14.6 3074 24 41.9 30.5

2828 28 24.1 10.9

Run 3 3864 19 48.4 21.9 400 3294 20 53.8 39.2

3444 20 41 18.5 2988 23 42.5 30.9

3024 21 34.29 15.5

Run 4 3592 23 37.2 16.8 500 3229 29 36.4 26.5

3172 28 27 12.2 2923 32 29.9 21.7

2752 29 23 10.2 2617 35 24.4 17.8

Run 5 4211 19 52.8 23.8

3791 21 43 19.4

3371 22 36.4 16.5

Table 24: Results from an experiment to verify the allocation successful formula

In all of the floor-plans in test case 3, the triple DES benchmark application which consumes

930 CLBs (30 x 31) of area was unable to be allocated by ReConfigME without being

partitioned, i.e. it was not successfully allocated in one process. This resulted in the triple

DES application being partitioned into three single DES processes, each consuming 420

CLBs (20 x 21) of area which were able to be successfully allocated. This is shown in the

floor-plans in Figure 64 as there are three green rectangles in each. Similarly, in the floor-

plans in test case 2, the single DES process was unsuccessfully allocated onto the FPGA. This

resulted in ReConfigME having to further partition the benchmark application into two

Chapter 7 – Performance evaluation

smaller processes, each consuming 306 CLBs (17 x 18) of area, which were able to be

successfully allocated onto the FPGA. This is shown in the floor-plans in Figure 62. For each

of the floor-plans, the percentage chance of successfully allocating the incoming application

was calculated for both the allocated and non-allocated process.

From the graph shown in Figure 78, applications with S calculated above 39% were

successfully allocated, whereas S less than 18% resulted in the application not being allocated.

For S between 18 and 39, some applications were allocated and some were not. These results

verify that the formula can be used to predict the likelihood that a particular sized application

will or will not be successfully allocated to an FPGA floor-plan.

Percentage chance of an application being allocated onto a fragmented FPGA

0

10

20

30

40

50

60

0 2 4 6 8 10 12 14

Sample number

Perc

enta

ge o

f suc

cess

16

Allocated Test case 3 Failed Test case 3 Allocated Test case 2 Failed Test case 2

Figure 78: A graph of the percentage success and

failed allocation of applications in test case 2 and 3

This formula can also be used for comparing the performance of FPGA area allocation

algorithms. For example, if an identical series of applications are loaded onto two FPGAs by

different allocation algorithms, the formula can then be used to predict the chance of

allocating the next application. If the percentage is less in one FPGA than the other, it can be

concluded that the performance of that allocation algorithm compared to the other is better.

Page 166

Chapter 7 – Performance evaluation

Page 167

7.4 Conclusion

This chapter resulted in three major deliverables: an experimental test environment, a

benchmark application, user response time and application throughput performance results,

and correlation factors between fragmentation, throughput, and user response time, including

a formula for predicting the successful allocation of an incoming application. The

experimental test environment consisted of a series of initial floor-plans and a benchmark

application which were used to generate the user response time and application throughput for

a variety of situations. The results in these experiments were then correlated through the use

of a linear regression data analysis. A formula for predicting the user response time and signal

delay based upon the amount of fragmentation the floor-plan has been derived. It was

discovered that as well as the fragmentation measured by Equation 4, the mean size of the

applications already allocated onto the FPGA affect both the user response time and signal

delay of the next application to be allocated. It was concluded that to minimise the affect of

different sized applications, the FPGA should be segmented into regions where large and

small applications are allocated separately. From the results obtained in the experiments, a

formula for predicting the successful allocation of an application was developed and verified.

It was determined that a variation of fragmentation reported in Equation 4 predicts the chance

of an application being allocated.

Chapter 8 – Conclusion

8 8 Conclusion and Future Work

The focus in this thesis was to design, build, and evaluate the performance of a resource

allocating operating system for a reconfigurable computer. This was achieved by firstly

describing a set of reconfigurable computing abstractions, defining a reconfigurable

computing operating system architecture that suits these abstractions, and outlining the

algorithm specifications that are needed in the components of the architecture. The algorithms

for resource allocation and application partitioning in the operating system were then selected

by ranking previously published algorithms according to their runtime complexity. The

algorithms with the least runtime complexity were then modified to suit the operating system

environment and implemented so experiments could be performed to measure performance.

The best performing resource allocation and partitioning algorithm were then selected to be

part of the prototype operating system, ReConfigME. This was implemented using a

commercially available reconfigurable computing platform and consisted of a three-tiered

network architecture. Users can load their applications onto the user tier, have them processed

by the Colonel on the operating system tier, and then configured onto an FPGA located in a

computer in the platform tier. A series of experiments were performed on the operating

system to measure its effect on user response time and application throughput. From these

experimental results, a predictor for user response time and signal delay based upon the

fragmentation of the FPGA were derived. Finally, a formula for estimating the percentage

chance of a successful allocation of an incoming application was also derived. In this chapter,

a summary of the contributions made in this thesis are outlined, followed by suggestions into

future work in the field.

Page 168

Chapter 8 – Conclusion

Page 169

8.1 Research contributions

In chapter 2 it was shown that there are significant gaps in the literature regarding the runtime

management of reconfigurable computing applications. In this thesis, contributions were

made in the following areas.

1. There is no agreed list of abstractions that should be used in an operating system for

reconfigurable computing (see section 4.1).

Before any operating system architecture or prototype could be built, a set of abstractions for

a reconfigurable computing operating system had to be defined. Through the uniqueness and

analogous comparison to a software operating system, a process, address space and inter-

process communication reconfigurable computing abstraction were defined. The process

abstraction consists of an application in execution and structured according to a data flow

graph model with data source and sinks nodes inserted for simplified I/O access. All I/O is

then streamed into or out of the application via these nodes. A process abstraction was defined

that had direct access to the FPGA pins if an application had hard performance limits. All

connections to these pins are handled by the designer and not the operating system. The

address space abstraction consists of a two dimensional address space that represents the

FPGA at CLB granularity, and a one dimensional address space that represents the on-board

memory. This abstraction prevents processes from accessing resources that are not allocated

to them. The inter-process communication abstraction consists of creating and passing

messages between processes, similar to what happens in a software operating system. All

inter-process communication is conducted via a memory and for performance reasons all

processes are connected together and memory via a shared multiple bus topology.

2. Current design flows have little support for dynamic reconfiguration with resource

allocation (see section 4.4).

A modified application design flow for use with the proposed operating system architecture

which consists of structuring the application according to a data flow graph model was

developed. This structure allows the operating system to use an application partitioning

algorithm to divide the application into smaller parts. Once the application has been defined,

each node of the data flow graph is then synthesized, technology mapped, and place and

Chapter 8 – Conclusion

Page 170

routed using the commercially available design tools. However each of the nodes must still be

relocatable as their position on the FPGA is not finalised until runtime.

3. Algorithms for runtime resource allocation and runtime application partitioning have

not been deeply investigated in the reconfigurable computing domain (see chapter 5).

Before any algorithm for resource allocation or application partitioning could be selected,

their functionality had to be defined. For online FPGA area allocation the functional

specifications were defined as:

• Determine the size and position of a vacant segment of area that an application can

fit into which will not interfere with already allocated applications

• If there is not enough contiguous vacant area find the largest segment of

contiguous area

• If there is not enough total area block the application and place it in a ready queue

• Minimise the amount of area wasted due to poor allocation choices

For an online application partitioning algorithm the functional specifications were defined as:

• Repetitively divide an application structured as a data flow graph into varying

specified sizes until the entire application has been partitioned

• Avoid partitioning feedback loops to minimise impact on performance

It was then determined as both algorithms will be executed at runtime, their runtime

complexity had to be linear with respect to the number of CLBs for allocation or the number

of nodes in the graph for partitioning. A survey of both the reconfigurable and non-

reconfigurable computing literature domains was then made which resulted in a list of the

promising allocation and partitioning algorithms. These algorithms, along with a greedy-

based allocation algorithm were then implemented so their performance could be calculated

by an experiment which measured their execution runtime and fragmentation. From this

experiment it was concluded that the Minkowski Sum allocation algorithm with bottom left

heuristic and temporal partitioning algorithm were suitable to be used in an operating system

for reconfigurable computing.

Chapter 8 – Conclusion

Page 171

4. There is no prototype runtime system for reconfigurable computing that demonstrates

runtime area resource allocation and partitioning (see section 4.2.2 and section 6.1).

Before a prototype reconfigurable computing operating system was constructed, the

architecture of one was proposed. This consists of six major components including an

interface, Partitioner, Allocator, Loader, hardware abstraction layer and primitive on-chip

architecture. A prototype called ReConfigME was then constructed consisting of all of the

components described in the architecture three separated into three tiers, all connected and

communicating together via standard TCP/IP message passing. The Allocator and Partitioner

were combined into the Colonel component of ReConfigME which is the core of the

operating system. Users connect to ReConfigME via a standard command line interface where

they would upload their application, and once it was executing, stream the I/O data to it. An

application architecture was also defined which consisted of a data flow graph structure and

data source and sink nodes for I/O.

5. There has been little discussion of metrics that might be used to evaluate the

performance of an operating system for reconfigurable computing (see section 6.2).

To evaluate the performance of the prototype operating system and applications executing

under its control, a set of metrics including the user response time and application throughput

were selected. The user response time consists of measuring the execution runtime of the

operating system when processing applications under various conditions. The application

throughput consists of calculating the signal delay of the application once it has been

allocated and partitioned. Partitioning the application and not co-locating the processes, can

potentially increase the total wire length which is the primary reason for a decrease in

application throughput.

6. There have been few evaluations into the affect an operating system environment will

have on reconfigurable computing application performance (see chapter 7).

Experiments were conducted to firstly measure the effect the operating system has on user

response time, and secondly the effect it has on application throughput. It was determined that

the Allocator was the main contributor to the execution runtime of ReConfigME. The

Allocator consumed far more time when the application was partitioned into multiple

Chapter 8 – Conclusion

Page 172

processes as each process required the Allocator to be called at least once. The throughput of

the applications executing under ReConfigME was only significantly affected when it was

divided into multiple partitions and the partitions were allocated in locations which resulted in

longer wire lengths. From the experimental results, good correlations between response time

and fragmentation and application throughput and fragmentation were determined. By

performing a linear regression analysis on the data set, a formula for predicting the user

response time and signal delay based upon the fragmentation was derived. It was determined

that the user response time and signal delay of smaller sized applications was lower for the

same amount of fragmentation in larger sized applications. From this result, it was concluded

that the FPGA should be segmented into regions where similar sized application should be

allocated. This will minimise the impact larger sized applications have on the performance of

smaller ones. Finally, a formula for predicting the chance of allocating a specified sized

application onto an FPGA with a particular amount of fragmentation was proposed. This

formula was verified via the experimental data.

8.1.1 Summary of major contributions

Listed below is a summary of the major research contributions that were made in this thesis.

• Abstractions for a reconfigurable computing operating system.

• An architecture of an operating system for reconfigurable computing.

• The creation of allocation and partitioning algorithms with ‘low’ execution runtime.

• The implementation of a prototype including multiple applications.

• A performance evaluation of the prototype.

• Two new fragmentation metrics which can predict signal delay and application

throughput.

• To segment the FPGA into parts where different sized applications can be allocated,

so it will reduce the fragmentation.

• The abstraction of the commercial place and route tools from the traditional user’s

experience.

• Proposed that for an FPGA to take full advantage of an operating system, its

architecture should have a separate routing layer and support true dynamic

reconfiguration.

Chapter 8 – Conclusion

Page 173

8.2 Suggestions for future work

The research results reported in this thesis give some quite definite directions for future

research in operating system for reconfigurable computers. Most importantly there needs to be

significant modifications in the design tools before the full benefits of an operating system

can be realised. This includes the ability to relocate pre-placed and pre-routed cores, a runtime

router with minimal execution time that can route inter-process communication wires after an

application has been loaded into the operating system by the user. A global routing

architecture separate from the routing hierarchy that is used in the pre-place and pre-route

phase could improve the execution time of a runtime router and would potentially result in

easier application allocations as the Allocator would not have to allocate around inter-

communication wires. The development of an FPGA architecture that can perform dynamic

partial reconfiguration without the column based restriction would also improve the

performance as applications would not need to be check-pointed and stopped between

reconfigurations.

Other future works could include a dynamic bus network that can be routed at runtime to

accommodate multiple reconfigurable computing applications. For example, once the location

of the application had been determined, the bus would dynamically extend to the location of

the new application, connect to it and then arbitrate the communications between the memory

and other applications. Finally, to improve the performance of inter-process communication, a

mechanism for routing channels directly between communicating applications and between

applications and the I/O pins of the FPGA could be developed to minimal any bottlenecks

associated with the platform memory. These types of communications could be used with

performance orientated applications or applications with significant amounts of I/O.

References

Page 174

9 References [1] Advanced_Semiconductor_Technology, "AST Des Core," Israel 2003. [2] Alpert, C. and Kahng, A., "Recent Directions in Netlist Partitioning: A Survey,"

Integration, the VLSI Journal, vol. 19, pp. 1-81, 1995. [3] Altera, "Excalibur Device Datasheet," 2004. [4] Altera, "Flex 10K FPGA Datasheet," 2003. [5] Altera, "Nios Development Tools Documentation Suite," 2003. [6] Amjad, U., Application (Re)Engineering. Upper Saddle River, New Jersey: Prentice

Hall, 1997. [7] Ashenden, P., The VHDL Cookbook, 1st ed. Adelaide: University of South Australia,

1990. [8] Athanas, P. and Abbott, A., "Image Processing on a Custom Computing Platform,"

presented at Fourth International Workshop on Field-Programmable Logic and Applications (FPL '94), Prague, Czech Republic, 1994.

[9] Athanas, P. and Silverman, H., "Processor Reconguration through Instruction-Set Metamorphosis," IEEE Computer, vol. 26, pp. 11-18, 1993.

[10] Atmel, "Edge Detection in AT6000s FPGAs," 1997. [11] Babb, J., Frank, M., Lee, V., Waingold, E., Barua, R., Taylor, M., Jang, K.,

Devabhaktuni, S., and Agarwal, A., "The RAW Benchmark Suite: Computation Structures for General Purpose Computing," presented at IEEE Symposium on FPGAs Custom Computing Machines (FCCM 97), Napa Vally, CA, USA, 1997.

[12] Babb, J., Tessier, R., and Agarwal, A., "Virtual Wires: Overcoming pin limitations in FPGA-based logic emulators," presented at IEEE Symposium on FPGAs Custom Computing Machines (FCCM'93), Napa Valley, CA, USA, 1993.

[13] Baker, B. S., Coffman, E. G., and Rivest, R. L., "Orthogonal packings in two dimensions," SIAM J. Comput., vol. 9, pp. 846-855, 1980.

[14] Barat, F. and Lauwereins, R., "Reconfigurable Instruction Set Processors: A Survey," presented at IEEE International Workshop on Rapid System Prototyping (RSP 2000), Paris, France, 2000.

[15] Baumgarte, V., May, F., Nuckel, A., Vorbach, M., and Weinhardt, M., "PACT XPP - A Self Reconfigurable Data Processing," presented at Engineering of Reconfigurable Systems and Algorithms (ERSA 01), Las Vegas, NV, USA, 2001.

[16] Bazargan, K., Kastner, R., Ogrenci, S., and Sarrafzadeh, M., "A C to Hardware/Software Compiler," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'00), Napa Valley, CA, USA, 2000.

[17] Bazargan, K. and Sarrafzadeh, M., "Fast Online Placement for Reconfigurable Computing Systems," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'99), Napa Valley, CA, USA, 1999.

[18] Betrin, P. and Touati, H., "PAM Programming Environments: Practice and Experience," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'94), Napa Valley, CA, USA, 1994.

[19] Bhasker, J., A SystemC Primer: Star Galaxy Publishing, 2002. [20] Bioler, "Bioler 3 Reconfigurable Computing Platform Datasheet," 2004. [21] Bogrow, H., "Field Programmable Gate Arrays: Off-the-shelf QML Components for

Rapid Technology Insertion," presented at Military and Aerospace Applications of Programmable Devices and Technologies Conference (MAPLD), Greenbelt, MD, USA, 1998.

[22] Borriello, G., Ebeling, C., Hauck, S., and Burns, S., "The Triptych FPGA architecture," IEEE Transactions on VLSI Systems, vol. 3, pp. 491-500, 1995.

References

Page 175

[23] Brebner, G., "Automatic Identification of Swappable Logic Units in XC6200 Circuitry," presented at 7th International Workshop on Field-Programmable Logic and Applications (FPL'97), London, UK, 1997.

[24] Brebner, G., "A Virtual Hardware Operating System for the Xilinx XC6200," presented at 6th International Workshop on Field-Programmable Logic and Applications (FPL'96), Darmstadt, Germany, 1996.

[25] Brebner, G. and Donlin, A., "Runtime Reconfigurable Routing," presented at 9th Symposium on Parallel and Distributed Processing (IPPS/SPDP'98), Orlando, FL, USA, 1998.

[26] Brown, S., Francis, R., Rose, J., and Vranesic, Z., Field Programmable Gate Arrays. Boston, USA: Kluwer and Acad. Publishers, 1992.

[27] Burns, J., Donlin, A., Hogg, J., Singh, S., and Wit, M., "A Dynamic Reconfiguration Run-Time System," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'97), Napa Valley, CA, USA, 1997.

[28] Caspi, E., Chu, M., Huang, R., Yeh, J., Markovskiy, Y., DeHon, A., and Wawrzynek, J., "Stream Computations Organized for Reconfigurable Execution (SCORE): Introduction and Tutorial," presented at 10th International Workshop on Field-Programmable Logic and Applications (FPL'00), Austria, 2000.

[29] Casselman, S., "Virtual Computing and the Virtual Computer.," presented at IEEE Workshop on FPGAs for Custom Computing Machines (FCCM'93), Napa Valley, CA, 1993.

[30] Celoxica, "Handel-C Reference Manual," 2003. [31] Celoxica, "RC1000-PP Hardware Reference Manual," 2000. [32] Celoxica, "RC2000 Hardware Reference Manual," 2003. [33] Chameleon_Systems, "CS2000 Advance Product Specification," 2000. [34] Chan, P. and Schlag, M., "Architectural Tradeoffs in Field-Programmable Device

based Computing Systems," presented at IEEE Workshop on FPGAs for Custom Computing Machines (FCCM'93), Napa Valley, CA, USA, 1993.

[35] Chazelle, B., "The Bottom-Left Bin-Packing Heuristic: An Efficient Implementation," IEEE Transactions on Computers, vol. 32, pp. 697-707, 1983.

[36] Chen, X., Feng, W., Zhao, J., Meyer, F., and Lombardi, F., "Reconfiguring One-Time Programmable FPGAs," in IEEE Micro, vol. 19, 1999, pp. 53-63.

[37] Chiricescu, S., Leeser, M., and Vai, M., "Design and Analysis of a Dynamically Reconfigurable Three-Dimensional FPGA," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 9, 2001.

[38] Chow, P., Seo, S., Rose, R., Chung, K., Paez-Monzon, G., and Rahardja, I., "The Design of a SRAM-Based Field-Programmable Gate Array - Part I: Architecture," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 7, 1999.

[39] Coffman, E. G., Csirk, J., and Woeginger, G., "Approximate Solutions to Bin Packing Problems," in Handbook of Applied Optimization: Cambridge University Press, 1999, pp. 607-615.

[40] Compton, K. and Hauck, S., "Reconfigurable Computing: A Survey of Systems and Software," ACM Computing Surveys (CSUR), vol. 34, pp. 171-210, 2002.

[41] Cong, J. and Xu, S., "Technology Mapping for FPGAs with Embedded Memory Blocks," presented at ACM International Symposium on Field-Programmable Gate Arrays (FPGA'98), Monterey, CA, USA, 1998.

[42] Crnkovic, I., "Software Engineering and Science," vol. 2003: Malardalen University, 2001.

References

Page 176

[43] Danadlis, A., Prasanna, V., and Rolim, J., "An Adaptive Cryptographic Engine for IPSec Architectures," presented at IEEE Symposium on FPGAs Custom Computing Machines (FCCM'00), Napa Valley, CA, USA, 2000.

[44] Davis, A. and Keller, R., "Data Flow Program Graphs," IEEE Computer, vol. 15, pp. 26-41, 1982.

[45] Davis, D., Barr, M., Bennett, T., Edwards, S., Harris, J., Miller, I., and Schanck, C., "A Java Development and Runtime Environment for Reconfigurable Computing," presented at 9th Symposium on Parallel and Distributed Processing (IPPS/SPDP'98), Orlando, FL, USA, 1998.

[46] De Berg, M., Van Kreveld, M., Overmars, M., and Cheong, O., Computational Geometry: Algorithms and Applications: Springer-Verlag, 2000.

[47] DeHon, A., "DPGA-Coupled Microprocessors: Commodity ICs for the 21st Century," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'94), Napa Valley, CA, USA, 1994.

[48] Dick, C., Harris, F., and Rice, M., "Synchronization in Software Radios - Carrier and Timing Recovery using FPGAs," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM00), Napa Valley, CA, USA, 2000.

[49] Diessel, O., Kearney, D., and Wigley, G., "A Web-based Multi-user Operating System for Reconfigurable Computing," presented at 10th Symposium on Parallel and Distributed Processing (IPPS/SPDP'99), San Juan, Puerto Rico, 1999.

[50] Dyer, M., Plessl, C., and Platzner, M., "Partially Reconfigurable Cores for Xilinx Virtex," presented at 12th International Workshop on Field-Programmable Logic and Applications (FPL'02), Montpellier, France, 2002.

[51] Eatmon, D. and Gloster, C., "Evaluating Placement Algorithms for Run-time Reconfigurable Systems," presented at 2nd Annual Military and Aerospace Programmable Logic Device, Maryland, MD, USA, 1999.

[52] Ebeling, C., Cronquist, D., and Franklin, P., "RaPiD - Reconfigurable pipelined datapath," presented at International Workshop on Field-Programmable Logic and Applications (FPL'96), Berlin, Germany, 1996.

[53] Elixent, "D-Fabric Reconfigurable Algorithm Processing Datasheet," 2004. [54] Emmert, J. and Bhatia, D., "A Methodology for Fast FPGA Floorplanning," presented

at 7th International Symposium on Field-Programmable Gate Arrays (FPGA'99), Monterey, CA, USA, 1999.

[55] Estrin, G., "Reconfigurable Computer Origins: The UCLA Fixed-Plus-Variable (F+V) Structure Computer," IEEE Annals of the History of Computing, vol. 24, 2002.

[56] Federal_Information_Processing_Standards_Publication, "Data Encryption Standard (DES)," 46-3, October 1999.

[57] Fiduccia, C. and Mattheyses, R., "A Linear Time Heuristic for Improving Network Partitions," presented at ACM/IEEE Design Automation Conference (DAC'82), 1982.

[58] Flato, E. and Halperin, D., "Exact Minkowski Sums and Applications," presented at Annual Symposium on Computational Geometry, Barcelona, Spain, 2002.

[59] Free-IP, "Free-DES Implementation Notes," 2000. [60] George, V. and Rabaey, J., Low-Energy FPGAs: Architecture and Design: Kluwer

Academic Publishers, 2001. [61] Gokhale, M., Holmes, W., Kosper, A., Kunze, D., Lopresti, D., Lucas, S., Minnich,

R., and Olsen, P., "SPLASH: A Reconfigurable Linear Logic Array," presented at International Conference on Application-Specific Array Processing, 1990.

[62] Guccione, S., Levi, D., and Sundararajan, P., "JBits: A Java-based Interface for Reconfigurable Computing," presented at 2nd Annual Military and Aerospace Applications of Programmable Devices and Technologies Conference (MAPLD).

References

Page 177

[63] Gunther, B., "SPACE 2 as a Reconfigurable Stream Processor," presented at 4th Australasian Conference on Parallel and Real-time Systems (PART'97), Singapore, 1997.

[64] Hadley, J. and Hutchings, B., "Design Methodologies for Partially Reconfigured Systems," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'95), Napa Valley, CA, USA, 1995.

[65] Hauck, S., Borriello, G., Burns, S., and Ebeling, C., "MONTAGE: An FPGA for synchronous and asynchronous circuits," presented at International Workshop Field-Programmable Logic and Applications (FPL'92), Vienna, Austria, 1992.

[66] Hauser, J. and Wawrzynek, J., "Garp: A MIPS Processor with a Reconfigurable Coprocessor," presented at IEE Symposium on FPGAs for Custom Computing Machines (FCCM'97), Napa Valley, CA, USA, 1997.

[67] Hess, J., Lee, D., Harper, S., Jones, M., and Athanas, P., "Implementation and Evaluation of a Prototype Reconfigurable Router," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'99), Napa Valley, CA, USA, 1999.

[68] Hopf, J., Itzstein, G., and Kearney, D., "Specification of Concurrent Reconfigurable Hardware using Hardware Join Java," presented at IEEE International Conference on Field-Programmable Technology (FPT), Hong Kong SAR, China, 2002.

[69] Huang, A., "Processor-In-Memory System Simulator," Massachusetts Institue of Technology 1998.

[70] Huang, W., Saxena, N., and McCluskey, E., "A Reliable LZ Data Compressor on Reconfigurable Coprocessors," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'00), Napa Valley, CA, USA, 2000.

[71] Hutchings, B. and Wirthlin, M., "Implementation Approaches for Reconfigurable Logic Applications," presented at International Workshop in Field-Programmable Logic and Applications (FPL'95), Oxford, England, 1995.

[72] Jacobs, S. and Bekker, J., "Automatic target recognition systems using high-resolution radar,," presented at 3rd Workshop on Conventional Weapon ATR, 1996.

[73] Jean, J., Tomko, K., Yavagal, V., Shah, J., and Cook, R., "Dynamic Reconfiguration to Support Concurrent Applications," IEEE Transactions on Computers, vol. 48, pp. 591-602, 1999.

[74] Jerraya, A., Park, I., and O'Brien, K., "Amical : an interactive High Level Synthesis Environment," presented at European Conference on Design Automation, Paris, France, 1993.

[75] Johnson, D., Aragon, C., McGeoch, L., and Schevon, C., "Optimisation by Simulated Annealing: An Experimental Evaluation Part i, Graph Partitioning," Operations Research, vol. 37, pp. 865-892, 1989.

[76] Kafura, D., Object-Oriented Software Design and Construction with C++: Prentical-Hall, 1998.

[77] Kaul, M. and Vemuri, R., "Optimal Temporal Partitioning and Synthesis for Reconfigurable Architectures," presented at Design Automation and Test, Paris, France, 1998.

[78] Kaul, M., Vemuri, R., Govindarajan, S., and Ousiss, I., "An Automated Temporal Partitioning and Loop Fission Approach for FPGA Based Reconfigurable Synthesis of DSP Applications," presented at 36th Annual Conference on Design Automation Conference (DAC'99), New Orleans, LA, USA, 1999.

[79] Kean, T., "Configurable Logic: A Dynamically Programmable Cellular Architecture and its VLSI implmentation," in Dept. Computer Science: University of Edinburgh, 1988.

References

Page 178

[80] Kean, T. and Duncan, A., "DES Key Breaking, Encryption and Decryption on the XC6212," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'98), Napa Valley, CA, USA, 1998.

[81] Kearney, D. and Veldman, G., "Evaluation of Network Topologies for a Runtime Re-routable Network on a Programmable Chip," presented at IEEE International Conference on Field-Programmable Technology (FPT), Toyko, Japan, 2003.

[82] Keller, E., "JRoute: A Run-time Routing API for FPGA Hardware," Parallel and Distributed Processing, pp. 874-881, 2000.

[83] Kernighan, B. and Lin, S., "An Efficient Heuristic Procedure for Partitioning Graphs," Bell System Technical Journal, vol. 49, pp. 291-307, 1970.

[84] Kirkpatrick, S., Gelatt, C., and Vecchi, M., "Optimisation by Simulated Annealing," Science 220, pp. 671-680, 1983.

[85] Kress, R., "A Fast Reconfigurable ALUs for Xputers," Univ. Kaiserslautern, 1996. [86] Kress, R. and Hartenstein, U., "An Operating System for Custom Computing

Machines based on the Xputer Paradigm," presented at International Workshop on Field-Programmable Logic and Applications (FPL'97), London, UK, 1997.

[87] Kumar, S., Pires, L., Ponnuswamy, S., and Spaanenburg, H., "A Benchmark Suite for Evaluating Configurable Computing Systems - Status, Reflections, and Future Directions," presented at International Symposium on Field-Programmable Gate Arrays (FPGA'00), Monterey, CA, USA, 2000.

[88] Kuznar, R., Brglez, F., and Kozminski, K., "Cost Minimization of Partitions into Multiple Devices," presented at 30th ACM/IEEE Design Automation Conference (DAC'93), Dallas, TX, USA, 1993.

[89] Le, M., Burghardt, F., and Rabaey, J., "Software Architecture of the Infopad System," presented at Mobidata Workshop on Mobile and Wireless Information Systems, New Brunswick, NJ, USA, 1994.

[90] Lee, C. and Lee, D., "A simple on-line bin packing algorithm," Journal of ACM, vol. 32, pp. 562-572, 1985.

[91] Leong, P., Cheung, O., Tsoi, K., and Leong, P., "A Bit-Serial Implementation of the International Data Encryption Algorithm IDEA," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'00), Napa Valley, CA, USA, 2000.

[92] Leong, P., Cheung, O., Tung, T., Kwok, C., Wong, M., and Lee, K., "Pilchard - A Reconfigurable Computing Platform with Memory Slot Interface," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'01), Napa Valley, CA, USA, 2001.

[93] Loo, S., Wells, B., and Kulick, J., "Handel-C for Rapid Prototyping of VLSI Coprocessors for Real Time Systems," presented at Southeastern Symposium on System Theory, Huntsville, Alabama, USA, 2002.

[94] Madsen, J., "Hardware Synthesis - An Introduction." Denmark: Technical University of Denmark, 2002.

[95] Mangione-Smith, W., "Seeking Solutions in Configurable Computing," IEEE Computer, vol. 30, pp. 38-43, 1997.

[96] McKusick, M., Joy, W., Leffler, S., and Fabry, R., "A Fast File System for UNIX," University of California, Berkeley, Berkeley, CA 18th February 1984.

[97] McLoone, M. and McCanny, J., "Single-Chip FPGA Implemenation of the Advanced Encryption Standard Algorithm," presented at International Workshop on Field-Programmable Logic and Applications (FPL'01), Belfast, UK, 2001.

[98] Mehta, M. and DeWitt, D., "Dynamic Memory Allocation for Multiple Query Workloads," presented at 19th International Confernce on Very Large Databases, Dublin, Ireland, 1993.

References

Page 179

[99] Mencer, O., Morf, M., and Flynn, M., "PAM-Blox: High Performance FPGA Design for Adaptive Computing," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'98), Napa Valley, CA, USA, 1998.

[100] Mignolet, J., Vernalde, S., Verkest, D., and Lauwereins, R., "Enabling hardware-software multitasking on a reconfigurable computing platform for networked portable multimedia applications," presented at Engineering of Reocnfigurable Systems and Algorithms (ERSA'02), Las Vegas, NV, USA, 2002.

[101] Mikroelektronik, "Programmable Logic Devices - Design Entry," Bolton Institute 2001.

[102] Milenkovic, V., Daniels, K., and Li, Z., "Placement and Compaction of Nonconvex Polygons for Clothing Manufacture," presented at 4th Canadian Conference on Computational Geometry, Newfoundland, Canada, 1992.

[103] Miyamori, T. and Olokotun, K., "REMARC: Reconfigurable Multimedia Array Coprocessor," presented at International Symposium on Field Programmable Gate Arrays (FPGA'98), Monterey, CA, USA, 1998.

[104] Moseley, R., "Reconnetics: A System for the Dynamic Implementation of Mobile Hardware Processes in FPGAs," Communicating Process Architectures 2002, pp. 177-190, 2002.

[105] Nallatech, "BallyVision Reconfigurable Computing Datasheet," 2002. [106] Nallatech, "Field Upgradeable Systems Environment (FUSE)," 2002. [107] Parsons, E. and Sevcik, K., "Coordinated Allocation of Memory and Processors in

Multiprocessors," presented at Conference on Measurement and Modelling of Computer Systems, Philadelphia, PA, USA, 1996.

[108] Patterson, J. and Agah, H., "Synopsys and Xilinx Unveil Next Generation Flow for Platform FPGAs," Xcell Journal Online, vol. 41, 2001.

[109] Purna, K. and Bhatia, D., "Temporal Partitioning and Scheduling Data Flow Graphs for Reconfigurable Computers," IEEE Transactions on Computers, vol. 48, pp. 579-590, 1999.

[110] QuickSilver_Technologies, "Adaptive Computing Machine," 2003. [111] Rakhmatov, D., "Dynamic Scheduling in Run-Time Reconfigurable Systems,"

University of Arizona, 1998. [112] Rakhmatov, D., Vrudhula, S., Brown, T., and Nagarandal, A., "Adaptive Multiuser

Online Reconfigurable Engine," in IEEE Design & Test of Computers, vol. 17, 2000, pp. 53-67.

[113] Ratha, N., Jain, A., and Rover, D., "Convolution on Splash 2," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'95), Napa Valley, CA, USA, 1995.

[114] Scalera, S. and Vazquez, J., "The Design and Implementation of a Context Switching FPGA," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'98), Napa Valley, CA, USA, 1998.

[115] Sharma, A., Programmable Logic Handbook: PLDs, CPLDs and FPGAs: McGraw-Hill Professional, 1998.

[116] Shirazi, N., Luk, W., and Cheung, P., "Automating Production of Run-time Reconfigurable Designs," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'98), Napa Valley, CA, USA, 1998.

[117] Shirazi, N., Luk, W., and Cheung, P., "Run-time Management of Dynamically Reconfigurable Designs," presented at International Workshop on Field-Programmable Logic and Applications (FPL'98), Tallinn, Estonia, 1998.

[118] Shoghi_Communications, "DES/TDES Core," New Delhi, India 2003.

References

Page 180

[119] Sidhu, R., Wadhwa, S., Mei, A., and Prasanna, V., "A Self-Reconfigurable Gate Array Architecture," presented at 10th International Workshop on Field Programmable Logic and Applications (FPL'00), 2000.

[120] Silberschatz, A., Galvin, P., and Gagne, G., Applied Operating System Concepts: John Wiley & Sons, 2000.

[121] Silicon_Strategies, "Chip Makers Post Mixed Results," TechWeb, San Jose, CA, USA 2004.

[122] Sima, M., Vassiliadis, S., Cotofana, S., Eijndhoven, J., and Vissers, K., "A Taxonomy of Custom Computing Machines," presented at 1st Workshop on Embeded Systems and Software, Utrecht, The Netherlands, 2000.

[123] Simmler, H., Levinson, L., and Manner, R., "Multitasking on FPGA Coprocessors," presented at 10th International Workshop on Field-Programmable Logic and Applications (FPL'00), Villach, Austria, 2000.

[124] Smith, D. and Bhatia, D., "RACE: Reconfigurable and Adaptive Computing Environment," presented at International Workshop on Field-Programmable Logic and Applications (FPL'96), Darmstadt, Germany, 1996.

[125] Stallings, W., Operating Systems: Internals and Design Principles, 4th Edition ed: Prentice Hall, 2000.

[126] Tanenbaum, A., Operating Systems - Design and Implementation: Prentice Hall, 1997. [127] Tennenhouse, D., Smith, J., Sincoskie, W., Wetherall, D., and Minden, G., "A Survey

of Active Network Research," in IEEE Communications Magazine, vol. 35, 1997, pp. 80-86.

[128] Trimberger, S., "A Time-Multiplexed FPGA," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'97), Napa Valley, CA, USA, 1997.

[129] Vasilko, M., Gibson, D., Long, D., and Holloway, S., "Towards a Consistant Design Methodology for Run-Time Reconfigurable Systems," presented at IEE Colloquium on Reconfigurable Systems, Glasgow, Scotland, 1999.

[130] Villasenor, J. and Mangione-Smith, W., "Configurable Computing," in Scientific American, 1997.

[131] Villasenor, J., Schoner, B., Chia, K., and Zapata, C., "Configurable Computing Solutions for Automatic Target Recognition," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'96), Napa Valley, CA, USA, 1996.

[132] Vuillemin, J., Bertin, P., Roncin, P., Shand, M., Touati, H., and Boucard, P., "Programmable active memories: Reconfigurable systems come of age," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 4, pp. 56-69, 1996.

[133] Walder, H. and Platzner, M., "Non-preemptive Multitasking on FPGAs: Task Placement and Footprint Transform," presented at International Conference on Engineering of Reconfigurable Systems and Algorithms, Las Vegas, NV, USA, 2002.

[134] Wilberg, J., Kuth, A., Camposano, R., Rosenstiel, W., and Vierhaus, T., "A Design Exploration Environment," presented at 6th Great Lakes Symposium on VLSI, Iowa State University, 1996.

[135] Wolinski, C., "Reconfigurable Computing Systems," Los Alamos National Laboratory 2003.

[136] Woo, N. and Kim, J., "An Efficient Method of Partitioning Circuits for Multiple-FPGA Implementation," presented at 30th ACM/IEEE Design Automation Conference (DAC'93), Dallas, TX, USA, 1993.

[137] Xilinx, "MicroBlaze Datasheet," 2003. [138] Xilinx, "Two Flows for Partial Reconfiguration: Module Based or Difference Based,"

Application Notes XAPP290 September 2004. [139] XIlinx, "Virtex Architecture Datasheet," 2002.

References

Page 181

[140] Xilinx, "Virtex-II Architecture Datasheet," 2000. [141] Xilinx, "Virtex-II Pro Datasheet," 2003. [142] Xilinx, "XC6200 Field Programmable Gate Arrays Datasheet," 1997. [143] Yi-Ran, S., Kumar, S., and Jantsch, A., "Simulation and Evaluation for a Network on

a chip architecture using NS-2," presented at 20th IEEE Norchip Conference, Copenhagen, 2002.

[144] Yoo, S., Choo, H., Youn, H., Yu, C., and Lee, Y., "On Task Relocation in Two-Dimensional Meshes," Journal of Parallel and Distributed Computing, vol. 60, pp. 616-638, 2000.

[145] Zhong, P., Martonosi, M., Ashar, P., and Malik, S., "Accelerating Boolean Satisfiability with Configurable Hardware," presented at IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'98), Napa Valley, CA, USA, 1998.