Designing Embedded Multiprocessor Networks-on-Chip - CiteSeerX

238
DESIGNING EMBEDDED MULTIPROCESSOR NETWORKS-ON-CHIP WITH USERS IN MIND A Thesis Submitted to the Faculty of Carnegie Mellon University by Chen-Ling Chou In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy April 2010

Transcript of Designing Embedded Multiprocessor Networks-on-Chip - CiteSeerX

DESIGNING EMBEDDED MULTIPROCESSOR NETWORKS-ON-CHIP

WITH USERS IN MIND

A Thesis

Submitted to the Faculty

of

Carnegie Mellon University

by

Chen-Ling Chou

In Partial Fulfillment of the Requirements for

the Degree of

Doctor of Philosophy

April 2010

ii

© Copyright by Chen-Ling Chou 2010

All Rights Reserved

iii

To my parents Chien-Te Chou, Hui-Yueh Chiang and my husband Hung-Chih Lai

iv

ACKNOWLEDGMENTS

I would like to express my sincere gratitude to all those who have inspired me during my

doctoral study and have supported me in finishing this dissertation.

I especially want to thank my advisor, Professor Radu Marculescu, for his continuous sup-

port, motivation and invaluable guidance during my research and study at Carnegie Mellon

University (CMU). His perpetual energy and enthusiasm in research had motivated all his

advisees, including me. Without his inspiration, patience, friendship and our stimulating dis-

cussions, this dissertation would have never been possible.

I am also grateful to my thesis committee members Professor Shawn Blanton, Dr. Michael

Kishinevsky, Prof. Twan Basten, and Prof. Onur Mutlu for their insightful suggestions and

comments on my research. In particular, I would like to thank Dr. Michael Kishinevsky for

hiring me as an intern at Intel Strategic CAD Lab. The associated experience broadened my

perspective on the practical aspects in the industry.

All my lab buddies at the Center for Silicon System Implementation (CSSI) of CMU made

it a convivial place to work. In particular, I would like to thank my colleagues at our System

Level Design (SLD) group, i.e. Paul Bogdan, Shun-ping Chiu, Cory Bevilacqua, Miray Kas,

and all previous members of SLD group, i.e. Jung-Chun (Mike) Kao, Umit Ogras, Nicholas H.

Zamora, and Ting-Chun Huang; They had inspired me in research and life through our interac-

tions during the long hours in the lab. Thanks.

v

I would also like to thank all of my friends in Pittsburgh who made this city a better place

to live. In particular, I would like to thank my badminton friends in CMU and University of

Pittsburgh, who have made my Ph.D. life more fruitful and exciting. Playing badminton regu-

larly with them makes me full of energy, as well as contributes to my persistence and hard-

working in research.

My deepest gratitude goes to my family (my mom Hui-Yueh Chiang, my father Chien-Te

Chou, and my husband Hung-Chih Lai) for their unflagging love and support throughout my

life; this dissertation is simply impossible without them. In particular, without the encourage-

ment and support from Hung-Chih, my graduate study would have finished much earlier with-

out a Ph.D. degree.

Finally, I would like to express my gratitude to several funding agencies, National Science

Foundation and Gigascale Systems Research Center, one of five research centers funded under

the Focus Center Research Program, a Semiconductor Research Corporation program.

vi

TABLE OF CONTENTS

Page

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

LAST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

ABBREVIATIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xx

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1. Trends and Challenges for Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . .1

1.2. Evolution of Embedded System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3

1.3. Motivation for User-Centric Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8

1.3.1. User Behavior Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8

1.3.2. Proposed User-aware Design Methodology . . . . . . . . . . . . . . . . . . . . . .12

1.4. Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15

1.4.1. DSE for Full-custom NoC with Predictable System Configurations . . .15

1.4.2. User-centric Design Methodology Handling Unpredictable System

Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15

1.5. Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20

2. Embedded NoC Platform Characterization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.1. NoC Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23

2.2. Application Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29

2.3. Trace-based Energy Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30

2.3.1. User Trace Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31

2.3.2. Computation Energy Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32

vii

2.3.3. Communication Energy Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . .33

3. System Interconnect DSE for Full-custom NoC Platforms . . . . . . . . . . . . . . . . . . . . . . 35

3.1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35

3.2. Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37

3.3. System Interconnect in MPSoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .38

3.3.1. General Framework for Application-specific MPSoC . . . . . . . . . . . . . .38

3.3.2. System Interconnect Problem Formulation . . . . . . . . . . . . . . . . . . . . . .40

3.3.3. Communication Fabric Exploration Flow . . . . . . . . . . . . . . . . . . . . . . .43

3.4. Optimization of System Interconnect Problem. . . . . . . . . . . . . . . . . . . . . . . . .45

3.4.1. Exact System Interconnect Exploration . . . . . . . . . . . . . . . . . . . . . . . . .45

3.4.2. Heuristic for Speeding up System Interconnect Exploration . . . . . . . . .48

3.5. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50

3.5.1. Industrial Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50

3.5.2. Synthetic Applications for Larger Systems . . . . . . . . . . . . . . . . . . . . . .53

3.6. Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .55

4. User-Centric DSE for Heterogeneous Embedded NoCs. . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .57

4.2. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .58

4.3. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59

4.4. The Problem and Steps for DSE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62

4.4.1. User Behavior Similarity and Clustering . . . . . . . . . . . . . . . . . . . . . . . .62

4.4.2. Automated NoC Platform Generation . . . . . . . . . . . . . . . . . . . . . . . . . .65

4.4.3. Validation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .70

4.5. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72

4.5.1. Evaluation of User Behavior Clustering. . . . . . . . . . . . . . . . . . . . . . . . .73

Page

viii

4.5.2. NoC Platform Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .75

4.5.3. Evaluation of Entire Design Methodology . . . . . . . . . . . . . . . . . . . . . . .76

4.6. Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .77

5. Energy- and Performance-Aware Incremental Mapping for NoC . . . . . . . . . . . . . . . . . 79

5.1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79

5.2. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .83

5.3. Motivational Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .84

5.4. Incremental Run-time Mapping Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . .86

5.4.1. Proposed Methodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .86

5.4.2. Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .88

5.4.3. Significance of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .89

5.5. Solving the Incremental Mapping Problem . . . . . . . . . . . . . . . . . . . . . . . . . . .90

5.5.1. Solutions to the Near Convex Region Selection Problem. . . . . . . . . . . .90

5.5.2. Solutions to the Vertex Allocation Problem . . . . . . . . . . . . . . . . . . . . . .103

5.6. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .107

5.6.1. Evaluation of Region Selection Algorithm on Random Applications. . .107

5.6.2. Evaluation of Vertex Allocation Algorithm on Random Applications . .109

5.6.3. Random Applications Considering Energy Overhead for the Entire Incremental Mapping Process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .111

5.6.4. Real Applications Considering Energy Overhead for the Entire Incremental Mapping Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .112

5.7. Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .114

6. Fault-tolerant Techniques for On-line Resource Management . . . . . . . . . . . . . . . . . . . . 117

6.1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .117

6.2. Related Work and Novel Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .120

6.3. Analysis for Network Contention and Spare Core Placement . . . . . . . . . . . . .121

Page

ix

6.3.1. Network Contention Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .121

6.3.2. Spare Core Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .125

6.4. Investigations Involving New Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .129

6.5. Fault-tolerant Resource Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .133

6.5.1. RUN_FT_MAPPING Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .134

6.5.2. RUN_FT_MAPPING Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .135

6.6. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .138

6.6.1. Evaluation with Specific Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .138

6.6.2. Impact of Failure Rates with Spare Core Placement . . . . . . . . . . . . . . . .140

6.6.3. Evaluation with Real Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . .141

6.7. Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .142

7. User-Aware Dynamic Task Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7.1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .143

7.2. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .147

7.3. Preliminaries and Methodology Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . .148

7.3.1. Motivational Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .148

7.3.2. System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .153

7.3.3. Overview of the proposed methodology . . . . . . . . . . . . . . . . . . . . . . . . .155

7.3.4. User Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .157

7.4. Problem Formulation of User-Aware Task Allocation Process . . . . . . . . . . .159

7.5. User-Aware Task Allocation Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . .162

7.5.1. Solving the Region Forming Sub-problem (P1) . . . . . . . . . . . . . . . . . . .162

7.5.2. Solving the Region Rotation Sub-problem (P2) . . . . . . . . . . . . . . . . . . .165

7.5.3. Solving the Region Selection Sub-problem (P3). . . . . . . . . . . . . . . . . . .168

7.5.4. Solving the Application Mapping Sub-problem (P4) . . . . . . . . . . . . . . .168

Page

x

7.6. Light-Weight Model Learning Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .168

7.7. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .171

7.7.1. Evaluation on Random Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . .173

7.7.2. Real Applications with Run-time Energy Overhead Considered . . . . . .177

7.7.3. Real Applications with On-line Learning of User Model . . . . . . . . . . . .180

7.8. Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .183

8. Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

8.1. Dissertation Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .185

8.2. Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .188

8.2.1. Challenges Ahead for User-centric Embedded System Design . . . . . . .188

8.2.2. Increasing Flow Experience by Designing Embedded Systems . . . . . . .189

LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

APPENDIX A. Machine Learning Techniques Survey for User-centric Design . . . . . . . . 203

APPENDIX B. ILP-based Contention-aware Application Mapping . . . . . . . . . . . . . . . . . 207

B.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .207

B.2. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .207

B.3. Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .209

B.4. ILP-based Contention-aware Mapping Approach . . . . . . . . . . . . . . .210

B.4.1. Parameters and Variables . . . . . . . . . . . . . . . . . . . . . . . . . .210

B.4.2. Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .211

B.4.3. Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .212

B.5. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .213

B.5.1. Experiments using Synthetic Applications . . . . . . . . . . . . .213

B.5.2. Experiments using Real Applications . . . . . . . . . . . . . . . . .215

B.6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .217

Page

xi

LIST OF TABLES

Table Page

1.1 Three different categories of user-system interaction. . . . . . . . . . . . . . . . . . . . . . . . 11

2.1 Impact of adding the control network on area. The synthesis is performed for Xilinx Virtex-II Pro XC2VP30 FPGA.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1 Architecture template for the NoC platform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.2 Computation energy consumption comparison for three trace clusters and different resource sets derived by the proposed and traditional design flow. . . . . . . . . . . . . . 75

5.1 L1(R’) + L1(R-R’) minimization problem when using the Euclidean Minimum (EM), Fixed Center (FC), and Neighbor_aware Frontier (NF) heuristics. . . . . . . . . 97

5.2 Mapping approach proposed in [27] vs. our algorithms results. . . . . . . . . . . . . . 114

6.1 Comparison among the Random, MBS [99], and Nearest Neighbor (NN) [27] map- ping methods.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.2 Throughput and Energy Consumption between proposed FT and Nearest Neighbor (NN) approaches for all-to-all and one-to-all communication patterns. . . . . . . . . 139

6.3 Impact of contamination area on different failure rates under Side and Random spare core placements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.4 Comparison between the Nearest Neighbor (NN) and our FT mapping results on the overall system performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

7.1 Event communication cost [in bits] for three approaches and five applications entering in the system as shown in Figure 7.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

7.2 Comparison of communication consumption among different approaches on a different size NoCs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

7.3 Comparison of the run-time overhead and the overall communication energy savings under four implementations on a 5 × 5 mesh NoC. . . . . . . . . . . . . . . . . . . 180

xii

7.4 Normalized event cost in stages 1, 2, and 3 under different user models from four users normalized to the total event cost of “Nearest Neighbor [27]” approach.. . . . 181

B.1 Energy and throughput comparison between energy-aware in [79] and contention-aware mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

B.2 Communication energy overhead and throughput improvement of our contention- aware solution compared to the energy-aware solution [79]. . . . . . . . . . . . . . . . . . 215

PageTable

xiii

LIST OF FIGURES

Figure Page

1.1 General idea of newly proposed user-centric design. . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 The design hierarchy and evolution of embedded systems in terms of hardware capacity and software programmability, namely task-level, resource-level, system- level and our proposed user-level design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 (a) Traditional system design methodology, Y-chart, for embedded systems (b) On- line optimization determine users satisfaction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Hierarchy of needs at each level of abstraction from system designer and user per- spectives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5 Three-day user traces from two users. (a) Appearances of five different Windows applications (b) Total number of applications in the system of each time instant. . 9

1.6 User satisfaction ratings corresponding to different CPU usage for two users. . . . 10

1.7 Sketch of (a) traditional and (b) user-centric design flows. . . . . . . . . . . . . . . . . . . 13

1.8 User-centric design flow for heterogeneous NoCs, including user behavior analysis, NoC architecture automation, and optimization process. Five types of problems with the “*” sign with their related machine learning techniques are surveyed in Appendix A.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.1 Homogeneous or heterogeneous 2-D mesh NoCs with PEs and interconnect via the data and control networks described in a generalized way.. . . . . . . . . . . . . . . . . . . 25

2.2 (a) The logical view of the control network. (b) The on-chip router micro-archi- tecture that handles the control network.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3 Application Characterization Graph (ACG) characteristics. The tasks belonging to the same vertex are mapped onto the same PE. Each edge represents the commu- nication between two nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1 Block diagram for a general MPSoC platform. . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2 General platform with multiple IPs communicated via the system interconnect. . . 39

xiv

3.3 (a) System interconnect design space trading off the system performance and area/ wirelength overhead (b) Traditional bus model connecting four IP blocks (c) Fully connected switches with four IP blocks (d) Possible optimized communication fabric for four IP blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4 The flow of the communication fabric design space exploration with the analysis, simulation, and evaluation stages shown explicitly. . . . . . . . . . . . . . . . . . . . . . . . . 44

3.5 A three-IP example of communication fabric exploration using the branch and bound algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.6 The pseudo code of the system interconnect exploration using the branch and bound method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.7 The proposed heuristic for four IPs with the number of muxes set to 2. . . . . . . . . 49

3.8 System interconnect exploration for a real SoC design. (a) Pareto-optimal set (latency vs. fabric area) obtained via analysis. (b) Simulation results for solutions in (a). (c) Pareto-optimal set (i.e., latency vs. fabric wirelength) obtained via analysis. (d) Simulation results for solutions in (c). . . . . . . . . . . . . . . . . . . . . . . . . 51

3.9 Forty non-Pareto points and Pareto curve plots obtained via analysis (a) and via simulation (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.10 Solutions comparison between branch and bound method (BB) and the proposed heuristic for system interconnect exploration of a synthetic application with 13 IP blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.11 Run-time and solution quality comparison between branch and bound approach (BB) and our heuristic as the system size scales up. . . . . . . . . . . . . . . . . . . . . . . . . 54

4.1 The proposed user-centric design flow in terms of the off-line DSE processes. . . 60

4.2 Main steps of user behavior clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.3 Main steps for computational resource selection.. . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.4 Main steps for resource location assignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.5 Validation process of the newly proposed methodology.. . . . . . . . . . . . . . . . . . . . 71

4.6 Pareto points showing the tradeoffs between price and computation energy con- sumption. For each cluster, four users are randomly selected and their Pareto curves are plotted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.1 Example of NoC incremental application mapping comparing the greedy and our proposed solutions. The greedy approach which does not consider additional mappings incurs higher communication overhead for App 2, and the system

PageFigure

xv

communication cost as well, compared to our proposed solution. . . . . . . . . . . . . . 80

5.2 Motivational example for incremental mapping process. (a) Optimal solution (b) Near convex region solution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.3 Overview of the proposed incremental mapping methodology.. . . . . . . . . . . . . . . 86

5.4 Overview of the proposed methodology. (a) The incoming application ACG (b) Current system configuration (c) The near convex region selection step (d) The ver-tex allocation step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.5 The impact of Manhattan Distance (MD) on communication energy consumption for four different scenarios (S1-S4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.6 L1(R’) + L1(R-R’) minimization problem: select a region R’, such that the sum of the total Manhattan Distance (MD) between any pair of tiles inside region R and that inside region R-R’ is minimized. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.7 Region with N = 20 resulting from several distinct methods, namely (a) Best Case (BC) (b) Worst Case (WC) (c) Euclidean Minimum (EM) (d) Fixed Center (FC) (e) Random Frontier (RF) (f) Neighbor_aware Frontier (NF). Note that the shape of the resulted regions would be the same even if shifted to other coordinates. Here, we only consider minimizing the total Manhattan Distance between any pair of these N tiles inside R’, i.e., L1(R’). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.8 L1 distance results showing the scalability of the solutions obtained via the Best Case (BC), Worst Case (WC) and four heuristics (EM, FC, RF, and NF).. . . . . . . 95

5.9 Histogram over 1000 runs for L1(R’) + L1(R-R’) minimization problem. We repre- sent [L1(R’) + L1(R-R’)] distances on the x-axis and their frequency of occurrence on the y-axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.10 Dispersion and Centrifugal factor calculation example. . . . . . . . . . . . . . . . . . . . 99

5.11 Near convex region selection algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.12 Incremental run-time mapping process. (a) The ACG of the incoming application (b) Current system behavior (c) Near convex region selection process (d) Vertex allocation process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .101

5.13 Vertex allocation algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.14 Vertex allocation process based on the example in Figure 5.12. (a) Initial con- figuration with every vertex white. (b) Vertex 6 is discovered. (c) Vertex 9 is discovered. (d) Vertex 7 is finished and colored black. (e) Vertex 9 is colored from gray to black. (f) Vertex 6 is colored from gray to black (g) Vertex allocation process is done; all vertices are colored black. . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

PageFigure

xvi

5.15 (a) Impact of selection region process on inter-processor communication. (b) Com- munication energy loss: optimal mapping vs. our allocation algorithm given a selected region. (c) Optimal vs. our allocation algorithm under different com- munication rates. (d) Communication energy savings: arbitrary mapping vs. our allocation algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.16 Communication energy consumption comparison using random applications. . . 111

6.1 Non-ideal 2-D mesh platform consists of resources connected via a network. The re- sources include computational tiles (i.e., manager titles, active and spare cores) and memory titles. Permanent, transient, or intermittent faults may affect the computational and communication components on this platform. . . . . . . . . . . . . . 118

6.2 Application mapping on mesh-based 3 × 3 NoC (a) Application characteristic ACG = (V, E) (b) Source-based contention (c) Destination-based contention (d) Path-based contention. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

6.3 The (a) source-based (b) destination-based (c) path-based contention impact on average packet latency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.4 (a) Application Characterization Graph (ACG) (b) Spare cores (‘S’) are assigned towards the side of the system. (c) Spare cores ‘S’ are randomly distributed in the system (d) Spare cores ‘S’ are evenly distributed in the system. . . . . . . . . . . . . . . 126

6.5 Quantitative analysis on the performance impact on three different spare core placements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6.6 Two mapping results for the ACG in Figure 6.4(a) where the spare cores are ran- domly placed on the platform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6.7 3D Kiviat plots showing WMD, LCC, and SFF metrics for three difference map- ping schemes (i.e., Random, MBS, and NN). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.8 The FT resource management framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.9 Main steps of RUN_MIGRATION process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.10 Main steps of RUN_FT_MAPPING process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

7.1 Contiguous (a) and non-contiguous (b)-(e) allocations for four applications using standard techniques.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

7.2 Motivational example of run-time resource management with user behavior taken into consideration. (a) Application characteristics. (b) Events in the system. (c)(d)(e) Task allocation scheme under Approach 1, Approach 2, and Hybrid ap- proach, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

7.3 Overview of the proposed methodology. Default approach (i.e., Approach 2) is

PageFigure

xvii

applied in stage 1. Hybrid approach with pre-defined user model is applied in stage 2. Hybrid approach with on-line learned user model is applied in stage 3. . . 155

7.4 Algorithm flow for our proposed methodology.. . . . . . . . . . . . . . . . . . . . . . . . . . . 156

7.5 Main steps of the region forming algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

7.6 Example showing the region forming algorithm on an ACG. . . . . . . . . . . . . . . . . 164

7.7 The subtraction calculation during the region rotation process. . . . . . . . . . . . . . . . 166

7.8 Main steps of the region rotation algorithm.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

7.9 Four possible decision tree structures for user model. . . . . . . . . . . . . . . . . . . . . . . 169

7.10 4-fold cross-validation for model learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

7.11 (a) Pseudo codes of tree structure learning process without cross-validation method and (b)(c) with cross-validation method. . . . . . . . . . . . . . . . . . . . . . . . . . . 162

7.12 Communication energy loss compared to the optimal solution for (a) region forming (P1) sub-problem and (b) application mapping (P4) sub-problem on a 2D- mesh NoC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

7.13 (a) Communication cost comparison among Approach 1, Approach 2, and the hybrid approach (which considers the user behavior) on an 8 × 8 NoC. (b) L(R) where R is the available/unused resources comparison among Approach 1, Ap- proach 2, and thee hybrid approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

8.1 Model exploration for user-centric design flow. . . . . . . . . . . . . . . . . . . . . . . . . . . 188

8.2 Four-quadrant states in terms of challenge and skill level.. . . . . . . . . . . . . . . . . . . 191

A.1 (a) Five types of problems for user-centric design i) classification ii) regression iii) similarity iv) clustering v) reinforcement learning (b) Selected machine learn- ing approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .204

B.1 (a) Logical and (b) physical application characterization graph. (c) one core map- ping example.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

B.2 Path-based contention count in a 4 × 4 NoC comparing the random, energy-aware in [79] and contention-aware mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

B.3 (a) Parallel-1 benchmark (b)(c) Mapping results of the energy-aware approach [79] and our contention-aware method (d) Average packet latency and throughput comparison under these two mapping methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

PageFigure

xviii

ABBREVIATIONS

ACG Application characterization graph

CC Computation capacities

CMP Chip multiprocessors

DSE Design space exploration

DSP Digital signal processor

E3S Embedded system synthesis benchmark

GM Global manager

FCA Failure contamination area

FIFO First-in-first-out

FT Fault tolerant

GPU Graphics processing units

IDC Identification content

ILP Integer linear programming

I/O Input/output

IP Intellectual property

LACG Logical application characterization graph

MCR Minimal computation requirement

MD Manhattan distance

MPSoC Multiprocessor Systems-on-Chip

NI Network interface

NN Nearest neighbor

NoC Networks-on-Chip

OS Operating system

PACG Physical application characterization graph

PCI Peripheral component interconnect

xix

PDA Personal digital assistant

PE Processing element

PL Port location

PTM Predictive technology model

RMS Recognition, mining, and synthesis

SATA Serial advanced technology attachment

SoC Systems-on-Chip

UART Universal asynchronous receiver/transmitter

USB Universal serial bus

WCET Worst case execution time

xx

ABSTRACT

Future embedded Systems-on-Chip (SoCs) designed at nanoscale will likely consist of

tens or hundreds of (potentially energy-efficient) heterogeneous cores supporting one or sev-

eral dedicated applications. For such systems, the Networks-on-Chip (NoC) communication

architectures have been proposed as a scalable solution which consists of a network of

resources exchanging packets while running various applications concurrently. Over recent

years, embedded systems have gained an enormous amount of processing power and function-

ality with the ultimate goal on power and performance optimization.

In this dissertation, with the ultimate goal that any system optimization is to satisfy the end

user, we study outstanding problems on embedded system methodology, while incorporating

the user behavior information into the modeling, analysis, optimization, and evaluation steps.

Our specific contributions are as follows.

• For predictable system configurations derived from use-case applications, we explore

the design space of system interconnect on application-specific multi-processor sys-

tems-on-chips (MPSoCs). With the proposed analytical and simulation models, we can

theoretically generate fabric solutions with optimal cost-performance trade-offs, while

considering various design constrains, such as power, area, and wirelength.

• For unpredictable system configurations incorporating users’ interaction with the sys-

tem, we present a new design methodology for automatically generating regular NoC

platforms, while including explicitly the information about the user experience into the

xxi

design process. Such off-line design flow aims at minimizing the workload variance and

allows the system to better adapt to different types of uses.

• For applications entering and leaving the system dynamically, we propose an efficient

technique for run-time application mapping onto heterogeneous NoC platforms

with the goal of minimizing the communication energy consumption, while still

providing performance guarantees. The proposed technique allows for new appli-

cations to be easily added to the system platform with minimal inter-processor

communication overhead.

• To address the problem of runtime resource management in NoC platforms while con-

sidering permanent, transient, and intermittent failures, we propose a system-level fault-

tolerant approach that investigates several metrics for network contention and system

fragmentation, as well as their impacts on system performance.

• Finally, having generated system platforms which exhibit less variation among the

users’ behavior, we explore flexible and extensible run-time resource management tech-

niques that allow system to adapt to run-time stimuli specific to each class of user

behaviors; these techniques change dynamically according to the user models built on-

line based on different user needs.

1

1. INTRODUCTION

1.1. Trends and Challenges for Embedded Systems

Embedded systems consist of hardware and software integrated on the same silicon

platform that typically runs one or a few dedicated applications in a static or dynamic manner

[116]. These systems became very popular in recent years and in fact dominate the

semiconductor industry nowadays. To give a bit of perspective, whereas only 3% of

processors are used in general-purpose workstations, desktop or laptop computers, about 97%

of the 6.5 billion processors produced worldwide in 2004 were integrated into embedded

systems that were deployed in avionics, automotive, multimedia, consumer electronics, office

appliances, robots and toys [55].

From a technological standpoint, computing hardware has improved dramatically over the

past forty years. As Gordon Moore predicted, almost every measure of capability in electronic

devices (e.g., processor speed, memory storage capacity, etc.) has improved at roughly

exponential rates over the years. For example, flashes with capacities over 1GB have replaced

3-1/2 inch floppy disks with a capacity of 1.44MB, while cell phones have gradually replaced

beepers and other obsolete communication devices because of their higher flexibility and

efficiency in communication [6]. However, among these high-tech products, only a few of

them made a long lasting impact, while others got eliminated through competition.

A natural question is then whether or not the success of embedded systems follows in

some sense Darwin’s principle of natural selection [48], or Spencer’s concept on survival of

2

the fittest [157]? In short, both philosophies argue that all species evolve from common

ancestors and only the fittest organisms get the chance to prevail over time.

Although finding a definite answer is a complicated endeavor, we believe that such ideas

may also apply to the evolution of embedded systems. More precisely, we believe that the

success of various embedded systems comes as a result of users selection and therefore the

products which fit users demands the best, would eventually dominate the market; the other

products are simply not competitive and they are meant to perish over a short period of time.

Perhaps, a more appropriate interpretation of these classical principles of evolution in the

context of embedded systems would be consider the survival of the “fit enough” system.

Indeed, although embedded systems have gained an enormous amount of processing power

and functionality, from users’ perspective, the newest or the most advanced products are not

necessarily the best. Instead, quite often, one can observe that products that “fit enough”, or

provide “just-enough performance” do reasonably well [132], and so designers can rather

focus on adding some additional features (e.g., appearance, low power, practicability,

interface, price) rather than focusing exclusively on improving devices raw performance.

Indeed, due to the high variability seen in user preferences, it becomes much more challenging

for system designers to satisfy the users taste, and this is especially true for the large class of

personal embedded systems (e.g., cell phones, personal digital assistants (PDAs), gaming

devices, etc.) [86].

Starting from these ideas, and in contrast to the traditional design flow, we propose a user-

centric embedded system design methodology which gets users directly involved in the design

flow, with the goal of minimizing the workload variance; this allows the system to better adapt

to different types of user needs and workload variations. More specifically, we collect traces

from various users (see dots in Figure 1.1) and investigate important behavioral traits in order

to cluster them (see circles in Figure 1.1). For each cluster of such user traces and depending

3

on the architectural parameters extracted from high-level specifications, we propose an

optimization technique for system architecture (see the square in Figure 1.1 which is done at

design time). We also propose validation techniques to assess the robustness of the newly

proposed design methodology. For such a design, we can further apply optimization

techniques (see arrow in Figure 1.1 which means at run time) to better adapt to users’

requirements on-line. Of note, in this dissertation, we restrict our attention to user-centric

design for embedded applications. However, we believe the idea of “user-centric design” can

be applied to other areas too, such as web applications [25][26], marketing business

[128][148], user interface design [133] and games design [139].

1.2. Evolution of Embedded System Design

Embedded systems today are increasingly complex and multi-functional in nature. The

design hierarchy, as well as the evolution, of embedded systems can be represented as in

Figure 1.2. Given the advances in the semiconductor industry (see the left part of Figure 1.2),

more and more microprocessors are used for building real systems. Moreover, the Intellectual

Property (IP) integrated solutions provide Systems-on-Chip (SoC) designers with a fast way to

develop robust embedded applications. For providing high scalability in large SoC designs,

Figure 1.1 General idea of newly proposed user-centric design.

user behavior traits 1

user

beh

avio

r tra

its 2 : user traces

: cluster with similar user traces

: at design time, sort of ideal platforms are generated for each cluster

: at run-time, making these platforms better adapt to users’ requirements

4

the Networks-on-Chip (NoC) communication architecture represents a promising solution.

NoCs consist of a network of resources (including computation and storage elements)

exchanging data [17][47]. In terms of software programming (see the right part of Figure 1.2),

single processor platforms are more suited to execute more tasks with multi-threading

computing. However, it is now recognized that increasing the clock frequency of future

processors at the rate that had been sustained during the last two decades is no longer a viable

option. As a result, we witness a rapid move from uniprocessor to multiprocessor systems.

Over that past few decades, there have been proposed various approaches to address the

design process at task-, resource-, and system-level [138] (see Figure 1.2). More precisely, at

task-level, timing analysis performed on each task is of crucial importance for real-time

systems, such as program execution path analysis, data dependency dynamic behavior

task-level

resource-level

system-level

user-level

modulemodule

IP

IP memory

Figure 1.2 The design hierarchy and evolution of embedded systems in terms of hardwarecapacity and software programmability, namely task-level, resource-level, system-level andour proposed user-level design.

Hardware Capacity Software Programming

single taskuni-processor

multiple tasksuni-processor

multiple tasksmulti-processors

microcontroller/

IP core

Systems in

microprocessor

(multi-threading)

(multi-processing)

different workload user-systeminteraction process

Systems-on-Chip(SoC)/Networks-on-Chip(NoC)

5

analysis for estimating the worse-case, average-case, and best-case execution time accurately

[58][61][71][97][166]. At resource-level, resources are shared among periodic and aperiodic

tasks; this requires time-triggered or event-triggered scheduling schemes, such as rate

monotonic scheduling, earliest deadline first, maximum urgency first, etc

[1][90][125][131][135][144][169]. At system-level, due to platform integration complexity,

various computation and communication models relying on certain assumptions on tasks and

resources profiling are used for early design space exploration (DSE). The traditional system

design flow at system-level follows the Y-chart in Figure 1.3(a) [7][8][100][171]. Given the

architecture parameters (e.g., computation and communication components, network

- QoS parameters- power budget- latency, bandwidth constraints

Application Parameters/ Design Metrics

- fixed type and # of resources- fixed comm. protocol - memory size- area & cost constraints

Architecture Parameters

Automated Platform Design(task mapping, scheduling,

resource allocation, ...)

G

M

D

F

D D

mapping

Application-specific Embedded System

user satisfactionsurvey

Run-time Strategies(adaptive mapping, scheduling, ...)

...

...

excellentgood

fairpoor

G

M

D

F

D D

Figure 1.3 (a) Traditional system design methodology, Y-chart, for embeddedsystems (b) On-line optimization determine users satisfaction.

(a) T

radi

tiona

l Y-c

hart

des

ign

met

hodo

logy

(b) O

n-lin

e op

timiz

atio

n

6

topology, etc.) and application-specific parameters (e.g., power constraints, maximum

latency, multiple use-cases [106], etc.), the customized architecture (or system platform) is

automatically generated offline using static techniques, such as generic optimization

[4][114], symbolic search [60][109][150], predictive modeling [44][91][121], or dynamic

programming [31]. Afterwards, the system is manufactured and deployed for use by different

users as shown in Figure 1.3(b). However, due to differences in users’ behavior, the platform

will likely not satisfy all the users equally well, even assuming perfect techniques for run-

time optimization. In other words, some users may find the system difficult or inefficient to

use, even though it may be highly recommended by other users. Such issues are typically the

cause for significant losses in product sales and revenues.

Since any system optimization has ultimately the goal of satisfying each end user, we

consider one more level in this design hierarchy, namely the user-level, in order to deal with

the real workload variation from different users [132]. As shown in this representation (see the

bottom of the pyramid in Figure 1.2), the users interact directly with the system. Due to

variations in users’ behavior, the workload across different resources may exhibit high

variability even when using the same hardware platform. Murali et al. in [106] deal with the

mapping for a finite set of use-cases on an given NoC where all use-cases belong to the same

task sets. Our methodology targets generating NoC-based platforms for multiple applications

simultaneously running on it where each application has its own task set. In addition, our

scenario considers the users’ interaction with the system; therefore, the system configurations

at each time instant in our case are impossible to be predicted off-line [62]. This motivates us

to define a new DSE methodology for future embedded systems by considering an extra

degree of freedom, namely, the user experience; this encompasses all aspects related to end-

user interaction with the platform and the associated design costs (e.g., power, performance).

7

In order to design embedded systems from users perspective, we discuss the needs at each

level of abstraction both from system designers and users perspectives, as shown in

Figure 1.4. First, at the task- and resource- level, the designers need to make sure that the

codes for tasks are error-free and are written in a modular style (i.e., IP module). Later, the IP

module integration/composition at system-level can help building the embedded systems,

while covering the entire design space for early estimation (i.e. system composability). In

addition, at user-level, a wide range of embedded systems usually considers programmability

purposes to support system upgradability and extensibility for dealing with any other run-time

system changes from various users.

Once the system is manufactured and deployed, the needs are different from the users

perspective (see Figure 1.4). Basically, users purchase the end products based on the

functionality and features they need. Also, the end products need to be easy to set up

(reliability), operate (usability), and update (adaptability) in order to support different run-

time stimuli and user preferences. Therefore, our main contribution in this work is to develop

Figure 1.4 Hierarchy of needs at each level of abstractionfrom system designer and user perspectives.

errorless

functionality

modularity

reliability

composabilityusability

extensibilityadapability

Designer perspective

User perspective

desi

gn c

ompl

exity

user

dem

and

task-level

resource-level

system-level

user-level

8

the user-centric design methodology, both from system designers and users perspectives; the

user-centric design flow as discussed in the next section.

1.3. Motivation for User-Centric Design

As discussed above, future embedded systems running multiple applications concurrently

should rely on a variety of system configurations, which are challenging to design. Although

prior work for exploring the design space exists [66], the traditional design flow (see

Figure 1.3) still can generate only one or just a few platform configurations, most likely along

the same Pareto curve which trades off multiple objectives [51]. However, due to the

potentially high user behavior variation, such a platform (or limited set of platforms) still

hardly meet all user needs or maximize the users satisfaction even assuming perfect

techniques for run-time optimization. Given all the above considerations, this section first

discusses the potential use of user-centric design flow (see Section 1.3.1) and later introduces

a novel idea on developing new methodologies and optimization techniques to get users

directly involved in the design flow (see Section 1.3.2).

1.3.1. User Behavior Variation

The critical questions that determine the potential use of user-centric design flow are as

follows: i) How much difference in users’ behavior there is? ii) How one can make sure that a

particular user is satisfied with the system at hand? iii) Is it necessary to propose different

designs for different users? In this chapter, we try to answer these questions based on some

realistic user traces.

Regarding the first question, Figure 1.5 presents data and the corresponding CPU usage

from a three-day trace (about 7-9 working hours per day) of two user sequences collected from

9

five applications, namely Internet Explorer, Microsoft Office Powerpoint, Matlab, Adobe

Acrobat, and Microsoft Office Word running under Windows XP. More precisely,

Figure 1.5(a) plots the presence of these five applications separately with “high/low” values

meaning application “running/not running” in the system (with solid and dash lines for two

different users). Figure 1.5(b) shows the total number of applications executed in the system

for these two users (in this representation, each time unit represents 15 minutes). As we can

see, the arrival order and frequencies of application entering and leaving the system vary a lot

from one user to another. Based on data on these five applications, the average number of

0 20 40 60 80 1000

1

2

3

4

5

Time unit

# of

app

licat

ions

in

the

syst

em0 20 40 60 80 100

Win. explorer

MSpowerpoint

Matlab

Adobeacrobat

MSword

Time unit

Figure 1.5 Three-day user traces from two users. (a) Appearances of five differentWindows applications (b) Total number of applications in the system of each time instant.

user 1 user 2

(a)

(b)

Time unit

Time unit0 20 40 60 80 100

0 20 40 60 80 100

5

# of

app

licat

ions

in th

e sy

stem

Internet Explorer

Microsoft OfficePowerpoint

Matlab

Adobe Acrobat

Microsoft OfficeWord

4

3

2

1

0

10

applications running in the system concurrently is 2.48 and 2.06 for solid-line and dash-line

users, respectively, while the switching frequency (i.e., number of times to switch from one

application to another) is 1.1/15mins and 1.5/15mins, respectively. Moreover, from these

collected traces, we observe that the solid-line user makes always a high use of CPU (i.e., on

average, 60% CPU use with variance 71), while the dash-line user has a higher variance of

CPU utilization (i.e., on average, 46% CPU use with a variance of 540).

With respect to the second question, recent studies have shown that there exists a

considerable variation in user expectation and user satisfaction relative to the actual system

performance [68][152][153]. Namely, some users are sensitive to system changes, while

others are not. Evidence is given in Figure 1.6. showing the relationship between the CPU

usage for some collected traces and the user satisfaction for two different users. During

experiments, users typically provide a satisfaction rating (1: very poor, 2: poor, 3: indifferent,

4: good, to 5: very good) every 15 minutes. The correlation of the user satisfaction rating

(variable x) to the CPU usage (variable y) can be interpreted using Pearson’s Product Moment

Correlation Coefficient (rxy):

Figure 1.6 User satisfaction ratings corresponding to different CPU usage for two users.

1 2 3 4 50

20%

40%

60%

80%

100%

CPU

usa

ge

User satification (1: very poor - 5: very good)

user 1

user 2

11

(1.1)

where n is the number of points in the data series X and Y written as xi and yi where i =

1,..., n. The correlation results in a value between -1 and 1, indicating the degree of linear

dependence between the variables. As it approaches zero there is less of a relationship. On the

contrary, the closer the coefficient to either -1 or 1, the stronger the correlation between the

variables; the more sensitive of the user to the CPU usage. As observed in Figure 1.6, the

correlation between the CPU usage and the user satisfaction for the first and the second user is

-0.36 and -0.85, respectively. We can conclude that user 2 is more sensitive to the CPU

utilization. This variation in user satisfaction indicates the existence of potential for further

optimization.

Regarding the third question, indeed, it is important to analyze how users interact with the

systems they use. We classify such interaction into three categories. Table 1.1 summarizes the

differences between these three categories:

Table 1.1 Three different categories of user-system interaction.

User-system Interaction Applications Note

I. shared, and used by several people at

one time

flight schedulemonitors, cen-tral air-condi-tioners, etc.

policy-driven,designed for popularity

II. shared, but only used by one person

at one time

ATM machines,equipment infitness centers,rental cars,computers inlibraries, etc.

event-driven,designed for

diversification

III. non-shared, one person owns the

system

cell phones, per-sonal digitalassistant (PDA),mp3 player, etc.

user-driven,designed for

user satisfaction

rxy

n xiyi∑ xi∑( ) yi∑( )×–

n xi2

∑ xi∑( )2

– n yi2

∑ yi∑( )2

–×-----------------------------------------------------------------------------------------------=

systemtime n

systemtime n+1

systemtime n + N...

systemtime n

systemtime n+1

systemtime n + N...

systemtime n

systemtime n+1

systemtime n + N...

12

The systems in the first category are public and can be used by several people at the same

time. The design of such systems places emphasis on wide accessibility and it always follows

a static policy. Flight schedule monitors, for instance, fall into this category. We suggest

surveying the human dynamics for this category.

The second category of systems are also public, but are only used by one person at a time.

Equipment in fitness centers or computers in a library belong to this category. We suggest

storing diverse (default) settings for such system; while an user logs in, or say an event occurs,

the system can adapt easily to his/her preferences.

The third (and the most difficult to design) category is represented by systems that are

personal, such as cell phones, PDAs or laptops. Due to the high variation in users satisfaction,

we suggest minimizing such variations not only during the off-line DSE but also at run-time.

In this dissertation, we focus on the design belonging to the second and third categories

while for design in the first category, there is a need to explore the human pattern activity

(more discussion are elaborated in Section 8.2.2).

1.3.2. Proposed User-aware Design Methodology

With the above discussion, we delve now into presenting new methodologies and

optimization techniques that have the users directly involved in the design flow as shown in

Figure 1.7. More precisely, in contrast to traditional design flow (see Figure 1.7(a)), we first

incorporate the user experience into the design process in order to minimize the workload

variance; then, we apply further optimizations in order to maximize the overall user

satisfaction (see Figure 1.7(b)). This process has two major steps:

13

Off-line design: Most system studies suggest two approaches for eliciting the user

requirements [148][149]: i) navigation-by-asking which can be done by user interview and

contextual enquiry through paper work, phone interviews, or other media [54] and ii)

navigation-by-proposing which is based on feedback on existing prototypes (limited versions

of the product/artifact [24]) or former generation products. Using these two approaches during

the design process, it is possible to develop more than one model for different types of users

which will incur less variation among the users’ behavior1. We note that during the platform

design space exploration step which is the main focus of the first part of the dissertation

1. To design a brand-new embedded system, without any prior knowledge of the user trace, we suggestusing the navigation-by-asking approach in order to come out with the architecture/application tem-plate. We also suggest studying the human activity patterns from other related embedded systemsfor generating meaningful traces.

user-centricdesign usage

prototype

enquiry

for incremental designfeedback

support

for onlin

e learning

Figure 1.7 Sketch of (a) traditional and (b) user-centric design flows.

(a) traditional design flow

(b) user-centric design flow

(navigation-by-asking)

off-line design -

on-line o

ptimiza

tion

off-line design -

(navigation-by-proposing)

off-line design - (navigation-by-proposing)

traditional design usage

14

[40][41], we target the main features (i.e., critical and predictable workload) of the system

from the hardware resources perspective with deterministic software running on it (i.e.

deterministic resource management, deterministic routing scheme, etc.). In other words, the

workloads generated by the newly downloaded applications or updated will stress the

hardware resources of the SoC in a similar manner as the initial set of applications and

therefore incur a minimal penalty.

On-line optimization: Due to various user expectations, a lightweight on-line optimization

is proposed to maximize the user satisfaction. Suggested methods includes reinforcement

learning (i.e., the system learns the behavior through trial-and-error interactions with a

dynamic environment [152][153]) and regression (i.e., predict or forecast the following

behavior [36][122]). Of note, system upgradability and extensibility are now considered

important features for a wide range of embedded systems, as discussed in Figure 1.4 of

Chapter 1.2; that is, the platform should be flexible enough to support various run-time system

changes, including newly-downloaded applications, third-party application programs, bug

fixes/patches, etc. However, all such updates are typically captured in software replacement

(e.g., based on the latest release of the firmware [12][67]) which is used to upgrade a system

already deployed in the field, rather than the off-line platform design space exploration step.

For all such updates, the hardware resources inside the system remain the same, but just a

different version of the firmware would be used to support the application updates. Similar

work can be seen in [38] which proposes an on-line user model for dynamic resource

management under a real-time operating system where the parameters would be updated

accordingly to the newly-download applications.

15

1.4. Dissertation Overview

This dissertation focuses on developing new methodologies, design automation and

optimization tools to support embedded NoC design while taking the user experience

information into consideration. The contribution of this thesis can be divided into two parts: 1)

DSE for full-custom embedded NoC with predictable system configurations and 2) user-

centric design methodology handling unpredictable system configurations. In what follows,

we summarize our contribution in these two directions.

1.4.1. DSE for Full-custom NoC with Predictable System Configurations

The first part of the dissertation addresses a new problem for system interconnect design

space exploration of application-specific MPSoCs supporting use-case applications where the

system configuration is given in advance. As a novel contribution, we develop an analytical

model for network-based communication fabric design space exploration and theoretically

generate fabric solutions with optimal cost-performance trade-offs, while considering various

design constrains, such as power, area, and wirelength. For large systems, we propose an

efficient approach for obtaining competitive solutions with significant less computation time.

The accuracy of our analytical model is evaluated via a SystemC simulator using several

synthetic applications and an industrial SoC design.

1.4.2. User-centric Design Methodology Handling Unpredictable System Configura-

tions

This second part of this dissertation focuses on developing a user-centric design

methodology for embedded systems targeting heterogeneous NoC platforms which support

multiple applications interacting with the system, i.e. unpredictable system configurations. In

order to expedite the user-centric concept into future embedded systems, we cover design

16

space exploration of heterogeneous NoC platforms, as well as the validation process to show

the robustness of the proposed flow (see Section 1.4.2.A). We further apply on-line

optimization processes with the goal of maximizing user satisfaction and the associated design

metrics (see Section 1.4.2.B).

1.4.2.A. DSE methodology for Heterogeneous Embedded NoC

As discussed in Figure 1.2, as opposed to the traditional design flow considering the task-,

resource-, or system-level optimization, our proposed methodology targets one level above,

namely, user-level design. More importantly, through analyzing the users’ interaction with the

system, we are able to provide more robust platforms for applications characterized with high

workload variation. Figure 1.8 outlines the proposed design methodology. Given collected

user traces from existing systems or prototypes, as well as the basic architecture and

application templates, a novel design methodology is proposed for building user-centric

heterogeneous embedded NoCs, which aims at minimizing the workload variance and allows

the system to better adapt to different types of uses. This methodology addresses the user

behavior analysis (including classification, similarity, and clustering problems), DSE for

automated NoC platform generation (including model learning problem), and potential

optimization (i.e. regression, reinforcement learning problems). More precisely, we apply

machine learning techniques to cluster the traces from various users into several classes, such

that the differences in user behavior for each class are minimized. Then, for each cluster, we

propose an architecture automation deciding the number, the type, and the location of

resources available in the platform, while satisfying various design constraints. Of note, as

shown with the “*” sign in this figure, five types of problems, i.e. classification, similarity,

clustering, regression, and reinforcement learning, are explored for use-centric embedded

systems design. More details about these five types of problems and related machine

learning techniques are surveyed in Appendix A.

17

We have performed multiple experiments on the real embedded system benchmark using

realistic user traces with the goal of minimizing the energy consumption under given price

constraints. With considering the user experience into the off-line DSE step, the system

platforms generated by our approach achieve about 30% computation energy savings, on

average, compared to the unique platform derived from the traditional design flow shown in

Figure 1.3; this implies that each system configuration we generate is highly suitable for a

particular class of user behaviors.

Figure 1.8 User-centric design flow for heterogeneous NoCs, including user behavior analysis,NoC architecture automation, and optimization process. Five types of problems with the “*”sign with their related machine learning techniques are surveyed in Appendix A.

User TracesArchitecture Template Application Template

User Behavior AnalysisClassification*, Similarity* & Clustering*

Cluster 1 Cluster k

user satisfactionsurvey

Light-weight Run-time Optimization Regression*, Reinforcement Learning*

excellent

good

fair

poor

Automated NoC Platform Design Space Exploration Learning a model*

...Trace Cluster 1 Trace Cluster 2 Trace Cluster kCluster 2

NoC Platform 1 NoC Platform 2 NoC Platform k...

18

1.4.2.B. Optimizations for NoC-based embedded systems

Having generated system platforms which exhibit less variation among the user behavior,

we explore extensible and flexible run-time resource management techniques that allow

systems to adapt to run-time stimuli specific to different user behaviors. Our NoC-based

embedded systems support a diverse mix of large and small applications running

simultaneously. More precisely, we address the following three problems:

1. Energy- and performance-aware incremental mapping for NoC

Achieving effective run-time mapping on heterogeneous system is a challenging task,

particularly since the arrival order of the target applications is not known a priori. We

address precisely the energy- and performance-aware incremental mapping problem for

NoC-based platforms and propose an efficient technique with the goal of minimizing

the communication energy consumption of the entire system, while still providing

the required performance guarantees. The proposed technique not only minimizes

the inter-processor communication energy consumption of the incoming application,

but also allows for new applications to be added to the system with minimal inter-

processor communication overhead. Experimental results show that the proposed

technique is very fast and scales very well, and as much as 50% communication energy

savings can be achieved compared to the state-of-the-art task allocation scheme.

2. Fault-tolerant techniques for on-line resource management

Resource utilization and system reliability are critical issues for the overall computing

capability of multiprocessor systems-on-chip (MPSoCs) running a mix of small and

large applications. This is particularly true for MPSoCs consisting of many cores that

communicate via the NoC approach since any failures propagating through the

computation or communication infrastructure can degrade the system performance, or

even render the whole system useless. Such failures may result from imperfect

19

manufacturing, crosstalk, electromigration, alpha particle hits, or cosmic radiation, etc.

and be permanent, transient, or intermittent in nature. Therefore, the system

configurations become unpredictable under such non-ideal platform.

Given the above consideration, we are first to propose a system-level fault-tolerant

approach addressing the problem of run-time resource management in non-ideal NoC

platforms. The proposed application mapping techniques in this new framework aim at

optimizing the entire system performance and communication energy consumption,

while considering the static and dynamic occurrence of permanent, transient, and

intermittent failures in the system. As the main theoretical contribution, we address the

spare core placement problem and its impact on system fault-tolerant (FT) properties. At

the same time, several critical metrics are investigated for providing insight into the

resource management process. A FT application mapping approach for non-ideal NoC

platforms is then proposed to solve this problem. Experimental results show that our

proposed approach is efficient and highly scalable; significant throughput improvements

can be achieved compared to the existing solutions that do not consider possible failures

in the system.

3. User-aware dynamic task allocation

Users’ dynamic interactions with the system result in different system configurations,

which cannot be predicted and modeled at design time. Consequently, determining how

to react to run-time stimuli the system receives, while maintaining high performance is

a major objective of this dissertation. As novel contribution, we incorporate the user

behavior information in the resource allocation process; this allows system to better

respond to real-time changes and adapt dynamically to different user needs. In other

words, the technique is well-suited to be embedded in future products (cell phones,

PDAs, multimodal games, etc).

20

Several algorithms are proposed for solving the task allocation problem, while

minimizing the communication energy consumption and network contention resulting

from the same or different applications. We further present a light-weight machine

learning technique for boosting the user model at run-time. Experimental results

show that for real applications considering the real user behavior information and on-

line building the user model, we can achieve around 75.8% communication energy

savings compared to state-of-the-art task allocation scenario on the NoC platform.

1.5. Dissertation Organization

The OS-controlled NoC architecture, application model and the associated energy model

on the target embedded MPSoCs supporting one or multiple applications are first introduced

in Chapter 2. The full-custom NoC platform design with predictable system configurations are

explored in Chapter 3. Then, for platforms having unpredictable system configurations, we

present a new design methodology for automatic platform generation of future embedded

NoCs, while including explicitly the information about the user experience into the design

process (Chapter 4). Having generated system platforms which exhibit less variation among

the users’ behavior, in Chapter 5, we present the incremental mapping techniques for

supporting applications interacting with the embedded NoC platforms. Following that in

Chapter 6, considering more general platform scenarios, we address system reliability issue

and present FT application mapping techniques for the target platforms where permanent,

transient, and intermittent failures may happen in the system. In Chapter 7, while observing the

major variation coming from users’ interaction with the system, we explore flexible and

extensible run-time resource management techniques that allow system to adapt to run-time

stimuli specific to each class of user behaviors; these techniques can change dynamically

according to the user model built based on user needs.

21

Following these off-line DSE and on-line optimization techniques for user-centric

embedded systems, we summarize our contributions and discuss some interesting open

problems in user-centric design in Chapter 8. Finally, we study related machine learning

techniques helping user-centric embedded system design in Appendix A. In Appendix B, the

integer linear programming (ILP) model is built for investigating critical factors on system

performance, where the conclusion has been used for supporting the run-time resource

management optimization as explained in Chapter 6 and Chapter 7.

22

23

2. EMBEDDED NOC PLATFORM CHARACTERIZATION

In order to better illustrate the methodologies, algorithms and ideas of user-centric

embedded NoC designs developed in this dissertation, the platform characterization and user

traces descriptions are needed. This chapter first provides a discussion of the suitable NoC

platform for handling predictable and unpredictable system configurations, respectively.

Finally, the application and energy models reflecting the user traces are described.

2.1. NoC Architecture

NoC represents a novel communication paradigm for systems-on-chip [47][134]. The

NoC solution brings networking approach to on-chip communication and provides notable

improvements in terms of performance, scalability, and flexibility, over the traditional bus-

based or more complex hierarchical bus structures (e.g. AMBA, STBus) [94]. In general, the

NoC architecture consists of multiple heterogeneous processors/resources and storage

elements interconnected via a packet switched network. For NoC platforms targeting on one

or several use-case applications resulting in a few and predictable system configurations, it is

necessary to discuss the design space exploration of NoC topology with several design

metrics, e.g. physical effects (SoC floorplan, total wirelength, maximum wirelength, area

overhead of interconnect fabrics), and other tight design parameters (application deadlines,

system performance, communication power consumption). More details for exploring the full-

custom NoC platform design are shown in Chapter 3.

24

From Chapter 4 to Chapter 7 in this dissertation, our target NoC platform supports multi-

processing where multiple applications are able to enter and leave the system dynamically,

resulting in unpredictable system configurations. Under such multi-processing paradigm with

various unpredictable system configurations, there is no way to customize the communication

architecture; instead, NoC with the regular topology (i.e. mesh, torus, ring) would be more

suitable. Although most of the work presented in this dissertation is applicable to other

topologies as we discuss when appropriate in the remaining chapters of this dissertation, we

assume our target NoC platform consists of multiple resources or processing elements (PEs)

interconnected by a 2-D H × W mesh network, as shown in Figure 2.1. The system can be

either homogeneous (i.e., identical PEs integration) or heterogeneous (i.e., consist of

different types of PEs or PEs operating at different voltage and frequency levels1). We

formulate the NoC platform in a generalized way, while illustrating the properties of

computation components, communication components, and the control scheme under such

platform.

• Computation components in NoC platform: Assume there exist n different types of

PEs/resources ri, i.e. r1, r2,..., rn RE having different computation capabilities CC(ri)

in the platform, where CC(r1) CC(r2) ... CC(rn)2. N(ri) represents the

number of resources of type ri in the platform. Therefore, the NoC-based MPSoC

platform can be characterized as Λ = (A, Ω(A)) where A = (N(r1), N(r2), ..., N(rn))

represents a resource set, capturing the number and the types of PEs integrated in the

1.The PEs operate at fixed voltage and frequency levels which are selected from a finite set (Vi, fi).When the voltage level of a PE is different from that of the network, mixed-clock first-in-first-out(FIFOs) need to be utilized. We also assume that the voltage/frequency assignment for PEs (or volt-age island partitioning problem) is already determined using an approach similar to the one pre-sented in [119].

2.We note that for some MPSoCs supporting memory intensive applications, i.e. video/audio, multime-dia, the location of PE in Figure 2.1 can be placed with a block of memory module if necessary.

≤ ≤ ≤

25

platform while Ω(A) represents the precise location of each PE in platform Λ (i.e.

resource mapping).

• Communication components in NoC platform: The communication infrastructure

consists of a data network and a control network (shown as solid and dotted lines,

respectively, in Figure 2.1), each containing routers and channels connected to the PEs

via standard network interfaces (NIs). The data network delivers data packets among

PEs under a wormhole routing scheme [113], while the control network (i.e., the routers

and links represented by dotted lines in Figure 2.1) is used to move around the control

messages sending from the global manager (GM). The data and control networks are

separated to ensure that data in the data network does not interfere with the control

messages in the control network. For large NoCs, it is suggested to have multiple

Figure 2.1 Homogeneous or heterogeneous 2-D mesh NoCs with PEs andinterconnect via the data and control networks described in a generalized way.

PENI

PENI

PENI

...... ...

......

PENI

PENI

...

PENI

PENI ... PE

NI

NI

data networkcontrol network

memory

processingunit

controlunit

NIR

GMOS

R

R

R

R

R

R

R

R

R

16 bits

2 bits

NI: Network interface

R: Router

OS: Operating systemGM: Global manager

PE: Processing element

26

distributed managers, instead of one global manager, along with a hierarchical

control mechanism, similar to the cluster locality idea proposed in [110].

• Control schemes in NoC platform: At least one of the PEs acts as a GM, i.e., master

PE, operating under the control of an operating system (OS), while others can be

considered as slave PEs (see Figure 2.1); each of them is an independent sub-

system, including the processing core (control unit and datapath) and its local

memory. Of note, the real-time OS in our embedded system should be designed to

be compact and efficient. We assume that such OS supports non-preemptive multi-

tasking and event-based programming. More precisely, the OS provides predictable

and controllable resource management, which includes monitoring the user’s

behavior and making the task allocation/mapping decision only when new events

occur (i.e. an application enters the system); the slave PEs are responsible for executing

the tasks/jobs assigned to them by the GM.

Here, we first provide a more thorough description of the control scheme via GM and

control network. In addition, an accompanying discussion of the router micro-architecture

(arbitration, buffers, etc.) that handles the control network and its area overhead are then

included, as well as a complete energy estimation via simulation of the control network.

• Operation of the GM and the control network: The task of the GM is to continuously

track the status of the PEs (idle/available or used/unavailable) in the system. When an

incoming application Q enters the system, the GM runs our incremental mapping

process and makes the run-time decision for the incoming application Q. After the

mapping decision is taken, the necessary resources are allocated to the tasks of this

incoming application, and the application starts executing. Once the application Q

finishes its execution and leaves the system, the PEs assigned to the application Q send

their address back to the GM through the control network to notify the GM that they

27

become available. This way, the GM always knows the status of the PEs in the system

and can take further decisions for new applications. Therefore, we do not need a fully

connected network, but just a tree that accumulates the messages for the GM, where the

structure of it is equivalent to a broadcast tree obtained by reversing the direction of all

edges as shown in Figure 2.2(a). We note that the architecture of our proposed control

network is designed here for the specific purpose; other different types of control

networks can also be built into the platform for supporting different types of control

messages [164].

• Design of the control network: As shown in Figure 2.1, the control network has

limited connectivity requirements, and it is physically separated from the data network.

More precisely, these two networks do not share any circuitry, such as links or buffers.

• Area overhead of our proposed network: In terms of implementation, we employ the

router described in [94] for the data network. In order to evaluate the overhead of the

control network, we add extra buffers and a MUX/DEMUX pair to the existing router

outputcontroller

inputcontroller

inpu

tco

ntro

ller

oupu

tco

ntro

ller

outputcontroller

inputcontroller

Mux/Demux

inpu

tco

ntro

ller

oupu

tco

ntro

ller

arbiter

inputcontroller

ouputcontroller

5 X 5switch

(crossbar)

routingtable

GM

Figure 2.2 (a) The logical view of the control network. (b) The on-chip router micro-architecture that handles the control network.

(a) (b)

28

used for data network (see Figure 2.2(b)). After that, we implemented a 4 × 4 mesh

network using both the original routers and the modified routers. Finally, the designs are

synthesized using Xilinx ISE to evaluate the area overhead (see data in Table 2.1).

• Energy overhead of the control messages in the control network: In terms of the

energy overhead for delivering the control messages, we utilize the bit energy model

[170] which is the same metric used when dealing with data messages which will be

later explained in Section 2.3. The energy consumption for transmitting the control

messages is related to the location of the GM and the amount of control messages. As

mentioned before, the control network is used only to send the status information from

PEs to the GM. This status information includes the address of the PE and an extra bit of

information showing the PE is busy or idle. Therefore, the size of control messages is

dependent on the network size. For an W × H NoC, we only need bits

to decode the address of each PE. Obviously, the volume of the control messages is

much smaller than the volume of the data messages, which is usually in Megabytes per

second for embedded applications. Moreover, the architecture view of the control

network is much simpler than the data network, as described before. Consequently, the

control network is expected to have significantly smaller energy consumption compared

Table 2.1 Impact of adding the control network on area. The synthesis is performed for Xilinx Virtex-II Pro XC2VP30 FPGA.

# of slicesone router in ‘pure’ data network 392

one router in our proposed network 401area overhead 2.3%

4 × 4 mesh network with ‘pure’ data network 67374 × 4 mesh network with our proposed network 6891

area overhead 2.2%

2 W H×( )log

29

to the data network. Indeed, if the energy consumption to send the information from PEs

to the GM is comparable to data communication due to the increasing NoC platform

size, then a more sophisticated hierarchical control mechanism is more suitable.

2.2. Application Modeling

Assume the proposed embedded system supports m different applications qi Q where

i = 1 ~ m. Similar to the off-line analysis in [30][127], each application qi can be characterized

by the Application Characterization Graph . Each ACG (see Figure 2.3) is

represented as a directed graph with the following properties:

• Vertices: Each vertex represents a cluster of tasks in application qi. Tasks

belonging to the same cluster/vertex should run on their own PE. Each vertex has its

minimal computation requirement MCR( ) at which it should operate in order to meet

the application deadlines3. Of note, the vertex to resource ri mappings are one-to-

one, where the mapping function is denoted as map( ), i.e., map( ) = ri. In addition,

the power profiling of each application at the vertex level on different types of PEs is

ACGqi V

qi Eqi,( )=

Figure 2.3 Application Characterization Graph (ACG) characteristics.The tasks belonging to the same vertex are mapped onto the same PE.Each edge represents the communication between two nodes.

veretex

tasks

communication rate per edge

ACG = (V, E)

e12

[comm(e42)]

e51

v1

v2

v3

v4

v5

e5,3

e34

v6e61

edge

v7

e42

[MCR(v7)] minimal computation requirement per vertex

v jqi V

qi∈

v jqi

v jqi

v jqi

v jqi

30

assumed to be available where represents the power

consumption, while vertex maps/executes on resource rk. Of note, for some

memory intensive applications, e.g. multimedia, a vertex can also be characterized as a

buffer or memory unit and needs to assign to the corresponding memory block in the

platform (see Footnote 2 in this chapter).

• Edges: Each directed edge in E characterizes the communication between

vertex and vertex , while weights comm( ) stand for the communication rate

(i.e., bits per time unit) from vertex to vertex .

2.3. Trace-based Energy Modelling

Multiprocessor embedded computing systems are always designed with the goal of

consuming less power. The trend of adding heterogeneity helps low the power demands while

maintaining performance. For a reasonable platform modeling, as seen in Section 2.1, we

have n different types of PEs with different computation capabilities in our NoC platform.

Here, a system-level energy modelling for such NoC-based MPSoC platform is presented. It is

worth to mention that as reported by the MIT RAW on-chip network, the communication

energy consumption represents 36% of the total energy consumption [20]. Therefore, in order

to achieve the accuracy of the high level system modeling, our proposed energy model covers

both computation and communication modules. We first formulate the user traces which are

recorded from relevant users over time and then describe for the computation energy model

and communication energy model, respectively, for such trace-based user patterns.

3.Note that, while dealing with the off-line task partitioning process, in order to meet the applicationdeadline, we use the worst case communication time for communications between nodes (i.e., lon-gest communication path). Moreover, we use the worst case execution time (WCET) for data-dependent tasks, where the WCET of a task is the maximum length of time that the task takes toexecute on a resource with certain computation capability.

P vjqi rk map vj

qi

⎝ ⎠⎛ ⎞=,⎝ ⎠

⎛ ⎞

vjqi

ejkqi E

qi∈

vjqi vk

qi ejkqi

vjqi vk

qi

31

2.3.1. User Trace Modeling

As we mentioned in Section 1.3.2, the best way for collecting the user traces is either from

the product prototype or from products belonging to an earlier generation of systems.

However, since such NoC-based products are rather difficult to have access to, we collect user

patterns/traces by monitoring the behavior of the Windows XP environment as users login and

logoff the system. By collecting real traces, as opposed to generate some traces based on

traditional distributions like heavy-tailed or exponential distribution, we are able to directly

capture the essence of human behavior while users interact with computing systems.

Before explaining the collected user traces, some terminology needs to be introduced. Let

[t1, App Q, t2] characterize an event where an application Q enters and then leaves the system

during a specific time period between two arbitrary moments in time t1 and t2. A session is a

sequence of events between a user signing in and out of a system. An episode is a discrete

period extracted from a session. The behavior of any user is defined as a set of consecutive,

overlapping events spanning a given period of interaction between the user and the system. In

order to learn the user’s behavior, it is desirable to examine long episodes, or even entire

sessions, as this would generate more accurate user data.

In our experiments, we collect multiple sessions from twenty users within three months.

Each session is represented as discrete time sequences sampled in 10 minutes4

collecting from user i while logging the system. Each element = {q1, q2,...} represents a set

of applications actively running in the system at discrete time t. For example, from the session

of user i,

4. Here we set 10 minutes for sampling the collected user traces at application-level with the consider-ation of reasonable simulation time for our experiments. For obtaining more accurate user interac-tion with the system, it is suggested to sample at higher rates, e.g. every minute, or to collect data atthread- or process-level.

ℜi ℜit⟨ ⟩=

ℜit

32

(2.1)

it is intuitive to see that at time 2, applications 2 enters the system and leaves at time 5.

Application 1 enters the system at time 1 and leaves at time 3, but later enters again at time 4,

etc. We can further obtain several events from this session, e.g. [1, App 1, 3], [2, App 2, 5], [4,

App 3, 8], [6, App 2, 7].

Therefore, by doing such an experiment, the collected information includes the detailed

time sequence of applications usage in the system; that is, we understand that “when

considering how many and which applications the user frequently accesses and for how long”.

Those traces will be utilized for later experiments discussed in Chapter 4 and Chapter 7.

2.3.2. Computation Energy Modeling

represents the computation energy consumption while running α onto

{β}, where α can be a vertex vi, an application qi, or a user trace , and {β} stands for a

resource set with one or multiple number of available resources able to run α. Assume

that the power consumption of each task running on certain specific resource is obtained

by off-line analysis and given in advance (as explained in Section 2.2),

can be obtained as a linear summation of the computation cost of running each vertex onto

the corresponding resources. Assume the duration of the trace α is from time 0 to ,

(2.2)

where is 1 if vertex vi is running on the system, 0 otherwise. Through this

dissertation, we are using embedded benchmark suite from [50] which profiled the codes

ℜi ℜi1 ℜi

2 ℜi3 ℜi

4 ℜ5, , , , i ℜi6 ℜi

7 ℜi8 …, , , ,⟨ ⟩=

q1{ } q1 q2,{ } q2{ } q1 q2 q3, ,{ } q1 q3,{ } q2 q3,{ } q1 q3,{ } q2{ }…, , , , , , ,⟨ ⟩=

Ecomp α β{ },( )

ℜi

Ecomp α β{ },( )

Ecomp α β{ },( ) P viqi rk map vi

qi

⎝ ⎠⎛ ⎞=,⎝ ⎠

⎛ ⎞ Δvi t( )

t 1=

∑⎝ ⎠⎜ ⎟⎜ ⎟⎛ ⎞

×vi in app qi or trace ℜi∀

∑=

Δvi t( )

33

of real embedded applications from EEBMC benchmarks (available at www.eembc.org)

onto several commercial processors and reported the power profiles for each vertex and

the corresponding graph for application qi, as well as the idle power for

each commercial processors.

2.3.3. Communication Energy Modeling

The communication energy modeling for NoC architectures have been explored in the

literature [170]. Ye et al. in [170] built the bit energy metric (with bit-level accuracy) for

modeling the energy consumption in a communication network. More precisely, it tracks

the dynamic energy consumed when transmitting one bit of data from the source to the

destination PEs through the whole network fabrics, including interconnect wires, arbiters,

input/output buffers, and crossbar for routing the data.

In this dissertation, we choose the bit energy model from [170] as it provides an efficient

approximation for the network fabrics under consideration with reasonable accuracy in

system-level of abstraction. For transmitting one bit through each network fabric component

(interconnect wires, buffers, etc.), we obtain the parameters from the Predictive Technology

Model (PTM) [129] which provides accurate, customizable, and predictive model files for

future transistor and interconnect technologies. We believe that with PTM, the modeling of

the system interconnect is accurate enough even before the advanced semiconductor

technology is fully developed. Here, we give the details of the communication energy

modelling for our NoC platform supporting worm-hole switching and minimal-path routing.

represents the total communication energy consumption of running α

(user traces recording the behavior of a set of applications over a finite period of time) on

the resource set {β} from time 0 to (the duration of user trace α), while vertices having

higher communication are assigned to PEs as closely as possible:

ACGqi V

qi Eqi,( )=

Ecomm α β{ },( )

34

(2.3)

where = 1 if application Q is active in the system between time t-1 and t, and

0 otherwise. The communication energy consumption of any application Q per time unit is

calculated as follows:

(2.4)

where is the communication rate of an edge in application Q (in bits per time unit),

and stands for the energy consumption to send one bit between the PEs where

vertices vi and vj are allocated to (in Joules per bit). More precisely,

(2.5)

The term represents the Manhattan Distance between the PEs where vertices vi and

vj are allocated to. The parameter stands for the energy consumed in routers, including

the crossbar switch and buffers, while represents the energy consumed in one unit link,

for one bit of data; these parameters are assumed to be constant obtained from the PTM

model.

We note that the parameters and would be different due to different circuit

design, post-silicon devices, wirelength and bandwidth, or even under different semiconductor

technology [129]. Here, we set them as fix values such that the overall computation and

communication energy consumption ratio is about 7:3, similar to the observation from the

MIT RAW on-chip network [20].

Ecomm α β{ },( ) EcommApp Q ΔApp Q t( )

t 1=

∑⎝ ⎠⎜ ⎟⎜ ⎟⎛ ⎞

×all applications

∑=

ΔApp Q t( )

EcommApp Q comm eij( ) EBit eij( )×

eij∀ E in App Q∈∑=

comm eij( )

EBit eij( )

EBit eij( ) MD eij( ) 1+( ) ERbit× MD eij( ) ELink×+=

MD eij( )

ERbit

ELink

ERbit ELink

35

3. SYSTEM INTERCONNECT DSE FOR FULL-CUSTOM NOC

PLATFORMS

3.1. Introduction

In the foreseeable future, it is expected that more computing resources will be integrated

into systems built at nanoscale. Consequently, the interconnect infrastructure plays a crucial

role for building truly scalable platforms [28]. For application-specific multi-processor

systems-on-chip (MPSoCs) supporting one or a few dedicated applications resulting in

predictable system configurations, customizing the computation and communication

architecture is needed in order to optimize various design metrics, such as power

consumption, throughput, area overhead, wirelength cost, etc.

Interconnect topology and protocol design are both critical steps while designing the chip

communication infrastructure. We note that most of the existing interconnect solutions are

developed to support a specific standard or in-house communication protocol [75]. In this

chapter, we propose a new approach which improves the system interconnect for data

transmission, while leaving the protocol untouched. This facilitates design re-use and

minimizes the design effort which are all critical to meet the tight time-to-market constraints.

Starting from these considerations, the goal of this chapter is to develop a new

methodology for system interconnect exploration, that allows designers to easily take

meaningful (system-wide) optimization decisions. More precisely, our approach optimizes the

data communication phase by replacing the bus-based interconnect with a NoC architecture,

while minimally altering the control phase of the communication protocol. In other words,

36

instead of providing a specific interconnect solution which satisfies the imposed design

constraints, the proposed approach explores a class of communication architectures

exhaustively, while considering the available floorplanning information [2][57]. As a results,

our approach can help designers find the Pareto-optimal solutions trading off power,

performance and other design metrics that account for various physical-level effects.

Due to the high resource reuse it enables, this hybrid approach can be easily integrated

into an up-to-date design flow in industry, as opposed to forcing a sudden paradigm change

towards fully NoC-based designs. To the best of our knowledge, this is the first attempt to

implement a hybrid communication model where the data phase (i.e. data transmission)

happens via the NoC approach, while the control phase from the original protocol design is

kept unchanged. Towards this end, our main contributions are as follows:

• The SIDE framework is proposed in an analytical manner for system interconnect

design space exploration, which allows a single run to explore multiple design points

that trade off various design metrics (e.g. average packet latency, area cost, wirelength).

The accuracy of the proposed analytical model is further validated using a SystemC

simulation model specifically developed for this work.

• To reduce the exploration complexity, we also propose a heuristic approach which

achieves three orders of magnitude reduction in runtime, while still providing high

quality solutions compared to the optimal solution.

• By taking the floorplanning information into account while enumerating various system

interconnect topologies, we are able to produce optimal placement of resources across

the communication fabric.

Taken together, these contributions represent an important step towards providing

designers an efficient analytical solution for system interconnect in application-specific

37

MPSoCs. Of note, the terms system interconnect and communication fabric, are used

interchangeably in this chapter.

The remaining of this chapter is organized as follows. In Section 3.2, we review the

related work. The general MPSoC platform with the related interconnect problem and design

space exploration flow are described in Section 3.3, while new optimization algorithms are

presented in Section 3.4. Experimental results in Section 3.5 show the accuracy and efficiency

of our system interconnect exploration under realistic benchmarks and an industrial case

study. Finally, we summarize our contribution in Section 3.6.

3.2. Previous Work

There exists a significant body of work on synthesizing and generating bus-based systems.

For instance, Sonics MicroNetwork [156] is a TDMA-based bus system handling different

access patterns, interrupt schemes of the intellectual property (IP) modules, while still proving

a high bandwidth. The STBus from STMicroelectronics is a flexible and high-performance

communication infrastructure based on shared buses and support for advanced protocol

features, such as out-of-order and multi-threading [160].

NoCs have been recently proposed as a promising solution to solve the scalability problem

in bus-based systems [17][47]. For application-specific NoCs, using the regular topology is

not always a good choice. Instead, topology selection and synthesis are becoming critical steps

in the design of an efficient communication architecture [107][168]. For instance, Yan et al.

propose greedy algorithms with Steiner-tree methods for reducing the NoC synthesis problem

[168]. The NoC synthesis problem considering physical effects, such as floorplanning and

wirelength, is discussed in [18][78][85][108][158]. Ascia et al. propose genetic approach for

NoC mapping problem considering multiple objectives [5]. In addition, other tools, such as

38

xpipes [85], NetChip [18], for NoC architecture automation and interconnect modeling

[9][120][170] are proposed for system-level communication optimization.

3.3. System Interconnect in MPSoC

3.3.1. General Framework for Application-specific MPSoC

Figure 3.1 shows a generic architecture for application-specific MPSoCs. As seen, such a

platform consists of multiple computing modules, i.e. general-purpose processor, graphics

processing units (GPU), Digital signal processor (DSP), intellectual property (IP), i.e. video/

audio processors, and related peripheral Input/Output (I/O) controller. These modules are not

only communicating with each other through the system interconnect, but also connected to

the off-chip memory and I/Os, such as universal serial bus (USB) devices, serial advanced

technology attachment (SATA), universal asynchronous receiver/transmitter (UART),

peripheral component interconnect (PCI).

Figure 3.1 Block diagram for a general MPSoC platform.

AudioProcessor

one or more levels of

cache

CPU

VideoProcessor

DSP

RISC

Others...

Display

system interconnectmemory controller

memory

others…

Wi-Fi

flash

UART

PCI

USB

SATA

Ethernet

Peripheral I/O controller

39

Traditional system interconnect uses the bus-based protocol which consists of two main

phases, namely a control phase and a data phase. A complete data transmission from a source

IP block to a destination block needs to complete the control phase first and only then the data

phase can proceed. The control phase follows the general handshake protocol (with VALID/

READY signals and exchanges of data packet information like data size, data priority etc.) to

ensure that the data are successfully transferred to/from buffer through the up-stream and

down-stream data buses, respectively.

The general platform with multiple IPs connecting with the system interconnect is shown in

Figure 3.2. In this representation, each IP block bi has only one input and one output port

denoted as “ini” and “outi”, respectively. The data packets are sent/received to/from the

communication fabric through the network interfaces. In general, the communication fabric

consists of multiplexers (mux), repeaters for transmitting packets over long links, links of

wire

other control circuits…

data bus

control bus

buffere.g. SRAM (on-chip), DRAM (off-chip)Scratchpad memory (SPM) …

multiplexer

set arbiter

repeater

IP block b0

IP block b2

out0

in0

out 3

in3

out2

in2

out |B

|-1

in|B

|-1

out 1

in1

system interconnect

IP block b|B|-1IP block b3

IP block b1

Figure 3.2 General platform with multiple IPs communicated via thesystem interconnect.

40

different widths, storage elements (e.g., elastic buffer memory, static random access memory,

scratchpad memory, dynamic random access memory, etc.), and other control circuitry (e.g.,

the arbiter). All signals in the control phase are managed by the system arbiter which drives

multiple data transactions and supports out-of-order transaction completion too. In addition,

each IP block of the platform can act as a master, slave, or both under such protocol. A master

initiates read or write (R/W) requests to the arbiter, while a slave can only respond to such R/W

requests from the arbiter. However, such a bus-based protocol is not scalable and becomes

easily a performance bottleneck as the number of IP blocks in the platform increases (i.e. ten-

plus IPs). Therefore, our idea is to keep the control phase (i.e. communication protocol) as it is

and build other communication fabric for data transmission among IPs. In addition, the system

arbiter is not only responsible for handling and scheduling the send or receive requests from

master or slave IP blocks, but also for setting up the path similar to circuit switching

techniques for data transmission on the proposed communication infrastructure. The problem

formulation for system interconnect is discussed next.

3.3.2. System Interconnect Problem Formulation

The system configurations are assumed to be predictable where may derive from one or

multiple use-case applications [105]. Similar to the application modeling in Section 2.2, the

use-case applications are decomposed into a set of communicating tasks via static analysis and

simulation and are characterized by an application characterization graph (ACG) = (V, E),

which is a directed graph where each vertex bi in B represents an IP block, while each directed

edge eij in E characterizes the communication flow from vertex bi to vertex bj. The weights

comm(eij) stand for bandwidth values (in bits per second) required for communication from

vertex bi to vertex bj. The system interconnect problem can be formulated as follows:

41

Given i) Floorplan of the system with information about placement regions or exact

locations for input and output ports, ii) the ACG and iii) design metrics and constraints (e.g.,

wirelength, area, power);

Objective - Explore a class of communication architectures that trades-off the system

performance and other design metrics, while meeting all the imposed constraints (i.e.,

maximum wirelength and communication fabric power-consumption overhead).

The high-level view of the interconnect synthesis problem with |B| = 4 IP blocks is shown

in Figure 3.3. Figure 3.3(a) explore the system interconnect trading off the system

performance and area/wirelength overhead, showing in left and right y-axis, respectively. As

shown, there are two extreme points in the design space. One left-most point illustrates the

system interconnect corresponding to a traditional bus for four IP blocks1: the data can be sent

from any source IP to the up-stream bus and then be stored in the buffers. Later on, once the

destination IP is ready for receiving data, the stored data can be delivered through the down-

stream bus to its destination IP. Such a communication model is simple and typically has a

low area and wirelength overhead (see Figure 3.3(b)). However, such a bus model will suffer

from poor performance and scalability issues when integrating more IP blocks in the system

since all data transmissions need to share the same wires (i.e. the up-stream and down-stream

buses).

Figure 3.3(c) plots the other extreme case, i.e. fully connected switches. Intuitively, the

system performance of such a model is much better than the one in Figure 3.3(b), since each

IP can communicate with any other IP, at any time, through its own mux without sharing it

1. In this chapter, a mux as in Figure 3.3 represents a switch with an arbiter so it has routing capabili-ties.

42

with the other IPs. However, the area and wirelength overhead under this model is much

higher than the one in Figure 3.3(b).

While keeping in mind these two extreme cases, our approach aims at exploring a class of

communication architectures for any specific application ACG=(B,E) and determining the

Pareto-optimal set which trades off the system performance against area and wirelength, while

satisfying various design constraints (i.e., maximum wirelength, power-consumption

overhead). For our simple example, Figure 3.3(d) shows one possible Pareto solution with

reasonable system performance and area/wirelength overhead.

Figure 3.3 (a) System interconnect design space trading off the system performance and area/wirelength overhead (b) Traditional bus model connecting four IP blocks (c) Fully connectedswitches with four IP blocks (d) Possible optimized communication fabric for four IP blocks.

Shared Bus

BUFFER*

In 0In 1

In 3In 2

Up-stream bus

Down-stream bus

* BUFFER can be an on-chip or off-chip memory.

Optimized System Interconnect

Pack

et L

aten

cy

Area &

Wirelength

Overhead

Increasing Design Complexity

Out 0Out 1

Out 3Out 2

Fully Connected Switch

(c)(d) (b)

(a)

43

The solution space of this interconnect problem, from a logical view, follows the |B|th Bell

number, also known as the set partitioning problem, where Bn is the number of ways that a set

of n elements can be partitioned into non-empty subsets [13]. For example, B4 = {(1234),

(1)(234), (2)(134), (3)(124), (4)(123), (12)(34), (13)(24), (14)(23), (12)(3)(4), (13)(2)(4),

(14)(2)(3), (23)(1)(4), (24)(1)(3), (34)(1)(2), (1)(2)(3)(4)}, i.e. elements 1, 2, 3, and 4 are

partitioned in 1, 2, 3, or 4 subsets. Also, it is clear that the interconnect in Figure 3.3(b), (c),

and (d) correspond to sets (1234), (1)(2)(3)(4), and (12)(34), respectively.

We note that the solution space grows rapidly, the time complexity being O(nlogn)n (B1-

B10 are 1, 2, 5, 15,..., 115975). In this work, we first propose an algorithm to cover the entire

design space for obtaining the Pareto-optimal solutions (Section 3.4.1); later on, we propose a

heuristic for larger systems to efficiently cover most of the Pareto-optimal solutions as

detailed in Section 3.4.2.

3.3.3. Communication Fabric Exploration Flow

The proposed communication fabric exploration flow for application-specific NoC is

depicted in Figure 3.4. Assume that the floorplan of IP blocks is given2. The inputs of our

flow are i) the I/O port locations (pl) for each IP block, ii) the application communication

graph (ACG) and iii) design constraints, D (e.g., maximum wirelength, maximum power-

consumption for the communication fabric) as shown in Figure 3.4. The modeling block

contains the performance analysis tool, optimal wirelength model in linear-time complexity,

and the fabric area model. Of course, it is also possible to include other models such as, power

[170], or even inductive coupling or crosstalk noise analysis [34] in this exploration.

2. In an industrial setting, this is often the case., e.g., a DDR controller should be near the edge, etc.Moreover, if there are un-placed IP blocks, a floorplanner tool, such as PARQUET [2], can beincluded to floorplan the chip as a pre-processing step.

44

With all inputs and the modeling library available, we explore a class of communication

fabrics and report the Pareto-optimal sets trading off selected design metrics, while satisfying

all design constraints (see the analysis stage in Figure 3.4). Without loss of generality, in the

rest of this chapter we assume that all the communication fabrics work under the same

operating frequency, although the proposed framework can be easily applied to fabrics with

multiple operating frequency settings throughout the chip by reflecting such a concern in the

performance analysis and wirelength models we can handle. In addition, the buffer sizing

problem and specific routing scheme can be further discussed after the communication fabrics

are decided [80][159].

As shown in the simulation stage of Figure 3.4, the netlists corresponding to the Pareto-

optimal sets are automatically generated and fed to the cycle-accurate SystemC simulator

Figure 3.4 The flow of the communication fabric design space exploration withthe analysis, simulation, and evaluation stages shown explicitly.

Constraints (area, power, maxwirelength, ...)

Regions of I/O ports of each IP

App. Comm. Graph (ACG)

Communication Fabric

Design Space Exploration

Evaluation Stage

performanceanalysis model

linear-time wirelength model

other models (power, etc.)

Pareto-optimal solutions

reconfigure mux/ wire connections

traffic generator

SystemC performance simulator with proposed comm. protocal

analysis plot

simulation plot

Simulation Stage

Analysis Stage

netlists

compare

fabric area model

45

specifically developed for this study. The simulator is used both to evaluate the Pareto-optimal

points found analytically and validate the accuracy of our analysis. Our simulator follows the

Intel® XScaleTM System Interconnect (XSI) like communication protocol, on-chip

interconnect for application-specific SoCs, for handling the multiple data transmission.

Finally, we evaluate the accuracy of analytical solutions by simulating them with the SystemC

simulator and comparing the analytical and simulation results, as depicted in the evaluation

stage in Figure 3.4).

3.4. Optimization of System Interconnect Problem

3.4.1. Exact System Interconnect Exploration

Our communication fabric exploration is based on a branch and bound approach. This

approach is capable of searching all solutions efficiently by walking through a tree structure.

For instance, Figure 3.5 shows an example of the tree structure needed to explore the

communication fabric solutions for a simple case with three IPs in the system. Assigning

different IPs to the communication muxes, allows us to explore different solutions. Two

extreme cases of the communication fabric are 1) assign each IP to a separate mux and 2)

assign all IPs to one single mux (see logical views in Figure 3.3(a) and (b), respectively).

As seen in Figure 3.5, the tree structure starts with the root node where no IPs are assigned

to any of muxes. At each level i, we assign IP bi to a different mux, denoted as an intermediate

node, by branching out from its corresponding parent node. For example, the branches of node

(1xx)(xxx)(xxx) at level 2, where IP b2 is placed into muxes 1, 2, and 3, result into nodes

(12x)(xxx)(xxx), (1xx)(2xx)(xxx), and (1xx)(xxx)(2xx), respectively. In addition, to speed up

exploration while keeping the optimality of the approach, we stop branching the nodes which

are isomorphic with other nodes in the tree. For example (see the “R” sign in Figure 3.5), the

46

nodes (1xx)(2xx)(xxx) and (1xx)(xxx)(2xx) are isomorphic, which implies that the solutions

branching out from node (1xx)(2xx)(xxx) are identical to those branching out from node

(1xx)(xxx)(2xx). Therefore, the node (1xx)(xxx)(2xx) is considered redundant in this case and

there is no need to further consider its children nodes in the solution space.

In addition, all nodes at level 3 are leaf nodes since all IPs have been assigned to muxes.

When reaching a leaf node, the expected average packet latency, total wirelength, area, and

power under the resulted mux structure are obtained using our analytical performance model,

optimal tree placer in [32], and the power model. If these results satisfy the design constraints

(e.g., the maximum wirelength in the fabric is smaller than a given value, and the power-

consumption overhead compared to a bus-based design does not exceed a threshold), we

Mux 1

X X X X X X X X X

Mux 2 Mux 3

1 X X X X X X X X

1 2 X X X X X X X

1 2 3

X 1 X X X X X X X

1 X X 2 X X X X X 1 X X X X X 2 X X

1 2 3 1 2 3

Example for 3 IPs

123

123

123

123

- Expected avg. latency- Total wire length- Fabric area

123

12

123

3

123

12

123

3

123

1

123

2

123

3

123

1

123

2

123

3

123

1

123

2

123

3

Level 0

Level 1

Level 2

Level 3

Figure 3.5 A three-IP example of communication fabricexploration using the branch and bound algorithm.

root node

intermediatenode

leaf node

redundant

47

check whether or not this solution belongs to the Pareto-optimal set. If yes, we include the

solution into the Pareto-optimal set and delete any solutions dominated by this one.

The branching process is applied in a recursive manner until all branchings hit level 3 or

stop at some intermediate level. The pseudo code of the branch and bound algorithm

implemented in a depth-first search manner is shown in Figure 3.6. Our solution structure is

listed in lines 01. The main search function of the tree structure is shown in lines 06-21. As

seen, at every iteration, we add one IP to a specific mux (line 06). When reaching a leaf node,

the expected latency, wirelength, area, and power are calculated and the solution is identified

as a Pareto point (lines 12-15). If we reach an intermediate node which is not redundant, we do

Figure 3.6 The pseudo code of the system interconnect explorationusing the branch and bound method.

Input: I/O port regions pl, ACG = (B,E), design constraints D, Output: Pareto-optimal set trading off metrics01 Solution S {latency, area, wirelength, power};02 MAIN PROCEDURE{03 Pareto-optimal ;04 Pareto-optimal = EXPLORE(1, 1);05 } EXPLORE(next_agent, next_mux){06 Solution S[next_mux].push_back(next_agent);07 IF (S is leaf_node)08 S.latency = estimate_latency(S, ACG);09 S.area = calculate_area(S, pl);10 S.wirelength = calculate_wirelength(S, pl);11 S.power = calculate_power(S, pl);12 IF (S satisfies constraints D)13 IF (S dominates solutions in Pareto-optimal)14 Pareto-optimal Pareto-optimal {S};15 delete_non_Pareto in Pareto-optimal;16 ELSE17 FOR (mux_ind = 1 to num_agent)18 IF(S is redundant_node)19 break;20 ELSE21 EXPLORE( next_agent+1, mux_ind ); }

∅←

← ∪

48

the depth-first search recursively to branch out this node by placing the next IP at different

muxes (see lines 17-21).

The estimate_latency function in Figure 3.6 (line 08) computes the average packet latency

for a given solution using a technique similar to the analysis presented in [120]. In short, we

first calculate the contention probability between each flow passing through the same

multiplexer. Then, we use these contention probabilities to find the approximate queuing

delays, as described in [120]. Similarly, the calculate_area and calculate_wirelength functions

in Figure 3.6 (see lines 09-10) are implemented using the technique presented in [32], while the

calculate_power function is estimated under predictive technology model [129].

In summary, for each mux with more than two inputs, we decompose the mux into a tree

structure and later apply the linear time optimal tree placement method in [32] to place each

decomposed mux. After all muxes are placed, the wirelength is calculated using the Steiner tree

method [168]. The corresponding area is the sum of all decomposed muxes, repeaters and

buffers for each mux structure and the corresponding power is the total power consumed on the

system interconnect components.

3.4.2. Heuristic for Speeding up System Interconnect Exploration

The run-time complexity of the above branch and bound algorithm grows exponentially

with the number of IP blocks in the system. Therefore, we propose a linear run-time heuristic

which can obtain solutions close to the ones in the Pareto-optimal set.

For solving the communication fabric optimization problem for |B| IP blocks, we generate

|B| solutions with number of muxes (num_mux) ranging from 1 to |B| (i.e. |B| class). The final

solution of this heuristic is obtained from the best solution among these |B| classes.

49

Figure 3.7 shows one class of the heuristic with two muxes for four IP blocks (i.e., |B|=4,

num_mux = 2). As shown in the figure, the structure starts with the root node where no IPs are

assigned to any mux (see level 0 in Figure 3.7). We first sort the IPs based on the total

communication bandwidth requirement ( ), and assume d1 >

d2 > d3 > d4. Later, at level i, we assign IP block di to each mux, as shown in Figure 3.7; that is,

at level i, we have a partial solution where IPs d1~di are assigned to muxes. Then, we apply

the performance analysis tool (explained in Section 3.4.1) to each intermediate node, i.e.

partial_latency function in Figure 3.7. The algorithm branches only along the node with the

better performance (i.e., lower partial_latency value). In order to deal with the sensitivity to

Mux 1

X X X X X X X

Mux 2Example for 4 IPs

Level 0

Level 1

Level 2

Level 3

X

X X X X X X Xd1 X X X d1 X X XX

branch only node 1 with prob. p if partial_latency(node 2) - partial_latency(node 1) < varpartial_latency(node 1)

d2 X X X X X Xd1 X X X d2 X X Xd1

d3 X X d2 X X Xd1 X X X d2 d3 X Xd1

Level 4 d4 d2 d3d1 d2 d3 d4d1

node 1 node 2

num_mux = 2sort order: d1>d2>d3>d4

node 3 node 4

branch only node 4 with prob. p if partial_latency(node 3) - partial_latency(node 4) < varpartial_latency(node 4)

Figure 3.7 The proposed heuristic for four IPs with the number of muxes set to 2.

intermediate node

leaf node

di comm eij( ) comm eji( )+[ ]j∀

∑=

50

system performance in partial_latency function, we accept the solution with a certain

probability p when its corresponding performance is within the variance var of the best

solution on that level. This process continues until reaching the leaf node (all IPs are assigned

to muxes) and start considering whether or not this solution belongs to the Pareto curve.

3.5. Experimental Results

3.5.1. Industrial Case Study

To evaluate the potential of our communication fabric exploration approach for a real

application, we apply this approach to an industrial SoC design, namely the Intel® media

processor CE 3100 [83]. For this example, we are given the number of IPs in the system, as

well as the application communication graph. In addition, the floorplan information of this

design and the locations of I/O ports for each IP are also known. We used an industrial process

technology for estimating the area, wirelength, and power on system interconnect. Through

system interconnect exploration, we report the Pareto-optimal set trading off performance and

two physical design metrics, i.e., area and wirelength, while satisfying the imposed constraints

such that the system designers can easily make meaningful system optimization choices.

We first apply the proposed exact exploration technique to find the Pareto-optimal set

trading off the average packet latency and communication fabric area (see Figure 3.8(a)) with

the power overhead constraints set to 0.15 (i.e. the power consumption of communication

fabric cannot be more than 15% of that of a bus-based implementation). This makes sense

since the power consumption of the communication fabric does not represent a big portion

from the power consumption of the entire system. Similarly, Figure 3.8(c) reports the Pareto-

optimal set trading off the average packet latency and communication wirelength.

51

In order to validate the accuracy of the analysis stage (see Figure 3.4), we take the

potential solutions in the Pareto-optimal set and simulate them using our SystemC simulator.

The simulation results of data in Figure 3.8(a) and (c) are shown in Figure 3.8(b) and (d),

respectively. Note that in our analysis model (Figure 3.8(a) and (c)), we capture the high-level

system performance without implementing all the details of the communication protocol.

Therefore, the protocol related latencies such as set-up time, is not included in the analysis.

Since the relative accuracy is sufficient to make accurate comparisons between alternative

solutions, we show the normalize latency values in Figure 3.8. We note that the Pareto points

80 85 90 95 100 105 110

0.96

0.98

1analysis

90 100 110 120 130 140 150

0.94

0.96

0.98

1analysis

80 85 90 95 100 105 110

20

30

40

50simulation

90 100 110 120 130 140 150

20

30

40

50simulation

Figure 3.8 System interconnect exploration for a real SoC design. (a) Pareto-optimal set(latency vs. fabric area) obtained via analysis. (b) Simulation results for solutions in (a). (c)Pareto-optimal set (i.e., latency vs. fabric wirelength) obtained via analysis. (d) Simulationresults for solutions in (c).

(a) (b)

(c) (d)

3 muxes

1 mux (bus model)

3 muxes

1 mux (bus model)

3 muxes 3 muxes

1 mux (bus model)1 mux (bus model)

10.2%

5.3%

40%

40%

communication fabric area communication fabric area

wirelength wirelength

1

0.8

0.6

0.4

pack

et la

tenc

yno

rmal

ized

1

0.8

0.6

0.4

norm

aliz

edpa

cket

late

ncy

1

0.8

0.6

0.4no

rmal

ized

pack

et la

tenc

y1

0.8

0.6

pack

et la

tenc

yno

rmal

ized

52

are accurately captured by analysis. Later, in Figure 3.9, we also show that no actual Pareto

points are missed by analysis.

As seen in Figure 3.8, our proposed SIDE framework covers the entire design space

exploration of system interconnect, including the traditional bus model (one mux case or

signal bus) which is suffers from having a poor performance, but involves less fabric area and

wirelength. When compared against the one mux case (i.e. a single bus), the fabric with three

muxes (i.e., the highlighted circles in Figure 3.8) can achieve around 40% reduction in

communication latency with only 5.3% wirelength and 10.2% area overhead with respect to

the original single bus design. The power consumed by the system interconnect with a 3-

muxes implementation is about 8.16% higher than that of a bus-based implementation. Note

that the reported overhead in area and power is negligible with respect to the entire chip area

and power (below 0.1%). However, the communication latency improvement leads to

significant gains in system-level performance for multiple applications.

In addition, we plot the non-Pareto points to confirm that no candidate points are missed

out in our exploration process. After obtaining all solutions with the branch and bound

algorithm, we select forty points with a smaller fabric area and later report the real simulation

90 100 110 120 130 140 150

0.94

0.96

0.98

1

:: Non-Pareto points

Pareto-optimal set

90 100 110 120 130 140 150

0.94

0.96

0.98

1

:: Non-Pareto points

Pareto-optimal set

analysis

90 100 110 120 130 140 150

20

30

40

50

::Non-Pareto pointsPareto-optimal set

simulation

Figure 3.9 Forty non-Pareto points and Pareto curve plots obtained via analysis (a) andvia simulation (b).

(a) (b)communication fabric area communication fabric area

1

0.8

0.6

0.4

1

0.8

0.6

0.4

norm

aliz

ed p

acke

t lat

ency

norm

aliz

ed p

acke

t lat

ency

53

results for those points. The analysis and simulation results for Pareto-points, plus forty non-

Pareto points, are shown in Figure 3.9 (a) and (b), respectively. As shown in Figure 3.9(a), it is

easy to see that all points in the Pareto-optimal set (see cross signs) dominate all other forty

solutions (dot signs). It is worth mentioning that the forty points obtained from the analysis

stage are indeed dominated by the Pareto-optimal set in real simulation, as shown in

Figure 3.9(b); this demonstrates that our early-stage analysis is able to make good design

choices systematically.

3.5.2. Synthetic Applications for Larger Systems

We now evaluate the run-time and the solution quality of the branch and bound approach

(see Section 3.4.1) against the proposed heuristic (see Section 3.4.2). Four categories of

synthetic applications are generated, with complete floorplaning information about I/O

locations for each IP. Each category contains 10 applications with 7, 9, 11, and 13 IPs,

respectively.

Figure 3.10 shows the solutions obtained with the branch and bound approach and our

heuristic (displayed with dots and crosses, respectively) and their corresponding Pareto curves

for one synthetic application with 13 IPs. As seen in Figure 3.10, those two Pareto curves are

close even though three out of Pareto-optimal solutions are not obtained from the heuristic;

that is, points 7, 8, and 14 can be obtained using the proposed heuristic which are indeed in the

Pareto-optimal set (points 1, 2, 6). The degradation in quality of the heuristic solutions

compared to the optimal solutions is calculated as a difference in area between the solution

generated by the heuristic and an area of the Pareto-optimal solution with the closest (from

below) latency. For example, For example, the average area increase with respect to the exact

algorithm for all design points reported in Figure 3.10 is 2%.

54

As mentioned before, the branch and bound exploration is exponential in nature. The run-

time overhead for exploring the system with 7, 8, ... , 13 IPs is 40 ms, 55 ms, 70 ms, 310 ms, 3

s, 2 min, 40 min. For future MPSoCs with hundreds of IPs, there is a need of using the proposed

heuristic to explore the exponentially increasing design space. Figure 3.11 shows how our

heuristic performs compared to the branch and bound approach as the system size grows up

120 140 160 180

0.94

0.96

0.98

1

::::

solutions from branch and bound approachsolutions from heuristicPareto-optimal set (branch and bound) points 1-6Pareto set (heuristic) points 7-14

1

23

45 6

7

8

9

1011 1314

12

Figure 3.10 Solutions comparison between branch and bound methodand the proposed heuristic for system interconnect exploration of asynthetic application with 13 IP blocks.

nor

mal

ized

pa

cket

late

ncy

communication fabric area

1

0.8

0.6

0.4

Figure 3.11 Run-time and solution quality comparison between branch and boundapproach and our heuristic as the system size scales up.

6 8 10 12 140

0.5

1

6 8 10 12 140

1

2

3 x 104

system size (# of IPs) system size (# of IPs)

spee

dup

degr

adat

ion

in q

ualit

y(h

euris

tic/b

ranc

h an

d bo

und)

(bra

nch

and

boun

d/he

urist

ic)

55

(four categories with the number of IP set to 7, 9, 11, and 13). For the heuristic, the parameter

iter, variance var, and probability p are set to 30, 0.3, and 0.5, respectively. For a system

consisting of 11 IPs, our heuristic runs 1800 times faster than the branch and bound algorithm,

on average. Meanwhile, the solutions obtained by the heuristic remain competitive as the

system size scales up.

3.6. Summary

In this chapter, we have addressed the problem of system interconnect exploration for

application-specific MPSoCs where the system configurations are predictable. As a novel

contribution, we have developed an analytical model for network-based communication fabric

design space exploration and theoretically generated fabric solutions with optimal cost-

performance trade-offs, while considering various design constrains, such as power, area, and

wirelength. For large systems, we have proposed an efficient approach for obtaining

competitive solutions with significant less computation time. The accuracy of our analytical

model has been evaluated via a SystemC simulator using several synthetic applications and an

industrial SoC design.

In the remaining of this dissertation, we will address the design space exploration for NoC

platforms where the system configurations are not predictable due to users interacting with

multiple applications within the system.

56

57

4. USER-CENTRIC DSE FOR HETEROGENEOUS NOCS

4.1. Introduction

As mentioned in Chapter 3, for systems resulting in predictable system configurations, the

traditional Y-chart flow (see Figure 1.3) works well; however, for future embedded systems,

most likely, we will have multiple applications interacting with the system, which results in

unpredictable system configurations. Since such application interaction is due to the end user

behavior, therefore, by analyzing the user interaction with the system, we are able to provide

more robust platforms for applications characterized by high workload variation and

unpredictable system configurations.

In order to consider the end user behavior into the DSE, in this chapter, our user-centric

design methodology relies on collecting user traces from similar, existing systems or

prototypes (see Figure 1.8). The user trace modeling has been discussed in Section 2.3.1,

which captures what applications are running, at what times, in the system. The novel

contributions of our proposed DSE methodology are as follows:

• First, we target user behavior analysis. More precisely, we apply machine learning

techniques to cluster the traces from various users such that the differences in user

behavior for each class are minimized.

• Then, for each cluster, we propose an offline algorithm for automated architecture

generation of heterogeneous NoC platforms that deal explicitly with computation and

58

communication components and satisfy various design constraints, while facing

significant workload variations.

We note that by taking the user experience into consideration into the DSE methodology,

the generated system platforms exhibit less variation among the users’ behavior; this implies

that each system is highly suitable for a particular user cluster and therefore the overhead of

later applying various online optimization techniques can be reduced as well

[36][38][122][152][153]. In this chapter, however, we restrict ourselves to the offline

optimization part of platform generation, while follow up chapters will consider the run-time

optimization aspects (see Chapter 5, Chapter 6, and Chapter 7).

4.2. Related Work

In an early attempt, Dick and Jha propose a multiobjective genetic search algorithm for

co-synthesis of hardware/software embedded systems which trades off price and power

consumption [51]. Some design methodologies for automatic generation of architecture for

heterogeneous embedded MPSoCs were later studied in [7][100]. Different from the heuristics

used to handle a large design space, Ozisikyilmaz et al. propose a predictive modeling

technique to estimate the system performance by looking at information from past systems

[121]. More recently, Shojaei et al. propose a BDD-based approach to efficiently obtain

Pareto points which help multi-dimensional optimization [150]. Instead of using the bus-

based communication, Chatha et al. address the automated synthesis of an application-

specific NoC architecture with optimized topology [31]. However, their approach targets

single application characteristics (i.e., the communication trace graph is fixed) which is

not realistic to use for different users. Murali et al. consider multiple use-cases during the

NoC design process [106]. However, they optimize the NoC using only worst case

constraints. In reality, the distribution of use-cases for various users are very different.

59

Gheorghita et al. presented a generic and systematic design-time/run-time methodology

for handing the dynamic nature of modern embedded systems, so-called system-scenario-

based design [63].

The differences in user behavior have been also elaborated. For instance, Kang et al. in

[86] observe the differences between younger and middle-aged adults in the use of

complicated electronic devices. Rabaey et al. in [132] discuss the wide range of workloads of

the future and advocate for new metrics to guide the exploration and optimization of future

systems, such as the user functionality, reliability, composability. To our best knowledge, we

are the first to take the collected user traces as input into DSE for building MPSoC platforms,

where their system configurations (i.e. system scenarios, use-cases) are not predictable at

design time.

4.3. Preliminaries

In this chapter, we give the details of the proposed user-centric design framework for

embedded NoC platforms. Later, we illustrate the detailed steps and related machine learning

techniques for off-line user-centric DSE, while targeting a generic platform of a system that

belongs to the third category in Table 1.1.

Our proposed user-centric design flow is shown in Figure 4.1. In order to take the user

behavior into consideration, the inputs of our design flow are:

• Architecture template, which consists of computation resources (e.g., FPGA, DSP,

ASIC), communication resources (e.g., router, FIFO, segmented bus), and the

communication protocol (e.g., routing/switching scheme). Of note, we focus only the

NoC platform on 2-D mesh topology and minimal-path routing, but the communication

architecture may be more general.

60

• Applications specification which captures the task graph characteristics (e.g., number of

tasks and communication rate between them), inter-application synchronization,

computation profile (e.g., power consumption, application deadlines).

• User experience which is based on data from users involvement through contextual

enquiry, prototype, and feedback from previous generation products (see

Figure 1.3(b)). This may include user traces, customer preferences, or other relevant

data.

The entire user-centric design flow involves several steps with the goal of generating

systems that meet the user needs. Here, we assume that the user needs in this case are to have

a system with low power consumption, but still able to maintain its basic performance. In

other words, our goal for the system design is to minimize the energy consumption, i.e. the

computation and communication consumption, per user.

Figure 4.1 The proposed user-centric design flow in terms of the off-line DSE processes.

- application charact.- QoS parameters

- comp. components- comm. components - storage elements

- contextual enquiry- prototype- legacy data & feedback

User ExperiencesArchitecture Template Application Template

Automated NoC Platform Design (see Section 3.4.2) 1. Computational resource selection

2. Resource location assignment

...Trace Cluster 1 Trace Cluster 2 Trace Cluster k

NoC Platform 1 NoC Platform 2 NoC Platform k...

Identification Content (IDC)

User Behavior Clustering (see Section 3.4.1) 1. Application usage similarity

2. K-mean clustering process

61

As mentioned in Section 1.3.2, we first need to understand the psychological, social, or

even ergonomic factors that affect the pattern of involvement of different users, in order to

classify the user behaviors. Then, we need to explore the problem of clustering the user traces

such that all users belonging to the same cluster have a similar1 behavior, while interacting

with the target system (see clusters 1 to k in Figure 4.1, details are discussed in

Section 4.4.1). Here, we assume that k is a given design parameter that can be determined by

market surveys or from previous design experience2. After knowing the involvement of users

into the same cluster, we can decide the architecture parameters (i.e., the number and type of

resources) under different design constraints/metrics (area, cost, etc.). Then, for building the

platform, we can follow the NoC platform automation process specific to this cluster of traces

(see Section 4.4.2). Later, we propose a validation process for this user-centric design flow in

order to assess whether or not the system can satisfy the involvement of users in that cluster

configuration (see Section 4.4.3).

To formulate the problem for off-line design, some terminologies are needed:

• ri : a resource of type i considered as the computation component in the platform (see

Section 2.1). Assume there exist n different types of resources, r1, r2,..., rn RE, N(ri)

represents the number of resources of type ri for the platform, while M(ri) represents

the price of resource ri.

• qi : an application with a set of tasks which are not shared with other applications. Each

application qi can be characterized as where the property details of

1. In terms of similar, it can be ‘frequency of accessing an application’, ‘time spent with each particularapplication’, etc. Through confirmatory factor analysis and model analysis, one can derive latentvariables (based on various observable behavior variables as listed above) that can be used to betterclassify users into categories (see the details in Section 4.4.1).

2. For example, for the non-shared systems that owned by certain person, k might be 5 or more, such asthe different models of the cell phones. However, for systems which are shared and used by severalpeople at a time, the universal design (k = 1 or maybe 2) that are usable and effective for everyone issufficient.

ACGqi V

qi Eqi,( )=

62

each application has been described in Section 2.2. Assume there exist m different

applications which can run on the platform, i.e., q1, q2,..., qm Q. Each vertex has a

sets of resources which can only be mapped to in order to meet the application

deadlines, i.e. { } where ri , MCR( ) CC(ri).

We note that the computation/communication energy modeling for a specific user trace

have been characterized in Section 2.3.

4.4. The Problem and Steps for DSE

4.4.1. User Behavior Similarity and Clustering

As mentioned in Figure 1.1, because the variation of user behavior interacting with the

system is quite high and in order to satisfy most of the users through off-line DSE, we need to

classify the user behaviors before clustering the user traces such that the users belonging to the

same cluster would have a similar behavior while interacting with the system. Keeping the

goal of minimizing the energy consumption, we need to observe how users interact with the

system, i.e. how many and which applications the user often uses and for how long. In

addition, each application has different resource requirements and power profiles so extracting

this kind of information is crucial for later customizing the design process.

Here, we define some terms specifically in order to figure out how similar the traces from

users are in a quantitative way; the steps of user behavior clustering process are later explained

in Figure 4.2.

• Application resource demand (L): The degree of resource demands for an application.

The application qi which demands a larger number of resource of type rn has a higher

value.

∈ v jqi

vjqi map 1–⊆ REj

qi ∀ ∈ REjqi v j

qi ≤

Lrn

qi

63

• Inter-application similarity: Two applications requiring similar resources (i.e. having

similar resource demand) have a high inter-application similarity coefficient.

• Application appearance probability ( ): The probability of observing a subset of

applications, v, in user trace .

• Application-usage similarity: Two user traces reflecting a similar frequency of

application appearance (i.e. having similar application appearance probability) have a

high application-usage similarity coefficient.

• Subset function (F): If A is a subset of (or is included in) B, then F(A, B) = 1; otherwise

it is 0.

• Cluster mapping (C): C(i) = j indicates that i has been clustered into the jth group,

where i:C(i) = j represents all the elements in the jth group.

• k-MEAN clustering: The algorithm in [21] groups the objects (or data points) based on

attributes/features into k different groups, where k is a positive integer. The grouping

here is done by minimizing the sum of squares of distances between data and the

corresponding cluster centroid. The k-MEAN clustering algorithm involves three

important steps. First, k initial data points are randomly selected from the data set and

set it as the center if each cluster. Second, we do the re-assign process, i.e. assign other

data points to the closest center. Third, we do the re-center process, i.e. the centroid of

each of the k clusters are re-calculated. We repeat steps 2 and 3 until all these processes

converged (the centroid of each cluster is not changed anymore).

By applying the k-MEAN clustering approach, our goal is to cluster the users having similar

behavior interacting with the system into the same cluster. Each user can be treated as a data

point. The distance between users reflect the coefficient of inter-application and application-

pvℜi

ℜi

64

Figure 4.2 Main steps of user behavior clustering.

Input: task graph characteristics of each application qi , and the task-level computing cost, )Output: user behavior cluster S

• Step 1: Derive the Pareto curve trading off the resources and computation powerconsumption for each application qi (similar to the solution proposed in[51]). Each Pareto point gives the mini-mum power consumption for application qi.

• Step 2: Given all Pareto points, calculate , i.e. the resourcedemand for application qi to each resource type rj for j = 1,..., n, where

• Step 3: Normalize for each application qi.

where

• Step 4: Set each application as a data point di and apply k-MEAN clustering to groupall data points di into z clusters. Assign the center of each cluster, μr where r=1,..., z, to the identification content (IDCr) which will be utilized in the testingstage and define an z-dimensional application vector V = (v1,v2,...,vz) =(di:C(di)=1, di:C(di)=2,..., di:C(di) = z) capturing the applications within thecorresponding cluster.

• Step 5: Calculate for each user trace , i.e. the applicationsets appearance probability with corresponding application set vi for i = 1,..., z,where

• Step 6: Set from traces of each user as a data point di and apply k-MEAN clusteringto group data points di into k clusters. Assign the center of each cluster, μr’,where r’=1, ..., k, to the identification content (IDCr’) and generate a k-dimen-sional trace cluster vector S = (S1, S2, ..., Sk) = (di:C(di) = 1, di:C(di) = 2,...,di:C(di) = k), where Si is a group of user traces with high correlation applica-tion-usage similarity.

• Step 7: Assign μr’ to the identification content (IDCr’). Then, generate a k-dimensionaltrace cluster vector S = (S1, S2,...,Sk) = (di:C(di)=1, di:C(di)=2,..., di:C(di) = k)capturing the user traces within the corresponding cluster.

Gqi T

qi Eqi,( )=

Ecomp tjqi rk,⎝ ⎠

⎛ ⎞

Ecomp qi N r1( ) N r2( ) … N rn( ), , , ,( )

Lqi Lr1

qi Lr2

qi … Lrn

qi, , ,⎝ ⎠⎛ ⎞=

Lrj

qi

∑= =−

−==

))(max(

1 1

1

))]}(,...,)(),...,(,([

))](,...,1)(),...,(,([{j

i

j

rN

x njicomp

njicompqr rNxrNrNqEavg

rNxrNrNqEavgL

Lqi

Lqi Lr1

qi Lr2

qi … Lrn

qi, ,⎝ ⎠⎛ ⎞ Lr1

qi avg Lqi( )– … Lrn

qi avg Lqi( )–, ,⎝ ⎠

⎛ ⎞= =

avg Lqi( ) Lr1

qi Lr2

qi … Lrn

qi+ + +⎝ ⎠⎛ ⎞ n⁄=

Lqi

pvℜi pv1

ℜi pv2

ℜi … pvz

ℜi, , ,⎝ ⎠⎛ ⎞= ℜi

pvj

ℜi

F qk ql,{ } vj,( )qk ql,{ }∀ ℜi

t⟨ ⟩∈∑

ℜit⟨ ⟩∀ ℜi∈∑

pvℜi

------------------------------------------------------------------------------------------------=

pvℜi

65

usage similarities; the closer distance, the higher similarity coefficient. More precisely, as

shown in Figure 4.2, the clustering is achieved by first grouping together all similar

applications (i.e. applications in the same cluster vi have a high inter-application similarity,

see Steps 1-4), and then clustering the traces that use these application groups in a similar way

(i.e. traces in the same cluster Si have a high application-usage similarity, see Steps 5-7).

4.4.2. Automated NoC Platform Generation

From Section 4.4.1, we obtain the set of user traces which have a similar interaction with

the system. Here, for each cluster of traces, our goal is to generate an NoC platform (i.e., a set

of resources interconnected via a mesh-like network) which minimizes the energy

consumption. Therefore, the design automation process of our NoC platform involves two

critical steps: i) Computational resource selection, which decides the number and type of

resources needed to build the platform with the price constraint satisfied, and ii) Resource

location assignment, which provides the tile location for each resource in the 2-D tile-based

NoC. Of note, while running a user trace on any platform Λ, one can observe that applications

enter and leave the system dynamically. Here, we apply a greedy approach for the application

mapping problem; that is, based on the available resources of the platform, we assign vertex vi

to the currently available resources rj consuming the minimum amount of power. The vertex to

resource mappings are one-to-one and vice versa, where the mapping function is denoted as

map( ), i.e., map(vi) = rj. In addition, the price of each type of resource rj is defined as M(rj)

while the total platform price constraints is set to Φ.

1.Computational Resource Selection

Given all user traces in a cluster S, i.e., S and a price constraint Φ.

Find a resource set A which

∀ ℜi ℜit⟨ ⟩= ∈

66

minimizes (4.1)

such that: Φ (4.2)

vertex vx in S, map(vx) {REx} (4.3)

The steps for the computational resource selection problem are summarized in Figure 4.3.

In Equation 4.2, the sum of the prices of resources integrating in the platform is not

greater the price constraint Φ while Equation 4.3 guarantees that all application tasks

running on the platform would be assigned to the specific resources for meeting the

application deadline.

Main idea: We start out with an initial set of resources (A(0)) which minimizes our

objective without considering the price constraint (Step 1). The price constraints can later

Ecomp ℜi A,( )ℜi∀ S∈∑

⎩ ⎭⎨ ⎬⎧ ⎫

M ri( )ri∀ A∈∑ ≤

∀ ℜi ∈ ⊆

Figure 4.3 Main steps for computational resource selection.

• Step 1: initialize platform A with unlimited resources, find A(0)

which min and j 0.

• Step 2: While ( Θ(j) > Φ)•

find A(j+1) (..., N(rx)-1,..., N(ry)+1,...)

// replace rx with ry,where W(ry) < W(rx), which minimize the energy-price

cost ratio while all tasks still meeting application deadlines, i.e.

and j j+1

End While;

return A(j)

Ecomp ℜi A 0( ),( )ℜi∀ S∈∑

⎩ ⎭⎨ ⎬⎧ ⎫

M ri( )ri∀ A t( )∈∑ =

min

Ecomp ℜi A j 1+( ),( )ℜi∀ S∈∑ Ecomp ℜi A j( ),( )

ℜi∀ S∈∑–

M ri( )ri∀ A j( )∈∑ M ri( )

ri∀ A j 1+( )∈∑–

-------------------------------------------------------------------------------------------------------------------------------

⎩ ⎭⎪ ⎪⎪ ⎪⎨ ⎬⎪ ⎪⎪ ⎪⎧ ⎫

67

be met by replacing the more expensive resources with cheaper ones. Since there are at

most n × (n-1) pairs of possible replacements for a platform with n types of resources,

n × (n-1) evaluations are performed. Then, the replacement that results in the largest price

reduction and smallest computation energy consumption overhead is updated (Step 2) while

satisfying the Equation 4.3 requirement. We continue this step until the price of the updated

resource set satisfies the price constraint.

Example: Assume that we are building a platform with (W × H) = Q resources with the

templates of five different types of resources, i.e. r1, r2, r3, r4, r5 and r1 has the strongest

computational capability but with the highest price. The resource set A is defined as (N(r1),

N(r2), N(r3), N(r4), N(r5)). Under this case, in Step 1 we have A(0) = (Q,0,0,0,0). Then in

Step 2, if A(0)cannot satisfy the price constraint Φ (i.e. Q × M(r1) > Φ), then we replace one

resource with the lower price in the original set, i.e. (Q-1,1,0,0,0), (Q-1,0,1,0,0), (Q-1,0,0,1,0),

(Q-1,0,0,0,1) Find the next iteration resource set A(1) which has lowest price reduction but

with small computational cost overhead. We pursue this greedy approach until finding A(j)

which satisfies the price constraint Φ.

Complexity: Simply speaking, finding the solution space of the computational resource

selection problem is related to that of the integer permutation problem, i.e., finding a

set S = {(N(r1),N(r2),...,N(rn)) | N(r1) + N(r2) + ... +N(rn) = Q}, where the solution

size equals to (Q + n − 1)!/(Q! × (n − 1)!). However, using our proposed greedy approach,

we can obtain a reasonable solution about 60x faster compared to the exhaustive search time

(see experimental results in Section 4.5).

2.Resource Location Assignment

After obtaining the number and type of computational resources from the previous step,

our task becomes to allocate each resource to the tile-based NoC platform with the goal of

N∈

68

minimizing the communication energy consumption when all user traces for a certain cluster

are running in the system. The resource location assignment problem is formulated as follows:

Given all user traces in a cluster S, i.e., S and a W×H 2-D tile-based NoC3

with a resource set A that satisfies

, (W×H). (4.4)

Find a one-to-one resource location assignment Ω( ) from any resource ri in A to a

specific tile location, Ω(ri)=(xi, yi), which

min (4.5)

such that: 1 xi W, 1 yi H. (4.6)

To solve this problem, we need the following notation:

• B(xi, yi): The neighbors of tile (xi, yi), i.e., (xi+1, yi), (xi, yi+1), (xi-1, yi), (xi, yi-1), where

1 xi+1, xi-1 W and 1 yi+1, yi-1 H.

• Empty/Full tile: The tile (xi, yi) without/with a computational resource ri already

assigned to it.

• Transmission matrix ψ: Each entry ψ uv stores the aggregate communication rate

between resources ru and rv.

The steps for the resource location assignment problem are summarized in Figure 4.4.

Main idea: We start out by calculating and normalizing the transmission matrix ψ

(Steps 1-2). Then, by allocating two resources, ru and rv, with the highest ψ uv values as close

3. We believe that the dimensions of the mesh (W × H) or even the total number of resources for build-ing the platform should be determined by previous design experience rather than the outcome of thesynthesis step since it is related not only to system reliability (e.g., one can have spare cores in theplatform), but also to the manufacturing process, or even chip yield. Except for such factors, thevalues of W and H would be close to each other in order to minimize the communication costamong resources.

∀ℜi ℜit⟨ ⟩= ∈

ri∀ A∈ N ri( )i 1=

n

∑ ≤

Ecomm ℜi Ω A( ),( )ℜi∀ S∈∑

⎩ ⎭⎨ ⎬⎧ ⎫

≤ ≤ ≤ ≤

≤ ≤ ≤ ≤

69

as possible, we are able to minimize the communication energy consumption, while running

applications onto the system (Steps 3-5). More precisely, the neighboring resources of ru are

assigned based on the ratio ψui, for i = 1, ..., n, as shown in Step 4.

Complexity: Assume the user trace set is . The exhaustive resource allocation

assignment on a (W × H) = Q platform with the resource set (N(r1), N(r2), ..., N(rn)) is

. (4.7)

Our proposed heuristic can reduce such problem to the complexity of ( ),

where Steps 1 and 2 need to go through the user trace set once before generating

with size k × k. Later, in Steps 3-6, based on , we assign each resource in Q greedily

Figure 4.4 Main steps for resource location assignment.

• Step 1: generate an n×n transmission matrix ψ with each entry

ψ(u, v) = ψ uv =

where map(vj) = ru and map(vk) = rv

• Step 2: normalize the transition matrix ψ,

i.e., ψ uv

• Step 3: get u with largest ψuv or ψvu value, then set the location of ru , i.e. Ω(ru) = (xu,yu), to the center of the platform.

• Step 4: decide B(xu, yu) such that ri with greater ψui has a higher possibility to beassigned to B(xu, yu).

i.e., ψu1 : ψu2 : ... : ψun N(r1): N(r2): ...: N(rn), where r1, r2,..., rn B(xu, yu)

• Step 5: get a filled tile (xu, yu) with the greatest empty neighboring tiles.

• Step 6: repeat Steps 4 and 5, until all resources get assigned to the corresponding tilelocations in the NoC platform.

comm ejkqi

⎝ ⎠⎛ ⎞

ejkqi∀ Y

qi∈

∑qi∀ ℜi

t⟨ ⟩∈∑

ℜit⟨ ⟩∀ ℜi∈

∑ℜi∀ S∈∑

ψuv ψuvv 1=

n∑

⎝ ⎠⎜ ⎟⎛ ⎞

≈ ∈

ℜ C N r1( )Q C N r2( )

Q N r1( )–× … C N rk( )

Q N r1( )– …– N rk 1–( )–××⎝ ⎠

⎛ ⎞×

ℜ= Q!N r1( )( )! N r2( )( )!× … N rk( )( )!××

---------------------------------------------------------------------------------------×

ℜ ψ Q×+

ℜ ψ

ψ

70

onto the corresponding tile locations in the platform. As shown in experimental results in

Section 4.5, we can obtain a reasonable solution about 4000× faster compared to the optimal

search time.

Of note, we focus only the NoC platform with 2-D mesh (W × H) topology, where W and

H are design parameters which can be determined by previous experience, as explained in

Footnote 3 of this chapter. In addition, minimal-path routing is selected as the switching

scheme through this chapter, but the communication architecture and routing scheme may be

more general. That is, our proposed resource assignment approach can be extended to

different topologies under different routing schemes, for instance, by redefining B(xi, yi)

which are the neighbors of tile (xi, yi) and giving some weight to B(xi, yi) which would capture

the distance (as determined by the topology and routing scheme) among the computational

resources.

4.4.3. Validation Process

Here, we validate the potential and robustness of our user-centric design flow (see

Figure 4.5). In this chapter, the system is defined as robust if it performs well not only under

ordinary/given cases (i.e. training dataset), but also under unpredictable/unknown cases (i.e.

testing dataset). Generally speaking, the training and testing datasets are usually given. The

training dataset is used to generate platforms under the user-centric design flow, while the

testing dataset is used to determine whether or not this design flow produces robust platforms

for different types of users. Theoretically, the user traces (see Figure 4.1) observed from an

older version of the platform, Dbefore, can be set as the training dataset in order to produce a

new generation platform. Then, we should take the user traces running on the new platform,

Dafter, as the testing dataset in order to validate the design flow. However, in practice, we

cannot have access to the later traces, Dafter, in advance. Therefore, if we have a reasonable

71

amount of dataset Dbefore, then this is usually split into two parts, namely the training and

testing datasets, that are used to build and evaluate the design flow. If we have too little data,

then the bootstrap method is a well-known approach used for generating more data [21].

As seen in Figure 4.5, we are given the user traces in the testing dataset with size

Ntesting, i.e. we have Ntesting users in the testing stage where each user’s traces are

collected accordingly. For each user i with the collected trace set , we do the cluster

identification check. More precisely, with the information of the identification content

(IDC) obtained from the training process (see Figure 4.1 and Steps 4 and 7 in Figure 4.2),

we report which cluster this user belongs to; that is, has higher inter-application and

application-usage similarity coefficient with other traces belonging to the same cluster

(say cluster k). Ideally, for the user i which is identified to be in the kth cluster during the

testing stage, this user’s traces should report the best performance, while executed on

NoC platform k generated from the training stage. Therefore, to validate the accuracy of

our user-centric design flow, we evaluate whether or not the NoC platform k = (A, Ω(A)) is

the best platform for user i, i.e., the total energy consumption of running the user trace on

it, , is smaller than all other generated platforms.

Figure 4.5 Validation process of the newly proposed methodology.

Ntesting Testing User Traces

Yes

Test if NoC platform k the most suitable?

Matched, Nok Nok+1

No

Identification Content (IDC)

(cluster k)

Cluster identification check

ℜi

ℜi

ℜi

Ecomp ℜi Ak,( ) Ecomm ℜi Ω Ak( ),( )+[ ]ℜi∀ S∈∑

72

If yes, it means that the user i passes the validation process and we label it as a match for

user i. Finally, the accuracy rate for our user-centric design flow, i.e., (Nok/Ntesting)×100%,

is reported where Nok is the number of user passes the validation process while Ntesting is the

total number of users in the testing stage. It is obvious that the higher the accuracy rate is, the

more robust the platforms are.

4.5. Experimental Results

To evaluate the user behavior model and the associated design flow, we apply our

proposed methodology to some real applications with realistic user traces. Our environment

and design inputs are as follows:

• Five different types of computational resources ri are available in the architecture

template; the corresponding processor model and its price (in U.S. dollars, USD), M(ri)

are listed in Table 4.1.

Table 4.1 Architecture template for the NoC platform.

• Seven applications are executed on the system platform, including two synthetic

applications generated by the TGFF package [162], Automotive/Industrial,

Consumer, Networking, Office automation, and Telecom from the embedded system

benchmark suite (E3S) [50]. Some pre-processing (such as task binding,

scheduling) is done for these seven applications, where task graphs have sizes 7, 7,

Resource Type, ri Part Number Price, M(ri)r1: DSP 300MHz TI TMS320C6203 112r2: RISC 266MHz IBM PowerPC 405GP 65r3: DSP 60MHz Analog Devices 21065L 10

r4: x86 μprocessor 400MHz AMD K6-2E 77r5: μcontroller 133MHz AMD ElanSC520 33

73

8, 6, 5, 4, and 6, respectively. Each task is going to execute on one resource later. In

addition, the task profile, the power consumption of running task ti on each

processor type, are analyzed beforehand under specified performance constraints.

• Hundreds of user traces (i.e., both training and testing datasets) are used to validate the

accuracy of the design flow. Realistic user patterns are collected the behavior of the

Windows XP environment from twenty users as users login and logoff the system. We

sample the patterns in about 10 minute for generating the traces; the bootstrap method is

later applied to generate even more traces [21].

Assume that, due to various incompatibilities, at most four applications can execute on the

platform simultaneously. In addition, based on data from market surveys or previous design

experience, assume that our goal is to generate three different platforms (i.e., parameter k is set

to 3) in order to satisfy different types of users. The price constraint for each platform is set to

1500 USD (i.e., Φ = 1500).

4.5.1. Evaluation of User Behavior Clustering

The clustering of user behavior is the most critical step in this design flow. Indeed, if the

user traces in the same cluster have a high variance in terms of the resource requirements, the

corresponding platform may not fit well most users in this cluster.

Figure 4.6 shows the clustering results. All feasible Pareto points are derived trading

off the price of the platform and the computation energy consumption. We randomly

select four users in each trace cluster and plot the corresponding Pareto curves. As shown

in Figure 4.6, the variation of users within the same cluster is quite small. We also

produce three resource sets (A1, A2, and A3) for these three trace clusters, while meeting

the price constraint (Φ = 1500). For example, for cluster 1, the resource set A1 consists of

74

3 resources of type r1, 6 of type r2, 6 of r3, 6 of r4, and 7 of r5, with the total price being equal

to 1479. As shown, these three resource sets (A1, A2, and A3) are quite different although their

prices are close to 1500.

Table 4.2 shows the normalized computation energy consumption of using these three

resource sets with each trace cluster. For example, for the second entry in second column,

the value 1.22 gives the computation energy consumption ratio of running all traces in

cluster 1 onto A1 and A2; that is,

: = 1 : 1.22 (4.8)

1250 1300 1350 1400 1450 1500 15500

1

2

3

4

5

6

7

8

9

10x 109

1250 1300 1350 1400 1450 1500 1550

10

9

8

7

6

5

4

3

2

1

0

x 109

users in cluster1

users in cluster2

users in cluster3

Figure 4.6 Pareto points showing the trade-offs between price and computationenergy consumption. For each cluster, four users are randomly selected and theirPareto curves are plotted.

A1 = (3,6,6,6,7)

A2 = (5,5,3,2,13)

A3 = (3,7,3,4,11)

price (unit: U.S. dollars)

com

p. e

nerg

y co

nsum

ptio

n (μ

J)

Ecomp ℜi A1,( )ℜi∀ S1∈∑

Ecomp ℜi A1,( )ℜi∀ S1∈∑

--------------------------------------------------------

Ecomp ℜi A2,( )ℜi∀ S1∈∑

Ecomp ℜi A1,( )ℜi∀ S1∈∑

--------------------------------------------------------

75

Of note, from Figure 4.6 and Table 4.2, we can conclude that the user-centric

methodology has the potential to separate the users having different behavior interacting

with the system quite effectively. In addition, by doing so, our platforms can be optimized

for each specific cluster of users with the goal satisfied.

Finally, we compare our proposed methodology against the traditional design flow which

generates only one platform, A’ (see the last row of Table 4.2), while optimizing the

computation energy consumption for the entire set of user traces, under the price constraint

Φ = 1500. As it can be observed, we achieve about 30% computation energy savings, on

average, compared to the unique platform, A’, derived from the traditional design flow.

4.5.2. NoC Platform Evaluation

We first evaluate the solution quality of the computational resource selection algorithm in

Section 4.4.2.I for traces with 200 users in the training dataset (Dbefore), against the best

solution which can be derived from the Pareto curve in Figure 4.6. The experiments are

performed on an AMD Athlon™ 64 Processor 3000+ running at 2.04GHz. Compared to

the optimal solution obtained from exhaustive search, our method consumes 5% more

computation energy, on average, for all these three clusters. However, it requires more than

10 hours to get the optimal resource sets for one cluster, while our algorithm takes only about

10 minutes to produce these reasonable platforms. Of note, for evaluating future systems in

Table 4.2 Computation energy consumption comparison for three trace clusters and different resource sets derived by the proposed and traditional design flow.

Resources Traces Cluster 1 Cluster 2 Cluster 3Set A1 1 1.47 1.35Set A2 1.22 1 1.33Set A3 1.33 1.28 1Set A’ 1.50 1.18 1.31

76

the market on millions of users, the proposed heuristic has the potential for producing

platform with industrial time-to-market constraints.

Next, we evaluate the solution quality of the resource location assignment algorithm

(Section 4.4.2.II) against the optimal solution, given a fixed set of resources running the

user traces. We observe that our method consumes only 7% more energy in

communication, on average, compared to the optimal resource location assignment but it

takes several seconds to process our heuristic, while hours for obtaining the optimal

solution.

To show the potential of our approach for larger platforms, we apply our proposed

approach to resource selection and allocation on 6 × 6, 8 × 8, 10 × 10 platforms using the

same settings shown in Table 4.1. Our approach has less than 7% computation energy

overhead and 5% communication energy overhead compared to the optimal solution for these

three platform settings. However, our solution can be obtained within 12, 15, 20 minutes,

while it takes about 40 minutes, 10 hours and more than three days to get the optimal solution

for 6 × 6, 8 × 8, 10 × 10 platforms, respectively.

4.5.3. Evaluation of Entire Design Methodology

Finally, we apply the validation process in Figure 4.5 (Section 4.4.3) to show the

potential of the user-centric design methodology. The size of training dataset ranges from 100

to 700 (we sample the collected user behavior in about 10 minutes as users login and logoff

this system), while the size of the testing dataset is fixed to 500. We observe that the accuracy

rate, (Nok/Ntesting) × 100%, increases as the size of the training data increases (for training

dataset size of 100, 300 and 500, the accuracy rate is 73%, 84%, and 87%, respectively).

By applying 700 training data for building these three platforms, we can have more

77

information for user behavior clustering and therefore, come up with a higher accuracy rate

(around 90%).

4.6. Summary

In this chapter, we have proposed a unified user-centric design framework for off-line

design space exploration (DSE) and on-line optimization techniques for embedded systems.

Our investigations target primarily heterogeneous multi-processor SoCs with resources

communicating via the NoC approach, but the approach is completely general and appropriate

to embedded systems design.

More precisely, in this new design methodology, we consider explicitly the information

about the user experience and apply machine learning techniques to develop a design flow

which aims at minimizing the workload variance; this allows the system to better adapt to

different types of user needs and workload variations. As shown, efficient algorithms have

been proposed for clustering the users’ behavior and automatically generating 2-D NoC

platforms such that the values of the total computation and communication energy

consumption are minimized, give specific design constraints. In addition, a validation process

for the proposed user-centric design flow has been proposed to show the robustness of the

framework. Although we focus on the architectures interconnected by 2D mesh networks with

minimal-path routing schemes, our user-centric design framework can be adapted to other

regular architectures with different network topologies or different deterministic routing

schemes.

Our experimental results using real applications have shown that by considering the user

experience into the design space exploration step, the system platforms generated by our

approach achieve more than 30% total energy savings, on average, compared to the single

platform derived from the traditional design flow; this implies that each system configuration

78

we generate is highly suitable for the targeted class of user and workload behaviors. Last but

not least, the problems addressed in this work are focused at the system-level, while future

work can cover the other levels of abstraction using a similar philosophy.

79

5. ENERGY- AND PERFORMANCE-AWARE INCREMENTAL

MAPPING FOR NOC

5.1. Introduction

Having generated NoC platforms which exhibit less variation among the users’ behavior

in Chapter 4, in this chapter, we concentrate on the dynamic resource management process

and present a robust algorithm for heterogeneous NoC-based MPSoCs. Here, we target real-

time applications described as task graphs (see the application modelling in Figure 2.3) as

opposed to general-purpose best-effort applications usually found in chip multiprocessors

(CMPs). As the target NoC platform discussed in Section 2.1, these applications are mapped

onto embedded MPSoCs where the basic architecture consists of homogenous PEs operating

at multiple voltage levels. More precisely, we assume that only the PEs connected to the NoC

have multiple voltage levels, whereas the network itself (including links, routers, etc.), is in its

own voltage-frequency domain. Of note, our proposed algorithm can also be applied to

platforms with different types of resources without any change with the given CC(ri) for each

resource ri.

A GM is responsible for system resource management which involves mapping the

incoming applications to the available PEs and handling the inter-processor communication

(the details of the control scheme related to GM shown in Section 2.1). Since the arrival order

and execution times of the applications are not known at design time (that is, applications

arrive at arbitrary times and leave the system after being executed), performing effective run-

time mapping is an important and challenging task. Towards this end, we propose a run-time

80

mapping technique which allocates the appropriate resources to the incoming application tasks

such that the communication energy is minimized, given some deadline constraints. At the

same time, all the pre-existing applications still run on the initial set of resources they have

been allocated to.

To illustrate the proposed methodology, we assume a system architecture with two voltage

levels as shown in Figure 5.1. The gray squares represent the PEs operating at higher voltage

levels, while the black dots show the tasks belonging to a pre-existing application which

cannot be reallocated. Applications App 1 and App 2 shown in Figure 5.1(a) and

pre-existing application

greedy solution(cost = 10 + 6 = 16)

proposed solution (cost = 10 + 7 = 17)

+ App

1 + App 2

greedy solution (cost = 10 + 6 + 12 = 28)

proposed solution (cost = 10 + 7 + 8 = 25)

assume (initial cost = 10) + App 2

1 2

4

5

3

76

1

4 5

7

6

23 1

4 5

7

6

23

1 2

4

5

3

76

1

2

3 6

5

4

1

2

3 6

5

4

+ App 1

vertexedge

1

2

5

3

6

4 7

App 1

1

2 5

3 6

4

App 2

Figure 5.1 Example of NoC incremental application mapping comparing the greedy and ourproposed solutions. The greedy approach which does not consider additional mappingsincurs higher communication overhead for App 2, and the system communication cost aswell, compared to our proposed solution.

(a) (b)

(c)

(d) (e)

(f) (g)

81

Figure 5.1(b), respectively, need to be mapped sequentially to the initial system configuration

in Figure 5.1(c). Suppose that vertices 4 and 6 are the critical vertices for App 1, while vertex

2 is the critical vertex for App 2; this means that they must be allocated to the PEs operating at

the highest voltage level in order to meet the application deadlines.

After the arrival of each new application App i, a greedy approach would map App i to the

NoC resources such that the inter-processor communication cost of App i is minimized for the

current configuration (that is, ignoring any future arrivals). In this simple example, the total

system communication cost is the sum of the communication cost of all applications; that is:

(5.1)

where MD(vi , vj) respresents the Manhattan Distance between any two application

vertices, vi and vj, connected to each other. As illustrated in Figure 5.1(d), even though the

greedy approach minimizes the communication cost for the current configuration, the newly

generated region consisting of the remaining (available) PEs is quite scattered. Consequently,

mapping any additional application onto this configuration would be ineffective, as it can be

seen for the non-contiguous region of App 2 in Figure 5.1(e).

As opposed to this, our newly proposed methodology does consider applications that may

arrive to the system at future times and consequently, it offers a more effective mapping in the

presence of dynamically incoming applications. Indeed, as shown in Figure 5.1(g), when

App 2 is mapped after App 1, the system communication cost becomes smaller than the cost

obtained using the greedy approach. Intuitively, since the pre-existing applications cannot be

reallocated, the performance of the greedy solution becomes much worse compared to our

proposed solution.

Note that the task migration approach is complementary to our incremental mapping;

indeed, task migration is an effective strategy to achieve load balancing and high resource

System communication cost MD vi vj,( )

i j,( )∀∑

App∀∑=

82

utilization. For distributed systems without shared memory support, the task migration policy

must be implemented by passing messages among PEs; the implicit migration cost is large due

to the need of moving the process context [19]. However, for embedded MPSoCs with shared

memory, we have two contexts to worry about from a migration perspective: The user context

(called the remote) and the system context (called the deputy or home node). Only the user

context (i.e. stacks, memory maps, registers of the process) needs to be migrated, while the

system context is kept either on the home node or in the shared memory. Therefore, the

migration process can be implemented with middleware support on top of the operating

system. In this chapter, we do not focus on the task migration process. Instead, we target an

incremental mapping process which does not need to change the current system configuration.

In summary, the novel contribution of this chapter consists of a new approach for dynamic

application mapping such that the total communication energy consumption in the system is

minimized. At the same time, additional applications can be easily added to the resulting

system with minimal communication cost overhead.

The remaining of this chapter is organized as follows: We first review the related work

(Section 5.2) and give an motivation example to highlight the key idea of our work

(Section 5.3). In Section 5.4, we formulate the problem of run-time incremental mapping and

present the proposed methodology. Then, we propose a two-step algorithm to solve this

problem; more precisely, the near convex region selection problem is discussed in

Section 5.5.1, while the vertex allocation problem is addressed in Section 5.5.2. The

experimental results appear in Section 5.6, while Section 5.7 summarizes our main

contribution.

83

5.2. Related Work

Resource allocation is a fundamental problem encountered in a variety of areas, including

processor allocation for supercomputers and task assignment in massively parallel processing

systems. While dealing with the resource management process, Karp et al. in [88] study the

problem of finding the shape of a region assigned to tasks for minimizing the pairwise

distance of all points within that region; they observe that there is no closed-form solution for

getting the optimal region. Bender et al. in [14] express the solution as a differential equation

to solve the resource allocation problem and provides a theoretical proof for getting the

optimal solution. Bender et al. in [15] present an approximate algorithm for selecting

processors such that to minimize the average number of communication hops in

supercomputers. Shojaei et al. in [151] present a pareto-algebra heuristic for finding multiple

feasible configurations trading off several design metrics, e.g. energy consumption, with the

resource usage for various types of resources at run-time.

In terms of the off-line resource allocation problem for NoC, several approaches have

been proposed. Hu et al. in [79] propose a branch and bound algorithm to map IP cores onto a

tile-based NoC architecture, while satisfying the bandwidth constraints and minimizing the

total communication energy consumption. The work in [104] considers the mapping problem

for minimizing the communication delay with split routing.

While dealing with the resource management problem for “multiple applications” in the

system, Pop et al. present an approach to incremental design of distributed systems for hard

real-time applications over a bus [130]. More recently, Murali et al. proposed a methodology

for mapping multiple use-cases onto NoCs, where each use-case has different communication

requirements and traffic patterns [105].

84

In terms of the on-line resource management for NoCs, the techniques proposed so far rely

on a resource manager operating under operating system (OS) control [117]. This OS-

controlled mechanism allows the system operate effectively in a dynamic manner. Smit et al.

[154] propose a run-time task assignment algorithm on heterogeneous processors. However,

the task graphs are restricted to have either a small number of tasks or a task degree of no

more than two. More recently, Carvalho et al. propose dynamic task mapping scheme in NoC-

based heterogeneous MPSoCs, targeting the channel load minimization for improving the

performance [27].

As such, all the previous work mentioned above does not maximize the system efficiency

by considering the possible addition of new applications. In this chapter, our goal is to

optimize the communication energy consumption for all possible system configurations (at

different time instances) considering that applications can dynamically arrive and leave the

system.

5.3. Motivational Example

We illustrate the incremental mapping process using three applications. For simplicity, the

system considered in this example has only a single voltage level. As shown in the optimal

mapping solution in Figure 5.2(a), whenever a new application arrives in the system, we

minimize the average communication distance for the incoming application and all existing

applications in the system. Applying the optimal mapping in practice would be infeasible

since the run-time for deciding the configuration which gives the optimal solution and

reconfiguring the previous applications by migrating tasks is too high. However, we can get a

significant insight from analyzing the results produced by an optimal solution. Indeed, by

looking at Figure 5.2, we observe that each application tends to cover a convex region, while

the PE utilization of the system increases. Therefore, if no task migration is allowed,

85

allocating an incoming application to a region which looks as convex as possible helps

minimize the communication overhead for any additional incoming application (see our

proposed solution in Figure 5.2(b)).

In general, a region is convex if it contains all the line segments connecting any pair of

points inside it. Bender et al. [14] define the region to be optimal if the average distance

between all pairs of points is a minimum; as such, they expect the shape of an optimal region

to be convex. However, the concept of near convex region we use in this chapter is slightly

more general; it stands for a region whose area is close to the area of its convex hull [87]. The

key goals in our approach for selecting a near convex region are to 1) minimize the average

communication distance (i.e., number of hops) between the processors assigned to the tasks of

the currently incoming application and 2) minimize the non-contiguous regions which may

incur a higher communication cost if additional applications are mapped onto them.

Figure 5.2 Motivational example for incremental mapping process. (a)Optimal solution (b) Near convex region solution.

optimal solutionfor App 1

+ App 1

optimal solutionfor App 1 +App 2

optimal solution for App 1 + App 2 + App 3

+ App 2 + App 3

+ App 1 + App 2 + App 3

2

3

11 24

5

3

1 25

34

1 25

34

1 25

34

1 25

341 24

35

32

41

24

13

2

3

1

32

412

3

1

4

1

2 3 4 5

App 1 App 2 App 31 4

2 3

1 2

3

86

Consequently, our problem formulation generalizes the optimal region considerations [14] for

dynamic system configurations with limited resources.

5.4. Incremental Run-time Mapping Problem

5.4.1. Proposed Methodology

Our proposed methodology is summarized in Figure 5.3. All applications are described

by ACGs which result from an off-line task partitioning similar to [30][127]. Our on-line

incremental mapping process is activated only when an application arrives in the system1. Our

objective is to first select a near convex region (see Section 5.5.1) and then decide on which

PE within this region should each vertex in the ACG be mapped to (Section 5.5.2), such that

the communication energy consumption is minimized under given timing constraints.

1. Of note, in this dissertation, we assume that each application is characterized by one fixed ACG.Therefore, as an application arrives, we realize the number of PEs for selection in order to meetdeadline. However, the application specification could be built more general by considering differ-ent modes of computation. For example, while executing multimedia applications, some users arealways in high-quality mode, but some in low-quality mode under difference situations. This modeprediction can be further captured in the user model, explaining in Chapter 7.

Figure 5.3 Overview of the proposed incremental mapping methodology.

App 2 App n...

Task Partitioning Process

ACGn...

update

System Utilization

(done by GM)

ACG2

on-lineoff-line

Near Convex Region Selection(see Section 5.5.1)

Vertex Allocation(see Section 5.5.2)

App 1

ACG1

87

To give an example, we follow the incremental mapping process deal with an incoming

application shown in Figure 5.4. Here we see the ACG of the incoming application

(Figure 5.4(a)) which is going to allocate to the current system configuration (Figure 5.4(b)).

As shown in this example, there are two voltage levels in the system: the gray squares are PEs

in high voltage level while the white squares are PEs in low voltage level. The dark green

vertex in ACG stands for critical vertices which needs to allocate on PE with high voltage

level later for meeting the application deadlines. In Figure 5.4(b), the black circles on the

utilized PEs shown on the current system configuration are the tasks of a pre-existing

application. Therefore, the incremental mapping process is to allocate each vertex in the ACG

to an idle PE, while minimizing the inter-processor communication and meeting the

application deadlines. Two steps are proposed for this process as explained below2.

MCR(v) = ‘H’

GM

Assume R1 is selected

idle PE

utilized PE

Application Characterization

Graph (ACG)

GM

R1

R2R3

GM

MCR(v) = ‘L’

PE in ‘H’ voltage level

PE in ‘L’ voltage level

pre-existing tasks

Near Convex

Region Selectio

n

Vertex Allocation

V(PE) = ‘H’

V(PE) = ‘L’

Figure 5.4 Overview of the proposed methodology. (a) The incoming application ACG(b) Current system configuration (c) The near convex region selection step (d) The vertexallocation step.

(a)

(b)

(d)

88

1.The first step is to select a near convex region, that is, to select a region as convex as

possible. The region can be defined as convex if it contains all the line segments con-

necting any pair of its points [87]. That is, as arbitrarily connecting two points in that

region, the line segments should be inside the region. As shown in Figure 5.3(c), R1

and R2 are more convex than region R3; and all of them have at least two PEs at the

high voltage level. Selecting nonconvexity will incur much higher communication costs

for additional mapping. We address this issue in more details in Section 5.5.1. Here, we

assume R1 is selected in this example.

2.The second step consists of assigning vertices to PEs within the selected region (with

critical vertices mapped onto PEs with higher voltage levels), while minimizing the

inter-vertex communication. More details are shown in Section 5.5.2.

5.4.2. Problem Formulation

To formulate this problem, we need a few notations as follows:

• PEij: the PE located at the intersection of the ith row and jth column of the network. We

assume PE11 is the global manager GM3;

• V(PEij): the voltage level that processor PEij belongs to;

• MCR(vi): the minimal computation requirement at which it should operate in order to

meet the application deadlines;

2. For a larger NoC with multiple distributed managers, hierarchical control mechanism may beneeded, similar to the cluster locality approach proposed in [110]. Thus selecting a suitable clusterfor allocating the incoming application would be done before applying our two-step algorithm.

3. Of note, the location of the GM indeed affects the MD of PEs and GM and therefore slightly modi-fies the energy consumption of moving the control messages, as seen in Equation 2.5. However,compared to the energy consumes on sending the data messages, the difference of energy consump-tion on control networks is negligible. Here, we assume the GM is located at the top- and left-mostof the platform.

89

• MD(PEij, PEkl): Manhattan Distance (MD) between PEij and PEkl.

Using this notation, the problem of dynamic incremental mapping for NoCs can be

formulated as follows:

Given the current system behavior and the ACG of the incoming application

Find a near convex region R and a vertex mapping function map( ),

in R, with the objective:

(5.2)

such that .

5.4.3. Significance of the Problem

To prove that the MD metric in the problem formulation heavily affects the

communication energy consumption, we consider the following experiment. An ACG is

generated using the TGFF package [162]. Then we implement four scenarios for mapping this

application onto an 8 × 8 homogeneous NoC. Scenario 1 (S1) in Figure 5.5 uses our method,

scenario 2 (S2) uses the Nearest Neighbor heuristic proposed in [27], scenario 3 (S3)

randomly maps the application vertices inside a 4 × 4 rectangle region, while scenario 4 (S4)

randomly maps the application vertices onto any PEs in an 8 × 8 NoC. The x-axis in

Figure 5.5 represents the average MD between two vertices, while y-axis represents the

communication energy consumption normalized to that of the first scenario. As we can see,

minimizing the MD between application vertices is an effective way to minimize the

communication energy consumption of the applications.

vk∀ V map vk( ) PEij→,∈

min Energy w ei j,( )ei j,∀∑ MD map vi( ) map vj( ),( )×=

⎩ ⎭⎨ ⎬⎧ ⎫

vk∀ V V PEij( ) MCR vk( )≥,∈

90

5.5. Solving the Incremental Mapping Problem

5.5.1. Solutions to the Near Convex Region Selection Problem

When dealing with region selection problem for the incremental mapping process, we

need to minimize the communication cost of the incoming application and, at the same time,

minimize the communication cost overhead for any additional incoming application. To

generalize and formulate this problem, the L1 distance is defined as follows.

Definition 1: The L1 distance of a region R with N tiles, denoted as L1(R), is the total MD

between any pair of these N tiles inside R.

The scenario in Figure 5.6(a) covers the general case of the incremental mapping

problem. Such a scenario includes some pre-existing applications (i.e., the black circles in

Figure 5.6) running in the system, as well as a new incoming application which needs to be

allocated on the remaining/available PEs (these M PEs in the thick line region R are indicated

in Figure 5.6(a)).

0 1 2 3 4 5 6 70

2

4

6

8

Avg. Manhattan Distance between verticesCom

m. e

nerg

y co

nsum

ptio

n ra

tio

Figure 5.5 The impact of Manhattan Distance (MD) on communicationenergy consumption for four different scenarios (S1-S4).

S1S2

S3

S4

91

Assume that this incoming application requires N PEs (with N < M). Our objective is to

find a sub-region R’ with N PEs to assign to this incoming application which minimizes the

metric L1(R’) + L1(R-R’) as shown in Figure 5.6(b)4. Intuitively, it is difficult to consider

these two terms, L1(R’) and L1(R-R’), at the same time. In Section 5.5.1.A, we first focus on

minimizing the first term L1(R’) (we call this the L1 problem), which is a special case of the

general allocation problem. This gives us an insight into the problem of minimizing

L1(R’) + L1(R-R’) which is discussed in Section 5.5.1.B. Finally, in Section 5.5.1.C, the

region selection algorithm is proposed for NoC platforms with multiple voltage levels.

5.5.1.A Minimization of L1(R’)

Minimizing L1(R’) for a region R’ with N tiles is a special case of the general allocation

problem in [15]. To find a lower bound for L1(R’), we first implement the best-case solution to

conjecture the best shape of the region. Then, the worst-case solution in a contiguous region

for this problem is derived as the upper bound. Note that since the incremental mapping

process is done on-line, we need to look for near-optimal solutions with very low cost (i.e.,

4. Of note, if the workload for future applications is predictable, it is suggested to have weights onthese two terms, L1(R’) and L1(R-R’). Such idea is later explored at Chapter 7 while considering theuser behavior into the resource management process.

Figure 5.6 L1(R’) + L1(R-R’) minimization problem: select a region R’, such that the

sum of the total Manhattan Distance (MD) between any pair of tiles inside region R andthat inside region R-R’ is minimized.

(a) (b)

Minimize L1(R’)+L1(R-R’)

Select R’

|R| = M |R’| = N, |R-R’| = M-N

RR'

R-R'

92

low computation time). Therefore, we also propose four sub-optimal solutions to see if any

lower cost solution gets close enough to the optimal case5.

Figure 5.7 plots one region (with N = 20) generated by each of the six cases which include

two possible extreme cases, the Best Case (BC) and Worst Case (WC) and our proposed

solutions (EM, FC, RF, NF).

5. Note that neither the best-case algorithm, nor the sub-optimal solutions, have known closed formformula in terms of N. Therefore, we can only obtain the optimal result from exhaustive search andthe results for sub-optimal solutions from simulation.

0 2 4 6 8 100

2

4

6

8

10Best Case

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

4 6 8 10 12 14 164

6

8

10

12

14

16Worst Case

1 2 3 4

5 6 7 8

9 1011 12

13 1415 16

17 1819 20

0 2 4 6 8 100

2

4

6

8

10Euclidean Minimum

1

2 3

4

5 6

7

8

9

10

11

12

13

14

1516

17 18

19

20

0 2 4 6 8 100

2

4

6

8

10Fixed Center

1 2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

0 2 4 6 8 100

2

4

6

8

10Random Frontier

1

2

3

4 5

6

7

8 9

10

1112

13

14

15

16

17 18 19

20

0 2 4 6 8 100

2

4

6

8

10Neighbor_awared Frontier

1

2

3

4

5 6

7 8 9

10

11

12

13

14

15

16

17

18

19

20

Neighbor_aware Frontier (NF)Random Frontier (RF)Fixed Center (FC)

Euclidean Minimum (EM)Worst Case (WC)Best Case (BC)

Figure 5.7 Region with N = 20 resulting from several distinct methods, namely (a) Best Case(BC), (b) Worst Case (WC), (c) Euclidean Minimum (EM), (d) Fixed Center (FC), (e)Random Frontier (RF), and (f) Neighbor_aware Frontier (NF). Note that the shape of theresulted regions would be the same even if shifted to other coordinates. Here, we onlyconsider minimizing the total Manhattan Distance between any pair of these N tiles inside R’,i.e., L1(R’).

L1 = 563 L1 = 1330 L1 = 566

(a) (b) (c)

L1 = 563 L1 = 741 L1 = 570

(d) (e) (f)

93

1. Best Case (BC): This corresponds to the optimal solution generated by an exhaustive

search but obviously this works only for moderate values of N.

2. Worst Case (WC): The closed-form solution for the worst-case of a contiguous region

R with N tiles:

L1(worst) (5.3)

Proof: The worst case of L1(R) with N tiles inside R is to place the fth tile to the resulting

region with distance N-k from the kth placed tile where f = 1 ~ N and k = 1 ~ f-1; that is,

L1(worst)

3. Euclidean Minimum (EM): While adding the fth tile into the region with f = 1 ~ N, the

EM heuristic updates the center, (xc , yc), by recalculating the arithmetic mean of f-1 tiles and

then selects a tile (x , y) with minimum Euclidean distance, , to the

updated center.

4. Fixed Center (FC): The 1st tile of the region is always set as the fixed center, (xc , yc).

While adding the fth tile into the region with f = 1 ~ N, the FC heuristic selects a tile (x , y)

with minimum Manhattan Distance, , to the fixed center.

5. Random Frontier (RF): While adding the fth tile into the region with f = 1 ~ N, the RF

heuristic randomly selects a tile from the frontier of the region consisting of N-1 tiles.

6. Neighbor-aware Frontier (NF): Every tile has four neighbors. The tile is considered to

be available, if it has not been selected into the region. While adding the fth tile into the region

R( )N N 1–( )× N 1+( )×

6----------------------------------------------------=

R( ) 1( ) 1 2+( ) … 1 2 … N 1–+ + +( )+ + +=

1( ) N 1–( )×= 2( ) N 2–( )× … N 1–( ) 1( )×+ + +

i N i–( )×i 1=

N 1–∑=

N N 1–( )× N 1+( )×

6----------------------------------------------------=

x xc– 2 y yc– 2+

x xc– y yc–+

94

with f = 1 ~ N, the NF heuristic searches the frontier of the resulted region and then selects a

tile with minimal number of available neighbors.

We should note that there exists more than one solution for these four cases (EM, FC, RF,

and NF), even for the BC and WC scenarios. Figure 5.7 shows a few concrete instances of

regions generated by each of the six cases for N = 20. The numbers on tiles in Figure 5.7(b)-

(f) represent the selection order when forming the regions. We initially set the tile (5, 5) as the

1st tile of the region.

As can be seen in Figure 5.7(a), the resulting shapes of BC are almost “circular”.

However, this solution cannot be applied to run-time incremental mapping due to its high run-

time overhead. For example, it take more than 40 minutes to get the optimal solution for

N = 20, while running on a Intel® Pentium 4 CPU with 2.60GHz. On the contrary, the EM,

FC, and RF heuristics take less than 10 μsec. We observe that for EM and FC cases, they

differ by less than 1% compared to the optimal solution (see Figure 5.7(c) and (d)). For the

RF case, there may exist holes inside the region and this would greatly increase the L1 distance

(e.g., for N = 20, 31.6% increase of the L1 distance compared to the optimal solution in

Figure 5.7(e)). In order to reduce the probability of getting holes inside a region, we propose

the NF heuristic which includes the neighbors information. Indeed, the solution produced by

the NF for N = 20 is only by 1.24% away compared to the optimal solution and it can be

obtained within 5 μsec.

To show the scalability of these heuristics, we need to test regions containing a large

number of tiles. The simulation results for the L1 distance under BC, and WC scenarios and

these four heuristics (EM, FC, RF, and NF) are shown in Figure 5.8(a). We plot the L1

distance from running 1000 experiments with N varying from 1 to 200 (see Figure 5.8(a)).

Since it takes more than 40 minutes to get the result for the BC if N = 20, we do not report

95

results for BC scenario when N is greater than 20 tiles. We do show, however, results for BC

when N varies from 1 to 20 (see Figure 5.8(b)).

From Figure 5.8(a), we observe that the results obtained from EM and FC cases are close

to each other. Of note, the NF heuristic has only 2.63% increase of the L1 distance compared

0 5 10 15 200

100

200

300

400

500

600Worst CaseEuclidean MinimumFixed CenterRandom FrontierNeighbor_aware FrontierBest Case

Figure 5.8 L1 distance results showing the scalability of the solutions obtained via the Best

Case (BC), Worst Case (WC) and four heuristics (EM, FC, RF, and NF).

0 50 100 150 2000

0.5

1

1.5

2 x 105

Worst CaseEuclidean MinimumFixed CenterRandom FrontierNeighbor_aware FrontierBest Case

L 1 d

ista

nce

# of tiles, N

L 1 d

ista

nce

# of tiles, N

(a)

(b)

96

to the FC heuristic for N = 200. Moreover, from Figure 5.8(b), EM, FC, and NF cases are

close to the optimal solution (i.e., the BC scenario) for N varying from 1 to 20.

5.5.1.B Minimization of L1(R’) + L1(R-R’)

Now we have a better sense about minimizing L1(R’) + L1(R-R’). Considering the system

configuration in Figure 5.6(a)), assume that the new incoming application has 12 vertices,

i.e., |R’| = 12. To minimize L1(R’) + L1(R-R’), we implement the EM, FC, and NF heuristics

described above. The first tile of each approach is selected from any boundary tile of R, i.e.

any boundary of the available tiles or next to the boundary of existing vertices. Note that

compared to the problem discussed in Section 5.5.1.A, the L1(R’) + L1(R-R’) minimization

problem has one obvious limitation: the boundary constraint. Under this limitation, for the

EM, FC, and NF cases, we need to add one more constraint; that is, the grid which is selected

should be inside the region R. Additionally, for the NF case, since the boundary condition

greatly influences the neighboring information, we need additional modifications. Initially,

every tile in the grid has four neighbors, except the corner tiles which have only three

neighbors. Other steps are the same as in Section 5.5.1.A.

Figure 5.9 shows the histogram of [L1(R’) + L1(R-R’)] values derived from 1000 runs for

each heuristic (EM, FC, and NF). For example, in Figure 5.9(c) which uses the NF approach,

we can get the [L1(R’) + L1(R-R’)] distance equal to 498 for 369 times. Of note, neither EM

nor FC can obtain this small value. We also summarize these data in Table 5.1 which lists the

L1 distance of the selected region R’ and the remaining region R-R’ for an average of 1000

runs. Also, we list the standard deviation over the mean of L1(R’) + L1(R-R’) in 1000 runs, and

the best/worst results for each heuristic.

97

Table 5.1 L1(R’) + L1(R-R’) minimization problem when using the Euclidean Minimum (EM), Fixed Center (FC), and Neighbor_aware Frontier (NF) heuristics.

Heuristics L1(R’) + L1(R-R’) = distance sum

Standarddeviation/

mean

Min(distance

sum)

Max(distance

sum)Euclidean Minimum

(EM)155.190 + 426.076 =

581.26667.095/581 504 696

Fixed Center (FC)

159.086 + 404.782 = 563.868

55.503/564 502 672

Neighbor_aware Frontier (NF)

167.604 + 342.268 = 509.872

14.544/510 498 568

Figure 5.9 Histogram over 1000 runs for L1(R’) + L1(R-R’) minimization problem.

We represent [L1(R’) + L1(R-R’)] distances on the x-axis and their frequency of

occurrence on the y-axis.

480 500 520 540 560 580 600 620 640 660 680 7000

200

400

480 500 520 540 560 580 600 620 640 660 680 7000

200

400

480 500 520 540 560 580 600 620 640 660 680 7000

200

400

Histogram

[L1(R’) + L1(R-R’)] value

freq

uenc

yfr

eque

ncy

freq

uenc

y

(a)EM

(b)FC

(c)NF

98

From Figure 5.9 and Table 5.1, we observe that the EM and FC do not work well for

solving the L1(R’) + L1(R-R’) minimization problem. As seen in Table 5.1, even though the

value of L1(R’) of EM and FC heuristics is quite small, the pairwise distance outside the

region, namely L1(R-R’), is relatively high; that is, the selected region does not help for

additional mappings. On the contrary, the NF with the neighboring information included helps

the additional mappings (the decrease in L1(R-R’) distance is 19% and 15% compared to the

EM and FC, respectively). Even if the distance L1(R’) of the NF is about 8% and 5% larger

than that of the EM and FC, respectively, the total distance, L1(R’) + L1(R-R’) of the NF is still

10% less than the solutions provided by the EM and FC. In addition, when observing the NF

in Figure 5.9(c), we have a higher probability to get the smaller L1(R’) + L1(R-R’) value

compared to EM and FC heuristics shown in Figure 5.9(a) and (b). This matches the goals of

the incremental mapping process, namely, to minimize the inter-processor communication cost

for the incoming application (i.e., smaller L1(R’)) and easily add additional applications to the

resulting system with minimal inter-processor communication overhead (i.e., smaller L1(R-

R’)).

5.5.1.C Solution to the Region Selection Problem for Run-time Incremental Mapping

Process

From the discussion in Section 5.5.1.A and Section 5.5.1.B, we decide to apply the NF

heuristic to the region selection problem (see Figure 5.3). Since for the above discussion, all

grids were considered to be the same (i.e., homogeneous system), we need to define two new

terms in order to deal with the heterogeneity of our proposed platform (see Figure 5.4(b)).

• Dispersion factor (D): The dispersion factor of a PE, D(PE), is defined as:

D(PE) = C - number of utilized neighbors of that PE (5.4)

99

where C is a constant. For the corner PEs, C = 3; for other PEs inside (including the

boundary), C = 46.

The PEs with the smaller D(PE) value indicate a higher likelihood to be included into

the current region. Indeed, a PE that has most of its neighbors utilized (i.e., PE with a

small D(PE) value), is very likely to be later isolated; then selecting this PE for the

current region helps reduce its dispersion probability.

• Centrifugal factor (C): The PE centrifugal factor, C(PE), is defined as the Manhattan

Distance between any PE and the border of the current region. PEs with the smaller

C(PE) value indicate a high likelihood to be included into the current region. Indeed,

since every PE in a near convex region should be close to the borders of that region, the

PE with smaller C(PE) is better suited for selection to form a near convex region.

Examples of calculation of the Dispersion and Centrifugal factors are shown in

Figure 5.10. PE12, PE13, PE21, PE22, PE23, and PE31 are running pre-existing applications

and considered to be unavailable. The current region is framed with thicker lines. We

6. The reason of not setting C as 2, 3, 4 for corner tiles, boundary tiles, and others is to avoid the higherprobability of keeping selecting all the boundary tiles into the region, such that the formed regionmay lose the convexity.

Figure 5.10 Dispersion and Centrifugal factor calculation example.

GM1

2

3

4

5

6

7

1 2 3 4 5 6 7

current region

C(PE32) = 1

C(PE77) = 7

D(PE34) = 4

D(PE53) = 3D(PE32) = 1

D(PE77) = 3

C(PE53) = 1

C(PE34) = 2

xy

100

demonstrate how to select and bring PEs into the current region, while keeping its shape as

convex as possible. In Figure 5.10, four PEs (PE32, PE34, PE53, and PE77) are selected as

examples for calculating D(PE) and C(PE). For PE32, D(PE32) = 1 since its neighbors PE22,

PE31, and PE42 are unavailable, while C(PE32) = 1 since it has a Manhattan Distance of 1 to

the region boundary. Other three PEs with Dispersion and Centrifugal factors calculation are

shown in Figure 5.10. Since PE with minimum D(PE) + C(PE) value indicates a higher

likelihood to form a near convex region, under the current region, PE32 is more likely to be

selected to become part of the region compared to PE53, PE34, and PE77. The steps of region

selection (similar to maze routing [93]) are used in Figure 5.11. (assuming k voltage levels in

the system and mk is the number of available PEs in the kth voltage level).

Of note, the k value in the platform does affect the convexity of the region selection; the

larger k value is, the less convexity the region it may form. Therefor, if the k value is too close

to the number of PEs in the system, that is, the platform is much heterogeneity, then it is not

suitable to apply the approach proposed in this chapter. However, it is reported that the k value

is much smaller than the number of PEs in the platform due to the circuit (mixed-clock FIFO)

or energy overhead of communicating PEs in different voltage levels. One experiment in [119]

shows that having 3 voltage levels for 5 × 5 NoC (total 25 PEs) for telecom benchmark

Figure 5.11 Near convex region selection algorithm.

Step 1): Assign each vertex v to Si where i is greater than M(v), and sortthem out in non-decreasing order, |S1|≤|S2|≤ … ≤|Sk|.

Step 2): Start with S1, select a PEij with minimum code transfer energy

Step 3): Update the D(PE) and C(PE) for unselected and idle PEs of

Step 4): Repeat Step 3 for the remaining sets.

that set. Select PEij with lowest D(PEij)+ C(PEij) into region. Continue with Step 3 until the number of PEs in the selected region matches the size of this set.

consumption and include into the region.

101

collected from embedded system synthesis benchmark suite (E3S) [50], the energy

consumption is more than four time reduction compared to the single voltage level case, and

smaller than that of 4, 5, or more voltage levels cases. Given the platform with the reasonable

heterogeneity, our region selected by our proposed approach (see Figure 5.11) forms more

convex than other task allocation approach which does not consider the additional mappings,

which has hugh impact on the overall communication cost as proofed in Section 5.5.1.B.

Now, we consider now a simple example and describe our approach step-by-step. The

ACG of the incoming application and system behavior are given in Figure 5.12 ((a) and (b))

where the black dots show the pre-existing applications in the system. The number marked on

Figure 5.12 Incremental run-time mapping process. (a) The ACG of the incomingapplication (b) Current system behavior (c) Near convex region selection process (d) Vertexallocation process.

GM

|SH|=2|SL|=7

GM1 2

3

4

5

6

7

{1} {2}

{3}

{5}

{4}

{6}

{9}

{6} {6}

R1

ACG

MC

R(v 4

) = M

CR(

v 6) =

v2

v1v3

v4 v5

v6 v7

v8v9

R1 is selected

1

2

3

4

5

6

7

GM

R1

v5 v1

v2

v3

v4v6

v7

v8

v9

1 2 3 4

1 2

`H

3 4 5

5 6 7

1 2

3

4

5

6

7

1 2 3 4 5 6 7 xy

xy

xy

`H’ voltage level`L’ voltage level

(a) (b)

(c)

(d)

102

each PE (e.g., {3} on PE32) in Figure 5.12(c) represents the selection order in forming a near

convex region. PEs with the same number show that they are selected into the region at the

same time.

We can see from Figure 5.12(a), that there are two vertices (v4 and v6) with

M(v4) = M(v6) = ‘H’ which are supposed to be mapped onto the PEs at high (‘H’) voltage

level; the other vertices can be mapped onto the PEs in ‘L’ voltage level, namely, |SH| = 2

(SH = {v4 , v6}) and |SL| = 7 (Step 1). Let us start with SH (Step 2), and assume that PE42 is

selected first to become part of the region for minimizing the code transfer energy

consumption (Figure 5.12(c)). Then, PE43 is the second PE selected for the region (Step 3)

since it has the lowest D(PE43) + C(PE43) = 3 + 1 (D(PE43) = 3 because PE33, PE44, and

PE53 are all idle. Comparing this to D(PE44) + C(PE44) = 4 + 2 or

D(PE46) + C(PE46) = 4 + 4, implies that PE43 gets selected). Step 3 terminates since the PEs

in SH are all selected inside the region. Now, we deal with the selection of SL (Step 4). Going

back to Step 3, PE32 is selected since it has the lowest D(PE) + C(PE) = 1 + 1. After that,

PE33 is selected with D(PE33)+C(PE33) = 1 + 1 and then PE41 is selected with

D(PE41)+C(PE41) = 2 + 1. With the same rule, PE51, PE52, and PE53 are selected with the

lowest D(PE)+C(PE)=4. Finally, PE61 is randomly selected among PE61, PE62, PE63, PE34,

and PE54, with all of them getting the same value of D(PE) + C(PE).

5.5.1.D Complexity of the Region Selection Algorithm

In order to determine the time complexity of the region selection algorithm, assume that

the ACG = (V, E) and the system contains a total of n × n PEs organized in k voltage levels,

where |V| < n2. Therefore, |S1| + |S2| + ... + |Sk| = |V|, where Si is the size of vertex set which is

about to be selected for the ith voltage level and m1 + m2 + ... + mk n2, where mi is the

number of available PEs in ith voltage level.

103

In Step 1 of Figure 5.11, calculating the size of PE sets takes O(V) time and the run-time

for sorting them is O(VlogV) if using QUICKSORT. Steps 2-4 need O(m12+m2

2+...+mk2) since in

Step 3, we need to update D(PE)+C(PE) for each PE in a certain set which takes linear time.

The worst-case scenario occurs when only one PE is selected into the region each time. Thus,

the total run time of region selection algorithm is O(n4+VlogV); that is, O(n4) since VlogV <

n2logn2 < n4. However, in Steps 3-4, we can record the frontier of the region and store the

information, D(PE)+C(PE), of this wavefront in a HEAP. Using this data structure, the run-

time for Steps 3-4 is reduced from O(m12 + m2

2 + ... + mk2) to O(S1logS1 + S2logS2

+ ... + SklogSk) = O(VlogV). Thus, the total time complexity of the region selection algorithm

becomes O(VlogV).

5.5.2. Solutions to the Vertex Allocation Problem

After the near convex region is selected, we continue allocating vertices of the incoming

application to the PEs with specific voltage levels in the selected region (see Figure 5.3),

while minimizing the inter-processor communication. To keep track of the vertex allocation

process, we color each vertex white, gray, or black. A gray vertex indicates that it has some

tentative PE locations but its precise location will be decided later. On the contrary, a black

vertex indicates that it has been already mapped onto some PE and this mapping will not

change anymore. All vertices start out being white and may later either become gray and then

black, or become directly black. A PE is set to be unavailable after a black vertex is mapped

onto it.

We define two actions for vertices:

• DISCOVER: This consists of 1) Select available PEs with a specific voltage level for

vertex t and 2) Color vertex t gray; then, vertex t is considered as “discovered”.

104

• FINISH: This consists of 1) Select a specific PE for vertex t such that the distance

between vertex t and its gray or black neighboring vertices is minimized. (Note that if

more than one PE gets the minimum distance, we select the PEij with its D(PEij) closest

to the number of nonblack neighbors of vertex t) and 2) Color vertex t black; then,

vertex t is considered as “finished”.

In short, we first sort vertices into an ordered set using the non-increasing order of their

total communication volume; that is, the higher communication volume a vertex has, the

earlier it is discovered or finished. The vertex allocation algorithm is summarized in

Figure 5.13.

Let us follow now the same example in Figure 5.12. The ACG in Figure 5.12(a) is going

to be mapped onto the region R1 which has been selected in Section 5.5.1.C. The final result is

shown in Figure 5.12(d); Figure 5.14 shows the vertex allocation process step-by-step.

Remember that for this ACG, the smallest vertex set is SH= {v4 , v6}.

Assume now that, based on the total communication volume, the vertex ordered set is {9,

6, 7, 5, 8, 4, 1, 3, 2}. Also, assume that all vertices are initially white (Figure 5.14(a)). We

Step 1): Color all vertices white. Then, start with the first white vertex in

Step 2): IF neighbors of vertex t are neither gray nor black, then do

Step 3): IF neighbors of vertex t are either gray or black, then do FINISH

Step 4): Go back to the first vertex of the ordered set, do Steps 2 and 3 for

Figure 5.13 Vertex allocation algorithm.

Step 5): Repeat Step 4 if there exists any nonblack vertex in the ordered set; otherwise, stop the algorithm.

each nonblack vertex t until the color of any nonblack vertex changes. Then go to Step 5.

for vertex t.

the smallest vertex set based on the ordered set.

DISCOVER for vertex t.

105

Figure 5.14 Vertex allocation process based on the example in Figure 5.12. (a) Initialconfiguration with every vertex white. (b) Vertex 6 is discovered. (c) Vertex 9 is discovered.(d) Vertex 7 is finished and colored black. (e) Vertex 9 is colored from gray to black. (f)Vertex 6 is colored from gray to black (g) Vertex allocation process is done; all vertices arecolored black.

v2

v1v3

v4 v5

v6 v7

v8v9

v2

v1v3

v4 v5

v6 v7

v8v9

v7

v2

v1v3

v4 v5

v6 v7

v8v9

v2

v1v3

v4 v5

v6 v7

v8v9

v7v9

v6

v7v9

v2

v1v3

v4 v5

v6

v8

v2

v1v3

v4 v5

v6 v7

v8v9

v6

v6 v6

3

4

5

6

1 2 3

v9

v6 v6

v9 v9

v9 v9

v9

v9

v9

v7

3

4

5

6

1 2 3

3

4

5

6

1 2 3

3

4

5

6

1 2 3

3

4

5

6

1 2 3

3

4

5

6

1 2 3

v6 v6

v9

v9 v9

v9

v9

v9

v9

v6 v6

v6 ...

v7

v9

v5

xy

xy

xy

xy

xy

xy

v6

v7v9

v2

v1v3

v4 v5

v6 v7

v8v9

3

4

5

6

1 2 3xy

v5

v2 v4

v3

v8

v1

v9

(a) (b) (c) (d)

(e) (f) (g)

106

start with vertex 6, since it has the smallest order among the vertices in SH (Step 1). Since at

this time its neighbors, vertices 5 and 7 (see in Figure 5.14(a)) are white, we DISCOVER this

vertex (i.e., color it gray) and select PE42 and PE43 (because these are the only PE locations at

‘H’ voltage level in region R1) to map it (Step 2); namely, vertex 6 will be allocated onto PE42

or PE43 later (Figure 5.14(b)). Then continuing with Step 4, we go back to the first vertex in

the ordered set. At this moment, the color of vertex 9 changes from white to gray since all

neighbors of vertex 9 are white; we select PE32, PE33, PE41, PE51, PE52, PE53, and PE61 for

it (Figure 5.14(c)). With the following repeat of Step 4, the color of vertices 9 and 6 remains

unchanged. Then, we consider vertex 7; its color changes from white to black directly since

vertices 6 and 9 are gray. We allocate vertex 7 onto PE52 (Figure 5.14(d)) since PE32, PE33,

PE41, PE52, PE53 has the minimum distance with the gray PEs where vertex 6 allocated and

D(PE52) = 3 equals to the number of nonblack neighbors of vertex 7. And then, since there

exists nonblack vertex in the ordered set, we repeat Step 4. Vertex 9 becomes black since its

neighbor, vertex 7, is colored black (Figure 5.14(e)) and we allocate vertex 9 onto the precise

location, PE51 (Figure 5.14(f)). We continue this process until all vertices are colored black

and each vertex is allocated a precise PE location. Figure 5.14(g) shows the final result of

vertex allocation process.

Complexity of the vertex allocation algorithm: The total run time of our algorithm has a

complexity of O(V2+E). This is because the body of the loop (Steps 4-5) executes |V| times,

while reaching at most |V| vertices each time. In addition, since each vertex will be reached at

most two times (i.e., DISCOVER and FINISH), and the adjacency list of each vertex is

scanned only when the vertex is reached, the total time of scanning the adjacency lists is O(E).

107

5.6. Experimental Results

We first evaluate the impact of the near convex region selection and vertex allocation

steps of the incremental mapping process using synthetic benchmarks (see Section 5.6.1 and

Section 5.6.2, respectively). Then, the overall algorithm with run-time energy overhead

considered is evaluated using synthetic benchmarks (see Section 5.6.3). To show the potential

of our proposed mapping algorithms for real applications, we later apply it to the embedded

system synthesis benchmark suite, E3S [50] (Section 5.6.4).

5.6.1. Evaluation of Region Selection Algorithm on Random Applications

To show that the choice of a near convex region heavily impacts the communication cost

of the incremental mapping process, we consider the following experiment. Several sets of

applications are generated using the TGFF package [162]. The vertex number and the

communication volume are randomly generated according to some specified distributions.

Then, applications are randomly selected for mapping onto the system or being removed from

it. For mapping onto the resulting system while pre-existing application remain fixed, two

different strategies are implemented: 1) A greedy approach minimizing the inter-processor

communication cost of the current configuration but without considering the newly incoming

applications and 2) A near convex region is first selected using the proposed approach and

then the application is optimally mapped onto this region using exhaustive search.

The number of vertices per application ranges between 5 and 10. The system consists of

7 × 7 PEs with PE11 being used as the global manager. The variance of the communication

volume per edge in one application is set arbitrarily between 0 and 106. Initially, there is no

application in the system. The sequence of events in the system is incremented whenever an

application comes to or departs from the system. If the number of idle PEs in the system is

108

smaller than the number of vertices of the incoming application, then the incoming application

is not accepted.

Figure 5.15(a) shows the inter-processor communication cost ratio between the mapping

using Strategy_1 (i.e., without selecting a region) and that in Strategy_2 (with selecting a near

convex region). Here, the inter-processor communication contains all communications (i.e.,

pre-existing and the incoming applications) in the system. We also show the number of

utilized PEs (except the GM) in that particular system configuration.

Variance of comm. rate per edge10

010

110

210

310

410

510

65

10

15

20

25

30

Com

m. e

nerg

y lo

ss (%

)

8 10 15 20 25 3040

45

50

55

60

Number of vertices per application

Com

m. e

nerg

y sa

ving

s (%

)

Figure 5.15 (a) Impact of selection region process on inter-processor communication. (b)Communication energy loss: optimal mapping vs. our allocation algorithm given a selectedregion. (c) Optimal vs. our allocation algorithm under different communication rates. (d)Communication energy savings: arbitrary mapping vs. our allocation algorithm.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

05101520253035404550

:

: # of utilized PE

Com

mun

icat

ion

cost

ratio

1 3 5 7 9 11 13 15Random sequence of incoming applications

Strategy_2Strategy_1 #

of u

tiliz

ed P

E

(a) (b)

(c) (d)

8 9 10 11 1216

18

20

22

24

Number of vertices per application

Com

m. e

nerg

y lo

ss (%

)

109

As shown in Figure 5.15(a), there is a slight increase in the communication ratio at the

beginning because the greedy approach performs well when the number of utilized PEs in the

system is small. Once the number of utilized PEs increases due to the incoming applications,

the benefit of our proposed algorithm becomes obvious. Finally, the ratio becomes stable since

for Strategy_1, when the application leaves the system, there is always a scattered region left

for additional mapping. This example demonstrates that near convex region selection

definitely helps the incremental mapping process.

5.6.2. Evaluation of Vertex Allocation Algorithm on Random Applications

We first compare the run-time and solution of our algorithm against exhaustive approach.

Experiments here are performed on a Intel® Pentium 4 CPU (2.60GHz with 768MB memory),

while we later report the run-time and energy overhead of running our algorithms on the real

embedded processor in Figure 5.6.4. The run-time for finding the optimal mapping within

selected region increases exponentially with the number of vertices in each application: For 8,

9, 10, 11, and 12 vertices in one application, it takes 0.2sec, 1.5sec, 4min, 10min, and 2hrs,

respectively, to obtain the optimal mapping. On the other hand, the run-time of our algorithm

stays within 3μsec, when the number of vertices varies between 8 and 20. Since finding the

optimal mapping for a region with 13 vertices takes more than 26hrs, we vary the number of

vertices per application from 8 to 12 (see points on x-axis in Figure 5.15(b)). More

specifically, there are 5 categories, (|V|=8-12), each category containing 40 applications

generated with TGFF.

We denote the energy consumption of our allocation algorithm by Eh, and the energy

consumption of the optimal mapping with the same region by Ee. Thus (Eh-Ee)/Ee × 100% is

the percentage of reported energy loss compared to the optimal solution. As shown in

Figure 5.15(b), the energy loss for each category is always less than 21%, and not scale up as

110

problem size increases. Therefore, our vertex allocation algorithm provides good results for

large designs.

Now, we address the impact of variance of communication rate per edge on the energy

consumption in Figure 5.15(c). The vertex number in one application used in this experiment

is fixed to 10 and the variance of communication volume per edge is set from 100, 101, 102,...,

to 106 (7 categories). For each category, we run 50 different ACGs and calculate the averages.

It can be seen from Figure 5.15(c) that our average communication energy loss in all

categories is within 21.5% compared to the optimal solution under the same region.

Last, in Figure 5.15(d), we compare the solution of an arbitrary mapping against our

algorithm. Since the run-time of arbitrary mapping is very small, we can consider ACGs with

a large number of vertices (i.e., 30 in Figure 5.15(d)) and see if our algorithm scales well. The

number of vertices per application used in this experiment ranges between 8 and 30 (i.e., 6

categories, |V|=8, 10, 15, 20, 25, and 30). We generate 20 different regions with PE

locations corresponding to the number of vertices in each category and then run 20

different applications on each selected region. The variance of communication volume per

edge and per application is set between 0 and 106.

We denote the energy consumption of our allocation algorithm by Eh, and the energy

consumption of the arbitrary mapping solution by Ea; then (Ea-Eh)/Ea × 100% is the

percentage of reported energy savings compared to the arbitrary mapping solutions which

are averaged from 500 random results. As shown in Figure 5.15(d), at least 45% savings

can be achieved in all categories; of note, the savings increase as the vertex size scales up.

111

5.6.3. Random Applications Considering Energy Overhead for the Entire Incremen-

tal Mapping Process

We compare seven scenarios to our near convex region selection and vertex allocation

technique (denoted as ‘our_all’ in Figure 5.16) in terms of the communication energy

consumption. In this comparison, the communication energy consumption includes the

energy overhead of delivering control messages (see Section 2.1) and that of running our

on-line processes. The latter overhead is measured by executing the C programs on

MicroBlaze processor running on Xilinx Virtex-II Pro XC2VP30 FPGA. For all these

scenarios, we first perform the region selection process using neighbor frontier (NF), euclidian

minimum (EM), fixed center (FC) or our near convex region selection technique (our). Then,

we perform vertex allocation using either random mapping (random) or our proposed

approach (our) presented in Section 5.5.2.

For example, in Figure 5.16, the notation ‘NF + our’ indicates that we implement the

‘NF’ algorithm (in Section 5.5.1.A) for region selection step and then use “our” vertex

Figure 5.16 Communication energy consumption comparisonusing random applications.

0 2000 4000 6000 8000 10000 120000

0.5

1

1.5

2

:::::::

(NF + our) / our_all(EM + our) / our_all(FC + our) / our_all(NF + random) / our_all(EM + random) / our_all(FC + random) / our_all

Total comm. rate per application (bits)

Com

m. e

nerg

y co

nsum

ptio

n ra

tio

Nearest Neighbor / our_all

112

allocation solutions (in Section 5.5.2) for mapping vertices into the selected region. The

experimental results in Figure 5.16 already include the energy overhead of running these

algorithms, i.e., ‘NF’, ‘EM’, or ‘FC’, and ‘our’ where the results in Figure 5.16 assume the

‘random’ algorithm has zero energy overhead.

Finally, the last scenario we consider is the state-of-the-art mapping approach proposed in

[27]. In this case, we allocate the vertices as close as possible without considering a particular

region. The comparison with this scenario is marked as ‘Nearest Neighbor’ in Figure 5.16.

As observed in Figure 5.16, when the total communication rate for applications is too

small, it can be seen the quality loss of our approach due to the comparable run-time energy

overhead of running our mapping algorithm. However, when the total communication volume

per application is over 10000 bits, we can achieve more than 37.5% = (1.6 - 1)/1.6

communication energy savings compared to all other scenarios.

5.6.4. Real Applications Considering Energy Overhead for the Entire Incremental

Mapping Process

The communication energy overhead of on-line processes contains the message

transmission into the control network and our on-line algorithms (i.e., near convex region

selection and vertex allocation). The incremental mapping process is activated only when a

new application arrives, and so the PEs need to send their status to GM. The communication

volume for all control messages is [a bits (for showing PE address, which depends on network

size) + 1 bit (PE status)] × MD (Manhattan Distance of all PEs to GM). For the 6 × 6 network,

a = 6 (26>36), MD = 180; therefore, all control bits for one incoming application are within 1

Kilobit. Compared to the communication volume in real applications (which is in the

Megabits range), the energy overhead for transmitting the control messages is negligible.

113

Next, we evaluate the extra energy overhead of our on-line algorithms. Our system

contains 6 × 6 PEs of AMD ElanSC520 (133 MHz), AMD K6-2E (500MHz), and one

MicroBlaze core (100MHz) for the global manager running our on-line algorithms. To

evaluate the potential of our on-line algorithms for real applications, we apply it to embedded

system synthesis benchmark suit, E3S [50]. We first do the off-line partitioning process for

each benchmark. The communication energy consumption is measured by a C++ simulator

using the bit energy model [170]. We start with a given system configuration running a set of

pre-existing applications. We denote some terms for energy saving calculation.

• Ph: the communication power consumption of our mapping algorithms

• Pa: the communication power consumption of the implementation of the state-of-the-art

allocation scheme which maps tasks on a contiguous region and as close as possible

[27]

• Pon-line: the power consumption of running the on-line algorithms (a constant, obtained

from MicroBlaze datasheet)

• Tet: the execution time of that application

• Ton-line: the execution time of running our on-line algorithms (obtained from

MicroBlaze processor running on Xilinx Vertex-II Pro XC2VP30 FPGA)

Thus, the communication energy savings of our algorithms compared to mapping

approach proposed in [27] is calculated as follows:

(5.5)Pa Tet×( ) Ph Tet× Pon-line Ton-line×+( )–

Pa Tet×----------------------------------------------------------------------------------------------------------- 100%×

114

Table 5.2 Mapping approach proposed in [27] vs. our algorithms results.

As shown in Table 5.7, if running the telecom benchmark for only 0.007msec, we cannot

achieve any communication energy savings; the energy overhead of running our algorithms

plus the energy with our algorithms applied is almost the same as the communication energy

with allocation scheme proposed in [27]. However, if running the telecom benchmark longer

(take 0.03msec for example), we already gain 25% communication energy savings compared

to the mapping solution in [27]; we note that the overhead of running our algorithms is

included.

We observe that about 48.6% communication energy savings can be achieved, on average,

compared to an implementation proposed in [27] when the execution time of applications is

over 0.2msec. The run-time overhead of executing incremental mapping process on

MicroBlaze (i.e., Ton-line) for telecom and consumer are 53μsec and 42μsec, respectively.

5.7. Summary

Achieving effective run-time mapping on MPSoCs is a challenging task, particularly since

the arrival order of the target applications is not known a priori. In this chapter, we target real-

time applications which are dynamically mapped onto heterogeneous embedded MPSoCs

Benchmark Tet , Application exec. time (msec)

CommunicationEnergy savings

0.007 0%telecom 0.03 25%

> 0.2 50.3%0.0005 0%

consumer 0.008 25%> 0.04 47%

115

where communication happens via the NoC approach and resources connected to the NoC

have multiple voltage levels.

More precisely, we have addressed precisely the energy- and performance-aware

incremental mapping problem for NoCs with multiple voltage levels and proposed an efficient

technique (consisting of near convex region selection and vertex allocation processes) to solve

it. As shown, using the near convex region selection technique, the mapping results of our

algorithms can be obtained very efficiently; also, they are not far from the optimal case.

Moreover, additional incoming applications can be added into system with minimal

communication overhead. Experimental results have shown that the proposed technique is

very fast and as much as 50% communication energy savings can be achieved compared to

using an the state-of-the-art task allocation scheme.

Of note, in this chapter, we address the run-time resource management problem on the 2-D

mesh-based platform, i.e. platform with regular topology. However, the workload variation on

the system may result from the system itself or users’ interaction with the system; we further

discuss deeply about these two factors on run-time optimization in Chapter 6 and Chapter 7,

respectively.

116

117

6. FAULT-TOLERANT TECHNIQUES FOR ON-LINE

RESOURCE MANAGEMENT

6.1. Introduction

Resource utilization and system reliability are critical issues for the overall computing

capability of MPSoCs running a mix of small and large applications [23]. This is particularly

true for MPSoCs consisting of many cores that communicate via the NoC approach since any

failures propagating through the computation or communication infrastructure can degrade the

system performance, or even render the whole system useless. Such failures may result from

imperfect manufacturing, crosstalk, electromigration, alpha particle hits, or cosmic radiation,

and be permanent, transient, or intermittent in nature [145].

Existing fault-tolerant (FT) techniques for NoC resilience target the device, packet/data, or

end-to-end transaction levels of abstraction [53][142]. However, there is a need to

complement these approaches by handling failures at system-level and thus ensuring

resiliency while maintaining the required levels of system performance. From this perspective,

it has been shown that adding spare cores and wires can significantly improve the reliability,

reduce the cost, and be a substitute for the burn-in process [65][145]. For instance, for the

Intel 80-core processor [72], adding 10 or 20 spare cores to achieve 10 × 9 or 10 × 10

configurations, can make the system yield jump to 90% and 99%, respectively [146].

As shown in Figure 6.1, the NoC platform we consider is a 2-D tile-based architecture,

which consists of various resources and network elements. More precisely, the resources

consist of computational tiles (i.e. processors/cores/resources) and memory tiles, while the

118

network elements consist of routers, links, and resource-network interfaces. In the remaining

part of the chapter, we may use the term “resource” and the term “core” alternatively when

there is no ambiguity.

In terms of the computational tiles, we assume a j-out-of-i-core model [145]; that is,

except the distributed manager tiles that control the status of the entire system, the platform

consists of i cores where at least j of these i cores should be defect-free (or active, reachable)

cores responsible for running the application tasks in order to satisfy the system performance

requirements. In other words, if there exist k (permanent) faulty cores in the system due to the

imperfections in manufacturing (see the ‘flash sign’ in Figure 6.1), then we assign i-j-k cores

as spares for application computation. Of note, some design parameters given for the

model, i.e. i, j, and k, are related to the chip yield or manufacturing process; a more

detailed discussion on them can be found in [145][146].

Coming back to Figure 6.1, the main task of the manager titles (‘MA’ tiles in Figure 6.1)

is to 1) decide on resource management and 2) control the migration process via the platform

operating system. The role of a spare core (‘S’ in Figure 6.1) is to replace the (temporarily

Figure 6.1 Non-ideal 2-D mesh platform consists of resources connected via anetwork. The resources include computational tiles (i.e., manager titles, activeand spare cores) and memory titles. Permanent, transient, or intermittent faultsmay affect the computational and communication components on this platform.

: computational core/tile, type: ‘CP’

: router: links: distributed global MAnager

: Spare core: permanent faulty core: transient/intermittent faulty coreMEM MEM MEM

!

MA

MA

S

S

S!

MEM MEM MEM

!

MA

MA

S

S

S!

MA

MEM

S

!

: MEMory, type: ‘MEM’

119

or intermittent) faulty cores (see ‘!’ in Figure 6.1) or other unreachable cores (due to the

failure of the system interconnect). In other words, each active core has a probability p > 0 to

be affected by transient, intermittent, or permanent faults. We note that p is not a constant

during the chip lifetime, as it depends on chip lifetime cycles, processor utilization, or even

temperature. If necessary, the application tasks assigned to the active core will migrate to the

spares in order to continue being processed. Such task migration processes are controlled by

the distributed manager tiles.

Doing effective resource management for such irregular MPSoCs while failures occur

dynamically, and minimizing the communication energy consumption while maximizing the

entire system performance is a challenging task. It is obvious that the lack of regularity

increases the distance among various cores; this may further incur a higher network contention

on inter- or intra- application communication. In turn, the contention in the network may

degrade the system throughput. Critical factors for causing the system degradation need to be

quantified in order to handle the dynamic application mapping on such irregular platforms. In

addition, when a transient, intermittent, or permanent failure occurs, the system must be able

to isolate the failure from the offending resource and thus some mechanisms are needed in

order to avoid failure propagation to the rest of the system.

Given the above consideration, we address the problem of run-time fault tolerant resource

management, with the objective of allocating the application tasks to the available, reachable, and

defect-free resources in irregular NoC-based multiprocessor platforms (i.e., j-out-of-i computation

model with known and dynamic faulty probability p for each active core). The goal of this

dynamic technique is to minimize the communication energy consumption and network

contention, while maximizing the overall system performance. The challenge of the approach is

to manage the run-time and energy overhead for running such an algorithm, while maintaining

useful levels of fault tolerance in the network [42]. Our contributions are as follows:

120

• First, we explore the spare core placement problem and investigate the impact on failure

propagation probability.

• Second, we analyze the major factors that produce network contention before

investigating critical metrics for measuring the network contention and system

fragmentation, as well as their impacts on system performance.

• Third, we propose and evaluate an efficient algorithm for fault-tolerant resource

management with the goal of minimizing the communication energy consumption and

maximizing the overall system performance.

Taken together, these specific contributions improve the system-level resiliency, while

optimizing the communication energy consumption and the system performance.

The remaining of this chapter is organized as follows. In Section 6.2, we review the

relevant work. Section 6.3 analyzes the impact on network contention and spare core

placement. In Section 6.4, we investigate several critical metrics and provide insight into the

FT resource management problem on irregular platforms. The problem formulation and

details of the proposed FT algorithms are presented in Section 6.5. Experimental results are

presented in Section 6.6. Finally, we summarize our contribution in Section 6.7.

6.2. Related Work and Novel Contributions

There is a considerable work on online failures/errors diagnosis and detection for

multiprocessor systems at micro-architecture level with low power and area overhead

[49][96]. Besides this work, operating system control in NoC-based multiprocessor platforms

has been proposed to support system-level fault-tolerance [117]. Other techniques for failure/

error as well as thermal monitoring for NoC platforms have been proposed in [84][136]. More

recently, Huang et al. have taken the system lifetime reliability and system lifetime into

121

consideration at design time, while dealing with the application task mapping in NoC-based

MPSoCs [82]. In addition, transient failures on NoC links have also been considered under

stochastic and adaptive routing schemes [53][102][142].

There exists prior work on run-time application mapping on NoCs that aims at optimizing

the packet latency and power/energy consumption [27][79][110][154]. However, to the best of

our knowledge, this is the first work that considers the run-time fault-tolerant resource

management on non-ideal NoC platforms which is able to cope with the occurrence of static

and dynamic failures on both computational or communication components. Of note, while we

assume there exists a fault/error detection scheme with support for thermal monitoring, we

focus our attention on application mapping for such non-ideal NoC platforms which support

multiple applications entering and leaving the system dynamically.

6.3. Analysis for Network Contention and Spare Core Placement

6.3.1. Network Contention Impact

Since applications enter and leave the system dynamically, the application communication

contention is less likely to be avoided. Therefore, before proposing the mechanism for run-

time FT resource management (especially on such irregular NoC), it is definitely necessary to

figure out the impact of all possible communication contention on system performance; here,

we clarify them into three types: source-based, destination-based, and path-based contention.

Figure 6.2 captures one application mapping on mesh-based 3 × 3 NoC, where the application

characteristic is defined in Section 2.2.

• Source-based contention: it occurs when two traffic flows originating from the same

source contend for the same links, as shown in Figure 6.2(b).

122

• Destination-based contention: it occurs when two traffic flows which have the same

destination contend for the same links, as shown in Figure 6.2(c).

• Path-based contention: it occurs when two traffic flows which neither come from the

same source, nor go towards the same destination contend for the same links

somewhere in the network, as shown in Figure 6.2(d). These two traffic flows can be

from the same applications or from the different applications, so-called internal or

external network contention as defined in [98] and [99]. From the definition of path-

based contention, it mostly comes from the external contention.

To illustrate the impact of the source-based, destination-based, and path-based network

contention on the packet latency, we consider the following experiment, i.e., several mapping

configurations (see Figure 6.3) in a 4 × 4 mesh NoC: without/with only source-based

contention (cases 1 vs. 2), without/with only destination-based contention (cases 3 vs. 4), and

without/with only path-based contention (cases 5 vs. 6). We apply the XY routing and

wormhole switching for data transmission with 5 flits per packet. The communication rate (or

the packet injection rate from the source core) of transmissions in each configuration is set to

be the same. For fixed injection rates in each configuration, we run 100 different experiments

and calculate the corresponding average packet latency and throughput; the latency is

Figure 6.2 Application mapping on mesh-based 3 × 3 NoC (a) Application characteristicACG = (V, E) (b) Source-based contention (c) Destination-based contention (d) Path-based contention.

e12

e13 e13

e23 e24

e13

v1 v2 v3

v4

v1 v2 v3

v4

v1 v2 v3

v4

PEACG = (V , E)

e12

v4

v2

v1

v3

e13

e23e24

e43

(a) (b) (c) (d)

123

0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10

100

200

300

400

500case 1: without source-based contentioncase 2: with source-based contention

Figure 6.3 The (a) source-based (b) destination-based (c) path-based contention impacton average packet latency.

packet injection rate (packet/cycle)

avg.

pac

ket l

aten

cy

0.08 0.1 0.12 0.14 0.16 0.180

100

200

300

400case 3: without destination-based contentioncase 4: with destination-based contention

packet injection rate (packet/cycle)

avg.

pac

ket l

aten

cy

0.1 0.15 0.2 0.25 0.3 0.35 0.40

100

200

300

400case 5: without path-based contentioncase 6: with path-based contention

packet injection rate (packet/cycle)

avg.

pac

ket l

aten

cy

S DD

D

D

S D

D

D

D

case 1

case 2

(a)

D SS

SS

S

D S

S

S

S

case 3

case 4

(b)

D

D

D

S

S

SD

S

D

SD

D

D

S

S

S

case 5

case 6

(c)

124

calculated from the time when packets are generated from sources to the time when the

packets reach the destination. The results are plotted in Figure 6.3 with the x-axis showing the

total injection rate of all transmissions in that configuration and the y-axis showing the

average packet latency at the corresponding injection rate.

As seen in Figure 6.3(a), for source-based contention (i.e., cases 1 and 2), the throughput

is the same. This makes sense since every generated packet needs to pass through the link

from the source core to its router; therefore, the system performance is basically limited by the

injection rate of the source core.

For destination-based contention (i.e., cases 3 and 4 in Figure 6.3(b)), the system

throughput has about 2% improvement. We observe that the bottleneck of these two

configurations (i.e., cases 3 and 4) is actually due to the link between the router and its

corresponding destination core for which all packets contend. Obviously, such intra-tile

contention can be mitigated via careful hardware/software codesign (i.e., clustering process),

but can not be solved via mapping.

As seen in Figure 6.3(c), there is a dramatic throughput difference when comparing cases

5 and 6 (without/with path-based contention, respectively) as 118% throughput improvement

is observed (i.e., the throughput improves from 0.16 to 0.35 without path-based contention in

the network). Moreover, we observe that the frequency of occurrence of the path-based

contention is much higher compared to the source-based and destination-based contention as

the system size scales up. By doing several experiments involving many runs, we observed

that the ratio of path-based to source-based contention and the ratio of path-based to

destination-based contention increase linearly with the network size (i.e., for 4 × 4, 6 × 6,

8 × 8, and 10 × 10, the ratios are 1.2, 2.5, 4.0, 5.6, respectively). Therefore, in the remaining of

this chapter, we focus on minimizing the path-based contention since this has the most

significant impact on the packet latency and can be mitigated through the mapping process.

125

Of note, to show the impact of the path-based contention minimization on system

performance, we proof such statement using an integer linear programming-based (ILP-based)

contention-aware mapping technique with the goal of minimizing the network contention and

communication energy consumption (see Appendix B for a detailed formulation on ILP

method and more experimental results for synthetic and real applications). Through such ILP-

based analysis, it can be observed that by mitigating the critical contention, i.e. path-based

contention, the end-to-end average packet latency can be significantly decreased with minimal

communication energy overhead. Indeed, this concept has been explored in Chapter 5. As

discussed in Section 5.5.1, under the near convex region selection step, we are able to

minimize the network contention resulting from different applications, which helps mitigating

a large portion of path-based contention. Since it is impossible to remove all internal and

external of path-based contentions during run-time mapping on such irregular platform, we

investigate several metrics, as discussed in Section 6.4, in order to achieve this goal.

6.3.2. Spare Core Placement

Any fault tolerant (FT) scheme needs to show i) No single point of failure ii) No single

point of repair iii) Fault detection and recovery iv) Fault isolation to the failing core v) Fault

containment to prevent propagation of the failure [59]. For the first two requirements, it is

clear that since the spare cores exist, if any of the cores in the system fails, it is unlikely to

bring the entire system to a halt. In addition, we do not need to shut down the entire system in

order to replace a failed core; instead, we can simply have the state recovery scheme in each

core [64] or replace the failed core with the spare one at run-time via task/process migration.

The task migration at task- and resource-level have been well studied for reduced response

time [19][161] and proactive/reactive interrupt [140] between processors and so this is out of

126

the scope for this chapter. Instead, we focus on task migration at system-level, e.g. spare core

placement, spare selection for faulty core replacement.

From a system-level point of view, the spare core placement problem needs to be

addressed since it directly affects the last three properties of the FT scheme, especially for

systems relying on NoC-based communication. Indeed, with a good spare core placement, not

only the distances between the spare and faulty cores decrease, but also the failures

propagation to the rest of the system is avoided.

e12

v3

v6

v4

v5

v2

e13

e34e24

e45

e56

e16

e61

v1

Figure 6.4 (a) Application Characterization Graph (ACG) (b) Spare cores (‘S’) are assignedtowards the side of the system. (c) Spare cores ‘S’ are randomly distributed in the system (d)Spare cores ‘S’ are evenly distributed in the system.

ACG = (V, E)

!

: existing tasks

: incoming tasks

: Spare core

: MAnager core

: permanent faulty core

: transient fault on resource r20

: failure contamination area (FCA): migration from faulty core to spare

(a)

(b) Case 1: Side (c) Case 2: Random

: MEMory

incoming app.

MA

S

MEM

0

1

2

3

4

S

S

v1

v3

v5

v2v4!

xy

v6

MA S

MEMMEM MAMEM5

0 1 2 3 4 50

1

2

3

4

0 1 2 3 4 5xy

v3

v5

v1

v4 v2! v6

MA S

SS

MEMMEM MAMEM5

0

1

2

3

4

S

S

v1

v3

v5

v2v4!

xy

v6

MA

S

MEMMEM MAMEM5

0 1 2 3 4 5

(e) Case 3: Uniform

127

Assume that an incoming application (see its ACG in Figure 6.4(a)) needs to be

mapped onto a 6 × 6 NoC platform interconnected via a mesh network under wormhole

switching and XY routing (if there exists failed links, use minimal-path routing instead).

Each resource rmn is located in the NoC at the intersection of mth row and nth column.

Several spare core placement schemes are studied here: Case 1) Side assignment: Assign

the spare cores to the side of the system (shown in Figure 6.4(b)), Case 2) Random

assignment: Randomly distribute the spare cores in the system (shown in Figure 6.4(c)),

and Case 3) Uniform assignment: Evenly distribute the spare cores in the system (shown

in Figure 6.4(d)), Case 3).

Intuitively, the distance among the active cores in Case 2 and Case 3 is higher than

that in Case 1 since the system size grows by involving the spares; this, in turn, results in

higher communication energy consumption and lower system performance. For example,

it can be seen that for the incoming application in Figure 6.4(b) and (c),

MD(e12) = MD(e13) = 1 in Case 1, while MD(e13) = 3 in Case 2. However, when a

transient fault occurs at core r20, the master will assign the closest spare to recover the

fault. Therefore, cores r25 and r11 are selected in these two cases which means that the

distance between the faulty core and the closest spare is 5 and 2, respectively. Moreover,

we define the failure contamination area (FCA) to reflect the failure propagation

probability, namely the greatest area resulting from the communication re-routing while

replacing the faulty core with a spare. As seen in Figure 6.4(a), since vertices v2, v3 and

v5 have the communication with vertex v4, the FCA in Case 1 is much higher than that in

Case 2. Shown as the thick frames in Figure 6.4(b), (c), and (d), the FCA value in Cases

1 and 2 is 18 and 6, respectively; this may further degrade the performance of some other

existing application shown with black dots in the system.

128

Figure 6.5 shows the quantitative analysis on the performance impact with different

number of spares ranging from 1 to 10 in a 10 × 10 NoC. Since the system configurations

cannot be predicted in advance, we measure the all-to-all Manhattan Distance between all

active cores, Daa, for these three spare core placement cases, where the smaller Daa value

represents the higher probability of obtaining smaller communication energy

consumption for the entire system. In addition, the average Manhattan Distance between

the active cores and the closest spares, noted as Das, is important for fault isolation and

containment. For the random placement case, i.e. Case 2, we do 50 experiments and report

the average Daa and Das values. As seen in Figure 6.5, as the number of spares increases,

the Daa value in Cases 2 and 3 is slightly greater than that in Case 1 (less than 7%

overhead), but ends up with half of the migration distance as the number of spares

becomes greater than 6. In addition, we observe that the Daa and Das results for Case 2

and Case 3 are very close to each other.

0.6

0.5

0.4

Daa_random / Daa_side (see left axis)Das_random / Das_side (see right axis)

0 2 4 6 8 100.9

1

1.1

::

0.6

0.5

0.4

Daa_uniform / Daa_side (see left axis)Das_uniform / Das_side (see right axis)

Daa_random / D aa_side (see left axis)Das_random / D as_side (see right axis)

Figure 6.5 Quantitative analysis on the performance impact onthree different spare core placements.

# of spare core in 10 x 10 NoC

Daa : avg. all-to-all Manhattan Distance

Das : avg. distance between active cores and the closest spare cores

avg.

all-

to-a

ll M

anha

ttah

Dis

tanc

e

ra

tio b

etw

een

activ

e co

res

avg.

dis

tanc

e ra

tio b

etw

een

activ

e an

d sp

are

core

s

between active cores

129

In conclusion, distributing spares at random or evenly within the system slightly

increases the all-to-all Manhattan Distance among the active cores. This may causes some

communication energy consumption during the mapping process, but it maintains useful

levels of fault isolation and containment compared to the scenario when spares are placed on

the sides. More evaluations on system throughput for spare core placement are shown in

Section 6.6.2.

6.4. Investigations Involving New Metrics

In practice, applications can enter and leave the system dynamically and since the faults

existing in the system can be transient or intermittent, the locations of faulty cores may

change at run-time. Therefore, our goal is to find a mapping function, map( ), for allocating

the incoming application tasks (see application modeling in Section 2.2) to the reachable,

available and fault-free cores on a non-ideal 2D mesh NoC platform (see Figure 6.1) such that

i) the communication energy consumption is minimized ii) the network contention is

minimized and iii) the entire system performance is maximized. Of note, the applications

execution time and their relative ordering are not known in advance, thus considering the entire

system optimization during mapping is the critical challenge for solving this problem. Again,

the application modeling and the communication energy modeling have been described in

Section 2.2 and Section 2.3, respectively. Even though we assume the platform under

consideration is based on a 2-D tile-based architecture and a wormhole switching scheme, we

emphasize that our proposed algorithms can be extended for other irregular topologies with

other switching schemes.

Three new performance metrics are defined next in order to reach these three goals of

finding the FT mapping function, map( ), as mentioned.

130

1. Weighted Manhattan Distance (WMD): Let vertices vi and vj be mapped onto resources

rab and rcd, respectively. The weighted Manhattan Distance between any two vertices is

defined by

comm(eij) × MD(map(vi) , map(vj)) = comm(eij) × (|a-c| + |b-d|) (6.1)

Based on the bit energy metric [170], it is obvious that the weighted Manhattan distance

is positively correlated with the communication energy consumption.

2. Link Contention Count (LCC): The link contention occurs when two communication

flows eij and ekl from the same or different ACGs, where i k and j l, contend for the

same link somewhere in the network. Such link contention can produce a significant

degradation on system performance.

3. System Fragmentation Factor (SFF): This factor reflects the degree to which the non-

contiguity of one application may affect other regions where vertices in different applications

may be allocated to. The system fragmentation factor is defined as

(6.2)

where w and h are the width and the height of the minimal enclosing rectangle covering

the mapping solution of ACG = (V, E), while f and s are the number of faulty cores and spares

in that rectangle, respectively. Therefore, the smaller SFF value for each application on the

system is a good indication of optimizing the entire system performance.

One example can be seen in Figure 6.6 where two possible mapping results (map( ) and

map’( )) are shown with solid and dotted circles, respectively. With the same ACG shown in

Figure 6.4(a), the WMD of vertex v1 and vertex v6 in map( ) is r(e16) × 2, while in map’( ) is

r(e16) × 5. Under the routing referring to Figure 6.6, the LCC in map( ) for the ACG is 1

(i.e. e13 and e24 share the same link,); the LCC in map’( ) is 5. In addition, the SFF in map( )

≠ ≠

SFF w h× V– f– s–w h×

-----------------------------------------=

131

for the ACG is 0.11 = 1/9 while that in map’( ) is 0.33 = 4/12. As seen, the larger the SFF, the

more interference exists between the cores from different applications (see the dashed line in

Figure 6.6(b)).

Now, we evaluate the composite effect of these three metrics (i.e., WMD, LCC, SFF) on

the average packet latency2. Several ACGs are generated using the TGFF package [162],

where the number of vertices ranges from 5 to 15. Then, we implement three different

scenarios (i.e., Random mapping (Random), Multiple Buddy Strategy (MBS) [99] and Nearest

Neighbor approach (NN) [27]) for allocating application tasks onto a 10 × 10 NoC with

randomly selected faulty and spare cores. For each scenario, the average packet latencies, as

well as the average values of the three metrics defined above are calculated for twenty

different system configurations.

We employ 3D Kiviat graphs to provide a composite view of the impact of these three

metrics [112]. A Kiviat graph consists of three dimensions, each representing one of the

aforementioned metrics emanating from a central point, as seen in Figure 6.7. Each metric

2. Of note, our objective function (i.e. system performance and communication energy consumption) ispositively correlated with the average packet latency [27][79].

Figure 6.6 Two mapping results for the ACG in Figure 6.4(a)where the spare cores are randomly placed on the platform.

(a) map( ) (b) map’( )

0

1

2

3

4

S

S

xMA S

MEMMEM MAMEM5

0 1 2 3 4 5

v6

v4

v5

v3v2v1

y

0

1

2

3

4

S

Sv1

v3

v5

v2v4

xy

v6

MA S

MEMMEM MAMEM5

0 1 2 3 4 5

132

varies from zero to the largest value observed in the experiments. As shown, the composite

view of three metrics lies within the shaded area. Intuitively, the smaller the area, the better

the system performance and the lower communication energy consumption corresponding to a

particular approach is.

Table 6.1 reports the values of these three metrics and the corresponding Kiviat area,

i.e. (K( )), while the system latency is measured in cycles and the communication energy

consumption is normalized to the Random approach. As we can see, due to its smaller

Kiviat area, the NN approach performs better and consumes lower communication energy

than the MBS and random approaches.

Table 6.1 Comparison among the Random, MBS [99], and Nearest Neighbor (NN) [27]

mapping methods.metrics \ mapping scenario Random MBS [99] NN [27]

WMD 0.95 0.60 0.2LCC 0.93 0.60 0.57SFF 0.94 0.85 0.58

Kiviat area 1.1478 0.5975 0.2427average packet latency 342.1 148.27 94.22

normalized comm. energy consumption 1 0.48 0.32

Figure 6.7 3D Kiviat plots showing WMD, LCC, and SFF metrics forthree difference mapping schemes (i.e., Random, MBS, and NN).

(a) Random (b) MBS (c) NN

WMD

LCC SFF

WMD

LCC SFF

WMD

LCC SFF

133

6.5. Fault-Tolerant Resource Management

In general, the FT resource management covers i) migrating tasks from faulty cores to

spares (discussed in Section 6.5.1) and ii) allocating the incoming application tasks to

available, reachable, and faulty-free cores (discussed in Section 6.5.2). The platform and

application models have been discussed in Section 2.1 and Section 2.2, respectively. Here,

we define some properties before formulating the FT scheme:

• There are two sets of cores/tiles rmn in the NoC platform, namely {CP, MEM} (see the

‘CP’ and ‘MEM’ cores in Figure 6.1)3.

• All applications have been done the off-line analysis and described by ACG = (V, E)

(see Section 2.2), where the type of each vertex vi, type(vi), can be either ‘t’ (i.e.,

representing a cluster of tasks) or ‘b’ (i.e., representing a buffer or memory unit). Tasks

belonging to the same cluster/vertex should run on their own defect-free computational

core (‘CP’), while the buffer module should be assigned to the memory tile ‘MEM’ of

the NoC platform.

• For rmn CP, s(rmn) stands for the status of core located at rmn. s(rmn) = -3 if the core

is assigned to be a spare, s(rmn) = -2 if the core is permanent faulty, s(rmn) = -1 if

the core is affected by transient or intermittent faults, s(rmn) = 0 if the core has

been already assigned to some application and s(rmn) = 1 if the core is idle/

available.

• map( ): vi map(vi) = rmn stands for a mapping function from one vertex to one

core.

3. As explained in Footnote 2 of Chapter 2, we may have memory module inside the MPSoC plat-form for memory intensive applications, as well as some vertices being characterized as a buf-fer unit in applications.

134

The FT resource management strategy is described in Figure 6.8. Of note, in this chapter,

we assume that the failure rate of each core is updated based on the temperature of the core. In

addition, we apply the reactive task migration4 on faulty cores (see Section 6.5.1). As seen,

our scheme supports multiple applications entering and leaving the system dynamically (see

Section 6.5.2) and the distributed managers keep track of the cores status.

6.5.1. RUN_MIGRATION Process

Similarly to the control scheme in [117], our NoC platform includes the data and control

network. The reactive task migration procedure is given in Figure 6.9.

We note the FCA value in Step 02 (Figure 6.9) is highly dependent on the spare core

placement where its impact are evaluated in Section 6.3.2. The run-time and energy overhead

of Steps 1 and 3 are discussed in [19][161] based on the process response time and code

4. The reactive task migration process implies that after one core fails, it will be replaced by a spare,while the proactive task migration implies that the system manager monitors the failure probabilityof each core and migrates a failure-prone core to the spare before it actually fails. Although wefocus on the former scheme, it can be easily extended to become a proactive scheme with additionalNoC monitoring [84][117].

Figure 6.8 The FT resource management framework.

while(1){

if (faults are detected at rmn && migration is necessary)

RUN_MIGRATION(rmn);

if (one application ACGQ enters && resources are enough)

map(vi V in ACGQ) =RUN_FT_MAPPING(conf , ACGQ);

if (one application ACGP leaves) update rmn status where

rmn map-1(vi V in ACGP), s(rmn) 1;

}

∀ ∈ ∈ →

135

sizes at task- and resource-level and is out of the scope of this work. Indeed, we consider

here task migration at system-level so we focus on spare core placement, spare selection for

faulty core replacement, etc. instead of showing the details of the task migration at task- and

resource-level (i.e., setup the interrupts in the codes for doing the migration, etc).

6.5.2. RUN_FT_MAPPING Process

Problem Formulation

Given the current system configuration, conf, and the incoming application ACGQ

Find map( ): vi map(vi) = rmn, vi V in ACGQ which minimizes the Kiviat area,

i.e. (K( )) corresponding to the three metrics, WMD, LCC, and SFF

Such that:

vi vj V, map(vi) map(vj) (6.3)

vi V and type(vi) = ‘t’, map(vi) = rmn CP and s(rmn) = 1 (6.4)

vi V and type(vi) = ‘b’, map(vi) = rmn MEM (6.5)

Figure 6.9 Main steps of RUN_MIGRATION process.

01: The failed core tmn sends out the message to themanager through the control network.

02: The distributed manager searches the closest availablespare which results in a smaller FCA value.

03: Execute the code migration or related data transmissionthrough the data network.

→ ∀ ∈

∀ ≠ ∈ ≠

∀ ∈ ∈

∀ ∈ ∈

136

Equation 6.3 means that each vertex should be mapped to exactly one tile and no tile can

host more than one vertex. Equation 6.4 and Equation 6.5 imply that the vertices should be

assigned to the correct type of resources in the system.

RUN_FT_MAPPING Algorithm

Any run-time algorithm must be lightweight and have a low energy consumption. Hence,

we define a few variables to achieve such a goal. For each tile t, the number of available (and

non-faulty) neighboring cores is stored in the variable neighbor[t], while the center of the

current resulting region R is stored in the variable center[R]. ED(tij , tkl) stands for the

Euclidean Distance between tiles tij and tkl, i.e. ED(tij , tkl) = (|i - k|2 + |j - l|2)1/2.

The steps of the RUN_FT_MAPPING algorithm are shown in Figure 6.10. We assume the

number of vertices of type ‘b’ and ‘t’ in the incoming ACG are S1 and S2, respectively.

In Steps 01-07, we select a region from the current system configuration with the goal of

minimizing the SFF; this helps reducing the LCC caused by different applications. Also the

number of cores belonging to the MEM and CP set should be equal to the number of vertices of

types ‘b’ and ‘t’ in ACG. Then, in Steps 08-10, we assign vertices to the selected region with

the goal of minimizing the WMD and LCC caused by this incoming application. With the

minimization of these three metrics, we seek to get a smaller Kiviat area through the

incremental mapping process.

The run time complexity of the FT algorithm is O(VlogV + ElogE). In Steps 03-07, if we

can search the possible tiles and store the information of neighbor[tmn] + ED(tmn, center[R])

of this wavefront in a HEAP, the complexity is O(VlogV). In Steps 08-10 with another HEAP

structure, it takes O(ElogE) to get the next un-assigned vertex from ACG and O(VlogV) to

find an available tile for it inside the region R. Of note, our FT approach can be implemented

137

on platforms that support different topologies (i.e., torus, or even irregular networks) by

modifying the neighbor[t] value accordingly.

Figure 6.10 Main steps of RUN_FT_MAPPING process.

Input: (1) current system configuration, conf

(2) one incoming application ACGQ = (V, E)

Output: mapping solutions for all vertices in ACGQ to the fault-free and thecorresponding types of cores.

01: Set a region R

02: If S1 > 0, mode = 1; otherwise, mode = 2.

03: If mode = 1, select a core rmn MEM with minimum code transfer energyconsumption and R R {rmn}, then go to 04. If mode = 2, randomly select atile rmn CP and R R {rmn}, then go to 06.

04: If S1 > (# of cores MEM in R), then select rmn MEM with smallestneighbor[rmn] + ED(rmn, center[R]) and R R {rmn}

05: Repeat 04 until S1 = (# of cores MEM in R), then go to 06.

06: If S2 > (# of cores CP in R), then select rmn CP with smallest value of

neighbor[rmn] + ED(rmn, center[R]) and R R {rmn}

07: Repeat 06 until S2 = (# of cores CP in R), then go to 08.

08: Start with vertex vk in ACG with the largestvalue and map it onto rmn CP or MEM closest

to center[R] if type(vk) = ‘t’ or ‘b’.

09: Pick an unassigned vertex vi with the largest comm(eki) + comm(eik) value toall assigned vertices vk in ACGQ, and map it onto one available core rmn in R (i.e.map(vi) = rmn) such that the WMD value between vi and all other assigned vk isminimized. If more than one tile is satisfied, select one which results in thesmallest LCC value.

10: Repeat 09 until all vertices get assigned to tiles selected in R.

∅←

∈← ∪

∈ ← ∪

∈ ∈← ∪

∈ ∈

← ∪

comm eki( ) comm eik( )+i 1 V∼=

∑ ∈

138

6.6. Experimental Results

6.6.1. Evaluation with Specific Patterns

In this section, we evaluate our FT mapping algorithm using a set of widely-used

workloads consisting of 1) communication-intensive applications with all vertices

communicating in an all-to-all fashion and 2) applications where all vertices communicate

with each other through a central memory only (denoted as one-to-all communication).

Several sets of applications are generated using the TGFF package [162] with the number of

vertices ranging from 5 to 35 in one ACG. The communication rates are randomly generated

according to some specified distributions. The sequences of incoming applications are also

generated randomly.

In terms of spare cores, we consider two spare core placement scenarios: 1) Side

placement, where all spares are assigned towards the sides and 2) Random placement, where

spares are randomly distributed across the platform. Also, 10% of the computational cores are

assumed to be permanently faulty due to the manufacturing process and randomly distributed

across the platform. As observed, the uniform spare core placement scenario (Case 3

discussed in Section 6.3.2) gives similar results to the random spare core placement; here we

report only the comparison between side and random placements.

To have a more accurate fault model, we use thermal modelling via HotSpot [73] to

estimate the temperature of each active computational core. We set the failure rate per cycle

for each core (at a room temperature of 25 C) to 10-9. Therefore, the estimated failure rate of

each core would be updated using the Arrhenius model [76] based on the temperature

obtained from the thermal measurements. In addition, in terms of the failure rate of the

memory, Alion System Reliability Analysis Centers report that the memory mean-time-

between-failure rates are around 700 years [74]. Also, since the on-chip buffer/memory

°

139

tiles benefit from built-in-self diagnostics and repair schemes, we assume that permanent

failures in memory do not occur at simulation.

Table 6.2 shows the throughput and communication energy consumption comparison

for two mapping approaches, namely 1) our proposed FT mapping (FT) and 2) Nearest

Neighbor (NN) [27]5, an heuristic which maps vertices with higher communication as closely

as possible, for NoCs of different sizes. All results are experimented from an NoC simulator

using C++ language.

As seen in Table 6.2 for the all-to-all communication, our proposed technique (FT)

achieves indeed a higher throughput and has a lower communication energy consumption

compared to the NN approach, especially for larger NoC platforms. Of note, our FT

approach works even better when the spares are located randomly in the platform

compared to the NN approach. For one-to-all communication, the system performance

cannot improve much since the bottleneck is mainly due to the memory module (i.e.,

accessing data to/from memory). Despite this, we still can achieve more communication

energy savings compared to the NN approach.

5. Even though the path load approach proposed in [27] performs the best, it does not consider any typeof failures in the cores of the platform so it is not directly comparable with our FT approach.

Table 6.2 Throughput and Energy Consumption between proposed FT and Nearest Neighbor (NN) approaches for all-to-all and one-to-all communication patterns.

space core placement-Side space core placement-Random

specific ACGs in different NoC size

throughput improvement (FT vs. NN)

comm. energy consumption

savings(FT vs. NN)

throughput improvement (FT vs. NN)

comm. energy consumption

savings(FT vs. NN)

5 × 5 all-to-all 23.2% 12.5% 23.5% 15.8%10 × 10 all-to-all 98.1% 32.1% 102.1% 36.2%30 × 30 all-to-all 163.2% 77.3% 178.2% 85.2%5 × 5 one-to-all 3.4% 13.8% 4.1% 17.8%

10 × 10 one-to-all 5.7% 17.5% 6.9% 23.6%30 × 30 one-to-all 19.4% 32.8% 25.9% 54.1%

140

6.6.2. Impact of Failure Rates with Spare Core Placement

As discussed in Section 6.3.2, the higher the failure contamination area (FCA), the

higher the probability to affect additional applications and degrade the overall system

performance. Here, we show the impact of the contamination area on different failure rates.

Also, since the lifetime of the computational processing core follows a bathtub curve (i.e. the

failure rate follows different phases: infant, normal, and wear-out), we apply different failure

rates to capture the failures in different phases. We consider two cases in these experiments.

Assume x is the failure rate per cycle at room temperature (25 C) and the estimated failure

rate of each core would be updated using the Arrhenius model based on the temperature

obtained from the thermal measurement. Several sets of applications are generated using the

TGFF package with the vertex number ranging from 5 to 35 and the edge number ranging

from 5 to 50 in the ACG. The proposed FT approach is applied to map the incoming

applications.

Table 6.3 reports the average contamination area and its variation for 10 × 10 and

30 × 30 NoCs with different failure rates. As shown, when assigning spares that are randomly

distributed in the platform, the FCA value is about 4 and 12 times smaller than those cases

when spares are grouped towards the side of the 10 × 10 and 30 × 30 NoCs, respectively. In

addition, when the failure rate gets higher, assigning spares that are randomly distributed has

less of an influence over the system.

Table 6.3 Impact of contamination area on different failure rates under Side and Random spare core placements.

FCA (avg., var.)

x =(10 -9) x =(10 -6)Side Random Side Random

10 × 10 (19.2, 180.2) (5.2, 6.8) (29.1, 250.4) (6.5, 13.5)30 × 30 (65.7, 1403) (6.8, 12.6) (125.2, 5874) (8.2, 21.2)

°

141

6.6.3. Evaluation with Real Applications

We evaluate the potential of our algorithm on several real applications, namely five

benchmarks from the Embedded System Synthesis Benchmarks Suite [50], a video object

plane decoder, the MPEG4 decoder, picture-in-picture, and multi-window display

applications, where the last four applications include several memory modules. The

ACGs of these nine applications are built through an off-line analysis; applications are

randomly selected to enter and leave the system.

The Nearest Neighbor (NN) mapping approach in [27] is evaluated against our FT

method. Also, the comparison of 1) the average packet latency (i.e., the time elapsed between

packet generation at the source core and packet arrival at the destination core, in cycles), 2)

communication energy consumption, and 3) the Kiviat area, on these approaches are given in

Table 6.4. In each run, 5%-15% of the computational cores are assumed to be

permanently faulty and randomly distributed in the system. We report the average results

of running 50 runs for each mapping approach under different NoC sizes (e.g. 10 × 10

NN). We note that the range of each metric in the Kiviat graph is normalized from zero to the

largest value observed in random mapping implementation. The same fault model is applied

as that in Section 6.6.1. In addition, the overhead of the energy consumption for running

our FT algorithm is included in the communication energy consumption measurement.

As shown in Table 6.4, our approach can obtain lower average packet latency and smaller

communication energy consumption compared to the NN approach. The data of the Kiviat

area also imply that by minimizing the WMD, LCC, SFF metrics, we are able to reduce the

average packet latency quite significantly; this, in turn, increases the system performance and

decreases the communication energy consumption. The run-time overhead for running the FT

(see Figure 6.10) and NN approaches on a 100MHz MicroBlaze processor acting as a

142

distributed manager is, on average, 68μs and 46μs, respectively; these values are well suited

to this kind of on-line optimizations.

6.7. Summary

In this chapter, we have proposed a system-level fault-tolerant approach addressing the

problem of run-time resource management in non-ideal multiprocessor platforms where

communication happens via the NoC approach. The proposed application mapping techniques

in this new framework aim at optimizing the entire system performance and communication

energy consumption, while considering the static and dynamic occurrence of permanent,

transient, and intermittent failures in the system. As the main theoretical contribution, we have

analyzed the main factor of producing network contention and addressed the spare core

placement problem with its impact on system fault-tolerant (FT) properties. Then we have

investigated several critical metrics and provided insight into the resource management

process. Finally, a FT application mapping approach for non-ideal multiprocessor platforms

has been presented. Experimental results have shown that our proposed approach is efficient

and highly scalable; significant throughput improvements can be achieved compared to the

existing solutions that do not consider possible failures in the system.

Table 6.4 Comparison between the Nearest Neighbor (NN) and our FT mapping results on the overall system performance.

NoC size mapping approach avg. latency

normalized comm. energy consumption

Kiviat area

10 × 10NN [27] 105.37 1 0.264

FT 63.82 0.54 0.051

20 × 20NN [27] 191.28 1 0.351

FT 67.33 0.37 0.042

30 × 30NN [27] 275.12 1 0.383

FT 66.33 0.29 0.021

143

7. USER-AWARE DYNAMIC TASK ALLOCATION

7.1. Introduction

As discussed in Figure 1.3, for multiple use-case NoCs (such as Tile64TM by Tilera

[164] which deliver high performance computing for embedded applications), different

system configurations resulting from unpredictable failures (presented in Chapter 6) or

multi-user behaviors are too dynamic and too complex in nature to be modeled off-line.

As discussed about the user-centric design flow (see Figure 1.8), even though after having

generated NoC platforms which exhibit less variation among the users’ behavior (see

Chapter 4), there still exists some variation between the targeted platform and users within

each cluster (see the distance between the squares and the dots belonging to that cluster in

Figure 1.1). Therefore, the need for more sophisticated light-weight run-time optimization

for maximizing the user satisfaction and adapting to different user needs becomes

apparent (see the arrows in Figure 1.1). This chapter focuses on the resource management

problem, more precisely, task allocation problem while taking the user behavior into

consideration1.

The task allocation problem has been a key issue for performance optimization in parallel

and distributed systems [10][15][16][29][95][98][99][165]. To date, contiguous (see

Figure 7.1(a)) and non-contiguous (Figure 7.1(b)-(e)) allocation techniques have been

1. Our user-centric run-time optimization proposed in this chapter can also be applied to the non-idealplatform, i.e. platform with permanent, transient, or intermittent faults discussed in Chapter 6.

144

proposed for resource assignment, aiming at i) maximizing the utilization of system resources

and ii) maximizing applications performance. Contiguous techniques restrict the resource

allocation of a given application to form a convex shape [29][95][165], while non-contiguous

allocation does not have such a restriction. Well-known non-contiguous task allocation

strategies which have been proposed in [10][99] are shown in Figure 7.1(b)-(e). In

“Random” strategy, we randomly assign the non-allocated resources to applications. In

Unallocated resourceResource allocated to Application 1Resource allocated to Application 2Resource allocated to Application 3

external contention

Application 1

tasks of application 1

internal contention

Resource allocated to Application 4

Application 2

Application 3

Figure 7.1 Contiguous (a) and non-contiguous (b)-(e) allocations for four applicationsusing standard techniques.

(b) Non-contiguous - Random (c) Non-contiguous - Paging

(d) Non-contiguous - MBS (e) Non-contiguous - GABL

(a) Contiguous allocation

145

“Paging” and “Multiple Buddy Strategy (MBS)”, the mesh network is divided into non-

overlapping sub-meshes initially. “Paging” selects the non-allocated sub-meshes in row-

major order, while “MBS” allocates applications to contiguous sub-meshes if possible.

“Greedy-Available-Busy-List (GABL)” allocates resources from the largest free sub-

mesh of any size. The contiguous strategy achieves only 34% to 66% resource utilization

[95], while the latter can reach up to 78% [165]. However, the performance of non-contiguous

allocation may suffer due to internal and external contention caused by messages originating

from the “same” or “different” applications, respectively, contending for the same link. Of

note, there is no external contention for contiguous allocation (see Figure 7.1(a)) because the

tasks of the same application belong to a convex region, while tasks from different

applications do not communicate among them (also, application tasks are not reallocated

once they start executing).

The resource management techniques proposed so far for SoCs rely on a resource

manager operating under the control of an operating system (OS) [117]; this allows the

system operate effectively in a dynamic manner. Since SoC design is moving towards a

communication-centric paradigm [17], new metrics (e.g., physical links with limited

bandwidth, communication-energy minimization) need to be considered in the resource

management process. Indeed, with inadequate task assignment, the system will likely perform

poorly. As an example, the “Random” case in Figure 7.1(b) causes severe internal/external

network contention; this contention incurs a longer transmission latency and smaller system

throughput.

This chapter focuses on proposing an effective run-time resource management onto

embedded NoC-based MPSoC. Given that the arrival order and the execution time of the

target applications is not known a priori, achieving effective run-time resource management

on such platform is a challenging task. Of note, as discussed in Figure 6.3.1, the path-based

146

internal and external contention especially have large impact on system performance.

Compared to Chapter 5, Chapter 6, and the previous work, our contributions in this chapter

are as follows:

• Propose strategies for minimizing the path-based internal and external network

contention.

• Present algorithms for resource management while incorporating certain user

characteristics to better respond to run-time changes.

• Propose light-weight machine learning techniques for learning structures with critical

parameters/thresholds which are able to adapt to different types of users.

Different users result in different system configurations which cannot be predicted and

modeled at design time. Consequently, how to react to run-time stimuli the system receives,

while maintaining high performance is our main objective in this chapter. The best known

example for considering external interactions (e.g., human users) to the electronic devices is

perhaps the e-commerce site “Amazon.com”. An interface collecting various data on user

activity (e.g., who is interested in what and when) helps the search engine recommend users

what products to buy. Also, the context-aware mobile computing utilizes wearable sensor

devices for sensing the users and their current state to exploit the context information and

reduce the demand for human attention [77].

We believe that our approach aimed at incorporating the user behavior in resource

management can automatically adapt to different user needs. In other words, the technique is

well-suited to be embedded in future products belonging to the second and third categories in

Table 1.1.

This chapter is organized as follows: In Section 7.2, we review related work. The system

description and our newly proposed methodology are described in Section 7.3. The run-time

147

task allocation problem is formulated in Section 7.4 and efficient algorithms to solve it are

presented in Section 7.5. The on-line light-weight user model learning process is explained in

Section 7.6. Experimental results in Section 7.7 show the kind of communication energy

savings that can be achieved by considering the user behavior. Finally, we summarize our

contribution in Section 7.8.

7.2. Related Work

The resource management problem has been addressed in the literature in various contexts

like supercomputers, parallel and distributed systems, SoCs, etc. While having the goal of

maximizing the system performance, many techniques such as partitioning, mapping,

scheduling, resource sharing, load balancing have been proposed to date. Pastrnak et al.

present methods such that several tasks can run concurrently by exploiting the task- and data-

level parallelism [127]. Chang et al. address the coarse-grain task partitioning and clustering

problem to preserve the modularity of the initial application description [29]. Moreira et al.

propose an online resource allocation heuristic for multiprocessor SoCs which can achieve

utilization values up to 95% [111]. Nollet et al. apply task migration to improve the system

performance by reconfiguring the hardware tiles [118] and propose the adaptive routing to

ensure the quality-of-service requests of various applications [117]. Smit et al. propose a run-

time task assignment algorithm on heterogeneous processors [154] targeting on the current

system configuration.

Some prior work does consider the whole system configuration. For instance, Murali et al.

propose an off-line methodology for mapping multiple use-cases (with different

communication requirements and traffic patterns) onto NoCs [105]. Also, Pop et al. present

the incremental design approach of distributed systems for hard real-time applications over a

bus [130].

148

With the ever-increasing connectivity between systems and users, how to design an

electronic system which can examine and mediate people’s interactions is becoming an

important challenge. We first addressed the run-time task allocation problem on NoC-based

platforms, while considering the user behavior in [36]. Then, we generalize the technique in

[36] to include the user behavior and pre-defined user model in the energy optimization

process and then propose a light-weight machine learning technique for boosting the user

model at run-time. We show that by taking the user behavior into consideration during the

task allocation process and building a specific model for each user, the system can respond

much better to run-time changes and adapt dynamically to user needs.

7.3. Preliminaries and Methodology Overview

7.3.1. Motivational Examples

Our motivational example of run-time task allocation with user behavior taken into

consideration is given in Figure 7.2. When an event occurs (e.g., an application enters the

system), our objective is to allocate the tasks belonging to this application to the available

resources on the platform such that the path-based internal/external network contention

and communication cost can be minimized. In the remaining part of this chapter, the term

“internal/external contention” stands for path-based internal/external contention where it

occurs when two traffic flows which neither come from the same source, nor go towards the

same destination contend for the same links somewhere in the network (see definition in

Section 6.3.1). More precisely, reducing the internal contention always comes with the

benefit of minimizing the communication cost; however, this increases the probability of

external contention for additional mappings. More details can be found in the example

discussed in Figure 7.2. Figure 7.2(a) describes the characteristics of applications App 1 -

149

at time 4

Time

0 1 2 3 4

1

11

23

45

67

8

4 567 8

1213

1014

9

111213

10149

18161915

20 17

1816

17

1915

20

18 16171915

20

2122

2324

25

2122

232425

2122

232425

App 11

App 2 App 3 App 4 [0, App 1, 5][1, App 2, 6][2, App 3, 12][3, App 4, 13][4, App 5, 9]2 3

45

6 78

9

11

10 12

13 14

1516 17

18 19 2021

22 23

24 25

App 5

2 2

2

10 101010

1

1 1 1

1 2 2

2 2 25 5

5 5

at time 0 at time 1 at time 2 at time 3

12

3

12

3

12

3

12

3

12

3

12

3

12

3

12

3

12

3

12

3

12

3

12

3

12

3

12

3

45

67

8 45

67

8 45

67

8

45

67

8 45

67

8 45

67

8 45

67

8

111213

1014

9

111213

1014

91816

17

1915

20

4 567 8

4 567 8

4 567 8

111213

10149

111213

10149

111213

10149

111213

10149

111213

10149

18 16171915

20

18161915

2017

Approach 1: primarily minimize internal metrics

Approach 2: primarily minimize external metrics

Hybrid approach: user behavior considered

Figure 7.2 Motivational example of run-time resource management with user behaviortaken into consideration. (a) Application characteristics. (b) Events in the system. (c)(d)(e)Task allocation scheme under Approach 1, Approach 2, and Hybrid approach, respectively.

(a) Application characteristics (b) Events

(c)

(d)

(e)

150

App 5. More precisely, each vertex vi contains a cluster of tasks that will acquire a resource

later. Each edge eij represents the communication between two vertices vi and vj, while the

weight of each edge, r(eij), gives the corresponding communication rate (e.g., bits/sec).

Figure 7.2(b) shows the events in the system; one event [t1, Q, t2] represents App Q running

in the system from time t1 (sec) to t2 (sec) as defined in Section 2.3.1. Figure 7.2(c), (d), and

(e) show the system configurations at particular times 0, 1, 2, 3, and 4 under three different

approaches. At each specific time, a new application arrives in the system and creates a more

complex configuration. These three different approaches are as follows:

• Approach 1: Focus primarily on minimizing the internal contention and communication

cost; minimize the external contention only as a secondary goal.

• Approach 2: Focus primarily on minimizing the external contention; minimize the

internal contention and communication cost only as a secondary goal.

• Hybrid approach: A hybrid method consists of combining Approaches 1 and 2 with user

behavior taken into consideration.

As observed in Figure 7.2(c), when the system utilization increases, in Approach 1, the

remaining (available) resources are quite dispersed; this results in an increase of external

contention and incurs a higher communication overhead for additional mappings. On the

contrary, in Approach 2 (see Figure 7.2(d)), the regions occupied by applications (shown with

thicker lines) are near convex; this helps reducing the external contention and lessen the

additional communication costs. For example, as shown in Figure 7.2(d) at time 1, there is no

external contention under such configuration. However, the drawback of Approach 2 is that

with this near convex region limitation, we can only obtain the sub-optimal communication

cost for the incoming application. It is hard to judge which approach is an adequate solution to

the run-time task allocation problem, particularly when the application characteristics are not

151

known a priori. Therefore, in this chapter, we address this very issue and present a hybrid

allocation approach leveraging Approaches 1 and 2 while considering the user interaction with

the system.

Here, we consider at least one PE acting as a global manager which observes the user’s

behavior for a long session (or episode). Also, with the same example shown in Figure 7.2,

we assume that the manager predicts that applications App 2 and App 4 have a higher

probability to become critical applications since the former application has a higher

communication rate, while the latter one has a long presence in the system according to the

user’s behavior traces. In the hybrid approach, Approach 1 is applied only to the critical

applications.

In this example, we assume that the minimal-path routing is used to transmit data. For

simplicity, the communication cost (i.e. energy consumption in bits, EQ) of the event

[t1, Q, t2] is defined as follows:

(7.1)

where MD(vi , vj) represents the Manhattan Distance between any two vertices, vi and vj,

connected to each other in App Q. The event communication cost for each application under

different approaches is summarized in Table 7.1.

Table 7.1 Event communication cost [in bits] for three approaches and five applications entering in the system as shown in Figure 7.2.

App 1 App 2 App 3 App 4 App 5Approach 1 40 200 50 100 175Approach 2 40 250 60 120 125

Hybrid approach 40 200 60 100 125

EQ MD vi vj,( ) comm eij( )×( ) t2 t1–×i j,( ) in App Q∀

∑=

152

As shown in Figure 7.2(c), the internal contention is indeed minimized for the first four

applications and these four events consume the least amount of communication cost by time 3

(see Table 7.1 under Approach 1). However, the remaining (available) resources are quite

dispersed (see (Figure 7.2(c) at time 3)) when the system utilization increases; this results in

an increase of external contention and therefore it will incur a higher communication overhead

for any additional mapping. As shown in Table 7.1, the event communication cost for App 5

under Approach 1 is about 40% higher than those under Approach 2 and the hybrid approach.

On the contrary, in Approach 2, the regions occupied by applications (shown with thicker

lines) are near convex; this helps reducing the external contention and decreasing the

additional communication costs (compare Figure 7.2(c) and (d) at time 3). However, the

drawback of Approach 2 is that with this near convex region limitation, we can only obtain the

sub-optimal communication cost for the incoming application. In other words, it does not

work well when the system utilization is low (see Table 7.1 and compare the cost of App 2, 3,

and 4 of Approach 2 with that of Approach 1). For Approaches 1 and 2, it is hard to judge

which one is the adequate solution to the run-time task allocation problem, particularly when

the application characteristics are not known a priori.

Our motivation for applying the hybrid approach is to leverage the advantages offered by

both approaches 1 and 2; that is, we aim at balancing the internal/external contention and,

subsequently minimize the event communication cost (see Table 7.1). The basic idea is that,

by observing the user’s behavior, we can predict what the critical applications are and then

minimize their communication cost via Approach 1. Even if this may cause a larger external

contention, this can be later mitigated by applying Approach 2 to other applications.

Moreover, we observe that mitigating the internal and external contention reduces the

system fragmentation, which has a huge impact on system throughput. To show the influence

of system fragmentation on the mapping quality, we consider two scenarios for multiple

153

application mappings on 6 × 6 and 10 × 10 mesh NoCs, respectively, while considering fifteen

to twenty five vertices having all-to-all communication in each application: Scenario 1)

randomly and contiguously select the unused resources for each application Scenario 2) apply

either Approach 1 or Approach 2. We observe that the system throughput in the second

scenario improves by 45% and 108% compared to the first scenario, which implies that

mitigating the internal and external contention not only helps minimizing the communication

overhead for additional mappings, but also has a huge impact on maximum system

throughput.

Of note, the reallocation of resources to defragment the system, also called task migration,

is a complementary approach aimed at achieving load balancing and high resource utilization.

For distributed systems without shared memory support, the task migration policy must be

implemented by passing messages among resources; the implicit migration cost is large due to

the need of moving the process context [161]. Therefore, in this chapter, we do not consider

the task migration process. Instead, we target an run-time mapping process which does not

need to change the current system configuration.

7.3.2. System Description

As mentioned in Chapter 2, our embedded NoC platform is based on a 2-D mesh

architecture consisting of heterogeneous processing resources (i.e., master PEs running

the OS acting as global managers and several slave processor/PEs as shown in

Figure 2.1(a)). Each segmented link li between PEs has the same bandwidth B(li). We

further assume that each slave processor SPi has its computation capability, CC(SPi), with

level set from 1 to i. In addition, the slave processors in our system are processor-based

cores, e.g. Digital Signal Processor (DSP), ARM, and all application codes are compiled

already and stored in the global program memory where each slave process can easily access.

154

In the remaining part of the chapter, we may use the term “master PE” and “global manager

(GM)” alternatively when there is no ambiguity.

The real-time OS built in our embedded system is designed to be very compact and

efficient, such as Open AT OS provided by Wavecom [77]. We assume such OS supports non-

preemptive multi-tasking and event-based programming. More precisely, the OS control

mechanism as presented in Chapter 2 can be used to provide predictable and controllable

resource management, which includes monitoring the user’s behavior and making the task

allocation/mapping decision only when new events occur.

The communication infrastructure in such platform consists of a Data Network and a

Control Network (shown as solid and dotted lines, respectively, in Figure 2.1) which supports

minimal-path routing scheme2 and worm-hole switching. And the sum of all communication

flows passing through the link li cannot exceed its bandwidth B(li). Under such a platform,

the bit energy model as presented in Section 2.3.3 can be used to derive the communication

energy consumption of the entire system analytically. Assume now that applications enter

and leave the system at time t, where t is an integer. The total communication energy

consumed by some events in any session/episode si under a certain user model ST, during time

interval t = 0 ~ , is denoted by:

(7.2)

where is the length of session si and where stands for the communication

energy consumption of any application Q per time unit (see Section 2.3.3).

2. After the mapping of the incoming application is done, elimination of possible deadlocks betweenthe communication traces can be achieved by adding additional virtual channels in the router as apost-processing step [46]. In this chapter, we focus on the mapping step.

Tsi

Ecomm TsiST,( ) Ecomm

App Q ΔApp Q t( )t 1=

∑⎝ ⎠⎜ ⎟⎜ ⎟⎛ ⎞

×all applications

∑t 1=

Tsi∑=

TsiEcomm

App Q

155

7.3.3. Overview of the proposed methodology

We assume that all applications have been characterized by the Application Characteristic

Graph ACG = (V, E) as presented in Section 2.2. Our proposed methodology handling the

user-aware task allocation includes three stages as shown in Figure 7.3. In stage 1, when a

user starts first signing in the system, the master PE uses the default approach (i.e., Approach

2) for application mapping and, at the same time, records the user sequences that characterize

Figure 7.3 Overview of the proposed methodology. Default approach (i.e., Approach 2) isapplied in stage 1. Hybrid approach with pre-defined user model is applied in stage 2. Hybridapproach with on-line learned user model is applied in stage 3.

Done by Master Processor

Application Characteristics

EventTrigger

Approach 2

System Configuration

Application Characteristics

Pre-defined User Model

System Configuration

Application Characteristics

On-line Learned User Model

Minimize external

contention

Approach 2Minimize external

contention

Approach 1Minimize internal

contention

user sequences

user

sequ

ence

s

user

sequ

ence

s

user

sequ

ence

s

user

sequ

ence

s

user sequences

user sequences

user sequences

stage 1

EventTrigger

stage 2

Approach 2Minimize external

contention

Approach 1Minimize internal

contention

EventTrigger

stage 3

156

this particular user interaction with the system. As new user sequences are collected, the

manager enter into stage 2 and the hybrid approach (i.e., the selection between approaches 1

and 2 is based on the pre-defined user model) is applied. After a sufficiently long period of

time, we on-line boost the user model which is used for any subsequent user interaction. Then,

in stage 3, the hybrid approach is selected according to i) application characteristics ii) on-line

learned user model, and iii) current system configuration.

Figure 7.4 illustrates the algorithm flow for our proposed methodology. Also, it illustrates

the four sub-problems (P1 ~ P4) relevant to the hybrid approach in stages 2 and 3. More

Figure 7.4 Algorithm flow for our proposed methodology.

no

yes

upda

te

User ModelSystem Configuration

decide an approach?

user sequences

i =0

P1 : region forming

P2 : region rotation

Approach 1

find a success mapping?

yesno && i < iter

i = i

+1

P3 : region selection

P4 : application mapping

Approach 252 7

136

4

...

63

5 72

41

no && i >= iter

successfully map application tasks to resourcesupdate

ACG1

23

64

7

5

2

4

5

3

1

2

211

enough resource for this event?

reject the event

collect enough user sequences

EventTrigger

157

precisely, for Approach 1, we first form a region to minimize the internal contention for the

incoming application (P1) and then rotate/translate the resulting region to fit the current

system configuration (P2). For Approach 2, in order to minimize the external contention, we

do the opposite, namely first select a near convex region based on the current configuration

(P3), and then map the application tasks onto the selected region (P4). The detailed

explanation and objective of each subproblem are explained in Section 7.4.

Note that if the number of vertices in an application is greater than the total number of

available resources, the system can, in principle, reject this application or start it at a different

level of granularity (which may result in lower performance). Since we focus on task

allocation, we use the first mechanism; changing the application granularity at run-time is left

for future work.

7.3.4. User Modeling

As shown in Figure 7.3 (i.e., steps from stage 2 to stage 3), the manager records many user

sequences for a specific user and then, at run-time, it builds the user model which is able to

predict the probability of a certain application as being critical. Considering the run-time

overhead, we propose two ratios for building this model.

• Instantaneous communication rate ratio ( ): the ratio between the communication

rate of application Q (bits per time unit) and that of all applications occurring from

time 0 to T.

(7.3)

• Cumulative communication energy ratio ( ): Ratio between the communication energy

consumption of application Q active in the system ( (t) = 1 if Q is active between times

α

α

r eij( )eij∀ E in AppQ∈

r eij( )eij∀ E in AppQ∈

∑⎝ ⎠⎜ ⎟⎛ ⎞

AppQ ever occured∀∑

---------------------------------------------------------------------------------------------------=

β

Δ

158

t-1 and t, and 0 otherwise) and the communication energy of all events in the system from

time 0 to T.

(7.4)

Later on, we introduce an user model based on two threshold/parameters, and , for

the master PE predicting whether the incoming application is critical or not for a particular

user. As mentioned, if the incoming application is recognized as critical, then Approach 1 is

applied; otherwise, Approach 2 is used.

We observe that in most systems, however, the application sequence from different users

is not stationary (The user behavior variation has been discussed in Section 1.3.1). More

precisely, none or multiple applications may be identified as critical from sequences belonging

to a certain user and so parameters ( or/and ) should be able to fit different users. In

other words, under such non-stationary behaviors between different users (see Figure 1.5), it is

meaningful to learn on-line the structure of the user model if we can collect enough user

sequence information. We note that the light-weight learning process is not executed every

time when an application enters the system. Instead, it can be executed based on the user

(manual) settings, or when the global manger records enough user data. The user learning

process is described in Section 7.6, while the performance (including the overhead) is reported

in Section 7.7.3. In addition, the learning process can be done either on the global manager, or

on some slave processors which have access to the recorded user information. We note that in

this chapter, since the “light-weight” user model is built at run-time for each specific user, we

only include two parameters, and , which can be obtained with low computational

effort. For more accurate user models, we can include other factors which affect the user

behavior, such as location, time, environment (similar to context-aware computing [141]), or

even using more complex structure for building the models (e.g. neural network [126]).

β

Δ t( )t 1=

T

∑⎝ ⎠⎜ ⎟⎛ ⎞

EcommApp Q×

Ecomm_total T( )-------------------------------------------------=

αth βth

αth βth

αth βth

159

Moreover, predicting the user sequences at run-time may be a challenging task (e.g. finding

the correlations among applications using hidden markov model [126], predicting the modes

of applications for each user), which has the potential to maximize the resource management

process at run-time.

The following user-aware task allocation process (Section 7.4 and Section 7.5) and user

model learning process (Section 7.6) are limited to be applied on systems utilized by one

specific user for certain time or a long period of time, i.e. systems in categories 2 and 3 in

Table 1.1. For systems with multiple user interacting with at the same time, it is suggested to

explore the human pattern activity (for more discussion, see Section 8.2.2).

7.4. Problem Formulation of User-Aware Task Allocation Process

In this section, we first define some terms; the formulation of four sub-problems (see

Figure 7.4 P1 - P4) are presented later.

• MD(si = (xi , yi), sj = (xj , yj)): Manhattan Distance between locations si and sj where xi,

xj, yi, and yj are the x- and y- coordinates in the mesh system, i.e. MD(si, sj) = |xi - xj| +

|yi - yj|.

• ED(si = (xi , yi), sj = (xj , yj)): Euclidean Distance between locations si and sj, i.e.

ED(si, sj)) = (|xi - xj|2 + |yi - yj|2)1/2.

• R: a region containing several locations. This region can be contiguous or non-

contiguous.

• L(R): sum of pairwise Manhattan distances between all locations within R.

Similar to the two dispersal metrics presented by Mache [101], we use a metric for

measuring the external contention in the system during the run-time mapping process, namely

160

L(R)+L(R’-R), where R’ is the region with available resources in current system configuration

and R is the region with available resources which is going to be selected for the incoming

application. We proved that minimizing the metric, L(R)+L(R’-R), helps reducing the external

contention in Chapter 5.

Region Forming Sub-problem (P1)

Given the ACG of an incoming application and the current system configuration

Find a region R and a corresponding location G(vi) inside R, vi V in ACG which:

(7.5)

Such that: vi vj V, G(vi) G(vj) (7.6)

vi vj with M(vi) = M(vj), MD(G(vi) , G(vj)) dist (7.7)

where dist is observed from the current configuration.

If more than one region satisfies Equation 7.5, then select the region R for which L(R) is

minimized, i.e. we select the region as convex as possible since this helps reducing external

contention.

Region Rotation Sub-problem (P2)

Given an already formed region R (derived from P1) and the current configuration with

region R’ containing the available resources

Find a placement for the region R within R’ which:

min { L(R’-R) } (7.8)

Such that: vi V, CC(SPi = G(vi)) M(vi) (7.9)

∀ ∈

min Comm. cost comm eij( ) MD G vi( ) G vj( ),( )×e∀ ij E∈∑=

⎩ ⎭⎨ ⎬⎧ ⎫

∀ ≠ ∈ ≠

∀ ≠ ≈

∀ ∈ ≥

161

each link lk , (7.10)

Region Selection Sub-problem (P3)

Given the number of resources s required by the incoming application and the current

configuration with region R’ containing the available resources

Find a region R inside R’ with the number of locations in R equal to |V| which:

min { L(R) + L(R’ - R) } (7.11)

and min { nodes_affected(R) + links_affected(R) } (7.12)

Such that:

computation capacity level i, # of (CC(s) = i) in R = # of (M(vi V) = i) in ACG (7.13)

Application Mapping Sub-problem (P4)

Given a selected region R (derived from P3) and the ACG of the incoming application

Find H(vi) inside R, vi V in ACG which:

(7.14)

Such that: vi vj V, H(vi) H(vj) (7.15)

vi V, M(vi) = CC(G(vi)) (7.16)

each link lk , (7.17)

∀ comm. flows through lkall apps in the system∀

∑ B lk( )≤

∀ ∈

∀ ∈

min Comm. cost comm eij( ) MD H vi( ) H vj( ),( )×e∀ ij E∈∑=

⎩ ⎭⎨ ⎬⎧ ⎫

∀ ≠ ∈ ≠

∀ ∈

∀ comm. flows through lkall apps in the system∀

∑ B lk( )≤

162

7.5. User-Aware Task Allocation Approaches

Here, the algorithms used in Approaches 1 and 2 to solve the four sub-problems P1 ~ P4

are described in detail; why and how each approach is selected is later explained in

Section 7.6.

7.5.1. Solving the Region Forming Sub-problem (P1)

For this subproblem, we do not set any region boundary; therefore, there may exist more

than one solution minimizing the internal contention. As stated, we select the region R with

L(R) minimized. This is because, the more convex the region is, the better for minimizing the

external contention and communication overhead for additional applications.

In general, the region is convex if it contains all the line segments connecting any pair of

points inside it. Bender et al. [16] define the region to be optimal if the average distance

between any pair of points is a minimum; as such, the shape of an optimal region is expected

to be convex. However, the concept of near convex region we use here is more general; it

stands for a region whose area is closest to the area of its convex hull [35][37][89]. Our

objective in this sub-problem is to minimize the internal contention and communication cost

of the incoming application and, at the same time, make the resulting region as convex as

possible.

The region forming procedure is shown in Figure 7.5; it assumes that the input ACG is

represented using adjacency lists. Several additional data structures are maintained with each

vertex in the ACG. The color of each vertex u V is stored in the variable color[u], and the

communication weighted sum of u to other black/white neighbors are stored in the variable

Adj_b[u] and Adj_w[u], respectively. In addition, the center of the current resulting region R is

stored in the variable Center[R].

163

An illustrative example is shown in Figure 7.6; here, we assume that each vertex vi in the

ACG has the same computational requirements. A BLACK vertex in Figure 7.6 stands for a

vertex which has been processed and got its specific location, while a WHITE vertex is a vertex

not processed yet. Initially, all vertices are WHITE and vertex v3 (for clarity, circle 3 stands for

Figure 7.5 Main steps of the region forming algorithm.

Input: (1) current system configuration

(2) ACG=(V,E)

Output: a region R(G), and its corresponding mapping G( ) for each vertex,i.e. R with the locations = (x1,y1) = G(v1), , ... ,

01: for each vertex u V02: do color[u] WHITE03: R04: choose u V such that Adj_w[u] is maximized05: color[u] BLACK06: G[u] = Center[R] (0,0)07: R R {G[u]} 08: if

09: do update Adj_b[u] for each vertex u10: choose u with color[u] = WHITE and Adj_b[u] is maximized11: color[u] BLACK12: choose available location gx,y such that , MD(gx,y , color[v] = = BLACK) dist where dist is observed from the current configuration and is minimized

and then ED(Center[R] , gx,y) is minimized13: G[u] gx,y

14: R R {gx,y}15: update Center[R]

gx1 y1, gx2 y2, gx V y V,

Φ←

← ∪

color u[ ]= =BLACKu∀ V∈∑ V<

u∀ V∈

v∀ V∈ ≈

comm euv( ) MD gx y, G v( ),( )×( )v∀ V∈

color v[ ]= =BLACK

← ∪

164

vertex v3) is selected as having the largest communication to its neighbors (see Figure 7.6(a)).

Then vertex v3 is located at the center grid G0,0 (see the solid dot in Figure 7.6(b)).

Next, vertex v2 is selected since it has the largest communication rate with vertex v3

(compared to vertices v1, v6, and v7 (Figure 7.6(c)) and is located at G-1,0 (Figure 7.6(d)). Now

the center is updated to G-1/2,0, as shown with the solid dot in Figure 7.6(d). Then, vertex v1 is

selected since it has largest communication with BLACK vertices v2 and v3 (see Figure 7.6(e)).

Now, since grid positions G0,1, G1,0, and G0,-1 have the shortest MD and the same internal

contention, we calculate their ED to the center G-1/2,0. We select G0,1 or G0,-1 for vertex v1

since its ED to the center is smallest, as shown in Figure 7.6(f). Following this, vertices v5, v4,

v6, and v7 are successively selected for forming the region; the remaining process is shown in

Figure 7.6(g)-(n). The final solution is shown in Figure 7.6(n) with a thick line.

Complexity of the region forming algorithm: The overhead for initialization is O(V) (line

1-3), while it takes O(E) for line 4 and constant time for line 5-7. There are totally |V| - 1

iterations in the main loop (line 8-15). For each iteration, one vertex is reached by searching

edge with maximum communication rate (takes O(E) time) and then costs O(logV) to get the

location if implemented using HEAP to search the wavefront of the resulting region. The total

2

4

5

3

1

1

2

5

63

4

7

2

211 2

4

5

3

1

1

2

5

63

4

7

2

211 2

4

5

3

1

1

2

5

63

4

7

2

211

3 32 321

2

4

5

3

1

1

2

5

63

4

7

2

211

5321

2

4

5

3

1

1

2

5

63

4

7

2

211

53214

2

4

5

3

1

1

2

5

63

4

7

2

211

53214

6

2

4

5

3

1

1

2

5

63

4

7

2

211

53214

67

y

x

y

x

y

x

y

x

y

x

y

x

y

x

(b) (d) (f) (h) (j) (l) (n)Figure 7.6 Example showing the region forming algorithm on an ACG.

ACG

(a) (c) (e) (g) (i) (k) (l)

165

time complexity is O(VElogV). However, the edge searching, if implemented with another

HEAP, it would only take O(ElogE) for these |V| - 1 iterations. Therefore, the time complexity

of this algorithm can be reduced to O(VlogV + ElogE).

7.5.2. Solving the Region Rotation Sub-problem (P2)

By solving the sub-problem P1, we get a region with each vertex vi and its corresponding

location G(vi). Now, we need to search a placement for this region on the current configuration

with the objective of fitting the region within the configuration as best as possible. First, we

define some terms and a metric for measuring it.

• Rrec: a minimal enclosing rectangle containing the region R. Assume the size of the

rectangle is m × n.

• Rotation(Rrec): all rotations of Rrec, i.e. rotate by 90, 180, 270 degree, and reverse (up

and down, left and right).

• grid_status(Rrec): an m × n matrix with the value of the entry set to its minimal

computation requirement if one vertex is allocated on it, otherwise set to 0.

• Rs: an m × n sub-mesh in the system configuration.

• system_status(Rs): an m × n matrix with the value of the entry set to its computational

capacity if the resource inside is not used, otherwise set to 0.

• subtract(grid_status(Rrec) , system_status(Rs)): subtracting grid_status(Rrec) from

system_status(Rs). If one of the entries of the resulting matrix is negative, then

subtract(Rrec , Rs) is set to 0; otherwise, subtract(Rrec , Rs) is defined as the sum of the

entries in the resulting matrix. Since subtract(Rrec , Rs) is defined as the matching

166

difference, a lower positive value stands for a better match. Of note, if

subtract(Rrec , Rs) = 0, we cannot place Rrec onto the sub-mesh Rs.

One simple example is illustrated in Figure 7.7. Assume that the region, R, shown in

Figure 7.7(a) contains all vertices inside it. The minimal rectangle containing R, Rrec, is shown

in Figure 7.7(b). Assume that two different sub-meshes, Rs1 and Rs2, have the same size of

Rrec extracted from the current configuration; the empty space in Figure 7.7(c) and (d)

represents the locations of un-used (i.e., available) resources. After subtracting

grid_status(Rrec) from system_status(Rs), the subtract(Rrec , Rs1) is 3 which implies that these

two rectangles fit pretty well. On the contrary, subtract(Rrec , Rs2) is set to 0 since one of the

entries of the resulting matrix is negative, which implies that these two rectangles do not

match.

The steps of region rotation algorithm are shown in Figure 7.8.

Figure 7.7 The subtraction calculation during the region rotation process.

1 2 1 00 1 1 20 0 1 0

RRrec

grid_status(Rrec)

=

Rs1 Rs2

1 2 1 10 1 1 20 0 1 2

1 2 1 11 1 1 20 0 0 2

system_status(Rs1)

=

system_status(Rs2)

=

0 0 0 1 0 0 0 00 0 0 2

=0 0 0 1 1 0 0 00 0 -1 2

=

subtract(Rrec,Rs1) = 3 subtract(Rrec,Rs2) = 0

Rs1 -Rrec Rs2 -Rrec

53214

67

53214

67

(a) (b) (c) (d)

167

Note that for considering the run-time overhead of the region rotation algorithm, we start

searching the sub-meshes available at corners or the meshes located on the wavefront of used

locations (see line 4 in Figure 7.8). Finally, we select a sub-mesh with the highest match value;

this also helps reducing the fragmentation of the system. Of note, there is no optimal solution

(i.e., optimal selection of the sub-mesh) to this step because it is not possible to know in

advance the future sequence of events.

Complexity of the region rotation algorithm: The initialization for lines 1 and 2 are O(V)

and O(mn), respectively. For searching the possible sub-meshes around the corners or the

meshes at the wavefront of used locations, there are about O(M + N) iterations in the main

loop (line 3-7). For each iteration, getting the possible sub-mesh and the match process takes

O(mn). Therefore, the overall complexity is O(mn(M + N)).

Figure 7.8 Main steps of the region rotation algorithm.

Input: (1) current system configuration, Conf(M × N)

(2) a region R(G), and its corresponding mapping G( ), i.e. for each vertex,

i.e. R with the locations, = (x1,y1) = G(v1), , ... ,

Output: a matching function map( ) for mapping R to Conf

01: calculate the size Rrec(m × n)

where m = max(xi| i = 1~|G|) - min(xi| i = 1~|G|) n = max(yi| i = 1~|G|) - min(yi| i = 1~|G|)02: calculate grid_status(Rrec_all) where Rrec_all = Rotation(Rrec)03: do04: search the possible available sub-meshs Rs(m × n) from current system configuration

05: calculate system_status(Rs)06: calculate sub=subtract(grid_status(Rrec_all) , system_status(Rs))07: choose R such that the sub value is minimized and 0

gx1 y1, gx2 y2, gx V y V,

>

168

7.5.3. Solving the Region Selection Sub-problem (P3)

The purpose of this sub-problem is to minimize the external contention, while selecting a

region R from the current configuration containing all available resources in R’. As discussed

in [35][37], we check that, indeed, selecting a near convex region helps minimizing i) the

external contention, ii) the communication cost for the incoming applications, i.e., L(R), and

iii) the communication overhead for additional applications, i.e., L(R-R’). Therefore, selecting

resources to form a near convex region becomes our goal for this sub-problem. More details

on the algorithm steps and examples was explained in Section 5.5.1.

Note that, there is no optimal solution for this region selection sub-problem since the

sequence of future events is not known a priori. We later evaluate the overall methodology for

long event sessions and show the potential of the region selection algorithm (see Section 7.7).

7.5.4. Solving the Application Mapping Sub-problem (P4)

The inputs of the application mapping sub-problem are i) the ACG of the incoming

application and ii) the resulting region (derived from P3). Our goal here is to map the

application tasks in ACG to the resource locations in R, i.e. vertex allocation process, such that

the communication energy consumption can be minimized. More details on the algorithm

steps and examples are available in Section 5.5.2.

7.6. Light-Weight Model Learning Process

As we already mentioned, as the global manager collects enough data on sequences from

one user, it starts building a light-weight user model that can be used to decide between

approaches 1 or 2 (see stage 3 in Figure 7.3), instead of simply using the pre-defined user

model (stage 2 in Figure 7.3). We apply a machine learning technique, more precisely decision

169

tree learning, for building the user model from input traces for each specific user. Due to the

requirement of having a small energy and run-time overhead, our tree structure has i) up to

four leaf nodes and ii) two feature parameters and for each branch (see Figure 7.9).

Under such a decision tree, there are 22 different tree structures and decision combinations we

can consider, where each tree structure TSc can be interpreted as a unique classifier

TSc( , ) (i.e. c = 1-22). The terms “classifier” and “tree structure” are used alternatively

when there is no ambiguity.

In Figure 7.9(a)-(d), we plot four possible classifiers. For example, under the tree structure

in Figure 7.9(a), Approach 1 is applied to applications where either “ is greater than and

α β

αth βth

get α

get β

αget

thαα ≥α thαα <

α thββ ≥α thββ <α thββ ≥α thββ <

Approach 2Approach 1Approach 1Approach 2

Get get β

α thββ ≥α thββ <

Approach 1Approach 2

get

αα

Approach 1Approach 2

α

thαα ≥thαα <

get α

get β

αα thαα <

α thββ ≥α thββ <

Approach 1

Approach 1Approach 2

β

thαα ≥

α

β

thα

thβ

Approach 2A2:Approach 1A1:

A2

A1

A1

A2

α

β

thα

thβ A1A1

A2

α

β

thβA1

A2

α

β

thα

A1A2

bran

ch

Figure 7.9 Four possible decision tree structures for user model.

(a)

(b)

(c)

(d)

α αth

170

is smaller than ”, or “ is smaller than and is greater than ”. If we use the

structure in Figure 7.9(d), then the critical application decision is made only depending on

values.

While the sequences of events vary a lot for different users (see Figure 1.5, where the

number and type of applications running on the system are quite different), the specific tree

structures should be learned to fit a certain user. We denote the collected/given sessions3 for

any user ui as the training dataset ( ), while the future (or unseen) sessions as the testing

dataset ( ). The goal of the model learning process is to a find a tree structure ST and the

related threshold value, and/or , such that the communication energy consumption for

future user sequences in is minimized. Since the dataset is not given in advance, we

can only target the sessions in ; that is,

Given sessions in ,

Find a tree structure TSc and corresponding parameters which minimize:

(7.18)

where is the length of session si

c = 1- 22, and = 0.1, 0.2,...0.9

It has been observed that, the performance of the training set is not a good indicator of

predictive performance of the unseen data due to the problem of over-fitting [56]. More

precisely, Efron in [56] proposes a cross-validation method for minimizing the future

prediction error. Here, we first explain the k-fold cross-validation method [92] and then

3. A session is a sequence of events when users log in and log off the system. As shown in Figure 1.5,every duration from the positive clock edge to the negative clock edge is referred to as an event.

β βth α αth β βth

αth

Dtrui

Dteui

αth βth

Dteui Dte

ui

Dtrui

Dtrui

αthc βth

c,( )

min Ecomm_total TsiTSc αth

c βthc,( ),⎝ ⎠

⎛ ⎞

si in Dtr

ui∀

∑⎩ ⎭⎪ ⎪⎨ ⎬⎪ ⎪⎧ ⎫

Tsi

αth βth

171

provide the pseudo code of our tree-based learning process with and without cross-

validation.

In the k-fold cross-validation, the training dataset Dtr is randomly partitioned into k

mutually exclusive subsets (called folds) D1, D2,..., Dk of approximately equal sizes. Each

classifier TS( , ) is trained and tested k times; each time t {1, 2, ..., k}, the classifier is

trained on D-Dt and tested on the validation set Dt. The performance of this classifier is

measured by the average testing results on the validation set Dt, t=1~k. The cross-validation

example with k = 4 is shown in Figure 7.10 where the validation sets are indicated by the gray

blocks. For each classifier TS, we train with (D2,D3,D4), (D1,D3,D4),(D1,D2,D4),(D1,D2,D3)

and test on D1,D2,D3, and D4, respectively. The goal is to find a classifier such that the

average testing results on the validation sets are the best.

Figure 7.11 (a) and (b)(c) illustrate the pseudo code of the structure learning process

without and with cross-validation, respectively. The sub-function find_para takes as inputs D

and TSc, and it returns the parameters for classifier c and its corresponding

performance results, Emin. The k-fold cross-validation method works as we train on D-Dt (see

line 3 in Figure 7.11(b))and test on the validation set Dt (see line 4 in Figure 7.11(b)) for t = 1 -

k.

Of note, the complexity of the k-fold cross-validation approach depends on several

parameters, such as the value k, the model selection with related factors and the amount of

αth βth ∈

Figure 7.10 4-fold cross-validation for model learning.

D1 D2 D3 D4

D1 D2 D3 D4

D1 D2 D3 D4

D1 D2 D3 D4

t = 1

t = 2

t = 3

t = 4

training set

validation set

Dtr

αthc βth

c,( )

172

Figure 7.11 (a) Pseudo codes of tree structure learning process withoutcross-validation method and (b)(c) with cross-validation method.

01: 02: for = 0.1 : 0.1 : 0.903: for = 0.1 : 0.1 : 0.904:

05: if

06: do

07:

08: return

find_para Dateset D classifier TSc,( )

Emin ∞←

αthβth

Etmp Ecomm_total TsiTSc αth βth,( ),⎝ ⎠

⎛ ⎞ si in D∀∑←

Etmp Emin<

Emin Etmp←

αthc βth

c,( ) αth βth,( )←

αthc βth

c Emin, ,

01: for c=1 : 1 : num_classifier 02:

03: select one classifier c’, s.t. minimize 04: output: learned model

TSDtr

c αthc βth

c,( ) Ec,⎝ ⎠⎛ ⎞ find_para Dtr TSc,( )=

Eallc′

TSc′ αthc′ βth

c′,( )

(a) structure learning without cross validation

01: for c=1 : 1 : num_classifier 02: for t = 1 : 1 : k03: 04:

05: select one classifier , minimize

06:

07: output: learned model

TSDtr Dt–c αth

c βthc,( ) Ec,⎝ ⎠

⎛ ⎞ find_para Dtr Dt– TSc,( )=

Eallc 1

k--- Ecomm_total Tsi

TSDtr Dt–c αth

c βthc,( ),⎝ ⎠

⎛ ⎞

si in Dt∀∑

t 1=

k

∑=

c″ Eallc″

TSDtr

c″ αthc″ βth

c″,( ) Ec,⎝ ⎠⎛ ⎞ find_para Dtr TSc″,( )=

TSc″ αthc″ βth

c″,( )

(b) structure learning with k-fold cross-validation

(c) subfuntion find_para( )

173

training data we have (i.e. the number of sequences of events we collect). For k, 10 is a widely

suggested value [56]. In terms of the amount of training data, intuitively, the more, the better.

If the user learning process does not have any memory or runtime limitation, then we can use

all collected data or even apply bootstrap method [56] which can increase the model accuracy.

In practice, however, we need to carefully select all the parameter settings. We report the

overhead in Section 7.7.3 for our system environment.

7.7. Experimental Results

In Section 7.7.1, we evaluate the hybrid approach consisting of Approaches 1 and 2 under

the pre-defined user model (stage 2 in Figure 7.3). We first evaluate two sub-problems, P1 and

P4, which have an optimal solution; later, the methodology combining P1-P4 at stage 2 is

evaluated. In Section 7.7.2, the energy overhead of running our run-time algorithms is

evaluated for real applications. Finally, the performance of the on-line user model learning

process in stage 3 is reported in Section 7.7.3.

7.7.1. Evaluation on Random Applications

We first evaluate the solution quality of region forming and application mapping

algorithms against the optimal solution. The experiments are performed on an AMD

Athlon™ 64 Processor 3000+ running at 2.04GHz; the results are shown in Figure 7.12.

Twenty categories of random applications are generated with TGFF [162]; these are

combinations of ‘ACG density with 30%, 50%, 70%, 90%’ and ‘variance of communication

rate per edge in one application of 10, 102, 103, 104, and 105’. Each category contains 50

applications with the number of vertices in a ACG ranging from 8 to 12 and assume each

vertex has the same computation requirement. The ACG density of an application is defined as

174

the number of edges in the application divided by the total number of edges in the complete

graph (in which each pair of vertices is connected by an edge). For instance, the ACG density

of an application in Figure 7.6 with 7 vertices and 9 edges is 9/21×100% = 42.8%.

Figure 7.12(a) gives the communication energy consumption under the region forming

algorithm (P1) by comparing it against the optimal solution (i.e., internal communication cost

minimized without any boundary constraint). Figure 7.12(b) compares the communication

energy consumption of the application mapping algorithm (P4) against the optimal solution

for a given region obtained from the region selection algorithm (P3) and the size of the region

is the same as the vertex number in that ACG. Simulation takes around 3 minutes to get the

optimal solution for each application; this is clearly inadequate for a run-time solution. In

contrast, our algorithms for P1 and P4 takes less than 1 microsecond. As shown in

Figure 7.12(a)-(b), the loss in communication energy consumption is less than 12% compared

to optimal solution for all categories, and about 4.5% and 6.2% communication energy loss,

on average, in Figure 7.12(a) and (b), respectively.

ACG density (%)

10 510 4 10 3 10 2 10 1

Variance of comm.

rate per edge

Com

m. e

nerg

y lo

ss p

erce

ntag

e

Figure 7.12 Communication energy loss compared to the optimal solution for (a) regionforming (P1) sub-problem and (b) application mapping (P4) sub-problem on a 2D-mesh NoC.

(a) (b)

Com

m. e

nerg

y lo

ss p

erce

ntag

e

10 510 4 10 3 10 2 10 1

Variance of comm.

rate per edgeACG density (%)

175

Next, we evaluate the stage 2 of the entire methodology as shown in Figure 7.3. Assume

that 10 different applications can be invoked by an user and their ACGs have been generated

with the number of vertices in each application ranges from 3 to 10 and the execution time of

applications ranges from 5 to 30 seconds. Next, we use probabilities to capture the user

behavior. More precisely, we randomly generate the first 100 events; after that, events come

out according to the occurrence probability of all previous events. Events displayed in

Figure 7.13 start from the 101st event sequence. The events are executed on a platform with

8 × 8 processors, one of them being the master PE. The pre-defined user model we use is the

same as the structure in Figure 7.9(b) with and set to 0.8 and 0.7, respectively;

these threshold values fit the best all the data collected from various user sequences). The

communication cost at time t is computed with the total communication energy consumed by

all applications running in the system, over a period from time t-1 to time t.

In Figure 7.13(a), we denote the communication cost of the hybrid approach by “cost_3”,

while the communication cost with Approaches 1 and 2 is denoted by “cost_1” and “cost_2”,

respectively. For the hybrid approach, the process starts taking the user behavior into

consideration after time 20, while the iteration, iter (see Figure 7.4), is set to 3. Note that, the

information of an application leaving the system is not displayed in the figure.

As shown in Figure 7.13(a), with user behavior considered, the energy overhead at

start is less than that when a deterministic approach is applied. As shown, Approach 1

performs well initially since the system utilization is low. As the system utilization

increases, Approach 1 performs poorly since there are many non-contiguous regions in

the system. One can also see that the communication cost ratio does not fluctuate

significantly after the system runs for a certain period of time. We estimate that the

hybrid approach achieves about 40% and 25% communication energy savings, compared

αth βth

176

to approaches 1 and 2, respectively. We also compare the L(R) metric, where R is the

available/unused resources in the system at each time unit among different approaches. We

denote the L(R) of the hybrid approach by “L_3”, while the L(R) with Approaches 1 and 2 are

denoted by “L_1” and “L_2”. In Figure 7.13(b), we plot the ratio of L_3/L_1, L_3/L_2, and

the number of applications in the system at each time unit. As shown in Figure 7.13(b), the

L(R) of the hybrid approach is less than 10% greater than that of Approach 2, which implies

:

:

:

: L_3 / L_1: L_3 / L_2: # of applications in the system

1110987654321

1110987654321

0 50 100 150

1110987654321

Figure 7.13 (a) Communication cost comparison among Approach 1, Approach 2, and thehybrid approach (which considers the user behavior) on an 8 × 8 NoC. (b) L(R) where R isthe available/unused resources comparison among Approach 1, Approach 2, and thee hybridapproach.

0 50 100 150

0.4

0.6

0.8

1

1.2

1.4

:::

cost_3 / cost_2cost_3 / cost_1

event trigger

time (sec)

com

mun

icat

ion

ratio

time (sec)

L(R

) rat

io

# of

app

licat

ions

(a)

(b)

L_1/L_3L_2/L_3

177

that the external contention in hybrid approach is not so serious. In addition, when the system

utilization is higher (i.e. more applications are present in the system), it can be observed that

L(R) in Approach 1 is much higher than that in hybrid approach; this is because, when the

application leaves the system, there is always a scattered region left in the system

configuration.

To show the scalability of our proposed methodology, we report the communication

consumption comparison among these three approaches on different size NoCs, i.e. 6 × 6,

8 × 8, 10 × 10, and 12 × 12 as shown in Table 7.2. The communication cost of running

applications on each NoC for a certain approach is obtained when the ratio does not fluctuate

significantly, i.e. after the system runs for a certain period of time (similar to the case in

Figure 7.13(a)). As observed, the ratio decreases as the size of the platform increases,

which shows that our hybrid approach looks promising particularly for large NoC

platforms.

7.7.2. Real Applications with Run-time Energy Overhead Considered

In this section, we apply our proposed methodology to real applications, i.e. the embedded

system benchmark suite (E3S) [50]: Automotive/Industrial, Consumer, Networking, Office

automation, and Telecom. Our homogeneous 5 × 5 mesh-based NoC contains 24 slave

processors (some are AMD ElanSC520 operating at 133 MHz, some are AMD K6-2E

Table 7.2 Comparison of communication consumption among different approaches on a different size NoCs.

communication ratio 6 × 6 8 × 8 10 × 10 12 × 12cost_3/cost_1 0.74 0.60 0.48 0.32cost_3/cost_2 0.81 0.75 0.71 0.63

178

operating at 500 MHz) and one master PE, MicroBlaze core (100MHz) acting as the global

manager.

A C++ simulator using the bit energy metric model in [170] evaluates the

communication energy consumption where is set to 4.49 × 10-13 (Joule/bit) and

contains the energy consumes on the routing engine (10-13 Joules/packet), arbiter request

(1.155 × 10-12 Joules/packet), switch fabric (2.84 × 10-13 Joules/bit), and buffer reading and

writing (1.056 × 10-12 and 2.831 × 10-12 Joules/bit, respectively).

Five benchmarks of E3S have been partitioned off-line [30][127]. As such, the number of

vertices in the ACG of each benchmark ranges from 3 to 8 and vertices have two different of

computation capacity level where critical vertices must be operate at AMD K6-2E for meeting

the application deadline. The user sequences come from realistic data collected from five

different applications running under Windows XP. The execution time of an event is

normalized to the reasonable range from 10 μs to 10 ms for E3S benchmarks. We run 200

events (from the 101st to the 300th event) for each scenario: “Nearest Neighbor [27]”,

“Approach 1”, “Approach 2”, and “Hybrid approach”. For “Approach 1” and “Approach 2”

scenarios, we only apply approach 1 and approach 2, respectively, to all events (see

Figure 7.4). For the “Hybrid approach” scenario, approaches 1 and 2 are selected at run-time

(the iteration, iter (see Figure 7.4) is set to 3) based on the pre-defined user model in

Figure 7.9(b), with and set to 0.8 and 0.7, respectively. Of note, “Approach 2” and

“Hybrid approach” correspond to stages 1 and 2 in Figure 7.3, respectively.

In the following evaluation, we consider the run-time and energy overhead of processing

our proposed algorithms: i) running the approach selection process (i.e., , compared to

, ) on the manager, ii) running the resource assignment process (i.e., P1-P2 for

Approach 1 and P3-P4 for Approach 2) and iii) sending the control messages over the control

network back to the manager. Of note, the communication volume for all control messages in

Elink ERbit

αth βth

α β

αth βth

179

one event is Z = [a bits (for showing the location of the slave processor, which depends on

network size) + 1 bit (resource status)] × MD (distance of all slave processors to the master

PE). Of note, compared to the data messages transmitted for real applications (which is in the

order of Megabits), the overhead of sending control messages is clearly negligible.

To compare the communication energy consumption for each scenario, we denote some

variables as follows:

• : the communication energy consumption of an event Q, or an application Q, in

scenario x.

• :the energy overhead of running the approach selection and resource

assignment processes on the master PE for the event Q in scenario x.

• :the run-time overhead of running the approach selection and resource

assignment processes on the master PE for the event Q in scenario x (obtained from

MicroBlaze processor running on Xilinx Vertex-II Pro XC2VP30 FPGA).

We set the scenario “Nearest Neighbor [27]” as the baseline algorithm and we assume

zero energy and run-time overhead for the baseline algorithm. Then the total communication

energy consumption for these 200 events in scenario x is:

[ ]

The experimental results are shown in Table 7.3. We determine experimentally that for

the “Hybrid approach”, about 60% communication energy savings can be achieved compared

to “Nearest Neighbor” scenario.

EApp Qx

EApp Qx_select

TApp Qx_select

Q 101=

300

∑ EApp Qx

Ex_select

App Q+

180

When comparing with “Approach 1” and “Approach 2” schemes, the “Hybrid approach”

provides around 38% and 24% communication energy savings, respectively. The average run-

time overhead of the approach selection and resource assignment processes in “Approach

1”, “Approach 2”, and “Hybrid approach” scenarios for one event is 47.2μsec, 49.4μsec, and

53.4μsec, respectively. As the hard deadline of E3S benchmarks are in msec order, our

algorithms are appropriate to be done at run-time.

7.7.3. Real Applications with On-line Learning of User Model

Here, we evaluate the performance with the on-line user model. In this implementation,

we include a Multimedia System (MMS) [80], a Video Object Plane Decoder [104], and

the five E3S benchmarks [50]; the number of vertices in the ACG of each benchmark

ranges from 5 to 25. Four scenarios (Nearest Neighbor [27], stages 1, 2, and 3 in Figure 7.3)

are considered for the system configuration of a 10 × 10 mesh network. For stage 3, 10-fold

cross-validation method is used for learning the user model [56]. As discussed in Section 2.3.1

for capturing the essence of human behavior while users interact with computing systems, the

user sequences come from collecting the behaviors of four users running seven

applications (i.e. Media player, Windows explorer, Windows powerpoint, Matlab, Adobe

acrobat, Microsoft word, and Outlook express) in a Windows XP environment; the

execution time is normalized, ranging from 10 μs to 100ms.

Table 7.3 Comparison of the run-time overhead and the overall communication energy savings under four implementations on a 5 × 5 mesh NoC.

Approach Tx_selection, orAvg. run-time overhead

per event (μsec)Normalized

total event costNearest Neighbor [27] 0 1

Approach 1 47.2 0.682Approach 2 (stage 1) 49.4 0.551

Hybrid approach (stage 2) 53.4 0.405

181

Table 7.4 shows the user model setting and the normalized total event cost (including the

communication energy consumption of all events, and energy overhead of running the

approach selection and the resource assignment process for each event). In stage 2, the user

sequences are set to 10 with each having 25-50 time units. and are set to 0.8 and 0.7,

respectively for default user model. For doing the experiment in stage 3, we collect the user

sequences with 10 minutes sampled as the user logins and logoffs the system within three

months. We use the collected sequences from the first two months as the training dataset Dtr

and apply 10-fold cross validation method for learning the user model. As seen in Table 7.4,

the on-line learned user models are different between each user. The performance of stage 3 in

Table 7.4 is evaluated by sequences excluded from the training dataset (i.e. the sequences

collected in the last month).

Table 7.4 Normalized event cost in stages 1, 2, and 3 under different user models from four users normalized to the total event cost of “Nearest Neighbor [27]” approach.

user # 1 user #2 user #3 user #4Nearest

Neighbor [27]

Normalizedevent cost 1 1 1 1

stage 1

user model(ST,

,)

N/A N/A N/A N/A

Normalizedevent cost 0.431 0.405 0.384 0.437

stage 2

user model(ST,

,)

(Figure 7.9(c), 0.8, 0.7)

(Figure 7.9(c), 0.8, 0.7)

(Figure 7.9(c), 0.8, 0.7)

(Figure 7.9(c), 0.8, 0.7)

Normalizedevent cost 0.318 0.272 0.297 0.341

stage 3

user model(ST,

,)

(Figure 7.9(c),0.8,0.8)

(Figure 7.9(d),0.8,

N/A)

(Figure 7.9(c),0.7,0.5)

(Figure 7.9(b), N/A,0.9)

Normalizedevent cost 0.284 0.201 0.232 0.291

αth βth

αthβth

αthβth

αthβth

182

As observed in Table 7.4, with the pre-defined user model considered (stage 2), we

achieve 70%, on average, communication energy savings compared to the “Nearest

Neighbor” schemes. With learned user model (stage 3), we can achieve 18% and 75%, on

average, communication energy savings compared the pre-defined user model scheme

and the “Nearest Neighbor” schemes, respectively4. While experimenting on the user

model learning procedure for the training dataset Dtr without 10-fold cross-validation

method, we observe that the energy consumes 15% more on the testing dataset, on

average, for each user compared to user model learning with cross-validation method

applied.

Of note, the overhead of the model learning step is not included in Table 7.4 since the

learning step is not executed for each event. Basically, we collect hundreds of sequences from

one user for a period of time and update the user model once for that user if necessary. The

overhead of learning the user model is affected by multiple factors, i.e. the user model

complexity (including the parameters and the structure for building the model), the amount of

collected sequences (how often for updating the model), and the computation capability of the

global manager.

Here, we report that the run-time overhead of the learned user model process (steps in

Figure 7.11 (b)) running on 100MHz MicroBlaze processor with the collected user sequences

(sessions with 10 minutes sampled as the user login and logoff the system for three months)

without and with the cross-validation process are 1.3 second and 9.6 seconds. The user model is

learned from 22 different decision trees (or classifiers) combined with different and

values ranging from 0.1, 0.2,...,to 0.9, while using the 10-fold cross-validation. We

conclude that having the cross-validation process for the learning process indeed helps

4. For example, in the MIT RAW on-chip network where the communication energy consumption rep-resents 36% of the total energy consumption [20], applying our approach in stage 3 of Figure 7.3can save 27% of total energy consumption compared to the “Nearest Neighbor” scheme.

αth βth

183

build more accurate user model, but we need to evaluate its run-time overhead. Therefore,

in reality, there is no restricted rule of when, where, and even how to update the user model. It

is suggested to do some analysis to see whether or not the current user behavior is suitable for

the latest updated user model. In addition, we could update the model slightly (e.g. modifying

the thresholds only) instead of learning the model using whole collected sequences. Moreover,

for future embedded systems, the user model learning process is not necessary to be built on

the system itself. We could upload the collected sequences periodically to the data center and

with its strong computation capacity, we are able to build more accurate model before

downloading relevant parameters back to the system.

The proposed tree-based structure for building user model on-line for each specific user is

the first step for run-time optimization while taking the user behavior into consideration. More

work needs to be done by increasing the adaptability of the system while considering the

feedback between system and users. In addition, if one user with his/her behavior has hugh

difference for certain reasons (as shown in Figure 1.1, the dot changes from one cluster to

other clusters), it is suggested to either include heavier run-time optimization such that the

system can better adapt to this user, or suggest the user to own new platform which can fit

him/her well with light-wight optimization.

7.8. Summary

In this chapter, we have proposed a run-time strategy for allocating the application tasks to

embedded MPSoC platform where communication happens via the NoC approach. As novel

contribution, we have incorporated the user behavior information in the resource allocation

process; this allows system to better respond to real-time changes and adapt dynamically to

different user needs. Several algorithms have been proposed for solving the task allocation

problem, while minimizing the communication energy consumption and network contention.

184

By applying machine learning techniques, more precisely tree-based model learning, for building

the user model from input traces, we can achieve around 75.8% communication energy savings

compared to an arbitrarily contiguous allocation scenario on the NoC platform. As suggested,

although we focus on the 2-D mesh NoC platform, our algorithm can be adapted to other

regular architecture with different networks topologies.

It can be seen that the methodology proposed in this chapter can be applied to embedded

system in the second and third categories as discussed in Table 1.1. For the systems in the first

category, we are also interested in researching the patterns of human dynamics which allow us

to follow specific human actions in ultimate detail, such as adding the social behavior

component (flow experience). This remains to be done in future work (see Table 8.2.2).

185

8. CONCLUSIONS AND FUTURE DIRECTIONS

With industry shifting to platform-based embedded system design, much progress in the

traditional DSE techniques has been moving from task-level, resource-level, to even system-

level perspective, while targeting system optimization with the goal of improving the system

performance. Over recent years, embedded systems have gained an enormous amount of

processing power and functionality; future systems will likely consist of tens or hundreds of

heterogeneous cores supporting multiple applications. However, from users’ perspective,

users purchase “fit enough” products which typically provide “just-enough performance”

during operation, and rather focus on additional concerns, such as the appearance,

practicability of the product, or even price. In addition, due to the high variability seen in user

preferences, it becomes much more challenging for system designers to meet the various users

taste.

8.1. Dissertation Contributions

In this dissertation, we propose a unified user-centric embedded design framework for

both off-line DSE and on-line optimization, while explicitly involving the user experience into

the process. In other words, we target incorporating the user behavior information into the

system design, optimization, and evaluation steps. The main contributions of this dissertation

can be summarized as follows:

• For MPSoCs with predictable system configurations, the platform can be generated

following the traditional Y-chart design flow, given the universal use-cases application

parameters, and architecture templates. In Chapter 3, we explored the system

186

interconnect for large MPSoC design using the NoC communication approach, which

aims at trading off the system performance and several physical design metrics. The

results demonstrated that the optimization framework is capable of obtaining Pareto

solutions with multiple buses instead of the current single bus that significantly reduces

communication latency with negligible fabric wirelength and area penalty.

• Satisfying the end user is the ultimate goal of any system optimization. Toward this end,

in Chapter 4, we presented a new design methodology for automatic regular platform

generation of embedded NoCs resulting in unpredictable system configurations, while

including explicitly the information about the user experience into the design process;

this aims at minimizing the workload variance and allows the system to better adapt to

different types of uses. More precisely, we relied on machine learning techniques to

cluster the traces from various users into several classes, such that the differences in

user behavior for each class are minimized. Then, for each cluster, we proposed an

architecture automation deciding the number, the type, and the location of resources

available in the platform, while satisfying various design constraints.

• Exploring on-line resource allocation techniques while mapping multiple applications

onto multiple computing resources is a fundamentally important issue in MPSoC

design. This also belongs to the large class of resource allocation problems in parallel

systems. In Chapter 5, efficient techniques for run-time application mapping onto

NoC platforms with multiple voltage levels have been presented with the goal of

minimizing the total communication energy consumption and maximizing the

system performance, while still providing the required performance guarantees. In

parallel, the proposed techniques allow for new applications to be easily added to

the system platform with minimal inter-processor communication overhead.

187

• The collective resource utilization and system reliability are important for achieving

overall computing capacity in MPSoCs. Especially, for larger MPSoCs integrated with

hundreds or thousands cores where the communication happens via the NoC approach,

any failures through the computation or communication components may degrade the

system performance, or even render the whole system useless. In Chapter 6, we

discussed the workload variation resulting from the system itself and later investigated

the spare core placement problem taking into account the with fault-tolerant property.

As the main theoretical contribution, we addressed the resource management problem on

irregular NoC platforms where permanent, transient, and intermittent faults can appear

statically or dynamically in the system. A fault-tolerant application mapping algorithm

has been presented which allocates the application tasks to the available, reachable, and

defect-free resources with the goal of maximizing the overall system performance.

• Due to variations in users’ behavior, the workload across different resources may

exhibit high variability even when using the same hardware platform. In Chapter 7,

extensible and flexible run-time resource management techniques have been presented

that allow systems to respond much better to run-time changes. In addition, we

proposed light-weight machine learning techniques for learning the user model at run-

time such that the systems are able to adapt dynamically to user needs. Given the

application characteristics, the on-line learned user model, and current system

configuration, our algorithm assigned the dynamic application tasks to the appropriate

resources such that the overall system performance is maximized. It has been

experimentally demonstrated that considering the user behavior during the resource

management process has an important impact on the system performance improvement.

188

8.2. Future Directions

The methodologies and user-centric ideas presented in this dissertation can open several

interesting research topics and challenges. In what follows, we summarize these directions.

8.2.1. Challenges Ahead for User-centric Embedded System Design

System-level approach (e.g., early performance analysis, evaluation) play an important

role in DSE, especially for large-scale embedded systems which consist of multiple

heterogeneous cores. Here, we highlight several important issues for designing embedded

systems (not just NoC platforms!) with users in mind.

• Model exploration and its level of granularity: Simply speaking, the proposed user-

centric design flow (see Figure 1.7(b)) is to explore models based on given data as

shown in Figure 8.1. Machine learning techniques help exploring robust models

together with useful parameters/features from given data for predicting the output result

as accurately as possible. More details for user model exploration and the corresponding

challenges are surveyed in Appendix A. In addition, having the right level of granularity

for application, platform, and user trace specification (see Chapter 2) is still opening

problem for embedded systems.

Figure 8.1 Model exploration for user-centric design flow.

model

data

datadata

model

data

189

• Human Dynamics: Systems are designed for humans. Exploring human dynamics helps

capturing individual human behavior and following specific human actions in ultimate

detail. Several study on human dynamics are already available, such as heavy tailed

distribution (also known as power law distribution) [11][69][70][163]. However, it is

still an open problem as to how one can incorporate human activity patterns in

embedded systems design.

• Workload Fidelity: Due to the user preference variation, it is required to understand the

buyer-to-be’s workload of target applications before designing a system [52]. Chen et

al. develop a workload analysis: recognition, mining, and synthesis (RMS) to model

events, objects, and concepts cased on end-user inputs which can be applied to

embedded systems, gaming, graphics, and even financial analytics [33]. With well

understanding of the workloads from different users, we could come out with the right

level of granularity for application specification, and later capture the main scenario for

system configuration for each specific user.

User-centric research for embedded systems is still at an early stage, and there is much

work that needs to be carried out, from modeling the user behavior, to analysis of various

workloads, and to user-aware DSE and optimization. However, we believe that making users

part of the design process is crucial for embedded systems design as this can lead to better and

more flexible designs in the future.

8.2.2. Increasing Flow Experience by Designing Embedded Systems

We argue that future embedded systems need to be designed using a flexible user-centric

design methodology geared primarily toward maximizing the user satisfaction (i.e., flow

experience) rather than only optimizing performance and power consumption. Therefore,

compared to the traditional design, we aim at re-focusing the current design paradigm by

190

placing the user behavior at the center of the design process and by using psychological

variables such as user ability and motivation as the main drivers of this process. This allows

systems to become more capable of adapting to different users' needs and of enhancing short-

and long-term user satisfaction.

Generally speaking, the flow experience is a mental state in which an individual feels

completely immersed in the task or activity at hand. Think, for instance, of web browsing.

While in flow, the user experiences enhanced motivation, concentration, positive affect, and

task involvement [45]. Theoretically, an individual achieves a state of flow when his/her

abilities match the challenges faced when engaged in executing a particular task. It has been

observed that designing interfaces that favor flow experiences helps increase the usability of

the information technology in use. Moreover, the increased user motivation experienced

during a flow episode guarantees continued use of technology and enhanced return behavior

[124].

Prior work in psychology indicates that while each task or application makes users feel

more or less stimulated, the optimal level of motivation (i.e., the flow experience) is achieved

by challenging tasks that match the user's abilities as seen in Figure 8.2 [45][167]. Indeed, if

the level of challenge presented by an application is low and does not engage the user, then the

user can quickly lose interest and get relaxed or even become bored by that particular activity

(see zones III and IV in Figure 8.2). However, if the task challenge is beyond the user's current

ability, then the activity becomes overwhelming and the user may feel frustrated or even

anxious (zone I in Figure 8.2). As shown in Figure 8.2, the flow zone (zone II) is reached only

when the task is challenging and the user's skills are great enough to deal with it; when

achieving a flow experience, the individual truly finds pleasure in doing the current activity

[45][147].

191

Therefore, from previous psychological research, mapping the relationship between the

task difficulty and the level of user ability is very important in predicting flow experience.

Applying this work to the embedded system design process, the experience of anxiety or

relaxation should prompt designers to either change the level of challenge in the system (e.g.,

CPU) or to motivate the user to increase his/her skill level in order to re-experience the flow

[147]. Thus, in order to maximize user flow experience, the traditional design paradigm needs

to be redefined by taking into consideration psychological variables such as user ability and

positive affect.

Looking forward, we believe that enhancing the users; flow experience offers a new

paradigm for understanding both individual and collective human functioning and

consequently plan to explore its implications for user-centric DSE of embedded systems.

Motivation and preliminary results are reported in [43]. We are hopeful that the future design

process will take into account the human nature of users and their ever-changing abilities and

interests.

Figure 8.2 Four-quadrant states in terms of challenge and skill level.

Skill Level

Cha

lleng

e Le

vel

I. Anxiety II. Flow

III. Boredom IV. Relaxation

Skill Level

Cha

lleng

e Le

vel

I. Anxiety II. Flow

III. Boredom IV. Relaxation

192

193

Bibliography

[1] H. Abdel-wahab, et al., “A proportional share resource allocation algorithm for real-time, time-shared systems,”

Proc. Real-Time Systems Symposium, 1996, pp. 288-299.

[2] S. N. Adya and I. L. Markov, “Fixed-outline floorplanning: enabling hierarchical design,” IEEE Trans. on VLSI

Systems, vol 11(6), Dec. 2003, pp. 1120-1135.

[3] R. Alur, D. L. Dill, “A theory of timed automata,” Theoretical Computer Science, 1994, vol. 126, pp. 183-235.

[4] G. Ascia, V. Catania, M. Palesi, “Multi-objective mapping for mesh-based NoC architectures,” Proc. Hardware/

Software Codesign and System Synthesis (CODES+ISSS), Sept. 2004, pp.182-187.

[5] G. Ascia, V. Catania, M. Palesi., “A multi-objective genetic approach to mapping problem on Network-on-Chip,”

Journal of Universal Computer Science, vol. 12, no. 4, 2006, pp. 370-394.

[6] A. Avd, “1.1 Billion Cell Phones Sold Worldwide In 2007, Says Study,” http://www.switched.com/2008/01/25/1-

1-billion-cell-phones-sold-worldwide-in-2007-says-study/.

[7] A. Baghdadi, et al., “An efficient architecture model for systematic design of application-specific multiprocessor

SoC,” Proc. DATE, 2001, pp. 55-63.

[8] F. Balarin, et al. Hardware-Software Co-design of Embedded Systems - The POLIS approach. Kluwer Academic

Publishers, 1997.

[9] N. Banerjee, P. Vellanki, K. S. Chatha, “A power and performance model for Network-on-Chip architectures,”

Proc. DATE, 2004, pp. 1250-1255.

[10] S. Bani-Mohammad, M. Ould-Khaoua, I. Ababneh, L. M. Mackenzie, “An efficient processor allocation strategy

that maintains a high degree of contiguity among processors in 2D mesh connected multicomputers.” Proc.

Computer Systems and Applications, 2007, pp. 934-941.

[11] A.-L. Barabási, “The origin of bursts and heavy tails in human dynamics,” Nature 435, 2005, pp. 207-211.

[12] M. Barr, “Architecting embedded systems for add-on software,” Embedded Systems Programming, Sept. 1999,

pp. 49-60.

[13] Bell, E. T. “Exponential Numbers” Amer. Math. Monthly 41, 1934, pp. 411-419 .

[14] C. M. Bender, M. A. Bender, E. D. Demaine, S. P. Fekete, “What is the optimal shape of a city?,” Journal of

Physics A: Mathematical and General, vol. 37, 2004, pp. 147-159.

[15] M. A. Bender, et al., “Communication-aware processor allocation for supercomputers,” Proc. Workshop on

Algorithms and Data Structure, Aug. 2005, pp. 169-181.

[16] C. M. Bender, M. A. Bender, E. Demaine, and S. Fekete, “What is the optimal shape of a city?,” Journal of Physics

A: Mathematical and General, vol. 37, 2004, pp. 147-159.

[17] L. Benini, G. De Micheli, “Networks on chip: a new paradigm for systems on chip design,” Proc. DATE, 2002, pp.

418-419.

[18] D. Bertozzi, and A. Jalabert, “NoC synthesis flow for customized domain specific multiprocessor systems-on-

chip,” IEEE Trans. Parallel Distrib. Syst., vol. 16, no. 2, Feb. 2005, pp. 113-129.

194

[19] S. Bertozzi, et al., “Supporting task migration in multi-processor systems-on-chip: a feasibility study,” Proc.

DATE, 2006, pp. 1-6.

[20] P. Bhojwani, et al., “A heuristic for peak power constrained design of network-on-chip (NoC) based multimode

systems,” Proc. VLSI Design, Jan. 2005, pp. 124-129.

[21] C. M. Bishop, Pattern Recognition and machine learning (Information Science and Statistics), 2006.

[22] R. Bitirgen, E. I.pek, and J.F. Martínez, “Coordinated management of multiple resources in chip multiprocessors:

A machine learning approach,” Intl. Symp. on Microarchitecture, Nov. 2008, pp. 318-329.

[23] S. Borkar, “Thousand core chips: a technology perspective,” in Proc. DAC, 2007, pp. 746-749.

[24] R. Burke, “The Wasabi Personal Shopper: a case-based recommender system,” Proc. Artificial intelligence States,

July 1999, pp. 844-849.

[25] I. V. Cadez, et al., “Model-based clustering and visualization of navigation patterns on a web site,” Data Mining

and Knowledge Discovery, 2003, pp. 399-424.

[26] I. V. Cadez, S. Gaffney, P. Smyth, “A general probabilistic framework for clustering individuals and objects,”

Proc. on Knowledge Discovery and Data Mining, Aug. 2000, pp. 140-149.

[27] E. Carvalho, N. Calazans, F. Moraes, “Heuristics for Dynamic Task Mapping in NoC-based Heterogeneous

MPSoCs,” IEEE/IFIP Workshop on Rapid System Prototyping, Porto Alegre, Brazil, May 2007, pp. 34-40.

[28] F. Catthoor, et al., “How can system-level design solve the interconnect technology scaling problem?,” Proc.

DATE, 2004, pp. 332-337.

[29] C. Chang and P. Mohapatra, “Improving performance of mesh connected multicomputers by reducing

fragmentation,” Journal of Parallel and Distributed Computing, vol. 52, no. 1, 1998, pp. 40-68.

[30] J.-M. Chang and M. Pedram, “Codex-dp: co-design of communicating systems using dynamic programming,”

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. (TCAD), vol. 19, July 2000, pp. 732-744.

[31] K.S. Chathak, K. Srinivasan, G. Konjevod, “Automated techniques for mapping of application-specific network-

on-chip architectures,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. (TCAD), Aug. 2008, pp. 1425-

1438.

[32] S. Chatterjee, Z. Wei, A. Mishchenko, R. Brayton,“A linear time algorithm for optimum tree placement,” Intl.

Workshop on Logic and Synthesis, 2007.

[33] Y.-K. Chen, et al., “Convergence of recognition, mining, and synthesis workloads and its implications,” Proc. of

IEEE, 2008, pp. 790-807.

[34] P. Chen and K. Keutzer, “Towards true crosstalk noise analysis,” Proc. ICCAD, 1999, pp. 132-138.

[35] C.-L. Chou, R. Marculescu, “Incremental run-time application mapping for homogeneous NoCs with multiple

voltage levels,” Proc. Hardware/Software Codesign and System Synthesis (CODES+ISSS), Oct. 2007, pp. 161-

166.

[36] C.-L. Chou, R. Marculescu, “User-aware dynamic task allocation in Networks-on-Chip,” Proc. DATE, 2008, pp.

1232-1237.

195

[37] C.-L. Chou, U. Y. Ogras, R. Marculescu, “Energy- and performance-aware incremental mapping for Networks-

on-Chip with multiple voltage levels” IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems

(TCAD), vol. 27, no. 10, Oct. 2008, pp. 1866-1879.

[38] C.-L. Chou, R. Marculescu, “Run-time task allocation considering user behavior in embedded multiprocessor

Networks-on-Chip,” IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems (TCAD), vol. 19,

no. 1, Jan. 2010, pp. 78-91.

[39] C.-L. Chou, R. Marculescu, “Contention-aware application mapping for Network-on-Chip communication

architectures,” Proc. ICCD, Oct. 2008, pp. 164-169.

[40] C.-L. Chou, R. Marculescu, “User-centric design space exploration for heterogeneous Network-on-Chip

platforms,” Proc.DATE, April 2009, pp. 15-20.

[41] C.-L. Chou, R. Marculescu, “Designing heterogeneous embedded Network-on-Chip with users in mind,” to

appear, IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 2010.

[42] C.-L. Chou, R. Marculescu, “Fault-tolerant algorithms for run-time resource allocation in many core systems,”

Proc. Semiconductor Research Corporation (SRC), TECHCON, 2009.

[43] C.-L. Chou, A. M. Miron, R. Marculescu, “Find your flow: Increasing flow experience by designing ‘Human’

Embedded Systems,” to appear, Proc. DAC, 2010.

[44] H. Cook, K. Skadron, “Predictive design space exploration using genetically programmed response surfaces,”

Proc. DAC, 2008, pp. 960-965.

[45] M. Csikszentmihalyi, “Flow: The psychology of optimal experience,” New York: Harper and Row, 1990.

[46] W. J. Dally and C. L. Seitz, “Deadlock-free message routing in multiprocessor interconnection networks,” IEEE

Trans. on Computer, 1987, pp. 547-553.

[47] W. J Dally, B. Towles, “Route packets, not wires: on-chip interconnection network,” Proc. DAC, 2001, pp. 684-

689.

[48] C. Darwin, On the origin of species by means of natural selection, or the preservation of favoured races in the

struggle for life,1859: ISBN 0-451-52906-5.

[49] S. Das, et al, “RazorII: In situ error detection and correction for PVT and SER tolerance,” IEEE Journal of Solid-

State Circuits, Jan. 2009, pp. 32-48.

[50] R. Dick, “Embedded system synthesis benchmarks suites (E3S),” http://ziyang.eecs.umich.edu/~dickrp/e3s/

[51] R. P. Dick, N. K. Jha, “MOGAC: A multiobjective genetic algorithm for hardware-software cosynthesis of

distributed embedded systems,” IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems

(TCAD), 1998, pp. 920-935.

[52] K. Diefendorff, and P. K. Dubey, “How multimedia workloads will change processor design,” Computer 30, 9,

Sept. 1997, pp. 43-45.

[53] T. Dumitras, S. Kerner, R. Marculescu, “Towards on-chip fault-tolerant communication,” Proc. ASP-DAC, 2003,

pp. 225-232.

[54] Eason, K. Information Technology and Organizational Change. 1st. Taylor & Francis, Inc. 1989.

196

[55] S. A. Edwards, “What do we do with 1012 transistors? The case for precision timing,” DSRC TeraChip Workshop,

2008.

[56] Bradley Efron, “Estimating the error rate of a prediction rule: improvement on cross-validation”, Journal of the

American Statistical Association, vol. 78, no. 382, 1983, pp. 316-331.

[57] Z. Feng, et al., Floorplan representation in VLSI, handbook of DATA structures and applications, by D.P. Mehta

and S. Sahni, Chapman and Hall, 2004, pp. 53-1: 53-29.

[58] C. Ferdinand, R. Wilhelm, “On predicting data cache behavior for real-time systems,” Proc. of the ACM Workshop

on Languages, Compilers, and Tools for Embedded Systems, 1998, pp.16-30.

[59] S. Fuller, RapidIO: The Embedded System Interconnect. ISBN: 0470092912.

[60] M. Geilen, T. Basten, “A calculator for pareto points,” Proc. DATE, 2007, pp. 16-20.

[61] S. V. Gheorghita, et al., “Automatic scenario detection for improved WCET estimation,” Proc. DAC, 2005, pp.

101-104.

[62] S. V. Gheorghita, T. Basten, H. Corporaal, “Application scenarios in streaming-oriented embedded-system

design,” IEEE Design & Test of Computers, vol. 25, no. 6, 2008, pp.581-589.

[63] S. V. Gheorghita, et al., “System-scenario based design of dynamic embedded system,” ACM Trans. on Design

Automation of Electronic Systems (TODAES), vol. 14, no. 1, Jan. 2009.

[64] M. Gomaa, et al., “Transient-fault recovery for chip multiprocessors,” Proc. ISCA, 2003, pp. 98-109.

[65] C. Grecu, et al., “Essential fault-tolerance metrics for NoC infrastructures,” On-Line Testing Symposium, 2007, pp.

37-42.

[66] M. Gries, “Methods for evaluating and covering the design space during early design development,” Integr. VLSI

Journal, 2004, pp. 131-183.

[67] Rebecca E. Grinter, “Systems architecture: product designing and social engineering,” ACM SIGSOFT Software

Engineering Notes, vol. 24, no. 2, 1999, pp.11-18.

[68] A. Gupta, B. Lin, P. A. Dinda, “Measuring and understanding user comfort with resource borrowing,” Proc. High

Performance Distributed Computing, June 2004, pp. 214-224.

[69] Mor Harchol-Balter, “The effect of heavy-tailed job size distributions on computer system design,” Proc. of the

ASA-IMS Conf. on Applications of Heavy Tailed Distributions in Economics, June 1999.

[70] T. Henderson and S. Bhatti, “Modelling user behaviour in networked games,” Proc of Intl. Conf. on Multimedia,

Sept. 2001, pp. 212-220.

[71] A. Hergenhan, W. Rosenstiel, “Static timing analysis of embedded software on advanced processor architectures,”

Proc. DATE, 2000, pp. 552-559.

[72] Y. Hoskote, “A 5-GHz Mesh Interconnect for a Teraflops Processor,” IEEE Micro, vol. 27, no. 5, Sept./Oct., 2007,

pp. 51-61.

[73] http://lava.cs.virginia.edu/HotSpot/

[74] http://src.alionscience.com/

[75] http://www.arm.com/products/solutions/axi_spec.html

197

[76] http://www.sematech.org/docubase/document/3955axfr.pdf

[77] http://www.wavecom.com/.

[78] Y. Hu, “Physical synthesis of energy-efficient networks-on-chip through topology exploration and wire style

optimization,” Proc. ICCD, 2005, pp. 111-118.

[79] J. Hu, R. Marculescu, “Energy- and performance-aware mapping for regular NoC architectures,” IEEE Trans. on

Computer-Aided Design of Integrated Circuits and Systems (TCAD), vol. 24, Apr. 2005, pp. 551-562.

[80] J. Hu, R. Marculescu, “Application-specific buffer space allocation for networks-on-chip router design,” Proc.

ICCAD, 2004, pp 354-361.

[81] J. Hu, R. Marculescu, “Energy-aware communication and task scheduling for network-on-chip architectures under

real-time constraints,” Proc. DATE, 2004, pp. 234-239.

[82] L. Huang, F. Yuan, and Q. Xu, “Lifetime reliability-aware task allocation and scheduling for MPSoC platforms,”

Proc. DATE, 2009, pp. 51-56.

[83] Intel Media processor CE 3100 [online] http://download.intel.com/design/celect/downloads/ce3100-product-

brief.pdf

[84] P. Ituero, et al., “Leakage-based on-chip thermal sensor for CMOS technology,” IEEE Intl. Symposium on Circuits

and Systems, 2007, pp.3327-3330.

[85] A. Jalabert, et al., “xpipesCompiler: A tool for instantiating application specific networks-on-chip,” Proc. DATE,

2005, pp. 884-889.

[86] N. E. Kang, W. Yoon, “Age- and experience-related user behavior differences in the use of complicated electronic

devices,” Int. J. Hum.-Comput. Stud., vol. 66, no. 6, 2008, pp 425-437.

[87] J. Kao, F. B. Prinz, “Optimal motion planning for deposition in layered manufacturing,” Proc. Design Engineering

Technical Conf., Sept. 1998, pp. 1-10.

[88] R. M. Karp, A. C. McKellar, C. K. Wong, “Near-optimal solutions to a 2-dimensional placement problem,” SIAM

Journal on Computing, vol. 4, 1975, pp. 271-286.

[89] J. Kao, F. B. Prinz, “Optimal motion planning for deposition in layered manufacturing,” Proc. Design Engineering

Technical Conf., Sept. 1998, pp. 13-16.

[90] D. I. Katcher, H. Arakawa, J. K. Strosnider, “Engineering and analysis of fixed priority schedulers,” IEEE Trans.

on Software Engineering, 1993, pp. 920-934.

[91] S. Khan, “Using predictive modeling for cross-program design space exploration in multicore systems,” Proc.

Parallel Architecture and Compilation Techniques, 2007, pp. 327-338.

[92] R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation and model selection.” Proc. of the

Fourteenth International Joint Conference on Artificial Intelligence, 1995, pp. 1137-1143.

[93] C. Y. Lee, “An algorithm for path connection and its applications,” IRE Trans. Electron Comput., vol. EC-10, Sept.

1961, pp. 346-365.

198

[94] H. G. Lee, N. Chang, U. Y. Ogras, R. Marculescu, “On-chip communication architecture exploration: A

quantitative evaluation of point-to-point, bus, and network-on-chip approaches,” ACM Trans. on Design

Automation of Electronic Systems (TODAES), vol. 12, no. 3, Aug. 2007.

[95] K. Li and K.-H. Cheng, “A two-dimensional buddy system for dynamic resource allocation in a partitionable mesh

connected system,” Journal of Parallel and Distributed Computing, vol. 12, 1991, pp. 79-83.

[96] M-L Li, et al., “Accurate microarchitecture-level fault modeling for studying hardware faults,” Intl. Conf. on High

Performance Computer Architecture, 2009, pp. 105-116.

[97] B. Lisper, “Fully automatic, parametric worst-case execution time analysis,” Workshop on Worst-Case Execution

Time (WCET) Analysis, 2003, pp. 77-80.

[98] W. Liu, V. Lo, K. Windisch, B. Nitzberg, “Non-contiguous processor allocation algorithms for distributed

memory multicomputers”, Proc. on Supercomputing, 1994, pp. 227-236.

[99] V. Lo, K. Windisch, W. Liu, and B. Nitzberg, “Non-contiguous processor allocation algorithms for mesh-

connected multicomputers,” IEEE Trans. on Parallel and Distributed Computing, vol. 8, no. 7, 1997, pp. 712-726.

[100]D. Lyonnard, et al., “Automatic generation of application-specific architectures for heterogeneous multiprocessor

system-on-chip,” Proc. DAC, 2002, pp. 518-523.

[101]J. Mache, V. Lo, “Dispersal metrics for non-contiguous processor allocation,” Technical Report, University of

Oregon, 1996.

[102]S. Manolache, P. Eles, and Z. Peng, “Fault and energy-aware communication mapping with guaranteed latency

for applications implemented on NoC,”Proc. DAC, 2005, pp. 266-269.

[103]Tom M. Mitchell. Machine Learning, ISBN: 0070428077, McGraw-Hill Science/Engineering/Math, 1997.

[104]S. Murali, G. De Micheli, “Bandwidth-constrained mapping of cores onto NoC architectures,” Proc. DATE, 2004,

pp. 896-901.

[105]S. Murali, et al., “Mapping and configuration methods for multi-use-case networks on chips,” Proc. ASP-DAC,

2006, pp. 146-151.

[106]S. Murali, et al., “A methodology for mapping multiple use-cases onto Networks on Chips,” Proc. DATE, 2006,

pp. 1-6.

[107]S. Murali, and G. De Micheli, G., “SUNMAP: a tool for automatic topology selection and generation for NoCs,”

Proc. DAC, 2004, pp. 914-919.

[108]S. Murali, et al.,“Designing application-specific networks on chips with floorplan information” Proc. ICCAD,

2006, pp. 355-362.

[109]S. Mohanty, et al., “Rapid design space exploration of heterogeneous embedded systems using symbolic search

and multi-granular simulation,” Proc. Joint Conference on Languages, Compilers and Tools For Embedded

Systems: Software and Compilers For Embedded Systems, 2002, pp. 18-27.

[110]A. A. F. Mohammad, R. Rudolf, J. Henkel, “ADAM: Run-time agent-based distributed application mapping for

on-chip communication,” Proc. DAC, 2008, pp. 760-765.

199

[111]O. Moreira, J. J. Mol, M. Bekooij, “Online resource management in a multiprocessor with a network-on-chip,”

Proc. ACM Symp. on Applied Computing, March 2007, pp.1557-1564.

[112]M. F. Morris, “Kiviat graphs: conventions and figures of merit,” SIGMETRICS Perform. Eval. Rev. 3, vol. 3, Oct.

1974, pp. 2-8.

[113]T. Moscibroda and Onur Mutlu, “A case for bufferless routing in on-chip networks,” Proc. ISCA, 2009, pp. 196-

207.

[114]F. Moya, J.M. Moya, J.C. Lopez, “Evaluation of design space exploration strategies,” Proc. EUROMICRO, pp.

472-476, 1999.

[115]. Neumaier, “Solving ill-conditioned and singular linear systems: A tutorial on regularization,” SIAM Review 40,

1998, pp. 636-666.

[116]T. Noergaard, Embedded Systems Architecture: A Comprehensive Guide for Engineers and Programmers

(Embedded Technology), Elsevier Science & Technology Books, 2005.

[117]V. Nollet, T. Marescaux, D. Verkerst, “Operating-system controlled network on chip,” Proc. DAC, 2004, pp. 256-

259.

[118]V. Nollet, et al., “Centralized run-time resource management in a network-on-chip containing reconfigurable

hardware tiles,” Proc. DATE, 2005, pp. 234-239.

[119]U. Y. Ogras, R. Marculescu, P. Choudhary, D. Marculescu, “Voltage-frequency island partitioning for GALS-

based networks-on-chip,” Proc. DAC, 2007, pp. 110-115.

[120]U. Y. Ogras, R. Marculescu, “Analytical router modeling for Networks-on-Chip performance analysis,” Proc.

DATE, 2007, pp. 1-6.

[121]B. Ozisikyilmaz, G. Memik, A. Choudhary, “Efficient system design space exploration using machine learning

techniques,” Proc. DAC, 2008, pp. 966-969.

[122]B. Ozisikyilmaz, G. Memik, and A. Choudhary, “Machine learning models to predict performance of computer

system design alternatives,” Proc. of international Conference on Parallel Processing (ICPP), 2008, pp. 495-502.

[123]O. Ozturk, M. Kandemir, S. W. Son, “An ILP based approach to reducing energy consumption in NoC-based

CMPS,” Proc. International Symposium on Low Power Electronics and Design (ISLPED), 2007, pp. 27- 29.

[124]S. Pace, “A grounded theory of the flow experiences of Web users,” Int. J. Human-Computer Studies, vol. 60,

2004, pp. 327-363.

[125]J. C. Palencia, M. González Harbour, “Schedulability analysis for tasks with static and dynamic offsets,” Proc. of

the Real-Time Systems Symposium, 1998, pp. 26-37.

[126]G. Paliouras, V. Karkaletsis, C. D. Spyropoulos. Machine Learning and Its Applications: Advanced Lectures,

Springer, 2001.

[127]A. M. Pastrnak, P. H. N. de With, S. Stuijk, J. van Meerbergen, “Parallel implementation of arbitrary-shaped

MPEG-4 decoder for multiprocessor Systems,” Proc. Visual Comm. and Image Processing, 2006.

[128]G. Prabhu, D. M. Frohlich, “Innovation for emerging markets: confluence of user, design, business and

technology research,” Proc. Human Computer Interaction, July 2005, pp. 22-27.

200

[129]Predictive Technology Model (PTM) website; http://www.eas.asu.edu/~ptm

[130]P. Pop, P. Eles, T. Pop, Z. Peng, “An approach to incremental design of distributed embedded systems,” Proc.

DAC, 2001, pp. 450-455.

[131]P. Pop, P. Eles, Z. Peng, “Bus access optimization for distributed embedded systems based on schedulability

analysis,” Proc. DATE, 2001, pp. 567-575.

[132]J. M. Rabaey, D. Burke, K. Lutz, J. Wawrzynek, “Workloads of the Future,” IEEE Design and Test of Computers,

vol. 25, no. 4, 2008, pp. 358-365.

[133]D. Rachovides, M. Perry, “HCI Research in the home: lessons for empirical research and technology

development,” Proc. Human Computer Interaction, vol. 2, Sept. 2006, pp. 11-15.

[134]V. Raghunathan, M. B. Srivastava and R. K. Gupta, “A survey of techniques for energy efficient on-chip

communication,” Proc. DAC, 2003, pp. 900-905.

[135]K. Ramamritham, J. A. Stankovic, P-F Shiah, “Efficient scheduling algorithms for real-time multiprocessor

systems,” IEEE Trans. on Parallel and Distributed Systems, 1990, pp. 184-194.

[136]P. Rantala, et al., “Agent-monitored fault-tolerant Network-on-Chips: concept, hierarchy, and case Study with

FFT Application,” DAC Workshop Digest in Diagnostic Services in Network-on-Chips, April 2008.

[137]C.-E. Rhee, H.-Y. Jeong, S. Ha, “Many-to-many core-switch mapping in 2-D mesh NoC architectures,” Proc.

ICCD, 2004, pp. 438-443.

[138]K. Richter, D. Ziegenbein, M. Jersak, and R.Ernst, “Bottom-up performance analysis of HW/SW platforms,”

Proc. of the IFIP World Computer Congress - Tc10 Stream on Distributed and Parallel Embedded Systems:

Design and Analysis of Distributed Embedded Systems, Aug. 2002, pp. 173-183.

[139]R. Rouse, Game design: theory and practice, Wordware Game Developer's Library, 2001.

[140]G. Sassatelli, et al., “Run-time mapping and communication strategies for Homogeneous NoC-Based MPSoCs,”

Proc IEEE Symposium on Field-Programmable Custom Computing Machines, 2007, pp. 295-296.

[141]B. Schilit, N. Adams, and R. Want, “Context-aware computing applications,” IEEE Workshop on Mobile

Computing Systems and Applications, 1994, pp. 85-90.

[142]T. Schonwald, et al., “Fully adaptive fault-tolerant routing algorithm for Network-on-Chip architecture,” Proc.

Digital System Design Architectures, Methods and Tools, 2007, pp. 527-534.

[143]B. Sethuraman, R. Vemuri, “optiMap: a tool for automated generation of NoC architectures using multi-port

routers for FPGAs,” Proc. DATE, 2006, pp. 947-952.

[144]L. Sha, R. Rajkumar, and S. S. Sathaye, “Generalized rate-monotonic scheduling theory: a framework for

developing real-time systems,” Proc. of the IEEE, vol. 82, no.1, Jan. 1994, pp.68-82.

[145]S. Shamshiri, et al., “A cost analysis framework for multi-core systems with spares,” Proc. Int. Test Conference,

2008, pp. 1-8.

[146]S. Shamshiri and K.-T. Cheng, “Yield and cost analysis of a reliable NoC,” IEEE VLSI Test Symposium, 2009, pp.

173-178.

201

[147]D.J. Shernoff et al., “Student engagement in high school classrooms from the perspective of flow theory,” School

Psychology Quarterly, 18, 2003, pp. 158-176.

[148]H. Shimazu, “ExpertClerk: Navigating shoppers buying process with the combination of asking and proposing,”

Proc. Joint Conference on Artificial Intelligence, 2001, pp. 1443-1450.

[149]H. Shimazu, “ExpertClerk: a conversational case-based reasoning tool for developing salesclerk agents in e-

commerce webshops,” Artificial Intelligence Review 18(3-4), pp. 223-244.

[150]H. Shojaei, et al., “SPaC: A symbolic pareto calculator,” Proc. CODES+ISSS, 2008, pp. 179-184.

[151]H. Shojaei, et al., “A parameterized compositional multi-dimensional multiple-choice knapsack heuristic for

CMP run-time management,” Proc. DAC, 2009, pp. 917-922.

[152]A. Shye, et al.,“Power to the People: Leveraging Human Physiological Traits to Control Microprocessor

Frequency,” Proc. MICRO, Nov. 2008, pp. 188-199.

[153]A. Shye, et al.,“Learning and Leveraging the Relationship between Architecture-Level Measurements and

Individual User Satisfaction,” Proc. ISCA, June 2008, pp. 427-438.

[154]L. T. Smit, et al., “Run-time assignment of tasks to multiple heterogeneous processors,” Progress Embedded

Systems Symp., Oct. 2004, pp. 185-192.

[155]L. I. Smith, “A tutorial on principal components analysis”, citeulike:353145, February 26, 2002.

[156]Sonics Integration Architecture. Available [online] http://www.sonicsinc.com

[157]H. Spencer, The Principles of Sociology, 1897, New York: D. Appleton.

[158]K. Srinivasan, et al., “An automated technique for topology and route generation of application Specific on-chip

interconnection networks,” Proc. ICCAD, 2005, pp. 231-237.

[159] K. Srinivasan, K. S. Chatha, “A technique for low energy mapping and routing in network-on-chip architectures,”

Proc. International Symposium on Low Power Electronics and Design (ISLPED), 2005, pp. 387-392.

[160]STMicroelectronics STBus Interconnect [online] http://www.st.com/stonline/products/technologies/soc/

stbus.htm

[161]T. T. Suen and J. S. Wong, “Efficient task migration algorithm for distributed systems,” IEEE Trans. Parallel

Distrib. Syst. vol. 3, 1992, pp. 488-499.

[162]Task graphs for free (TGFF v3.0) Keith Vallerio, 2003. http://ziyang.eecs.umich.edu/~dickrp/tgff/.

[163]A. Vázquez, et al., “Modeling bursts and heavy tails in human dynamics,” Physical Review E73, 036127, 2006.

[164]D. Wentzlaff, et al., “On-Chip Interconnection Architecture of the Tile Processor,” IEEE MICRO, vol. 27, no. 5,

2005, pp. 15-31.

[165]K. Windisch and V. Lo, “Contiguous and non-contiguous processor allocation algorithms for k-ary n-cubes”,

Proc. Intl. Conference on Parallel Processing, 1995, pp. 164-168.

[166]F. Wolf, R. Ernst, “Execution cost interval refinement in static software analysis,” J. Syst. Archit. 47, 3-4, Apr.

2001, pp. 339-356.

[167]R. A. Wright, and J. W. Brehm, “Energization and goal attractiveness,” In L.A. Pervin (Ed.), Goal concepts in

personality and social psychology, 1989, pp. 169-210, Hillsdale, NJ: Erlbaum.

202

[168]S. Yan and Bill Lin, “Application-specific network-on-chip architectures synthesis based on set partitions and

Steiner trees”, Proc. ASPDAC, 2008, pp. 277-282.

[169]P. Yang, et al.,“Managing dynamic concurrent tasks in embedded real-time multimedia systems,” Proc. of the

Symposium on System Synthesis (ISSS), 2002, pp. 112-119.

[170]T. T. Ye, L. Benini, and G. De Micheli, “Analysis of power consumption on switch fabrics in network routers,”

Proc. DAC, 2002, pp. 524-529.

[171]N.-E. Zergainoh, A. Baghdadi, A. Jerraya, “Hardware/software codesign of on-chip communication architecture

for application-specific multiprocessor system-on-chip,” Int. J. Embedded Systems, vol. 1, 2005, pp. 112-124.

203

APPENDIX A. MACHINE LEARNING TECHNIQUES SURVEY

FOR USER-CENTRIC DESIGN

Generally speaking, machine learning is the study of algorithms that allow machines/

computers/systems to learn based on the experience (i.e. collected training data) in such a

manner and later improved this expected performance [103]. In recent years, machine learning

has made its way from artificial intelligence into areas of administration, commerce, and

industry. In addition, it became the preferred approach for speech recognition, computer

vision, medical analysis, robot control, computational biology, sensor networks, etc [126].

More recently, for general systems design, Ozisikyilmaz et al. applied linear regression and

neural network methods on small portions of data through cycle-accurate simulations to

predict the performance of the entire design space [121][122]. Bitirgen et al. applied neural

network approach on multiple shared chip multiprocessor resources to enforce higher-level

performance objectives [22].

In this appendix, we study how machine learning algorithms can help user-centric

embedded system design. As mentioned in Section 1.4.2, five types of problems, i.e.

classification, similarity, clustering, regression, and reinforcement learning problems (see “*”

in Figure 1.8)) from user-centric design flow can be solved using machine learning

techniques. On example is shown in the case study of NoC embedded system (see Figure 4.1),

we explore the k-mean clustering methods for classifying user traces. Here, we first

explain these five types of problems (see Figure A.1(a)) and its applications. Later,

several machine learning techniques will be investigated to solve these problems (see

Figure A.1(b)).

204

• i) Classification: Given a data-set X = {x1, x2,...,xi} and the corresponding discrete

class-set Y = {y1, y2,...,yj}, the classification problem is to classify the new data xnew to

one class in Y. As example can be shown for medical diagnosis (i.e. diagnose whether

the patient gets cancer or not given medical reports from many patients). In our case of

user-centric embedded design for Figure 4.1, we can identify which class the new user/

customer belongs to and is suitable for which generated platform.

• ii) Similarity: Given a data-set X = {x1, x2,...,xi}, the similarity problem is to find some

similar data in this set for a given feature. Finding similar images in Google websites,

(a) (b)

classification

(from data todiscrete classes)

which class?

similarity

(finding data)

finding similar hair type as

clustering

(discovering structure in data)

clustered by head shape

reinforce learning

(training by feedback)

good goodfeedback

bad

badbadgood

good bad good

time (day)

246

hair

am

ount

regression

(predicting anumeric value)0 5 10 15 20

How many hairs at 20th day?

5

problemsselected machine

learning approaches

Naïve Bayes (NB)

Support Vector Machine (SVM)

K-nearest neighbor

Logistic regression

Decision tree

K-means

Neural Networks

Q-learning

Hidden Markov Model (HMM)

Gaussian Mixed Model (GMM)

Bayesian Networks

Figure A.1 (a) Five types of problems for user-centric design i) classification ii)regression iii) similarity iv) clustering v) reinforcement learning (b) Selected machinelearning approaches.

205

listing similar products in Amazon websites, etc. belong to this category. As an

example, in our case, it is highly recommended to figure out the similarity between

users while interacting with the system.

• iii) Clustering: Given a data-set X = {x1, x2,...,xi}, the clustering problem is to assign

similar data in the same group/cluster; that is to discover the structure in data. As an

example of sequence analysis in computational biology, clustering is used to group

homologous sequences into gene families. For our case study in Chapter 4, we explore

the k-mean clustering methods for grouping user traces with some similarity

coefficients.

• iv) Regression: The regression problem is to predict a numeric value from previous

data, as well as the potential trend. Examples can also be shown in the stock market, and

the temperature prediction for following days. For use-centric design, we can apply it

for off-line DSE for predicting the performance of the entire design space with small

portions of data [121][122] and for on-line optimization for predicting the application

execution time for certain user with the goal of improving system performance

improvement [22][36].

• v) Reinforcement learning: The reinforcement learning problem is to allow the machine/

agent to learn its behavior based on feedback from the environment. This behavior can

be learnt once, or keep on adapting as time goes by. It is widely used in robot design, i.e.

robot navigation where collision avoidance behavior can be learnt by negative feedback

from bumping into obstacles. In terms of user-centric design, the system is suggested to

update its strategies dynamically based on the users’ feedback [152][153].

206

Now, we propose to use some popular machine learning approaches [21] to solve these

five problems as shown in Figure A.1(b). The arrow from problem A to method B in

Figure A.1 implies that the problem A have been solved using the method B in the literature.

However, even applying those methods, there still exists various challenges in solving

these problems where we mention a few as follows. Techniques such as cross-validation [56],

regularization [115], have been proposed in order to avoid model over-fitting, i.e. if it is more

accurate for training dataset but less accurate in predicting new data or unseen testing dataset.

Principal component analysis (PCA) [155] is widely used to reduce the problem

dimensionality, i.e. to develop a smaller number of artificial variables (called principal

component) from a number of observed variables without much loss of information.

Bootstrapping approach is suggested to be applied if we have scarce training dataset, i.e. when

the model has too many variables and there are not enough participants/observations [92].

207

APPENDIX B. ILP-BASED CONTENTION-AWARE APPLICA-

TION MAPPING

B.1. Introduction

In this appendix, we analyze the impact of network contention on the application mapping

for 2D mesh-based NoC architectures. Our main theoretical contribution consists of an integer

linear programming (ILP) formulation of the contention-aware off-line application mapping

problem which aims at minimizing the type of network contention which highly affect the

system performance [39].

Previous work attempts to minimize the communication energy consumption

[79][123][137][159]. However, the communication energy consumption is a good indicator of

latency only if there is no congestion in the network. Indeed, in the absence of congestion,

packets are injected/transmitted through the network as soon as they are generated and then

latency can be estimated by counting the number of hops from source to destination.

Compared to previous work, our focus in this appendix is on the network contention problem;

this highly affects the latency, throughput, and communication energy consumption. We show

that, by mitigating the network contention, the packet latency can be significantly reduced;

this means that the network can support more traffic which directly translates into significant

throughput improvements.

B.2. Preliminaries

As mentioned in Section 2.2, the target application has been done the off-line analysis. To

better explain the off-line application mapping, we need to first introduce the following

definitions:

208

• A Logical Application Characterization Graph (LACG) = (V, E) is a weighted directed

graph (see Figure B(a)). Each vertex vi V represents a core which will be allocated to

one specific processing resource later. Each directed edge eij = (vi, vj) E represents the

communication from core vi to vj. The weight comm(eij) or stands for the

communication rate (bits) from core vi to vj within each period, while bw(eij) or

stands for the required bandwidth for the communication from vi to vj.

• A Physical Application Characterization Graph (PACG) = (R, P) is a directed graph

(see Figure B(b)), where each vertex r = r(vi) R represents a resource which gets

assigned a cluster of tasks, vi, and each directed edge pij represents the routing path from

resource ri to resource rj. We denote L(pij) or L(ri,rj) the set of links of the

communication within routers that make up the path pij from ri to rj where |L(pij)| is the

size of that set, i.e., the number of links for making up pij.

Figure B.1 (a) Logical and (b) physical application characterization graph.(c) one core mapping example.

(a) (b)

(c)

v1

v4

e13

e43

e12

v3v2

e24

r4(v1)p45

p53

p46

r5(v2) r6(v3)

r3(v4)

LACG PACGmap( )

p36

e23

p56

l1

l2

l3

R

l4R RR

l5l6

r6 v3r5 v2r4 v1

r3 v4

RR

r2r1

commvi v, j

bwvi v, j

209

A mapping function map( ) maps the cores in the LACG to the resources in the NoC;

under a given routing mechanism, this results in the PACG.

Figure B(c) shows the mapping result of the LACG under the deterministic XY routing:

cores v1, v2, v3, and v4 are mapped onto resources r4, r5, r6, and r3, respectively, and

L(p45) = {l1}, L(p46) = {l1, l3}, L(p53) = {l3, l6}, L(p36) = {l5}, and L(p56) = {l3}. Note that as

defined the types of network contention in Section 6.3.1, source-based contention occurs in

this case since L(p45) L(p46) = {l1} , while the destination-based contention occurs

since L(p46) L(p56) = {l3} . And the path-based contention occurs since

L(p46) L(p53) = {l3} . With the motivation of significant impact on path-based

contention (see discussion in Figure 6.3), in what follows, we summarize our ILP-based

contention-aware mapping with path-based contention minimized.

B.3. Problem Formulation

Given the application characteristics and the NoC architecture, our objective is to map the

IP cores onto the NoC resources such that the sum of the weighted communication distance

and path-based network contention are minimized under a given routing mechanism. Of note,

minimizing the weighted communication distance directly contributes to minimizing the

communication energy consumption as well. More formally:

Given the LACG of the application, the routing mechanism, and the NoC architecture

Find a mapping function map( ) from LACG = (V, E) to PACG = (R, P) which minimizes:

min{ (B.1)

+ } for i k and j l

such that:

∩ ∅≠

∩ ∅≠

∩ ∅≠

1 α–( )β

----------------- comm eij( ) L map eij( )( )×[ ]eij∀ E∈∑×

αγ--- L map eij( )( ) L map ekl( )( )∩× ≠ ≠

210

(B.2)

(B.3)

(B.4)

where if and Bk is the capacity for link lk

Since the communication distance and path-based contention count have different units,

the normalization of these two metrics is approximated by assuming a worst-case scenario.

More precisely, β is set to ( ) × ( ) for an N × N NoC platform,

where the second factor, , is the longest distance in the network. γ is set to the

average number of path-based contentions of reasonable random mapping configurations. α is

a weighting coefficient meant to balance the communication distance and the contention

count. More precisely, we set α as the ratio of “the number of cores” to “the number of

resources + 1” (i.e., α = |V|/(|R| + 1)). If the number of cores is much smaller than the number

of resources (i.e., α is small), in order to avoid a higher communication distance, the first term

in (1) has a higher weight. Equation B.2 and Equation B.3 basically mean that each core

should be mapped to exactly one resource and no resource can host more than one core.

Finally, Equation B.4 guarantees that the load of each link will not exceed its bandwidth.

B.4. ILP-based Contention-aware Mapping Approach

B.4.1. Parameters and Variables

The given parameters are as follows:

• stands for the Manhattan Distance from resource rs to rt.

vi∀ V∈ , map vi( ) r vi( ) R∈=

vi∀ vj≠ V,∈ r vi( ) r vj( )≠

link lk∀ bwvi v, jlkmap vi( ) map vj( ),

Bk≤×vi vj,( )∀ E∈

∑,

lkmap vi( ) map vj( ),

1= lk L map vi( ) map vj( ),( )∈

comm eij( )eij∀ E∈∑ 2 N 1–( )×

2 N 1–( )×

MDrsrt

211

• The NoC architecture consists of |K| uni-directional segment links with IDs {l1, l2, ...,

l|K|}.

• For each link lk, where k = 1~|K|, represents whether or not this link lk is part of the

routing path from resource rs to resource rt, i.e., . Of note,

the above parameters are known under a given NoC architecture with a fixed routing

mechanism.

The variables of interest are as follows:

• shows the mapping result and can only take values in {0, 1}. More precisely, this

variable is set to 1, if the core vi is mapped onto resource rs.

• shows the communication path result and can only be {0, 1}. This variable is set to

1, if the communication path is made up from resources rs to rt, where cores vi and vj are

mapped onto.

• shows the path-based contention and can only be {0, 1}. This

variable is set to 1, while the cores vi, vj, vm, and vn are mapped onto resources rs, rt, rp,

and rq and at the same time, the communication path from resource rs to resource rt

shares the link lk with the path from resource rp to resource rq.

B.4.2. Objective Function

Our objective is to minimize the weighted communication distance and the path-based

network contentions as well, i.e.,

lkrsrt

lkrsrt 1 if lk L rs rt,( )∈,

0 otherwise ,⎩⎨⎧

=

mvi

rs

pvivj

rsrt

zlk_ vivjvmvnrsrtrprq( )

212

{

} (B.5)

B.4.3. Constraints

The following constraints are used:

• One-to-one core-to-resource mapping: Each resource cannot accept more than one core

(see Equation B.6). Each core should be mapped onto a specific resource (see Equation

B.7). Equation B.8 makes sure that variables are set to be either 0 or 1.

(B.6)

(B.7)

(B.8)

• Communication path: Any two communicating cores that belong to two different

resources make up a path. Therefore,

(B.9)

To transform Equation B.9 into an ILP formulation, we impose the following constraints:

(B.10)

(B.11)

• Bandwidth constraint on each link: For each k, all possible paths through link lk cannot

exceed its bandwidth Bk.

1 α–( )β

----------------- commvi vj, MDrsrtpvivj

rsrt×r∀ s rt, R∈

∑⎝ ⎠⎜ ⎟⎛ ⎞

×vi vj,( )∀ E∈

∑×

αγ--- zlk_ vivjvmvnrsrtrprq( )

lk∀vi vj,( )∀ vm vn,( ), E∈

rs∀ rt rp rq, , , R∈

∑×+

mvi

rs

rs∀ R∈ mvi

rs 1≤vi∀ V∈∑,

vi∀ V∈ mvi

rs

rs∀ R∈∑, 1=

vi V rs R∈,∈∀ 0 m≤ vi

rs, 1≤

vi vj,( )∀ E∈ pvivj

rsrt 1 if mvi

rs 1=⎝ ⎠⎛ ⎞ and mvj

rt 1=⎝ ⎠⎛ ⎞,

0 otherwise ,⎩⎪⎨⎪⎧

=,

mvi

rs mvj

rt 1–+ pvivj

rsrtmvi

rs mvj

rt+

2----------------------≤ ≤

0 p≤ vivj

rsrt 1≤

213

(B.12)

• Path-based network contention count: This type of contention occurs when two paths

with different sources or different destinations contend for the same link. Therefore,

(B.13)

if (B.14)

To transform Equation B.14 into an ILP formulation, we impose the following constraints:

(B.15)

(B.16)

Equation B.15 and Equation B.16 determine whether or not the path-based contention

occurs; if so, this variable is set to be 1.

B.5. Experimental Results

B.5.1. Experiments using Synthetic Applications

We first evaluate the number of path-based contentions on application mapping for a 4 × 4

NoC platform under three different scenarios: random (or adhoc) mapping, energy-aware

mapping [79] and our contention-aware mapping. Several sets of synthetic applications are

generated using the TGFF package [162]. The number of cores used in this experiment ranges

from 12 to 16, while the number of edges varies from 15, 20, ..., to 60 (organized in 10

categories). For each category, we generate 100 random task graphs and the corresponding

bwvi v, jlkrsrt× pvivj

rsrt Bk≤×vi vj,( )∀ E∈

∑rs∀ rt, R∈

lkrsrt∀ L rs rt,( )∈⎝ ⎠

⎛ ⎞ & lkrprq∀ L rp rq,( )∈⎝ ⎠

⎛ ⎞

vi vj,( )∀ vm vn,( ), E∈i m & j n≠ ≠

rs∀ rt rp rq, , , R∈

zlk_ vivjvmvnrsrtrprq( ) 1= mvi

rs mvj

rt mvm

rp mvn

rq 1= = = =

pvivj

rsrt pvmvn

rprq lkrsrt lk

rprq 3–+ + + zlk_ vivjvmvnrsrtrprq( )≤

0 z≤ lk_ vivjvmvnrsrtrprq( ) 1≤

214

results (i.e., number of contentions, communication energy consumption, system throughput)

are calculated.

Figure B.2 shows the number of path-based contentions comparing these three scenarios,

while Table B.1 shows the communication energy ratio and throughput savings normalized to

the results of energy-aware mapping approach in [79] and contention-aware mapping

approaches for the selected categories (the number of edges set to 20, 30, 40, and 50). Of note,

The communication energy consumption and the packet latency are measured by a C++

simulator using the bit energy model in [170].

Table B.1 Energy and throughput comparison between energy-aware in [79] and contention-aware mapping.

As we can see in Figure B.2, the contention-aware mapping effectively reduces the path-

based contention. Moreover, the reduction increases as the number of edges scales up. For

instance, for task graphs with 50 edges, the number of path-based contention in the mapping

configuration can be reduced from 36 to 5. As observed in Table B.1, under the contention-

# of edges 20 30 40 50communication energy ratio 1.02 1.07 1.11 1.08

throughput savings 18.2% 24.1% 21.8% 13.5%

15 20 25 30 35 40 45 50 55 600

20

40

60random mappingenergy-aware mappingcontention-aware mapping

Figure B.2 Path-based contention count in a 4 × 4 NoC comparing therandom, energy-aware in [79] and contention-aware mapping.

# of edges in the task graph

cont

entio

n#

of p

ath-

base

d

215

aware mapping, the communication energy consumption is up to 11%1 larger compared to the

energy-aware mapping solution; however, the system throughput can be improved around

19.4%, on average. From Figure B.2 and Table B.1, it can be concluded that the contention-

aware mapping effectively reduces the path-based contention which achieves great system

throughput improvement with negligible energy loss.

B.5.2. Experiments using Real Applications

To evaluate the potential of our contention-aware idea for real-time examples, we apply it to

several benchmarks, such as examples with high degree of parallelism (Parallel-1 and Parallel-

2) [143], LU Decomposition [143], and MPEG4 decoder [159]. In Table B.2, the first three

benchmarks are mapped onto a 3 × 3 NoC platform, while the last is onto a 4 × 4 NoC platform.

The first through the fifth columns in Table B.2 show, respectively, the name of the benchmark,

the number of cores and edges in the LACG, the communication energy loss and the throughput

savings of our contention-aware solution compared to the energy-aware solution [79].

As seen in Table B.2, our contention-aware solution can achieve 17.4% throughput

savings, on average, with the communication energy loss within 9% compared to energy-

aware solution in [79].

Table B.2 Communication energy overhead and throughput improvement of our contention-aware solution compared to the energy-aware solution [79].

1. We note that this is only the communication energy part. If the communication energy consumptionis around 20% of the total energy consumption (as shown in [94]), we have only 2.2% energy loss.

benchmarks cores edgescomm. energy

overheadthroughput

improvementParallel-1 9 13 0% 16.9%Parallel-2 9 15 8.8% 20.4%

LU Decomposition 9 11 6.5% 14.1%MPEG4 Decoder 12 26 3.6% 18.2%

216

Figure B.3 plots the LACG of the Parallel-1 benchmark (see Figure B.3(a)), the mapping

results under two scenarios: energy-aware mapping [79] using ILP approach and our

contention-aware mapping approach (see Figure B.3(b) and (c), respectively), while the

average packet latency comparison for different injection rates in these two scenarios (see

Figure B.3(d)). The existing path-based contentions are highlighted in the mapping results. As

seen the energy-aware mapping result in Figure B.3(b), there are two pairs of path-based

contention in the network, while no path-based contention occurs when using the contention-

aware approach. We observe that for such path-based contention, the average latency goes up

dramatically after the packet injection rate exceeds a critical point (i.e. the network gets into

the congestion mode, see Figure B.3(d)). Also, when contention-aware constraints are taken

into consideration during the mapping process, the throughput for Parallel-1 moves from

0.2173 (packet/cycle) to 0.254 which represents about 16.9% throughput improvement.

0.1 0.15 0.2 0.25 0.30

100

200

300

400 energy-aware mapping [5]contention-aware mapping

12

3

46 8

7 9

5 12

3

468

7 9

5

1

2 3 4

6

87

9

5

Figure B.3 (a) Parallel-1 benchmark (b)(c) Mapping results of the energy-aware approach [79] and our contention-aware method (d) Average packetlatency and throughput comparison under these two mapping methods.

avg.

pac

ket l

aten

cy

Parallel-1

path-based contention

packet injection rate (packet/cycle)

(a) (b) (c)

(d)

217

B.6. Summary

In this appendix, we have addressed the issue of off-line core-resource mapping for NoC-

based platforms while considering the critical network contention minimization. We have

reported our results obtained from many experiments involving both synthetic and real

benchmarks. Experimental results show that, compared to other existing mapping approaches

based on communication energy minimization, our contention-aware mapping technique (with

the goal of reducing the path-based network contention) achieves a significant decrease in

packet latency (and implicitly, a throughput increase) with a negligible communication energy

overhead.

Although in this appendix we focus on 2-D mesh NoCs with XY routing, our idea can be

further adapted to other architectures implementing under different network topologies with

deterministic routing schemes. Moreover, the idea of minimizing the network contention is

not limited to core-resource mapping as presented. Instead, it can be applied to other NoC

synthesis problems and the mapping/scheduling heuristics on parallel systems to achieve

further system throughput improvements.