A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

285
The Pennsylvania State University The Graduate School Department of Computer Science and Engineering A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE EVENT SIMULATION A Thesis in Computer Science and Engineering by Marc D. Bumble c 2001 Marc D. Bumble Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy May 2001

Transcript of A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

Page 1: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

The Pennsylvania State University

The Graduate School

Department of Computer Science and Engineering

A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC

DISCRETE EVENT SIMULATION

A Thesis in

Computer Science and Engineering

by

Marc D. Bumble

c© 2001 Marc D. Bumble

Submitted in Partial Fulfillmentof the Requirements

for the Degree of

Doctor of Philosophy

May 2001

Page 2: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

We approve the thesis of Marc D. Bumble.

Date of Signature

Lee D. CoraorAssociate Professor of Computer Science and EngineeringThesis AdviserChair of Committee

Mary Jane IrwinProfessor of Computer Science and Engineering

John J. MetznerProfessor of Computer Science and Engineering

Ageliki ElefteriadouAssociate Professor of Civil Engineering

Dale A. MillerProfessor of Computer Science and EngineeringHead of the Department of Computer Science and Engineering

Page 3: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

iii

Abstract

An architecture for a non-deterministic simulation machine is described and pre-

sented for the purposes of accelerating the simulation of road traffic. The thesis includes

a survey of related work and a description of general architectural methods applied to

accelerate non-deterministic parallel event simulation. A study of the traffic simulator,

CORSIM, was undertaken to identify software simulation bottlenecks. Mathematical

analysis is used to assist in the decision between running a simulation in an event or

time-driven mode. Finally, the details of the simulator architecture are presented. The

architecture is divided into event generation, the event queue, the scheduler, and the uni-

fying communications network. The slowest subcomponent is shown to be accelerated

with a speedup of of 91.

Page 4: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

iv

Table of Contents

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 The Importance of Non-deterministic Simulation . . . . . . . . . . . 4

1.2 Simulation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Opportunities for Acceleration in Discrete Simulation . . . . . . . . . 8

1.3.1 Difficulties faced by Parallel Discrete Event-Driven Simulations 11

1.4 Simulation’s Niche . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.5 Microscopic & Macroscopic . . . . . . . . . . . . . . . . . . . . . . . 15

Chapter 2. Traffic Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.0.1 Reuschel & Pipes . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.0.2 General Motors’ Car-following Model . . . . . . . . . . . . . . 23

2.0.3 Vehicle Deceleration . . . . . . . . . . . . . . . . . . . . . . . 26

2.0.4 Macroscopic Models . . . . . . . . . . . . . . . . . . . . . . . 27

2.0.4.1 Greenshields . . . . . . . . . . . . . . . . . . . . . . 29

2.0.4.2 Greenberg . . . . . . . . . . . . . . . . . . . . . . . 31

Chapter 3. Previous Work Related to Simulation Architectures . . . . . . . . . . 33

Page 5: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

v

3.1 Logic Simulation Machines . . . . . . . . . . . . . . . . . . . . . . . . 34

3.1.1 Boeing Computer Simulator . . . . . . . . . . . . . . . . . . . 35

3.1.2 The IBM Los Gatos Logic Simulation Machine . . . . . . . . 40

3.1.3 Barto and Szygenda’s Hardware Simulator . . . . . . . . . . . 45

3.1.4 Abramovici’s Logic Simulation Machine . . . . . . . . . . . . 50

3.1.5 Levendel, Menon, and Patel’s Logic Simulator . . . . . . . . . 54

3.1.6 Megalogican . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.1.7 The IBM Yorktown Simulation Engine . . . . . . . . . . . . . 70

3.1.8 HAL: A Block Level Logic Simulator . . . . . . . . . . . . . . 77

3.1.9 MARS:Micro-Programmable Accelerator for Rapid Simulation 87

3.1.10 Reconfigurable Machine . . . . . . . . . . . . . . . . . . . . . 93

3.1.11 Bauer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

3.2 Accelerator & General Purpose Machine . . . . . . . . . . . . . . . . 100

3.2.1 Splash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

3.2.2 The ArMen . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

3.3 Optimistic Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 112

3.4 Non-Deterministic Simulation . . . . . . . . . . . . . . . . . . . . . . 114

3.4.1 Hoogland, Spaa, Selman, and Compagner . . . . . . . . . . . 117

3.4.2 Monaghan & Pearson, Richardson, and Toussant . . . . . . . 117

3.5 Reduction Buses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

3.5.1 Parallel Reduction Network . . . . . . . . . . . . . . . . . . . 119

Chapter 4. Software Traffic Simulation . . . . . . . . . . . . . . . . . . . . . . . 123

Page 6: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

vi

4.1 Event Generation & Queue . . . . . . . . . . . . . . . . . . . . . . . 124

4.1.1 Event Generation Software . . . . . . . . . . . . . . . . . . . 124

4.1.2 Event Queue Software . . . . . . . . . . . . . . . . . . . . . . 127

4.2 CORSIM: An Established Software Simulator . . . . . . . . . . . . . 130

4.2.1 CORSIM Function Categories . . . . . . . . . . . . . . . . . . 131

4.2.2 NT versus Linux . . . . . . . . . . . . . . . . . . . . . . . . . 131

4.2.3 CORSIM Profile . . . . . . . . . . . . . . . . . . . . . . . . . 133

4.3 Trafix: A Road Traffic Simulator . . . . . . . . . . . . . . . . . . . . 138

4.3.1 A Shared, Pooled Allocator . . . . . . . . . . . . . . . . . . . 144

Chapter 5. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

5.1 Event verses Time-Driven Simulation . . . . . . . . . . . . . . . . . . 148

5.1.1 Expected Advantage of Event vs Time-Driven Simulation . . 148

5.1.2 Decision between Event vs Time-Driven Modes . . . . . . . . 149

5.1.3 Exponentially Distributed Example . . . . . . . . . . . . . . . 152

5.1.4 Weibull Distribution Example . . . . . . . . . . . . . . . . . . 155

5.2 Topology: Traffic Map Layout . . . . . . . . . . . . . . . . . . . . . . 158

Chapter 6. Design Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

6.1 Reconfigurable Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

6.2 Systolic Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

6.3 Content Addressable Memory . . . . . . . . . . . . . . . . . . . . . . 173

6.4 Reduction Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

Page 7: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

vii

Chapter 7. Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

7.1 Distributed Multiprocessors . . . . . . . . . . . . . . . . . . . . . . . 180

7.2 Processing Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

7.2.1 Event Generation . . . . . . . . . . . . . . . . . . . . . . . . . 185

7.2.1.1 Event Generator Results . . . . . . . . . . . . . . . 190

7.2.2 Event Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

7.2.2.1 The Service Event Sorter . . . . . . . . . . . . . . . 191

7.2.2.2 The Linear Array . . . . . . . . . . . . . . . . . . . 195

7.2.2.3 The Queue Model Results . . . . . . . . . . . . . . . 198

7.2.3 Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

7.2.3.1 Vehicle Data . . . . . . . . . . . . . . . . . . . . . . 203

7.2.3.2 Vehicle Initialization . . . . . . . . . . . . . . . . . . 207

7.2.3.3 Road Movement . . . . . . . . . . . . . . . . . . . . 207

7.2.3.4 Intersection Movement . . . . . . . . . . . . . . . . 213

7.2.3.5 Scheduler Results . . . . . . . . . . . . . . . . . . . 214

7.3 Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

7.3.1 Communications Architectures . . . . . . . . . . . . . . . . . 221

7.3.2 Parallel Bus Architecture . . . . . . . . . . . . . . . . . . . . 226

7.3.3 Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 227

7.3.4 Phase 1 Elimination . . . . . . . . . . . . . . . . . . . . . . . 227

7.3.5 Phase 2 Selection . . . . . . . . . . . . . . . . . . . . . . . . . 228

7.3.6 Cross-Point Matrix . . . . . . . . . . . . . . . . . . . . . . . . 231

7.3.7 Network Results . . . . . . . . . . . . . . . . . . . . . . . . . 235

Page 8: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

viii

Chapter 8. Optimistic Synchronization . . . . . . . . . . . . . . . . . . . . . . . 242

Chapter 9. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

Page 9: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

ix

List of Tables

2.1 Car-Following Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 Vehicle Deceleration Notation . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 Harmonic Mean Speed Notation . . . . . . . . . . . . . . . . . . . . . . 28

2.4 Notation for Greenshields Equations . . . . . . . . . . . . . . . . . . . . 29

3.1 Barto’s Simulator Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 47

3.2 Synchronous Discrete Event-Driven Simulation Algorithm . . . . . . . . 107

4.1 Event Generation Code I . . . . . . . . . . . . . . . . . . . . . . . . . . 126

4.2 Event Generation Code II . . . . . . . . . . . . . . . . . . . . . . . . . . 128

4.3 Event Queue Loop Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

4.4 CORSIM Function Classifications . . . . . . . . . . . . . . . . . . . . . . 132

4.5 CORSIM Runtime Under Linux and NT . . . . . . . . . . . . . . . . . . 135

4.6 Scheduler Software Function Profile . . . . . . . . . . . . . . . . . . . . 144

7.1 Event Generator and Event Queue FPGA Implementation . . . . . . . . 201

7.2 Vehicle Data Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

7.3 Acceleration Decisions for a Road . . . . . . . . . . . . . . . . . . . . . . 212

7.4 Acceleration Decisions for an Intersection . . . . . . . . . . . . . . . . . 216

7.5 Scheduler Chip Implementation . . . . . . . . . . . . . . . . . . . . . . . 218

Page 10: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

x

List of Figures

1.1 Simulator Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1 Time Headway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Car-Following Notation Figure . . . . . . . . . . . . . . . . . . . . . . . 20

2.3 Car-Following Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1 The Boeing Simulator Architecture . . . . . . . . . . . . . . . . . . . . . 36

3.2 Simulation Classifications . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3 The Boeing Simulator Logic Processor . . . . . . . . . . . . . . . . . . . 39

3.4 Los Gatos Logic Simulation Machine Architecture . . . . . . . . . . . . 42

3.5 The IBM Los Gatos Logic Simulation Machine . . . . . . . . . . . . . . 44

3.6 Barto’s Logic Simulator Architecture . . . . . . . . . . . . . . . . . . . . 49

3.7 Logic Simulation Machine . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.8 Levendel’s Logic Simulator Architecture . . . . . . . . . . . . . . . . . . 55

3.9 Mapping Circuit Blocks ai and aj to Processors pi and pj . . . . . . . 57

3.10 Interface Between the Data Sequencers and the Time-Shared Parallel Bus 58

3.11 The Controlling Processor Unit Configuration . . . . . . . . . . . . . . . 60

3.12 Subordinate Processor Unit Configuration . . . . . . . . . . . . . . . . . 61

3.13 Interface Between the Parallel Bus and the Cross-Point Matrix . . . . . 65

3.14 Interface Between the Data Sequencers and a Cross-Point Matrix . . . . 67

3.15 Megalogican Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Page 11: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

xi

3.16 The YSE Logic Processor Configuration . . . . . . . . . . . . . . . . . . 72

3.17 A Switch Port ”K” Example with its Logic Port Connection . . . . . . . 74

3.18 The YSE Scheduling Problem . . . . . . . . . . . . . . . . . . . . . . . . 78

3.19 The HAL Level Ordering Method . . . . . . . . . . . . . . . . . . . . . . 82

3.20 The HAL Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . 83

3.21 Internal mechanism of a Logic Processor . . . . . . . . . . . . . . . . . . 84

3.22 Global MARS Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 88

3.23 Internal Cluster Architecture . . . . . . . . . . . . . . . . . . . . . . . . 89

3.24 Architecture of the Processing Element . . . . . . . . . . . . . . . . . . . 91

3.25 MARS logic simulation pipeline . . . . . . . . . . . . . . . . . . . . . . . 93

3.26 The RM Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

3.27 The LSIM Fanout Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

3.28 The LSIM Evaluation phase . . . . . . . . . . . . . . . . . . . . . . . . . 97

3.29 Bauer’s Reconfigurable Logic Simulator . . . . . . . . . . . . . . . . . . 99

3.30 The Splash 2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 102

3.31 The Splash 2 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

3.32 The ArMen Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

3.33 Digital-Serial Implementation of the Global Minimum Computation and Broadcast 111

3.34 The Ising Spin Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

3.35 A General 2-Bit Feedback Shift Register . . . . . . . . . . . . . . . . . . 118

3.36 Random Number Generator . . . . . . . . . . . . . . . . . . . . . . . . . 118

3.37 Parallel Reduction Network . . . . . . . . . . . . . . . . . . . . . . . . . 121

3.38 PRN Arithmetic Logical Unit Node . . . . . . . . . . . . . . . . . . . . . 122

Page 12: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

xii

4.1 Simulation Timeline Generation . . . . . . . . . . . . . . . . . . . . . . . 126

4.2 Profile Chart of CORSIM on NT . . . . . . . . . . . . . . . . . . . . . . 136

4.3 Profile Chart of CORSIM on Linux . . . . . . . . . . . . . . . . . . . . . 137

4.4 The Trafix Software Structure . . . . . . . . . . . . . . . . . . . . . . . . 140

4.5 The Trafix Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

4.6 Trafix Input Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 142

5.1 Wrapping a Traffic Map onto the Simulator . . . . . . . . . . . . . . . . 159

5.2 Different Lexicographical Map Layouts . . . . . . . . . . . . . . . . . . . 160

6.1 General Xilinx FPGA Architecture . . . . . . . . . . . . . . . . . . . . . 164

6.2 Xilinx Architecture Interconnects . . . . . . . . . . . . . . . . . . . . . . 165

6.3 The Xilinx XC4000 Configurable Logic Block . . . . . . . . . . . . . . . 166

6.4 Block Diagram of the Altera Flex 10K Architecture . . . . . . . . . . . . 168

6.5 Diagram of the Altera Embedded Array Block (EAB) . . . . . . . . . . 169

6.6 Diagram of the Altera Logic Element (LE) . . . . . . . . . . . . . . . . . 171

6.7 Associative Memory Block Diagram . . . . . . . . . . . . . . . . . . . . 175

6.8 An Associative Memory Cell . . . . . . . . . . . . . . . . . . . . . . . . 176

6.9 Associative Memory Match Logic . . . . . . . . . . . . . . . . . . . . . . 177

7.1 System User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

7.2 Processor Elements Network . . . . . . . . . . . . . . . . . . . . . . . . . 183

7.3 Local Processing Element Design . . . . . . . . . . . . . . . . . . . . . . 186

7.4 The Event Generator Flow Diagram . . . . . . . . . . . . . . . . . . . . 188

7.5 Service Event Sorter: Cycle 1 . . . . . . . . . . . . . . . . . . . . . . . . 193

Page 13: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

xiii

7.6 Service Event Sorter: Cycle 2 . . . . . . . . . . . . . . . . . . . . . . . . 194

7.7 Service Event Sorter: Cycle 3 . . . . . . . . . . . . . . . . . . . . . . . . 196

7.8 Service Event Sorter: Cycle 4 . . . . . . . . . . . . . . . . . . . . . . . . 197

7.9 Linear Array Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

7.10 Linear Sort Array Input Example . . . . . . . . . . . . . . . . . . . . . . 199

7.11 Linear Sort Array Output Example . . . . . . . . . . . . . . . . . . . . . 199

7.12 Speedup vs Events for Event Generation, Arrival and Service Queues . . 201

7.13 An Intersection and its Departing Roads . . . . . . . . . . . . . . . . . . 204

7.14 Scheduler Vehicle Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

7.15 Calculations for Vehicle Movement on a Road . . . . . . . . . . . . . . . 211

7.16 Calculations for Vehicle Movement Through an Intersection . . . . . . . 215

7.17 Processing Element for 4-way Intersection and Exit Roads . . . . . . . . 220

7.18 A Network of Processing Elements . . . . . . . . . . . . . . . . . . . . . 222

7.19 The 3-Dimensional Network Structure . . . . . . . . . . . . . . . . . . . 223

7.20 The PE Interconnection Network . . . . . . . . . . . . . . . . . . . . . . 224

7.21 K-ary Search Tree Network . . . . . . . . . . . . . . . . . . . . . . . . . 225

7.22 Algorithm Phase 2 Method 2 . . . . . . . . . . . . . . . . . . . . . . . . 230

7.23 Cross-point Switch Architecture . . . . . . . . . . . . . . . . . . . . . . . 232

7.24 Processing and Communications time . . . . . . . . . . . . . . . . . . . 233

7.25 Exponential Distribution in Event vs Time-Driven Simulation . . . . . . 238

7.26 Exponential Distribution Slice of Figure 7.25 . . . . . . . . . . . . . . . 239

7.27 Weibull Distribution in Event vs Time-Driven Simulation . . . . . . . . 241

7.28 Weibull Distribution Slice of Figure 7.27 . . . . . . . . . . . . . . . . . . 241

Page 14: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

xiv

9.1 Speedup Results by Section . . . . . . . . . . . . . . . . . . . . . . . . . 248

Page 15: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

xv

Acknowledgments

First, and most importantly, I wish to thank my wife, Anna, for her patience,

guidance and assistance.

I would like to express gratitude to both the Pennsylvania State University and

Massachusetts Institute of Technology Library systems. Without their assistance and

public access policies, this thesis would not have been possible.

Thanks to Henry Lieu of the United States Federal Highway Administration for

providing access to the CORSIM source code.

Special thanks to Ms. Ralene Marcoccia and Mr. Joe Hanson of Altera’s Univer-

sity Programs Department for providing the Altera simulation software which performed

the backbone of our FPGA analysis.

Pie charts are rendered using the Ploticus software package which was developed

by Steve Grubb (www.sgpr.net). GNU software and the Linux kernel are used extensively

in this research.

Page 16: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

1

Chapter 1

Introduction

Simulation is one important aspect of computing. There are a wide variety of

simulation applications, from gaming to financial management. Computer manufactur-

ers have thus far concentrated the vast majority of their efforts on developing general

purpose computer architectures which quickly execute a stored program. The stored

program concept dates back to the inception and seminal papers on modern comput-

ing [132]. The concept relies on the ability to store instructions and data which can

be retrieved sequentially from memory and then executed by a processor. The stored

program concept works well for general purpose computing where the applications are

diverse and the runtime speeds are non-critical. Unfortunately, not all environments are

so accommodating. In this thesis, the proposed architecture deviates from the stored

program concept by moving the processor instructions from memory and embedding

them in the reconfigurable logic data path of the architecture. This deviation is unusual.

Presented are methods for accelerating discrete event simulation in general, with

a focus on the specific example of traffic simulation. Using an accelerated simulator,

existing metropolitan models can be adjusted to reflect accidents or incidents for traffic

management. The simulation can then be rapidly re-run to determine whether proposed

detour and signal control solutions will alleviate congestion. Results can provide ad-

ditional guidance about detour impact on the rest of the traffic grid. Demonstrating

Page 17: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

2

their importance as traffic management tools, air traffic simulators are used to minimize

delays and maximize passenger throughput by increasing efficiency.

Traffic simulators are often unable to simulate traffic at a rate much greater than

the time required to actually run traffic on a network of roads. A recent demonstration

of MITSIM projecting a section of the Boston arterial flow project is able to simulate

traffic moving at a stated rate approximately equal to 90% of real time. The speed of

these simulators is adequate for the design of new traffic pattern construction and for

optimizing traffic signal timing sequences. However, the response time is inadequate

for handling traffic incidents which require a greater level of acceleration to be useful

to traffic managers attempting to optimize existing networks experiencing unanticipated

crises.

Metropolitan traffic grids are often strained by the advent of celebrations or

demonstrations which may foster abnormal traffic loads. Concentrations of congregants

may induce localized surges of congestion. Even a simple traffic incident in an already

strained metropolitan street grid yields immediate consequences. This thesis presents a

simulation machine architecture capable of serving as a rapid traffic incident response

simulation system. The accelerated machine is designed to assist traffic management

officials in obtaining and testing detours. The machine is capable of running its simu-

lations fast enough to be useful to the traffic officers on the street. Although anyone

stuck in a traffic jam can attest to the benefits and increased satisfaction level gained

by avoiding congestion, the implications of increased traffic throughput are not just a

matter of convenience. Now, the resuscitation rate nationally is only 2 to 5 percent – in

Page 18: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

3

large part because defibrillators don’t get to victims in time [124]. Faster response time

to injury victims can be directly correlated to an increased survival rate.

An architecture composed of multiple processing elements is proposed. The pro-

cessing elements are united and synchronized towards the common goal of accelerating

discrete event simulation. Many of the examples illustrate methods applicable to general

discrete event simulation. A specific example of microscopic road traffic simulation is

also provided as a concrete example.

The thesis organizes its discussion of discrete event simulation into the following

topics. First, Chapter 1 describes the motivation behind the research. A basic model of

discrete event simulation is illustrated which is referred to throughout the thesis as the

basis of discrete event simulation. Chapter 1 also highlights opportunities for accelera-

tion which are applied in future sections, as well as the basic constraints of simulation.

Finally, Chapter 1 describes why simulation, as opposed to mathematical analysis, is

often required. Related aspects of traffic theory are reviewed in Chapter 2. This chapter

discusses the derivation of the microscopic acceleration model developed by General Mo-

tors in Section 2.0.2. Two macroscopic models, the first developed by Greenshields and

the second developed by Greenberg, are reviewed in Sections 2.0.4.1 and 2.0.4.2, respec-

tively. Chapter 3 presents an overview of deterministic simulators. In addition to the

logic simulators, there is a review of some optimistic simulation hardware in Section 3.3

and hardware used as random number generators in Section 3.4. Chapter 4 describes

the simulation software implemented and/or studied in this research. Software is used as

a standard against which the proposed hardware implementation throughput speeds are

measured. In order to gain an understanding of the requirements of hardware simulators,

Page 19: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

4

Section 4.2 presents the runtime characteristics of CORSIM, a representative software

simulator which has been developed under the aegis of the United States Department

of Transportation. CORSIM is widely used in research and practice. The results of

Section 4.2 focus hardware acceleration efforts towards the bottlenecks of simulation.

Trafix, an open source, free software, traffic simulator which was developed concurrently

by the author, is overviewed in Section 4.3. Chapter 5 performs some statistical anal-

ysis beneficial for the decision of running a simulation in time or event-driven mode.

A brief analysis of the geometric constraints required in partitioning a traffic map over

the proposed simulation architecture is examined in Section 5.2. Chapter 6 describes

some of the key hardware methods which are applied to accelerate the simulator. These

approaches include reconfigurable logic, systolic arrays, associative memory, and a reduc-

tion bus. Chapter 7 presents the architecture design of the simulator which applies the

techniques described in Chapter 6 to accelerate each component of the simulation model

defined in Chapter 1. Chapter 8 considers optimistic modifications to the simulator

design. Finally, Chapter 9 describes the results of the work.

1.1 The Importance of Non-deterministic Simulation

Non-deterministic simulation is an important tool used by a variety of disciplines.

Faster simulations will allow engineers to predict and accommodate changes in metropoli-

tan traffic models. A system which can accurately predict traffic jams or service inter-

ruptions will assist in their prevention. Faster simulation allows existing traffic models

to quickly reflect accidents or changes in available traffic routes. Accelerated simulation

models can be re-run rapidly to pinpoint expected traffic congestion. Traffic engineers

Page 20: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

5

can model their proposed changes and simulate solutions quickly for verification. In a

real time application, simulators can either evaluate traffic management strategies for

re-routing during incidents, or can assess and optimize traffic control schemes under

various, changing traffic demands.

Recent instances of large scale discrete event simulation and concerns about their

slow pace abound in the press. One example which occurred on state of the art equipment

was the simulation of a 1-kilometer wide comet striking the earth’s ocean to determine

the impact detonation power and resulting shockwave strength. The simulation was per-

formed using the teraflops (trillion floating point operations per second) supercomputer

at Sandia National Laboratories.

A kilometer is about the size of the largest fragment of Comet Shoemaker-Levy 9

which crashed into Jupiter in 1994 - an event that was also the subject of computational

simulations [105, 111]. The calculation used Sandia’s bang and splat shock physics code

and was run on 1,500 processors of the new Intel Teraflops computer being installed at the

Labs. 1,500 processors is one-sixth of the expected final 9,000-processor configuration.

The calculation assumed a 1-kilometer-diameter comet (weighing about a billion

tons) traveling 60 kilometers per second and impacting Earth’s atmosphere at about a

45 degree angle. The model is small as far as comets go (the massive Comet Hale-Bopp

weighs about ten trillion tons). The problem was divided into 54 million zones and ran

for 48 hours. The results, although dramatic, confirmed earlier predictions about a

comet impact, but they did so with much finer resolution in three dimensions than has

ever before been possible [105].

Page 21: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

6

Scientists at the University of Wisconsin have recently applied genetic algorithms

which allow the process of Darwinian natural selection to guide the design process of

diesel engines [127]. This is the kind of problem that chokes even the most powerful

supercomputer. Computers running software developed by Dr. Reiz and his colleagues

at government laboratories, universities, and in industry have begun to make progress,

though the progress is slow. “A typical simulation will run for several days on a super-

computer” Dr. Reiz said. “That simulation is of one engine cycle which actually takes

place in less than a tenth of a second. . . . There can be dozens of parameters to adjust,

each of which affects the others. Finding an optimal combination by trial and error on

a real-world engine could take practically forever. But with simulations taking two days

apiece, trying all the combinations of variables with a computer does not seem to work

much faster.” [127]

In terms of simulating emergency evacuations, simulation has also recently been

applied to crowd behavior during a fire with limited egress. The intent is to allow

emergency planners to prevent death or injuries due to panic [40].

Traffic engineering presents a practical and realistic simulation application. One

possible scenario models the new millennium celebration in Times Square, New York

City. Changes could be made to existing models, factoring in the affects of traffic outages,

and thereby allowing the simulation and verification of proposed traffic detours. Traffic

outages could occur due to construction, accidents, or terrorist activity. Accelerated

simulators will prove to be highly effective in assisting engineers rerouting traffic during

emergency situations. The same scenarios hold for rail [32] and aeroplane traffic. If

Page 22: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

7

transportation systems designers have fast simulators available, the following system

design enhancements become feasible [88]:

• Real time verifiable experimentation becomes plausible.

• Simulation can contribute to systems management rather than being used solely

for the design process.

• Users can be more confident of the accuracy of their implementation decisions.

1.2 Simulation Model

Discrete event simulations typically have three basic common denominators. First,

they contain a set of state variables denoting the current state of the simulation. The

state variables contain information such as the number and availability of system re-

sources. Secondly, a typical discrete simulation contains an event queue, depicted in Fig-

ure 1.1. The event queue is a list of pending events which have been created by an event

generator but not yet executed by the scheduler. These events require system resources

to execute. The availability of resources is described by the state variables. Events often

contain an arrival timestamp and possibly a duration. The arrival timestamp indicates

when the event impacts the system’s state variables. Event arrival times and service

times are frequently generated based on statistical models. For example, events may

arrive according to a Poisson Distribution. Finally, the third common denominator of

discrete event simulations is the global simulation clock which keeps track of the sim-

ulation’s progress. The simulation must maintain proper causal states, meaning that

each event must be executed in the environment created by the execution of the prior

Page 23: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

8

events. Therefore, if prior events have depleted a particular resource, that resource will

be unavailable for the execution of a following event.

The simulation generally executes a main loop [55] which repeatedly removes the

event with the smallest timestamp from the event queue. Each event is processed by

making appropriate state changes to the simulation model’s state variables.

Discrete simulation creates a system which changes state at specific points in

time. The simulation model jumps from one state to the next when an event occurs and

is processed. A telephone system might contain a set of state variables which describe

telephone trunks leading from a substation as either full or available to route new calls.

Additional state variables might contain the number of calls currently being handled by

the substation. Typical events at the substation might include call arrivals, inbound

calls being routed through the station, calls being terminated, or calls being blocked.

1.3 Opportunities for Acceleration in Discrete Simulation

Discrete event simulation provides a couple of strong candidate openings for ac-

celeration. It is easy to cite impediments to acceleration, which include the need for

causality in event execution. Requiring each event to execute in the environment cre-

ated by its predecessor does indeed stipulate an inherent sequential nature. However,

various attempts have been made to allow simulation events to be processed concur-

rently, or optimistically [29, 43, 58, 56, 80, 140]. Three main opportunities are explored

in this work.

The first opportunity views events as collections of smaller discrete subcompo-

nents, parts of which are independent, sometimes referred to as fined-grained parallelism.

Page 24: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

9

Server

SchedulerSimulationTime Clock

EventGenerator

Event Queue

RandomNumberGenerator

Queue

Fig. 1.1. Simulator Model The simulator is divided into the components illustrated.The Event Generator creates random events, according to a user selected statistical distri-bution, and the event’s resource requirements. The events and their attributes are placedin the Event Queue. The Scheduler steps through the Event Queue in chronological orderaccording to the Global Simulation Clock, attempting to allocate resources to each event.If the resources are available, the event can execute. If not, the event is blocked.

Page 25: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

10

In computer architecture, there was an exploration of Reduced Instruction Set Comput-

ing (RISC) versus Complex Instruction Set Computing (CISC). The CISC proponents

proposed rich, sophisticated instruction sets with large selections of addressing modes

which tended to slow instruction execution. RISC proponents proposed a simple, fixed-

length instruction set with fewer addressing modes. Though it may take several simple

instructions on a RISC machine to replace one of the more complicated instructions, it

is possible to process the simpler instructions more rapidly [21, pg.118]. Dividing simu-

lation events into smaller sub-tasks reveals independent sub-components which may be

executed rapidly and in parallel.

The second advantage which can be exploited within simulation is the locality

of data. For example, in traffic simulation, vehicles tend to move along a continuous

trajectory, flowing from one street into an intersection and then onto the next street.

If a traffic simulation is divided along naturally occurring geographic boundaries, data

required to process the vehicle will be easily cached within the processing elements

handling the respective streets and intersections. For example, if a processing element

is assigned to handle an intersection and its egressing roads, then information about

the roads and intersection including speed limit, road grade, traffic signals, etc. can all

be maintained within the processing element and need not be moved with individual

vehicle datasets. All vehicles traversing roads and intersections will store properties

locally on the road and intersection processing elements. Common geographic data is

stored locally. The currently processed vehicle may become the following vehicle’s leader

during the next processing stage. The sedentary nature of much of the data will help

Page 26: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

11

to alleviate the common memory bottleneck problems which are often experienced by

general purpose computers.

The third opportunity which may be turned to advantage allows the hardware

to compute all of the possible outcomes of a conditional statement concurrently and a

priori and then simply select the appropriate result. This advantage trades the silicon

real-estate required to engender the needed functional units for an accelerated access to

required results.

1.3.1 Difficulties faced by Parallel Discrete Event-Driven Simulations

The greatest opportunity for increasing simulation speed lies in concurrent event

processing. Parallel execution of discrete event-driven simulation is limited by the need

to always process the queue event with the smallest time stamp. If a different event

were removed from the queue and processed, that second event might incorrectly change

the simulation’s state variables in which the next smallest time stamped event would

execute. Having an event in the future affect an event in the past is called a causality

error [58].

Fujimoto [55] provides a quick example using two events, Ei and Ej with their

respective timestamps, Ti and Tj . It is assumed that if i < j then Ti < Tj . If Ei writes

into a state variable that is read by Ej , then Ei must be executed before Ej to be sure

that no causality error occurs. This example ignores partial concurrent execution of the

events, which may be possible given the sequencing constraints.

Hence, the possibilities for increasing speed via concurrency seem limited. In

the proposed architecture, reconfigurable logic devices are targeted at the computation

Page 27: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

12

problems as opposed to conventional processors. Reconfigurable logic devices devote their

silicon area to a large number of computing primitives, interconnected via a configurable

network. Both the primitives and the interconnection network can be programmed to

fit the problem. Computational tasks are spatially implemented on the device allowing

intermediate computation results flowing directly from producing to receiving functions.

Since thousands of primitives can reside on a single chip, significant amounts of data flow

may occur without crossing chip boundaries. In this thesis, the entire task is mapped into

hardware. The reconfigurable logic provides a spatially oriented processing environment

as opposed to the temporally-oriented processing provided by general purpose processors.

Reconfigurable logic provides the following advantages over traditional micropro-

cessors. These advantages are used to accelerate discrete event simulation [47].

• Functional Unit Distribution - Rather than broadcasting a new instruction to

the functional units on every cycle, instructions are locally configured in the data-

path, allowing the reconfigurable device to compress the data stream distribution

and effectively deliver more instructions into active silicon on each cycle.

• Spatial routing of computational intermediates - As space and primitives

permit, intermediate values are routed in parallel from producing functions to

consuming functions rather than forcing all communication to take place in time

through a central resource bottleneck.

Page 28: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

13

• Fine-grained computing approach - Having more, often finer-grained, sepa-

rately programmable building blocks, reconfigurable devices provide a large num-

ber of separately programmable units allowing a greater range of computations to

occur per time step.

• Resource Placement - Distributed deployable resources, eliminate bottlenecks.

Resources including memory, interconnect, and functional units are distributed and

deployed based on need, rather than being centralized in large pools. Independent,

local access allows reconfigurable designs to take advantage of local and parallel

on-chip bandwidth, instead of creating a central resource bottleneck.

In addition to these advantages of reconfigurable logic, the causality constraints

described in this section are similar to the constraints faced by the logic simulators of

Chapter 3. Some of the same techniques applied to the deterministic logic simulators

will also be targeted on the non-deterministic simulation problem. In traffic simulation,

causality constraints are limited to the environment directly surrounding the vehicle.

For example, in a traffic simulation, the movement of a vehicle on one city block may

be completely independent of a vehicle traveling on the next block let alone a vehicle

traveling across town. At a higher level, communications efficiency between concurrent

processors is reviewed. All of these techniques and attributes are used to assist in the

acceleration of parallel discrete event simulation.

Page 29: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

14

1.4 Simulation’s Niche

Simulation is generally carried out either macroscopically or microscopically. Macro-

scopic models use aggregate data which may include speed-flow-density and queue dis-

persion equations. These equations determine how vehicles move through the traffic

grid. The microscopic approach provides a greater level of detail, modeling each vehicle

individually. These two approaches are examined more closely in Section 1.5.

Simulation fulfills requirements to provide appropriate predictions on the behav-

ior of dependent system interactions where mathematical solutions remain elusive. In

traffic networks, there are many situations where various lanes of traffic interact in the

sense that a traffic stream departing from one queue of vehicles enters one or more

other queues, perhaps after merging with portions of yet other traffic streams departing

from still other queues. The merging of these traffic streams has the unfortunate effect

of complicating the character of the arrival process at the downstream queues [18, pg

209]. Once the vehicles travel beyond their entry points, their interarrival times become

strongly correlated with the vehicle lengths and inertia. It therefore becomes impossible

to carry out a precise and effective analysis comparable to the queueing theory analysis

for the M/M/1 and M/G/1 systems [18, pg 209]. Analysis of these systems requires the

assumption of inter-arrival independence.

Bertsekas provides a traffic analogy when describing why the assumption of in-

dependent arrival times is often not applicable in simulation models. Consider a slow

truck traveling on a busy narrow street followed by several faster cars. The truck will

typically see empty space ahead while being closely followed by the faster cars [18, pg

Page 30: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

15

209]. The assumption of independent arrival times in a network of nodes is often not a

valid assumption. Without the assumption of independence, many mathematical models

do not apply.

1.5 Microscopic & Macroscopic

A microscopic analysis focuses on the speeds of individual units, and the time

and distance headways between each unit, whereas a macroscopic approach categorizes

groups of flow rates, average velocities and distances. Traffic analysis methods range

from simple equations to complex simulation models. Traffic stream models can often

be used for uninterrupted flow situations where demands do not exceed capacities [102].

For oversaturated situations, where traffic flow may be complicated by interruptions,

more complex methods, including shock wave analysis, queue analysis, and simulation

modeling, can be employed.

Macroscopic analysis may be selected for higher-density, larger-scale systems in

which a study of the behavior of groups of units is sufficient [102]. Macroscopic traf-

fic analysis focuses on three fundamental characteristics, flow, density, and speed. For

instance, macroscopic analysis might explore the average vehicle velocity on a freeway

at peak versus off-peak times of the day near a particular exit. Continuum models are

needed for better understanding the collective behavior of traffic [104]. Michalopoulos

notes that applications of the existing high-order macroscopic models have not shown

satisfactory results, especially with models which are congested and contain interrupted

flows. In high density situations, the models may suffer from stability problems, although

Page 31: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

16

in some cases these problems are related to the numerical method used. Vehicle interac-

tion is also considered to be one of the components that contribute to flow acceleration.

Unfortunately, so far, there are neither theoretical arguments, nor experimental results

that lead to an unambiguous choice of such contributions [104].

Microscopic analysis may be selected for moderate-sized systems where the num-

ber of transport units passing through the system is relatively small and there is the

need to study the behavior of individual units in the system. The designed simulator is

intended to assist with traffic incident recovery. A microscopic model was selected for

this research to provide detailed incident traffic analysis.

Page 32: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

17

Chapter 2

Traffic Theory

Traffic flow theory has been well studied. In a traffic stream, a minimum space

must be available in front of every vehicle so that the operator can control the vehicle

without colliding into the lead vehicles. Vehicle spacing is an important criterion for the

operator’s level of service. Large vehicle spacing provides operators with considerable

freedom of motion. In theory, as vehicle spacing decreases, operators are required to

devote more concentration to the task of driving and to reducing their velocity. Decreased

spacing results in lower levels of comfort, but higher throughput as long as the vehicle

spacing remains greater than the critical spacing limit. After the critical limit is reached,

the throughput begins to drop along with the level of service. An extreme example of

the service drop is a stopped queue of vehicles where the spacing is minimal, the level of

service is at its nadir, and the flow is zero.

The results contained in this chapter are used in the development of Trafix, a traf-

fic simulator which is discussed in Section 4.3, and in the architecture implementations

described in Section 7.2.3. Specifically, the car-following model described in Section 2.0.2

is often used as the basis of traffic simulation. The simulators developed for this thesis

depend on the car-following model.

Figure 2.1 illustrates the time headway. An analogous measurement, the distance

headway, is defined as the space between two selected points on the lead and following

Page 33: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

18

vehicle. The time headway is more frequently encountered in practice “... because

of the greater ease of measuring time headway. Distance headway can be obtained

photographically; however, it is more often obtained by calculation based on the time

headway and individual speed measurements.” [102]

The notation and definitions of Figure 2.2 are used to develop car-following mod-

els. Two vehicles moving left to right are illustrated with the lead vehicle, n, having a

length of Ln and the following vehicle, n + 1, of length Ln+1. The other figure parame-

ters are listed and defined in Table 2.1. Note that the acceleration rate of the following

vehicle Xn+1 occurs at time t + ∆t, not t. The ∆t, sometimes called the operator reac-

tion time, is the time required for the operator of the following car to decide and initiate

a new acceleration.

Variable Descriptionn lead vehiclen + 1 following vehicleLn length of lead vehicleLn+1 length of following vehiclexn position of lead vehiclexn+1 position of following vehiclexn speed of lead vehiclexn+1 speed of following vehiclexn+1 acceleration rate of following vehiclet time tt + ∆t ∆t time after time t

Table 2.1. Car-Following Notation Notation used in car-following theories is listed. Amatching illustration of the notation symbols is contained in Figure 2.2

Page 34: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

19

TimeHeadway

SpaceHeadway

1 2 3 4 5t t t t t

x

Dis

tanc

e

Time

Occupancy time Time gap

Fig. 2.1. Time Headway The time headway is defined as the elapsed time between thearrival of pairs of vehicles. The time headway, t5 − t3, consists of two time intervals, theoccupancy time, denoted in the figure as t4 − t3, is the time required for the vehicle toactually pass the observation point, plus the time gap between the rear of the lead vehicleand the front of the following vehicle, denoted as t5−t4. The time headway is not specificallydefined as the time between the passage of the following edges of two consecutive vehicles,but is simply taken as the time passage of identical points on two consecutive vehicles. Inpractice, the leading edges of vehicles are frequently used [102].

Page 35: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

20

nn+1

L n

L n + 1

n+1xnn+1

n

x

x

x

Fig. 2.2. Car-Following Notation Figure Notation used in car-following theories isillustrated. Car-following theories describing vehicle interactions were developed in the1950s and 1960s [102]. Various car-following models were developed, however, one of themost notable was the work performed at General Motors Corporation (GM). The GMresearch was accompanied by field experiments and the discovery of the mathematical bridgebetween the microscopic and macroscopic theories of traffic flow [102]. The notation citedabove, and listed in Table 2.1, was developed by General Motors.

Page 36: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

21

Chapter 2 contains information on both microscopic and macroscopic simulation

theory. The first section, Section 2.0.1, describes the work by Reuschel and Pipes who

developed microscopic equations to describe vehicular traffic flow. Section 2.0.2, discusses

the General Motors formula for the calculation of acceleration. The General Motors

acceleration formula is for microscopic simulation calculation. The next sub-section,

Section 2.0.3, derives the equations used for vehicle deceleration and stopping from basic

principles. Finally, Section 2.0.4 discusses two macroscopic simulation models. One

of the first macroscopic models, the Greenshields model, is reviewed in Section 2.0.4.1.

Section 2.0.4.2 reviews Greenberg’s model which is a fluid-flow macroscopic traffic model.

Shockwave models are not reviewed, but detailed information can be found in [35, 99,

60, 104].

2.0.1 Reuschel & Pipes

The challenge to describe vehicular flow in a microscopic manner led Reuschel

and Pipes to formulate the phenomena of the motion of pairs of vehicles following each

other as described in Equation 2.1 [82]. The derivation of the expression is described

graphically in Figure 2.3.

xn − xn+1 = L + S(xn+1) (2.1)

Differentiation of Equation 2.1 leads to Equation 2.2, which is referred to as the

basic equation of the car-following models. Research groups associated with the General

Motors Corporation developed a linear mathematical formula which fitted well against

Page 37: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

22

Lx n+1

x n

n+1S(x )o

n

Fig. 2.3. Car-Following Acceleration The challenge to describe vehicular flow in amicroscopic manner led Reuschel and Pipes to formulate the phenomena of the motion ofpairs of vehicles following each other by the expression: xn− xn+1 = L + S(xn+1) [82]. Inthis expression, it is assumed that each driver maintains a separation distance proportionalto the speed of his vehicle, xn+1 plus a constant distance L, which is composed of the lengthof the vehicle plus a distance headway as determined at standstill when xn = xn+1 = 0.The constant S is measured in time.

Page 38: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

23

high density traffic data. The formula they derived, with the desire to maintain a linear

relationship, is provided in Equation 2.3. The equation differs from Equation 2.2 by the

introduction of ∆t, which is defined to be the time lag response to the stimulus [82].

xn+1 =1

S[xn − xn+1] (2.2)

xn+1(t + ∆t) =1

S[xn(t)− xn+1(t)] (2.3)

2.0.2 General Motors’ Car-following Model

The General Motors research group established 5 generations of car-following

models which all take the form:

response = func(sensitvity, stimuli) (2.4)

The response in Equation 2.4 represents vehicle acceleration. Stimuli is the rel-

ative velocity of the lead and following vehicles. The 5 models are distinguished by

differences in their sensitivity terms.

In the first model, Equation 2.5, the sensitivity term, α, is assumed to be a

constant. The equation is equivalent to Equation 2.3.

xn+1(t + ∆t) = α[

xn(t)− xn+1(t)]

(2.5)

Page 39: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

24

Equation 2.5 forms a feedback mechanism for the acceleration of the following

vehicle. If the lead vehicle’s velocity is faster than the following vehicle, the difference

between the two velocities is positive, xn(t) − xn+1(t), and the following vehicle is

accelerated. Conversely, if the lead vehicle’s velocity is slower than the following vehicle,

the following vehicle’s acceleration becomes negative, slowing its approach.

Field experimentation led the GM research team to note a wide range of values

in the α sensitivity value. Hence, the team first tried to introduce separate sensitivity

constants. The term used in the equation is selected based upon the vehicles’ relative

distances. A higher sensitivity term, α1, is used when the vehicles are closer together, as

shown in Equation 2.6. The equation is unsatisfactory due to its inherent discontinuity.

xn+1(t + ∆t) =

α1

or

α2

[

xn(t)− xn+1(t)]

(2.6)

Gazis, Herman, and Potts [61] changed the linear property of Equation 2.5 by

allowing the constant sensitivity factor, α, to become inversely proportional to the sepa-

ration distance between the vehicles. The group introduced the physical spacing between

the lead and following vehicles as a parameter, leading to Equation 2.7. As the distance

between the vehicles decreases, the sensitivity term is given more weight. Gazis’s modi-

fication is illustrated in Equation 2.7, where α is a new constant.

xn+1(t + ∆t) =α

xn(t)− xn+1(t)

[

xn(t)− xn+1(t)]

(2.7)

Page 40: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

25

Next, Equation 2.7 is modified yielding Equation 2.8 which gives the sensitivity

term more weight based on the speed of the following vehicle. The rational is that as

the speed of the traffic stream increased, the operator of the following vehicle is more

sensitive to the relative velocity between the lead and the following vehicles. In work

subsequent to [61], Gazis generalized Equation 2.8 into its final form, Equation 2.9.

Equation 2.9 became the final version of the acceleration formula for microscopic car-

following acceleration.

xn+1(t + ∆t) =α′

[

xn+1(t + ∆t)]

xn(t)− xn+1(t)

[

xn(t)− xn+1(t)]

(2.8)

Equation 2.9 is a continued effort to generalize the sensitivity term. The final

model allows the following vehicle velocity and relative vehicle separation to have a

generalized exponential effect. The equation allows the speed and distance headway

components to be raised to powers other than one, using exponents m and l. All previous

car following models are specialized cases of Equation 2.9.

xn+1(t + ∆t) =αl,m

[

xn+1(t + ∆t)]m

[

xn(t)− xn+1(t)]l

[

xn(t)− xn+1(t)]

(2.9)

When m = l = 0, Equation 2.9 is reduced to Equation 2.5. Equation 2.7 results

when m = 0 and l = 1. Equation 2.9 was used in the software simulator of Section 4.3

and in the hardware reconfigurable logic implementations resulting from Figures 7.15

and 7.16. In both the hardware and software simulations, the values used for αl,m, m,

and l are the values used in Yang [146], where αl,m = 1.25, and m = l = 1.

Page 41: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

26

2.0.3 Vehicle Deceleration

The derivation of the formula for deceleration during speeding is derived from

the basic principles of Equations 2.10, 2.12, and 2.14. Letting t0 be zero leads to Equa-

tions 2.11 and 2.13 respectively from Equations 2.10 and 2.12.

Variable Descriptionvf final velocityv0 initial velocityv average velocitytf final timet0 initial timea average accelerationxf final positionx0 initial position

Table 2.2. Vehicle Deceleration Notation Notation used in vehicle deceleration equa-tions of Section 2.0.3 are defined.

a =dv

dt=

vf − v0

tf − t0(2.10)

atf = vf − v0 (2.11)

v =dx

dt=

xf − x0

tf − t0(2.12)

Page 42: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

27

vtf = xf − x0 (2.13)

v =1

2(v0 + vf ) (2.14)

Then, combining Equations 2.11, 2.13, and 2.14, letting x0 be 0, and eliminating

v, the final form of Equation 2.15 is derived. This is the form used in both the hardware

and software models to compute deceleration when a vehicle is speeding. By letting v0

go to 0, the same equation is also used to stop at the end of a road during a stop signal.

a =v2f− v2

0

2xf(2.15)

2.0.4 Macroscopic Models

Macroscopic traffic simulation uses a continuum traffic model in its approach.

The macroscopic models view traffic as collections of vehicles flowing along a network

of roads. Using macroscopic theory, the equations and conditions for maximum traffic

throughput can be derived. Section 2.0.4.1 reviews Greenshields’ work and the equations

he developed. Section 2.0.4.2 follows a second approach by Greenberg.

The general equilibrium equation relating traffic flow (q), density (k) and mean

harmonic speed (µs) is given in Equation 2.16. These variables further depend on envi-

ronmental factors which include roadway, driver, and vehicle characteristics along with

Page 43: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

28

basic environmental factors like weather [60]. Equation 2.16 will be used below to de-

termine the maximum flow rate [60].

q = kµs (2.16)

The harmonic mean vehicle speed, µs, is defined by Equation 2.17 where addi-

tional terms are defined in Table 2.3.

µs =n

n∑

i=1

(1

µi)

=nLn

i=1

ti

(2.17)

Variable Description

ti time ith vehicle takes to cross highway segment

ui ith vehicle speedn number of vehicles passing a point on the highwayL length of highway section

Table 2.3. Harmonic Mean Speed Notation The variables used in Equation 2.17 aredefined.

Page 44: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

29

2.0.4.1 Greenshields

Greenshields [67] published one of the earliest works where he observed the linear

relationship of Equation 2.18 between the velocity and density of vehicles on a stretch of

road. The literals of the equation are defined in Table 2.4. The use of the Greenshields’

model depends on whether or not the equation satisfies the traffic model’s boundary

conditions. The Greenshields’ model satisfies the boundary conditions when the density,

k, is approaching zero as well as when the density is approaching the jam density, kj ;

therefore, the Greenshields model is used for either light or dense traffic [60].

µs = µf −µf

kjk (2.18)

Variable Descriptionµf mean free speed - maximum speed as density tends to 0µs harmonic mean of vehicle speedsk density (vehicles per lane-unit)kj jam density in vehicles per lane-unit

( the density when vehicles are bumper to bumper and stopped)

Table 2.4. Notation for Greenshields Equations The variables used in Sections 2.0.4.1and 2.0.4.2 are defined.

Page 45: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

30

Corresponding relationships for flow vs. density and for flow vs. speed can be

developed by eliminating k from Equation 2.18 using Equation 2.16. Equation 2.19

results.

µ2s

= µfµs −µf

kjq (2.19)

Similarly, using Equation 2.16 to eliminate µs from Equation 2.18, results in

Equation 2.20.

q = µfk −µf

kjk2 (2.20)

Now, using Equations 2.19 and 2.20, the speed and density required for maximum

flow can be attained. Differentiating q of Equation 2.19 with respect to µs, provides

Equation 2.21.

2µs = µf −µf

kj

dq

dµs(2.21)

Maximum flow is reached when dqdµs

= 0, or when the mean velocity is equal to

half the free mean speed µs =µf2 [60].

Equation 2.20 can be similarly manipulated by differentiating q with respect to k

yielding Equation 2.22.

dq

dk= µf − 2k

µf

kj(2.22)

Page 46: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

31

Again, setting the derivative dqdk = 0, yields the maximum flow density, k =

kj2 ,

to be equivalent to half the jam density. The maximum flow for the Greenshields’

relationship can then be obtained by inserting the maximum flow density and the mean

velocity which yields the maximum flow into Equation 2.16, producing Equation 2.23.

qmax =kjuf

4(2.23)

2.0.4.2 Greenberg

Concurrent to the development of the General Motors (GM) model, the Port

Authority of New York, which was assisting GM with the testing of the GM model,

was developing a macroscopic flow model of its own referred to as the Greenberg Model.

Speed is defined as a function of density. Optimum speed is reached when the traffic flow

level reaches capacity. The Greenberg model satisfies its boundary conditions when the

density is approaching the jam density. Unlike the Greenshields model, the boundary

conditions of the Greenberg model are not satisfied as k approaches zero. Therefore the

Greenberg model is only useful for modeling dense traffic conditions [60]. The Greenberg

Model is contained in Equation 2.24, where c is a constant.

µs = c ln

(

kj

k

)

(2.24)

Page 47: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

32

Into which we can substitute Equation 2.16 to obtain an equation for vehicle

density.

q = ck ln

(

kj

k

)

(2.25)

If q in Equation 2.25 is differentiated with respect to k, and the derivative is set

to zero to solve for the maximum flow, Equation 2.27 is obtained.

dq

dk= c ln

(

kj

k

)

− c (2.26)

ln

(

kj

k

)

= 1 (2.27)

Substituting Equation 2.27 into Equation 2.24 finds the value of c = µs, which is

the velocity at maximum flow.

Page 48: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

33

Chapter 3

Previous Work Related to Simulation Architectures

Special purpose machines have been designed and implemented expressly for the

development and performance enhancement of logic systems [1, 3, 13, 14, 28, 89, 98, 119,

135, 137, 138]. In logic design, simulations are used to verify new projects and to run

fault analysis of these designs. The simulation of logic entails modeling deterministic

behavior. The proposed machine simulates non-deterministic behavior. Additionally,

traffic requires a wider variety of dynamic behavior than a limited set of deterministic

logic functions. Recent research has begun to explore the application of parallel pro-

cessing to real time traffic simulation. Both microscopic [110] and macroscopic [36, 81]

approaches have been explored.

The proposed non-deterministic simulation architecture differs from the existing

body of published research. The simulator architecture leverages the locality of data

inherent in traffic simulation. The architecture mitigates the Von Neumann bottleneck

by embedding simulation instructions in reconfigurable logic. Distributed processing is

developed by applying both a global network of processing elements and the implemen-

tation of embedded instructions as pipelined, systolic arrays within reconfigurable logic.

Unlike conventional general purpose processors, where data and instructions are fetched

from memory, computation is performed and then the data is returned to memory, in the

proposed architecture, data flows from functional unit to functional unit, accomplishing

Page 49: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

34

computation as part of its transport process. The data channel is pipelined allowing

concurrent computation of different stages of the simulation within each processing el-

ement. Multiple processing elements are networked together in a scalable architecture

facilitating further parallelism.

The chapter on related work is divided into several subsections. Section 3.1 de-

scribes simulation machines which were created to analyze and test deterministic logic

designs. Unlike the architecture presented in the thesis, logic simulators are determinis-

tic. Section 3.2 reviews the architectures of two machines, the Splash and the ArMen,

which are not specifically logic simulators, but whose designs are relevant to the archi-

tecture presented in this thesis. The Splash design provides good background for the

thesis processing element architecture. Section 3.3 describes the Rollback Chip, which

is a piece of hardware used for state saving during optimistic simulation. Section 3.4

describes three papers on random number generator hardware implementations. Finally,

Section 3.5 describes a reduction bus. The thesis develops a unique reduction bus model

which synchronizes the network of processing elements during either time or event-driven

operation.

3.1 Logic Simulation Machines

Section 3.1 reviews deterministic simulation machines which have been constructed

to simulate and verify logic circuits and designs.

Page 50: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

35

3.1.1 Boeing Computer Simulator

The Boeing Simulator architecture is illustrated in Figure 3.1. The system went

into operation in June of 1970. Categorized by Figure 3.2, the Boeing Simulator is an

event-driven, fine-grained, conservative simulator. The simulator’s primary purpose is

logic simulation. The Boeing Simulator was initially used to perform an architectural

study of a navigation processor, logic design support in the development of general-

purpose computers and special-purpose processors, as well as fault tests of manually

generated logic circuit boards [138]. The simulator contains 4 independent logic pro-

cessors which implement an event-driven logic simulation algorithm. There is a paged

memory which consists of 16K 48-bit words. A memory switch exists to allow shared

memory access. The processor’s architecture is composed of 4 parts, as illustrated in

Figure 3.3. There are 3 scratch pad memories which store the logical equations (ES -

equation store), device delays (D - delay), and the events (E). The fourth part is the

logic evaluation hardware.

To initialize the simulation, the host partitions the logic design simulation among

the 4 processors. The host generates and stores the equations which specify the oper-

ations of the primitive logic elements. A maximum of twelve gates can be represented

by each equation. The host also stores the gate connectivity information. During each

simulation cycle, the smallest time delay is saved and used for the next simulation time

increment. The basic simulation flow is divided into the following steps [20]:

1. The ES points to the active equations and acts as an event queue.

2. For each active equation, the delay value, D, is reduced by the current time step.

Page 51: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

36

Loop Communications

Station

Control andDisplayConsole Unit

Interface Computer

General Purpose Peripherals

Central ElectronicCrossbarSwitch

Loop Communications

StationLoop

Communications

StationLoop

Communications

StationLoop

Communications

Station

LogicProcessor

LogicProcessor

LogicProcessor

LogicProcessor

Fig. 3.1. The Boeing Simulator Architecture The four logic processors of the BoeingSimulator operate independently, using an event-driven simulation algorithm. The cross-bar switch allows the host to access internal state information from each processor’s corememory. The communications loop provides interprocessor connectivity and connects thelogic processors with the host interface. The loop also allows the host to access processorregisters and scratch pad memories.

Page 52: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

37

Simulation

Event−Driven Time−Driven

Synchronization

Conservative

Granularity

Course Fine

Granularity

Course Fine

Granularity

Course FineOptimistic

Fig. 3.2. Simulation Classifications [16] Computer generated simulation can be dividedinto two main classes, time-driven simulation and event-driven simulation. Time-driven sim-ulation steps through each time cycle. Event-driven simulation skips those time cycles whichlack events. Event-driven simulations may attempt optimistic or conservative simulationapproaches. Conservative simulations follow a strictly causal approach, where each eventcan execute only after all events with earlier timestamps have executed. Optimistic simu-lations may allow simulations to proceed in a non-causal fashion, but must ensure that thesimulation results are causal. Both time and event-driven simulation have two degrees ofgranularity, course and fine. Course-grained simulations tend to group collections of eventstogether and evaluate the collections as a unit. In a fine-grained simulation, each event issimulated as a discrete unit.

Page 53: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

38

3. When an equation’s delay is reduced to zero, the equation is evaluated. The result-

ing value is propagated to all the connected components according to a connectivity

list. The ES memory is updated.

4. Equations whose output values change are handled as follows:

• The equation output field is updated and written back to memory.

• The corresponding logic delay is fetched from the D scratch memory.

• The new delay and logic values are stored in the E scratch memory.

The Boeing Simulator was constructed of approximately 28K TTL packages, 40K

of fast memory, and 65K of core memory [138]. The system ran at a clock speed of 10

MHz. The simulator could model 36K elements, or 48K 2-input gates. Elements can

consist of flip-flops, gates, and one-shot devices specified in equation form. Its estimated

performance is 1 million gate evaluations per second.

To its credit, the Boeing Computer Simulator is the first [78] of the logic simulators

to be built and operated. Boeing recognized the limitations of using general-purpose

computing architectures for software logic simulators. The Boeing Computer Simulator

was created to assist in the design of digital equipment used in the United States space

program. Their simulator offered Boeing tremendous logic design advantages at that

time [78].

However, the Boeing simulator also had some significant drawbacks, mostly due

to the technologies available at the time of its creation. These drawbacks include:

• The Boeing simulator had non-programmable logic functions, implemented by logic

cards.

Page 54: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

39

Processor Hardware

CoreMemory8Kx48

CoreMemory8Kx48

CommunicationsLoop Station

ScratchPad

Memories

EventMemories

Local

Switch

Logic Processor

To Central

Crossbar Switch

Fig. 3.3. The Boeing Simulator Logic Processor The Logic processors contain 16Kwords of 48-bit 650 nanoseconds core memory which is divided into two banks. The memoryis accessed through a local crossbar switch. Three scratch pad memories, the equationstore (ES), the device delay (D), and the event scratch pad memories are all used to storeintermediate results. The event computations are performed by the processor hardware.

Page 55: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

40

• The Boeing simulator test patterns were interpreted on a host computer as the

simulation was run. These patterns needed to be transmitted from the host to the

simulator during runtime.

• To enter the logic formulas, the design arrived via punch cards and the output was

delivered to a line printer.

• The Boeing system employed separate processors in a multiprocessor environment.

These processors require inter-processor communication via the Communications

Loop Stations.

• The system required data to be read/written to memory which incurred further

time delays.

3.1.2 The IBM Los Gatos Logic Simulation Machine

The IBM Logic Simulation Machine (LSM) was conceived in 1977 by John Cocke,

Richard L. Malm, and John Schedletsky of IBM’s Thomas J. Watson Research center

in Yorktown Heights, New York. The Los Gatos Logic Simulation Machine is a logic

simulator which can simulate 64,512 logic expressions at a rate of 640·106 expressions per

second [28, 78, 88]. The Los Gatos Logic Simulation Machine was one of the few machines

implemented which was not an event-driven simulation machine. The other time-driven

simulation machines are the Yorktown Simulation Engine [119] and the machine by

Levendel et al [98]. All three machines are fine-grained. The Los Gatos engineering team

was under the incorrect impression that event-driven simulators “monitor all the gates in

a design and from one cycle of simulation time to the next evaluates new outputs for only

Page 56: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

41

those gates that had a change of inputs.” [78] The designers felt that the “disadvantage

in building a hardware event-driven simulator is in the increased complexity and cost due

in large part to the large amount of data that needs to be passed among various parts of

the simulator. This complexity soon results in switching problems and communications

bottlenecks.” [78]

The LSM design approach decided that these hazards could be avoided and rea-

sonable simulation speed could be attained by relying on fast and parallel hardware

techniques. They felt their assumption was justified by the results.

The Los Gatos Logic Simulation Machine is composed of 3 types of processors,

logic processors, array processors, and a control processor. The three types of processors

are interconnected via a crossbar switch. The processors are depicted in Figure 3.4. The

logic processors do the majority of the simulation work. The logic processors fetch, eval-

uate, and store the results of each gate event. The array processors are used to simulate

memory arrays such as Random Access Memory (RAM). The control processor regulates

the operation of the Los Gatos Logic Simulation Machine. The control processor starts,

stops and allows a simulation to be interrupted. The control processor also allows the

host computer to interface with the simulation. The switch in Figure 3.4 is a 64 by 64

crossbar switch which allows all processors to communicate.

The basic structure of a logic processor is depicted in Figure 3.5. The processor

contains an instruction memory, two data memories, a logic unit, and a gate delay value

memory. The instruction memory contains room for 1K of 80-bit instruction words.

Each instruction word contains a function opcode, five input operand addresses, and

Page 57: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

42

Interface

Host

ControlProcessor

LogicProcessor

LogicProcessor

LogicProcessor

ArrayProcessor

ArrayProcessor

Switc

h

Fig. 3.4. Los Gatos Logic Simulation Machine Architecture The Los Gatos LogicSimulation Machine is composed of 3 types of processors, logic processors, array processors,and a control processor. The three types of processors are interconnected via a crossbarswitch. The logic processors do the majority of the simulation work. The logic processorsfetch, evaluate, and store the results of each gate event. The array processors are usedto simulate memory arrays such as RAM. The control processor directs the operation ofthe Los Gatos Logic Simulation Machine. The control processor starts, stops and allows asimulation to be interrupted.

Page 58: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

43

other control information. The input operand addresses represent gate inputs. If a gate

has more than 5 inputs, it is decomposed into a sequence of 5-input functions.

The Logic processor accesses two data memories, which each contain 2,048 2-bit

words. One memory is called the input data memory and the other is the output data

memory. The function memory contains 1024 64-bit words. The functions described by

this table are actually 6-bit input functions so that the system allows internal chaining of

the 5-input functions. Thus, the function memory contains 64 bit words as there are 26

possible input combinations for these internal gates. There must be 64 possible outputs.

The delay value memory stores a table of rise and fall delay times associated with

each instruction in the instruction memory. Thus the delay value memory contains 1K

words each of which is 16 bits in length. Of the 16 bits, 8 bits are devoted to rise time

and 8 bits are devoted to the function’s fall time. These delay times range from 1 to

256 units of delay. The minimum delay is 1 unit. If “... more than one delay unit is

specified for an instruction in the instruction memory, the output of the logic unit is not

written to the output memory but is retained until the specified number of time steps

have elapsed.” [28]

The model and test patterns are downloaded and distributed among the Los Gatos

Logic Simulation Machine logic processors. When running, an instruction is fetched from

the instruction memory and the inputs are fetched from the data input memory according

to the data addresses in the instruction. The function code and the data are passed to

the logic unit where they are evaluated. The result is sent to the inter-processor switch

which can be latched into the output data memory. This fetch, evaluate, and store cycle

is called an instruction step. Instruction steps are executed in a seven stage pipeline,

Page 59: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

44

1024x64

fetchaddr

storeaddr

signaldata

LOGICUNIT

fetchaddr

DELAYVALUEMEMORY

1024X16

InstructionMemory

fetchaddr

1024x80FunctionMemory

program counter

from switch

to switch

operands 1−5

functionDATA MEMORY

2048x2x2

Fig. 3.5. The IBM Los Gatos Logic Simulation Machine Logic processors eachcontain an instruction memory, two data memories, a logic unit, and a gate delay valuememory. The instruction memory contains room for 1K of 80-bit instruction words. Eachinstruction word contains a function opcode, five input operand addresses, and other controlinformation. The input operand addresses represent the simulated gate inputs. If a gatehas more than 5 inputs, it is decomposed into a sequence of five input functions.

Page 60: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

45

one instruction following another in the pipeline and each instruction completing every

100 nanoseconds [28].

The designers should be complemented on their bold attempt at an unusual sim-

ulation approach. The time-driven approach simplifies the machine. The disadvantages

of this system, however, are:

• The system simulates gates even during non-event times.

• To simulate delays, the system actually holds the function outputs until a specified

number of delays has passed. This wastes simulation time and slows the system

down.

3.1.3 Barto and Szygenda’s Hardware Simulator

A 1980 PhD dissertation at the University of Texas detailed the implementation

of a special high performance simulation machine architecture designed to perform high

speed logic simulation. This special purpose simulation machine hardware is based on

distributed processing. The machine is an event-driven, fine-grained, conservative logic

simulator. The research notes that some problems are not easily handled by basic Von

Neumann computer architectures. Problems involving large data structures which must

be moved, searched, etc. are often not aided by the underlying computer architecture. A

programmer finds that the hardware provides little or no assistance in setting up the data

structures or data flows required by the problem [13]. Large databases, and event-driven

simulators are examples of problems which fall into this category. Several attempts have

tried to define architectures for data bases. These attempts incorporated the application

Page 61: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

46

of associative memory and intelligent disk systems. The hardware simulator research

presents a possible architecture for logic simulation [13]. The architecture of the machine

is illustrated in Figure 3.6. The logic simulator described here supports a table based,

event-driven simulator. The logic simulator performs according to the algorithm of

Table 3.1.

The evaluation phase consists of two tasks. The evaluation phase searches through

the activity flags to determine which gates are active during a particular simulation time

increment. In its second task, the evaluation phase calculates the gate changes or element

outputs and inputs according to the activity flags. The evaluation phase determines the

required gate changes and schedules them to occur appropriately. When the evaluation

phase is finished, the update phase commits the changes into the simulation.

The update phase consists of two concurrent tasks. The update phase propagates

the signals through the gates using the scheduling information generated during the

evaluation phase [13]. Gates whose output changes also have their activity flags set

during the update phase. The activity flags indicate which gates are active in the current

simulation cycle. Those gates with flagged activity indicators will be evaluated during

the evaluation phase.

The Event Queue Processor (EQP) schedules events which need to be evaluated

in future simulation cycles in the Event Queue Memory (EQM). The EQP maintains

the EQM as linked lists of events. Each list contains a gate descriptor and its events

are sorted in chronological order. The scheduled events contain a field, called the time

count-down (TC) field, which stores the amount of time remaining until the event is

Page 62: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

47

process SIMULATEbegin

Load and initialize simulation dataSTI = 0; Simulation Time Increment while (STI < STImax) dobegin

EVAL phase;UPDATE phase;STI = STI + 1;

endend.

Table 3.1. Barto’s Simulator Algorithm Barto’s simulator runs using the algorithmdescribed in this table. The algorithm consists of the evaluation phase and the update phase.The evaluation phase searches through its list of gates to determine which need evaluationduring this simulation cycle. The update phase performs the required gate evaluations andpropagates the results.

Page 63: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

48

to occur. On each pass through the update processor, this TC field of the gate event

descriptor is decremented and when the TC field equals zero, the event occurs.

Barto discusses some of the reasons for the low performance of software simulators.

The memory system of a general purpose computer has essentially no structure at all. A

software simulator operating on a general purpose machine must impose a structure on

the memory and move the data. To accomplish the data manipulations, the program’s

instructions must be fetched from memory and then executed. It may take several

instructions to move one word of data, since its present and new addresses must be

calculated, the word must be loaded from memory into a register, then stored somewhere

else in memory. The time required for this process will depend on the efficiency of the

computer’s architecture. Still, no general purpose machine has an architecture which is

natural for logic simulation [13].

To its credit, this machine is one of the earliest proposed simulation engines. The

system has the following advantages and disadvantages:

1. Advantages

• Its a specialized Simulation Engine, not composed of general purpose proces-

sors.

• This machine is an event-driven simulation machine.

2. Disadvantages

• Not expandable in terms of processing power focused on the simulation prob-

lem. The processing power is limited.

Page 64: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

49

queuememory

Statusand datamemory

Fan-in Fan-outmemorymemory

Activityflag memory

Event Eventqueueprocessor

Evaluationprocessor

Updateprocessor

Fig. 3.6. Barto’s Logic Simulator Architecture The logic simulator architecture iscomposed of three processors: the update, the evaluation, and the event queue processors.The system also contains 5 memories: the status and data memory (SDM), the fan-inmemory (FIM), the fan-out memory (FOM), the activity flag memory (AFM), and theevent queue memory (EQM). The AFM contains flags indicating which simulation gates arecurrently active. The AFM is used in conjunction with the FOM to direct gate evaluationresults. The EQM serves as the simulations event list and is maintained by the event queueprocessor. The majority of the simulation work is performed by the evaluation and updateprocessors during the update phase.

Page 65: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

50

3.1.4 Abramovici’s Logic Simulation Machine

Abramovici et al developed a distributed parallel processing architecture for han-

dling logic simulation [1]. The architecture, proposed in December of 1981, is based on

pipelining and concurrency. The simulator is an event-driven, fine-grained, conserva-

tive machine. The model is capable of handling both simple gates and more complex

functions. The employed timing analysis can handle more than simple unit gate delays.

Separate Processing Units (PU) are dedicated to specific tasks of the algorithm,

such that the entire logic simulation algorithm is executed as a result of the cooperation

between the individual PUs working concurrently [1]. The logic simulation tasks are

pipelined.

Each process illustrated in Figure 3.7 is assigned to a process unit. Concurrency

is achieved by pipelining the dataflow. The Event List Manager PU receives the future

events and event times or event cancellations from the scheduler PU. The Event List

Manager orders the events in causal order in the Event List Memory. When all the

events scheduled at a particular time have finished processing, the scheduler signals the

Event List Memory PU to advance to the next time cycle. The Event List Memory PU

then activates the Current Event Processor. The Current Event Processor receives a new

time value and a pointer to the first event on the list of events for that time. When the

Current Event Processor finishes the list, the Event List Memory PU issues a finished

signal.

The Current Event Processor retrieves each event from the list of events for a

particular event time from the Event List Memory. Each event is sent to the Model Access

Page 66: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

51

CurrentEvent

Processor

EventList

Manager

ResultsTo

File

FromSimulus

File

SimpleConfigProc

Eval

FEV1 FuncConfProc

FEVn

ModelAccessing

UnitScheduler

Event List

Memory

Fig. 3.7. Logic Simulation Machine The Simulation Machine architecture with its re-finement for simple evaluations is illustrated. The Event List Manager orders the eventsreceived from the Scheduler in increasing time order and stores them in the Event ListMemory. The Current Event Processor retrieves events in order from the Event List Mem-ory and sends them to the Model Accessing Unit and Simple Configuration Processor forevaluation. The Model Accessing Unit finds all the event receivers and sends the receiveraddresses to the Simple Configuration Processor. The Simple Configuration Processor for-wards the gates impacted by the events to the Evaluator. The results after evaluation aresent to the Scheduler which delays transmission of the results by the appropriate delay foreach gate.

Page 67: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

52

Unit and the Simple Configuration Processor (Simple Config Proc). The Current Event

Processor can perform user interaction and simulation control tasks such as processing

user-set break points and system state monitoring.

The Model Accessing Unit retrieves each element which receives the results from

the current event. This fanout list is stored in the Model Accessing Unit’s local memory.

The fanout list entries propagate to the Simple Configuration Processor. When the

fanout list from the current event is depleted, the Model Accessing Unit forwards a done

message to the Simple Configuration Processor PU.

Both the Function Configuration Processor (Func Conf Proc) and the Simple

Configuration Processor receive the current event from the Current Event Processor and

the fanout list elements from the Model Accessing Unit. The Simple Configuration Pro-

cessor maintains a definition of each type of element which is forwarded to the Evaluation

(Eval) unit.

The Functional Configuration Processor forwards large functional element param-

eters to the appropriate Functional Evaluator (FEV) in which the function is statically

assigned. Small functional elements may be dynamically assigned to idle FEVs. The

Functional Configuration processor must transmit the configuration of the dynamically

configured FEV.

The Evaluation Unit (Eval) receives the configuration and element types to be

evaluated from the Simple Configuration Processor [1]. The Evaluator performs the

event evaluations and sends the results to the Scheduler. The Evaluation Unit may also

send a cancellation event message to the Scheduler. Cancellations occur when an element

is re-evaluated changing the results of a previously scheduled event.

Page 68: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

53

Finally, the results from the Evaluator and the FEVs are forwarded to the Sched-

uler. The Scheduler retrieves the event delay (service time) from its local memory.

Delays may be associated with a type of element or be specific to a particular circuit

element. The Scheduler determines the time of an event by adding its delay to the cur-

rent simulation time and then the new event and its simulation time are forwarded to

the Event List Manager. Event cancellations are also forwarded to the Event List Man-

ager. The PU operations on the events are quite simple, involving either data transfer or

logic/arithmetic. The designers selected micro-programmable processors to serve as the

PUs. Microcode could avoid substantial software overhead involved in fetching and de-

coding macro-instructions [1]. Any changes to the number of logic values used, the delay

modeling, or the timing analysis can be incorporated by changing the microinstructions.

The system architecture need not be modified.

Some of the machine’s advantages and disadvantages are as follows:

1. Advantages

• The processors are microcoded to tailor the instruction set architecture to the

simulation problem tasks.

2. Disadvantages

• The architecture cannot be easily scaled to focus more processing power on

the simulation problem.

Page 69: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

54

3.1.5 Levendel, Menon, and Patel’s Logic Simulator

The following logic simulator was developed by Y. H. Levendel, P. R. Menon and S.

H. Patel. The simulator served as the basis of Patel’s 1982 PhD dissertation at the Illinois

Institute of Technology. The simulator differs from the work by Barto and Abramovici in

that multiple processors do not perform dedicated tasks. Barto’s machine, described in

Section 3.1.3, contains processors which are dedicated to event queue management, gate

evaluation, and fan-out updates (see Figure 3.6). The machine is an event-driven, fine-

grained, conservative logic simulator. The machine developed by Abramovici contains

processors dedicated to circuit evaluations and event list management (see Figure 3.7).

In Levendel’s machine, a host pre-processor distributes the sub-circuits of a design among

various homogeneous processors. The modularity of this design allows an easy increase

of computational power to be assigned to the simulation. The architecture is illustrated

in Figure 3.8

This simulator consists of processors p1 to pn. The circuit to be simulated is

referred to as the target circuit. The target circuit is partitioned into blocks ai through

aj . The circuit connections between blocks ai and aj are designated as bij . It should

be noted that blocks are not necessarily circuit clusters, that is to say that the elements

in a block can be from disjoint portions of the circuit. Each circuit block ax is mapped

onto a processor py and is then called sub-circuit cz as illustrated in Figure 3.9. During

the simulation, each sub-circuit, cz , is simulated independently. Different sub-circuits

become active as the signals proceed from the primary inputs to the primary outputs. As

the simulation progresses, data is carried between sub-circuits ci and cj changing the logic

Page 70: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

55

Bus Interface

Unit

Master SimpleEvaluator

1

SimpleEvaluator

n

CrossbarMatrix

ParallelBus

FunctionalEvaluator

1

FunctionalEvaluator

n

Communications Structure

Fig. 3.8. Levendel’s Logic Simulator Architecture The logic simulator architectureincludes a communications structure, a controlling processor, several simple subordinateevaluators for simulating gate level blocks, and several functional subordinate evaluatorsfor simulating functional blocks of the design under test. A cross-point matrix is used toconnect the controlling processor with the simple subordinate evaluators. The functionalsubordinate evaluators are connected to the same cross-point matrix through a bus interfaceunit and a parallel bus [98].

Page 71: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

56

values of bij . The interconnect between the original sub-circuit blocks, bij , is now referred

to as dij , which is the datapath between sub-circuits ci and cj . The controlling processor

and the Simple Evaluators are connected via a cross-point matrix. The Functional

Evaluators are connected to the cross-point matrix through a bus interface unit and a

parallel bus. The bus interface is shown in Figure 3.10. The parallel bus has sufficient

speed for the Functional Evaluators according to a timing calculation performed within

the study.

Concurrency during simulation is achieved by allowing the sub-circuits ci and

cj to be evaluated independently. Different circuits will become active as signal values

proceed from the primary inputs to the primary output [98].

The simulator is configured to consist of one controlling processor and a multi-

tude of subordinate processors which are interconnected by a communications structure.

Processors pi and pj in Figure 3.9 are both subordinate processors. The sub-circuits ci

and cj reside in the subordinate processor’s memories.

The system works as follows. At the beginning of each simulation cycle, the con-

trolling processor sends any primary inputs required to each subordinate processor using

the communication structure. The controlling processor then issues a start signal to the

subordinate processors ordering them to begin the next simulation cycle. The subordi-

nate processor may generate events which will need to be forwarded to other processors

for future cycles of the simulation. In the case of logic simulations, a change in a logic

value on an output signal line becomes a scheduled event . In this system, only events

scheduled for the immediately following simulation time cycle are transferred between

the subordinate processors in order to reduce communications overhead. Therefore, the

Page 72: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

57

test circuit design

simulator

c i

Sub−circuit

piprocessor

c j

Sub−circuit

pjprocessorijData Path d

a i

Blocka j

ijbInterconnections

Circuit

Block

Fig. 3.9. Mapping Circuit Blocks ai and aj to Processors pi and pj The simulatorconsists of processors p1 through pn. The target circuit is subdivided into blocks a1 throughan. The circuit connections which span two blocks are designated as bij . Each circuit block,

aq, is then mapped onto a processor, pq, as sub-circuit cq . The original inter-connections,bij , between the circuit blocks ai and aj are mapped into the datapath dij . The blocks, aqmay contain disjoint pieces of the original circuit, they are not necessarily clustered sectionsof the original circuit.

Page 73: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

58

OutputData

Sequencer

InputData

Sequencer

Time-SharedParallel Bus

CommunicatonsStructure:

Data Ready

Data

Address

RTS for MasterBus Grant

Request to Send

Data

(RTS)(RTSM)

Fig. 3.10. Interface Between the Data Sequencers and the Time-Shared Par-allel Bus This figure illustrates the communications signals required between the datasequencers and the time-shared parallel bus. When the Output Data Sequencer (ODS) hasdata to transmit, it pulls the Request to Send (RTS) line high. The data sequencer receivespermission to transmit when it receives a pulse back on the bus grant line. The ODS thentransmits all the data in its Output FIFO Buffer (OFB). The receiver stores the inbounddata in its local Input FIFO Buffer (IFB). The ODS then sets the RTS line low whichreleases the bus. All requesting bus users have equal priority.

Page 74: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

59

scheduled time need not be immediately sent, allowing messages to accumulate, and

thereby saving on the number of transmitted communications overhead bytes. Data

transferred to the controlling processor from the subordinate processors consists of the

primary outputs and any user requested data.

When the subordinate processors finish processing their sub-circuits for the sim-

ulation cycle, they each inform the controlling processor. When all the subordinate

processors report their completion, and the controlling processor has also finished trans-

ferring its primary inputs scheduled for the next simulation cycle to the subordinate

processors, the controlling processor broadcasts a start cycle to the subordinate proces-

sors beginning the next simulation cycle.

The controlling processor is depicted in Figure 3.11. The controlling processor

contains local memory. Each subordinate unit also consists of a processing unit and local

memory. The controller and the subordinate processors both have one input and one

output FIFO buffer to handle inbound and outbound data messages. They also each

have one input and one output data sequencer for the communications interface.

The subordinate processor configuration is illustrated in Figure 3.12. The Pro-

cessing Unit (PU) is a 16-bit microprocessor. The input and output data sequencers

are either specially designed Application Specific Integrated Chips (ASIC) or single chip

microcomputers. The subordinate PU evaluates the circuit elements or functions. The

PU contains the circuit blocks to be simulated. The Output Data Sequencer (ODS) is

isolated from the PU by a FIFO buffer. The ODS allows the subordinate unit to send

data whether or not that subordinate’s PU is active during a simulation cycle. The Input

Page 75: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

60

Data to

control andcommunicationstructure

Data from control andcommunicationstructure

buffer

buffer(With

Data

Memory)

CPU

InputData

Sequencer

OutputData

Sequencer

FIFOInput

OutputFIFO

To General−PurposeComputer

Data Flow

Data Flow

Done

Start

Start

Done

ToSubordinateProcessors

Fig. 3.11. The Controlling Processor Unit Configuration The controlling processorserves as the interface between the general-purpose host computer and the simulator [98].The controlling processor synchronizes the subordinate processors, maintains the simulationclock, supplies the subordinate processors with their primary inputs and gathers the primaryoutput values from the simulation. The controlling processor unit also maintains the userrequested simulation monitor values.

Page 76: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

61

communicationcontrol andData from

structurecommunication

Start (From Master)

structure

control andData to

Done (To Master)

buffer

Memory)

Data

(Withbuffer

PU

FIFOOutput

InputFIFO

SequencerData

Output

SequencerDataInput

Data Flow

Data Flow

Fig. 3.12. Subordinate Processor Unit Configuration The subordinate processorunit (PU) represents both the Simple Evaluator processors and the Functional Evaluatorprocessors. The PU is a general purpose 16-bit microprocessor. The PU evaluates the circuitfunction or element. Each PU receives a block of the target circuit when the target circuitis initially partitioned. The input and output data sequencers establish connections viathe communications structure and transfer data to and from the FIFO buffers respectively.Data is transferred to and from other subordinate processors or the controlling processorwhich are also connected to the communications structure.

Page 77: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

62

Data Sequencer (IDS) behaves and is configured in much the same way. So the PU runs

2 concurrent processes, the simulation and communication processes.

The PU stores data destined for the controller or other subordinate PUs in the

output FIFO buffer (OFB). The ODS will request an appropriate channel on the com-

munications structure if there is data which must be transferred from the OFB. When

granted access to the communications structure, the ODS transmits the data across the

communications channel. Data received by an IDS is placed in the Input FIFO Buffer

(IFB). To separate data from different simulation cycles, the data in the OFB is sepa-

rated by end of data (EOD) markers. The EOD marker also allows the PU to write new

data into the FIFO before the ODS has finished transferring data out. The same system

is implemented in the IFBs.

Two dedicated signal lines run between the controller and the subordinate units,

synchronizing the simulation. The controlling processor signals the subordinate proces-

sors using the start signal, and the subordinate processors signal the controller using the

done signal. The done line is pulled active when all the subordinate processors have fin-

ished their individual processing. The asserted start signal from the controller indicates

that the subordinate processors can initiate processing for the next simulation cycle.

The start signal causes all subordinate processors to load an EOD marker into the IFBs.

The EOD marks the end of input data arriving from other subordinate processors and

the controller during the current simulation cycle. When the PU reaches the EOD flag

in its IFB, then that PU has loaded all the required data for this simulation cycle, and

the PU may now begin processing the events. The start signal also alerts the ODS to

begin sending out data for the next simulation cycle.

Page 78: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

63

When the PU has finished processing events for a simulation cycle, the PU loads

an EOD marker into the OFB and begins to process the next simulation cycle. When

the ODS encounters the EOD marker in the OFB, the ODS has finished transferring

its data for the current cycle. The ODS then signals the controlling processor using the

done line.

The processor unit (PU) contains three important data structures which are rel-

evant to its operation. The three data structures are the:

• Sub-Circuit Description Table

• Activity List

• Event Queue

The sub-circuit description table contains the requisite information needed to

evaluate and process the PUs assigned sub-circuit. For each element in that sub-circuit,

the table contains the value, type, delay, input status word pointer, signal values on the

fan-in lines, and the corresponding fanout list which handles signals bound for other

subordinate processors. The external fanout list requires more space than the internal

fanout list, because the value, destination processor, and element index information must

be stored for the external fanout.

The controlling processor connects the simulator to the general purpose host plat-

form. The controlling processor is illustrated in Figure 3.11. The controlling processor

maintains the simulated time, synchronizes the subordinate processors, supplies primary

data to the subordinate processors, and gathers input from the subordinate processors.

Page 79: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

64

The controlling processor is similar to the subordinate units. It also contains a cen-

tral processing unit (CPU) with local memory, input and output FIFO buffers (IFB

and OFB), and input and output data sequencers (IDS and ODS). The controller is

connected to the subordinate processor units via the communications structure. The

controller initiates processing for each simulation cycle by issuing the start signal. When

all the subordinate processors have reported they are finished, the controller sets the

done signal indicating the current simulation cycle is over.

In the controlling processor, the start signal is also wired to the controller’s ODS

unit. The start signal tells the ODS to begin transferring data for the cycle. The

ODS transmits its data across the communications structure until it encounters the

EOD marker at which point, the ODS signals the controlling CPU that it has finished

transferring data by setting the done signal .

The communications structure is divided into two sub-structures. The first struc-

ture is a time-shared parallel bus which connects to the slower simple evaluation pro-

cessors. The interface between the parallel bus and the cross-point matrix is illustrated

in Figure 3.13. The second communications structure is the cross-point matrix which

connects directly to the faster Simple Evaluators and the parallel bus.

The interface between the time-shared parallel bus and the data sequencers is

illustrated in Figure 3.10. When a Functional Evaluator’s ODS has data to send, it sets

the request to send (RTS) signal high. The bus control grants permission to the ODS

by signalling on the bus grant line. The ODS then sends all of its data to the receiving

IDS. The receiving IDS stores the data in its local IFB. When finished, the ODS sets

Page 80: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

65

RTS for Controller

Bus Interface UnitData Flow

Data Flow

ParallelCross−PointData

BusMatrix

DataSequencer

Address

Data

Data Ready

Request to Send

Bus Grant

Address

Data

Request to Send

Busy/Grant

Address

Address + Data

Data Ready

DataSequencer

1

2

Fig. 3.13. Interface Between the Parallel Bus and the Cross-Point Matrix. Thestudy [98] demonstrated that although a cross-point matrix is preferable for communicationsbetween the Simple Evaluators, a parallel bus is cost effective and sufficient for communica-tions to and from the Functional Evaluators. The Bus Interface Unit is designed to transferdata between the cross-point matrix and the parallel bus. Data Sequencer 1 transfers datafrom the Functional Evaluators connected to the parallel bus to the Simple Evaluators andthe controlling processor which are connected to the cross-point matrix. The parallel datafrom the bus must be transmitted serially across the cross-point matrix. Data Sequencer2 sends data from the cross-point matrix to the parallel bus, again translating the inputserial data to output parallel bus data.

Page 81: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

66

the RTS line low releasing the time-shared parallel bus. All subordinate processors have

equal priority on the bus.

The data transferred between subordinate processors or to the controlling pro-

cessor consists of events scheduled for the next simulation cycle. The data sent to the

controlling processor consists of the return address of the sending subordinate processor,

the element number (either a primary output or user requested monitor point) and the

element value. A separate request line, request to send to the controller (RTSC), is used

to address the controlling processor. When transmitting to the controlling processor,

the sending ODS address lines contain the sending subordinate unit’s address.

The interface between the Simple Evaluators’ data sequencers and the cross-point

matrix is illustrated in Figure 3.14. To send data, the ODS puts the destination address

on the address lines and signals on the RTS line. If the destination is not busy, the

transfer request is granted. The ODS transmits its data serially to the receiving IDS

which stores the data in its local IFB. The data ready signal is used to indicate the

presence of data at the IDS. If an access request is denied, the data is stored locally and

attempts are made to send blocked data later. Access requests to the cross-point matrix

are denied if the destination is pre-occupied with another incoming call. The call block

is controlled by use of the busy/grant line.

The machine by Levendel has several advantages and disadvantages:

1. Advantages

• The presented machine is scalable, and additional processing hardware can

be effectively added to increase the simulation execution speed.

Page 82: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

67

OutputData

Sequencer

InputData

Sequencer

CommunicatonsStructure:

Data Ready

Request to Send

Data

(RTS)Busy/Grant

Data

Address

Cross−Point Matrix

(RTSC)

Fig. 3.14. Interface Between the Data Sequencers and a Cross-Point Matrix Thisfigure illustrates the communications signals required between the data sequencers and across-point matrix. When the Output Data Sequencer (ODS) has data to transmit, it putsthe destination address on the address lines and pulls the Request to Send (RTS) line high.If the destination is not busy, the matrix control grants the request. The Output DataSequencer then sends the data across the cross-point matrix channel serially. The receiverstores the inbound data in its local Input FIFO Buffer (IFB). The Data Ready signal lineis used to show the presence of data. If the destination is busy when the ODS attemptsto do a data transfer, the data is stored back in the ODS’s local memory, and an RTS isperformed for the next destination. Later, the ODS will re-attempt transmission of theblocked data.

Page 83: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

68

2. Disadvantages

• The system is synchronous and time-driven.

• The machine consists of a multiprocessor based architecture.

3.1.6 Megalogican

The Megalogican contains a special purpose computing engine attached to a gen-

eral purpose workstation which is used for logic simulation and design verification [63].

The Megalogican was announced to the public in November of 1983. The machine is an

event-driven, coarse or fine-grained, conservative logic simulator. The system is an 80286

computing platform with 3 bit-slice engines. The bit slice engines connect directly to

dedicated memory and to two neighboring processors through a hardware FIFO queue

forming a 3 processor ring [20]. The three connected units were the State Unit, the

Evaluation Unit, and the Queue Unit.

The Queue Unit maintains the event queue, or list of simulation events. The

Queue unit received the results of the Evaluation Unit and provides events in time order

to the State Unit. The State Unit receives the net values from the host and maintains the

gate values along with the connectivity information in a state array. The evaluation unit

takes each logic element’s input values and function and generates the new output value.

Specific tasks were encoded as microcode instructions. Hardware-accelerated simulations

demonstrated a 100 fold speed increase over their software counterparts running on an

80286. The hardware simulator was capable of 100,000 gate evaluations per second and

was capable of handling circuits of 64,000 primitives.

Some of the Megalogican’s advantages and disadvantages are as follows:

Page 84: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

69

HostComputational

Engine(80286/80287)

DedicatedMemory

EvaluationUnitFIFO

DedicatedMemory

Dedicated Memory

QueueUnit FIFO State

UnitFIFO

System Bus

Fig. 3.15. Megalogican Architecture The Megalogican is composed of three processors.The Queue Unit provides events in time order to the State Processor. The State Processorgathers and maintains network values along with the circuit connectivity information in astate array. The Evaluation Processor uses the logic element’s input values and functionsto generate the resulting output values.

Page 85: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

70

1. Advantages

• The system has some flexibility due to the use of microcoded functions im-

plemented in the system processors.

• The system was intended to be commercially available to the general public.

2. Disadvantages

• The architecture could not be easily scaled to focus more processing power

on a simulation problem.

3.1.7 The IBM Yorktown Simulation Engine

The IBM Yorktown Simulation Engine (YSE) [119] is a descendent of the Los

Gatos Logic Simulation Machine [28, 78, 88]. The machine was proposed sometime

before 1983. The YSE is a special purpose, parallel, programmable computer for logic

gate-level simulation. Like the Los Gatos Logic Simulation Machine, the YSE is also not

event-driven. It is a time-driven, fine-grained logic simulator.

The YSE architecture is also composed of logic processors, array processors, and

a control processor. The logic processor simulates a portion of the total system logic,

up to a maximum of 8K gates per processor. The gates in each processor are simulated

serially, at a rate of 80 ns per gate [119]. The array processors simulate storage devices

such as RAMs and Read Only Memories (ROMs). The control processor provides com-

munication between the YSE and a host machine. The control processor loads the YSE

processors with the necessary simulation data. There is also an inter-processor switch

Page 86: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

71

which connects up to 256 array and logic processors with the control processors during

the simulation.

In the YSE, the logic processors are capable of 3 modes of operation. Two of the

three logic processor internal arrangements can be seen illustrated in Figure 3.16. The

first mode is the unit-delay mode. In the unit-delay mode, each gate has the same delay.

So a combinatorial network of N levels of depth takes N time units to stabilize. The

second mode is referred to as rank-order . In rank-order mode, the gates are connected

so that all equal depth combinatorial networks stabilize in a single time unit. Finally

the third and last mode is called the mixed-mode. In the mixed-mode, the combinatorial

networks are simulated in rank-order mode and storage units are modeled in the unit-

delay mode. A single simulation clock cycle carries out one clock cycle of the simulated

machine. In actual use, the YSE is generally run in mixed mode.

The logic unit in the YSE is actually nothing more than a simple RAM access

that reads a value from a function Table [119]. So the YSE’s operations really consist of

nothing more than table lookups. There are no conditionals, branches, etc.

In the rank-order mode, the gate instructions must be executed in order. No

gate’s instruction can be executed before those of its predecessor gates. This ordering

prohibits feedback, so it is impossible to conveniently simulate memory [119]. Rank-

order simulation imposes an order on instruction execution which allows any equal depth

combinatorial network to produce the correct output results at the end of each simulation

cycle.

Page 87: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

72

(8K 128−bit words)

MemoryInstruction

LogicUnit

words)

MemoryData

(8K 128−bit

(8K 128−bit words)

MemoryInstruction

Memory

"B"

"A"

Memory

LogicUnit

Rank OrderLogic Processor

Unit DelayLogic Processor

Fig. 3.16. The YSE Logic Processor Configuration Two of the three logic processorinternal arrangements can be seen illustrated in Figure 3.16. The first mode, described inSection 3.1.7, is not illustrated. The second mode is the rank-order mode. In rank order,the gates are connected so that all depth combinatorial networks stabilize in a single timeunit. The third mode is referred to as unit-delay mode. In unit-delay mode, each gatehas the same delay. So a combinatorial network of N levels of depth takes N time units tostabilize.

Page 88: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

73

The unit-delay mode does not require this type of ordered execution. The results

of instruction executions do not affect any other instruction’s inputs until the next simu-

lation cycle. Therefore instructions may be executed in any order within each simulation

cycle. In the unit-delay mode, the processor configuration is similar to the rank-order

processor structure, except the memory is divided into parts A and B. During the unit-

delay mode simulation, alternating simulation time cycles take turns reading/writing to

the processor data memory. In the even cycles, the processor might write to the “B”

memory and read from the “A” memory. In odd cycles, the processor would then write to

the “A” memory and read from the “B” memory. The net effect is that every simulation

cycle performs a single gate delay for every gate in the entire simulated machine [119].

Mixed-mode allows the unit-delay and rank-order modes to be combined. In

mixed-mode, memory elements are simulated in the unit-delay mode and the combina-

toric logic is simulated in rank-order mode.

The inter-processor switch allows communications between all the YSE processors

during the simulation. A sample switch port to a logic processor connection is illustrated

in Figure 3.17. In the YSE, all processors operate synchronously, with a common clock

and identical values in their program counters. The processors may all execute different

instructions; however, each processor will execute its first, second, ..., kth instruction

in lock step [119]. The YSE takes advantage of this synchronization by sending each

processor’s result to all the other processors via the switch multiplexor. So, at each time

increment, T, each switch multiplexor has every processor’s kth result at its data input.

Results generated at time T in one processor can be at any other processor’s inputs by

time T + 1.

Page 89: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

74

InstructionMemory

LogicUnit

Input Memory

(8K 2−bitwords)

DataMemory

MemorySwitch

(8K 8−bitwords)

256−WayMultiplexor

Switch Port "K" Logic Processor "K"

to input "K"

of all switch

multiplexors

from proc 1from proc 2

from proc 256

Fig. 3.17. A Switch Port ”K” Example with its Logic Port Connection Theinter-processor switch allows communications between all the YSE processors during thesimulation. A sample switch port to a logic processor connection is illustrated.

Page 90: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

75

The YSE array processor was designed to consist of two parts, the parallel adapter

and the backing store processor. The parallel adapter, or PAD, collects array data from

the logic processors, passing the data on to the backing store processor, or BSP. The

PAD also works vice versa by distributing data from the BSP to the logic processors.

The PAD contains input and output memories, and an instruction memory. The input

memory is similar to the logic processor’s input memory in form and function. The

input memory is loaded from the inter-processor switch in the same manner as the logic

processor. The instruction memory words contain addresses in the input memory which

contain the gate inputs for each simulation time cycle. The relevant data is transferred

to the BSP from the input memory. The instructions contain control codes which are

passed directly to the BSP. The control codes indicate:

• whether the signals passed this cycle are valid.

• the data type passed to the BSP (eg. address, data to be written, write enable,

etc.).

• which array is being addressed.

• what operation is to be performed (read or write).

For example, in the case of a read operation, the BSP will write the data from

the array into the PAD’s output memory. From the PAD’s output memory, the data

would then be transferred to the inter-processor switch according to an additional PAD

instruction field.

Page 91: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

76

The BSP contains an array descriptor memory and a large backing store. The

BSP also contains registers holding the array addresses, the data to be written, and

data which was read. Each entry of the array descriptor memory describes one of the

simulated arrays held in the backing store. The descriptors indicate the offset to the

beginning of the array and the array’s stride (size of each element). When the BSP

receives a read or write command, the BSP uses the appropriate descriptor to calculate

the target array address and then performs the requested function.

To its credit, the YSE demonstrated that a speed increase of several orders of

magnitude is achievable for a gate-level logic simulation using a parallel, special purpose

machine. The simulator relies on the user performing separate timing verification via a

proof technique called the Level Sensitive Scan Design (LSSD) discipline. The timing

analysis is data-independent, so simulation is not required for the analysis.

However, the machine also has the following disadvantages:

• All the gates of the design are executed in every simulation cycle, regardless of

whether their input data are valid in a given cycle [119]. This is a waste of processor

time which could be used to accelerate the simulation.

• The scheduling problem. The YSE has some data flow problems which may be

handled by the YSE’s compiler through the insertion of nop instructions. One

limitation of the YSE’s inter-processor communications is that a processor can

only receive one value from one other processor at a time. If two instruction inputs

must be fed to a processor, the values cannot be written during the same simulation

cycle, as illustrated in Figure 3.18. The inputs must be staggered in time, as only

Page 92: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

77

one can be written during each cycle. The extra wait cycle for the instruction must

be filled by a nop, or another independent instruction.

• The YSE does not handle non-deterministic simulations. There is no probabilistic

estimate of bus contention, data traffic, etc. All instructions take exactly the

same amount of time to execute. Scheduled instructions are always performed.

Switching contention is completely resolved at compile time. It was felt that “... if

the switch had to receive and arbitrate communications defined only at run time,

its control logic alone might well have exceeded the size of the entire current YSE

switch.” [119]

3.1.8 HAL: A Block Level Logic Simulator

HAL is another high-speed hardware logic simulation machine which gains speed

by exploiting concurrency in simulation processes. The HAL results were initially re-

ported in 1983. HAL is a special purpose simulation engine which is approximately 103

times faster than a comparable software simulator or about 105 times slower than the

actual machine [89]. HAL contains 32 distributed special parallel processors, which uti-

lize a Block Oriented Simulation Technique [125]. HAL is designed to simulate custom

designed Large Scale Integration (LSI) computers composed of a central processor unit,

a system controller and a memory unit. HAL is also designed to be capable of simulat-

ing large logic networks at high speed using a reduced amount of hardware. HAL is an

event-driven, coarse-grained, conservative logic simulator.

Page 93: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

78

AProcessor Processor

BProcessor

C

C3C2C1nopA1

AProcessor Processor

BProcessor

C

C1A1

C3

C2

AProcessor Processor

BProcessor

C

A1 C1C2C3

Fig. 3.18. The YSE Scheduling Problem In this illustration, time runs verticallyplacing later instructions lower in the schedules. The top drawing shows that both dataitems cannot be delivered to processor B at the same time. This scheduling dilemma can besolved by moving all the instructions of processor C down one position and inserting a NOP.Or if an independent instruction which does not require inter-processor communication canbe found by the compiler, it can be substituted for the NOP.

Page 94: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

79

HAL derives much of its speedup by simulating all functions used in the simu-

lation as hardware. Specialized hardware modules perform most of the simulator pro-

cessing within a single step as opposed to multiple program steps required by a software

simulation. HAL also takes advantage of pipelining to allow concurrency among inde-

pendent hardware modules. The simulation algorithm is implemented by independent

sub-function sequences, and block event streams are fed into a pipeline that is com-

posed of hardware modules [89]. Finally, sub-circuits which lie on the same level in

level-ordering , can be executed in parallel on different processors.

The HAL simulation team cites an interesting example of a software simulator

evaluating a mainframe computer design. If a software simulator whose simulation per-

formance is about 108 times slower than an actual machine is used, it would take about

15 years to execute a test program, a task that would take only five seconds on an actual

machine [89].

In their formulation of the HAL simulator, three major delay simulation models

were considered, the zero-delay model, the unit-delay model, and the nomimal-delay

model. The zero-delay simulation is only applicable to a synchronous logic circuit which

does not include feedback loops or delay-dependent units. The zero-delay simulation

arranges all gates in a block according to the signal propagation order, and assigns an

equal level number for those gates at the same logic depth. The zero-delay model handles

all logic elements as ideal switching elements with no switching delay. The zero-delay

algorithm has lots of parallelism because all gates having the same level or depth can

be executed at the same time by different processors. The unit-delay model simply

Page 95: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

80

handles all gates as having the same unit delay time. The unit-delay model allows delay-

dependent logic elements or feedback loops, so it is possible to model memory using the

delay-dependent model. Unit-delay however requires two sets of memory to store the

input and output of each logic element. In each simulation cycle, all gates are evaluated

by using the previous values, stored in the input memory. New values are stored in the

output memory. In the next simulation cycle, the values in the input memory and output

memory are exchanged so that previous outputs are used as inputs in the next cycle [89].

Unit-delay simulation requires more simulation cycles than zero-delay, because unit-delay

simulation often requires several cycles to allow the events in process to settle. The final

model is the nomimal-delay model which also has concurrency in gate evaluations for all

gates which belong to the same time period. The time span selected for a time period is

set for the duration of the simulation. However, the longer the time span, the greater the

amount of possible gate evaluation concurrency. The HAL simulation engine employs

the zero-delay model based on gate level-ordering.

The HAL simulation team also created categories of simulation granularity. For

logic simulation, the group divided the granularity levels into gate-level, block-level, and

function-level granularity. Gate-level simulation evaluates all gates on the same level

as individual units. Block level evaluation groups gates into collections of several tens

of gates which are evaluated as a unit. Functional-level evaluation is more complex

than block level evaluation. As the granularity of the simulation becomes courser, inter-

granular event propagations are reduced significantly [89]. The granularity of event

execution is a benefit derived from executing deterministic logic simulations. Non-

deterministic event-driven simulations can not be divided into blocks of events, as future

Page 96: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

81

events depend on previous events for their existence and since the simulation events are

generated randomly, events which may exist in one simulation run, will probably not

exist in the next.

HAL handles its simulation as a block-level simulation. Figure 3.19 illustrates

the level-ordering method. Data independent blocks which can execute concurrently are

assigned to the same level. The system begins by concurrently executing all blocks in

level 0 with new input values. If a block’s output values change, then the new output

values are propagated to the appropriate input values of the next level. When all of the

blocks of level 0 have been executed, level 1 is handled. The simulation continues until

all levels of the simulation have been processed.

HAL’s hardware is organized according to the diagram illustrated in Figure 3.20.

HAL is composed of 29 logic processors, 2 memory processors, and a router cell network.

Logic processors handle block-level simulation if the block contains only combinatorial

logic gates. The logic processor is itself composed of a node processor and a dynamic gate

array (DGA). The node processor manages event processing among the blocks. Each

node processor can handle up to 1000 blocks where each block can contain 32 inputs

and 32 outputs. The DGA performs gate evaluations within each block. The DGA gate

evaluations are performed by table lookup. The DGA receives the block inputs and the

block type. The inputs are then routed to index a location according to both the block

type and the gate inputs. Individual gate functions are implemented by the table-driven

method, where the function table for the gate is embedded in the RAM area in the form

of bit patterns [89].

Page 97: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

82

Level i Level i+1 Level i+2

963

2

1 4

5

7

8

Fig. 3.19. The HAL Level Ordering Method. HAL implements a zero-delay simulationmodel. As part of that model, events propagate through the simulation in pipelined fashion.The simulation is subdivided into blocks, with independent blocks executing at the samelevel. In each clock cycle, events which are generated from register outputs propagatethrough combinatorial circuits and reach register inputs. At the end of each clock cycle, theregister outputs are updated by the register inputs, and then generate events for the nextclock cycle. The evaluation of each block is executed at most once per clock cycle. Outputvalues for all blocks are preserved until the next simulation clock cycle [89].

Page 98: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

83

LSI

LSI LSI

LSI

TranscieverPacket

ExecutionControl

DGA

NP29

DGA

memoryState

Mastermemoryaccess

addressIC

mapping

memoryState

Mastermemoryaccess

addressIC

mapping

NP2 NP30 NP31

CPU MainMemory

Data

RAMdata Controller

DMA

Gate arraymemory

data memoryPin array

Addressmemory

Main Memory

NP − Node Processor

DGA − Dynamic Gate Array

MNS − Memory Node Simulation

NP1

MNS MNS

Dynamic GateArray

Host Processor

ProcessorControlMaster

Logic ICProcessors

LSI − Large Scale Integration

Router−cell Network

Event packettransmission

Connectionmemory

Output−statusmemory

Event−fetchprocess

Input−statusmemory

Event−setprocess

Fig. 3.20. The HAL Hardware Architecture. HAL is an array of 29 identical logic pro-cessors (NP1-NP29), two memory processors (NP30 and NP31), and a router cell network.The logic processors perform course-grain logic block simulation for blocks which containonly combinational logic. Each logic processor contains a node processor and a dynamicgate array (DGA). The memory processor consists of a node processor and a memory nodesimulator (MNS). The control processor performs level and clock synchronization amongthe logic and memory processors. The router cell network connects the processors andenables store-and-forward packet transmission among them [89].

Page 99: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

84

Block no. Pin no.

category

Block no. Pin no.

Flag

Event packet output

Pin no.

Output-Status

Block

Node counter

valuesChange

Connection memoryEvent-send process

Block outputBlock input

Event-set process

Event

Input-status memory

Process

Event packet input

Input pin Block no.valuesOutput pin

Fan-out process

Event-fetch

encode

Block Input pin

memory

valuesOutput pin

category values

process

blockGate array

Dynamic gate array

Block evaluation

Update-status process

Processor no.Processor no.

Fig. 3.21. Internal mechanism of a Logic Processor. The internal mechanics of oneof HAL’s 29 logic processors is illustrated. The event-set process block receives events fromthe router-cell network which can be seen illustrated in Figure 3.20. The event-set processblock sets and updates the input-status memory with the received event information. Theevent-fetch process block searches the input-status memory for new events and sends thenew event information to the dynamic gate array for evaluation. The new event’s block isevaluated by the dynamic gate array which returns its results to the output status memory.The update-status process compares the new status with the previous one stored in theoutput-status memory, and puts the updated status back into the output-status memory.The fan-out process block uses the connection memory to determine the fanout list forpropagating the evaluation results. Finally, the event-send process block transfers eventsthrough either the router-cell network or a local bypass if the result is needed for the sameblock.[89]

Page 100: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

85

In Figure 3.20, the interior of the first node processor is displayed. The event-set

process block receives events from the router-cell network which can be seen illustrated in

Figure 3.21. The event-set process block sets and updates the input-status memory with

the received event information. The event-fetch process block searches the input-status

memory for new events and sends the new event information to the dynamic gate array

for evaluation. The new event’s block is evaluated by the dynamic gate array which

returns its results to the output status memory. The update-status process compares

the new status with the previous one stored in the output-status memory, and puts

the updated status back into the output-status memory. The fan-out process block

uses the connection memory to determine the fanout list for propagating the evaluation

results. Finally, the event-send process block transfers events through either the router-

cell network or a local bypass if the result is needed for the same block.[89]

Memory Processors contain node processors which are the same as in the logic

processors. The memory simulator models main memory and the cache. Although

memory could have been simulated by logic processors, the amount of memory required

to model a mainframe computer would exhaust HAL’s logic processor capacity. So the

memory processors were developed to model memory without degrading the simulation

performance. The memory simulator, however, stores the memory data for its simulated

memory blocks in the host computer’s main memory. By using the host computer’s main

memory, HAL’s simulation memory capacity is several megabytes of memory. However,

HAL required approximately 3 ms to simulate each 16-bit read or write memory access

cycle [89].

Page 101: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

86

The Control Processor , also called the Host Processor, synchronizes the simulation

level operations with the simulation clock across the logic and memory processors. When

a node processor ends the evaluation in the current level, it sends an end signal to the

control processor. When the control processor has received level end signals from all

the node processors indicating that they have finished, the control processor increments

the simulation level. The control processor then broadcasts the new level to all node

processors, and the simulation begins for the new level. The control processor also

manages data transfers between the logic and memory processors and the host computer.

The router cell network connects the processors and facilitates store and forward message

packet communication between the processors.

The HAL simulation model introduces the novel approach of block-level simu-

lation which improves simulation speed at the cost of simulation granularity. HAL’s

disadvantages are as follows:

• The zero-delay, unit-delay, and nomimal-delay models used to describe different

approaches to evaluating logic simulation are not applicable to non-deterministic

event-driven simulation. In non-deterministic simulation, the events, which are

analogous to gates in a logic simulation, are created on the fly, and therefore can

not be easily sorted according to evaluation levels. These methods all depend on

the gates existing before the simulation begins so that the events can be ordered.

• The granularity categories of the logic event-driven simulations depend on a priori

information about the hardware being simulated. Non-deterministic simulations

Page 102: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

87

don’t have all the events generated before the simulation occurs, so again this

categorization doesn’t apply to non-deterministic simulation.

• HAL’s dependence on the host processor’s main memory to simulate memory in its

test models causes a substantial 3 ms time penalty for simulation memory accesses.

After the success of the initial HAL machine, second and third generation ma-

chines were developed culminating with the construction of HAL III. HAL III consists

of 127 processors and a maximum memory capacity of 254 M bytes [135]. HAL III is

reported to be more than 10,000 times faster than conventional multiprocessor software

simulators.

3.1.9 MARS:Micro-Programmable Accelerator for Rapid Simulation

MARS, the Micro-Programmable Accelerator for Rapid Simulation, was devel-

oped at AT&T Bell Laboratories and built in approximately 1987. MARS is a pipelined,

parallel accelerator whose microprocessors can be reconfigured through microprogram-

ming [3]. MARS is classified as an event-driven, fine-grained, conservative logic simula-

tor.

MARS consists of 256 clusters which are connected to a binary 8-cube communi-

cations network. A host processor can access the network and each cluster. The clusters

and network are illustrated in Figure 3.22. A cluster contains 14 Processing Elements

(PE). Every PE serves as a single stage of the pipeline. Each cluster performs a partition

of a multiple-delay logic simulation and communicates with the other clusters via the

Page 103: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

88

communication network. Figure 3.23 illustrates the internal components of a MARS

cluster.

Global communication network

Cluster 0 Cluster 1 Cluster 255

Host

Fig. 3.22. Global MARS Architecture. MARS consists of 256 Clusters which areconnected to a binary 8-cube communications network. A host processor can access thenetwork and each cluster.

A cluster contains a communications network node, a local message switch, 14

PEs and a housekeeping processor. The 14 PEs are connected via a 16x16 crossbar switch

which also connects to the housekeeping processor and the external global communica-

tions network. The housekeeping processor is implemented as an M68020 processor,

which uses a local disk to store circuit partitions.

Figure 3.24 illustrates the architecture of the MARS Processing Elements (PE).

Each PE acts as a pipeline stage and together, the pipelined PEs perform the simula-

tion [3]. Individual PE functions include event scheduling, fanout updating, and function

Page 104: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

89

PE13PE0 PE1 PE2

RAM RAM RAM RAM

Housekeeper

Host/

Local message switch (16x16 crossbar)

Global communications network node

Fig. 3.23. Internal Cluster Architecture Each cluster contains a communicationsnetwork node, a local message switch, 14 processing elements (PE) and a housekeepingprocessor. The 14 PEs are connected via a 16x16 crossbar switch which also connects tothe housekeeping processor and the external global communications network.

Page 105: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

90

evaluation. The PEs communicate with each other and other clusters through their local

message switch.

As illustrated in Figure 3.24, each PE contains a microprogram RAM, a data

RAM, a register array containing 32 registers, an address arithmetic unit (AAU), a bit

field operation unit (FOU), and a message Queue Unit which serves as the I/O queue

for the PE. The FOU can perform operations on 1, 2, 4, and 8-bit field sizes. The AAU

is used to address the PE’s data RAM using a variety of addressing methods. The AAU

also performs 16-bit arithmetic, logical and shift operations. The AAU can support

multiplication and division at the rate of 1-bit per clock cycle. The FOU, on the other

hand, can perform bit-wise data extraction from two separate words, a bit-wise addition

operation on the operands, and then re-pack the results all in the same clock cycle.

The data path consists of 3 16-bit buses, A, B, and C. The microinstruction cycles

consist of 3 phases. Phase 1 allows data to be read from registers onto a bus. During

phase two, the AAU and FOU operate on the retrieved data and place the results on

a bus during phase 3. The contents of the buses are also written to selected registers

during the final phase.

Other units in Figure 3.24 include the data RAM address register (DAR), the

data RAM high address register (DHAR), the external address (EAD), the external

data register (ED), the field select register (FSR), the microinstruction register (MIR),

the memory select register (MSR), and the program address register (PAR).

Figure 3.25 illustrates the fanout phase and evaluation phase of a cluster for logic

simulation. Each stage of the pipeline represents one PE. The same PE may be used in

both phases.

Page 106: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

91

EAD

EDPAR

DAR

DHARMSR

24

16

16

1616

16 64

Macroinstruction

ED

MicroprogramRAM

D

Address

Data

4

166

Handshakesignals

Bus A

Bus B

Bus C

Address

Data

Bus A

Bus BBus C

RAMinterface

RAM Interfacecontrol

ADDRMUX

Registerdecode

Registerarray

FOU

FOUdecode

AAUdecode

MIR

Conditionstall &

trap logic

Queuecontrol

Queueunit

AAUExternal

Fig. 3.24. Architecture of the Processing Element Each Processing Element (PE)acts as a pipeline stage and together, the PE pipeline performs the simulation [3]. IndividualPE functions include event scheduling, fanout updating, and function evaluation. The PEscommunicate with each other and other clusters through their local message switch. ThePEs contain a microprogram RAM, a data RAM, a register array containing 32 registers,an address arithmetic unit (AAU), a bit field operation unit (FOU), and a message QueueUnit which serves as the I/O queue for the PE.

Page 107: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

92

During the fanout phase, the signal scheduler contains pointers to linked lists

of events. The output filter keeps track of current and pending signal values as well as

canceled events [3]. The oscillation detector detects zero delay oscillations and interrupts

the housekeeper if a predetermined number of oscillations is exceeded. The output log

records events in its data RAM on watched signals. The fanout pointer list, fanout list,

and input table all are used to propagate the gate results to the proper elements on the

evaluated gate’s fanout list. Finally, the gate scheduler schedules the gates whose inputs

have changed for evaluation during the next appropriate evaluation cycle.

The evaluation phase starts where the fanout phase left off. The gate scheduler

pops the gates to be evaluated off its stack of events and forwards the events to the

input table. The input table then fetches the appropriate input values for the gates

and forwards the values to the gate type table. The gate type table adds its data for

the appropriate gate to the data and moves it to the function unit. The function unit

evaluates the single gate and passes its computed result to the delay table which adds the

correct gate delay. Next, the input vector list, the output filter, and the signal scheduler

detect the new events generated by evaluation and schedule the new events for the next

fanout phase.

The MARS project has the following advantages and disadvantages:

1. Advantages

• Provides programmability through its use of microcoded PE chips.

• MARS works well for designs which utilize variable bit fields and variable

memory widths.

Page 108: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

93

Fanout Phase

Evaluation Phase

Signalscheduler

Gatescheduler

Inputtable

Fanoutlist

Fanout

listpointer

OutputLog

Oscilationdetector

Gatescheduler

Outputfilter

Outputfilter

Signalschedulervector

Input

listtableDelay

UnitFunction

typeGate

table

Inputtable

Housekeeper

Housekeeper

Fig. 3.25. MARS logic simulation pipeline The fanout phase and evaluation phaseof a cluster for logic simulation are illustrated above. Each stage of the pipeline representsone PE. The same PE may be used in both phases.

2. Disadvantages

• Each PE has to receive data and instructions from RAM, which will cause

some speed penalties to access the data.

• The PEs also function as processors, with data-paths, so the operations of each

PE may involve reading from a bus, calculating a result, and then writing to

a bus, with appropriate storage to either memory or registers.

3.1.10 Reconfigurable Machine

The Reconfigurable Machine [137] (RM) combines FPGAs and RAMs to support

a wide range of applications. The RM, built in approximately 1992, incorporates FPGAs

which are capable of in-circuit reconfiguration allowing the RM to reload several types

of configuration data during power-on. The RM is an event-driven, fine-grained, conser-

vative logic simulator. A first prototype version of the RM, called RM-I, has been built

Page 109: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

94

and applied to a multiple-delay Logic Simulator (LSIM). LSIM can simulate 1 million

gate events per second at a 4 MHz clock rate [137].

The RM architecture employs FPGAs which allow in-circuit reconfiguration and

relatively fast switching speeds. FPGAs come in four types with respect to programming

technology. The four types are anti-fuse, EPROM, EEPROM, and the SRAM type. The

anti-fuse and EPROM types do not allow in-circuit reconfiguration. The other two types

do allow in-circuit reconfiguration, but the SRAM type offers faster switching rates. So

the RM project decided to employ SRAM FPGAs. The project used the XC3090 (9000

gate class) FPGA from Xilinx which contains 320 Configuration Logic Blocks (CLBs)

and 144 Input/Output Blocks (IOBs).

One of the FPGAs serves as the interface module. The other four FPGAs serve

as processing modules. Each of the four FPGAs access two types of memory, shared

and distributed. Both types of memory are implemented as 24-bit words. The FPGAs

have access only to their local memory when the FPGA is in its processing mode. When

not in the processing mode, the host has access to each FPGA memory using global

addressing. The RM can configure 4 pipeline stages with memory access.

The RM employs a tightly connected communications architecture with all FP-

GAs directly connected. There is also a 24-bit global bus used for global data transfer

and control. The system can be configured to run with a 16 MHz/(2n) clock where

n = 0, 1, 2, . . . , 7 and is programmed by the configuration data [137]. When the clock

speed is less than or equal to 4 MHz, the RM-I can process two memory accesses within

a single system clock cycle which is good for read/modify/write cycles.

Page 110: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

95

Global Bus

Memory4 FPGA4 FPGA4 Memory4

FPGA4 FPGA4 Memory4

FPGA4

Memory4

Configuration Data HostInterface Computer

Interface Module

Processing Module

Fig. 3.26. The RM Machine The FPGAs selected for the RM implementation wereSRAM-type FPGAs which allowed in-circuit configuration. The RM employs a distributedmemory architecture, using 24-bit 32K words. Each FPGA accesses only its local memorywhen in the processing mode. The communications network consists of a global bus whichhas 24 data bits and 6 control bits. One FPGA serves as the system interface. Theconfiguration data interface unit determines the FPGA configurations.

Page 111: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

96

When applied to logic simulation, the RM implementation is called the Logic

Simulator (LSIM). The logic simulator implementation on the RM was divided into

two main phases, fanout and evaluation [137]. The fanout phase allowed events in the

current time increment to propagate to the inputs of the next gates. These gates are

then scheduled for evaluation. During the evaluation phase, the event manager fetches

each gate and its signals from the event list. The evaluator receives the gate type and

input signals and retrieves the appropriate rise and fall delays from the function code

table. Finally, a comparator compares the gates’ new output to its previous output.

If there was a change, the gate, its delay information, and the new signal results are

forwarded to the scheduler as a future event.

The FPGAs are used for implementing logic functions [137]. Gate-level circuits

implemented on each FPGA of RM-I are designed manually. Xilinx tools are used for

automatic placement and routing.

The RM advantages and disadvantages are as follows:

1. Advantages

• The system has demonstrated 170 to 190 times speedup running a Logic

Diagnosis Engine as compared to the same system compiled as software on

the host computer.

2. Disadvantages

• The machine has limited storage capacity, so tasks requiring large amounts of

memory are not practical.

Page 112: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

97

Connectionindex table

Connectiontable

Event flagEvent list

Propagator Event Manager

Input Signaltable

SchedulerComputer

Host

Time−mappingqueue

Monitor gate event

Initial data

Memory 2Memory 1

Test Pattern

Memory 3

Memory 1,4

Fig. 3.27. The LSIM Fanout Phase The Scheduler first increments the simulation cycletime and then gets the pointer to the linked list of current events from the time-mappingqueue. Each event consists of the gate identifier and its new output value. The Propagator,which receives each event from the Scheduler, uses the Connection Index table to locatethe current gates fan-out receiving gates in the Connection Table. The Event Managerreceives the propagated gate identifier, the terminal identifier, and the signal value. TheEvent Manger updates the Input Signal Table according to the values from the Propagator.Events whose inputs changed, have their Activity Flags set to false and are stored in theevent list.

Event flagEvent list

Function code tableDelay table

Time mappingqueue

Event Manager EvaluatorComparator

Output signaltable

Input signaltable

Scheduler

Memory1,4

Memory3 Memory2

Memory3Memory1

Fig. 3.28. The LSIM Evaluation phase In the Evaluation phase, the Event Mangergets each event from the event list, clears its activity flag, and retrieves the gate’s inputvalues. The Evaluator retrieves each gate identifier and its input signal values from theEvent List. Next, the gate type is pulled from the Function Code Table, and the rise/falldelays are retrieved from the Delay Table. The Evaluator determines the gate’s output valueand forwards the result to the comparator. The comparator evaluates the gate output todetermine if a change has occurred. If the output is new, the gate identifier, its delayinformation, and the new output value are sent to the Scheduler. The Scheduler places thenew event on the Time Mapping Queue.

Page 113: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

98

• The 4 FPGA processors have fixed connections. The maximum bandwidth

between the processors is not scalable.

• The Xilinx XC3090 FPGA clock rate is limited to between 4 and 8 MHz.

3.1.11 Bauer

A more recent logic simulation design which utilizes FPGAs was developed in

work by Bauer [14]. This logic simulator uses reconfigurable logic to accelerate the dis-

crete event simulation of logic circuits. The focus of this work is accelerating discrete

event simulation, but the target is again deterministic. The foundation for this recon-

figurable computing system is an FPGA-based emulator, which provides large blocks of

reconfigurable logic [14]. The simulation is generated by a compiler which compiles a

behavioral Verilog HDL description of the design under test.

Each emulation module of Figure 3.29 runs a small operating system to manage

the behavioral simulation and logic netlist emulation. A separate control processor which

is not illustrated performs higher level operating system functions including network ac-

cess and disk management. The emulation modules consist of a PowerPC 403GCX

processor, local RAM, and a local FPGA array with its associated programmable inter-

connect. Emulation modules connect to each other via programmable interconnects.

The advantages and disadvantages of the system are:

1. Advantages

• Focus on accelerating Logic simulation as discrete event simulation.

• The system is scalable.

Page 114: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

99

GlobalProgrammableInterconnect

Local ProgrammableInterconnect

RAM CPU FPGA

FPGA Array

Local ProgrammableInterconnect

RAM CPU FPGA

FPGA Array

Fig. 3.29. Bauer’s Reconfigurable Logic Simulator The architecture of the systemconsists of one or more emulation modules. Each module consists of a CPU, RAM, and anFPGA array with a local programmable interconnect. The interconnect allows the FPGAarray to be treated as one large reservoir of reconfigurable logic. The figure depicts twoemulation modules.

Page 115: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

100

2. Disadvantages

• Purely time-driven implementation.

• The system foundation is composed of FPGA-based emulators. Emulation

sacrifices generality for performance: it cannot be used to simulate behavioral

circuit models that contain delays or other constructs that are either non-

structural or cannot be synthesized into gate-level circuitry [14].

3.2 Accelerator & General Purpose Machine

The two machines in this section, the Splash accelerator, and the ArMen, a general

purpose parallel machine, are not specifically designed as logic simulators, and therefore

do not fit in the criteria of Section 3.1. Splash, described in Section 3.2.1, is designed

to provide very high performance on a range of bit-processing problems. Similar to the

architecture proposed in this thesis, Splash employs reconfigurable logic in the form of

systolic arrays. The work also provides invaluable feedback on its architecture advan-

tages and disadvantages. Section 3.2.2 describes the ArMen, which is perhaps the closest

architecture to the system presented by the thesis in that the ArMen is a general pur-

pose machine using reconfigurable logic which is specifically designed to support parallel

discrete event simulation. The ArMen, however, has a significantly different approach

to synchronization and it lacks a reduction network.

3.2.1 Splash

The original Splash 1 is a single board which plugs into the VME bus of a Sun

workstation. In approximately 1991, Splash was designed to serve as a systolic processing

Page 116: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

101

system [8] using a Sun workstation as its host. The general purpose machine is normally

an SIMD machine and the original design was motivated by a systolic algorithm for

DNA pattern matching [64]. The boards consist of 32 XC3090 Xlinix FPGAs which are

programmed to serve as processing elements. The FPGAs are connected as a linear array

by a 32-bit wide data bus. The two end chips, X0 and X31 can be connected together

allowing the FPGAs to form a ring. Between each pair of FPGAs lies a shared 128K x 8

RAM with an 8-bit wide path to the FPGAs. The Splash 1 board clock rates can be set

in factors of 2 from 1 MHz to 32 MHz. The slower speeds allow placement and routing

design difficulties to be accommodated.

Splash 2 attempts to alleviate the I/O bound drawbacks of Splash 1 by using a

Sparc II host and a connection to the system’s SBus. Splash 2 is expected to be 8 to 10

times faster than its predecessor in terms of its sustainable I/O rate.

In Figure 3.30, each Splash 2 array board is composed of 16 processing elements,

FPGAs designated X1 through X16, each of which is connected to a 16-way crossbar

switch. An additional FPGA processing element, designated X0, controls the switch

configuration. The FPGA processing elements are Xilinx XC4010 FPGAs which are

each connected to 500K of local memory. The host Sun platform can directly address the

processing element’s 500K local memories. The memories are connected to the FPGAs

by a 16-bit data bus and a 18-bit address bus. The FPGAs have 36-bit bidirectional

data paths to both left and right neighbors as well as the crossbar switch [7]. A crossbar

input may be configured to connect to any number of output ports allowing point-to-

point, multicast, and broadcast communication. This configuration allows X0 to receive

broadcast data from the host on the SIMD bus and rebroadcast it through the crossbar

Page 117: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

102

Bus

X1 X2 X3 X4 X5 X6 X7 X8

X9X10X11X12X13X14X15X16

CrossbarX0

X1 X2 X3 X4 X5 X6 X7 X8

X9X10X11X12X13

SIMD

X14

SBus

X15X16

CrossbarX0

X1 X2 X3 X4 X5 X6 X7 X8

X9X10X11X12X13X14X15X16

CrossbarX0

Sparc

Station

Host

Input

DMAXL

Output

DMAXR

Splash Boards

RBus

Interface Board

Extension

Fig. 3.30. The Splash 2 Architecture The Splash 2 architecture was designed basedon the newer Xilinx XC4010 10,000-gate FPGA. Splash 2 is scalable from one board with16 processing elements to a combination of boards yielding 256 processing elements. Theinput and output data streams can be provided with direct memory access (DMA) from theSun SBus or from an external source. The crossbar switches on each board fully connectthe board’s 16 processing elements. Applications programs can be written in behavioraldescribed VHDL.

Page 118: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

103

DMAChannel

DMAChannel

DMAChannel

XR

XL

Array Board n

Array Board 2

Array Board 1

SparcStationHost

SBusExtension

BusSIMD

RBus

Interface Board

Fig. 3.31. The Splash 2 Interface The Splash 2 interface board contains 3 bidirectionalDMA channels. Each DMA channel is connected to the Splash array boards via a FIFOqueue. XL and XR are two user programmable FPGAs which can process the incomingand outgoing data streams, optionally stopping and starting the system clock as data fillsthe output channel or new data becomes available on the input channel. In Splash 2, theclock frequency is selectable by the host in 50-Hz increments from 100 Hz to 30 MHz.

Page 119: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

104

to each of the 16 PEs on the board. The input to each board’s processing element array

is through the XL processing element which connects to a 36-bit SIMD bus of each X0

element on all the boards and to the X1 element of the first board. Additional array

boards can be linked together by extending the linear data path from the X16 element

of one board to the X1 element of the next board [7]. The XR unit determines which

board is the last board in the chain.

The Sparc host down-loads the configuration data to the processing elements on

each board, which includes X0-X16, XL, XR and the crossbar switch. The host system

can read and write to the DMA channel FIFOs on the interface board as shown in

Figure 3.31. The host can also stop and start the system clock, setup and manage

the DMA channels, read and write to the processing element memories, and receive

interrupts from both the DMA channels and from each of the computing elements. Each

array board contains a set of bidirectional handshake registers through which the host can

communicate directly and asynchronously with the computing elements. There is also

a single-bit broadcast mechanism and a 2-bit wide global AND/OR reduction network

between the processing elements and the interface board [7].

Splash was designed to handle various programming models including a single

instruction/multiple data stream (SIMD) model, a one-dimensional pipelined systolic

model, and several higher-dimensional systolic models [7]. The SIMD applications utilize

the X0 element and the crossbar switch on each board to broadcast the instructions and

data to all processing elements simultaneously. The instruction stream is sent from the

host to the X0 chip on each board via the SIMD bus. X0 broadcasts the instruction to

all 16 of the board’s processing elements, which are each programmed with one or more

Page 120: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

105

identical SIMD computing engines. These engines synchronously receive and execute

instructions and perform nearest neighbor communications through the linear data path.

Global element synchronization is accomplished with the AND/OR reduction network.

One-dimensional systolic arrays are formed by using the processing board’s 36-bit

linear data path to form a continuous pipeline from the host, through the array, and back

to the host. The crossbar switch allows an individual processing element to be bypassed

or multi-dimensional systolic arrays to be implemented.

The Splash system has several advantages and disadvantages:

1. Advantages

• Splash is a general purpose system.

2. Disadvantages

• I/O bandwidth and inter-processor communications have proved to be a lim-

iting factor during system testing. Splash 1 was entirely I/O bound [8].

• Splash 1 has only a single systolic datapath between all of its 32 Xilinx FPGAs.

Splash 2 implemented a crossbar interconnect, but this still limits the speed

of the communication required for simulation synchronization.

• Splash was designed for synchronous SIMD operation, however most simu-

lations might work better with asynchronous multiple input multiple data

(MIMD) operation. MIMD allows different nodes of the same simulation

to operate with different constraints and statistical distributions in a non-

homogeneous simulation network.

Page 121: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

106

• The two-bit reduction network is rather small and might be constraining for

event-driven simulations.

3.2.2 The ArMen

In 1994, Beaumont et al [15] proposed a new architecture for discrete synchronous

event-driven simulation using FPGAs. The MIMD ArMen implementation allows the

parallel execution of events with the same timestamp in virtual time. The machine is an

event-driven, fine-grained, conservative logic simulator. All processors wait until all the

event computations for a given simulation cycle are complete. Then the simulation can

proceed to the next phase. The next simulation cycle is the global minimum of all the

minimum timestamped events on each node. The protocol respects causality constraints

since all processors are always executing events with the same timestamp [15].

The two main global control operations are the synchronization barrier which ev-

ery processor must reach before the simulation can proceed to the next simulation cycle,

and the calculation of the global minimum of all the Local Virtual Times (LVTs) in order

to determine the next time to be simulated. These two operations have been implemented

in the FPGAs of the ArMen machine. The algorithm is provided in Table 3.2.

Each ArMen node is tightly coupled to an FPGA ring. The reconfigurable ring,

called the logic layer , allows the synthesis of application specific operators. ArMen can

be configured and specialized at runtime.

The basic ArMen architecture is illustrated in Figure 3.32. Each ArMen node

consists of a processor and FPGA combination. The processor is connected via its

system bus to a bank of memory. The FPGA can be dynamically reconfigured by the

Page 122: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

107

GV T ← GV T Computation(tWakeUpmini)

. Global minimum computation and broadcastwhile (¬ End Simulation)

if (GVT = tWakeUpmini) then

〈〈 Model evaluation;Sending of Generated Messages;Waiting for acknowledgments; 〉〉

Global Synchronization();. . . . in order to be sure that every execution. is over in the current time step〈〈 tWakeUpmini

evaluation 〉〉. Local minima searchGV T ← GV T Computation(tWakeUpmini

)

. New global minimum computation and broadcast

Table 3.2. Synchronous Discrete Event-Driven Simulation Algorithm In thisalgorithm, the Global Virtual Time (GVT) minimum is the minimum of all the local virtualtimes at each simulation processing node. Instructions which occuring between 〈〈 〉〉 indicateconcurrent instructions. The protocol respects the causality constraint since all processorsare always executing events with the same timestamp [15].

Page 123: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

108

attached processor. The processor loads configuration data into the FPGA during a 100

ms delay using memory-mapped registers [49]. There are four input/output ports on each

FPGA. The north port connects to the processor bus. The east and west FPGA ports are

connected to the adjacent FPGAs forming a ring topology. The south port is generally

free, but can be connected to other processor nodes to form different communications

topologies.

The logic layer can provide either application speedups or other services [49]. Al-

gorithms or functions are implemented in the FPGAs which serve as local accelerators

with data exchanges between the FPGA and the processor. The processor writes values

into the FPGA registers allowing the FPGA to perform its configured calculations [15].

The processor then reads its results back. Local FPGA-based accelerators can be fed

and controlled using the MIMD framework. Experimentation has shown that the system

throughput is limited by the processor read/write speed. The FPGA and the processor

are synchronized via the processor’s interrupt signal line. The FPGA to FPGA com-

munications can be either synchronous, taking advantage of the same clock signal, or

asynchronous, using ready/ack signal lines.

When implementing synchronous parallel event-driven simulation, all processors

send flags to their associated FPGAs indicating the need to synchronize. The processors

set up a synchronization barrier for each simulation cycle. When every processor reaches

the barrier and sends the appropriate signal, node 0 issues a restart signal.

To compute the global minimum time which is the time of the next event, the

ArMen machine implements the following strategy, illustrated in Figure 3.33. Each

FPGA computes the minimum of its own node’s next event time value and the local time

Page 124: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

109

FPGA FPGA FPGA FPGA

Memory

Host

Processor+

Memory

Processor+

Memory

Processor+

Memory

Processor+

Interconnection Network

MIMD level

FPGA Ring

Configurable Logic Layer

Fig. 3.32. The ArMen Architecture Each ArMen node has a processor connectedvia its system bus to a bank of memory and an FPGA. The FPGA can be dynamicallyreconfigured by the attached processor. The input/output ports of each FPGA are dividedinto four ports. The north port connects to the processor bus. The east and west FPGAports are connected to the adjacent FPGAs forming a ring topology.

Page 125: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

110

minimums for the nodes to its left and right. Each node writes its local minimum in its

associated FPGA. After n levels of vertical pipelining and shifting of data between the

FPGAs, the minimum over 2n+1 values is computed. So, as an example, if the minimum

timestamp needed to be computed among 21 timestamp values, then 10 levels of vertical

pipelining would be required. If N > 2n+1, where N is the number of processors in the

system, then the vertical pipeline is executed again with the results from the previous

run until at least N/2 levels are computed. The computation broadcasts the result to all

processors. All the processors of the MIMD network have to contribute to the calculation

at the same time under control of the simulation kernel. The ArMen can switch from

MIMD to Single Program Multiple Data (SPMD) mode to assist in the computation.

SPMD differs from SIMD in that the program is replicated and stored and executed at

several (i.e. multiple) nodes, as opposed to the SIMD model where the instructions are

transmitted one by one to the processing elements.

Compared to a pure software implementation using the same operating system,

the ArMen group reports speedups of 40 for the global minimum computation on four

32-bit integer values and one level of the FPGA pipeline. The slow speeds are attributed

to the delay required by the processor interrupt signal. With 120 processors, a speedup

of 600 is expected.

The advantages and disadvantages of the ArMen system are:

1. Advantages

• Capable of running random event-driven simulations.

• The system is scalable.

Page 126: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

111

B C D

min(cdabc)

min(dab)

min

min min

min min

min

(abc) (cda)

(dabcd) (abcda) (bcdab)

A

(bcd)

Processors

Global and

local clocks

Cellular automatons in network for minimum computation

Fig. 3.33. Digital-Serial Implementation of the Global Minimum Computationand Broadcast The method of comparing each local next event time stored at every nodeto come up with a global minimum next event time is illustrated in this figure. After nlevels of vertical pipelining and shifting of data between the FPGAs, the minimum over2n + 1 values is computed.

Page 127: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

112

2. Disadvantages

• Finding the global virtual time using an O(n2) algorithm.

• Lacks a reduction network.

• Consists of a ring topology, which doesn’t scale well.

3.3 Optimistic Processing

Section 3.3 reviews an optimistic state-saving hardware approach to discrete event

simulation. This hardware device is used to allow optimistic simulation to proceed by

the application of a state saving technique. Checkpoints are saved in the event that the

optimistic path taken turns out to be incorrect, so that the previously saved point in the

simulation can be quickly restored.

In July of 1988, Fujimoto et al.[58] proposed a special purpose hardware design

called the Rollback Chip (RBC) for parallel discrete event-driven simulation. The RBC is

designed to work with the Time Warp mechanism which handles difficult clock synchro-

nization problems. Time warp relies on a lookahead and rollback mechanism to achieve

widespread exploitation of parallelism. The state of each process must be periodically

saved, and when necessary, the process must be rolled back to a previously checkpointed

time.

The Rollback Chip (RBC) is a type of memory management unit and data cache

combined into a single component [58]. The chip was specifically designed to work with

Time Warp mechanism developed by D.R. Jefferson [80].

Page 128: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

113

Instead of putting data into “protected” areas of memory, the RBC manipulates

the addresses generated by the CPU in order to avoid overwriting old values which may be

required for a future roll back operation. The RBC is designed to be embedded in every

computation node of a multi-node system in order to assist with rollback operations for

that local node. The processors perform optimistic simulation, computing as far ahead

as possible and then rolling back if a straggler or late event arrives with a timestamp

which is earlier than the node’s local simulation clock time.

The RBC provides the processor it serves with version controlled memory (VCM).

Version controlled memory is identical to normal read/write memory, except that a

process may “mark” the state of the memory as one which may later need to be restored

via a rollback operation. In a parallel simulation, the processors issue a mark operation

after processing a simulation event. Simulation variables which are subject to rollback

and therefore require state saving, must be stored in version control memory.

The RBC contains six operations. The reset operation initializes the RBC. The

Mark operation preserves the current state of version controlled memory. A write(A,D)

operation writes data D into the location at memory address A. A read(A) operation

reads the most recently written version of data associated with address A (excluding

rolled back write operations) and returns the data. A rollback(k) operation restores the

version controlled memory to the kth previously marked state (k > 0). Finally, the last

operation is an advance(k), in which the kth oldest marked states are discarded and

can be reclaimed as available memory space. This memory reclamation is called fossil

collection and is similar to garbage collection, but it also performs additional irrevocable

operations such as I/O [58].

Page 129: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

114

The processor can invoke the reset, mark, rollback, and advance operations by

writing into the RBC’s control registers which are memory mapped to the CPU’s address

space. The RBC read and write operations represent normal CPU read/write operations

to specific program variables. Context switch controls are represented by additional

registers in the RBC.

The advantages and disadvantages of this system are:

1. Advantages

• Allows optimistic simulation processing.

• Creates a fast state restoring technique.

2. Disadvantages

• Extra address calculations are required with every memory access.

• Larger memory space is required.

3.4 Non-Deterministic Simulation

Section 3.4 gives a brief introduction to the Ising Spin model. This model inspired

three physicist groups [76, 106, 118] to develop random number generators which are

applicable for seeding statistical distributions.

The simulation machines discussed in Section 3.1 are deterministic logic simula-

tors. An important feature which distinguishes this work from previous simulators is

its enhanced ability to model non-deterministic simulation. However, non-deterministic

number generation has been both required and created by previous architectures. These

Page 130: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

115

special purpose architectures created processors which were designed for one specific ap-

plication, the Ising Spin model. In statistical mechanics, the calculation of static and

dynamic properties of Ising Spin systems by means of the Monte-Carlo method is today

a standard technique [76]. Ising Spin models have been compared to a stochastic vari-

ation of cellular automata, similar to the popular Game of Life [107, 106] and to the

Hamiltonian Path problem [131].

Another analogous comparison is the model of an array of pixels composed of

one or more bits which each interact with other adjacent pixels. The exact meaning of

the pixel depends on the nature of the simulation. In the case of cellular automata, the

pixel is a cell, in the simulation of a discretized fluid, it is a region of space that can

accommodate gas molecules, and in the case of statistical physics, each pixel represents

a discrete degree of freedom of a many-body system. In the last case, the pixel is called

a spin if the system to be simulated is a magnetic system [106].

Specifically, the Ising Spin Model is composed of discrete variables, Si, called

spins, which take on one of two values, up (+1), or down (-1) and occupy the sites

of a regular or random D-dimensional lattice, where D = 1, 2, 3, . . . as illustrated in

Figure 3.34. The Ising model was first used as a model for the behavior of magnetic ma-

terials. A magnetic material consists of a large number of regularly located microscopic

magnetic moments ( or dipoles ) which are also called spins because they arise from

angular momentum or spin properties of electrons. In the Ising Model, these dipoles are

only allowed to point in one of two opposite directions.

“When high accuracy is required or complex systems are studied, one is severely

limited by the amounts of computing time that is needed. Large amounts of computation

Page 131: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

116

Fig. 3.34. The Ising Spin Model A two dimensional array of Ising Spins is illustrated.Similar to the model of a lattice of magnetic moments, the elements can have either up ordown spin. Figures of three dimensional spin models can be found in [131].

are necessary because the accuracy obtained in a Monte-Carlo study is proportional to

1√N

, where N is the number of iterations of the algorithm. For many computations that

we wish to do . . . the cost of performing the computation on a general purpose processor

is prohibitive.” [118] In order to accelerate their computations, the physicists developed

a special purpose processor for performing Monte-Carlo simulations on a particular class

of problems, the three dimensional Ising models. Despite its modest cost, this machine

is faster than the fastest supercomputers on the one particular problem for which it was

designed. Details of the algorithm required and the architecture of the process developed

can be found in [118, 76]. Here, the interest is specifically in the uniform random number

generation techniques which were developed.

Page 132: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

117

3.4.1 Hoogland, Spaa, Selman, and Compagner

To generate uniformly distributed random numbers, Hoogland employed a Feed-

back Shift Register algorithm [76] which consists of a 127-bit shift register with feedback

on the input of the first bit. If we denote the nth bit of the sequence by xn, where

n = 0, 1, 2, . . . , 15, the Feedback Shift Register algorithm used is [76]:

xn = (xn+(127−2∗(16−1)) + xn+(127−1∗(16−1)))mod2 (3.1)

Selecting values for p and q of Figure 3.35 which produce the maximum length

sequences, the maximum non-repeating period is 2127−1 [76]. The 32-bit random num-

bers are selected out of the 127-bit sequence at intervals of 32 clock cycles. Figure 3.35

illustrates the circuit used to accomplish the random number generation. Every 32 clock

cycles produce a new 32-bit random number. An accelerated version which produces a

new uniformly distributed random number every 2 clock cycles is shown in Figure 3.36.

3.4.2 Monaghan & Pearson, Richardson, and Toussant

Both Monaghan [106] and Pearson’s [118] random number generation work di-

rectly parallels Section 3.4.1, except that the random number generated is either 8 or

24-bits wide respectively. One of the most salient points of Pearson’s work is not the

discussion of the success of the random number generator described above, but of their

earlier failure with a different Random Number Generator (RNG) which was based on

the Linear Congruence Algorithm [87]. During testing, small but significant discrepan-

cies were discovered after comparisons with known results. The average magnetization in

Page 133: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

118

r + Lq p1 r + 1 ............ ....

L−bit random number

Fig. 3.35. A General 2-bit Feedback Shift Register A general 2-bit feedback shiftregister of p bits is used to generate a random number of L bits. From the initial state ofthe register, a sequence of p bits, the next state is produced by inserting the modulo-2 sumof the feedback bits in positions q and p into position 1 of the shift register. In the shiftregister, the original bit in position 1 is shifted into position 2, and so on up to position p,the original contents of which is lost. If all bits in the register are initially zero, the shiftregister will remain in that state forever. If the shift register progresses through the other2p− 1 states before repeating, a maximum length sequence is produced. After L shifts, thecontents of the L positions used to generate the random number are completely refreshedand the next random number can be read [76].

17

33

49

65

81

97

113

1

112

16

32

48

64

80

96

111

127

31

15

47

63

79

95

114

18

98

2

34

50

66

82

97 112 98 111 112113 126 127

8 Bit Shift RegistersPlaced vertically, read horizontally

8−bi

t shi

ft r

egis

ter

Fig. 3.36. Random Number Generator The actual design implemented by Hooglanddiffers from Figure 3.35 in that this design produces one random number per two clockcycles, instead of every 32 clock cycles, where L = 32. This design is composed of 16 8-bitshift registers and additional logic to compute the L required bits in parallel. The resultof the single synchronized shift in each register coupled with the 16 feedback circuits isequivalent to 16 shifts of the circuit of Figure 3.35. The random number for this circuit isread from bit positions 96 through 127 [76].

Page 134: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

119

the high temperature phase and at zero magnetic field of the Ising model is strictly zero.

But the results they garnered of long runs with different addends using the original linear

congruence random number generator produced magnetizations as large as 0.01 which

were reproducible for different random number seeds [118]. As a result, the random

number generator was redesigned following Figure 3.35. With the new generator design,

which is described above, the tests yielded consistent results with zero magnetizations.

3.5 Reduction Buses

Some multi-processor implementations include a specially designed bus which

fosters a combination of both communications and computation. These buses are often

referred to as reduction buses. The CM-5 [77] contains a reduction bus tying its separate

processing units together. Another reduction bus which has a relatively high level of

functionality is the Parallel Reduction Network developed by Reynolds [116]. This bus

is the subject of Section 3.5.1.

3.5.1 Parallel Reduction Network

Reynolds [116] proposed a Parallel Reduction Network (PRN) bus which both

computes and disseminates different binary, associative operations across state vectors

of values. State vectors, composed of subcomponent values, are passed through the

reduction bus. Simultaneously traveling instructions can request that all of the first

components of the state vector be added together, and that all of the second components

of a 2 component state vector be Or-ed together. In the network, the hardware reads

state vectors of size m, computes m globally reduced values, and writes a globally reduced

Page 135: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

120

state vector. Separate auxiliary processor units interface with the PRN to handle the

expected load of traffic.

Figure 3.37 illustrates a binary reduction tree of depth log2 n, where n is the num-

ber of processors in the multi-processor network. Each node of the PRN is an Arithmetic

and Logic Unit (ALU) with some additional logic for tagged selective operations. Global

reduction operations can be computed and disseminated in O(log n) time.

A single ALU is illustrated in Figure 3.38. The ALUs perform binary associative

operations on the two inputs based on a programmed operation code which accompanies

the state vectors as they flow through the PRN. The operations include sum, minimum,

maximum, logical AND, logical OR, etc. Operations including minimum and maximum

support tagged selective operations. A tag, chosen by the selector, accompanies the

winning value of the binary operation. Additionally, an error check is performed on

the two incoming opcodes. If they do not match, an error condition is set in the tag

registers denoting the problem with the resulting state vector. The PRN also pipelines

the reduction operations at a rate which equals the delay time of each stage.

Reynolds’ design has both advantages and disadvantages:

1. Advantages

• allows the simultaneous computation and dissemination of data throughout

the network.

• the computation does not impact the processors.

2. Disadvantages

• may suffer from geometric constraints in large networks.

Page 136: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

121

In Out

AP

In Out

AP

In Out

AP

In Out

AP

In Out

AP

In Out

AP

In Out

AP

In Out

AP

ALU ALU ALUALU

ALU ALU

ALU

0 1 2 3 4 5 6 7

PRN

Fig. 3.37. Parallel Reduction Network Reynolds [116] designed a reduction bus referredto as the Parallel Reduction Network (PRN) which is presented as a k -ary tree of depthlogk n, where n is the number of processors in the network. Each node of the tree isan Arithmetic and Logic Unit (ALU) with some logic for tagged selective operations. EachAuxiliary Processor (AP) has sets of memory-mapped input and output registers. The PRNreads values from the input registers and writes the corresponding globally reduced resultsto the AP output registers. An interlock mechanism prevents memory access contention.The tree allows a global reduction operation to be computed and disseminated in O(log n)time.

Page 137: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

122

Error Check

Opcode

ALU

Data

Selector

Tag Tag

Control

8 832 32 3232

83232

Fig. 3.38. PRN Arithmetic and Logical Unit Node A single Arithmetic and LogicalUnit (ALU) node of the PRN network of Figure 3.37 is illustrated. The ALUs performbinary operations on two inputs based on a programmed operation code which accompaniesthe inputs; operations include sum, minimum, maximum, logical AND, logical OR, etc.Each input register of the ALU is paired with a Tag register. The ALU supports taggedselective operations whereby a tag, chosen by the selector, accompanies the winning valueof each binary operation. An error check is performed on the two incoming opcodes. Ifthey do not match, an error condition is set in the tag registers denoting a problem withthe resulting state vector.

Page 138: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

123

Chapter 4

Software Traffic Simulation

The primary focus of this work is the creation of an architecture for a non-

deterministic traffic simulation machine. In order to clearly concentrate acceleration

efforts, software models were used as a guide for the hardware development. The soft-

ware models included in this paper were studied in three phases. In determining whether

this project was worth pursuing, the small, simple code modules of Section 4.1 are used to

establish what types of speedup can be obtained to accelerate discrete event simulation.

Once established by the initial publications [24, 25, 26] that this simulator work is both

desired and justified, a study of a representative and well established traffic simulator,

CORSIM, was undertaken. This study is described in Section 4.2. Since CORSIM is not

an open source, free software simulator, sharing verifiable results is not practical using

CORSIM as a standard for comparison. Other possible candidates for study are rejected

for similar reasons. The final stage of the simulator work did require a system to verify

the accuracy of the selected Scheduler algorithm employed in Section 7.2.3. Therefore, as

a separate effort, the simulator, Trafix, was generated as an open source, freely available

traffic simulator. Unlike other conventional simulators, Trafix is open source, free, and

modular. The Trafix simulator is briefly described in Section 4.3.

Page 139: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

124

4.1 Event Generation & Queue

As part of the initial studies used to gauge the effectiveness and direction of

the selected approach, the event generation and the event queue of Figure 1.1 were first

examined. An event generator was constructed using software which was then translated

to a reconfigurable logic implementation. The same sequence was performed on the event

queue segment. In Section 4.1.1, the event generation software was first implemented

following the methods applied in [141]. This method facilitated a fine-grained, parallel,

systolic hardware implementation in Section 7.2.1.

In Section 4.1.2, the Event Queue software applied standard GNU C++ classes to

manage both the event queue and the random distribution calculations. Therefore the

event generation code was re-written in Section 4.1.2, so that standardized code could

be applied, and attention focused on the queuing software.

4.1.1 Event Generation Software

An abbreviated software outline is listed in Table 4.1. In the C++ code, first,

the Poisson event arrival offset, τ , is calculated according to Equation 4.1 [141]. In

Equation 4.1, Ω1, or rand1 in the code, is an independent random variable uniformly

distributed over [0,1). λ, or LAMBDA in the code, is the object or event arrival rate.

The event generator dynamically allocates space for the new event, s, and enqueues the

s object.

τ = − 1

λlog Ω1 (4.1)

Page 140: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

125

The resulting τx values generated by this equation can be seen in Figure 4.1. τ0

is the distance from the beginning of the timeline to the first event. τ1 is the distance

from the beginning of the first event to the beginning of the second event, and so on.

New event arrival times are calculated by adding the arrival offset, τ , to the previous

event arrival time. The clock is then advanced to the new event arrival time. Service

events which overlap at the end of the current timeline segment are carried over into the

next segment.

The Poisson service time, σ, is calculated according to Equation 4.2 [141]:

σ = − 1

µlog Ω2 (4.2)

where µ, given as an average number of events per second, is the object or event

service rate. µ is the same as MU in Table 4.1. The σx values generated by Equation 4.2

are illustrated in Figure 4.1 to be the offsets from the beginning to the end of event x.

Ω2, or rand2 in the code, is also an independent random variable which is uni-

formly distributed over [0,1). The service time offset, σ, is added as an offset to the

event’s arrival time to determine the end of the event’s service time. Event resources

are released at the end of this service time. Both σ and τ are independent and expo-

nentially distributed. In software, to allocate memory and then generate Poisson arrival

and service times requires approximately 30500 nanoseconds on an Ultra Sparc.

Page 141: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

126

61 70 10789 988079

Timeline 0

Timeline 1

0 9 18 27 36 45 54

τ1

τ2

τ4

τ3

τ5

τ6

τ7

τ0

σ0

σ1

σ0

σ1

σ

σ

σ

σ

σ

σ

2

3

4

5

6

7

τ τ10

σ σ−2 −1

Fig. 4.1. Simulation Timeline Generation Each succeeding arrival starts an offset ofτx from the previous arrival. Similarly, each service time σx is an offset from the x event’scorresponding arrival time. These dependencies which constrain event arrival time andevent service time generation appear to prevent speedup through parallelism.

event* s = new event();

clock = s->arrival = clock - (1/LAMBDA)*log(rand1);

s->service = - (1/MU)*log(rand2) + s->arrival;

Table 4.1. Event Generation Code I The initial event generation implementationfollowed Walrand [141] creating random arrival and service times as fine-grained paralleldiscrete steps in a systolic array. The approach is also illustrated in the hardware eventgeneration block diagrams of Section 7.2.1.

Page 142: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

127

4.1.2 Event Queue Software

This section focuses on the software used for the service queue. The software

version is implemented as a GNU LIBG++ XPPQ Priority Queue class. In the C++

software simulation, the time required for the insertion and extraction of events to and

from the event queue increases as the queue strays from its optimum size. The proposed

hardware queue speed, on the other hand, is not affected by its size, and provides a 102

speedup over the software model.

The software simulation model used for comparison is written in C++ and is

illustrated in Tables 4.2 and 4.3. Some additional processing is performed when the

event data structure is allocated. The arrival and service queues are maintained as a

single heap data structure, unlike the proposed dual queue hardware mechanism of the

processing elements for the proposed architecture which are described in Section 7.2.

Page 143: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

128

while (create_event_cnt <= num_events)

create_event_cnt++;

arrival_time = clock + rnd1();

service_time = rnd2();

Event_Class *s =

new Event_Class(arrival_time,

service_time);

queue->enq(*s);

;

Table 4.2. Event Generation Code II The Event Generation code allocates an eventwith an arrival time which is a random offset from the previous event’s arrival time. Theservice time for the event is then selected to be a random offset from its own arrival time.The two random values need not necessarily use the same statistical distribution. The eventis also constructed to randomly require resources when it is executed by the scheduler. Thiscode differs from Table 4.1 in that GNU LIBG++ standard classes are applied. Table 4.1creates its random offsets using distribution methods from Walrand [141].

Page 144: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

129

while (events <= num_events)

events++;

// included in speedup test

Event_Class event = queue->deq();

if (event.getArrival() == true)

if ((event.res.get_a() <= a_resource_counter) &&

(event.res.get_b() <= b_resource_counter))

a_resource_counter -= event.res.get_a();

b_resource_counter -= event.res.get_b();

event.SetNextArrivalTime();

// push service event

// enq included in speedup test

queue->enq(event);

else

if (event.res.get_a() >= a_resource_counter)

block_a++;

if (event.res.get_b() >= b_resource_counter)

block_b++;

// requeue a replacement event

arrival_time = clock + rnd1();

service_time = rnd2();

Event_Class *arriv = new Event_Class(arrival_time,

service_time);

// enq not included in speedup test

queue->enq(*arriv);

else

a_resource_counter += event.res.get_a();

b_resource_counter += event.res.get_b();

;

Table 4.3. Event Queue Loop Code The arrival and service queues are maintained as asingle heap data structure, unlike the proposed dual queue hardware mechanism illustratedin Figure 7.3. If the dequeued event is an arrival event, then the resources available arecompared against the resources required by the event. If the required resources are available,a service event is enqueued. If resources are unavailable, the event is recorded as a blockedevent. When a service event is dequeued, its resources are returned to the available resourcespool. To gather accurate timing results, the number of events in the event queue is keptconstant. The extra time used to generate additional arrival events in order to maintainthe queue size is not included in the speedup plot of Figure 7.12.

Page 145: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

130

4.2 CORSIM: An Established Software Simulator

As part of the effort to develop a profile of a traffic simulator, CORSIM (COR-

ridor SIMulator) was selected as a representative software simulation model. CORSIM

microscopically models vehicular traffic flows, emissions and accounts for pedestrians.

Developed in Fortran by the Federal Highway Administration (FHWA), CORSIM is

part of the TRAF family of simulation models. [CORSIM] combines TRAF-NETSIM,

a simulation model of non-freeway traffic, and FRESIM, a simulation model of freeway

traffic [37]. NETSIM, the older of the two simulators, grew out of the Urban Traffic

Control System, developed for mainframes in the early 1970s. The CORSIM model

and its components comprise one of the first traffic simulation environments of its kind.

CORSIM has been widely used in the traffic engineering community and claims to have

been calibrated and validated in a wide variety of traffic and highway design conditions.

The FHWA granted special access to study and to evaluate the CORSIM source code as

part of the research generated for use in this thesis and its related publications.

In current applications, CORSIM is used to evaluate alternatives planned for

highway networks [37, 41]; it may be used, for example, to evaluate new traffic signal

optimization strategies. The runtime required by the simulator has caused CORSIM to

be used in off-line applications only. Real time applications, however, are becoming more

prevalent in transportation engineering, and in such applications, speed is critical. A

study of CORSIM runtime characteristics determined that the processor tended to dwell

in simulation scheduling and overhead routines [27]. Therefore, attempts to accelerate

traffic simulation need to accelerate or eliminate overhead and event scheduling.

Page 146: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

131

4.2.1 CORSIM Function Categories

Following Figure 1.1, CORSIM functions were categorized according to the Ta-

ble 4.4 classifications. The event generation, event list, scheduling, and timer classifi-

cations are derived from the simulation model depicted in Figure 1.1. Two additional

categories of overhead and statistics are not depicted in the figure, but are required by

simulators.

4.2.2 NT versus Linux

The CORSIM code was delivered as a win32 based software system which compiles

and runs under the Microsoft Fortran Compiler. This compiler is a Fortran 90 compiler

which is integrated with Microsoft’s Developer Studio. The compiler is now maintained

and developed by Compaq. As CORSIM was delivered using the Microsoft compiler,

there was strong incentive for our decision to profile under NT.

In order to study the code better, and as part of its procurement, the source was

ported to the Linux operating system. A code translation program called VAST/f90

from Pacific-Sierra Research was selected as the compiler under Linux. The VAST/f90

system translates the Fortran 90 code to an intermediate Fortran 77 version. Then

GNU’s g77 compiler is called to compile the intermediate result. The VAST/f90 system

uses a library of routines to emulate Fortran 90’s memory allocation and deallocation

routines. Some of the advantages of the VAST/f90 compiler include its target operating

system, Linux, and its freely available personal version. The incorporation of the g77

compiler provides the binaries produced by VAST/f90 with the advantages of profiling

Page 147: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

132

Classification Description

event list Majority of code queues or dequeues objects waiting for an event.Handles traffic queues at lights, and on links, as well as otherqueues of data structures.

event generator Functions which generate or sink vehicles in the simulation.overhead Functions which include general mathematics calculations, mem-

ory allocation, data error checks, etc.scheduling Majority of code executes and schedules events. Traffic routing

across multiple road links as well as within each link and inter-section. This classification also handles traffic signal events andpedestrian action.

scheduling/event list Functions which are combinations of the scheduling and event listcategories in an approximately 50%-50% mix of functionality.

shutdown Simulation shutdown functions, such as flushing data and closingout files.

statistics Functions which calculate general traffic statistics and results.These statistics include vehicle speed, stops, delays, hours oftravel, miles of travel, fuel consumption, and emissions.

timer Functions which control the simulation clocks and timers.

Table 4.4. CORSIM Function Classifications The CORSIM functions were classifiedaccording to these eight categories. CORSIM functions often contained a myriad of categoryfunctionality, but were classified according to the majority of the function code. In somecases, the subroutines performed approximately 50% routing and 50% event list work, so anadditional combinational category was added. The event generation, event list, scheduling,and timer categories are derived from Figure 1.1.

Page 148: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

133

using gprof and debugging with gdb, the GNU debugger. VAST/f90’s disadvantages

include significant delays during memory allocation and deallocation.

Unlike NT, profiling under Linux includes operating system and compiler related

functions. To compare the profiling results, the non-CORSIM Linux overhead func-

tions were removed from their respective profile before the chart and graph values were

computed.

4.2.3 CORSIM Profile

CORSIM was profiled under the NT operating system as a stand-alone application

without the rest of TSIS. Perl scripts were written to parse the resulting profile data.

The runtime statistics from 20 CORSIM traffic models were averaged and joined with

classification categories based on the CORSIM functions. The pie chart data in Figure 4.2

is the result of these scripts.

Perl scripts were written to parse the resulting profile data for both operating

systems. A single classification file, used for both the NT and Linux functions, was

parsed by multiple profiling scripts to maintain a common criteria among the profiling

data sets, and then runtime statistics from 20 traffic models were averaged and joined

with the classification categories based on the CORSIM function names. The pie chart

data in Figures 4.2 and 4.3 are a result of these scripts. Additional Perl scripts were

written to parse and compute the simulation runtime statistics of Table 4.5.

CORSIM functions were examined and categorized according to the eight clas-

sifications described in Table 4.4. CORSIM functions often contain code which falls

Page 149: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

134

under more than one of the classifications listed. Therefore, the functions were classi-

fied according to the major functionality of their code. There were some subroutines

whose functionality was approximately a 50%-50% split between scheduling and event

queue maintenance, so these functions were put in their own classification, called schedul-

ing/event list. A large percentage of CORSIM functions included in the overhead cate-

gory were devoted to parsing and error checking the simulation input data files. Instead

of having generalized functions to bounds check and parse, there were usually specialized

functions for each input line type in the .trf file.

Our study of CORSIM is motivated by our desire to determine the bottlenecks of

the simulation model illustrated in Figure 1.1. After the CORSIM functions were cate-

gorized, the profiling data derived using the NT tools, PREP, PROFILE, and PLIST,

was joined with the function classifications and reduced to yield the pie chart shown

in Figure 4.2. This figure illustrates the percentage of CORSIM runtime devoted to

each category of simulation function. When run under the NT operating system, COR-

SIM dwells mostly in its scheduling and overhead functions. Therefore, the simulation

architecture proposal must carefully consider their acceleration.

Table 4.5 lists the simulation models from the Georgia Institute of Technology’s

Civil Engineering CORSIM repository which were used to compute and average the

CORSIM simulation results. The listed runtimes indicate values derived from their

respective profiling datasets. Please note that at the time these files were generated,

code optimization flags were active under NT but not under Linux due to a compiler

bug. So, the code is expected to execute faster on Linux when compiler optimizations

become available.

Page 150: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

135

Function Runtime Linux Runtime NT

actctrl 5.32 4.71ca101001 9.77 10.08ca101002 10.36 10.31intch1 12.70 9.81mmbs 10.04 10.16mmd1 8.84 9.65mmf1 9.39 10.22mmp1 9.99 10.06mmp2 9.86 10.12mms1 10.28 10.18opt 80 32.88 12.48proj3 33.36 12.82rabs 37.23 26.80rad1 33.60 29.31raf1 33.47 28.99rap1 34.78 25.19rap2 33.31 29.30ras1 33.50 29.20scen1 259.42 193.62scen3 228.26 173.20

Averages 40.78 31.25

Table 4.5. CORSIM Runtime Under Linux and NT The various Georgia Institute ofTechnology CORSIM repository models which were used to test CORSIM under both Linuxand NT are listed along with their respective runtimes. The NT profiler only provides resultsfor the CORSIM functions, so the Operating System and Compiler generated function wereculled from the Linux results. Note, that although the NT results appear to run faster byabout 16.5%, these results indicate the amount of CPU time used by the simulation, and arenot necessarily the duration the user waited for their results. For example, the time valuesneglect operating system and compiler function overheads which will be present under bothNT and Linux. All times shown are in seconds.

Page 151: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

136

event_list (6%)shutdown (0%)

overhead (44%)

event_generator (0%)

scheduling (46%)

scheduling/event_list (0%)statistics (4%)timer (0%)

Fig. 4.2. Profile Chart of CORSIM on NT Illustrated are the percentages of CORSIMruntime used by eight categories of simulation functions when run under the NT operat-ing system. The graph represents an average of 20 simulations from Table 4.5 run underthe NT operating system and profiled using NT’s profiling tools. The CORSIM functionswere classified into the eight categories of scheduling, scheduling/event list, event list, timer,event generator, overhead, statistics and shutdown. These categories are described in Ta-ble 4.4.

The graphs in Figures 4.2 and 4.3 illustrate the importance of accelerating the

CORSIM overhead and scheduling categories. A new proposed simulation accelerator

must focus on these two simulation components.

The first CORSIM category, overhead, is dominated by its data integrity routines

which read data from input files, verify that data, and then store the results for later

retrieval. The proposed architecture assists in alleviating much of the overhead required

by CORSIM. For example, with the reconfigurable logic approach, the simulator system

must be configured before it is used, and error checking on the input data occurs once

during initialization. The network of roads and the scheduling and routing algorithms

need to be implemented in reconfigurable hardware before the system starts the simu-

lation. Much of the data which is input into the CORSIM simulation is configured as

Page 152: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

137

event_list (9%)

shutdown (0%)

overhead (26%)

event_generator (0%)

scheduling (56%)

statistics (1%)

scheduling/event_list (9%)

timer (0%)

Fig. 4.3. Profile Chart of CORSIM on Linux This chart is the same as the graphpresented in Figure 4.2, but performed under Linux. The same dataset of 20 simulationmodels were run and averaged to create the pie chart which depicts CORSIM categoriesaccording to their dwell time. The GNU gprof profiler was used to produce this data chart.

hardware in the proposed simulator. The setup is based on selections from available

configurations or sub-configuration model segments.

The version of CORSIM provided with version 4.2 of the TSIS package, having

been constructed over time, is not modular in its software functionality. Routines reg-

ularly blend input data integrity, event list handling, and event scheduling functions.

CORSIM source code is not generally publicly available for research study and compari-

son. For these and a myriad of other reasons, a second simulator, Trafix, was developed

for modeling the traffic scheduling software functionality. Trafix was developed in C++

and is open source. Its development models follow the research provided in [146].

Page 153: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

138

4.3 Trafix: A Road Traffic Simulator

In working with road traffic simulators, it becomes immediately clear that the

traffic research and management community requires a standard open source, free traf-

fic simulator. The simulator should also have a standard general input file format, so

that simulations can be easily tested under various simulators for verification. The free

software, open source approach allows researchers to test various traffic theories on a

standardized software platform in a reliable and reproducible fashion. Further, an inex-

pensive system can provide smaller municipalities with access to a tool for making their

own local road networks and traffic signal timing schemes more efficient. Maximizing

traffic throughput and minimizing delay benefits trade and tourism both domestically

and internationally. The possible windfall is potentially large.

The Trafix simulator was developed in C++ with the GNU gcc compiler under the

GNU-Linux operating system. The program uses Xfig, which has created its own input

file format standard, to generate an input file describing the simulation road network.

The code is written to be modular so that various components can be replaced as the user

community requires. The overall modular design concept is overviewed in Figure 4.4.

So for instance, attempts have been made to allow the code to be easily changed in the

future, altering the current dependence on Xfig input files. Trafix displays its animated

output in X windows as illustrated in Figure 4.5. In addition to using Xfig for its input

and X windows for output, Trafix employs the STLPORT Standard Template Library

(STL) routines wherever expedient to foster the reuse of code which is intended to both

lead to efficiency and reduce errors.

Page 154: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

139

The STLPORT is a more cross platform portable version of Silicon Graphic’s

Standard Template Library (SGI STL). The SGI STL maintains the current implemen-

tation of the Hewlett-Packard Company’s original version of the STL. Trafix employs the

STL container classes whose interfaces are described by Bjarne Stroustrup [134]. These

classes include standard library container objects and iterators which are used to main-

tain various simulation objects. For example, in Trafix, to maintain a list of vehicles, an

STL vector of vehicles is generated. This vector can then be stepped through by use of

the vector class iterator.

The STL container classes provide programmers with a variety of benefits. In-

dividual containers are simple and efficient. Each container provides a set of standard

operations with standard names and semantics. Individual container classes may also

provide operations which are specific to a particular container class. The same classes

also provide a set of standard common iterators to efficiently access the container mem-

bers. The container classes are non-intrusive, that is the objects need not be modified

to be stored within the containers. The containers each take an allocator argument

which can be used as a handle for implementing services for every container. The allo-

cator greatly eases the provision of universal services which may include persistence and

object I/O. The STL benefits are further enumerated in [134].

At this time, Trafix forks two processes. The first process displays the two win-

dows, each containing maps. One window holds input map symbols which have been

used to generate the simulation, and the second depicts the background map for the

animated traffic display. The first process is intended to eventually migrate into a more

suitable user interface as community interest materializes. The second process handles

Page 155: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

140

Physical

Simulation

Symbolic

Layer Objects

Road, Place, Intersection

Xfig objects and attributes

Road Q, Intersection Q, Timeline

Xwindows objects

Fig. 4.4. The Trafix Software Structure Written to be modular, the Trafix software iscomposed of three levels. The bottom, physical, level allows trafix to interface with its inputand output systems. As currently written, Trafix reads its input from Xfig files and animatesits output in X windows displays. The middle symbolic layer serves as an intermediate levelbetween the simulator and its physical files and converts raw data into conceptual objects.These objects include roads, intersections, and places. The top layer consists of simulationobjects including such concepts as timelines, road queues, and intersection queues.

Fig. 4.5. The Trafix Display A view of the Trafix simulator is illustrated. Vehicles aremoving from the left and bottom lanes to the top and right. Turning decisions at the secondintersection depend on the vehicle’s randomly assigned destination. Cars, buses, and trucksare represented as different sized and colored boxes.

Page 156: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

141

the animation of the vehicle traffic. The animation is visible only on the single graphic

map display. Trafix simulates car, bus and truck traffic moving through intersections

and along roads.

An input example corresponding to the Trafix network displayed in Figure 4.5 is

illustrated in Figure 4.6. The figure shows an Xfig editor with a Trafix input map drawn.

The Trafix library includes sources, destinations, and intersections. Vehicle sources are

drawn as triangles, vehicle sinks are circles and the three-way intersections should be

obvious from the drawing context. The Trafix software parses the Xfig drawing files

and generates input geometric objects based on the Xfig map drawing. These Xfig data

objects consist of the raw data which define the Xfig drawing. These raw data types

include such geometric shapes as lines, circles, ellipses, compounds, etc. which are all

based on the Xfig input file format. Source, destination, intersection, and road attributes

are included in the drawn object comments.

Once the raw, physical layer objects are created from the parsed Xfig drawing

file, the geometric shapes are used to compute symbolic road network objects. Drawn

lines are used to create road objects. The Trafix library compounds composed of boxes,

triangles, and circles are used to generate source, destination, and intersection objects.

Trafix performs some network error checking to insure that the network is fully connected.

A symbolic traffic map is created and used to generate a third level of simulation data

types as shown in Figure 4.4. This third level of Trafix objects includes timelines, vehicle

queues, and intersection queues.

A Trafix simulation proceeds by executing an event loop. The interior of the loop

is divided into two stages. The first stage of the loop checks network source nodes from

Page 157: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

142

Fig. 4.6. Trafix Input Environment The Trafix input environment incorporates Xfig, anopen source drawing package. Xfig allows user generated libraries of objects, which Trafixsupplies. For Trafix, the current library consists of two, three, and four-way intersections,along with source and destination nodes. Roads consist of drawn lines. Object characteris-tics are included in the object comments. A road speed limit can be set by adding a speedattribute in the drawn road line comment. Xfig comments are not visible by default, butcan be accessed using the Edit tool on the Editing modes tool bar. Object names are seteither by including a name attribute in the object comments or by creating a compoundconsisting of a text object with the desired name along with the selected object to receivethe name. Using compounds to add names to objects allows the name to be visible onthe drawn map. Intersections of degree greater than four must be made by combinationsof smaller intersections. Intersections are composed of boxes. Roads to and from the in-tersection must end inside of the intersections associated peripheral boxes, one road perperipheral box. Source nodes, which generate vehicles, are represented by triangles, anddestination nodes, represented by circles, are vehicle sinks. Roads leading to or exiting froma source or sink must have one end within the source or destination object.

Page 158: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

143

which vehicles erupt at particular simulation time cycles. Each source node contains a

timeline segment of vehicle events consisting of their arrival times based on a user selected

random distribution. The arrival distribution characteristics are entered by the user as

attributes in the Xfig Trafix library source symbols. These erupting vehicle objects are

popped off the timeline and injected into the traffic network if there is sufficient space

on the road at the source node. If space is lacking, the event is blocked. The second

stage of the event loop processes all vehicles already in motion in the simulation network.

Vehicles in motion are updated in a time-driven fashion. Leaders are moved first. Vehicle

movement routines are located in the vehicle.cc source file.

Trafix was created to allow a verification of the car-following acceleration schemes

employed. No suitable open source traffic simulator was available when the project

started. The initial intent of Trafix is not to serve as a stand-alone finished traffic

simulation program. However, incidental materials were added as convenient, and hooks

are available in the software, such that the software has become a good starting point

towards a finished traffic simulator. The code was released and is available as a free

software, open source project with the intent of providing others with a valuable starting

point from which to improve. The Trafix simulator code is GNU-public licensed and

available from its web site at http://trafix.sourceforge.net.

The vehicle movement functions from Trafix which are used to verify the algo-

rithms and equations used in Section 7.2.3 were tested and timed on a 600 MHz Pentium

III 7.0 SuSE Linux Box. The software runtimes measured are averages for each cited

vehicle movement function. The timing results are shown in Table 4.6.

Page 159: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

144

Function Runtime in µsvehicle_road_initialize 30.43move_vehicle_on_road 36.76

vehicle_intersection_initialize 23.19move_vehicle_thru_intersection 48.40

Table 4.6. Scheduler Software Function Profile The four modular vehicle movementfunctions from Trafix were timed on a 600 MHz Pentium III 7.0 SuSE Linux Box. Thesefunctions are contained in the vehicle.cc source file. The times presented are in microseconds.The software bottleneck is in the intersection handling function. The time shown is thetime elapsed during function execution. The functions are executed each time a vehicle isprocessed.

4.3.1 A Shared, Pooled Allocator

The design approach in Trafix generates two processes. One process is needed

to provide a fast response to the user. A second process is required to execute, and

therefore becomes bound by the simulation event loop. These two processes need to

communicate easily. UNIX provides a variety of methods to allow cross-process commu-

nications including semaphores, pipes, and shared memory. A natural solution for this

application is to allocate the simulation objects in shared memory and allow both the

user and event loop processes access to the same objects. Then, if the user selects to

change the scale of the map, the simulation process immediately sees the change and

can alter its calculations to adopt the newly selected map scale factors in its vehicle

movement computations.

In C++, local (auto) objects are stored on the run-time stack, symbols are bound

directly to the local objects, and storage management is performed via stack-based mark

and release strategies in which enough space to hold all locals is allocated, all at once,

Page 160: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

145

upon entry to a block, and released upon exit. Globals or statics have similar properties.

Programmers who desire other access and lifetime control strategies must use the new

operator in order to create objects on the free store, and then explicitly manage their

storage. While block structuring accounts for some of the efficiency and flexibility advan-

tages of C++ over other languages, it is not without its cost. In many object-oriented

applications, difficulties arise in using block structures to obtain the desired effect in

controlling storage lifetimes, and requiring users to manually employ the new and delete

operators leads to error-prone results. Container classes offer an attractive alternative

to this manual manipulation of memory. Besides their value in organizing groups of ob-

jects as data structures, container classes are perhaps the best means for ensuring that

groups of objects have coexistent lifetimes. In other words, containers serve as a way of

extending the scope or lifetime rules of C++. Knowing that all objects created within

some collection exist, unless explicitly removed, until the collection itself is destroyed,

can help minimize a good deal of awkwardness and error-prone code [95].

The C++ language provides a programming concept, referred to as an alloca-

tor, which is used to insulate programmers from the details of physical memory. The

allocator provides standard methods and a standard interface for allocating and deal-

locating memory along with standard names of types used as pointers and references.

Further, the STLPORT library, which is the Standard Template Library employed by

Trafix, is well suited to employ a standard allocator. However, although it would seem

obvious that a programmer might need to allocate a container class in shared memory, a

shared memory allocator class was not readily available. The STL code is written to be

portable across platforms and operating systems. Memory allocation is very operating

Page 161: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

146

system specific. Therefore, the desired shared memory allocator was not part of the STL

library. The STL classes, are, however, written properly to accept and use an allocator

if provided.

So because the Trafix code required a C++ shared memory allocator class, the

code was generated and now resides at http://allocator.sourceforge.net. A link from

www.stlport.org points to the allocator web-site. The allocator uses the standard mem-

ory template to allocate shared memory on the GNU-Linux operating system. The

allocator creates large blocks of shared memory and then manages the memory, allo-

cating and deallocating sub-portions of it as required by the STL container classes in

the form of chunks. The allocator keeps track of free memory using a bit vector. The

allocator works with the STLPORT container classes. Additional general information

on allocators can be found in [134].

Page 162: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

147

Chapter 5

Analysis

Chapter 5 performs some basic mathematical analysis of simulation properties.

Section 5.1 uses mathematics to determine whether event versus time-driven simula-

tion yields results faster under particular constraints. Section 5.2 reviews some of the

constraints involved in wrapping a traffic map on an array of processors.

Deciding between running the simulator in event or time-driven mode is important

for simulations in which event processing is not continuous. As an example of non-

continuous event processing, consider the case of simulating telephone calls where the

calls are temporarily assigned virtual circuits within the communications network. For

this example, the circuits are the resources required by each event. An event generator

creates a sequence of calls which are placed into an event queue. The calls can be initiated

if their required circuits are available when they actually execute. The executed call

then temporarily depletes the circuit it uses from the available pool. However, in the

telephone simulation, the circuit does not require continuous adjustment or modification.

The execution of the telephone call simply needs to schedule a secondary event which

will return the circuits to the available pool for other calls to use when the current call

completes. Logic simulation is a similar example, although deterministic. When a gate

executes, it changes the state of its output signals, but it does not need continuous event

processing. The gate only needs attention when one of its input signals is scheduled to

Page 163: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

148

change. Contrast these examples with traffic simulation. The event generator schedules

new vehicles to enter the traffic network. The vehicle arrival times are enqueued in the

event queue. At the appropriate time, the vehicle is popped off the event queue and

enters the traffic network. Once moving within the network, the vehicle’s acceleration,

velocity, and position require constant updates. Traffic is a type of simulation which

requires continuous event processing by the scheduler during every simulation cycle.

Since traffic needs attention every simulation cycle, the model naturally falls into

the time-driven mode of simulation. Virtual circuit telephone simulation and logic simu-

lation may be better suited for an event-driven model. Additional hardware, referenced

in Section 7.3.2, can be included in the simulator design accelerating event-driven simu-

lation.

5.1 Event verses Time-Driven Simulation

The analysis in this section compares and examines event versus time-driven sim-

ulation. Section 5.1.1 illustrates the maximum speedup which can be expected from

running under an event versus a time-driven approach. Section 5.1.2 uses statistics to

find the solution point which optimizes runtime by selecting between an event versus a

time-driven mode.

5.1.1 Expected Advantage of Event vs Time-Driven Simulation

Felderman [52] compares two distributed processing methods which are analogous

to event and time-driven simulation. One method is asynchronous and the other is

synchronous. In the synchronous method, after each subtask is completed, all processes

Page 164: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

149

must reach a barrier before being allowed to process the next sub-task. The synchronous

method is analogous to the time-driven approach. The asynchronous method doesn’t

require the barrier. Individual jobs run to completion as fast as they progress. Felderman

demonstrates that the asynchronous method has an expected potential speedup over the

synchronous method by no more than lnP where P is the number of processors used.

So the speedup gained from event-driven processing over time-driven processing will be

no more than lnP .

5.1.2 Decision between Event vs Time-Driven Modes

When confronted with a network of random event generators, the next expected

event time to occur can be calculated using Ordered Statistics. Frequently, an objective

is to determine the fastest car in a race or the heaviest mouse among those fed a certain

diet [126]. Similarly, random variables can be ordered according to their magnitudes.

For this work, the shortest expected arrival time must be found in order to determine

which simulation approach, time or event-driven, is the most appropriate.

Let X1, X2, . . . , Xn denote independent continuous random variables which have

distribution functions shown in Equation 5.1.

F1(x), F2(x), . . . , Fn(x) (5.1)

Page 165: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

150

The distribution functions of Equation 5.1 have the corresponding density functions of

Equation 5.2.

f1(x), f2(x), . . . , fn(x) (5.2)

Ordered random variables, Xi, are denoted X(1), X(2), . . . , X(n) where X(1) ≤

X(2) ≤ . . . ≤ X(n). Continuous random variables allow the equality signs to be dropped.

So the maximum value of Xi is

X(n) = max(X1, X2, . . . , Xn) (5.3)

and the minimum value is

X(1) = min(X1, X2, . . . , Xn) (5.4)

Page 166: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

151

For this work, the goal is to determine the minimum next expected event time which is

X(1). The density function of X(1), denoted g1(x) can be found as:

P[

X(1) ≤ x]

= 1− P[

X(1) > x]

= 1− P (X1 > x,X2 > x, . . . ,

Xn > x)

= 1− [1− F1(x)] [1− F2(x)] · · ·

[1− Fn(x)]

(5.5)

Taking the derivative of both sides yields the density function,

g1(x) = f1(x) [1− F2(x)] · · · [1− Fn(x)] +

[1− F1(x)] f2(x) · · · [1− Fn(x)] + . . .

[1− F1(x)] [1− F2(x)] · · · fn(x)

(5.6)

The expected time of the next arrival event can then be calculated by finding the

expectation of g1(x) as follows:

E(x) =

∫ ∞

0xg1(x)dx (5.7)

For an actual simulator, the computation of Equation 5.7 would be automated

given that the user supplies the appropriate F (x) and f(x). For the purposes of this

Page 167: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

152

thesis, the expected minimum timestamp is derived for 2 sample distributions, the Ex-

ponential and Weibull Distributions. The resulting equations are used to derive the

results of Chapter 9.

5.1.3 Exponentially Distributed Example

As an example, the next expected event time for a network of two exponentially

distributed event generators will be calculated. The exponential density function is

provided in Equation 5.8:

f(x) =1

θe−x

θ (5.8)

The exponential cumulative distribution function can then be calculated as:

F (x) = 0 for t < 0

F (x) = P (X ≤ x) =

∫ x

0

1

θe−x

θ dt

= −e−x

θ

x

0= 1− e

−xθ for t ≥ 0

(5.9)

The density function for this example contains two generators, so Equation 5.6

simplifies to:

g1(x) = f1(x) [1− F2(x)] + [1− F1(x)] f2(x) (5.10)

Page 168: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

153

Substituting equations 5.8 and 5.9 into Equation 5.10 yields the following proba-

bility density function, g1(x):

g1(x) =

(

1θ1

e− x

θ1

)[

1− (1− e− x

θ2 )

]

+

[

1− (1− e− x

θ1 )

] (

1θ2

e− x

θ2

)

=

(

1θ1

e− x

θ1

)(

e− x

θ2

)

+

(

e− x

θ1

)(

1θ2

e− x

θ2

)

= 1θ1

e− (θ1+θ2)

θ1θ2x

+ 1θ2

e− (θ1+θ2)

θ1θ2x

=(

1θ1

+ 1θ2

)

e− (θ1+θ2)

θ1θ2x

(5.11)

Next, the development of the expected value of the minimum is derived by plug-

ging Equation 5.11 into Equation 5.7 to derive:

E(x) =

∫ ∞

0x

(

1

θ1+

1

θ2

)

e− (θ1+θ2)

θ1θ2xdx

=(

1θ1

+ 1θ2

)

∫ ∞

0x e−αxdx

where α =(θ1+θ2)

θ1θ2

(5.12)

Equation 5.12 can be simplified by applying the following Γ(n) definition:

Γ(n) =

∫ ∞

0xn−1 e−xdx (5.13)

Page 169: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

154

The situation in Equation 5.12 is slightly different due to the α in the exponent.

We can massage Equation 5.12 by using the substitution y = αx and dy = αdx. We

derive Equation 5.13:

∫ ∞

0xn−1 e−αxdx =

∫ ∞

0

( y

α

)n−1e−y 1

αdy

=(

)n∫ ∞

0yn−1 e−ydy

=(

)nΓ(n)

(5.14)

The results of Equation 5.14 and the substitution for α can be plugged back

into Equation 5.12. For Equation 5.12, the two generator example, n = 2. Note that

Γ(n) = (n− 1)!.

E(x) =(

1θ1

+ 1θ2

)

Γ(n)(

)n

E(x) =(

1θ1

+ 1θ2

)

Γ(2)(

)2

E(x) =(

1θ1

+ 1θ2

)

(2− 1)!(

)2

E(x) =θ1θ2

θ1+θ2

(5.15)

Page 170: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

155

We can interpret the results of Equation 5.15 as follows. If the mean of the first

process, θ1, is 5 seconds, and the mean of the second process, θ2, is 10 seconds, then the

expected minimum arrival time of the two processes is given by Equation 5.15 as:

E(x) =θ1θ2

θ1+θ2

E(x) = 5·105+10 = 3.33 seconds

(5.16)

For three exponentially distributed event generators, the next expected event for

x would occur at time:

E(x) =θ1θ2θ3

θ2θ3+θ1θ3+θ1θ2(5.17)

Assuming that all event generators create events with the same θ value, the ex-

pected value of the next event in Equation 5.17 can be generalized to Equation 5.18.

E(x) = θN

(5.18)

5.1.4 Weibull Distribution Example

As a second example, an independent, identically distributed (IID) Weibull dis-

tribution is calculated. This example is calculated for a 2 source experiment similar to

Equation 5.16. The Weibull distribution has the density function found in Equation 5.19.

f(x) = γθ xγ−1e−xγ/θ for x > 0 (5.19)

Page 171: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

156

Equation 5.19 can be integrated directly to derive the cumulative distribution

function in Equation 5.20.

F (x) =

∫ x

0

γ

θyγ−1e−yγ/θdy

= −e−yγ/θ∣

x

0= 1− e−xγ/θ x > 0

(5.20)

Equations 5.19 and 5.20 can be inserted into Equation 5.10. Also, if IIDs are

assumed, then γ1 = γ2 and θ1 = θ2 allowing additional simplifications:

g1(x) =(

γ1θ1

xγ1−1e−xγ1/θ1) (

e−xγ2/θ2)

+

(

e−xγ1/θ1)(

γ2θ2

xγ2−1e−xγ2/θ2)

= 2γθ xγ−1e−2xγ/θ

where θ1 = θ2 and γ1 = γ2

(5.21)

To find the expected minimum, the results of Equation 5.21 are inserted into

Equation 5.7:

E(x) =

∫ ∞

0

θxγe−2xγ/θdx (5.22)

If we then let y = xγ so that x = y1/γ and dx = 1γ y

1−γγ dy, we can substitute

these results into 5.22 to obtain:

E(x) = 2γθ

∫ ∞

0ye−2y/θ 1

γy

1−γγ dy

= 2θ

∫ ∞

0y1/γe−2y/θdy

(5.23)

Page 172: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

157

Employing the Gamma function substitution listed in Equation 5.24, where a = 2θ

and n = 1γ we find a solution for Equation 5.23.

∫ ∞

0xne−axdx =

Γ(n + 1)

an+1(5.24)

E(x) = 2θ

∫ ∞

0y1/γe−2y/θdy

=(

)

Γ(

1γ +1

)

(

) 1γ +1

=(

)− 1γ Γ

(

1γ + 1

)

(5.25)

Equation 5.25 has a nice solution for γ = 1 or γ = 2. For the latter case, the

formula yields:

E(x) =(

)−12 Γ

(

12 + 1

)

=(

)−12 1

2Γ(

12

)

=

(

θ2

)

12√

π

(5.26)

Equation 5.25 can be generalized for N different independent identical distribution

(IID) generators as:

E(x) =(

)− 1γ Γ

(

1γ + 1

)

(5.27)

Page 173: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

158

5.2 Topology: Traffic Map Layout

Much of the simulator design acceleration is facilitated by judicial use of inherent

data locality. Data locality is dependent on adjacent nodes in the simulation being

physically adjacent within the simulator. Communications delays are a factor due to the

accelerated speed of the machine. The inter-processing element communications speed

is shown to be comparable to the simulation event processing speed in Section 7.3.6.

Therefore, the question arises, what is the maximum distance expected between any two

traffic map sections when the map is divided and assigned to processing arrays in the

simulator.

Fishburn [53] shows that for a graph, Gn,m, with nm vertices arranged in n

rows and m ≥ maxn, 2 columns with an edge u, v between vertices u and v, if the

vertices are adjacent either horizontally or vertically, the bandwidth of Gn,m equals n.

For an example of a 4 by 6 map, G4,6, shown in Figure 5.1 and arranged judicially

into the stack of the same figure, the largest discontinuous jump required between the

normal connections of the original map squares in the stack would be 4 vertical array

sections, the value of n. Therefore, in the simulator, if the target simulation map is

divided into sections and distributed on the simulator, in terms of communications, the

largest discontinuous jump in the simulator depends on the minrows, columns of the

simulated traffic map.

Page 174: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

159

Fig. 5.1. Wrapping a Traffic Map onto the Simulator The Traffic network to besimulated is partitioned and assigned to the processing elements of the simulator. Assumingthat the map is divided into an m by n matrix of subsections where n ≤ m, what isthe largest discontinuity between the resulting tiles? For the stack of tiles illustrated,Fishburn [53] shows that when the tiles are laid out in a down-diagonal lexicographicallinear arrangement, the maximum discontinuity is expected to be n. This value providesan estimate of the connectivity required by the simulator.

Page 175: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

160

1

2

3

4

5 9 13 17 21

6 10 14 18 22

7 11 15 19 23

8 12 16 20 24

column−row lexicographic

1

2

4

7

3

5

8

11

6

9

12

15

10

13

16

19

14

17

20

22

18

21

23

24

down−diagonal lexicographic

1 2 3 4 5 6

7 8 9 10 11 12

13 14 15 16 17 18

19 20 21 22 23 24

row−column lexicographic

1

3

6

10

2

5

9

14

4

8

13

18

7

12

17

21

11

16

20

23

15

19

22

24

up−diagonal lexicographic

Fig. 5.2. Different Lexicographical Map Layouts Four linear arrangements from [53]illustrate different possible layouts for the map sections of Figure 5.1. The algorithm usedto derive the graphs is evident from their construction. The column-row and the down-diagonal lexicographic arrangements yield bandwidth arrangements with |f(u)− f(v)| = 4.The other maps are not bandwidth arrangements because |f(u) − f(v)| = 5 for the up-diagonal lexicographic arrangement and |f(u)− f(v)| = 6 for the row-column lexicographicarrangement.

Page 176: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

161

Chapter 6

Design Methods

The overall design employs a few general methods of hardware design to optimize

and accelerate its performance. Chapter 6 reviews various architecture elements which

are applied within the design as described in Chapter 7. Section 6.1 describes Recon-

figurable Logic which has been applied throughout the system. Parallelism in the form

of Systolic Arrays is described in Section 6.2. Section 6.3 discusses Content Addressable

Memory which is applied to Event Generation in Section 7.2.1. Finally, Section 6.4

reviews a Reduction Bus.

6.1 Reconfigurable Logic

The major new technology which facilitates the proposed solution is the applica-

tion of reconfigurable logic [51]. Because of the acceleration dependence on reconfigurable

logic, this section will be covered in detail. Reconfigurable logic is difficult to understand

without a brief overview of FPGAs, the selected source of reconfigurable logic. Recon-

figurable logic has several advantages:

• More application specific than processors.

• No instruction fetch.

• Allows high degrees of parallelism.

Page 177: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

162

• Allows reuse of silicon.

• Provides flexible redesign which is more affordable than ASICS.

A Field Programmable Gate Array is an array of uncommitted elements whose in-

terconnections can be programmed by the user. In 1985, the Xilinx company introduced

the first FPGA. Typical FPGAs consist of logic blocks with programmable interconnects.

The interconnects are wiring segments of the chip which may be of varying lengths.

Switches connect the logic blocks to the wiring segments. A design is implemented by

partitioning the design among the FPGA’s logic blocks and by using the interconnect to

route signals appropriately.

FPGA logic block structures vary widely. They may contain combinational logic,

multiplexors and lookup tables. Many logic blocks also contain flip-flops to assist in the

implementation of sequential circuits [22].

Reconfigurable logic consists of a collection of elementary units which can be flex-

ibly connected together to form larger functions. Reconfigurability allows more than one

custom circuit to run on a given piece of silicon. Reconfigurable logic can be config-

ured while the FPGAs remain within their resident systems. The elementary gates can

also be reconfigured as the system is running which is called “on the fly” configuration.

However, there is often a performance impact or delay when “on the fly” reconfiguration

is implemented. For discrete event-driven simulations, reconfigurable logic allows the

user to select from a smorgasbord of random statistical distributions and implement the

choice as hardware. So the statistical models which result are unconstrained in form and

faster than their software counterparts.

Page 178: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

163

Field Programmable Gate Arrays (FPGAs) were selected as the reconfigurable

logic. FPGAs increase the application speed, but maintain the reconfigurability of soft-

ware. Further, the FPGAs are reconfigurable in place, meaning that the chip needs not

be removed from the board to be reprogrammed. Unlike traditional microprocessors,

reconfigurable logic instructions are directly embedded in the technology and need not

be fetched from memory and executed sequentially. Commercially available FPGAs have

been categorized into four classifications[22]:

• Symmetrical Array

• Row Based

• Hierarchical Programmable Logic Device

• Sea-of-Gates

The two major FPGA competitors are Xilinx and Altera. Xilinx follows a Sym-

metric Array architecture, which consists of a two-dimensional array of logic blocks

interconnected by both horizontal and vertical routing channels as shown in Figures 6.1

and 6.2. Connections which traverse different numbers of switching matrices will have

different delays, making accurate FPGA simulation timing difficult.

Xilinx Configurable Logic Blocks (CLBs) use Lookup Tables (LUT) which can

generate functions based on stored values. The Xilinx XC4000 CLB, depicted in Fig-

ure 6.3, is capable of implementing two independent functions of 4 variables, a single

function of five variables, or some functions of up to nine variables. The two CLB outputs

can be either combinational or registered.

Page 179: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

164

Routing Channels

Configuration

BlockLogic

IO Block

Fig. 6.1. General Xilinx FPGA Architecture. Xilinx FPGA architectures generallyconsist of a two dimensional array of programmable Configurable Logic Blocks (CLBs).The FPGAs contain horizontal and vertical routing channels running between the rows andcolumns of the CLBs. The programmable resources may be controlled by setting staticRAM cell values [22].

Page 180: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

165

MatrixSwitch

MatrixSwitch

MatrixSwitch

CLB

MatrixSwitch

MatrixSwitch

CLB

MatrixSwitch

WiringSegment

RoutingSwitch

Each switch matrixpoint consists of sixrouting switches

Note:

Direct Interconnect

ConnectionBlock

Fig. 6.2. Xilinx Architecture Interconnects. The horizontal and vertical routing linesconnect at the FPGA switch matrices. Single length lines are intended for short connectionsor connections which do not have critical timing requirements. The XC3000 series alsocontains Direct Interconnect lines which allow the CLBs to reach their neighbors on theright, top, and bottom. For connections which span a distance of more than one CLB, theconnections are made via the wiring segments and the switch matrices. Connections routedthrough the switches incur significant routing delays [22]. Special long lines traverse theentire width or length of the chip. The long lines cross at least one switch and are used tointerconnect several CLBs with minimum delay [22].

Page 181: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

166

LookupTable

LookupTable

LookupTable

state

E R

SD Q

E R

SD Q

state

selector

G4

G3

G2

G1

F4

F3

F2

F1

C1 C2 C3 C4

Q2

G

Q1

F

InputsOutputs

VccClock

Fig. 6.3. The Xilinx XC4000 Configurable Logic Block. The Xilinx XC4000 Con-figurable Logic Block (CLB) utilizes a two-stage arrangement of lookup tables (LUT) thatyields a greater logic capacity per CLB than the XC3000’s single stage LUT. The ar-rangement allows the CLB to implement two independent functions of 4 variables, a singlefunction of five variables, or a function of up to nine variables. The two CLB outputs canbe either combinational or registered.

Page 182: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

167

Altera FPGAs consist of a Hierarchical Array of Programmable Logic Devices

(PLDs). Altera refers to its devices as Complex Programmable Logic Devices (CPLDs)

and distinguishes CPLDs from FPGAs based on their interconnect structures. The

segmented interconnect structure of FPGAs is distinguished by its use of multiple metal

lines of varying lengths, joined by pass transistors or anti-fuses, to connect logic cells. In

contrast, the continuous interconnect structure of CPLDs uses an interconnect structure

with seamless metal lines to provide logic cell-to-logic cell connectivity [5]. Brown et al

classifies these Hierarchical Arrays as FPGAs because the Altera devices consist of a two-

dimensional array of programmable blocks and a programmable routing structure [22].

In this thesis, the conventions found in [22] are followed and the devices are referred to as

FPGAs. The Altera devices also implement multi-level logic and are user programmable.

Figure 6.4 illustrates the block diagram of the Altera Flex 10K architecture. In

the center of the figure resides the embedded array which consists of a series of embedded

array blocks (EABs). EABs can function as either memory or logic. When configured

as memory, the EABs each provide 2K of bits which can be used to create RAM, ROM,

FIFO functions, or dual-port RAM. EABs can also be configured into complex logic

functions, such as multipliers, micro-controllers, state machines, and DSP functions [4].

EABs can be used independently or ganged together. A more detailed diagram of the

EAB implementation can be found in Figure 6.5.

The logic array, consisting of logic array blocks (LABs), is featured in Figure 6.4.

Each LAB contains eight logic elements (LEs) and a local interconnect. The LEs consist

of a 4-input LUT, a programmable flip-flop, and dedicated signal paths for cascaded

Page 183: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

168

IOE IOE

IOE IOE

IOE IOE

IOE IOE

IOE IOE

IOE IOE

IOE IOE

IOE IOE

IOE IOE

IOE IOE

IOE

IOE

IOE

IOE IOE

IOE

IOE

IOE

EAB

IOE

IOE IOE

IOE

EAB

Embedded Array Block (EAB)I/O Element (IOE)

Logic Array Block (LAB)

Logic Element

Logic Interconnect

Embedded Array

Logic Array

Row Interconnect

Column Interconnect

Fig. 6.4. Block Diagram of the Altera Flex 10K Architecture. Each group of logicelements (LEs) is combined into a logic array block (LAB). The LABs are arranged intorows and columns. Each row contains a single embedded array block (EAB). The LABsand the EABs are interconnected forming a network. Chip I/O elements (IOEs) are locatedat the end of each row and column [5].

Page 184: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

169

D Q

D Q

256x8512x41,024x22,048x1

ColumnInterconnect

D Q

D Q

Data

In

Data

Out

Address

RAM/ROM

WE

8,4,2,1

8,9,10,11

6

2,4,8,16

24

2,4,8,16

Row Interconnect

EAB Local Interconnect

Device−Wide Clearand Global SignalsDedicated Inputs

Fig. 6.5. Diagram of the Altera Embedded Array Block (EAB). The EmbeddedArray Block (EAB) is a flexible block of RAM with registers on the input and output ports.Logic functions may be created by programming the EAB with a read-only pattern duringconfiguration, creating a large Lookup Table (LUT). The large capacity of the EAB allowscomplex logic functions to be implemented in one level without routing delays. The EABscan also implement large, dedicated blocks of RAM which eliminate the timing and routingconcerns of competitor FPGAs. The competitor FPGAs must often string together smallerdistributed RAM blocks to allocate larger memories [5].

Page 185: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

170

functions. The eight LEs can be used to create functions such as 8-bit counters, address

decoders, or state machines. The internals of an LE are illustrated in Figure 6.6.

The Sea-of-Gates format consists of logic blocks in which the interconnection net-

work is overlayed on the logic blocks themselves. The Sea-of-Gates approach is not

commonly used. Row-Based FPGAs consist of a multiplexor based interconnection net-

work which employs anti-fuse technology. Anti-fuse connections are normally open, high

impedance connections. Programming permanently closes the appropriate connections

by melting a dielectric and thereby configures the connections on the chip. Actel manu-

factures some row-based devices.

Although reconfigurable logic facilitates algorithm implementations which are

faster than analogous software, the application of Application Specific Integrated Circuits

(ASIC) would yield still faster results. ASICs are silicon chips fabricated to perform a

single specific task. ASICs are generally not configurable, and their rigid nature makes

them less appropriate for application in a more general purpose simulator. Reconfig-

urable logic serves as a compromise between the slow but flexible general purpose pro-

cessor and the overly rigid ASIC approach. Whereas ASICs would allow the user access

to a limited array of statistical models, FPGAs allow the user to implement a statistical

model for event generation of the user’s choosing. Section 7.2.3 uses reconfigurable logic

to flexibly implement a configurable arrangement of functional units. Reconfigurable

logic preserves the parallel processing aspect of ASICs along with permitting some of

the flexible programming of the general purpose processor.

Page 186: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

171

Look−UpTable

(LUT)

CascadeChain to

Interconnect

to LAB localInterconnect

CarryChain

Clear/PresetLogic

Device−WideClear

ClockSelect

ProgrammableRegister

DATA1

DATA2

DATA3

DATA4

Carry−in Cascade−In

Register Bypass

LABCTRL3

LABCTRL4

LABCTRL2

LABCTRL1

D QPRn

ENACLR

Carry−Out Cascade−Out

Fig. 6.6. Diagram of the Altera Logic Element (LE). The Altera Flex 10K LogicElement (LE) is the smallest unit of logic in the architecture. Each LE contains a 4-inputLUT. Additionally, each LE contains a programmable flip-flop with a synchronous enablewhich can be configured as a D, T, JK or SR flip-flop. The LE also contains a carry chainwhich supports high speed counters and adders, and a cascade chain which implementswide-input (large fan-in) functions with minimum delay. Carry and cascade chains canconnect all LEs in a Logic Array Block (LAB) and all LABs in the same row [5].

Page 187: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

172

The Altera FPGA was selected over Xilinx for a couple of important reasons.

First, the manufacturer claims that the hierarchical architecture of the FPGA allows re-

alistic timing simulation. The continuous global vertical and horizontal routing structure

is claimed to provide more predictable performance as opposed to other manufacturer

approaches which employ a routine segmented interconnect with switch matrices. The

Altera interconnect ensures predictable performance and accurate simulation and timing

analysis. This predictable performance contrasts with that of other manufacturers, which

use a segmented connection scheme and therefore have unpredictable performance [5, pg.

69]. The competitor FPGAs must often string together smaller distributed RAM blocks

to allocate larger memories causing timing and routing concerns.

Reconfiguration allows the semantic expressiveness of very large instructions with-

out paying the commensurate bandwidth and deep storage costs for these powerful in-

structions. The sacrifice made in developing this solution is the ability to change the

entire instruction on every cycle. Reconfiguration opens a middle ground, or an inter-

mediate binding time, between ‘behavior which is hardwired at fabrication time’ and

‘behavior which is specified on a cycle by cycle basis. This middle ground is useful to

consider in the design of any kind of computing device not just conventional FPGAs [47].

6.2 Systolic Arrays

In this work, the event generation hardware is implemented as a pipelined, parallel

systolic array. Translating software to hardware significantly increased the simulation

calculation performance as the hardware logic is faster than comparable software.

Page 188: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

173

Reconfigurable logic naturally facilitates the creation of Systolic Arrays [73, page

580]. In a systolic array, data is pumped from processing element to processing element

at regular intervals, until the data circulates back to memory. Intermediate results can

be passed along in the pipeline, instead of writing them back to the register file after

every instruction [70].

Systolic systems consist of interconnected cells, each of which is capable of per-

forming a simple operation. Systolic systems tend to have uncomplicated communication

and control structures which provide an advantage in design and implementation. Sev-

eral cells are generally joined together to form an array or tree. Data flows through the

cells which are pipelined together.

Initially, systolic arrays were proposed for special purpose computers implemented

in Very Large Scale Integration (VLSI) silicon, in an effort to reduce design costs. Design

costs in FPGAs are already somewhat minimal, but here the systolic array architecture

allows the implementation of multidimensional pipelines. Arithmetic functional units

(adders, multipliers, etc.) can be flexibly formed and interconnected allowing both par-

allelism and the continuous flow of data through computation units. These systolic

pipelines are ideal for implementing parallel algorithms to compute simulation events.

6.3 Content Addressable Memory

Content addressable memory (CAM) or Associate memory is a storage system

which can both store data and also perform some minor processing, such as searches.

Although a CAM approach is less applicable to a traffic simulation example, the CAMs

are suitable to a more general discrete event simulation. When working in conjunction

Page 189: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

174

with a processor, the content addressable memory queue allows some events to be pro-

cessed independently, freeing the scheduler to attend to other tasks. A block diagram

of an Associate Memory is illustrated in Figure 6.7 [100]. An example of a storage bit

within one memory word is shown in Figure 6.8 [100]. One of the proposed Event Queue

approaches of Section 7.2.2 applies Content Addressable Memory. If the required event’s

resources are unavailable, the queue assists the scheduler by removing the impotent

events.

Searches are accomplished by seeking a specific bit pattern. For the search, two

parameters are supplied, the matching bit pattern and a mask to limit the search set.

The search is dependent on which values are stored in memory as opposed to the address

or location of those values. It accomplishes a search by content. For example a request

might be to search for memory words whose lowest-order 8 bits contain the pattern

“00000100” and return the first match. In this case, the lowest-order 8 bits is the mask,

the argument is the given bit pattern, and returning the first word provides conflict

resolution in the event of multiple matching words. Figure 6.9 illustrates Matching logic

from [100].

Besides the structure depicted in Figure 6.7, an additional Tag register is often

included, allowing the rapid determination of which words in the memory are valid.

When a word is written to the associative memory, the Tag Register, which contains one

bit for each memory word, is scanned until the first 0 bit is found indicating an unused

word. The new value can then be written to the corresponding associative memory word,

and that bit is then flipped to a 1 indicating that the newly written word is valid. To

Page 190: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

175

MatchRegister

Argument Register

Key Register

Write

Read

Input

Output

n bits/wordm words

AssociativeMemory

Fig. 6.7. Associative Memory Block Diagram The block diagram of an AssociativeMemory or Content Addressable Memory consists of the four elements shown. The Memorystorage array and match logic are used for both the storage of data and for allowing a parallelsearch. The Argument Register contains the value to be compared with all the words in theMemory array in one parallel operation. The Key register is really a Mask register, used tolimit which bits of the Argument Register are used in the search for a match. The MatchRegister contains one bit for every word in the array. During a search, words which matchset a bit in the Match Register. The found data can be read sequentially by selecting wordswhose Match Bits have been set by the results of the search. These architectures usuallyalso contain a Tag register, which like the Match Register contains one bit for each word inthe array, indicating whether or not that word contains valid data [100].

Page 191: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

176

MatchLogic

Aj Kj

Read

Write

Input

R S

To MiFij

Output

Fig. 6.8. An Associative Memory Cell The diagram illustrates the typical logic con-tained within an Associative memory cell [100]. The cell contains a flip-flop storage unit,Fij , and circuits for reading, writing and matching the cell contents with an argument. A

write transfers an input bit to the flip-flop, and reads the value during a read operation [100].

Page 192: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

177

Fi1’Fi1

K A1 1 K A

’F F

2 2

i2i2

K A

’F F

n

in in

n

Fig. 6.9. Associative Memory Match Logic The match logic is used to compare thevalues stored in each cell with the corresponding bit held in the argument register. Fora match to occur, both the argument and corresponding cell bits must contain the samevalue, so they must either both be 0 or both be 1. In Boolean, the logic is described byxi = AjFij + AjFij , where xi is 1 when both bits are equal. For a match, all bits mustregister a 1, so the xi’s are ANDed together. The Kj bit is used to remove a bit from

comparison by forcing the comparison logic output line high when the bit is not maskedregardless of the match logic outcome. Therefore any bit not masked will not negativelyaffect the matching logic result [100].

Page 193: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

178

delete a word in memory, the corresponding Tag bit is set to 0. Further, the tag bits are

included in the match logic to prevent an invalid word from participating.

In the case of discrete event simulation, content addressable memory can assist

in removing events which lack required resources from the event queue.

6.4 Reduction Bus

A reduction bus is a communications structure which has the dual purpose of both

communications and some minor simultaneous computation. For instance, the bus can

both determine and disseminate the next global minimum event time through a network

of interconnected nodes. The CM-5 r©, developed by the Thinking Machines Corpora-

tion, can consist of hundreds or thousands of processors, linked together by two communi-

cations networks, the Control and Data networks. The Control network contains a global

reduction network which allows many data elements to be combined producing a smaller

result. The CM-5’s reduction network directly supports integer summation, finding the

integer maximum, logical OR, and logical exclusive OR operations [77]. Reynolds de-

scribes another reduction network referred to as the Parallel Reduction Network (PRN),

where each node in the binary tree structure of the PRN is an Arithmetic Logical Unit

(ALU) specifically for performing reduction computation [116]. Reynolds’ PRN is de-

scribed in Section 3.5.1. The reduction network presented in Section 7.3.2 is a flattened

network, where the considered constraints include both the node geometric layout and

short bus run-lengths.

Page 194: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

179

Chapter 7

Architecture

The proposed architecture is composed of multiple processing elements united

and synchronized towards the common goal of accelerating discrete event simulation. In

the case of traffic simulation, each processing element is responsible for simulating the

vehicles within an intersection and for simulating the traffic on the intersection’s outgoing

roads. Traffic intersections which are directly joined by roads in the simulation are

similarly co-located on adjacent nodes within the architecture to profit from the resulting

data locality. Data dependencies are local to each simulation node providing ample

opportunity for the concurrent processing of events on different processing elements.

As illustrated in Figure 1.1, the simulation architecture for each processing ele-

ment is divided into the three main categories of event generation, an event queue, and

a scheduler. Additionally, there is the interconnection network which binds the elements

into a single cohesive computing machine. The architecture of the system is sub-divided

into the four smaller sub-architectures. The architecture, overviewed in Section 7.1,

is composed of multiple processing elements which are unified into a cohesive simula-

tor. Individual processing element sub-components are discussed in Sections 7.2.1, 7.2.2,

and 7.2.3. Section 7.2.1 describes the event generation design. Section 7.2.2 describes the

event queue which stores events generated by the hardware described in Section 7.2.1.

At each processing element, events are retrieved from the event queues and scheduled

Page 195: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

180

and processed by scheduler hardware described by Section 7.2.3. Finally, Section 7.3

describes the network which unites and synchronizes the processing elements.

7.1 Distributed Multiprocessors

Figure 7.1 illustrates the operational environment of the simulator which is simi-

lar to Levendel’s [98]. A general purpose machine serves as a User Interface and system

Controller. The Controller pre-processes the simulation data, partitioning and loading

it across the various simulator processing elements. There may be more than one Con-

troller initializing the simulator to ensure reasonable response time due to the size and

scalability of the system. However, if there are several Controller units, one will be des-

ignated as the main Controller, responsible for the initial simulation partitioning. The

pre-processing, partitioning, and post-processing topics are not covered in this thesis.

Once the simulator is loaded, the main Controller provides the initial Start signal, il-

lustrated in Figure 7.2. The Controller machines receive, postprocess, and provide the

simulation results to the user. Intermediate simulation results can be obtained by either

programming the processing elements to automatically post results as certain thresholds

are crossed or by interruptions from the Controller. The interruptions allow the user

to monitor simulation progress. Major adjustments are not possible without halting

the simulator and reconfiguring processing element logic. Results include traffic run-

time statistics and output values from monitored points at specific simulation times or

conditions.

The Controller unit is a general purpose machine which receives the simulation

input from the user. The Controller then partitions the simulation across the simulator

Page 196: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

181

General PurposeController

SimulatorProcessing Element

Network

Preprocessing Postprocessing

Fig. 7.1. System User Interface The proposed preprocessing and post processing oper-ational environment for the multi-processing element simulator architecture are similar toLevendel [98]. The Simulator Processing Element Network is composed of a parallel reduc-tion bus structure and a cross-point matrix. The parallel bus is used for synchronizationand initialization. The cross-point matrix network is for inter-processor communications.

Page 197: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

182

processing element network of Figure 7.2, optimizing the distribution to take advantage

of data locality among the neighboring processing elements and the simulator commu-

nications network. For a road traffic network, the Controller logically assigns the roads

and intersections of the simulated road network onto the processing elements, attempting

to place adjacent simulation nodes on adjacent simulator processing elements. Multiple

controllers can be used once the simulation network is partitioned to initialize the sys-

tems reconfigurable logic. For instance, eight controllers, one per quadrant, can be used

to implant the initialization and configuration data in the processing elements of each

quadrant. Once the simulation starts, the controllers can monitor the system and check

to see when user-determined thresholds are crossed. In the traffic simulation example,

one concern is throughput. Overflowing vehicle queues indicate a traffic bottleneck, so

a threshold can be assigned to the traffic queue size as a simulation break point. The

Controller units are responsible for providing formated output to the user. Configuration

of the processing elements will be slow, but once the simulation starts running, the Con-

trollers will not negatively affect the speed of the simulation unless the user requires the

simulation to proceed slowly. Controllers can step through a simulation and run it in a

debug mode. Citing the traffic network model example, once a simulator is initialized for

the traffic network model, major model changes are infrequent as road networks change

slowly.

Vehicle data flows into each processing element as the vehicle enters the corre-

sponding traffic map section. Vehicle data is transmitted between processing elements

using either nearest neighbor interconnect routing or the processing element array cross-

point switch. Greater detail of the processing element sub-components is provided in

Page 198: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

183

GeneralPurposeController

Cross−PointSwitch

CommunicationsStructure

ControlSwitch

CommunicationsStructure

ParallelBus

Data/Clk

Address

Control

Processing

ElementAddress

Data

ControlDoneStart

Data/Clk

Address

Control

Processing

ElementAddress

Data

ControlDoneStart

Fig. 7.2. Processor Element Network The simulator consists of a Controller and anetwork of processing elements, interconnected by both a shared parallel reduction bus anda dedicated communications structure. The communications structure is composed of cross-point matrices laid out in approximately fully connected star topologies. Further detail onthe cross-point matrix network is found in Section 7.3.6, and the parallel bus is describedin Section 7.3.2.

Page 199: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

184

the following sections. A controlling processor at the core of the simulator initiates each

simulation cycle using the Start signal in both its time and event-driven modes. In

time-driven mode, the processing elements have already exchanged input values for the

beginning of the next simulation cycle during the previous cycle using the communica-

tions structure. In event-driven mode, the processing elements must wait until the next

event time is determined before exchanging data. Only data required for the next sim-

ulation cycle is exchanged by the processing elements, alleviating the need to exchange

event scheduling times with the vehicle data. Data may also be generated for the Con-

trolling unit, upon user request or instruction, and it is expected that this user requested

data will impact the simulator’s processing speed. Processing elements signal that they

are ready using the Done signal line on the reduction bus, illustrated in Figure 7.2.

In event-driven mode, when all processing elements have signalled the end of

the current simulation cycle, the next event time is determined using the reduction

bus of Section 7.3.2. Data between the processing elements can then be exchanged

for the next time cycle. The time-driven mode avoids both the next simulation time

cycle determination and the subsequent data exchange which occurs during the cycle

processing. Once all processing for the previous simulation time cycle is complete, the

main controlling processor initiates the next simulation cycle using the Start signal.

7.2 Processing Elements

The local processing element architecture, consisting of an event generator, local

event queues, and a scheduler, is shown in Figure 7.3. The processing elements perform

the actual scheduling calculation on each discrete event in the system. For the road

Page 200: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

185

traffic example, the processing elements compute the acceleration, velocity, and positions

of each vehicle as they traverse the simulated road network. Each processing element

is responsible for part of the overall simulation map, as partitioned by the Controller

during the simulation initiation. For the traffic simulation, parts of both the routing

table and map attributes are implemented in reconfigurable logic before the simulation

starts. Source nodes which introduce new events into the simulation network need event

generators and queues to handle arrival and service events. Simulation nodes which

serve as pass-throughs or way-points for events already active in the simulation may

require only the scheduler components which may be complex. Further, the scheduler

components may contain queue structures which should not be confused with the Arrival

and Service queues illustrated in Figure 7.3. The processing elements each include a

microprocessor, RAM, and EEPROM to provide added design flexibility.

Each processing element incorporates hardware to exchange simulation data with

other elements connected to the Communications Structure of Figure 7.2. The inbound

events are handled by additional small communications FIFO queues not illustrated in

Figure 7.3. These communications queues are used to maintain the ordered inbound

events from other processing elements and the ordered outbound events sent to other

processing elements.

7.2.1 Event Generation

Speedup of event-driven simulation is attacked from two vantage points. First, a

separate event generator is created which functions in parallel with, and independently

Page 201: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

186

comparator

ServiceQueue

Event

Generator

adjacent PEsEvents from

Scheduler

ArrivalQueue

Fig. 7.3. Local Processing Element Design The local processing element (PE) designuses two queues for each server. The arrival queue holds the sorted list of arrival eventsfrom the Event Generator and adjacent network processing elements. Service events, whichare created from processing successful arrival events, are stored in the service queue. Acomparator samples the heads of both queues and indicates where the next minimum localtimestamped event resides.

Page 202: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

187

from, the event scheduler. Although some data dependency exists during event gener-

ation, partial parallelism at this stage is reasonable. Data dependency exists because

event arrivals are calculated as random offsets from the previous event’s arrival time.

Also, service durations, required for some types of simulation, are calculated as random

offsets from the event’s arrival time. Note that not all types of simulation require service

durations. Road traffic does not necessarily require a duration, but a telephone call in

a communications network simulation needs to determine the length of time its corre-

sponding circuit is unavailable. The partial parallelism derived by calculating the arrival

offset and service duration concurrently is possible because the arrival and service offsets

are not themselves dependent on anything. However those offsets must then be added

to either the previous or current event’s arrival time, respectively. The event generator

computes event arrival times, service times, and resource requirements with some par-

tial parallelism (see Figure 7.4). The resulting event objects are stored in a memory

queue which is accessible to the scheduling software. The memory queue serves as the

simulation’s event queue.

Speedup is accomplished by translating some simulation loop software into par-

allel, systolic, hardware. The hardware is designed through a combination of recon-

figurable logic technology and systolic arrays. Reconfigurable logic allows the user to

compile various statistical distribution models into hardware inexpensively. Systolic ar-

rays lend themselves nicely to the parallel execution of the independent sub-portions of

events which are otherwise data dependent. These non-dependent parts of the events

can be executed in parallel. The Event Generator from Figure 1.1 is translated into both

software and the hardware of Figure 7.4 for timing comparisons.

Page 203: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

188

CreateArrival

Time Offset

Pipeline Register

Add OffsetTo Previous

Arrival Time

Create

Time OffsetService

i

e

Reg

p

PPipeline Register

i

e

Reg

p

P

i

e

Reg

p

P

Pipeline Register

Set

Resources

Place Event

in Queue

Add OffsetTo Current

Arrival Time

Fig. 7.4. The Event Generator Flow Diagram The Event Generator of Figure 1.1is subdivided into arrival and service time generation. The time offsets can be created inparallel. This design converts event generation software into a two-dimensional reconfig-urable, systolic array. Reconfigurable logic boosts the execution speed of event generationby fostering parallel computation. In the systolic array depicted above, data is pumpedfrom one processing block to the next at regular intervals, until the data circulates to theEvent Queue.

Page 204: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

189

In the hardware version, multiple calculations happen simultaneously. First, the

three outer pipeline blocks of Figure 7.4, Create Service Time Offset, Create Arrival Time

Offset, and Set Resources execute simultaneously. Create Service Time Offset and Create

Arrival Time Offset generate the Poisson arrival and service times of Equations 4.1

and 4.2. Next, the arrival time offset is added to the current clock time to determine

the actual arrival time in the Add Offset to Previous Arrival Time block. In the next

step, the service time offset is added to the actual arrival time yielding the time at which

the event is finished and its resources become available again. Simultaneously, the start

event data is matched to its resource requirements in the Create Next Start Event block.

The Create Related Finish Event block pumps out its value in the next step. However,

when the pipe is loaded, start and finish events emerge from the pipeline simultaneously,

with each cycle.

The hardware version of Section 4.1.1 was modeled using Altera’s Max+Plus IIR©

FPGA simulation package. The design, written in the AHDL language, used the Flex

10K series FPGA chips. The Max+Plus IIR© design automation package consists of a

series of tools including an editor, a compiler, and a simulator. The editor allows designs

to be entered as text files in AHDL (Altera High Level Design Language), Verilog,

or VHDL (VHSIC Hardware Description Language where VHSIC is another acronym

standing for Very High Speed Integrated Circuits). The compiler translates the design

into files for simulation, timing, and device programming. The Max+Plus II R© simulator

provides timing information and allows design functionality to be verified.

The Altera Flex 10K series of FPGAs have the following features. The devices

contain 10,000 to 100,000 typical gates, 720 to 5,392 registers, and 6,144 to 24,576 RAM

Page 205: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

190

bits. Additional routing features on the chip facilitate predictable interconnect delays

which provide reliable simulation results. Software design support and automatic place-

and-route tools are provided by Altera’s Max+Plus IIR© development system.

7.2.1.1 Event Generator Results

The design depicted in Figure 7.4 is translated into AHDL, Altera’s Hardware

Design Language. The Event generator is synthesized as a combination of five chips. Four

logarithm units are required, two producing their results on the even clock cycles and

the other two producing results on the odd clock cycles. The Create Arrival Time Offset

and the Create Service Time Offset blocks of Figure 7.4 each require one odd and one

even logarithm unit. In simulation at a 200 nanosecond clock rate, the hardware version

requires 200 nanoseconds, producing one event per clock cycle. Therefore, we achieve a

speedup of 150. This speedup is just for the translation of the event generation software

code as pipelined, systolic hardware. The event generator implementation results are

listed in Table 7.1.

7.2.2 Event Queue

The second problem strike point occurs within the queue of waiting events. This

queue is designed to hold the events in order of their arrival. One proposed memory

queue is a Content Addressable Memory Queue. If the required event’s resources are

unavailable, the queue assists the scheduler by removing impotent events.

Page 206: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

191

After events are created by the event generator, they are stored in the arrival

queue in order. The arrival queue can be easily implemented as a FIFO queue. Success-

fully executed arrival events create service events. However, the service events may be

generated out of order. The smallest timestamped events, be they service or arrival, must

be continually available to allow the events to be executed in time order. This section

presents two service queue alternatives. The first method, described in Section 7.2.2.1,

maintains a sorted queue and can select the nth element in O(1), but requires O(4) steps

to insert a new element. The second method, described in Section 7.2.2.2, inserts new

elements in O(1) and can pop the smallest element off in O(1) but does not maintain a

sorted queue.

7.2.2.1 The Service Event Sorter

The first method, the Service Event Sorter, applies associative memory to sort

events in 4 cycles, significantly faster than standard software sorts. This sorting mech-

anism maintains a sorted array facilitating selection of the kth smallest element. The

hardware consists of the Input Register, a Content Register Array, a Marked Array, and

a Maxbit Register. The input value is compared against the content array values and

inserted in the correctly sorted position within the content array. Auxiliary hardware

registers and logic are used to quickly locate the correct insertion point for the new value.

Longer queues can be created by chains of smaller queues.

In the first cycle, illustrated in Figure 7.5, all words in the content array are

compared to the input register value. If any word in the content array has the same

most significant bit (MSB) as the input register, the first bit of the maxbit register is

Page 207: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

192

set. If any content array word has the same top two MSB’s as the input register, then

the first two bits of the maxbit register are set. So if three bits of the maxbit register

are set, at least one word in the content array has the same top 3 MSB’s as the input

register. In the example depicted in Figures 7.5, 7.6, and 7.7, two words have the same

top two MSB’s as the input register value.

The proposed algorithm works as follows. All registers in the content array are

compared with the input register. A network of nodes, called the match array, is used to

determine the number of most significant bits which each content register has in common

with the input data register. A single register, the maxbit register, records the result.

For example, if one or more content array registers match the data input register on all

3 of the 3 most significant bits, then the first 3 bits of the maxbit register are set to logic

1’s.

In the second of the four cycles, all words in the content array matching the input

register with the maximum number of MSBs are marked by setting bits in a marked

array. The marked array consists of one bit per word of the content array. Content

array words which have the maximum number of MSBs matching the input register are

marked as illustrated in Figure 7.6. These words will cluster due to the binary nature of

the search and the sorted queue format.

During the third cycle, the required words in the content array will be moved

down to allow room for the insertion of the input register contents. If the least significant

marked bit in the maxbit register is i, then the i+1 bit of the input register is checked.

When the i+1 bit of the input register is a zero, as shown in Figure 7.7, all registers

in the content array from the marked register to the end of the array are shifted down

Page 208: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

193

1st B

it M

atch

es1s

t 2 B

its M

atch

1st 3

Bits

Mat

ch1s

t 4 B

its M

atch

0001

0011

0110

0111

1001

1010xxxxxxxx

MatchArray

Content

ArrayRegister

−−−−xx−−

−−−−

0101

xx−−−−−−−−−−−−−−−−−−

1100

Input Register

Maxbit RegisterMarked Array

Fig. 7.5. Service Event Sorter: Cycle 1 The Service Event Sorter can sort events in 4cycles, significantly faster than standard software sorts. In the first cycle, all words in thecontent array are compared to the input register value. If any word in the content arrayhas the same most significant bit (MSB) as the input register, the first bit of the maxbitregister is set. If any content array word has the same top two MSBs as the input register,then the first two bits of the maxbit register are set. So if bit three of the maxbit registeris set, it indicates that at least one word in the content array has the same top 3 MSBs asthe input register. In this example, two words have the same top two MSBs as the inputregister value.

Page 209: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

194

1st B

it M

atch

es1s

t 2 B

its M

atch

1st 3

Bits

Mat

ch1s

t 4 B

its M

atch

0001

0011

0110

0111

1001

1010xxxxxxxx

MatchArrayArray

ContentRegister

−−−−xx−−

−−−−

0101

xx−−−−−−−−−−−−−−−−−−

1100

11

Input Register

Maxbit RegisterMarked Array

Fig. 7.6. Service Event Sorter: Cycle 2 In the second cycle, all words in the contentarray which contain the same number of matching maximum bits with the input registerare marked. For this example, two words match the input register with their two mostsignificant bits.

Page 210: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

195

one register in a single clock cycle. Otherwise, if the i+1 bit of the input array is a one,

then all registers below the marked registers are shifted down one position creating a

spot for the new word to be inserted as shown in Figure 7.7. Finally, in the fourth step,

the input register value can be inserted into the array in proper order.

7.2.2.2 The Linear Array

The second service queue mechanism consists of a linear array approach which is

described in [97] and illustrated in Figure 7.9. However, instead of using the linear array

to sort the values, the array will simply maintain fast access to the minimum timestamped

event. All new simulation events are passed into the leftmost array element, the queue

head, and when removed, the elements are also popped off the queue head. Each element

of the queue contains two registers and a comparator. The larger of the two resident

elements is passed to the right, and the smaller of the two elements is passed to the left.

Therefore the smallest entry is always at the leftmost queue element. Comparators in

each element and the queue push/pop signal steer the 2x2 multiplexor logic to route the

correct entries in and out of the processing element registers.

The service queue is required to always have the smallest element ready. The

availability of the smallest element can be reasoned as follows. Assume that at some time,

t, the queue contains N elements. Therefore, the leftmost element, K, has examined a

sequence of N values, retaining the smallest value. This value can be popped off in 1

move. The element to K’s right, K-1, has examined at least N-2 values, so the 2nd

smallest value can be either at element K, or at element K-1, but it must be in one of

Page 211: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

196

1st B

it M

atch

es1s

t 2 B

its M

atch

1st 3

Bits

Mat

ch1s

t 4 B

its M

atch

MatchArrayArray

ContentRegister

−−−−xx−−

−−−−

0101

xx−−−−−−−−−−−−−−−−−−

1100

11

0001

0011

xxxx1010100101110110

Input Register

Maxbit RegisterMarked Array

Fig. 7.7. Service Event Sorter: Cycle 3 The third cycle shifts words in the contentarray to insert the input register word. If the least significant marked bit in the maxbitregister is i, then the i+1 bit of the input register is checked. When the i+1 bit of the inputregister is a zero, all registers in the content array from the marked register to the end ofthe array are shifted down one register in a single clock cycle as illustrated in the figure.Otherwise, if the i+1 bit of the input array is a one, then all registers below the markedregisters are shifted down one register creating a spot for the new word to be inserted.

Page 212: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

197

1st B

it M

atch

es1s

t 2 B

its M

atch

1st 3

Bits

Mat

ch1s

t 4 B

its M

atch

MatchArrayArray

ContentRegister

−−−−xx−−

−−−−

xx−−−−−−−−−−−−−−−−−−

1100

11

0001

0011

xxxx10101001011101100101

Input Register

Maxbit RegisterMarked Array

Fig. 7.8. Service Event Sorter: Cycle 4 The fourth cycle inserts the input registerword into the content array. If the least significant marked bit in the maxbit register is i,then the i+1 bit of the input register is checked. When the i+1 bit of the input register isa zero, the input register word is inserted into the first marked word position. If the i+1bit of the input register is a one, then the word is inserted below the last marked word.

Mux Mux

Out

In Comparator

Register A

Register B

Comparator

Register A

Register B

Fig. 7.9. Linear Array Queue The queue consists of a linear array of processing ele-ments. All new elements are passed into the leftmost array element, and when removed, theelements exit the same leftmost element. Each element of the queue contains two registersand a comparator. The larger of the two resident elements may be passed to the right, andthe smaller of the two elements may be passed to the left. Therefore the smallest entryis always at the leftmost queue element. Comparators in each queue element steer themultiplexor logic to route the correct entries in and out of the processing element registers.

Page 213: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

198

those two places, and can be accessed in 2 moves since the smallest element must be

removed first.

The nth smallest element to enter the array is in any position from K down to

K − (n− 1). Then the next smallest element to enter the queue will be in any position

from K down to K − ((n − 1) − 1), which provides our inductive step for the n − 1

smallest element. So our nth smallest element can always be retrieved in n steps. This

queue allows us to push and pop each element in O(1) time. The queue is illustrated in

Figures 7.10 and 7.11.

Figure 7.10 illustrates a sequence of values being pushed into the array. The top

array illustrates the first time step, with each successive array below depicting the same

array during the next clock cycle. Comparators on each processing element and their

associated multiplexors steer the values into each element of the array. Larger elements

are pushed to the right. When events are popped off the queue, the analogous sequence

of steps is illustrated in Figure 7.11. Smaller elements are pushed to the left during

insertion.

7.2.2.3 The Queue Model Results

The implemented hardware service queue is a five element design closely resem-

bling Figure 7.9. For the traffic simulation example, there is no need for the fully ordered

list of Figure 7.6. The linear array queue is capable of pushing one 16-bit value per 40

nanoseconds. The smallest queue value can also be popped out at that rate. It is as-

sumed that each simulator cycle needs to push one event and pop one event from the

service queue. Therefore, the queue achieves the system 80ns cycle time. Queue data

Page 214: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

199

8 21

4 653

7

1 34 6

5728

8 2 7 3 1 5 6 4

46

45

6

8 2 7 3 1 5

8 2 7 3 1

41

65

8 2 7 3

Fig. 7.10. Linear Sort Array Input Example The figure illustrates a sequence ofvalues being pushed into the array. The top array illustrates the first time step, with eachsuccessive array below depicting the same array during successive clock cycles. Comparatorson each processing element and multiplexors between each element steer the values into eachelement of the array of Figure 7.9. Larger elements are pushed to the right.

1 34 6

578 2

36

782

4 51

78 41 2 3 5

6

78

51 2 3 46

8 61 2 3 4 5 7

81 2 3 4 5 6 7

1 2 3 4 5 6 7 8

Fig. 7.11. Linear Sort Array Output Example The figure illustrates a sequenceof values being popped out of the array. Comparators on each processing element andmultiplexors between each element steer the values into each element of the array. Smallerelements are pushed to the left.

Page 215: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

200

values also require pointers to the event data so that pairs of values are needed to be

pushed and popped off the queue. Conversely, when new elements are pushed into a

software data structure, the existing software elements must be fetched from memory to

allow the CPU to compare the stored elements to the new arrival, so that the insertion

point can be determined, or an address must be calculated to determine a proper bin

on which to chain the new entry for hashing. Software methods require more time and

variable amounts of it.

Comparing the two methods in terms of their required hardware, the Service Event

Queue requires two registers for each memory word, where one is the memory register

and the second is within the Matchbit array. Additional matching logic, a Match Array

Bit, and a Tag bit are required for each word. With the Linear Array approach, each

stored word requires 1 storage register, and an amortized 50% of both a multiplexor and

a comparator plus some additional control logic. The hardware requirements are similar

in magnitude.

Using Altera’s Max+Plus IIR© FPGA simulation package, the Event Generator

and the Service Queue have been simulated as individual parts running with a clock

rate of 80ns. The service queue was simulated at a rate of 40ns, allowing it to push

an event during the first half cycle and pop an event during the next. A 5 processing

element queue was implemented on one Altera EPF10K20TC144-3 chip utilizing 90% of

the chip’s resources. The system FPGA components are listed in Table 7.1.

Figure 7.12 illustrates the speedup expected for other distributions if the 80ns

clock is maintained in their hardware implementations. The implemented hardware

Page 216: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

201

Function Quantity Chip Type % Utilized Clock RateEvent GeneratorLogarithm Unit 4 EPF10K40RC208-3 95% 15.84 MHz

Event Logic 1 EPF10K30RC240-3 71% 12.78 MHzService QueueLinear Array 1 EPF10K20TC144-3 90% 30.95 MHz

Table 7.1. Event Generator and Event Queue FPGA Implementation The AlteraFPGAs used to simulate the event generation and event list hardware are listed. The naturallogarithm unit uses pairs of FPGAs to facilitate one result per clock cycle. Two logarithmunits are used to generate one arrival time per clock cycle, and another two logarithm unitsare used to generate a service time offset per clock cycle. Each logarithm unit of the pairsgenerate output on alternating clock cycles. The Linear Array implementation is describedin Section 7.2.2.2.

100

150

200

250

300

1 2 3 4 5 10 50 100 500 1000

Spe

edup

Number of Events in Arrival Queue

Number of Queue Arrival Events vs Speedup

UniformNormal

LogNormalWiebull

NegativeExpntl

Fig. 7.12. Speedup vs Events for Event Generation, Arrival and Service QueuesIllustrated are the speedup values obtained by comparing the software event generationand queuing from the code in Table 4.3 against the respective hardware implementation.The speedup values were derived on a dual Intel Pentium 350 MHz RedHat Linux boxrunning the 2.2.15 kernel. Compilation of the software was with the GNU gcc compiler,version 2.95.2, using the optimization flag. The speedup results indicate a second order ofmagnitude speedup.

Page 217: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

202

distribution provides the speedup illustrated for the Negative Exponential curve in Fig-

ure 7.12. It is also important to note that the proposed linear array queue need not

be implemented as reconfigurable logic. The queue can be implemented as an Applica-

tion Specific Integrated Chip (ASIC), and would probably be able to function at an even

faster clock rate with many more queue elements. The code execution times were clocked

on a Dual Pentium 350 MHz machine running RedHat Linux kernel version 2.2.15. The

code was compiled using the GNU gcc compiler, version 2.95.2. To gather accurate tim-

ing results, the number of events in the queue is kept constant. The extra time used to

generate additional arrival events in order to maintain the queue size is not included in

the speedup plot of Figure 7.12.

7.2.3 Scheduler

The results of Section 4.2.3 determined that discrete event simulation scheduling

algorithms are an important target of acceleration research. However, unlike the Event

Generation and Event Queue implementations, the scheduler implementation is very

simulation dependent. For instance, a discrete event simulation of road traffic might

have a very different scheduling algorithm than a biological scenario. For this study,

the simulation of traffic was selected. The nature of microscopic traffic simulation is

the determination of position, velocity, and acceleration along with routing and other

considerations. Traffic simulation has the added benefit of a significant amount of data

locality. Vehicles in a system tend to dwell in the same locality and their dependencies

rely on other sets of data within that same locality. Even when vehicles move, they move

to an adjacent node within the traffic network. The work in this study, and especially

Page 218: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

203

within this section, must be viewed in light of these properties of traffic simulation.

The other properties of event generation and the event queue are probably more widely

applicable to discrete event simulation in general. However, there is a wide and analogous

field of simulation which can benefit from the results of this section.

Due to current technology limitations, it is assumed that each processing element

contains one intersection and all of the roads which exit from the intersection. Process-

ing elements model two, three, and four-way intersections. A four-way intersection is

illustrated in Figure 7.13. Larger intersections are modeled using combinations of these

three intersection types.

The Scheduler component of the architecture is described in several sections.

Section 7.2.3.1 explains the data structure which composes a vehicle description. Next,

Section 7.2.3.2 describes how the vehicle data structure of Section 7.2.3.1 is initialized as

the vehicle enters the traffic simulation network. Vehicle road movement computation is

described in Section 7.2.3.3, which is the third section. Section 7.2.3.4 describes the com-

putation involved in vehicle movement through an intersection. Finally, Section 7.2.3.5

presents the Scheduler experimental results.

7.2.3.1 Vehicle Data

The data requirements of the logic simulators discussed in Chapter 3 are less

demanding than the data requirements faced by a traffic simulator. For logic simulation,

the output of a particular logic function tends to be a relatively simple signal value. The

output of a logical gate, AND or OR for example, tends to be a single value. Traffic

simulation faces a broader, more complex, set of values which must be computed and

Page 219: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

204

Fig. 7.13. An Intersection and its Departing Roads The traffic model employedassumes that each processing element can model one intersection and its exiting roads,which are highlighted. The processing element handles traffic entering the intersectionfrom up to four directions. The processing element continues to handle the traffic on thelanes which exit the intersection. When a vehicle reaches the end of a road it is handedoff to the next processing element. Processing elements handle two, three and four-wayintersections. Large intersections are generated by creating combinations of these smallerintersection types.

Page 220: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

205

transmitted with each simulation cycle. The challenge is to develop a minimum dataset

which is feasible and useful. For this research, the vehicle is composed of the sub-fields

described in Table 7.2.

Vehicle acceleration on roads in the scheduler is determined by a variety of condi-

tions. One factor is whether or not the vehicle has a leader within its headway. For this

thesis, the headway is a 4 second following time based on the vehicle’s velocity. Other

acceleration criteria include the distance to the end of the road, the value of the traffic

signal at the end of the road, and the vehicle’s velocity with regards to the speed-limit.

If the vehicle is determined to be in a following mode, the vehicle’s acceleration is cal-

culated using Equation 2.9. Vehicles which do not follow a leader, are not approaching

the end of a road, and are not speeding, use a table lookup to determine their free-flow

acceleration. Similar to [146], the acceleration is determined by the most restrictive

constraint of Equation 7.1. Tables 7.3 and 7.4 provide the algorithms used to determine

vehicle acceleration.

an = minaCarFollowingn

, aTrafficSignaln

,

aFreeF lown

, aSpeedingn

(7.1)

As the vehicle flows through the simulation network, additional data is required,

but that data remains stationary at each local processing element. Free flow acceleration,

for example, is maintained at each processing node. Vehicles needing the value perform

Page 221: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

206

Field Name # of bits Description

source 16 Vehicle entered the network at source.destination 16 Vehicle travels to destination.velocity 9 Vehicle speed (ft/sec)vehicle type 2 High or Low Performance Car, Bus or Truckvehicle id 18 Unique ID, used in combination with Sourcelane assignment 2 Format to handle 4 lanes, 3 travel + shoulder.valid word bit 1 Is data word valid?center point x 32 x coordinatecenter point y 32 y coordinatedistance down lane 16 distance along roadreserved 16 Reserved.

Total: 160 Total bits for vehicle

Table 7.2. Vehicle Data Fields In order to accelerate vehicle processing, the size of thevehicle data needs to be limited to allow it to pass quickly through the datapath and avoidmemory stores and fetches wherever possible. The data fields listed in the table are thosewhich are required for movement computation. The source is included for statistics andvehicle identification. The destination is used for vehicle routing. The vehicle type is used incombination with the velocity to determine the free flow acceleration. For movement withinan intersection on turning lanes, a polar coordinate system is used. Within intersections,the distance down lane field is initialized to a radial θ field which keeps track of the angletraversed during the vehicle’s turning motion. Similarly, the reserved field is used to keeptrack of the angular velocity, and is initialized to ω0. Free flow accelerations are storedon each node in table format. The data which is vehicle specific and accessed with eachcalculation moves with the vehicle. If more data is incorporated with the vehicle, localmemory and cache can be used in conjunction with the event list and a pre-fetch mechanism.

Page 222: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

207

a table lookup based on the vehicle type and current velocity. Each node also contains

network routing information.

7.2.3.2 Vehicle Initialization

When each vehicle is dequeued from either the Arrival or Service queue of Fig-

ure 7.3 and injected into the simulation network, the vehicle must be initialized with its

vehicle type, source node, and destination node. Additional data may also be required,

depending on the simulation. A single stage implementation was created to initialize

new vehicles.

7.2.3.3 Road Movement

The road movement implementation is composed of two parts. The first part

initializes vehicles each time they enter the road for the first time. The data fields which

are initialized include the lane assignment, center point, and distance down lane fields

from Table 7.2. In Section 7.2.3.5, this initialization implementation is referred to as

Veh Road Init. The rest of this section deals with the second implementation, referred

to as Move on Road in Table 7.5, which performs the vehicle movement computation.

The queues in Figure 7.14 are not event queues as illustrated in Figure 1.1. The

Event Queue in Figure 1.1 expels approximately one event per simulation time cycle.

The number of vehicles generated and expelled from source nodes is determined by the

user’s selected statistical generation distribution. The event queues of Section 7.2.2 are

required only in conjunction with event generation and only on simulation source nodes.

Page 223: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

208

Perform VehicleMovement Calcs

To Next Node

From Sourceor Intersection

comparator

MUX

MainFIFO

TransferVehicle

FIFOEntry

Fig. 7.14. Scheduler Vehicle Queue The scheduler vehicle queue implementation issimilar for both movement on a road and movement in an intersection. Two FIFOs aremaintained. Newly arrived vehicles are placed in the Entry FIFO. The Main FIFO is forthose vehicles which are in-progress, either along a road or through an intersection. Thecomparator between the two queues selects the vehicle with the most advanced positiondown the lane, and routes that vehicle’s data into the appropriate functional units of Fig-ure 7.15 or 7.16 for either road-handling or intersection-handling movement calculationsrespectively. Vehicle datasets which are circulated back from either Figure 7.15 or 7.16 areplaced in the main FIFO for the next simulation time cycle’s computation. Although thedual queue design is similar to the system illustrated in Figure 7.3, the application here ismuch different. The vehicle events stored in the queues of Figure 7.3 are dormant until theirappropriate simulation time cycle arrives. The vehicles are then dequeued, initialized, andinjected into the Scheduler, which for traffic, simulates the road network. Vehicle eventsin the queues of Figure 7.3 are indexed based on time. The vehicles stored in the queuesshown in this figure are already moving within the traffic network and are themselves circu-lated and processed every simulation cycle. These vehicles are indexed based on their laneposition, not time.

Page 224: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

209

The main queue of Figure 7.14 is used to store the vehicles which circulate to

the computation units each simulation cycle so that the vehicle position, velocity, and

acceleration fields can be updated. The second queue, the entry queue, stores vehicles

which have just entered the intersection or roadway. The implementation calculates the

vehicle position, velocity and acceleration for vehicles moving in the same direction on a

road. Similarly, each intersection implementation handles one direction of traffic enter-

ing the intersection and exiting the same intersection in one of possibly three directions.

Each processing element contains FPGAs to handle a four-way intersection and its cor-

responding exiting lanes of traffic. All queued vehicles in Figure 7.14 are processed every

simulation time cycle. The queued data values represent vehicles moving on either a

road or through an intersection. Each vehicle is circulated from the appropriate queue

into the calculation hardware implementation (Figure 7.15 or 7.16) for acceleration, ve-

locity and position updates, and then either the vehicle is circulated back to the main

FIFO or to the next traffic network node. Each vehicle in the queues is handled once

per simulation cycle. The vehicles are advanced based on position. Those closest to the

end of the road are moved first.

For the road movement calculation, the vehicles are processed as follows. Al-

though the vehicles pass through the 6 stages of pipeline illustrated in Figure 7.15, the

processing is divided into 4 cycles. Vehicle data passes through the first four stages where

its acceleration is determined. The next two stages are a repeat of the first two cycles,

only during the second pass, the vehicle is fulfilling its role as a leader to subsequent

vehicles in the same lane. If no vehicle is following in the same lane, the vehicle data

replaces the previous leader in the lane based leader register.

Page 225: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

210

The movement calculations can now be defined according to their 6 pipeline stages.

As vehicle data enters the pipeline in the first stage, several actions begin immediately.

First, the computation of all forms of acceleration which require a total of four stages

begins. In order to compute the acceleration, the vehicle’s lane assignment and leader

are determined. The lane assignment in conjunction with the vehicle’s headway facilitate

the determination of the vehicle’s leader. The second stage computes items including

the vehicle’s headway. Acceleration computation continues during the second and third

stages. In the third stage, the vehicle’s position and velocity computations begin. Free-

flow acceleration indices are computed. During the fourth stage, the correct acceleration

value is selected from the computed accelerations. The acceleration selection is based

on the algorithm described in Table 7.3 and Equation 7.1. Once the acceleration for

the vehicle is selected, the vehicle can finish its velocity and position computations and

begin to serve as a leader for any qualifying subsequent vehicle. Stage five allows the

vehicle with its newly calculated acceleration to be placed in the appropriate leader

register for subsequent vehicles. The final stage of the pipeline computation decides

whether the vehicle has reached the end of the road, therefore, needing to transfer to

the next intersection, or whether the vehicle should be returned to the main queue of

Figure 7.14 to await the next simulation cycle of processing. The pipeline is illustrated

in Figure 7.15.

The scheduler hardware implementation for the selected traffic example takes

significant advantage of data locality. The selected model distinguishes road traffic and

intersection traffic. Vehicles are assumed to be initialized and injected onto a road.

Properties such as the speed-limit, grade, and other road characteristics are considered

Page 226: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

211

DetermineLane

AccelerationTable lookup

Car FollowingAcceleration

SpeedingDeceleration

StoppingDeceleration

Set LeaderRegister

MUXSelectAcceleration

Set NewAcceleration

CalculateNew Velocity

CalculateNew Position

AdjustCenter Pt

MovedVehicle

TransferPrev Vehicle

To Next

Node

To Main

Queue

CalculateNew Position

or Entry Q

From Main

Vehicle Register

Fig. 7.15. Calculations for Vehicle Movement on a Road The steps required to cal-culate a vehicle’s movement along a road are represented. For each simulation time cycle,vehicles on the road are moved from the Main and Entry FIFO queues of Figure 7.14 intothe pipeline described by the block diagram. The pipeline processes the vehicle data wordsto adjust the vehicle acceleration, velocity, and position. Each vehicle passing through thecalculation pipeline may depend on its immediate predecessor’s calculation if the previousvehicle is the current vehicle’s leader in traffic. Because the lead time required to calculatethe acceleration is 6 cycles, and because a dependency may exist where this vehicle’s accel-eration may be required to determine the following vehicle’s acceleration, all the possibleacceleration outcomes commence calculation immediately and concurrently. Accelerationdetermines the duration of each vehicle’s process time in the pipeline. As the accelerationsare being computed, the vehicle’s traveling lane and possible lead vehicle are determined.The appropriate acceleration is selected based on the vehicle’s relation to its leader, thevehicle’s distance to the end of the road, the traffic signal value at the end of the road, thevehicle’s previous velocity, the speed-limit, etc. The block diagram was implemented as a6 stage pipeline.

Page 227: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

212

if leader = 0. Not following anyone - no leaderif (traffic signal == green) ||

(plenty of open road) thenif speeding then

. decelerate in proportion to speeding

accel =V 2−V 2

speeding

2(headway)else

accel = table lookup free-flow valueend if

else. open road, but traffic signal or road endif (road end) then

stopsign, wait for intersection accessaccel = 0, velocity = 0

else

accel = − V 2

2(roadleft)end if

end ifelse

. Following a leaderif (traffic signal == red) then

accel = − V 2

2(roadleft)else

accel =αl,m

[

xn+1(t+∆t)]m

[

xn(t)−xn+1(t)]l

[

xn(t)− xn+1(t)]

end ifend if

Table 7.3. Acceleration Decisions for a Road The algorithm for determining vehicleacceleration during road travel is provided. Only stop-sign traffic signals were simulated,therefore vehicles stop at the end of a road before proceeding into the intersection. Accel-eration for free-flow traffic is determined by table-lookup based on the vehicle speed andtype. Equations 2.9 and 2.15 derived in Chapter 2 are incorporated in this algorithm.

Page 228: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

213

constant with respect to the roads. The vehicle is composed of data which includes

the destination, velocity, type, etc. Some vehicle properties, free-flow acceleration for

example, need not move with the vehicle, but can be locally accessed based on the vehicle

type and velocity in each simulation node.

7.2.3.4 Intersection Movement

Similar to the road movement implementation of Section 7.2.3.3, the intersection

movement implementation is composed of two parts. The first part initializes vehicles

each time they enter the intersection for the first time. The data fields which are initial-

ized include the lane assignment, θ0, and the ω0 fields from Table 7.2. In Section 7.2.3.5,

this initialization implementation is referred to as Veh Intersect Init in Table 7.5. The

rest of this section deals with the second implementation, referred to as Move in Intersect

in Table 7.5. Move in Intersect performs the computation for vehicle movement within

the intersection.

Vehicle motion through the intersections is similar to the motion computations of

the traffic on roads. There are, however, some differences. For this simulator design, the

motion through the intersection and the related computations were converted to angular

velocity and acceleration for intersection turns. Another difference is that as vehicles

traversed the intersections, an assumption is made that the vehicles continue through

the end of the intersection and do not stop. All roads, on the other hand, are simulated

as being terminated by stop-signs. Otherwise, the intersection computations also follow

a similar 6 stage pipeline. The vehicle’s lane position and leader are determined in the

first stage. All acceleration choices also begin their computation in the first stage. In

Page 229: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

214

the second stage, a determination of whether or not the leader falls within the vehicle’s

headway is started. Acceleration computations continue. The third stage initiates the

vehicle’s free-flow acceleration table index computation. Vehicle position and velocity

computations begin during the third stage. The selection of the proper acceleration

occurs in the fourth pipeline stage. Vehicle position and velocity computation complete

during this fourth stage. In the fifth stage, the vehicle may begin to serve as the leader

for any subsequent vehicles. If the vehicle has crossed the intersection, it is handed off

to the exit road entrance queue for road handling. Otherwise, the vehicle is returned to

the intersection main queue for further processing during the next simulation cycle.

The general mathematical formulas used to describe vehicle motion which are

applied in this section are reviewed in Chapter 2.

For this thesis, all roads are assumed to be straight running either north/south

or east/west. Further, all intersections are assumed to be governed by stop signs. Traffic

lights were not modeled. For turning computations required within intersections, angular

acceleration equations analogous to 2.9 and 2.15 were derived.

7.2.3.5 Scheduler Results

Results from the scheduler contain the least speedup of the simulator sections at-

tempted. The major limitation to the experimental design lies in the division functional

unit and the data dependency between leading and following vehicles. An implemen-

tation of just a simple division functional unit with registered input and output ports

achieved a clock rate of 9.14 MHz. So one impediment to faster implementation on the

Page 230: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

215

MUX

MovedVehicle

AccelerationTable lookup

Car FollowingAcceleration

SpeedingDeceleration

Set NewAcceleration

CalculateNew Velocity

CalculateNew Position

Set LeaderRegister

SelectAcceleration

CalculateNew Position

TransferPrev Vehicle

To Next

Queue

Node

To Main

or Entry Q

From Main

Vehicle Register

Fig. 7.16. Calculations for Vehicle Movement Through an Intersection A vehicle’smovement through an intersection is similar to its movement along a road as illustratedin Figure 7.15. Again, the acceleration calculations determine the length of the 6 cyclecalculation pipeline. Movement through the intersection differs from movement along aroad. For instance, it is assumed that there is no traffic signal at the end of the intersectionlane. For this study, a polar coordinate system is applied in the intersections so the angularacceleration, velocity, and radial angle are computed for each vehicle.

Page 231: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

216

if (leader = 0) then. No leader - open roadif speeding then

. decelerate in proportion to speeding

accel =V 2−V 2

speeding2∗headway

else. Not speedingaccel = table lookup freeflow value

end ifelse

. Following a leaderaccel =

xn+1(t + T ) =λxm

n+1(t+T )

[xn(t)−xn+1(t)]l[xn(t)− xn+1(t)]

end if

Table 7.4. Acceleration Decisions for an Intersection The acceleration decisions fortraversing an intersection are similar to the decisions for travel on a road as explained inTable 7.3, but it is assumed that a vehicle does not stop at the end of the intersection beforemoving on to the next road. Equations 2.9 and 2.15 derived in Chapter 2 are converted totheir angular counterparts and then incorporated in this algorithm. For turning lanes, thecorresponding angular acceleration equations were implemented.

Page 232: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

217

FPGAs is division. During the fitting of the traffic designs, the slowest routing imple-

mentation paths are composed of division signal lines. If AHDL division library routines

cannot be accelerated, providing hardwired division functional units on FPGAs would

certainly accelerate the traffic implementations.

The routines listed in Table 7.5 compute different segments of the vehicle simula-

tion. The Initialize Vehicle implementation, described briefly in Section 7.2.3.2, handles

vehicles which are entering the simulation at source nodes for the first time. These ve-

hicles have been popped off the Event Queue described in Section 7.2.2 and presented

in Table 7.1. The Initialize Vehicle component receives vehicles from the event queue

and inserts the vehicles source and random destination node values into the vehicle data

structure. The Veh Road Init component handles vehicles anytime they enter a new

road in the simulation. Any required road specific data is injected into the vehicle data

structure in this routine. The component is described at the beginning of Section 7.2.3.3.

The rest of that section is devoted to describing the Move on Road implementation. Veh

Intersect Init, described at the beginning of Section 7.2.3.4, similarly initializes vehicles

as they enter intersections. The rest of Section 7.2.3.4 then describes the Move in Inter-

sect implementation. Both the Move in Intersect and Move on Road implementations

employ a First-In-First-Out (FIFO) queueing system composed of two FIFO queues. So

each implementation requires two chips to implement this queueing system. The system

is implemented to handle 256 vehicles in the main queue.

Each processing element has facilities to handle a four-way intersection and its

egress lanes of traffic. Therefore, each processing element contains enough FPGAs to

handle 4 sets of each implementation described in Table 7.5, except the Initialize Vehicle

Page 233: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

218

Function Chips Chip Type % Util Clock Rate Cycles/vehInitialize Vehicle 1 EPF10K30EFC256-1 50% 5.44 MHz 1Veh Road Init 1 EPF10K200SFC484-1 38% 46.94 MHz 1

Veh Intersect Init 2 EPF10K200SFC484-1 63% 109.89 MHz 1EPF10K30EQC208-1 50 %

Move in Intersect 3 EPF10K200SBC356-1 52% 5.44 MHz 1EPF10K50EQC240-1 57% 5.44 MHz 1EPF10K100EBC356-1 75% 7.54 MHz 4

Move on Road 3 EPF10K200SBC356-1 52% 5.44 MHz 1EPF10K50EQC240-1 57% 5.44 MHz 1EPF10K130EFC484-1 89% 8.00 MHz 4

Table 7.5. Scheduler Chip Implementation The scheduler software was implementedas 5 separate components. The Initialize Vehicle sets the vehicle’s source location anddestination as the vehicle is injected into the traffic network. Although its clock speed isonly 5.44 MHz, it can process vehicles once per cycle and is therefore not a bottleneck forthe simulator. The Veh Intersect Init prepares vehicles for transit though an intersectionby initializing their starting coordinates and lane designation. The module also performssome vehicle routing. Both the Move in Intersect and Move on Road implementationsemploy a first-in-first-out (FIFO) queueing system composed of two FIFO queues. Soeach implementation requires two chips to implement this queueing system. The Move inIntersect implementation is the system bottleneck. 4 clock cycles are required to processeach vehicle data word, and the clock only runs at 7.54 MHz due to division operationsin some of the pipeline stages. Table 4.6 illustrates that for the timed Trafix schedulerfunction, the bottleneck resides in the intersection routine. Here, the routine requires 4cycles, so the speedup attained by the hardware over the software is 91.

Page 234: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

219

implementation. Nodes may be specifically configured to act as vehicle source nodes,

in which case only the vehicle road handling hardware and the hardware used for the

source node presented in Table 7.1 are required. The resulting processing element design

is illustrated in Figure 7.17. The total number of FPGA or reconfigurable logic chips is

expected to decrease as the technology and compilers improve.

Comparing Tables 4.6 and 7.5, system bottlenecks can be seen to occur within the

function for traversing an intersection. In software, this routine required 48.4 µs. This

measurement comes from timing and averaging the vehicle movement routines while the

Trafix code was executing on a UNIX box as described by Table 4.6. The software timing

results are then compared to the time required to get the same functional result on the

FPGA implementation according to the Altera MaxPlusII simulation. In hardware, 4

cycles of a pipeline running at a 7.54 MHz clock cycle are required. Therefore, the

speedup of the hardware implementation over the software implementation is 91. This

value compares favorably with other reported hardware acceleration, usually based on

a network of processors, used to increase the speed and capacity of simulation by up to

100 times [14].

7.3 Network

For simulation acceleration to be successful, speedup must occur within all facets

of the architecture, including the processing element interconnection network. Section 7.3

presents a method of synchronizing individual nodes to form a processing element net-

work capable of determining the smallest timestamped event rapidly. The basic processor

model used to implement the local processing elements is illustrated in Figure 7.3.

Page 235: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

220

ChannelControl

To NorthPE

ChannelControl

To WestPE

LocalCrossbar

ChannelControl

To SouthPE

ProcessingElement

ChannelControl

To EastPE

ParallelBus

InterfaceStorage

CPU

ToParallel

Bus

To Cross−PointSwitch

ChannelControl

EventGenerator

comp

IntersectionMovementExit Road

Movement

Service Arrival

Scheduler

Move onRoad

Move onRoad

Move onRoad

Move onRoad

Move inIntersect

Move inIntersect

Move inIntersect

Move inIntersect

Fig. 7.17. Processing Element for 4-way Intersection and Exit Roads A detailedversion of the Processing Element design depicted in Figure 7.3 is illustrated. This designis capable of modeling 4-way intersections, and consists of enough Scheduler subcomponentunits to model the traffic entering an intersection from 4 directions and exiting the inter-section on 4 output roads. A single Event Generator module which contains its associatedArrival and Service Queues is included to allow the Processing Element to serve as a simula-tion source node. The Processing Element contains 4 nearest neighbor interconnect FIFOsand a communications FIFO pair which connects to its corresponding cross-point switch,illustrated in Figure 7.23. An additional interface connects the processing element to theparallel bus illustrated in Figure 7.22. A central crossbar matrix, similar to the Splashdesign described in Section 3.2.1, connects the various processing element sub-components.The design described in this Figure requires approximately 30-34 FPGAs. 6 FPGAs arerequired for the Event Generator and the Event Queues. Each Scheduler sub-componentused in calculating vehicle movement requires 3 FPGAs, yielding a total of 24 FPGAs forthe 8 Scheduler sub-components. Additional FPGAs are reserved for Channel Control. Thisvalue is similar to the original Splash board designs which required 32 FPGAs per Splashunit [64].

Page 236: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

221

A typical simulator is composed of individual nodes joined in a network. To

prevent causality errors in conservative simulation, all nodes process the same simulation

cycle simultaneously. In conservative event-driven simulation, individual nodes all jump

to the simulation cycle which coincides with the smallest timestamped event held within

the network. Logistical difficulties occur in both the communications and sorting of the

timestamps. Each node’s local minimum timestamp must be compared against all of the

local minimum timestamps in the global network.

In a simulation network, as shown in Figure 7.18, nodes are generally synchro-

nized using either a time-driven or event-driven simulation approach. A single network

architecture can be constructed allowing a simulation to run as either a time or event-

driven model. The decision between the two models is made at the beginning of the

simulation based on calculations from Chapter 5, and the selected model is used for the

simulation duration. A communications network which can be used to determine and

select the smallest timestamp in a network of nodes when running in event-driven mode

is presented. A time-driven solution is also presented using the same implementation.

7.3.1 Communications Architectures

Communications synchronization is often a source of delay. In work on the CM-5,

Legendza notes synchronization overhead accounts for 70% to 90% of total simulation

runtime and therefore severely limits speedup [96]. Traditional approaches in multi-

processor simulation search for the smallest next timestamp in a network of N processing

elements. The simulation model may have n active simulation model nodes distributed

across the N processing elements in a balanced fashion, but each processing element will

Page 237: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

222

EventQueue

Scheduler

EventQueue

Scheduler

EventQueue

Scheduler

EventQueue

Scheduler

EventQueue

Scheduler

EventQueue

Scheduler

GeneratorEvent

ProcessingElement

GeneratorEvent

ProcessingElement

GeneratorEvent

ProcessingElement

GeneratorEvent

ProcessingElement

BusAsyncParallel

GeneratorEvent

ProcessingElement

GeneratorEvent

ProcessingElement

Master

SyncStart

Done

Fig. 7.18. A Network of Processing Elements. A simulation consists of a network ofevent sources, sinks, and way-points. Each must be synchronized to the global system timeclock. Two common methods of synchronization are time and event-driven synchronization.The analysis of Chapter 5 can be used to gauge which methods are faster. The illustratedtime-driven simulation uses a controller/subordinate approach similar to Levendel [98]. Thenetwork core, illustrated in Figure 7.19, serves as the Main Synchronizer which asserts theStart line at the beginning of each time cycle. Each network processor signals it is readyfor the next time cycle by asserting its Done line. The Start and Done lines are configuredas reduction network lines illustrated in Figure 7.20.

Page 238: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

223

Fig. 7.19. The 3-Dimensional Network Structure Although trees have a wonderfullylogarithmic decreasing structure, they offer difficult geometric constraints for actual imple-mentation. A linear parallel bus offers a much easier to implement structure, but poses moredifficult adjacency problems. In the network illustrated, each parallel bus is composed ofreduction logic as shown in Figure 7.20. Much of the communications can be accomplishedby the Process Element (PE) I/O cells. The length of each bus is a trade-off between com-munications circuit element switching speed, bus signal propagation speed, and physical PEgeometry constraints. In this figure, the PEs are arrayed along linear busses. Letting 10elements reside on each bus, and 10 arrays of 100 PEs per quadrant, allows each networkto contain 8000 elements. The core may be composed of more than one processor, but forthe purposes of this research, the core is assumed to be one unit.

Page 239: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

224

ProcessingElement

ProcessingElement

ProcessingElement

ProcessingElement

Fig. 7.20. The PE Interconnection Network Processing Elements (PEs) can be inter-connected in 1 or 2 dimensions. The interconnections consist of a high speed emitter-coupledlogic design. The buses link the Processing Elements (PE) together, allowing rapid and semi-parallel determination of the next smallest time in the network. The OR network assistsin the computation of the smallest timestamp and serves for both computation and sig-nal driving. In addition, each processing element is directly connected to its north, south,east and west neighbors in what is commonly called a two-dimensional nearest-neighborcommunication pattern [64].

Page 240: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

225

have one minimum timestamp for the model nodes it handles. Each processor timestamp

must be compared against the other minimum timestamps in the network. Some of the

more commonly expected network search algorithms include network structures con-

structed as k-ary trees depicted in Figure 7.21. To determine the minimum timestamp

in such a network requires logk(n) communications steps. The smallest timestamp is

filtered to the root of the tree, and from there the result must be distributed to the rest

of the network. This method requires O(logk(n)) communications steps.

ProcessingElement

ProcessingElement

ProcessingElement

ProcessingElement

ProcessingElement

ProcessingElement

ProcessingElement

Fig. 7.21. K-ary Search Tree Network The K-ary search network topology allows Nprocessing elements in a network to compare individual local minimum timestamp results tothe winner of the K elements on the level below it in the network tree. Successive winnerscompete in tournament style comparisons.

Another view of the simulation notes that the larger the number of event gener-

ators which exist in the system, the shorter the expected time to the next event, E(x).

This phenomenon can be gleamed from Equations 5.18 and 5.27. Although the examples

Page 241: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

226

from [27] use homogeneous distributions, it is assumed that the trend holds for indepen-

dent heterogenous distributions as well. So the larger the number of event generators in

the simulation, the faster the events will arrive, and the smaller the mean time between

events grows. As N increases, time-driven simulation becomes more and more practical.

7.3.2 Parallel Bus Architecture

For the proposed algorithm, several transmitters must share the bus and be able to

generate signals simultaneously. The bus architecture can be handled by a bi-directional

reduction logic network. Employing a technology such as Emitter Coupled Logic (ECL)

gives the interface reasonable transmission speed, and ECL hardware couples nicely with

CMOS technology [34]. ECL switching speed is accomplished by keeping its transistors

always biased in their active regions. OR or NOR logic can be used to run buses in two

directions as depicted in Figure 7.20. Reduction logic can be accomplished directly at

the processing element I/O points without processor intervention.

The primary function of the parallel bus is to locate the minimum network time

stamp and synchronize the network. A secondary function of the bus allows the pro-

cessing elements to communicate with the centralized Controllers. Data sent to the

Controller includes the address of the sending processing element, a simulation event or

location identifier, and the data values. Additional control signals to send data to the

Controller are used with the bus. Alternatively, the PE can also communicate with the

controllers serially via the cross-point matrix of Section 7.3.6.

Page 242: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

227

7.3.3 Search Algorithm

One algorithm for finding the network minimum timestamp proceeds in two basic

phases. The first step consists of a general elimination which prunes processing elements

having timestamps larger than 2k, the base 2 ceiling of the global minimum timestamp.

The second phase of the algorithm then finds the minimum among the remaining nodes.

7.3.4 Phase 1 Elimination

First, all network processing elements (PEs) find their local minimum values.

This search involves comparing the lead elements of the service and arrival queues from

Figure 7.3 in O(1) time. A hardware algorithm for maintaining the smallest event within

a processing element is presented in 7.2.2. Next, each PE computes the difference between

the current global simulation time cycle and the next local minimum timestamp, tdiff ,

in O(1) time. Each PE determines the number of bits, b, required to express tdiff .

For example, 13, requires 4 bits, 11012. The PEs simultaneously pull the signal line

representing b low on the global parallel bus illustrated in Figure 7.22. After all PEs

have floated their b values on the bus in O(1), the PEs whose b value is greater than the

bus minimum signal line eliminate themselves from the search. The smallest asserted

signal line of the parallel bus narrows the scope of the search to the limited range of

numbers expressed in Equation 7.2:

2b − 2b−1 = 2b−1(2− 1) = 2b−1 (7.2)

Page 243: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

228

All elements not eliminated in this first phase are referred to as active elements

in the second phase.

7.3.5 Phase 2 Selection

The second phase of the algorithm can proceed in either of two methods. Method

one requires a 3-bit reduction network, and method two requires a 2-bit reduction net-

work. The first method performs a binary search through the range of timestamps

isolated in Phase 1. The second method performs binary eliminations among the re-

maining active nodes. The reduction network can also serve as the Start and Done lines

for the Main Synchronizer under time-driven simulation as described by Figure 7.18.

In the first method, a Bus Controller begins a binary search through the remaining

range of numbers to determine the minimum global timestamp. The reduction network

is used to allow the PEs to signal whether their values are higher, equal, or lower than the

value floated on the parallel bus. Using Equation 7.2, the global search can be completed

in O(log2(2b−1)) = b− 1, and the resulting global minimum timestamp range is visible

to all PEs simultaneously.

The selection phase has several significant advantages over tree search methods.

One advantage of this method is its initial elimination step which occurs across the

network at all PEs simultaneously. This advantage is opposed to a k-ary tree in which

the first comparison happens at the lowest level only. Another significant advantage is

that the network is somewhat more conducive to a geometric element layout as opposed

to a binary tree, where the interconnections between element levels get progressively more

difficult. Perhaps the most significant advantage is that the timestamps can remain in

Page 244: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

229

their original locations instead of being moved and coalesced into a central location. Its

disadvantages include requirements for additional hardware and bus lines as illustrated

in Figure 7.20.

The second method proposed for Phase 2 allows the processing elements which

remain after the first phase to work in adjacent pairs. All active processing elements

generate a signal which is passed towards the core along the network Edge signal line.

Therefore any PE and the core receiving this signal know that there exists at least one

active element on their network edge side. Next, the elements use their Adjacency signal

line to form processor pairs. Active elements at the edge of the network propagate both

the Edge and Adjacency signals. The next innermost active element heading towards

the core will receive both the Edge and Adjacency signals along with the value of the

smallest timestamp on the data lines as shown in Figure 7.22. This inner core side ele-

ment will propagate only the Edge signal towards the core. Having alternating elements

propagate the Adjacency signal facilitates a pairing of the network elements. In each

pair, the element closer to the network edge automatically self eliminates. The inner

paired element compares its local minimum timestamp with the value received on the

data-bus. The smaller value becomes the minimum used in the next cycle. The core

retains the smallest value until all eight network quadrants have reported in, and then

broadcasts the final result. The advantage of this mode is that as the number of nodes,

N, increases, the expected time of the minimum timestamp becomes more isolated from

the other local minimum timestamps in the system. The first elimination becomes the

only step required.

Page 245: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

230

pairedelements

pairedelements

reductionnetwork

NetworkNetworkCore Edge

ProcessingElements

DataBus

Adjacency LineEdge Line

eliminatedelements

signal/data linesBold indicates active

Fig. 7.22. Algorithm Phase 2 Method 2 Elements eliminated by the initial reductionstep are illustrated inscribed with a cross. Signals flow through the eliminated processingelements. The data signals are shown traversing the upper bus. The lower two-signal busrepresents the basic handshaking signals. The Edge signal indicates to each element whetheror not that element is a network edge element. All elements which have not self-eliminatedduring the first phase generate an active Edge signal and propagate the signal towards thenetwork core. The Adjacency signal is used to pair processing elements. Each active elementwhich receives the Edge signal but not the corresponding Adjacency signal propagates itsown Adjacency signal towards the direction of the network core. When either another activePE or the core receives the Adjacency signal, that element does not propagate the signalbut instead compares its minimum local timestamp with the timestamp value received onthe Data bus. The minimum value of the pair becomes the minimum value at the nodeclosest to the core while the outer pair node is eliminated.

Page 246: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

231

7.3.6 Cross-Point Matrix

The simulator cross-point matrix Communications Structure of Figure 7.2 is a

cross-point matrix network laid out in a star topology. Levendel’s [98] cross-point ma-

trix and Splash’s [8] crossbar switch employ an interconnect to accelerate processing

element communications thereby avoiding a communications bottleneck. In the case of

Splash [8], the requirement of a crossbar switch was learned only after creating the initial

prototype without the matrix. Each of the 8 quadrants depicted in the 3-dimensional net-

work layout of Figure 7.19, contains 2-dimensional arrays of processing elements. Each

of these 2-dimensional arrays is associated with a cross-point switch used to allow pro-

cessing element communications. The cross-point network, although serial, allows more

direct connections between the processing elements than the parallel reduction bus. For

quadrant cubes, 10 processing elements on an edge, each 2-dimensional processing ele-

ment sub-array contains 100 processing elements. Using a 300 pin cross-point matrix,

approximately one third of the lines will connect directly to the processing elements of

the 2-dimensional array. The other two thirds of the lines of the cross-point matrix are

used to connect the 2-dimensional array to the rest of the 3-dimensional network. There

is a cross-point switch at the network core. Adjacent processing elements also connect

directly to each other. The cross-point control network is illustrated in Figure 7.23.

The time required for communications using the cross-point matrix network can

be analyzed by dividing the simulator processing time into the time spent processing

vehicles, tprocess, and the communications time, tcomm. Therefore each simulation

cycle is composed of tprocess + tcomm as illustrated in Figure 7.24.

Page 247: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

232

FPGACPU

Storage

Cross−PointSwitch

ProcessingElement

ProcessingElement

Switch

Switch

ProcessingElement

ProcessingElement

Initialization

Control

Status

HDLC

Switch

Data/Clk Data/Clk

Data/ClkData/ClkHDLC

HDLC

Data/Clk Data/Clk

HDLC HDLC

Fig. 7.23. Cross-point Switch Architecture A cross-point hierarchical network isillustrated with one switch shown in detail. Arrays of processing elements connect to theirrespective switch by a high-level data link control (HDLC) line which is used to send framedconnection control data to the cross-point switch controller. If a virtual circuit is availableto the requested destination, the circuit is assigned to the processing element. The cross-point switch virtual circuit provides a direct serial connection for a data line and for itsrelated clock line. Cross-point switches in the same simulator quadrant directly connectto each other. The cross-point switches are hierarchically configured allowing a virtualcircuit to connect to processing elements attached to cross-point switches in other simulatorquadrants. Large cross-point matrices are used to provide as close to a fully connectednetwork as possible where the network is layed out in a star topology. CPUs are used tomonitor and initialize the switch configurations [17].

Page 248: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

233

Run through the Queues updatingthe vehicles

Prepare vehicleswhich mustbe transferred

TranferVehicles tonext node

Accept New vehiclesTransferred in

ProcessUserRequests

processt

Processing cyclecommt

Comm cycle

One Simulation Cycle

Fig. 7.24. Processing and Communications time Each simulation time cycle is di-vided into tprocess and tcomm subcomponents. The processing element cycles through theScheduler Main and Entry queues updating the position, velocity, and acceleration infor-mation of each vehicle data structure during tprocess. Vehicles which must be transfered

to the next node are moved during tcomm. User directed system interrupts and systemsynchronization occur during this later phase.

Looking first at the time required to process events in each processing element

using the traffic scheduler as our model, let tprocess be defined in Equation 7.3.

tprocess = ((Qavg − 1) + numstages)(tcycle) (7.3)

The limiting motion function of Table 7.5 has a clock rate of 7.54 MHz or 133

ns/cycle. Using this clock rate for tcycle and the 6 stage pipeline implementation of

Figure 7.15, numstages = 6. Let Qavg, the expected vehicle queue size, be 25 vehicles.

The resulting tprocess = 4 µs to process the vehicles moving on the road. Next, the

value of tcomm, the communications delay through the cross-point matrix, is calculated.

The simulator was implemented with 0.5 second time resolution, so let esend = 2, be the

events, or vehicles which have finished processing at the current processing element and

need to be transferred to the next during one simulation cycle. In the traffic simulation,

Page 249: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

234

these events represent vehicles which have come to the end of a road and are now entering

the intersection.

From a table in [38], the propagation signal delay can be estimated as tsd = 0.05

ns/cm. The worst case communications scenario involves passage through 3 cross-point

switches. The first half of the route is illustrated in Figure 7.19, the second half of the

route travels outwards from the core to a different cube corner. The first cross-point

switch in the worst case scenario is connected to the sending processing element’s array.

This first switch is located at approximately the end of the second array displayed in

Figure 7.19. The second switch resides at the network core, where the sphere is located in

Figure 7.19, and the third connects the receiving array to the network. Each processing

element is a 10 cm cube. The worst case distance across a network composed of 8 1000

PE quadrants is 600 cm. Through that distance, the propagation delay is 30 ns. The

vehicle data messages are relatively long as compared to the gate value results of [98].

So let, tdm = 300 ns, as a conservative estimate.

To communicate across the cross-point matrix, a point-to-point channel is nego-

tiated between the two processing elements. First, channel request & grant time, tcrg,

is required to establish the circuit. The time required to send the message is the delay

in message transmission time, tdm. Once the process is complete, channel release time,

tcr, is used to free the circuit. Finally, if the circuit is unavailable, a penalty of time

wasted in processing a blocked request, trb, is incurred. For the calculation of tcomm,

let j denote the number of events which encounter a busy channel. Assume, on average

Page 250: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

235

that the messages transmitted go half of the worst case distance, or through 12(3) or 1.5

switching matrix hops. The formula for the transmission time, is as follows:

tcomm = 1.5[tcrg + tdm + tcr]esend + trb(1− (1− j)(1.5)) (7.4)

Equation 7.4 assumes the average communications require 1.5 network hops which

can result in 1.5 possible call blocks. To compute tcomm, the parameter values: tcrg = 50

ns, tdm = 300 ns, tcr = 50 ns, esend = 2 vehicles, trb = 50 ns, and j = 10%, which

are based on the values from [98] are used to compute the communications delay. Using

these values, tcomm computes to 1.2 µs. For this example, although tcomm is smaller

than tprocess, the values are close enough to indicate that the implementation of the

communications system is an important consideration in the machine design.

7.3.7 Network Results

Networks of processing elements deployed in three dimensional arrays and con-

nected by the parallel bus architecture of Section 7.3.2 were simulated for both time and

event-driven mechanisms. The event-driven synchronization time was computed assum-

ing that the worst case signal propagation delay was required for all steps. The signal

propagation delay along the parallel bus is composed of the time required to propagate

a signal through each reduction gate in the array as depicted in Figure 7.20. The gates

are assumed to be high speed Emmiter-Coupled Logic (ECL) or Source-Coupled FET

Logic (SCFL) 10 ns delay gates [109]. Processing elements are deployed along linear

busses whose lengths are determined by the number of processors in the simulation. The

Page 251: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

236

processing element connections to the bus are spaced 10 cm apart. Propagation delay

along the bus is assumed to be 5 ns/m [38] excluding the time required to pass through

the OR gate drivers. The peripheral buses of processing elements are connected to the

middle level of linear busses. Gates at the end of the buses bridge onto the next bus

layer. There are three interconnected bus layers from the network edge to its core. These

three layers are represented by the arrows illustrated in Figure 7.19.

The simulation mode determines the time required to locate the next smallest

timestamp in the network. For the time-driven simulation mode, the time required is

composed of two components. The first component tallies the simulation clock cycle

signal as it is passed through each repeater gate illustrated in Figure 7.20. The second

component is the propagation time along the wire runs between each gate. The signal

must pass through the entire network from its core to its edge elements. The expected

time to the smallest arrival event was computed using a network of Independent, Iden-

tically Distributed (IID) sources following the patterns set by Equations 5.18 and 5.27.

The event-driven simulation time was computed by using the same pair of ex-

pected time components described above. One bus propagation/elimination step is al-

ways necessary. Then the power of 2, k, which is a logarithmic ceiling of the expected

minimum network timestamp value is computed. Next, the number of events, E, which

can be expected to arrive before 2k is computed. Each comparison requires about 4

ns. Finally, log2(E) was determined to calculate the number of comparisons required to

calculate the network minimum timestamp. Each communication through the network

is assumed to be from core to edge, the worst case scenario.

Page 252: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

237

In Figures 7.25-7.28, two event-driven methods are illustrated. The two methods

are labeled Event-Driven Range and Event-Driven Elements and are defined in Sec-

tion 7.3.5. The Event-Driven Range curve performs a binary search by dividing the

range of possible time values isolated in the first elimination phase. Alternatively, in the

Event-Driven Element method, the algorithm takes advantage of the fact that as the

distribution means increase, the elements become more isolated at the extreme ends of

the distribution curves. The first method must step through the remaining binary range

of numbers, searching for the minimum. The second method tends to jump directly to

the correct element in O(1) as the distribution means increase.

Figure 7.25 reveals the time required by the network minimum timestamp search

algorithm as the number of processing elements and the event generator Exponential

means vary. The event generators used in the distributions are IID. Figure 7.26 shows a

slice of Figure 7.25 at the 1000 processing element mark. The graphs indicate a clear gain

which can be harvested if the time and event-driven methods are used in conjunction.

The second method in the second phase of the event-driven algorithm clearly yields

significant gains for exponential distributions with large means. A scenario which would

benefit from this algorithm might be one where the simulation has distribution means

in the millisecond range but requires nanosecond resolution.

Figure 7.27 illustrates the results of a simulation using IID Weibull distributed

sources. The plot varies the number of processing elements and the arrival rates of

those elements. Time-driven simulation works well with smaller mean arrival rates, and

the second proposed event-driven method works best with higher distribution means.

Page 253: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

238

Event-Driven Range

1000 2000 3000 4000 5000 6000 7000 8000

50000

100000

10

100

1000

10000

100000

Num of Processing ElementsExponential Dist Mean

Time UnitsEvent-Driven Elements

Time-Driven

Fig. 7.25. Exponential Distribution in Event vs Time-Driven Simulation Thegraph illustrates a network of Independent and Identically Distributed (IID) nodes in net-work sizes ranging from 125 to 8000 nodes. Each node is generating arrival events according

to an Exponential distribution with mean arrival times ranging from 1 to 214. The graphillustrates that for an exponential arrival rate, the mean arrival time offers the most sig-nificant impact to network synchronization. The time required for the event-driven modelis computed by counting the longest signal run from the edge to the center of the networkmultiplied by the propagation delay per unit length. 10 ns are added for each OR gate, seeFigure 7.20, encountered in traversing the path to the network core. The time-driven delaythrough the network is just the duration of the number of time steps required.

Page 254: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

239

50

100

150

200

250

300

350

400

450

2000 4000 6000 8000 10000 12000 14000 16000

Tim

e U

nits

Exponential Dist Mean

Event-Driven RangeEvent-Driven Elements

Time-Driven

Fig. 7.26. Exponential Distribution Slice of Figure 7.25 Illustrates a slice takenfrom Figure 7.25 where the simulation contains 1000 processing elements. The first methodfrom Section 7.3.4 is labeled as the Event-Driven Range, and the second method from thesame section is labeled Event-Driven Elements. The graph illustrates that a time-drivenapproach used in conjunction with the Event-Driven Element method provides the fastestnetwork search approach.

Page 255: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

240

Figure 7.28 clearly illustrates the greater potential range of benefit to be gained by a

machine which can proceed using either the time or event-driven approaches.

Another issue which requires consideration is the notion that when the mean time

between statistical events exceeds a certain limit, and the effected number of simulation

events impacted as a causal by-product of the isolated events decrease and are limited

to a locality around the isolated event, a simulation may actually better lend itself to a

purely software implementation. The software would be able to jump from affected area

to affected area of the simulation network, processing only the individual simulation

nodes which require attention under the isolated circumstances. So simulations with

sparse events may run more efficiently using a software approach, where the software

keeps the event list in a heap, pulls off the next smallest timestamped event and moves

to process it. However, for traffic simulations and simulations with continuous activity

spread over a network with events arriving rapidly and simultaneously across multiple

nodes of the network, a time-driven approach is clearly beneficial.

Page 256: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

241

Event-Driven Range

1000 2000 3000 4000 5000 6000 7000 800050000

100000150000

20000025000010

100

1000

10000

Num of Processing ElementsWeibull Dist Mean

Time Units

Event-Driven ElementsTime-Driven

Fig. 7.27. Weibull Distribution in Event vs Time-Driven Simulation The Weibulldistribution results are similar to the Exponential. The optimum cross over point from thetime-driven to the event-driven method allows a wider speedup gain to be derived.

40

60

80

100

120

140

160

180

200

220

2000 4000 6000 8000 10000 12000 14000 16000

Tim

e U

nits

Exponential Dist Mean

Event-Driven RangeEvent-Driven Elements

Time-Driven

Fig. 7.28. Weibull Distribution Slice of Figure 7.27 The graph is a slice of Figure 7.27at the 1000 processing element mark. The relative simulation search times are displayed.The optimal solution for this range of means would be a simulator which could selectbetween the Time and Event-Driven Element approaches.

Page 257: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

242

Chapter 8

Optimistic Synchronization

As opposed to conservative simulation which avoids causality violations, opti-

mistic approaches allow errors to occur, but then are able to detect and recover from

violations. Optimistic simulation offers two important advantages over its conservative

counterpart. First, greater degrees of parallelism can be exploited. For instance, if two

events might affect each other, but the computations are such that they actually don’t,

optimistic mechanisms can process the events concurrently, while conservative methods

must sequentialize execution [57]. Second, optimistic simulation methods need not rely

on application specific information (e.g. the proximity to the next object) in order to de-

termine which events are safe to process. Conservative approaches tend to be dependent

on application specific data for correctness. The synchronization method can therefore

be more transparent to the application program in optimistic simulation. The downside

is that optimistic simulation may require more overhead computations and storage than

conservative approaches, causing performance penalties instead of the intended benefits.

For the proposed system, one optimistic modification which might prove beneficial

involves overlapping the local element processing and communications time periods. If

each simulation cycle is divided into two sub-segments, a processing phase followed by a

communications phase as described in Section 7.3.6, some optimistic processing can be

inserted by the overlap of these two phases. The idea is to allow processing elements to

Page 258: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

243

begin processing their next cycle of vehicle queue data concurrently with the previous

communications phase. If the majority of simulation cycles do not transmit data between

the processing elements, some speedup may result. The entire simulation cycle can be

initially stored off and saved as a checkpoint. For traffic, this approach seems intuitive,

as the newly transferred vehicles will probably enter the vehicle queues towards the

beginning of the road. Processing vehicles near the front of the queues (or the end of

the road) may not engender any causality errors. As a further modification, processing

elements can begin their early vehicle computation based on the size of their main queues

or based on the simulation positions of the contents of those queues. So, for example,

thresholds can be set so that if the main queue is full, or if the vehicles in simulation are

beyond a certain entrance distance from the front of the queue, then optimistic processing

can begin at that individual node. Problems with the optimistic approach include the

amount of memory required to store off the checkpoint information. Additional circuitry

is required to detect and handle the causality error conditions.

Page 259: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

244

Chapter 9

Results

Discrete event simulation acceleration is requested, needed, and feasible. Exam-

ples of articles in the press [127] and in scientific journals [118] explicitly describe the

requirement for access to accelerated means of simulation. Experimentation shows that

by applying various architectural techniques, discrete event simulation can be success-

fully accelerated. Results can be found at the ends of each thesis chapter, and are also

summarized in this section.

In Section 4.2.3, using the representative software simulation model, CORSIM,

typical bottleneck areas of simulation processing are identified. These bottleneck areas,

shown in Figures 4.2 and 4.3, involve the scheduler and overhead routines. Both sets

of routines must be minimized. The presented architecture, by converting software to

hardware, accelerates the Scheduler routines in Section 7.2.3. The overhead routines from

CORSIM, which involve a significant number of data integrity checks in the simulation,

can be eliminated and moved into the simulation model input phase. Because CORSIM

was neither modular nor current, Trafix, a software simulator written in C++ using

object oriented methods is used to verify the correctness of the car-following algorithms.

Once the car-following algorithms were verified in software, they were translated into

the hardware implementations of Section 7.2.3. The Trafix Scheduler routine timing

Page 260: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

245

results are found in Table 4.6. The software bottleneck is identified as the intersection

movement routine which requires 48.4 µs to process each vehicle.

Although not specified as an initial goal in the comprehensive exam paper, the

lack of an open source road traffic simulator necessitated the development of Trafix. The

current purpose of the Trafix simulator is to test and verify the road movement software

routines developed for Scheduler modeling. Trafix is a GNU licensed open source, free,

modular traffic simulator which is further described in Section 4.3. During the Trafix

system development, the lack of a C++ shared memory allocator was also unexpectedly

noted for GNU-Linux systems. Therefore, an open source, free GNU licensed allocator

class was developed for use with the C++ Standard Template Library, STLPORT. The

allocator is described in Section 4.3.1.

Chapter 5 performs analysis which can be used to determine whether a simulation

will proceed faster in time or event-driven mode. Equation 5.7 can be used to determine

the expected time of the next event. Knowing that interval time, the mode of operation

which most rapidly advances the simulation can be determined. Specific results for both

Independent and Identically Distributed (IID) Exponential and Weibull distributions are

provided in Equations 5.1.3 and 5.1.4. Section 5.2 explores the geometric requirements

for wrapping a traffic map onto the simulator. When a traffic map is divided into nm

sections, the distance between the discontinuities of traffic map sections laid out on the

simulator processing element arrays is n, where n ≤ m. A more detailed explanation of

the topology concerns can be found in Section 5.2.

The interior design of the processing element architecture is divided into the Event

Generation, Event Queue, and Scheduler components. Each subsection of the processing

Page 261: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

246

element architecture is individually explored. The Event Generator design is presented in

Section 7.2.1. The results of Section 7.2.1.1 yield an implementation which can generate

an event every 200 ns. Events are produced rapidly enough that this section of the design

is not a bottleneck to throughput. An Event Queue was implemented as a linear array

in Section 7.2.2.3. The Event Queue handles 16-bit words which can be used to point

to the address of data interleaved in memory, or can be expanded to larger sized words

by working queues in parallel. The Event Queue implementation is capable of working

against an 80 nanosecond cycle time, both pushing and popping elements off in each

cycle. The logic implementation required for both the Event Generator and the Service

Queue can be found in Table 7.1. Again, the speed of the event queue results show that

the event queue is not a throughput bottleneck. The event Scheduler results for the

traffic simulator model can be found in Section 7.2.3.5. The Scheduler section modeled

the scheduling algorithm in 5 components. The first component initialized a vehicle data

object entering the network with its source and randomly selected destination. The next

four components consisted of a pair to handle vehicles on a road and a pair to handle

vehicles traveling through an intersection. Both the road and intersection initialization

implementations set attributes before injecting vehicles either onto a road or into an

intersection.

Section 7.2.3.5 found the speedup of the software bottleneck in accelerated hard-

ware to be a factor of 91. As expected, the Scheduler is the bottleneck component in

the processing element design. The individual component speedup values are illustrated

in Figure 9.1. The lowest speedup derived in the system is 91, which coincides with the

simulation task which requires the most time, so the overall system speedup would be

Page 262: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

247

approximately that value. The processing element design illustrated in Figure 7.17 would

require approximately 30-34 FPGAs. Six FPGAs are required for the Event Generator

and the Arrival and Service Queues. Each processing element Scheduler sub-component

used in calculating vehicle movement requires 3 FPGAs, yielding a total of 24 FPGAs

for the eight sub-components of the Scheduler illustrated in Figure 7.17. Additional FP-

GAs are reserved for Channel Control. This number of FPGAs is similar to the original

Splash board designs which required 32 FPGAs per Splash unit [64]. Each processing

element is capable of modeling one source, intersection, or destination node with the

associated outbound roads of traffic. A system composed of 8000 processing elements

can therefore simulate a large traffic network.

Although very implementation-dependent, the Scheduler can be further acceler-

ated by accelerating division in reconfigurable logic, which currently causes the largest

pipeline stage delays. Access to an accelerated form of division in reconfigurable logic

allows the derivation of greater acceleration. Although Splash originally intended to

pair floating-point chips along with each FPGA [64], the chip I/O pin constraints in this

existing design are already significant. Altera is planning to introduce a new FPGA con-

taining a Strong-Arm processor core. However, the ARM integer core does not contain

support for floating-point data types [59]. Perhaps if the reconfigurable logic had some

functional units embedded within the chip, faster designs would be implementable. An

approach similar to the Digital Signal Processors which often include special functional

units may be worthwhile.

One interesting facet gleamed from the FPGA research is that FPGA implemen-

tation methods and user designs directly impact the resulting design clock speeds. It is

Page 263: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

248

Speedup: 150

Speedup: 125

Speedup: 91

Server

SchedulerSimulationTime Clock

EventGenerator

Event Queue

RandomNumberGenerator

Queue

Fig. 9.1. Speedup Results by Section The speedup results obtained from the simulationsection-by-section analysis are illustrated. A speedup of 150 is obtained when comparingevent generation software to its reconfigurable hardware counterpart. Similarly, a minimumspeedup of 125 was determined for the Event Queue and a speedup of 91 for the Scheduler.So the overall speedup determined for the system would be approximately 91, the minimumof the sub-components. This result compares reasonably with the speedup of 100 reportedfor deterministic logic simulators by Bauer [14].

Page 264: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

249

difficult for the hardware compilers to fully optimize designs. For instance, two methods

of allowing a 16 bit D-flip/flop register to hold its value can be compared. One method

routes the output back through an input multiplexor, so that the output is re-inserted

on the next clock strike. The second method simply disables the register bits which then

retain their value. The first method requires 16 lines to be routed efficiently within the

FPGA. The second requires one signal to be chained to each element of the register bits.

The second method was much more efficient producing better timing results. At this

time, the compiler does not seem capable of detecting and affecting the faster design

automatically.

The proposed architecture is an innovative and unique contribution to the field

of non-deterministic parallel discrete event simulation architecture. Accompanied by

the literature survey, the mathematical analysis, the FPGA research, and the software

studies, the architecture presented in this thesis represents a comprehensive coverage of

the problem. The work clearly shows that a well designed parallel discrete event simulator

can provide much needed results rapidly. Specifically, the simulator can provide timely

results which are required by road traffic management personnel handling a network

under stress.

Page 265: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

250

References

[1] Miron Abramovici, Ytzhak H. Levendel, and Premachandran R. Menon. A logical

simulation machine. IEEE Transactions on Computer-Aided Design of Integrated

Circuits and Systems, CAD-2(2):82–94, April 1983.

[2] Miron Abramovici and Prem Menon. Fault simulaton on reconfigurable logic.

IEEE Symposium on FPGAs for Custom Computing Machines, pages 182–190,

April 1997.

[3] P. Agrawal, W.J. Dally, and W.C. Fischer et. al. Mars: A multiprocessor-based

programmable accelerator. IEEE Design & Test of Computers, pages 28–35, Oc-

tober 1987.

[4] Altera Corporation. Altera Data Book, 1996.

[5] Altera Corporation. Altera Data Book, 1998.

[6] American Association of State Highway and Transportation Officials, 444 North

Capitol Street,NW, Washington,DC 20001. A Policy on Geometric Design of High-

ways and Streets, 1995. ISBN: 1-56051-068-4.

[7] Jeffery M. Arnold. The splash 2 software environment. The Journal of Supercom-

puting, 9:277–290, 1995.

Page 266: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

251

[8] Jeffery M. Arnold, Ducan A. Buell, and Elaine G. Davis. Splash 2. SPAA ’92. 4th

Annual ACM Symposium on Parallel Algorithms and Architectures, pages 316–22,

June,July 1992.

[9] William Aspray and Arthur Burks, editors. Papers of John Von Neumann on

Computing and Computer Theory, volume 12 of The Charles Babbage Institute

Reprint Series of the History of Computing. The MIT Press and Tomash Publish-

ers, Cambridge, MA, 1987. QA76.5.P3145 1987.

[10] Prithviraj Banerjee. Parallel Algorithms for VLSI Computer-Aided Design. PTR

Prentice Hall, Englewood Cliffs, NJ 07632, 1994.

[11] Jerry Banks, John S. Carson II, and Barry L. Nelson. Discrete-Event System

Simulation. International Series in Industrial and Systems Engineering. Prentice

Hall, Upper Saddle River, New Jersey 07458, second edition, 1996.

[12] Robert J. Baron and Lee Higbie. Computer Architecture. Addison Wesley in

Electrical and Computer Engineering. Addison Wesley, 1 edition, 1992.

[13] R. Barto and S. A. Szygenda. A computer architecture for digital logic simulation.

Electronic Engineering, 52(642):35–66, September 1985.

[14] Jerry Bauer, Michael Bershteyn, Ian Kaplan, and Paul Vyedin. A reconfigurable

logic machine for fast event-driven simulation. In Proceedings of the 1998 35th

Design Automation Conference, pages 668–671. IEEE, 1998.

Page 267: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

252

[15] C. Beaumont, P. Boronat, and J. Champeau et al. Reconfigurable technology: An

innovative solution for parallel discrete event simulation support. In 8th Work-

shop on Parallel and Distributed Simulation (PADS ’94). Proceedings of the 1994

Workshop on Parallel and Distributed Simulation, pages 160–163, Edinburgh, UK,

July 1994. IEEE, SCS, San Diego, CA, USA.

[16] Christophe Beaumont, J. Chanpeau, J.-M. Filloque, and B. Pottier. On fpgas as

a new hardware support for parallel discrete event simulation. moscou94.ps from

ftp://ubolib.univ-brest.fr/pub/reports/, October 1994.

[17] John Bergen. Personal communication, February 2001. ICube, Inc.

[18] Dimitri Bertsekas and Robert Gallager. Data Networks. Prentice Hall, Inc., En-

glewood Cliffs, New Jersey 07632, second edition, 1992.

[19] William H. Beyer, editor. CRC Standard Mathematical Tables. CRC Press, Inc.,

Boca Raton, FL, 25 edition, 1978.

[20] Tom Blank. A survey of hardware accelerators used in computer-aided design.

IEEE Design & Test, pages 21–39, August 1984.

[21] James Brink and Richard Spillman. Computer Architecture and VAX Assem-

bly Language Programming. The Benjamin/Cummings Publishing Company, Inc.,

Menlo Park, CA, 1987.

Page 268: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

253

[22] Stephen D. Brown, Robert Francis, Jonathan Rose, and Zvonko Vranesic. Field-

programmable gate arrays. The Kluwer International Series in Engineering and

Computer Science. Kluwer Academic Publishers, 1992.

[23] Marc Bumble and Lee Coraor. Introducing parallelism to event-driven simulation.

In Proceedings of the IASTED International Conference–Applied Simulation and

Modelling, ASM ’97, Banff, Canada, July 27-August 1, 1997. The International

Association of Science and Technology for Development, August 1997.

[24] Marc Bumble and Lee Coraor. Architecture for a non-deterministic simulation

machine. In 1998 Winter Simulation Conference Proceedings, volume 2, pages

1599–1606, December 1998.

[25] Marc Bumble and Lee Coraor. Implementing parallelism in random discrete event-

driven simulation. In Lecture Notes in Computer Science 1388, Parallel and Dis-

tributed Processing, pages 418–427. IEEE Computer Society, Springer, March 1998.

[26] Marc Bumble and Lee Coraor. A global synchronization network for a non-

deterministic simulation architecture. In 1999 Winter Simulation Conference Pro-

ceedings, December 1999.

[27] Marc Bumble, Lee Coraor, and Lily Elefteriadou. Exploring corsim runtime char-

acteristics: Profiling a traffic simulator. 33rd Annual Simulation Symposium 2000

(SS 2000), pages 139–146, April 2000.

Page 269: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

254

[28] Ted Burggraff, Al Love, Richard Malm, and Ann Rudy. The ibm los gatos logic

simulation machine hardware. IEEE International Conference on Computer De-

sign: VLSI in Computers, pages 584–587, 1983.

[29] Calvin A. Buzzell and Micheal J. Robb. Modular VME rollback hardware for

time warp. Simulation Series, 22(1):153–156, January 1990. Monthly Publication

Number: 0114438.

[30] Pak K. Chan and Samiha Mourad. Digital Design Using Field Programmable Gate

Arrays. PTR Prentice Hall, Englewood Cliffs, New Jersey 07632, first edition,

1994.

[31] K. M. Chandy and J. Misra. Asynchronous distributed simulation via a sequence of

parallel computations. Communications of the ACM, 24(11):198–206, April 1981.

[32] C.S. Chang, S.L. Ho, T. T. Chan, and K. K. Lee. Fast ac train emergency reschedul-

ing using an event driven approach. IEE Proceedings-B, 144(4):281–288, July 1993.

[33] Gang-Len Chang, Jifeng Wu, and Henry Lieu. Real-time incident responsive sys-

tem for corridor control: modeling framework and preliminary results. In Trans-

poration Research Record, volume 1452, pages 42–51, December 1994.

[34] Barbara A. Chappell, Terry I. Chappell, Stanley E. Schuster, Herman M. Seg-

muller, James W. Allan, Robert L. Franch, and Phillip J. Restle. Fast cmos ecl

receivers with 100-mv worst-case sensitivity. IEEE Journal of Solid-State Circuits,

23(1):59–67, February 1988.

Page 270: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

255

[35] A. T. Chronopoulos. Traffic flow simulation through high order traffic modeling.

Mathematical Computing Modeling, 17(8):pp. 11–22, 1993.

[36] Anthony Theodore Chronopoulos and Charles Michael Johnson. A real-time traffic

simulation system. IEEE Transactions on Vehicular Technology, 47(1):321–331,

February 1998.

[37] Jim Clark and Gene Daigle. The importance of simulation techniques in ITS re-

search and anaylysis. In S. Andradottir, K. J. Healy, D. H. Withers, and B. L.

Nelson, editors, Winter Simulation Conference Proceedings, pages 1236–1243, Pis-

cataway, NJ, USA, 1997. IEEE.

[38] Alan Clements. Microprocessor Systems Design: 68000 Hardware, Software, and

Interface. PWS Publishing Company, Boston, MA, third edition, 1997.

[39] John Craig Comfort. The simulation of a master-slave event set processor. Simu-

lation, pages 117–124, March 1984.

[40] Gareth Cook. Scientists dissect dynamics of panic. The Boston Globe, page A24,

September 28 2000.

[41] Gene Daigle, Michelle Thomas, and Meenakshy Vasudevan. Field applications of

corsim: I-40 freeway design evaluation. In D. J. Medeiros, E. F. Watson, J. S. Car-

son, and M. S. Manivannan, editors, Winter Simulation Conference Proceedings,

volume 2, pages 1161–1167, Piscataway, NJ, USA, 1998. IEEE.

Page 271: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

256

[42] Frederica Darema and Gregory F. Pfister. Multipurpose parallelism for vlsi cad on

the rp3. IEEE Design & Test of Computers, pages 19–27, October 1987.

[43] Samir R. Das and Richard M. Fujimoto. An empirical evaluation of performance-

memory trade-offs in time warp. IEEE Transactions on Parallel and Distributed

Systems, 8(2):210–224, February 1997.

[44] Carolyn K. Davis, Sallie V. Sheppard, and William M. Lively. Automatic devel-

opment of parallel simulation models in ADA. Proceedings of the 1988 Winter

Simulation Conference, pages pp. 339–343, 1988.

[45] Andre DeHon. DPGA utilization and application. Proceedings of the 1996 Inter-

national Symposium on Field Programmable Gate Arrays, February 1996.

[46] Andre DeHon. Dynamically programmable gate arrays: A step toward increased

computational density. Proceedings of the Fourth Canadian Workshop on Field-

Programmable Devices, pages 47–54, May 1996.

[47] Andre DeHon. Reconfigurable architectures for general-purpose computing. A.I.

Technical Report 1586, Massachusetts Institute of Technology, Aritificial Intelli-

gence Laboratory, Cambridge, MA, October 1996.

[48] Jay L. Devore. Probability and statistics for engineering and the sciences. Duxbury

Press, fourth edition, 1995.

Page 272: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

257

[49] Philippe Dhaussy, Jean-Marie Filloque, Bernard Pottier, and Stephane Rubini.

Global control synthesis for an mimd/fpga machine. In Proceedings of the IEEE

Workshop on FPGAs for Custom Computing Machines, pages 72–81, IEEE, Los

Alamitos, CA USA, April 1994. IEEE Computer Society.

[50] J. Presper Eckert. Thoughts on the history of computing. Computer, pages 58–65,

December 1976.

[51] Bradly K. Fawcett. Taking advantage of reconfigurable logic. Seventh Annual IEEE

International ASIC Conference and Exhibit, pages 227–230, September 1994.

[52] Robert E. Felderman and Leonard Kleinrock. An upper bound on the improve-

ment of asynchronous verses synchronous distributed processing. Simulation Series

Proceedings of the SCS Multiconference on Distributed Simulation, 22(1):131–136,

January 1990.

[53] Peter Fishburn and Paul Wright. Bandwidth edge counts for linear arrangements

of rectangular grids. Journal of Graph Theory, 26(4):195–202, 1997.

[54] Richard M Fujimoto. Performance measumements of distributed simulation strate-

gies. Transactions of the Society for Computer Simulation, 6(2):89–132, April 1989.

[55] Richard M. Fujimoto. Parallel discrete event simulation. In Communications of

the ACM, volume 33 no. 10, pages 30–53. ACM, October 1990.

[56] Richard M. Fujimoto. Parallel and distributed simulation. Proceedings of the 1995

Winter Simulation Conference, pages 118–125, 1995.

Page 273: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

258

[57] Richard M. Fujimoto. Parallel and distributed simulation. Proceedings of the 1999

Winter Simulation Conference, pages 122–131, 1999.

[58] Richard M. Fujimoto, Jya-Jang Tsai, and Ganesh C. Gopalakrishman. Design and

evaluation of the rollback chip: Special purpose hardware for time warp. IEEE

Transactions on Computers, 41(1):68–82, January 1992.

[59] Steve Furber. ARM System Architecture. Addison-Wesley, Essex, England, 1st

edition, 1996.

[60] Nicolas J. Garber and Lester A. Hoel. Traffic and Highway Engineering. PWS

Publishing Company, 2 edition, 1997. ISBN 0-534-95338-7.

[61] Demos C. Gazis, Robert Herman, and Renfrey B. Potts. Car-following theory of

steady-state traffic flow. Operations Research, 7(4):499–505, 1959.

[62] Mohammed S. Ghausi. Electronic Devices and Circuits: Discrete and Integrated.

HRW Series in Electrical and Computer Engineering. Holt, Rinehart and Winston,

1985.

[63] Loys Gindraux and Gary Catlin. CAE station’s simulators tackle 1 million gates.

Electronic Design, pages 127–136, November 10 1983.

[64] Maya Gokhale, William Holmes, Andrew Kopser, Sara Lucas, Ronald Minnich,

Douglas Sweely, and Daniel Lopresti. Building and using a highly parallel pro-

grammable logic array. Computer, 24(1):81–89, January 1991.

Page 274: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

259

[65] Jim Gray. International parallel processing symposium keynote address, April

1998.

[66] Harold Greenberg. An analysis of traffic flow. Operations Research, 7:79–85, 1959.

[67] B. D. Greenshields. A study in highway capacity. Highway Research Board Pro-

ceedings, 14:pp. 468, 1934.

[68] Leo J. Guibas and Frank M. Liang. Systolic stacks, queues, and counters. In 1982

Conference on Advanced Reesearch in VLSI, M.I.T., pages 155–164, January 1982.

[69] J.D. Hadley and B.L. Hutchings. Design methodologies for partially reconfigured

systems. IEEE Symposium on FPGAs for Custom Computing Machines, Proceed-

ings 1995, pages 78–84, 1995.

[70] Reiner W. Hartenstein, Jurgen Becker, Rainer Kress, and Helmut Reinig. High-

performance computing using a reconfigurable accelerator. Concurrency: Practice

and Experience, 8(6):429–443, July-August 1996.

[71] John Patrick Hayes. Computer architecture and organization. McGraw-Hill series

in computer organization and architecture. McGraw-Hill, 2 edition, 1988.

[72] W. R. Heller, C. George Hsi, and Wadi F. Mikhaill. Wirability - designing wiring

space for chips and chip packages. IEEE Design and Test of Computers, pages

43–51, August 1984.

[73] John L. Hennessy and David A. Patterson. Computer Architecture A Quantitative

Approach. Morgan Kaufmann Publishers, Inc., first edition, 1990.

Page 275: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

260

[74] John L. Hennessy and David A. Patterson. Computer Architecture A Quantitative

Approach. Morgan Kaufmann Publishers, Inc., second edition, 1996.

[75] M.P. Henry. Keynote paper: Hardware compilation - a new technique for rapid

prototyping of digital systems - applied to sensor validation. Control Engineering

Practice, 3(7):907–924, 1995.

[76] A. Hoogland, J. Spaa, B. Selman, and A. Compagner. A special-purpose processor

for the monte carlo simulation for ising spin systems. Journal of Computational

Physics, 51:250–260, 1983.

[77] R. Micheal Hord. Parallel Supercomputing in MIMD Architectures. CRC Press,

Inc., Boca Raton, Florida, 1993.

[78] John K. Howard, Richard L. Malm, and Larry M. Warren. Introduction to the ibm

los gatos logic simulation machine. Proceedings - IEEE International Conference

on Computer Design: VLSI in Computers, pages 580–583, 1983.

[79] Kai Hwang and Faye A. Briggs. Computer Architecture and Parallel Process-

ing. McGraw-Hill Series in Computer Organization and Architecture. McGraw-Hill

Book Company, 1984.

[80] David R. Jefferson. Virtual time. ACM Transactions on Programming Languages

and Systems, 7(3):404–425, July 1985.

Page 276: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

261

[81] Charles Michael Johnson and Anthony Theodore Chronopoulos. A communica-

tions latency hiding parallelization of a traffic flow simulation. In 13th International

Parallel Processing Symposium & 10th Symposium on Parallel and Distributed Pro-

cessing, pages 688–695. IEEE, April 1999.

[82] Adolf D. May Jr. and Hartmut E. M. Keller. Non-integer car-following models.

Highway Research Board, 199:19–32, 1967.

[83] T. Junchaya and G. Chang. Exploring real-time traffic simulation with massively

parallel computing architecture. Transportation Research Committee, 1(1):pp. 57–

76, 1993.

[84] Tom Kean and John Gray. Configurable hardware: Two case studies of micro-grain

computation. Journal of VLSI Signal Processing, 2:9–16, 1990.

[85] Thomas F. Knight and Alexander Krymm. A self-terminating low-voltage swing

cmos output driver. The IEEE Journal of Solid-State Circuits, 23(2):457–463,

April 1988.

[86] Donald E. Knuth. The art of computer programming. Addison-Wesley, 1968.

[87] Donald E. Knuth. The art of computer programming, volume 2. Addison-Wesley,

3rd edition, 1998.

[88] Jack Kohn, Richard Malm, Chuck Meiley, and Frank Nemec. The ibm los gatos

logic simulation software. IEEE International Conference on Computer Design:

VLSI in Computers, pages 588–591, 1983.

Page 277: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

262

[89] Nobuhiko Koike, Kenji Ohmori, and Tohru Sasaki. HAL: A high-speed logic sim-

ulation machine. IEEE Design & Test of Computers, 2(5):61–73, October 1985.

[90] Israel Koren. Computer arithmetic algorithms. Prentice Hall, Englewood Cliffs,

N.J., 1993.

[91] H. T. Kung. Why systolic architectures. IEEE Computer, 15(1):37–46, January

1982.

[92] H. T. Kung. Systolic communications. International Conference on Systolic Arrays,

pages 695–703, May 1988.

[93] Bernard S. Landman and Roy L. Russo. On pin versus block relationship for

partitions of logic circuits. IEEE Transactions on Computers, c-20(12):1469–1479,

December 1971.

[94] Richard J. Larsen and Morris L. Marx. An Introduction to Mathematical Statistics

and its Applications. Prentice-Hall, Englewood Cliffs, NJ 07632, second edition,

1986.

[95] Doug Lea. Some storage management techniques for container classes, 1989.

[96] Ulana Legedza and William E. Weihl. Reducing synchronization overhead in par-

allel simulation. In 10th Workshop on Parallel and Distributed Simulation (PADS

’96). Proceedings of the 1996 Workshop on Parallel and Distributed Simulation,

pages 86–95, Philadelphia,PA, May 1996. IEEE, SCS, San Diego, CA, USA.

Page 278: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

263

[97] F. Thomson Leighton. Introduction to Parallel Algorithms and Architectures: Ar-

rays, Trees, Hypercubes. Morgan Kaufmann Publishers, San Mateo, CA, 1992.

[98] Y. H. Levendel, P. R. Menon, and S. H. Patel. Special-purpose computer for

logic simulation using distributed processing. The Bell System Technical Journal,

61(10):2873–2909, December 1982.

[99] M. T. Lighthill and G. B. Witham. On kinematic waves, a theory of traffic flow on

long crowded roads. Proceedings of the Royal Society, A 229(1178):pp. 317–345,

1955.

[100] M. Morris Mano. Computer System Architecture. Prentice Hall, Englewood Cliffs,

NJ, 3 edition, 1993.

[101] John W. Mauchly. Amending the eniac story. Datamation, 25(11):217–218, Octo-

ber 1979.

[102] Adolf D. May. Traffic Flow Fundamentals. Prentice Hall, Englewood Cliffs, NJ

07632, 1990. ISBN 0-13-926072-2.

[103] John J Metzner and B. N. Jamoussi. an easily programmable algorithm for window

flow control analysis. Proceedings of the 1992 Conference on Information Science

and Systems, pages 1041–1044, March 1992.

[104] Panos G. Michalopoulos, Ping Yi, and Anastasios S. Lyrintzis. Development of

an improved high-order continuum traffic flow model. Transportation Research

Record, 1365:125–132, 1993.

Page 279: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

264

[105] Chris Miller. Comet crash: Teraflops computer simulates colossal comet impact

into ocean. Sandia National Laboratories - News Release WWW, April 1997.

[106] Sean Monaghan. A gate-level reconfigurable monte carlo processor. Journal of

VLSI Signal Processing, 6(2):139–153, August 1993.

[107] Sean Monaghan and P.D. Noakes. Reconfigurable special purpose hardware for

scientific computation and simulation. Computing & Control Engineering Journal,

page 225, September 1992.

[108] Motorola. MECL Device Data, 1989.

[109] Motorola. Motorola Military ALS/FAST/TTL Data, 1989. Q3/89 DL142.

[110] Jeffrey D. Myjak. A massively parallel microscopic traffic simulation model with

fuzzy logic. Master’s thesis, Massachusetts Institute of Technology, September

1993.

[111] William R. Newcott. The age of comets. National Geographic, 192(6):94–109,

December 1997.

[112] David M. Nicol. Principles of conservative parallel simulation. In J. M. Charnes,

D. J. Morrice, D. T. Brunner, and J. J. Swain, editors, Proceedings of the 1996

Winter Simulation Conference, pages 128–135, 1996.

[113] Bill Nitzberg and Samuel A. Fineberg. Parallel I/O on highly parallel systems

supercomputing ’94 – tutorial m11 notes. Technical Report NAS-94-005, NASA

Ames Research Center, Moffett Field, CA 94035-1000, November 1994.

Page 280: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

265

[114] John V. Oldfield and Richard C. Dorf. Field Programmable Gate Arrays: Recon-

figurable Logic for Rapid Prototyping and Implemenation of Digital Systems. John

Wiley & Sons, Inc., 1995.

[115] Athanasios Papoulis. Probabilty, Random Variables, and Stochastic Processes.

McGraw-Hill Series in Electrical Engineering. McGraw-Hill Publishing Company,

second edition, 1984.

[116] Jr. Paul F. Reynolds, Carmen M. Pancerella, and Sudhir Srinivasan. Design and

performance analysis of hardware support for parallel simulations. Journal Of

Parallel And Distributed Computing, 18(4):435–453, August 1993.

[117] Jr. Paul F. Reynolds, Craig Williams, and Jr. R.R. Wagner. Isotach networks.

IEEE Transactions on Parallel and Distributed Systems, 1997.

[118] Robert B Pearson, John L. Richardson, and Doug Toussant. A fast processor for

monte-carlo simulation. Journal of Computational Physics, 51:241–249, 1983.

[119] Gregory F. Pfister. The ibm yorktown simulation engine. Proceedings of the IEEE,

74(6):850–860, June 1986.

[120] Neil S. Pickles and Martin C. Lefebvre. ECL I/O buffers for BiCMOS integrated

systems: A tutorial overview. IEEE Transactions on Education, 40(4):229–241,

November 1997.

[121] James L. Pline, editor. Traffic Engineering Handbook. Prentice Hall, Englewood

Cliffs, NJ 07632, 4 edition, 1992.

Page 281: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

266

[122] Eric S. Raymond. The cathedral and the bazaar. Presented at the 1997 Linux-

Kongress, the Atlanta Linux Showcase, 1997.

[123] Daniel A. Reed, Allen D. Molony, and Bradley D McCredie. Parallel discrete event

simulation: A shared memory approach. Proc of the 1987 ACM SIGMETRICS

Conf on Meas and Model of Comput Syst, 15(1):36–38, May 1987.

[124] Ronni Sandroff. New jump start for hearts? Consumer Reports, 66(2):8, February

2001.

[125] Tohru Sasaki, Nobuhiko Koike, Kenji Ohmori, and Kyoji Tomita. HAL; a block

level hardware logic simulator. Proceedings - ACM IEEE 20th Design Automation

Conference, pages 150–156, 1983.

[126] Richard L. Scheaffer. Introduction to Probability and its Applications. The Duxbury

Advanced Series in Statistics and Decision Sciences. PWS-KENT Publishing Com-

pany, Boston, USA, 1990.

[127] Bruce Schecter. Putting a darwinian spin on the diesel engine. The New York

Times, page D3, September 19 2000.

[128] Donald L. Schilling and Charles Belove. Electronic Circuits Discrete and Integrated.

Series in Electrical Engineering. McGraw-Hill, second edition, 1979.

[129] Carla Sciullo. Department of statistics - project id: 98-1-008. consultation,

Febraury 1998.

Page 282: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

267

[130] Larry Soule and Tom Blank. statistics for parallelism and abstraction level in

digital simulation. Design Automation Conference - Proceedings 1987, pages 588–

591, 1987.

[131] Daniel L. Stein. Spin glasses. Scientific American, 261:52–59, July 1989.

[132] Nancy Fortgang Stern. From ENIAC to UNIVAC: A case study in the history of

technology. PhD thesis, State University of New York at Stony Brook, August

1978.

[133] Harold S. Stone, editor. Introduction to Computer Architecture. SRA computer

science series. Science Research Associates, Inc., 2nd edition, 1980.

[134] Bjarne Stroustrup. The C++ Programming Language. Addison-Wesley, 3rd edi-

tion, 1997.

[135] Shigeru Takasaki, Nobuyoshi Nomizu, Yoshihiro Hirabayashi, Hiroshi Ishikura,

Masahiro Kurashita, Nobuhiko Koike, and Toshiyuki Nakata. HAL iii: Function

level hardware logic simulation system. Proceedings - IEEE International Confer-

ence on Computer Design: VLSI in Computers and Processors Proceedings of the

1990 IEEE International Conference on Computer Design: VLSI in Computers

and Processors - ICCD ’90, pages 167–170, September 1990.

[136] Techical Education - Corporate Management Development, Crotonville, NY. Math-

ematical Analysis, second edition, August 1988. Chapter 6 - Queueing Theory.

Page 283: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

268

[137] Masahiro Tomita, Naoaki Suganuma, and Kotaro Hirano. Reconfigurable machine

and its application to logic simulation. IEICE Transactions on Fundamentals of

Electronics Communications and Computer Science, E76-A(10):1705–1712, Octo-

ber 1993.

[138] A. W. VanAusdal. Use of the boeing computer simulator for logic design con-

firmation and failure diagnostics programs. Proceedings of the Advances in the

Astronautical Sciences 17th Annual Meeting, 29:573–594, June 1971.

[139] George Varghese, Roger Chamberlain, and William E. Weihl. The pessimism be-

hind optimistic simulation. In 8th Workshop on Parallel and Distributed Simulation

(PADS ’94). Proceedings of the 1994 Workshop on Parallel and Distributed Sim-

ulation, pages 126–131, Edinburgh, UK, July 1994. IEEE, SCS, San Diego, CA,

USA.

[140] George Varghese, Roger Chamberlain, and William E. Weihl. Deriving global

virtual time algorithms from conservative simulation protocols. Information Pro-

cessing Letters, 54(2):121–126, April 1995.

[141] Jean Walrand. Communication Networks: A First Course. Aksen Associates, Inc.,

1991.

[142] Kevin Watkins. Discrete Event Simulation in C. The McGraw-Hill International

Series in Software Engineering. McGraw-Hill Book Company, 1993.

[143] C. Craig Williams and Jr. Paul F. Reynolds. Combining atomic actions. Journal

of Parallel and Distributed Systems, pages 152–163, 1995.

Page 284: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

269

[144] Michael J. Wirthlin and Brad L. Hutchings. A dynamic instruction set computer.

IEEE Workshop on FPGAs for Custom Computing Machines, Napa, CA, pages

1–9, April 1995.

[145] Michael J. Wirthlin, Brad L. Hutchings, and Kent L. Gilson. The nano processor:

A low resource reconfigurable processor. IEEE Workshop on FPGAs for Custom

Computing Machines, Napa, CA, pages 23–30, April 1994.

[146] Qi Yang. A microscopic traffic simulation model for ivhs applications. Master’s

thesis, Massachusetts Institute of Technology, Department of Civil and Environ-

mental, August 1993.

[147] Albert Y. Zomaya. Parallel and distributed computing handbook. Computer engi-

neering series. McGraw-Hill, New York, 1996.

Page 285: A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...

Vita

Marc Bumble is a PhD candidate in the Computer Science and Engineering department

at the Pennsylvania State University in University Park, PA. He received his B.S. and

M.S. degrees in Electrical Engineering from the University of Pennsylvania in Philadel-

phia. There, he wrote his Masters Thesis in Telecommunications on a routing algorithm

for a hypothetical satellite network based on the Iridium cellular network. His current re-

search investigates architectures for accelerating non-deterministic simulation, including

the application of reconfigurable logic.