A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...
Transcript of A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC DISCRETE ...
The Pennsylvania State University
The Graduate School
Department of Computer Science and Engineering
A PARALLEL ARCHITECTURE FOR NON-DETERMINISTIC
DISCRETE EVENT SIMULATION
A Thesis in
Computer Science and Engineering
by
Marc D. Bumble
c© 2001 Marc D. Bumble
Submitted in Partial Fulfillmentof the Requirements
for the Degree of
Doctor of Philosophy
May 2001
We approve the thesis of Marc D. Bumble.
Date of Signature
Lee D. CoraorAssociate Professor of Computer Science and EngineeringThesis AdviserChair of Committee
Mary Jane IrwinProfessor of Computer Science and Engineering
John J. MetznerProfessor of Computer Science and Engineering
Ageliki ElefteriadouAssociate Professor of Civil Engineering
Dale A. MillerProfessor of Computer Science and EngineeringHead of the Department of Computer Science and Engineering
iii
Abstract
An architecture for a non-deterministic simulation machine is described and pre-
sented for the purposes of accelerating the simulation of road traffic. The thesis includes
a survey of related work and a description of general architectural methods applied to
accelerate non-deterministic parallel event simulation. A study of the traffic simulator,
CORSIM, was undertaken to identify software simulation bottlenecks. Mathematical
analysis is used to assist in the decision between running a simulation in an event or
time-driven mode. Finally, the details of the simulator architecture are presented. The
architecture is divided into event generation, the event queue, the scheduler, and the uni-
fying communications network. The slowest subcomponent is shown to be accelerated
with a speedup of of 91.
iv
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 The Importance of Non-deterministic Simulation . . . . . . . . . . . 4
1.2 Simulation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Opportunities for Acceleration in Discrete Simulation . . . . . . . . . 8
1.3.1 Difficulties faced by Parallel Discrete Event-Driven Simulations 11
1.4 Simulation’s Niche . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5 Microscopic & Macroscopic . . . . . . . . . . . . . . . . . . . . . . . 15
Chapter 2. Traffic Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.0.1 Reuschel & Pipes . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.0.2 General Motors’ Car-following Model . . . . . . . . . . . . . . 23
2.0.3 Vehicle Deceleration . . . . . . . . . . . . . . . . . . . . . . . 26
2.0.4 Macroscopic Models . . . . . . . . . . . . . . . . . . . . . . . 27
2.0.4.1 Greenshields . . . . . . . . . . . . . . . . . . . . . . 29
2.0.4.2 Greenberg . . . . . . . . . . . . . . . . . . . . . . . 31
Chapter 3. Previous Work Related to Simulation Architectures . . . . . . . . . . 33
v
3.1 Logic Simulation Machines . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.1 Boeing Computer Simulator . . . . . . . . . . . . . . . . . . . 35
3.1.2 The IBM Los Gatos Logic Simulation Machine . . . . . . . . 40
3.1.3 Barto and Szygenda’s Hardware Simulator . . . . . . . . . . . 45
3.1.4 Abramovici’s Logic Simulation Machine . . . . . . . . . . . . 50
3.1.5 Levendel, Menon, and Patel’s Logic Simulator . . . . . . . . . 54
3.1.6 Megalogican . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.1.7 The IBM Yorktown Simulation Engine . . . . . . . . . . . . . 70
3.1.8 HAL: A Block Level Logic Simulator . . . . . . . . . . . . . . 77
3.1.9 MARS:Micro-Programmable Accelerator for Rapid Simulation 87
3.1.10 Reconfigurable Machine . . . . . . . . . . . . . . . . . . . . . 93
3.1.11 Bauer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.2 Accelerator & General Purpose Machine . . . . . . . . . . . . . . . . 100
3.2.1 Splash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.2.2 The ArMen . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.3 Optimistic Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.4 Non-Deterministic Simulation . . . . . . . . . . . . . . . . . . . . . . 114
3.4.1 Hoogland, Spaa, Selman, and Compagner . . . . . . . . . . . 117
3.4.2 Monaghan & Pearson, Richardson, and Toussant . . . . . . . 117
3.5 Reduction Buses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
3.5.1 Parallel Reduction Network . . . . . . . . . . . . . . . . . . . 119
Chapter 4. Software Traffic Simulation . . . . . . . . . . . . . . . . . . . . . . . 123
vi
4.1 Event Generation & Queue . . . . . . . . . . . . . . . . . . . . . . . 124
4.1.1 Event Generation Software . . . . . . . . . . . . . . . . . . . 124
4.1.2 Event Queue Software . . . . . . . . . . . . . . . . . . . . . . 127
4.2 CORSIM: An Established Software Simulator . . . . . . . . . . . . . 130
4.2.1 CORSIM Function Categories . . . . . . . . . . . . . . . . . . 131
4.2.2 NT versus Linux . . . . . . . . . . . . . . . . . . . . . . . . . 131
4.2.3 CORSIM Profile . . . . . . . . . . . . . . . . . . . . . . . . . 133
4.3 Trafix: A Road Traffic Simulator . . . . . . . . . . . . . . . . . . . . 138
4.3.1 A Shared, Pooled Allocator . . . . . . . . . . . . . . . . . . . 144
Chapter 5. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.1 Event verses Time-Driven Simulation . . . . . . . . . . . . . . . . . . 148
5.1.1 Expected Advantage of Event vs Time-Driven Simulation . . 148
5.1.2 Decision between Event vs Time-Driven Modes . . . . . . . . 149
5.1.3 Exponentially Distributed Example . . . . . . . . . . . . . . . 152
5.1.4 Weibull Distribution Example . . . . . . . . . . . . . . . . . . 155
5.2 Topology: Traffic Map Layout . . . . . . . . . . . . . . . . . . . . . . 158
Chapter 6. Design Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.1 Reconfigurable Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.2 Systolic Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
6.3 Content Addressable Memory . . . . . . . . . . . . . . . . . . . . . . 173
6.4 Reduction Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
vii
Chapter 7. Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
7.1 Distributed Multiprocessors . . . . . . . . . . . . . . . . . . . . . . . 180
7.2 Processing Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
7.2.1 Event Generation . . . . . . . . . . . . . . . . . . . . . . . . . 185
7.2.1.1 Event Generator Results . . . . . . . . . . . . . . . 190
7.2.2 Event Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
7.2.2.1 The Service Event Sorter . . . . . . . . . . . . . . . 191
7.2.2.2 The Linear Array . . . . . . . . . . . . . . . . . . . 195
7.2.2.3 The Queue Model Results . . . . . . . . . . . . . . . 198
7.2.3 Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
7.2.3.1 Vehicle Data . . . . . . . . . . . . . . . . . . . . . . 203
7.2.3.2 Vehicle Initialization . . . . . . . . . . . . . . . . . . 207
7.2.3.3 Road Movement . . . . . . . . . . . . . . . . . . . . 207
7.2.3.4 Intersection Movement . . . . . . . . . . . . . . . . 213
7.2.3.5 Scheduler Results . . . . . . . . . . . . . . . . . . . 214
7.3 Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
7.3.1 Communications Architectures . . . . . . . . . . . . . . . . . 221
7.3.2 Parallel Bus Architecture . . . . . . . . . . . . . . . . . . . . 226
7.3.3 Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 227
7.3.4 Phase 1 Elimination . . . . . . . . . . . . . . . . . . . . . . . 227
7.3.5 Phase 2 Selection . . . . . . . . . . . . . . . . . . . . . . . . . 228
7.3.6 Cross-Point Matrix . . . . . . . . . . . . . . . . . . . . . . . . 231
7.3.7 Network Results . . . . . . . . . . . . . . . . . . . . . . . . . 235
viii
Chapter 8. Optimistic Synchronization . . . . . . . . . . . . . . . . . . . . . . . 242
Chapter 9. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
ix
List of Tables
2.1 Car-Following Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Vehicle Deceleration Notation . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Harmonic Mean Speed Notation . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Notation for Greenshields Equations . . . . . . . . . . . . . . . . . . . . 29
3.1 Barto’s Simulator Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Synchronous Discrete Event-Driven Simulation Algorithm . . . . . . . . 107
4.1 Event Generation Code I . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.2 Event Generation Code II . . . . . . . . . . . . . . . . . . . . . . . . . . 128
4.3 Event Queue Loop Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.4 CORSIM Function Classifications . . . . . . . . . . . . . . . . . . . . . . 132
4.5 CORSIM Runtime Under Linux and NT . . . . . . . . . . . . . . . . . . 135
4.6 Scheduler Software Function Profile . . . . . . . . . . . . . . . . . . . . 144
7.1 Event Generator and Event Queue FPGA Implementation . . . . . . . . 201
7.2 Vehicle Data Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
7.3 Acceleration Decisions for a Road . . . . . . . . . . . . . . . . . . . . . . 212
7.4 Acceleration Decisions for an Intersection . . . . . . . . . . . . . . . . . 216
7.5 Scheduler Chip Implementation . . . . . . . . . . . . . . . . . . . . . . . 218
x
List of Figures
1.1 Simulator Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Time Headway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Car-Following Notation Figure . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Car-Following Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1 The Boeing Simulator Architecture . . . . . . . . . . . . . . . . . . . . . 36
3.2 Simulation Classifications . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 The Boeing Simulator Logic Processor . . . . . . . . . . . . . . . . . . . 39
3.4 Los Gatos Logic Simulation Machine Architecture . . . . . . . . . . . . 42
3.5 The IBM Los Gatos Logic Simulation Machine . . . . . . . . . . . . . . 44
3.6 Barto’s Logic Simulator Architecture . . . . . . . . . . . . . . . . . . . . 49
3.7 Logic Simulation Machine . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.8 Levendel’s Logic Simulator Architecture . . . . . . . . . . . . . . . . . . 55
3.9 Mapping Circuit Blocks ai and aj to Processors pi and pj . . . . . . . 57
3.10 Interface Between the Data Sequencers and the Time-Shared Parallel Bus 58
3.11 The Controlling Processor Unit Configuration . . . . . . . . . . . . . . . 60
3.12 Subordinate Processor Unit Configuration . . . . . . . . . . . . . . . . . 61
3.13 Interface Between the Parallel Bus and the Cross-Point Matrix . . . . . 65
3.14 Interface Between the Data Sequencers and a Cross-Point Matrix . . . . 67
3.15 Megalogican Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 69
xi
3.16 The YSE Logic Processor Configuration . . . . . . . . . . . . . . . . . . 72
3.17 A Switch Port ”K” Example with its Logic Port Connection . . . . . . . 74
3.18 The YSE Scheduling Problem . . . . . . . . . . . . . . . . . . . . . . . . 78
3.19 The HAL Level Ordering Method . . . . . . . . . . . . . . . . . . . . . . 82
3.20 The HAL Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . 83
3.21 Internal mechanism of a Logic Processor . . . . . . . . . . . . . . . . . . 84
3.22 Global MARS Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.23 Internal Cluster Architecture . . . . . . . . . . . . . . . . . . . . . . . . 89
3.24 Architecture of the Processing Element . . . . . . . . . . . . . . . . . . . 91
3.25 MARS logic simulation pipeline . . . . . . . . . . . . . . . . . . . . . . . 93
3.26 The RM Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.27 The LSIM Fanout Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.28 The LSIM Evaluation phase . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.29 Bauer’s Reconfigurable Logic Simulator . . . . . . . . . . . . . . . . . . 99
3.30 The Splash 2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.31 The Splash 2 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.32 The ArMen Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.33 Digital-Serial Implementation of the Global Minimum Computation and Broadcast 111
3.34 The Ising Spin Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.35 A General 2-Bit Feedback Shift Register . . . . . . . . . . . . . . . . . . 118
3.36 Random Number Generator . . . . . . . . . . . . . . . . . . . . . . . . . 118
3.37 Parallel Reduction Network . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.38 PRN Arithmetic Logical Unit Node . . . . . . . . . . . . . . . . . . . . . 122
xii
4.1 Simulation Timeline Generation . . . . . . . . . . . . . . . . . . . . . . . 126
4.2 Profile Chart of CORSIM on NT . . . . . . . . . . . . . . . . . . . . . . 136
4.3 Profile Chart of CORSIM on Linux . . . . . . . . . . . . . . . . . . . . . 137
4.4 The Trafix Software Structure . . . . . . . . . . . . . . . . . . . . . . . . 140
4.5 The Trafix Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
4.6 Trafix Input Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.1 Wrapping a Traffic Map onto the Simulator . . . . . . . . . . . . . . . . 159
5.2 Different Lexicographical Map Layouts . . . . . . . . . . . . . . . . . . . 160
6.1 General Xilinx FPGA Architecture . . . . . . . . . . . . . . . . . . . . . 164
6.2 Xilinx Architecture Interconnects . . . . . . . . . . . . . . . . . . . . . . 165
6.3 The Xilinx XC4000 Configurable Logic Block . . . . . . . . . . . . . . . 166
6.4 Block Diagram of the Altera Flex 10K Architecture . . . . . . . . . . . . 168
6.5 Diagram of the Altera Embedded Array Block (EAB) . . . . . . . . . . 169
6.6 Diagram of the Altera Logic Element (LE) . . . . . . . . . . . . . . . . . 171
6.7 Associative Memory Block Diagram . . . . . . . . . . . . . . . . . . . . 175
6.8 An Associative Memory Cell . . . . . . . . . . . . . . . . . . . . . . . . 176
6.9 Associative Memory Match Logic . . . . . . . . . . . . . . . . . . . . . . 177
7.1 System User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7.2 Processor Elements Network . . . . . . . . . . . . . . . . . . . . . . . . . 183
7.3 Local Processing Element Design . . . . . . . . . . . . . . . . . . . . . . 186
7.4 The Event Generator Flow Diagram . . . . . . . . . . . . . . . . . . . . 188
7.5 Service Event Sorter: Cycle 1 . . . . . . . . . . . . . . . . . . . . . . . . 193
xiii
7.6 Service Event Sorter: Cycle 2 . . . . . . . . . . . . . . . . . . . . . . . . 194
7.7 Service Event Sorter: Cycle 3 . . . . . . . . . . . . . . . . . . . . . . . . 196
7.8 Service Event Sorter: Cycle 4 . . . . . . . . . . . . . . . . . . . . . . . . 197
7.9 Linear Array Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.10 Linear Sort Array Input Example . . . . . . . . . . . . . . . . . . . . . . 199
7.11 Linear Sort Array Output Example . . . . . . . . . . . . . . . . . . . . . 199
7.12 Speedup vs Events for Event Generation, Arrival and Service Queues . . 201
7.13 An Intersection and its Departing Roads . . . . . . . . . . . . . . . . . . 204
7.14 Scheduler Vehicle Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
7.15 Calculations for Vehicle Movement on a Road . . . . . . . . . . . . . . . 211
7.16 Calculations for Vehicle Movement Through an Intersection . . . . . . . 215
7.17 Processing Element for 4-way Intersection and Exit Roads . . . . . . . . 220
7.18 A Network of Processing Elements . . . . . . . . . . . . . . . . . . . . . 222
7.19 The 3-Dimensional Network Structure . . . . . . . . . . . . . . . . . . . 223
7.20 The PE Interconnection Network . . . . . . . . . . . . . . . . . . . . . . 224
7.21 K-ary Search Tree Network . . . . . . . . . . . . . . . . . . . . . . . . . 225
7.22 Algorithm Phase 2 Method 2 . . . . . . . . . . . . . . . . . . . . . . . . 230
7.23 Cross-point Switch Architecture . . . . . . . . . . . . . . . . . . . . . . . 232
7.24 Processing and Communications time . . . . . . . . . . . . . . . . . . . 233
7.25 Exponential Distribution in Event vs Time-Driven Simulation . . . . . . 238
7.26 Exponential Distribution Slice of Figure 7.25 . . . . . . . . . . . . . . . 239
7.27 Weibull Distribution in Event vs Time-Driven Simulation . . . . . . . . 241
7.28 Weibull Distribution Slice of Figure 7.27 . . . . . . . . . . . . . . . . . . 241
xiv
9.1 Speedup Results by Section . . . . . . . . . . . . . . . . . . . . . . . . . 248
xv
Acknowledgments
First, and most importantly, I wish to thank my wife, Anna, for her patience,
guidance and assistance.
I would like to express gratitude to both the Pennsylvania State University and
Massachusetts Institute of Technology Library systems. Without their assistance and
public access policies, this thesis would not have been possible.
Thanks to Henry Lieu of the United States Federal Highway Administration for
providing access to the CORSIM source code.
Special thanks to Ms. Ralene Marcoccia and Mr. Joe Hanson of Altera’s Univer-
sity Programs Department for providing the Altera simulation software which performed
the backbone of our FPGA analysis.
Pie charts are rendered using the Ploticus software package which was developed
by Steve Grubb (www.sgpr.net). GNU software and the Linux kernel are used extensively
in this research.
1
Chapter 1
Introduction
Simulation is one important aspect of computing. There are a wide variety of
simulation applications, from gaming to financial management. Computer manufactur-
ers have thus far concentrated the vast majority of their efforts on developing general
purpose computer architectures which quickly execute a stored program. The stored
program concept dates back to the inception and seminal papers on modern comput-
ing [132]. The concept relies on the ability to store instructions and data which can
be retrieved sequentially from memory and then executed by a processor. The stored
program concept works well for general purpose computing where the applications are
diverse and the runtime speeds are non-critical. Unfortunately, not all environments are
so accommodating. In this thesis, the proposed architecture deviates from the stored
program concept by moving the processor instructions from memory and embedding
them in the reconfigurable logic data path of the architecture. This deviation is unusual.
Presented are methods for accelerating discrete event simulation in general, with
a focus on the specific example of traffic simulation. Using an accelerated simulator,
existing metropolitan models can be adjusted to reflect accidents or incidents for traffic
management. The simulation can then be rapidly re-run to determine whether proposed
detour and signal control solutions will alleviate congestion. Results can provide ad-
ditional guidance about detour impact on the rest of the traffic grid. Demonstrating
2
their importance as traffic management tools, air traffic simulators are used to minimize
delays and maximize passenger throughput by increasing efficiency.
Traffic simulators are often unable to simulate traffic at a rate much greater than
the time required to actually run traffic on a network of roads. A recent demonstration
of MITSIM projecting a section of the Boston arterial flow project is able to simulate
traffic moving at a stated rate approximately equal to 90% of real time. The speed of
these simulators is adequate for the design of new traffic pattern construction and for
optimizing traffic signal timing sequences. However, the response time is inadequate
for handling traffic incidents which require a greater level of acceleration to be useful
to traffic managers attempting to optimize existing networks experiencing unanticipated
crises.
Metropolitan traffic grids are often strained by the advent of celebrations or
demonstrations which may foster abnormal traffic loads. Concentrations of congregants
may induce localized surges of congestion. Even a simple traffic incident in an already
strained metropolitan street grid yields immediate consequences. This thesis presents a
simulation machine architecture capable of serving as a rapid traffic incident response
simulation system. The accelerated machine is designed to assist traffic management
officials in obtaining and testing detours. The machine is capable of running its simu-
lations fast enough to be useful to the traffic officers on the street. Although anyone
stuck in a traffic jam can attest to the benefits and increased satisfaction level gained
by avoiding congestion, the implications of increased traffic throughput are not just a
matter of convenience. Now, the resuscitation rate nationally is only 2 to 5 percent – in
3
large part because defibrillators don’t get to victims in time [124]. Faster response time
to injury victims can be directly correlated to an increased survival rate.
An architecture composed of multiple processing elements is proposed. The pro-
cessing elements are united and synchronized towards the common goal of accelerating
discrete event simulation. Many of the examples illustrate methods applicable to general
discrete event simulation. A specific example of microscopic road traffic simulation is
also provided as a concrete example.
The thesis organizes its discussion of discrete event simulation into the following
topics. First, Chapter 1 describes the motivation behind the research. A basic model of
discrete event simulation is illustrated which is referred to throughout the thesis as the
basis of discrete event simulation. Chapter 1 also highlights opportunities for accelera-
tion which are applied in future sections, as well as the basic constraints of simulation.
Finally, Chapter 1 describes why simulation, as opposed to mathematical analysis, is
often required. Related aspects of traffic theory are reviewed in Chapter 2. This chapter
discusses the derivation of the microscopic acceleration model developed by General Mo-
tors in Section 2.0.2. Two macroscopic models, the first developed by Greenshields and
the second developed by Greenberg, are reviewed in Sections 2.0.4.1 and 2.0.4.2, respec-
tively. Chapter 3 presents an overview of deterministic simulators. In addition to the
logic simulators, there is a review of some optimistic simulation hardware in Section 3.3
and hardware used as random number generators in Section 3.4. Chapter 4 describes
the simulation software implemented and/or studied in this research. Software is used as
a standard against which the proposed hardware implementation throughput speeds are
measured. In order to gain an understanding of the requirements of hardware simulators,
4
Section 4.2 presents the runtime characteristics of CORSIM, a representative software
simulator which has been developed under the aegis of the United States Department
of Transportation. CORSIM is widely used in research and practice. The results of
Section 4.2 focus hardware acceleration efforts towards the bottlenecks of simulation.
Trafix, an open source, free software, traffic simulator which was developed concurrently
by the author, is overviewed in Section 4.3. Chapter 5 performs some statistical anal-
ysis beneficial for the decision of running a simulation in time or event-driven mode.
A brief analysis of the geometric constraints required in partitioning a traffic map over
the proposed simulation architecture is examined in Section 5.2. Chapter 6 describes
some of the key hardware methods which are applied to accelerate the simulator. These
approaches include reconfigurable logic, systolic arrays, associative memory, and a reduc-
tion bus. Chapter 7 presents the architecture design of the simulator which applies the
techniques described in Chapter 6 to accelerate each component of the simulation model
defined in Chapter 1. Chapter 8 considers optimistic modifications to the simulator
design. Finally, Chapter 9 describes the results of the work.
1.1 The Importance of Non-deterministic Simulation
Non-deterministic simulation is an important tool used by a variety of disciplines.
Faster simulations will allow engineers to predict and accommodate changes in metropoli-
tan traffic models. A system which can accurately predict traffic jams or service inter-
ruptions will assist in their prevention. Faster simulation allows existing traffic models
to quickly reflect accidents or changes in available traffic routes. Accelerated simulation
models can be re-run rapidly to pinpoint expected traffic congestion. Traffic engineers
5
can model their proposed changes and simulate solutions quickly for verification. In a
real time application, simulators can either evaluate traffic management strategies for
re-routing during incidents, or can assess and optimize traffic control schemes under
various, changing traffic demands.
Recent instances of large scale discrete event simulation and concerns about their
slow pace abound in the press. One example which occurred on state of the art equipment
was the simulation of a 1-kilometer wide comet striking the earth’s ocean to determine
the impact detonation power and resulting shockwave strength. The simulation was per-
formed using the teraflops (trillion floating point operations per second) supercomputer
at Sandia National Laboratories.
A kilometer is about the size of the largest fragment of Comet Shoemaker-Levy 9
which crashed into Jupiter in 1994 - an event that was also the subject of computational
simulations [105, 111]. The calculation used Sandia’s bang and splat shock physics code
and was run on 1,500 processors of the new Intel Teraflops computer being installed at the
Labs. 1,500 processors is one-sixth of the expected final 9,000-processor configuration.
The calculation assumed a 1-kilometer-diameter comet (weighing about a billion
tons) traveling 60 kilometers per second and impacting Earth’s atmosphere at about a
45 degree angle. The model is small as far as comets go (the massive Comet Hale-Bopp
weighs about ten trillion tons). The problem was divided into 54 million zones and ran
for 48 hours. The results, although dramatic, confirmed earlier predictions about a
comet impact, but they did so with much finer resolution in three dimensions than has
ever before been possible [105].
6
Scientists at the University of Wisconsin have recently applied genetic algorithms
which allow the process of Darwinian natural selection to guide the design process of
diesel engines [127]. This is the kind of problem that chokes even the most powerful
supercomputer. Computers running software developed by Dr. Reiz and his colleagues
at government laboratories, universities, and in industry have begun to make progress,
though the progress is slow. “A typical simulation will run for several days on a super-
computer” Dr. Reiz said. “That simulation is of one engine cycle which actually takes
place in less than a tenth of a second. . . . There can be dozens of parameters to adjust,
each of which affects the others. Finding an optimal combination by trial and error on
a real-world engine could take practically forever. But with simulations taking two days
apiece, trying all the combinations of variables with a computer does not seem to work
much faster.” [127]
In terms of simulating emergency evacuations, simulation has also recently been
applied to crowd behavior during a fire with limited egress. The intent is to allow
emergency planners to prevent death or injuries due to panic [40].
Traffic engineering presents a practical and realistic simulation application. One
possible scenario models the new millennium celebration in Times Square, New York
City. Changes could be made to existing models, factoring in the affects of traffic outages,
and thereby allowing the simulation and verification of proposed traffic detours. Traffic
outages could occur due to construction, accidents, or terrorist activity. Accelerated
simulators will prove to be highly effective in assisting engineers rerouting traffic during
emergency situations. The same scenarios hold for rail [32] and aeroplane traffic. If
7
transportation systems designers have fast simulators available, the following system
design enhancements become feasible [88]:
• Real time verifiable experimentation becomes plausible.
• Simulation can contribute to systems management rather than being used solely
for the design process.
• Users can be more confident of the accuracy of their implementation decisions.
1.2 Simulation Model
Discrete event simulations typically have three basic common denominators. First,
they contain a set of state variables denoting the current state of the simulation. The
state variables contain information such as the number and availability of system re-
sources. Secondly, a typical discrete simulation contains an event queue, depicted in Fig-
ure 1.1. The event queue is a list of pending events which have been created by an event
generator but not yet executed by the scheduler. These events require system resources
to execute. The availability of resources is described by the state variables. Events often
contain an arrival timestamp and possibly a duration. The arrival timestamp indicates
when the event impacts the system’s state variables. Event arrival times and service
times are frequently generated based on statistical models. For example, events may
arrive according to a Poisson Distribution. Finally, the third common denominator of
discrete event simulations is the global simulation clock which keeps track of the sim-
ulation’s progress. The simulation must maintain proper causal states, meaning that
each event must be executed in the environment created by the execution of the prior
8
events. Therefore, if prior events have depleted a particular resource, that resource will
be unavailable for the execution of a following event.
The simulation generally executes a main loop [55] which repeatedly removes the
event with the smallest timestamp from the event queue. Each event is processed by
making appropriate state changes to the simulation model’s state variables.
Discrete simulation creates a system which changes state at specific points in
time. The simulation model jumps from one state to the next when an event occurs and
is processed. A telephone system might contain a set of state variables which describe
telephone trunks leading from a substation as either full or available to route new calls.
Additional state variables might contain the number of calls currently being handled by
the substation. Typical events at the substation might include call arrivals, inbound
calls being routed through the station, calls being terminated, or calls being blocked.
1.3 Opportunities for Acceleration in Discrete Simulation
Discrete event simulation provides a couple of strong candidate openings for ac-
celeration. It is easy to cite impediments to acceleration, which include the need for
causality in event execution. Requiring each event to execute in the environment cre-
ated by its predecessor does indeed stipulate an inherent sequential nature. However,
various attempts have been made to allow simulation events to be processed concur-
rently, or optimistically [29, 43, 58, 56, 80, 140]. Three main opportunities are explored
in this work.
The first opportunity views events as collections of smaller discrete subcompo-
nents, parts of which are independent, sometimes referred to as fined-grained parallelism.
9
Server
SchedulerSimulationTime Clock
EventGenerator
Event Queue
RandomNumberGenerator
Queue
Fig. 1.1. Simulator Model The simulator is divided into the components illustrated.The Event Generator creates random events, according to a user selected statistical distri-bution, and the event’s resource requirements. The events and their attributes are placedin the Event Queue. The Scheduler steps through the Event Queue in chronological orderaccording to the Global Simulation Clock, attempting to allocate resources to each event.If the resources are available, the event can execute. If not, the event is blocked.
10
In computer architecture, there was an exploration of Reduced Instruction Set Comput-
ing (RISC) versus Complex Instruction Set Computing (CISC). The CISC proponents
proposed rich, sophisticated instruction sets with large selections of addressing modes
which tended to slow instruction execution. RISC proponents proposed a simple, fixed-
length instruction set with fewer addressing modes. Though it may take several simple
instructions on a RISC machine to replace one of the more complicated instructions, it
is possible to process the simpler instructions more rapidly [21, pg.118]. Dividing simu-
lation events into smaller sub-tasks reveals independent sub-components which may be
executed rapidly and in parallel.
The second advantage which can be exploited within simulation is the locality
of data. For example, in traffic simulation, vehicles tend to move along a continuous
trajectory, flowing from one street into an intersection and then onto the next street.
If a traffic simulation is divided along naturally occurring geographic boundaries, data
required to process the vehicle will be easily cached within the processing elements
handling the respective streets and intersections. For example, if a processing element
is assigned to handle an intersection and its egressing roads, then information about
the roads and intersection including speed limit, road grade, traffic signals, etc. can all
be maintained within the processing element and need not be moved with individual
vehicle datasets. All vehicles traversing roads and intersections will store properties
locally on the road and intersection processing elements. Common geographic data is
stored locally. The currently processed vehicle may become the following vehicle’s leader
during the next processing stage. The sedentary nature of much of the data will help
11
to alleviate the common memory bottleneck problems which are often experienced by
general purpose computers.
The third opportunity which may be turned to advantage allows the hardware
to compute all of the possible outcomes of a conditional statement concurrently and a
priori and then simply select the appropriate result. This advantage trades the silicon
real-estate required to engender the needed functional units for an accelerated access to
required results.
1.3.1 Difficulties faced by Parallel Discrete Event-Driven Simulations
The greatest opportunity for increasing simulation speed lies in concurrent event
processing. Parallel execution of discrete event-driven simulation is limited by the need
to always process the queue event with the smallest time stamp. If a different event
were removed from the queue and processed, that second event might incorrectly change
the simulation’s state variables in which the next smallest time stamped event would
execute. Having an event in the future affect an event in the past is called a causality
error [58].
Fujimoto [55] provides a quick example using two events, Ei and Ej with their
respective timestamps, Ti and Tj . It is assumed that if i < j then Ti < Tj . If Ei writes
into a state variable that is read by Ej , then Ei must be executed before Ej to be sure
that no causality error occurs. This example ignores partial concurrent execution of the
events, which may be possible given the sequencing constraints.
Hence, the possibilities for increasing speed via concurrency seem limited. In
the proposed architecture, reconfigurable logic devices are targeted at the computation
12
problems as opposed to conventional processors. Reconfigurable logic devices devote their
silicon area to a large number of computing primitives, interconnected via a configurable
network. Both the primitives and the interconnection network can be programmed to
fit the problem. Computational tasks are spatially implemented on the device allowing
intermediate computation results flowing directly from producing to receiving functions.
Since thousands of primitives can reside on a single chip, significant amounts of data flow
may occur without crossing chip boundaries. In this thesis, the entire task is mapped into
hardware. The reconfigurable logic provides a spatially oriented processing environment
as opposed to the temporally-oriented processing provided by general purpose processors.
Reconfigurable logic provides the following advantages over traditional micropro-
cessors. These advantages are used to accelerate discrete event simulation [47].
• Functional Unit Distribution - Rather than broadcasting a new instruction to
the functional units on every cycle, instructions are locally configured in the data-
path, allowing the reconfigurable device to compress the data stream distribution
and effectively deliver more instructions into active silicon on each cycle.
• Spatial routing of computational intermediates - As space and primitives
permit, intermediate values are routed in parallel from producing functions to
consuming functions rather than forcing all communication to take place in time
through a central resource bottleneck.
13
• Fine-grained computing approach - Having more, often finer-grained, sepa-
rately programmable building blocks, reconfigurable devices provide a large num-
ber of separately programmable units allowing a greater range of computations to
occur per time step.
• Resource Placement - Distributed deployable resources, eliminate bottlenecks.
Resources including memory, interconnect, and functional units are distributed and
deployed based on need, rather than being centralized in large pools. Independent,
local access allows reconfigurable designs to take advantage of local and parallel
on-chip bandwidth, instead of creating a central resource bottleneck.
In addition to these advantages of reconfigurable logic, the causality constraints
described in this section are similar to the constraints faced by the logic simulators of
Chapter 3. Some of the same techniques applied to the deterministic logic simulators
will also be targeted on the non-deterministic simulation problem. In traffic simulation,
causality constraints are limited to the environment directly surrounding the vehicle.
For example, in a traffic simulation, the movement of a vehicle on one city block may
be completely independent of a vehicle traveling on the next block let alone a vehicle
traveling across town. At a higher level, communications efficiency between concurrent
processors is reviewed. All of these techniques and attributes are used to assist in the
acceleration of parallel discrete event simulation.
14
1.4 Simulation’s Niche
Simulation is generally carried out either macroscopically or microscopically. Macro-
scopic models use aggregate data which may include speed-flow-density and queue dis-
persion equations. These equations determine how vehicles move through the traffic
grid. The microscopic approach provides a greater level of detail, modeling each vehicle
individually. These two approaches are examined more closely in Section 1.5.
Simulation fulfills requirements to provide appropriate predictions on the behav-
ior of dependent system interactions where mathematical solutions remain elusive. In
traffic networks, there are many situations where various lanes of traffic interact in the
sense that a traffic stream departing from one queue of vehicles enters one or more
other queues, perhaps after merging with portions of yet other traffic streams departing
from still other queues. The merging of these traffic streams has the unfortunate effect
of complicating the character of the arrival process at the downstream queues [18, pg
209]. Once the vehicles travel beyond their entry points, their interarrival times become
strongly correlated with the vehicle lengths and inertia. It therefore becomes impossible
to carry out a precise and effective analysis comparable to the queueing theory analysis
for the M/M/1 and M/G/1 systems [18, pg 209]. Analysis of these systems requires the
assumption of inter-arrival independence.
Bertsekas provides a traffic analogy when describing why the assumption of in-
dependent arrival times is often not applicable in simulation models. Consider a slow
truck traveling on a busy narrow street followed by several faster cars. The truck will
typically see empty space ahead while being closely followed by the faster cars [18, pg
15
209]. The assumption of independent arrival times in a network of nodes is often not a
valid assumption. Without the assumption of independence, many mathematical models
do not apply.
1.5 Microscopic & Macroscopic
A microscopic analysis focuses on the speeds of individual units, and the time
and distance headways between each unit, whereas a macroscopic approach categorizes
groups of flow rates, average velocities and distances. Traffic analysis methods range
from simple equations to complex simulation models. Traffic stream models can often
be used for uninterrupted flow situations where demands do not exceed capacities [102].
For oversaturated situations, where traffic flow may be complicated by interruptions,
more complex methods, including shock wave analysis, queue analysis, and simulation
modeling, can be employed.
Macroscopic analysis may be selected for higher-density, larger-scale systems in
which a study of the behavior of groups of units is sufficient [102]. Macroscopic traf-
fic analysis focuses on three fundamental characteristics, flow, density, and speed. For
instance, macroscopic analysis might explore the average vehicle velocity on a freeway
at peak versus off-peak times of the day near a particular exit. Continuum models are
needed for better understanding the collective behavior of traffic [104]. Michalopoulos
notes that applications of the existing high-order macroscopic models have not shown
satisfactory results, especially with models which are congested and contain interrupted
flows. In high density situations, the models may suffer from stability problems, although
16
in some cases these problems are related to the numerical method used. Vehicle interac-
tion is also considered to be one of the components that contribute to flow acceleration.
Unfortunately, so far, there are neither theoretical arguments, nor experimental results
that lead to an unambiguous choice of such contributions [104].
Microscopic analysis may be selected for moderate-sized systems where the num-
ber of transport units passing through the system is relatively small and there is the
need to study the behavior of individual units in the system. The designed simulator is
intended to assist with traffic incident recovery. A microscopic model was selected for
this research to provide detailed incident traffic analysis.
17
Chapter 2
Traffic Theory
Traffic flow theory has been well studied. In a traffic stream, a minimum space
must be available in front of every vehicle so that the operator can control the vehicle
without colliding into the lead vehicles. Vehicle spacing is an important criterion for the
operator’s level of service. Large vehicle spacing provides operators with considerable
freedom of motion. In theory, as vehicle spacing decreases, operators are required to
devote more concentration to the task of driving and to reducing their velocity. Decreased
spacing results in lower levels of comfort, but higher throughput as long as the vehicle
spacing remains greater than the critical spacing limit. After the critical limit is reached,
the throughput begins to drop along with the level of service. An extreme example of
the service drop is a stopped queue of vehicles where the spacing is minimal, the level of
service is at its nadir, and the flow is zero.
The results contained in this chapter are used in the development of Trafix, a traf-
fic simulator which is discussed in Section 4.3, and in the architecture implementations
described in Section 7.2.3. Specifically, the car-following model described in Section 2.0.2
is often used as the basis of traffic simulation. The simulators developed for this thesis
depend on the car-following model.
Figure 2.1 illustrates the time headway. An analogous measurement, the distance
headway, is defined as the space between two selected points on the lead and following
18
vehicle. The time headway is more frequently encountered in practice “... because
of the greater ease of measuring time headway. Distance headway can be obtained
photographically; however, it is more often obtained by calculation based on the time
headway and individual speed measurements.” [102]
The notation and definitions of Figure 2.2 are used to develop car-following mod-
els. Two vehicles moving left to right are illustrated with the lead vehicle, n, having a
length of Ln and the following vehicle, n + 1, of length Ln+1. The other figure parame-
ters are listed and defined in Table 2.1. Note that the acceleration rate of the following
vehicle Xn+1 occurs at time t + ∆t, not t. The ∆t, sometimes called the operator reac-
tion time, is the time required for the operator of the following car to decide and initiate
a new acceleration.
Variable Descriptionn lead vehiclen + 1 following vehicleLn length of lead vehicleLn+1 length of following vehiclexn position of lead vehiclexn+1 position of following vehiclexn speed of lead vehiclexn+1 speed of following vehiclexn+1 acceleration rate of following vehiclet time tt + ∆t ∆t time after time t
Table 2.1. Car-Following Notation Notation used in car-following theories is listed. Amatching illustration of the notation symbols is contained in Figure 2.2
19
TimeHeadway
SpaceHeadway
1 2 3 4 5t t t t t
x
Dis
tanc
e
Time
Occupancy time Time gap
Fig. 2.1. Time Headway The time headway is defined as the elapsed time between thearrival of pairs of vehicles. The time headway, t5 − t3, consists of two time intervals, theoccupancy time, denoted in the figure as t4 − t3, is the time required for the vehicle toactually pass the observation point, plus the time gap between the rear of the lead vehicleand the front of the following vehicle, denoted as t5−t4. The time headway is not specificallydefined as the time between the passage of the following edges of two consecutive vehicles,but is simply taken as the time passage of identical points on two consecutive vehicles. Inpractice, the leading edges of vehicles are frequently used [102].
20
nn+1
L n
−
L n + 1
n+1xnn+1
n
x
x
x
Fig. 2.2. Car-Following Notation Figure Notation used in car-following theories isillustrated. Car-following theories describing vehicle interactions were developed in the1950s and 1960s [102]. Various car-following models were developed, however, one of themost notable was the work performed at General Motors Corporation (GM). The GMresearch was accompanied by field experiments and the discovery of the mathematical bridgebetween the microscopic and macroscopic theories of traffic flow [102]. The notation citedabove, and listed in Table 2.1, was developed by General Motors.
21
Chapter 2 contains information on both microscopic and macroscopic simulation
theory. The first section, Section 2.0.1, describes the work by Reuschel and Pipes who
developed microscopic equations to describe vehicular traffic flow. Section 2.0.2, discusses
the General Motors formula for the calculation of acceleration. The General Motors
acceleration formula is for microscopic simulation calculation. The next sub-section,
Section 2.0.3, derives the equations used for vehicle deceleration and stopping from basic
principles. Finally, Section 2.0.4 discusses two macroscopic simulation models. One
of the first macroscopic models, the Greenshields model, is reviewed in Section 2.0.4.1.
Section 2.0.4.2 reviews Greenberg’s model which is a fluid-flow macroscopic traffic model.
Shockwave models are not reviewed, but detailed information can be found in [35, 99,
60, 104].
2.0.1 Reuschel & Pipes
The challenge to describe vehicular flow in a microscopic manner led Reuschel
and Pipes to formulate the phenomena of the motion of pairs of vehicles following each
other as described in Equation 2.1 [82]. The derivation of the expression is described
graphically in Figure 2.3.
xn − xn+1 = L + S(xn+1) (2.1)
Differentiation of Equation 2.1 leads to Equation 2.2, which is referred to as the
basic equation of the car-following models. Research groups associated with the General
Motors Corporation developed a linear mathematical formula which fitted well against
22
Lx n+1
x n
n+1S(x )o
n
Fig. 2.3. Car-Following Acceleration The challenge to describe vehicular flow in amicroscopic manner led Reuschel and Pipes to formulate the phenomena of the motion ofpairs of vehicles following each other by the expression: xn− xn+1 = L + S(xn+1) [82]. Inthis expression, it is assumed that each driver maintains a separation distance proportionalto the speed of his vehicle, xn+1 plus a constant distance L, which is composed of the lengthof the vehicle plus a distance headway as determined at standstill when xn = xn+1 = 0.The constant S is measured in time.
23
high density traffic data. The formula they derived, with the desire to maintain a linear
relationship, is provided in Equation 2.3. The equation differs from Equation 2.2 by the
introduction of ∆t, which is defined to be the time lag response to the stimulus [82].
xn+1 =1
S[xn − xn+1] (2.2)
xn+1(t + ∆t) =1
S[xn(t)− xn+1(t)] (2.3)
2.0.2 General Motors’ Car-following Model
The General Motors research group established 5 generations of car-following
models which all take the form:
response = func(sensitvity, stimuli) (2.4)
The response in Equation 2.4 represents vehicle acceleration. Stimuli is the rel-
ative velocity of the lead and following vehicles. The 5 models are distinguished by
differences in their sensitivity terms.
In the first model, Equation 2.5, the sensitivity term, α, is assumed to be a
constant. The equation is equivalent to Equation 2.3.
xn+1(t + ∆t) = α[
xn(t)− xn+1(t)]
(2.5)
24
Equation 2.5 forms a feedback mechanism for the acceleration of the following
vehicle. If the lead vehicle’s velocity is faster than the following vehicle, the difference
between the two velocities is positive, xn(t) − xn+1(t), and the following vehicle is
accelerated. Conversely, if the lead vehicle’s velocity is slower than the following vehicle,
the following vehicle’s acceleration becomes negative, slowing its approach.
Field experimentation led the GM research team to note a wide range of values
in the α sensitivity value. Hence, the team first tried to introduce separate sensitivity
constants. The term used in the equation is selected based upon the vehicles’ relative
distances. A higher sensitivity term, α1, is used when the vehicles are closer together, as
shown in Equation 2.6. The equation is unsatisfactory due to its inherent discontinuity.
xn+1(t + ∆t) =
α1
or
α2
[
xn(t)− xn+1(t)]
(2.6)
Gazis, Herman, and Potts [61] changed the linear property of Equation 2.5 by
allowing the constant sensitivity factor, α, to become inversely proportional to the sepa-
ration distance between the vehicles. The group introduced the physical spacing between
the lead and following vehicles as a parameter, leading to Equation 2.7. As the distance
between the vehicles decreases, the sensitivity term is given more weight. Gazis’s modi-
fication is illustrated in Equation 2.7, where α is a new constant.
xn+1(t + ∆t) =α
xn(t)− xn+1(t)
[
xn(t)− xn+1(t)]
(2.7)
25
Next, Equation 2.7 is modified yielding Equation 2.8 which gives the sensitivity
term more weight based on the speed of the following vehicle. The rational is that as
the speed of the traffic stream increased, the operator of the following vehicle is more
sensitive to the relative velocity between the lead and the following vehicles. In work
subsequent to [61], Gazis generalized Equation 2.8 into its final form, Equation 2.9.
Equation 2.9 became the final version of the acceleration formula for microscopic car-
following acceleration.
xn+1(t + ∆t) =α′
[
xn+1(t + ∆t)]
xn(t)− xn+1(t)
[
xn(t)− xn+1(t)]
(2.8)
Equation 2.9 is a continued effort to generalize the sensitivity term. The final
model allows the following vehicle velocity and relative vehicle separation to have a
generalized exponential effect. The equation allows the speed and distance headway
components to be raised to powers other than one, using exponents m and l. All previous
car following models are specialized cases of Equation 2.9.
xn+1(t + ∆t) =αl,m
[
xn+1(t + ∆t)]m
[
xn(t)− xn+1(t)]l
[
xn(t)− xn+1(t)]
(2.9)
When m = l = 0, Equation 2.9 is reduced to Equation 2.5. Equation 2.7 results
when m = 0 and l = 1. Equation 2.9 was used in the software simulator of Section 4.3
and in the hardware reconfigurable logic implementations resulting from Figures 7.15
and 7.16. In both the hardware and software simulations, the values used for αl,m, m,
and l are the values used in Yang [146], where αl,m = 1.25, and m = l = 1.
26
2.0.3 Vehicle Deceleration
The derivation of the formula for deceleration during speeding is derived from
the basic principles of Equations 2.10, 2.12, and 2.14. Letting t0 be zero leads to Equa-
tions 2.11 and 2.13 respectively from Equations 2.10 and 2.12.
Variable Descriptionvf final velocityv0 initial velocityv average velocitytf final timet0 initial timea average accelerationxf final positionx0 initial position
Table 2.2. Vehicle Deceleration Notation Notation used in vehicle deceleration equa-tions of Section 2.0.3 are defined.
a =dv
dt=
vf − v0
tf − t0(2.10)
atf = vf − v0 (2.11)
v =dx
dt=
xf − x0
tf − t0(2.12)
27
vtf = xf − x0 (2.13)
v =1
2(v0 + vf ) (2.14)
Then, combining Equations 2.11, 2.13, and 2.14, letting x0 be 0, and eliminating
v, the final form of Equation 2.15 is derived. This is the form used in both the hardware
and software models to compute deceleration when a vehicle is speeding. By letting v0
go to 0, the same equation is also used to stop at the end of a road during a stop signal.
a =v2f− v2
0
2xf(2.15)
2.0.4 Macroscopic Models
Macroscopic traffic simulation uses a continuum traffic model in its approach.
The macroscopic models view traffic as collections of vehicles flowing along a network
of roads. Using macroscopic theory, the equations and conditions for maximum traffic
throughput can be derived. Section 2.0.4.1 reviews Greenshields’ work and the equations
he developed. Section 2.0.4.2 follows a second approach by Greenberg.
The general equilibrium equation relating traffic flow (q), density (k) and mean
harmonic speed (µs) is given in Equation 2.16. These variables further depend on envi-
ronmental factors which include roadway, driver, and vehicle characteristics along with
28
basic environmental factors like weather [60]. Equation 2.16 will be used below to de-
termine the maximum flow rate [60].
q = kµs (2.16)
The harmonic mean vehicle speed, µs, is defined by Equation 2.17 where addi-
tional terms are defined in Table 2.3.
µs =n
n∑
i=1
(1
µi)
=nLn
∑
i=1
ti
(2.17)
Variable Description
ti time ith vehicle takes to cross highway segment
ui ith vehicle speedn number of vehicles passing a point on the highwayL length of highway section
Table 2.3. Harmonic Mean Speed Notation The variables used in Equation 2.17 aredefined.
29
2.0.4.1 Greenshields
Greenshields [67] published one of the earliest works where he observed the linear
relationship of Equation 2.18 between the velocity and density of vehicles on a stretch of
road. The literals of the equation are defined in Table 2.4. The use of the Greenshields’
model depends on whether or not the equation satisfies the traffic model’s boundary
conditions. The Greenshields’ model satisfies the boundary conditions when the density,
k, is approaching zero as well as when the density is approaching the jam density, kj ;
therefore, the Greenshields model is used for either light or dense traffic [60].
µs = µf −µf
kjk (2.18)
Variable Descriptionµf mean free speed - maximum speed as density tends to 0µs harmonic mean of vehicle speedsk density (vehicles per lane-unit)kj jam density in vehicles per lane-unit
( the density when vehicles are bumper to bumper and stopped)
Table 2.4. Notation for Greenshields Equations The variables used in Sections 2.0.4.1and 2.0.4.2 are defined.
30
Corresponding relationships for flow vs. density and for flow vs. speed can be
developed by eliminating k from Equation 2.18 using Equation 2.16. Equation 2.19
results.
µ2s
= µfµs −µf
kjq (2.19)
Similarly, using Equation 2.16 to eliminate µs from Equation 2.18, results in
Equation 2.20.
q = µfk −µf
kjk2 (2.20)
Now, using Equations 2.19 and 2.20, the speed and density required for maximum
flow can be attained. Differentiating q of Equation 2.19 with respect to µs, provides
Equation 2.21.
2µs = µf −µf
kj
dq
dµs(2.21)
Maximum flow is reached when dqdµs
= 0, or when the mean velocity is equal to
half the free mean speed µs =µf2 [60].
Equation 2.20 can be similarly manipulated by differentiating q with respect to k
yielding Equation 2.22.
dq
dk= µf − 2k
µf
kj(2.22)
31
Again, setting the derivative dqdk = 0, yields the maximum flow density, k =
kj2 ,
to be equivalent to half the jam density. The maximum flow for the Greenshields’
relationship can then be obtained by inserting the maximum flow density and the mean
velocity which yields the maximum flow into Equation 2.16, producing Equation 2.23.
qmax =kjuf
4(2.23)
2.0.4.2 Greenberg
Concurrent to the development of the General Motors (GM) model, the Port
Authority of New York, which was assisting GM with the testing of the GM model,
was developing a macroscopic flow model of its own referred to as the Greenberg Model.
Speed is defined as a function of density. Optimum speed is reached when the traffic flow
level reaches capacity. The Greenberg model satisfies its boundary conditions when the
density is approaching the jam density. Unlike the Greenshields model, the boundary
conditions of the Greenberg model are not satisfied as k approaches zero. Therefore the
Greenberg model is only useful for modeling dense traffic conditions [60]. The Greenberg
Model is contained in Equation 2.24, where c is a constant.
µs = c ln
(
kj
k
)
(2.24)
32
Into which we can substitute Equation 2.16 to obtain an equation for vehicle
density.
q = ck ln
(
kj
k
)
(2.25)
If q in Equation 2.25 is differentiated with respect to k, and the derivative is set
to zero to solve for the maximum flow, Equation 2.27 is obtained.
dq
dk= c ln
(
kj
k
)
− c (2.26)
ln
(
kj
k
)
= 1 (2.27)
Substituting Equation 2.27 into Equation 2.24 finds the value of c = µs, which is
the velocity at maximum flow.
33
Chapter 3
Previous Work Related to Simulation Architectures
Special purpose machines have been designed and implemented expressly for the
development and performance enhancement of logic systems [1, 3, 13, 14, 28, 89, 98, 119,
135, 137, 138]. In logic design, simulations are used to verify new projects and to run
fault analysis of these designs. The simulation of logic entails modeling deterministic
behavior. The proposed machine simulates non-deterministic behavior. Additionally,
traffic requires a wider variety of dynamic behavior than a limited set of deterministic
logic functions. Recent research has begun to explore the application of parallel pro-
cessing to real time traffic simulation. Both microscopic [110] and macroscopic [36, 81]
approaches have been explored.
The proposed non-deterministic simulation architecture differs from the existing
body of published research. The simulator architecture leverages the locality of data
inherent in traffic simulation. The architecture mitigates the Von Neumann bottleneck
by embedding simulation instructions in reconfigurable logic. Distributed processing is
developed by applying both a global network of processing elements and the implemen-
tation of embedded instructions as pipelined, systolic arrays within reconfigurable logic.
Unlike conventional general purpose processors, where data and instructions are fetched
from memory, computation is performed and then the data is returned to memory, in the
proposed architecture, data flows from functional unit to functional unit, accomplishing
34
computation as part of its transport process. The data channel is pipelined allowing
concurrent computation of different stages of the simulation within each processing el-
ement. Multiple processing elements are networked together in a scalable architecture
facilitating further parallelism.
The chapter on related work is divided into several subsections. Section 3.1 de-
scribes simulation machines which were created to analyze and test deterministic logic
designs. Unlike the architecture presented in the thesis, logic simulators are determinis-
tic. Section 3.2 reviews the architectures of two machines, the Splash and the ArMen,
which are not specifically logic simulators, but whose designs are relevant to the archi-
tecture presented in this thesis. The Splash design provides good background for the
thesis processing element architecture. Section 3.3 describes the Rollback Chip, which
is a piece of hardware used for state saving during optimistic simulation. Section 3.4
describes three papers on random number generator hardware implementations. Finally,
Section 3.5 describes a reduction bus. The thesis develops a unique reduction bus model
which synchronizes the network of processing elements during either time or event-driven
operation.
3.1 Logic Simulation Machines
Section 3.1 reviews deterministic simulation machines which have been constructed
to simulate and verify logic circuits and designs.
35
3.1.1 Boeing Computer Simulator
The Boeing Simulator architecture is illustrated in Figure 3.1. The system went
into operation in June of 1970. Categorized by Figure 3.2, the Boeing Simulator is an
event-driven, fine-grained, conservative simulator. The simulator’s primary purpose is
logic simulation. The Boeing Simulator was initially used to perform an architectural
study of a navigation processor, logic design support in the development of general-
purpose computers and special-purpose processors, as well as fault tests of manually
generated logic circuit boards [138]. The simulator contains 4 independent logic pro-
cessors which implement an event-driven logic simulation algorithm. There is a paged
memory which consists of 16K 48-bit words. A memory switch exists to allow shared
memory access. The processor’s architecture is composed of 4 parts, as illustrated in
Figure 3.3. There are 3 scratch pad memories which store the logical equations (ES -
equation store), device delays (D - delay), and the events (E). The fourth part is the
logic evaluation hardware.
To initialize the simulation, the host partitions the logic design simulation among
the 4 processors. The host generates and stores the equations which specify the oper-
ations of the primitive logic elements. A maximum of twelve gates can be represented
by each equation. The host also stores the gate connectivity information. During each
simulation cycle, the smallest time delay is saved and used for the next simulation time
increment. The basic simulation flow is divided into the following steps [20]:
1. The ES points to the active equations and acts as an event queue.
2. For each active equation, the delay value, D, is reduced by the current time step.
36
Loop Communications
Station
Control andDisplayConsole Unit
Interface Computer
General Purpose Peripherals
Central ElectronicCrossbarSwitch
Loop Communications
StationLoop
Communications
StationLoop
Communications
StationLoop
Communications
Station
LogicProcessor
LogicProcessor
LogicProcessor
LogicProcessor
Fig. 3.1. The Boeing Simulator Architecture The four logic processors of the BoeingSimulator operate independently, using an event-driven simulation algorithm. The cross-bar switch allows the host to access internal state information from each processor’s corememory. The communications loop provides interprocessor connectivity and connects thelogic processors with the host interface. The loop also allows the host to access processorregisters and scratch pad memories.
37
Simulation
Event−Driven Time−Driven
Synchronization
Conservative
Granularity
Course Fine
Granularity
Course Fine
Granularity
Course FineOptimistic
Fig. 3.2. Simulation Classifications [16] Computer generated simulation can be dividedinto two main classes, time-driven simulation and event-driven simulation. Time-driven sim-ulation steps through each time cycle. Event-driven simulation skips those time cycles whichlack events. Event-driven simulations may attempt optimistic or conservative simulationapproaches. Conservative simulations follow a strictly causal approach, where each eventcan execute only after all events with earlier timestamps have executed. Optimistic simu-lations may allow simulations to proceed in a non-causal fashion, but must ensure that thesimulation results are causal. Both time and event-driven simulation have two degrees ofgranularity, course and fine. Course-grained simulations tend to group collections of eventstogether and evaluate the collections as a unit. In a fine-grained simulation, each event issimulated as a discrete unit.
38
3. When an equation’s delay is reduced to zero, the equation is evaluated. The result-
ing value is propagated to all the connected components according to a connectivity
list. The ES memory is updated.
4. Equations whose output values change are handled as follows:
• The equation output field is updated and written back to memory.
• The corresponding logic delay is fetched from the D scratch memory.
• The new delay and logic values are stored in the E scratch memory.
The Boeing Simulator was constructed of approximately 28K TTL packages, 40K
of fast memory, and 65K of core memory [138]. The system ran at a clock speed of 10
MHz. The simulator could model 36K elements, or 48K 2-input gates. Elements can
consist of flip-flops, gates, and one-shot devices specified in equation form. Its estimated
performance is 1 million gate evaluations per second.
To its credit, the Boeing Computer Simulator is the first [78] of the logic simulators
to be built and operated. Boeing recognized the limitations of using general-purpose
computing architectures for software logic simulators. The Boeing Computer Simulator
was created to assist in the design of digital equipment used in the United States space
program. Their simulator offered Boeing tremendous logic design advantages at that
time [78].
However, the Boeing simulator also had some significant drawbacks, mostly due
to the technologies available at the time of its creation. These drawbacks include:
• The Boeing simulator had non-programmable logic functions, implemented by logic
cards.
39
Processor Hardware
CoreMemory8Kx48
CoreMemory8Kx48
CommunicationsLoop Station
ScratchPad
Memories
EventMemories
Local
Switch
Logic Processor
To Central
Crossbar Switch
Fig. 3.3. The Boeing Simulator Logic Processor The Logic processors contain 16Kwords of 48-bit 650 nanoseconds core memory which is divided into two banks. The memoryis accessed through a local crossbar switch. Three scratch pad memories, the equationstore (ES), the device delay (D), and the event scratch pad memories are all used to storeintermediate results. The event computations are performed by the processor hardware.
40
• The Boeing simulator test patterns were interpreted on a host computer as the
simulation was run. These patterns needed to be transmitted from the host to the
simulator during runtime.
• To enter the logic formulas, the design arrived via punch cards and the output was
delivered to a line printer.
• The Boeing system employed separate processors in a multiprocessor environment.
These processors require inter-processor communication via the Communications
Loop Stations.
• The system required data to be read/written to memory which incurred further
time delays.
3.1.2 The IBM Los Gatos Logic Simulation Machine
The IBM Logic Simulation Machine (LSM) was conceived in 1977 by John Cocke,
Richard L. Malm, and John Schedletsky of IBM’s Thomas J. Watson Research center
in Yorktown Heights, New York. The Los Gatos Logic Simulation Machine is a logic
simulator which can simulate 64,512 logic expressions at a rate of 640·106 expressions per
second [28, 78, 88]. The Los Gatos Logic Simulation Machine was one of the few machines
implemented which was not an event-driven simulation machine. The other time-driven
simulation machines are the Yorktown Simulation Engine [119] and the machine by
Levendel et al [98]. All three machines are fine-grained. The Los Gatos engineering team
was under the incorrect impression that event-driven simulators “monitor all the gates in
a design and from one cycle of simulation time to the next evaluates new outputs for only
41
those gates that had a change of inputs.” [78] The designers felt that the “disadvantage
in building a hardware event-driven simulator is in the increased complexity and cost due
in large part to the large amount of data that needs to be passed among various parts of
the simulator. This complexity soon results in switching problems and communications
bottlenecks.” [78]
The LSM design approach decided that these hazards could be avoided and rea-
sonable simulation speed could be attained by relying on fast and parallel hardware
techniques. They felt their assumption was justified by the results.
The Los Gatos Logic Simulation Machine is composed of 3 types of processors,
logic processors, array processors, and a control processor. The three types of processors
are interconnected via a crossbar switch. The processors are depicted in Figure 3.4. The
logic processors do the majority of the simulation work. The logic processors fetch, eval-
uate, and store the results of each gate event. The array processors are used to simulate
memory arrays such as Random Access Memory (RAM). The control processor regulates
the operation of the Los Gatos Logic Simulation Machine. The control processor starts,
stops and allows a simulation to be interrupted. The control processor also allows the
host computer to interface with the simulation. The switch in Figure 3.4 is a 64 by 64
crossbar switch which allows all processors to communicate.
The basic structure of a logic processor is depicted in Figure 3.5. The processor
contains an instruction memory, two data memories, a logic unit, and a gate delay value
memory. The instruction memory contains room for 1K of 80-bit instruction words.
Each instruction word contains a function opcode, five input operand addresses, and
42
Interface
Host
ControlProcessor
LogicProcessor
LogicProcessor
LogicProcessor
ArrayProcessor
ArrayProcessor
Switc
h
Fig. 3.4. Los Gatos Logic Simulation Machine Architecture The Los Gatos LogicSimulation Machine is composed of 3 types of processors, logic processors, array processors,and a control processor. The three types of processors are interconnected via a crossbarswitch. The logic processors do the majority of the simulation work. The logic processorsfetch, evaluate, and store the results of each gate event. The array processors are usedto simulate memory arrays such as RAM. The control processor directs the operation ofthe Los Gatos Logic Simulation Machine. The control processor starts, stops and allows asimulation to be interrupted.
43
other control information. The input operand addresses represent gate inputs. If a gate
has more than 5 inputs, it is decomposed into a sequence of 5-input functions.
The Logic processor accesses two data memories, which each contain 2,048 2-bit
words. One memory is called the input data memory and the other is the output data
memory. The function memory contains 1024 64-bit words. The functions described by
this table are actually 6-bit input functions so that the system allows internal chaining of
the 5-input functions. Thus, the function memory contains 64 bit words as there are 26
possible input combinations for these internal gates. There must be 64 possible outputs.
The delay value memory stores a table of rise and fall delay times associated with
each instruction in the instruction memory. Thus the delay value memory contains 1K
words each of which is 16 bits in length. Of the 16 bits, 8 bits are devoted to rise time
and 8 bits are devoted to the function’s fall time. These delay times range from 1 to
256 units of delay. The minimum delay is 1 unit. If “... more than one delay unit is
specified for an instruction in the instruction memory, the output of the logic unit is not
written to the output memory but is retained until the specified number of time steps
have elapsed.” [28]
The model and test patterns are downloaded and distributed among the Los Gatos
Logic Simulation Machine logic processors. When running, an instruction is fetched from
the instruction memory and the inputs are fetched from the data input memory according
to the data addresses in the instruction. The function code and the data are passed to
the logic unit where they are evaluated. The result is sent to the inter-processor switch
which can be latched into the output data memory. This fetch, evaluate, and store cycle
is called an instruction step. Instruction steps are executed in a seven stage pipeline,
44
1024x64
fetchaddr
storeaddr
signaldata
LOGICUNIT
fetchaddr
DELAYVALUEMEMORY
1024X16
InstructionMemory
fetchaddr
1024x80FunctionMemory
program counter
from switch
to switch
operands 1−5
functionDATA MEMORY
2048x2x2
Fig. 3.5. The IBM Los Gatos Logic Simulation Machine Logic processors eachcontain an instruction memory, two data memories, a logic unit, and a gate delay valuememory. The instruction memory contains room for 1K of 80-bit instruction words. Eachinstruction word contains a function opcode, five input operand addresses, and other controlinformation. The input operand addresses represent the simulated gate inputs. If a gatehas more than 5 inputs, it is decomposed into a sequence of five input functions.
45
one instruction following another in the pipeline and each instruction completing every
100 nanoseconds [28].
The designers should be complemented on their bold attempt at an unusual sim-
ulation approach. The time-driven approach simplifies the machine. The disadvantages
of this system, however, are:
• The system simulates gates even during non-event times.
• To simulate delays, the system actually holds the function outputs until a specified
number of delays has passed. This wastes simulation time and slows the system
down.
3.1.3 Barto and Szygenda’s Hardware Simulator
A 1980 PhD dissertation at the University of Texas detailed the implementation
of a special high performance simulation machine architecture designed to perform high
speed logic simulation. This special purpose simulation machine hardware is based on
distributed processing. The machine is an event-driven, fine-grained, conservative logic
simulator. The research notes that some problems are not easily handled by basic Von
Neumann computer architectures. Problems involving large data structures which must
be moved, searched, etc. are often not aided by the underlying computer architecture. A
programmer finds that the hardware provides little or no assistance in setting up the data
structures or data flows required by the problem [13]. Large databases, and event-driven
simulators are examples of problems which fall into this category. Several attempts have
tried to define architectures for data bases. These attempts incorporated the application
46
of associative memory and intelligent disk systems. The hardware simulator research
presents a possible architecture for logic simulation [13]. The architecture of the machine
is illustrated in Figure 3.6. The logic simulator described here supports a table based,
event-driven simulator. The logic simulator performs according to the algorithm of
Table 3.1.
The evaluation phase consists of two tasks. The evaluation phase searches through
the activity flags to determine which gates are active during a particular simulation time
increment. In its second task, the evaluation phase calculates the gate changes or element
outputs and inputs according to the activity flags. The evaluation phase determines the
required gate changes and schedules them to occur appropriately. When the evaluation
phase is finished, the update phase commits the changes into the simulation.
The update phase consists of two concurrent tasks. The update phase propagates
the signals through the gates using the scheduling information generated during the
evaluation phase [13]. Gates whose output changes also have their activity flags set
during the update phase. The activity flags indicate which gates are active in the current
simulation cycle. Those gates with flagged activity indicators will be evaluated during
the evaluation phase.
The Event Queue Processor (EQP) schedules events which need to be evaluated
in future simulation cycles in the Event Queue Memory (EQM). The EQP maintains
the EQM as linked lists of events. Each list contains a gate descriptor and its events
are sorted in chronological order. The scheduled events contain a field, called the time
count-down (TC) field, which stores the amount of time remaining until the event is
47
process SIMULATEbegin
Load and initialize simulation dataSTI = 0; Simulation Time Increment while (STI < STImax) dobegin
EVAL phase;UPDATE phase;STI = STI + 1;
endend.
Table 3.1. Barto’s Simulator Algorithm Barto’s simulator runs using the algorithmdescribed in this table. The algorithm consists of the evaluation phase and the update phase.The evaluation phase searches through its list of gates to determine which need evaluationduring this simulation cycle. The update phase performs the required gate evaluations andpropagates the results.
48
to occur. On each pass through the update processor, this TC field of the gate event
descriptor is decremented and when the TC field equals zero, the event occurs.
Barto discusses some of the reasons for the low performance of software simulators.
The memory system of a general purpose computer has essentially no structure at all. A
software simulator operating on a general purpose machine must impose a structure on
the memory and move the data. To accomplish the data manipulations, the program’s
instructions must be fetched from memory and then executed. It may take several
instructions to move one word of data, since its present and new addresses must be
calculated, the word must be loaded from memory into a register, then stored somewhere
else in memory. The time required for this process will depend on the efficiency of the
computer’s architecture. Still, no general purpose machine has an architecture which is
natural for logic simulation [13].
To its credit, this machine is one of the earliest proposed simulation engines. The
system has the following advantages and disadvantages:
1. Advantages
• Its a specialized Simulation Engine, not composed of general purpose proces-
sors.
• This machine is an event-driven simulation machine.
2. Disadvantages
• Not expandable in terms of processing power focused on the simulation prob-
lem. The processing power is limited.
49
queuememory
Statusand datamemory
Fan-in Fan-outmemorymemory
Activityflag memory
Event Eventqueueprocessor
Evaluationprocessor
Updateprocessor
Fig. 3.6. Barto’s Logic Simulator Architecture The logic simulator architecture iscomposed of three processors: the update, the evaluation, and the event queue processors.The system also contains 5 memories: the status and data memory (SDM), the fan-inmemory (FIM), the fan-out memory (FOM), the activity flag memory (AFM), and theevent queue memory (EQM). The AFM contains flags indicating which simulation gates arecurrently active. The AFM is used in conjunction with the FOM to direct gate evaluationresults. The EQM serves as the simulations event list and is maintained by the event queueprocessor. The majority of the simulation work is performed by the evaluation and updateprocessors during the update phase.
50
3.1.4 Abramovici’s Logic Simulation Machine
Abramovici et al developed a distributed parallel processing architecture for han-
dling logic simulation [1]. The architecture, proposed in December of 1981, is based on
pipelining and concurrency. The simulator is an event-driven, fine-grained, conserva-
tive machine. The model is capable of handling both simple gates and more complex
functions. The employed timing analysis can handle more than simple unit gate delays.
Separate Processing Units (PU) are dedicated to specific tasks of the algorithm,
such that the entire logic simulation algorithm is executed as a result of the cooperation
between the individual PUs working concurrently [1]. The logic simulation tasks are
pipelined.
Each process illustrated in Figure 3.7 is assigned to a process unit. Concurrency
is achieved by pipelining the dataflow. The Event List Manager PU receives the future
events and event times or event cancellations from the scheduler PU. The Event List
Manager orders the events in causal order in the Event List Memory. When all the
events scheduled at a particular time have finished processing, the scheduler signals the
Event List Memory PU to advance to the next time cycle. The Event List Memory PU
then activates the Current Event Processor. The Current Event Processor receives a new
time value and a pointer to the first event on the list of events for that time. When the
Current Event Processor finishes the list, the Event List Memory PU issues a finished
signal.
The Current Event Processor retrieves each event from the list of events for a
particular event time from the Event List Memory. Each event is sent to the Model Access
51
CurrentEvent
Processor
EventList
Manager
ResultsTo
File
FromSimulus
File
SimpleConfigProc
Eval
FEV1 FuncConfProc
FEVn
ModelAccessing
UnitScheduler
Event List
Memory
Fig. 3.7. Logic Simulation Machine The Simulation Machine architecture with its re-finement for simple evaluations is illustrated. The Event List Manager orders the eventsreceived from the Scheduler in increasing time order and stores them in the Event ListMemory. The Current Event Processor retrieves events in order from the Event List Mem-ory and sends them to the Model Accessing Unit and Simple Configuration Processor forevaluation. The Model Accessing Unit finds all the event receivers and sends the receiveraddresses to the Simple Configuration Processor. The Simple Configuration Processor for-wards the gates impacted by the events to the Evaluator. The results after evaluation aresent to the Scheduler which delays transmission of the results by the appropriate delay foreach gate.
52
Unit and the Simple Configuration Processor (Simple Config Proc). The Current Event
Processor can perform user interaction and simulation control tasks such as processing
user-set break points and system state monitoring.
The Model Accessing Unit retrieves each element which receives the results from
the current event. This fanout list is stored in the Model Accessing Unit’s local memory.
The fanout list entries propagate to the Simple Configuration Processor. When the
fanout list from the current event is depleted, the Model Accessing Unit forwards a done
message to the Simple Configuration Processor PU.
Both the Function Configuration Processor (Func Conf Proc) and the Simple
Configuration Processor receive the current event from the Current Event Processor and
the fanout list elements from the Model Accessing Unit. The Simple Configuration Pro-
cessor maintains a definition of each type of element which is forwarded to the Evaluation
(Eval) unit.
The Functional Configuration Processor forwards large functional element param-
eters to the appropriate Functional Evaluator (FEV) in which the function is statically
assigned. Small functional elements may be dynamically assigned to idle FEVs. The
Functional Configuration processor must transmit the configuration of the dynamically
configured FEV.
The Evaluation Unit (Eval) receives the configuration and element types to be
evaluated from the Simple Configuration Processor [1]. The Evaluator performs the
event evaluations and sends the results to the Scheduler. The Evaluation Unit may also
send a cancellation event message to the Scheduler. Cancellations occur when an element
is re-evaluated changing the results of a previously scheduled event.
53
Finally, the results from the Evaluator and the FEVs are forwarded to the Sched-
uler. The Scheduler retrieves the event delay (service time) from its local memory.
Delays may be associated with a type of element or be specific to a particular circuit
element. The Scheduler determines the time of an event by adding its delay to the cur-
rent simulation time and then the new event and its simulation time are forwarded to
the Event List Manager. Event cancellations are also forwarded to the Event List Man-
ager. The PU operations on the events are quite simple, involving either data transfer or
logic/arithmetic. The designers selected micro-programmable processors to serve as the
PUs. Microcode could avoid substantial software overhead involved in fetching and de-
coding macro-instructions [1]. Any changes to the number of logic values used, the delay
modeling, or the timing analysis can be incorporated by changing the microinstructions.
The system architecture need not be modified.
Some of the machine’s advantages and disadvantages are as follows:
1. Advantages
• The processors are microcoded to tailor the instruction set architecture to the
simulation problem tasks.
2. Disadvantages
• The architecture cannot be easily scaled to focus more processing power on
the simulation problem.
54
3.1.5 Levendel, Menon, and Patel’s Logic Simulator
The following logic simulator was developed by Y. H. Levendel, P. R. Menon and S.
H. Patel. The simulator served as the basis of Patel’s 1982 PhD dissertation at the Illinois
Institute of Technology. The simulator differs from the work by Barto and Abramovici in
that multiple processors do not perform dedicated tasks. Barto’s machine, described in
Section 3.1.3, contains processors which are dedicated to event queue management, gate
evaluation, and fan-out updates (see Figure 3.6). The machine is an event-driven, fine-
grained, conservative logic simulator. The machine developed by Abramovici contains
processors dedicated to circuit evaluations and event list management (see Figure 3.7).
In Levendel’s machine, a host pre-processor distributes the sub-circuits of a design among
various homogeneous processors. The modularity of this design allows an easy increase
of computational power to be assigned to the simulation. The architecture is illustrated
in Figure 3.8
This simulator consists of processors p1 to pn. The circuit to be simulated is
referred to as the target circuit. The target circuit is partitioned into blocks ai through
aj . The circuit connections between blocks ai and aj are designated as bij . It should
be noted that blocks are not necessarily circuit clusters, that is to say that the elements
in a block can be from disjoint portions of the circuit. Each circuit block ax is mapped
onto a processor py and is then called sub-circuit cz as illustrated in Figure 3.9. During
the simulation, each sub-circuit, cz , is simulated independently. Different sub-circuits
become active as the signals proceed from the primary inputs to the primary outputs. As
the simulation progresses, data is carried between sub-circuits ci and cj changing the logic
55
Bus Interface
Unit
Master SimpleEvaluator
1
SimpleEvaluator
n
CrossbarMatrix
ParallelBus
FunctionalEvaluator
1
FunctionalEvaluator
n
Communications Structure
Fig. 3.8. Levendel’s Logic Simulator Architecture The logic simulator architectureincludes a communications structure, a controlling processor, several simple subordinateevaluators for simulating gate level blocks, and several functional subordinate evaluatorsfor simulating functional blocks of the design under test. A cross-point matrix is used toconnect the controlling processor with the simple subordinate evaluators. The functionalsubordinate evaluators are connected to the same cross-point matrix through a bus interfaceunit and a parallel bus [98].
56
values of bij . The interconnect between the original sub-circuit blocks, bij , is now referred
to as dij , which is the datapath between sub-circuits ci and cj . The controlling processor
and the Simple Evaluators are connected via a cross-point matrix. The Functional
Evaluators are connected to the cross-point matrix through a bus interface unit and a
parallel bus. The bus interface is shown in Figure 3.10. The parallel bus has sufficient
speed for the Functional Evaluators according to a timing calculation performed within
the study.
Concurrency during simulation is achieved by allowing the sub-circuits ci and
cj to be evaluated independently. Different circuits will become active as signal values
proceed from the primary inputs to the primary output [98].
The simulator is configured to consist of one controlling processor and a multi-
tude of subordinate processors which are interconnected by a communications structure.
Processors pi and pj in Figure 3.9 are both subordinate processors. The sub-circuits ci
and cj reside in the subordinate processor’s memories.
The system works as follows. At the beginning of each simulation cycle, the con-
trolling processor sends any primary inputs required to each subordinate processor using
the communication structure. The controlling processor then issues a start signal to the
subordinate processors ordering them to begin the next simulation cycle. The subordi-
nate processor may generate events which will need to be forwarded to other processors
for future cycles of the simulation. In the case of logic simulations, a change in a logic
value on an output signal line becomes a scheduled event . In this system, only events
scheduled for the immediately following simulation time cycle are transferred between
the subordinate processors in order to reduce communications overhead. Therefore, the
57
test circuit design
simulator
c i
Sub−circuit
piprocessor
c j
Sub−circuit
pjprocessorijData Path d
a i
Blocka j
ijbInterconnections
Circuit
Block
Fig. 3.9. Mapping Circuit Blocks ai and aj to Processors pi and pj The simulatorconsists of processors p1 through pn. The target circuit is subdivided into blocks a1 throughan. The circuit connections which span two blocks are designated as bij . Each circuit block,
aq, is then mapped onto a processor, pq, as sub-circuit cq . The original inter-connections,bij , between the circuit blocks ai and aj are mapped into the datapath dij . The blocks, aqmay contain disjoint pieces of the original circuit, they are not necessarily clustered sectionsof the original circuit.
58
OutputData
Sequencer
InputData
Sequencer
Time-SharedParallel Bus
CommunicatonsStructure:
Data Ready
Data
Address
RTS for MasterBus Grant
Request to Send
Data
(RTS)(RTSM)
Fig. 3.10. Interface Between the Data Sequencers and the Time-Shared Par-allel Bus This figure illustrates the communications signals required between the datasequencers and the time-shared parallel bus. When the Output Data Sequencer (ODS) hasdata to transmit, it pulls the Request to Send (RTS) line high. The data sequencer receivespermission to transmit when it receives a pulse back on the bus grant line. The ODS thentransmits all the data in its Output FIFO Buffer (OFB). The receiver stores the inbounddata in its local Input FIFO Buffer (IFB). The ODS then sets the RTS line low whichreleases the bus. All requesting bus users have equal priority.
59
scheduled time need not be immediately sent, allowing messages to accumulate, and
thereby saving on the number of transmitted communications overhead bytes. Data
transferred to the controlling processor from the subordinate processors consists of the
primary outputs and any user requested data.
When the subordinate processors finish processing their sub-circuits for the sim-
ulation cycle, they each inform the controlling processor. When all the subordinate
processors report their completion, and the controlling processor has also finished trans-
ferring its primary inputs scheduled for the next simulation cycle to the subordinate
processors, the controlling processor broadcasts a start cycle to the subordinate proces-
sors beginning the next simulation cycle.
The controlling processor is depicted in Figure 3.11. The controlling processor
contains local memory. Each subordinate unit also consists of a processing unit and local
memory. The controller and the subordinate processors both have one input and one
output FIFO buffer to handle inbound and outbound data messages. They also each
have one input and one output data sequencer for the communications interface.
The subordinate processor configuration is illustrated in Figure 3.12. The Pro-
cessing Unit (PU) is a 16-bit microprocessor. The input and output data sequencers
are either specially designed Application Specific Integrated Chips (ASIC) or single chip
microcomputers. The subordinate PU evaluates the circuit elements or functions. The
PU contains the circuit blocks to be simulated. The Output Data Sequencer (ODS) is
isolated from the PU by a FIFO buffer. The ODS allows the subordinate unit to send
data whether or not that subordinate’s PU is active during a simulation cycle. The Input
60
Data to
control andcommunicationstructure
Data from control andcommunicationstructure
buffer
buffer(With
Data
Memory)
CPU
InputData
Sequencer
OutputData
Sequencer
FIFOInput
OutputFIFO
To General−PurposeComputer
Data Flow
Data Flow
Done
Start
Start
Done
ToSubordinateProcessors
Fig. 3.11. The Controlling Processor Unit Configuration The controlling processorserves as the interface between the general-purpose host computer and the simulator [98].The controlling processor synchronizes the subordinate processors, maintains the simulationclock, supplies the subordinate processors with their primary inputs and gathers the primaryoutput values from the simulation. The controlling processor unit also maintains the userrequested simulation monitor values.
61
communicationcontrol andData from
structurecommunication
Start (From Master)
structure
control andData to
Done (To Master)
buffer
Memory)
Data
(Withbuffer
PU
FIFOOutput
InputFIFO
SequencerData
Output
SequencerDataInput
Data Flow
Data Flow
Fig. 3.12. Subordinate Processor Unit Configuration The subordinate processorunit (PU) represents both the Simple Evaluator processors and the Functional Evaluatorprocessors. The PU is a general purpose 16-bit microprocessor. The PU evaluates the circuitfunction or element. Each PU receives a block of the target circuit when the target circuitis initially partitioned. The input and output data sequencers establish connections viathe communications structure and transfer data to and from the FIFO buffers respectively.Data is transferred to and from other subordinate processors or the controlling processorwhich are also connected to the communications structure.
62
Data Sequencer (IDS) behaves and is configured in much the same way. So the PU runs
2 concurrent processes, the simulation and communication processes.
The PU stores data destined for the controller or other subordinate PUs in the
output FIFO buffer (OFB). The ODS will request an appropriate channel on the com-
munications structure if there is data which must be transferred from the OFB. When
granted access to the communications structure, the ODS transmits the data across the
communications channel. Data received by an IDS is placed in the Input FIFO Buffer
(IFB). To separate data from different simulation cycles, the data in the OFB is sepa-
rated by end of data (EOD) markers. The EOD marker also allows the PU to write new
data into the FIFO before the ODS has finished transferring data out. The same system
is implemented in the IFBs.
Two dedicated signal lines run between the controller and the subordinate units,
synchronizing the simulation. The controlling processor signals the subordinate proces-
sors using the start signal, and the subordinate processors signal the controller using the
done signal. The done line is pulled active when all the subordinate processors have fin-
ished their individual processing. The asserted start signal from the controller indicates
that the subordinate processors can initiate processing for the next simulation cycle.
The start signal causes all subordinate processors to load an EOD marker into the IFBs.
The EOD marks the end of input data arriving from other subordinate processors and
the controller during the current simulation cycle. When the PU reaches the EOD flag
in its IFB, then that PU has loaded all the required data for this simulation cycle, and
the PU may now begin processing the events. The start signal also alerts the ODS to
begin sending out data for the next simulation cycle.
63
When the PU has finished processing events for a simulation cycle, the PU loads
an EOD marker into the OFB and begins to process the next simulation cycle. When
the ODS encounters the EOD marker in the OFB, the ODS has finished transferring
its data for the current cycle. The ODS then signals the controlling processor using the
done line.
The processor unit (PU) contains three important data structures which are rel-
evant to its operation. The three data structures are the:
• Sub-Circuit Description Table
• Activity List
• Event Queue
The sub-circuit description table contains the requisite information needed to
evaluate and process the PUs assigned sub-circuit. For each element in that sub-circuit,
the table contains the value, type, delay, input status word pointer, signal values on the
fan-in lines, and the corresponding fanout list which handles signals bound for other
subordinate processors. The external fanout list requires more space than the internal
fanout list, because the value, destination processor, and element index information must
be stored for the external fanout.
The controlling processor connects the simulator to the general purpose host plat-
form. The controlling processor is illustrated in Figure 3.11. The controlling processor
maintains the simulated time, synchronizes the subordinate processors, supplies primary
data to the subordinate processors, and gathers input from the subordinate processors.
64
The controlling processor is similar to the subordinate units. It also contains a cen-
tral processing unit (CPU) with local memory, input and output FIFO buffers (IFB
and OFB), and input and output data sequencers (IDS and ODS). The controller is
connected to the subordinate processor units via the communications structure. The
controller initiates processing for each simulation cycle by issuing the start signal. When
all the subordinate processors have reported they are finished, the controller sets the
done signal indicating the current simulation cycle is over.
In the controlling processor, the start signal is also wired to the controller’s ODS
unit. The start signal tells the ODS to begin transferring data for the cycle. The
ODS transmits its data across the communications structure until it encounters the
EOD marker at which point, the ODS signals the controlling CPU that it has finished
transferring data by setting the done signal .
The communications structure is divided into two sub-structures. The first struc-
ture is a time-shared parallel bus which connects to the slower simple evaluation pro-
cessors. The interface between the parallel bus and the cross-point matrix is illustrated
in Figure 3.13. The second communications structure is the cross-point matrix which
connects directly to the faster Simple Evaluators and the parallel bus.
The interface between the time-shared parallel bus and the data sequencers is
illustrated in Figure 3.10. When a Functional Evaluator’s ODS has data to send, it sets
the request to send (RTS) signal high. The bus control grants permission to the ODS
by signalling on the bus grant line. The ODS then sends all of its data to the receiving
IDS. The receiving IDS stores the data in its local IFB. When finished, the ODS sets
65
RTS for Controller
Bus Interface UnitData Flow
Data Flow
ParallelCross−PointData
BusMatrix
DataSequencer
Address
Data
Data Ready
Request to Send
Bus Grant
Address
Data
Request to Send
Busy/Grant
Address
Address + Data
Data Ready
DataSequencer
1
2
Fig. 3.13. Interface Between the Parallel Bus and the Cross-Point Matrix. Thestudy [98] demonstrated that although a cross-point matrix is preferable for communicationsbetween the Simple Evaluators, a parallel bus is cost effective and sufficient for communica-tions to and from the Functional Evaluators. The Bus Interface Unit is designed to transferdata between the cross-point matrix and the parallel bus. Data Sequencer 1 transfers datafrom the Functional Evaluators connected to the parallel bus to the Simple Evaluators andthe controlling processor which are connected to the cross-point matrix. The parallel datafrom the bus must be transmitted serially across the cross-point matrix. Data Sequencer2 sends data from the cross-point matrix to the parallel bus, again translating the inputserial data to output parallel bus data.
66
the RTS line low releasing the time-shared parallel bus. All subordinate processors have
equal priority on the bus.
The data transferred between subordinate processors or to the controlling pro-
cessor consists of events scheduled for the next simulation cycle. The data sent to the
controlling processor consists of the return address of the sending subordinate processor,
the element number (either a primary output or user requested monitor point) and the
element value. A separate request line, request to send to the controller (RTSC), is used
to address the controlling processor. When transmitting to the controlling processor,
the sending ODS address lines contain the sending subordinate unit’s address.
The interface between the Simple Evaluators’ data sequencers and the cross-point
matrix is illustrated in Figure 3.14. To send data, the ODS puts the destination address
on the address lines and signals on the RTS line. If the destination is not busy, the
transfer request is granted. The ODS transmits its data serially to the receiving IDS
which stores the data in its local IFB. The data ready signal is used to indicate the
presence of data at the IDS. If an access request is denied, the data is stored locally and
attempts are made to send blocked data later. Access requests to the cross-point matrix
are denied if the destination is pre-occupied with another incoming call. The call block
is controlled by use of the busy/grant line.
The machine by Levendel has several advantages and disadvantages:
1. Advantages
• The presented machine is scalable, and additional processing hardware can
be effectively added to increase the simulation execution speed.
67
OutputData
Sequencer
InputData
Sequencer
CommunicatonsStructure:
Data Ready
Request to Send
Data
(RTS)Busy/Grant
Data
Address
Cross−Point Matrix
(RTSC)
Fig. 3.14. Interface Between the Data Sequencers and a Cross-Point Matrix Thisfigure illustrates the communications signals required between the data sequencers and across-point matrix. When the Output Data Sequencer (ODS) has data to transmit, it putsthe destination address on the address lines and pulls the Request to Send (RTS) line high.If the destination is not busy, the matrix control grants the request. The Output DataSequencer then sends the data across the cross-point matrix channel serially. The receiverstores the inbound data in its local Input FIFO Buffer (IFB). The Data Ready signal lineis used to show the presence of data. If the destination is busy when the ODS attemptsto do a data transfer, the data is stored back in the ODS’s local memory, and an RTS isperformed for the next destination. Later, the ODS will re-attempt transmission of theblocked data.
68
2. Disadvantages
• The system is synchronous and time-driven.
• The machine consists of a multiprocessor based architecture.
3.1.6 Megalogican
The Megalogican contains a special purpose computing engine attached to a gen-
eral purpose workstation which is used for logic simulation and design verification [63].
The Megalogican was announced to the public in November of 1983. The machine is an
event-driven, coarse or fine-grained, conservative logic simulator. The system is an 80286
computing platform with 3 bit-slice engines. The bit slice engines connect directly to
dedicated memory and to two neighboring processors through a hardware FIFO queue
forming a 3 processor ring [20]. The three connected units were the State Unit, the
Evaluation Unit, and the Queue Unit.
The Queue Unit maintains the event queue, or list of simulation events. The
Queue unit received the results of the Evaluation Unit and provides events in time order
to the State Unit. The State Unit receives the net values from the host and maintains the
gate values along with the connectivity information in a state array. The evaluation unit
takes each logic element’s input values and function and generates the new output value.
Specific tasks were encoded as microcode instructions. Hardware-accelerated simulations
demonstrated a 100 fold speed increase over their software counterparts running on an
80286. The hardware simulator was capable of 100,000 gate evaluations per second and
was capable of handling circuits of 64,000 primitives.
Some of the Megalogican’s advantages and disadvantages are as follows:
69
HostComputational
Engine(80286/80287)
DedicatedMemory
EvaluationUnitFIFO
DedicatedMemory
Dedicated Memory
QueueUnit FIFO State
UnitFIFO
System Bus
Fig. 3.15. Megalogican Architecture The Megalogican is composed of three processors.The Queue Unit provides events in time order to the State Processor. The State Processorgathers and maintains network values along with the circuit connectivity information in astate array. The Evaluation Processor uses the logic element’s input values and functionsto generate the resulting output values.
70
1. Advantages
• The system has some flexibility due to the use of microcoded functions im-
plemented in the system processors.
• The system was intended to be commercially available to the general public.
2. Disadvantages
• The architecture could not be easily scaled to focus more processing power
on a simulation problem.
3.1.7 The IBM Yorktown Simulation Engine
The IBM Yorktown Simulation Engine (YSE) [119] is a descendent of the Los
Gatos Logic Simulation Machine [28, 78, 88]. The machine was proposed sometime
before 1983. The YSE is a special purpose, parallel, programmable computer for logic
gate-level simulation. Like the Los Gatos Logic Simulation Machine, the YSE is also not
event-driven. It is a time-driven, fine-grained logic simulator.
The YSE architecture is also composed of logic processors, array processors, and
a control processor. The logic processor simulates a portion of the total system logic,
up to a maximum of 8K gates per processor. The gates in each processor are simulated
serially, at a rate of 80 ns per gate [119]. The array processors simulate storage devices
such as RAMs and Read Only Memories (ROMs). The control processor provides com-
munication between the YSE and a host machine. The control processor loads the YSE
processors with the necessary simulation data. There is also an inter-processor switch
71
which connects up to 256 array and logic processors with the control processors during
the simulation.
In the YSE, the logic processors are capable of 3 modes of operation. Two of the
three logic processor internal arrangements can be seen illustrated in Figure 3.16. The
first mode is the unit-delay mode. In the unit-delay mode, each gate has the same delay.
So a combinatorial network of N levels of depth takes N time units to stabilize. The
second mode is referred to as rank-order . In rank-order mode, the gates are connected
so that all equal depth combinatorial networks stabilize in a single time unit. Finally
the third and last mode is called the mixed-mode. In the mixed-mode, the combinatorial
networks are simulated in rank-order mode and storage units are modeled in the unit-
delay mode. A single simulation clock cycle carries out one clock cycle of the simulated
machine. In actual use, the YSE is generally run in mixed mode.
The logic unit in the YSE is actually nothing more than a simple RAM access
that reads a value from a function Table [119]. So the YSE’s operations really consist of
nothing more than table lookups. There are no conditionals, branches, etc.
In the rank-order mode, the gate instructions must be executed in order. No
gate’s instruction can be executed before those of its predecessor gates. This ordering
prohibits feedback, so it is impossible to conveniently simulate memory [119]. Rank-
order simulation imposes an order on instruction execution which allows any equal depth
combinatorial network to produce the correct output results at the end of each simulation
cycle.
72
(8K 128−bit words)
MemoryInstruction
LogicUnit
words)
MemoryData
(8K 128−bit
(8K 128−bit words)
MemoryInstruction
Memory
"B"
"A"
Memory
LogicUnit
Rank OrderLogic Processor
Unit DelayLogic Processor
Fig. 3.16. The YSE Logic Processor Configuration Two of the three logic processorinternal arrangements can be seen illustrated in Figure 3.16. The first mode, described inSection 3.1.7, is not illustrated. The second mode is the rank-order mode. In rank order,the gates are connected so that all depth combinatorial networks stabilize in a single timeunit. The third mode is referred to as unit-delay mode. In unit-delay mode, each gatehas the same delay. So a combinatorial network of N levels of depth takes N time units tostabilize.
73
The unit-delay mode does not require this type of ordered execution. The results
of instruction executions do not affect any other instruction’s inputs until the next simu-
lation cycle. Therefore instructions may be executed in any order within each simulation
cycle. In the unit-delay mode, the processor configuration is similar to the rank-order
processor structure, except the memory is divided into parts A and B. During the unit-
delay mode simulation, alternating simulation time cycles take turns reading/writing to
the processor data memory. In the even cycles, the processor might write to the “B”
memory and read from the “A” memory. In odd cycles, the processor would then write to
the “A” memory and read from the “B” memory. The net effect is that every simulation
cycle performs a single gate delay for every gate in the entire simulated machine [119].
Mixed-mode allows the unit-delay and rank-order modes to be combined. In
mixed-mode, memory elements are simulated in the unit-delay mode and the combina-
toric logic is simulated in rank-order mode.
The inter-processor switch allows communications between all the YSE processors
during the simulation. A sample switch port to a logic processor connection is illustrated
in Figure 3.17. In the YSE, all processors operate synchronously, with a common clock
and identical values in their program counters. The processors may all execute different
instructions; however, each processor will execute its first, second, ..., kth instruction
in lock step [119]. The YSE takes advantage of this synchronization by sending each
processor’s result to all the other processors via the switch multiplexor. So, at each time
increment, T, each switch multiplexor has every processor’s kth result at its data input.
Results generated at time T in one processor can be at any other processor’s inputs by
time T + 1.
74
InstructionMemory
LogicUnit
Input Memory
(8K 2−bitwords)
DataMemory
MemorySwitch
(8K 8−bitwords)
256−WayMultiplexor
Switch Port "K" Logic Processor "K"
to input "K"
of all switch
multiplexors
from proc 1from proc 2
from proc 256
Fig. 3.17. A Switch Port ”K” Example with its Logic Port Connection Theinter-processor switch allows communications between all the YSE processors during thesimulation. A sample switch port to a logic processor connection is illustrated.
75
The YSE array processor was designed to consist of two parts, the parallel adapter
and the backing store processor. The parallel adapter, or PAD, collects array data from
the logic processors, passing the data on to the backing store processor, or BSP. The
PAD also works vice versa by distributing data from the BSP to the logic processors.
The PAD contains input and output memories, and an instruction memory. The input
memory is similar to the logic processor’s input memory in form and function. The
input memory is loaded from the inter-processor switch in the same manner as the logic
processor. The instruction memory words contain addresses in the input memory which
contain the gate inputs for each simulation time cycle. The relevant data is transferred
to the BSP from the input memory. The instructions contain control codes which are
passed directly to the BSP. The control codes indicate:
• whether the signals passed this cycle are valid.
• the data type passed to the BSP (eg. address, data to be written, write enable,
etc.).
• which array is being addressed.
• what operation is to be performed (read or write).
For example, in the case of a read operation, the BSP will write the data from
the array into the PAD’s output memory. From the PAD’s output memory, the data
would then be transferred to the inter-processor switch according to an additional PAD
instruction field.
76
The BSP contains an array descriptor memory and a large backing store. The
BSP also contains registers holding the array addresses, the data to be written, and
data which was read. Each entry of the array descriptor memory describes one of the
simulated arrays held in the backing store. The descriptors indicate the offset to the
beginning of the array and the array’s stride (size of each element). When the BSP
receives a read or write command, the BSP uses the appropriate descriptor to calculate
the target array address and then performs the requested function.
To its credit, the YSE demonstrated that a speed increase of several orders of
magnitude is achievable for a gate-level logic simulation using a parallel, special purpose
machine. The simulator relies on the user performing separate timing verification via a
proof technique called the Level Sensitive Scan Design (LSSD) discipline. The timing
analysis is data-independent, so simulation is not required for the analysis.
However, the machine also has the following disadvantages:
• All the gates of the design are executed in every simulation cycle, regardless of
whether their input data are valid in a given cycle [119]. This is a waste of processor
time which could be used to accelerate the simulation.
• The scheduling problem. The YSE has some data flow problems which may be
handled by the YSE’s compiler through the insertion of nop instructions. One
limitation of the YSE’s inter-processor communications is that a processor can
only receive one value from one other processor at a time. If two instruction inputs
must be fed to a processor, the values cannot be written during the same simulation
cycle, as illustrated in Figure 3.18. The inputs must be staggered in time, as only
77
one can be written during each cycle. The extra wait cycle for the instruction must
be filled by a nop, or another independent instruction.
• The YSE does not handle non-deterministic simulations. There is no probabilistic
estimate of bus contention, data traffic, etc. All instructions take exactly the
same amount of time to execute. Scheduled instructions are always performed.
Switching contention is completely resolved at compile time. It was felt that “... if
the switch had to receive and arbitrate communications defined only at run time,
its control logic alone might well have exceeded the size of the entire current YSE
switch.” [119]
3.1.8 HAL: A Block Level Logic Simulator
HAL is another high-speed hardware logic simulation machine which gains speed
by exploiting concurrency in simulation processes. The HAL results were initially re-
ported in 1983. HAL is a special purpose simulation engine which is approximately 103
times faster than a comparable software simulator or about 105 times slower than the
actual machine [89]. HAL contains 32 distributed special parallel processors, which uti-
lize a Block Oriented Simulation Technique [125]. HAL is designed to simulate custom
designed Large Scale Integration (LSI) computers composed of a central processor unit,
a system controller and a memory unit. HAL is also designed to be capable of simulat-
ing large logic networks at high speed using a reduced amount of hardware. HAL is an
event-driven, coarse-grained, conservative logic simulator.
78
AProcessor Processor
BProcessor
C
C3C2C1nopA1
AProcessor Processor
BProcessor
C
C1A1
C3
C2
AProcessor Processor
BProcessor
C
A1 C1C2C3
Fig. 3.18. The YSE Scheduling Problem In this illustration, time runs verticallyplacing later instructions lower in the schedules. The top drawing shows that both dataitems cannot be delivered to processor B at the same time. This scheduling dilemma can besolved by moving all the instructions of processor C down one position and inserting a NOP.Or if an independent instruction which does not require inter-processor communication canbe found by the compiler, it can be substituted for the NOP.
79
HAL derives much of its speedup by simulating all functions used in the simu-
lation as hardware. Specialized hardware modules perform most of the simulator pro-
cessing within a single step as opposed to multiple program steps required by a software
simulation. HAL also takes advantage of pipelining to allow concurrency among inde-
pendent hardware modules. The simulation algorithm is implemented by independent
sub-function sequences, and block event streams are fed into a pipeline that is com-
posed of hardware modules [89]. Finally, sub-circuits which lie on the same level in
level-ordering , can be executed in parallel on different processors.
The HAL simulation team cites an interesting example of a software simulator
evaluating a mainframe computer design. If a software simulator whose simulation per-
formance is about 108 times slower than an actual machine is used, it would take about
15 years to execute a test program, a task that would take only five seconds on an actual
machine [89].
In their formulation of the HAL simulator, three major delay simulation models
were considered, the zero-delay model, the unit-delay model, and the nomimal-delay
model. The zero-delay simulation is only applicable to a synchronous logic circuit which
does not include feedback loops or delay-dependent units. The zero-delay simulation
arranges all gates in a block according to the signal propagation order, and assigns an
equal level number for those gates at the same logic depth. The zero-delay model handles
all logic elements as ideal switching elements with no switching delay. The zero-delay
algorithm has lots of parallelism because all gates having the same level or depth can
be executed at the same time by different processors. The unit-delay model simply
80
handles all gates as having the same unit delay time. The unit-delay model allows delay-
dependent logic elements or feedback loops, so it is possible to model memory using the
delay-dependent model. Unit-delay however requires two sets of memory to store the
input and output of each logic element. In each simulation cycle, all gates are evaluated
by using the previous values, stored in the input memory. New values are stored in the
output memory. In the next simulation cycle, the values in the input memory and output
memory are exchanged so that previous outputs are used as inputs in the next cycle [89].
Unit-delay simulation requires more simulation cycles than zero-delay, because unit-delay
simulation often requires several cycles to allow the events in process to settle. The final
model is the nomimal-delay model which also has concurrency in gate evaluations for all
gates which belong to the same time period. The time span selected for a time period is
set for the duration of the simulation. However, the longer the time span, the greater the
amount of possible gate evaluation concurrency. The HAL simulation engine employs
the zero-delay model based on gate level-ordering.
The HAL simulation team also created categories of simulation granularity. For
logic simulation, the group divided the granularity levels into gate-level, block-level, and
function-level granularity. Gate-level simulation evaluates all gates on the same level
as individual units. Block level evaluation groups gates into collections of several tens
of gates which are evaluated as a unit. Functional-level evaluation is more complex
than block level evaluation. As the granularity of the simulation becomes courser, inter-
granular event propagations are reduced significantly [89]. The granularity of event
execution is a benefit derived from executing deterministic logic simulations. Non-
deterministic event-driven simulations can not be divided into blocks of events, as future
81
events depend on previous events for their existence and since the simulation events are
generated randomly, events which may exist in one simulation run, will probably not
exist in the next.
HAL handles its simulation as a block-level simulation. Figure 3.19 illustrates
the level-ordering method. Data independent blocks which can execute concurrently are
assigned to the same level. The system begins by concurrently executing all blocks in
level 0 with new input values. If a block’s output values change, then the new output
values are propagated to the appropriate input values of the next level. When all of the
blocks of level 0 have been executed, level 1 is handled. The simulation continues until
all levels of the simulation have been processed.
HAL’s hardware is organized according to the diagram illustrated in Figure 3.20.
HAL is composed of 29 logic processors, 2 memory processors, and a router cell network.
Logic processors handle block-level simulation if the block contains only combinatorial
logic gates. The logic processor is itself composed of a node processor and a dynamic gate
array (DGA). The node processor manages event processing among the blocks. Each
node processor can handle up to 1000 blocks where each block can contain 32 inputs
and 32 outputs. The DGA performs gate evaluations within each block. The DGA gate
evaluations are performed by table lookup. The DGA receives the block inputs and the
block type. The inputs are then routed to index a location according to both the block
type and the gate inputs. Individual gate functions are implemented by the table-driven
method, where the function table for the gate is embedded in the RAM area in the form
of bit patterns [89].
82
Level i Level i+1 Level i+2
963
2
1 4
5
7
8
Fig. 3.19. The HAL Level Ordering Method. HAL implements a zero-delay simulationmodel. As part of that model, events propagate through the simulation in pipelined fashion.The simulation is subdivided into blocks, with independent blocks executing at the samelevel. In each clock cycle, events which are generated from register outputs propagatethrough combinatorial circuits and reach register inputs. At the end of each clock cycle, theregister outputs are updated by the register inputs, and then generate events for the nextclock cycle. The evaluation of each block is executed at most once per clock cycle. Outputvalues for all blocks are preserved until the next simulation clock cycle [89].
83
LSI
LSI LSI
LSI
TranscieverPacket
ExecutionControl
DGA
NP29
DGA
memoryState
Mastermemoryaccess
addressIC
mapping
memoryState
Mastermemoryaccess
addressIC
mapping
NP2 NP30 NP31
CPU MainMemory
Data
RAMdata Controller
DMA
Gate arraymemory
data memoryPin array
Addressmemory
Main Memory
NP − Node Processor
DGA − Dynamic Gate Array
MNS − Memory Node Simulation
NP1
MNS MNS
Dynamic GateArray
Host Processor
ProcessorControlMaster
Logic ICProcessors
LSI − Large Scale Integration
Router−cell Network
Event packettransmission
Connectionmemory
Output−statusmemory
Event−fetchprocess
Input−statusmemory
Event−setprocess
Fig. 3.20. The HAL Hardware Architecture. HAL is an array of 29 identical logic pro-cessors (NP1-NP29), two memory processors (NP30 and NP31), and a router cell network.The logic processors perform course-grain logic block simulation for blocks which containonly combinational logic. Each logic processor contains a node processor and a dynamicgate array (DGA). The memory processor consists of a node processor and a memory nodesimulator (MNS). The control processor performs level and clock synchronization amongthe logic and memory processors. The router cell network connects the processors andenables store-and-forward packet transmission among them [89].
84
Block no. Pin no.
category
Block no. Pin no.
Flag
Event packet output
Pin no.
Output-Status
Block
Node counter
valuesChange
Connection memoryEvent-send process
Block outputBlock input
Event-set process
Event
Input-status memory
Process
Event packet input
Input pin Block no.valuesOutput pin
Fan-out process
Event-fetch
encode
Block Input pin
memory
valuesOutput pin
category values
process
blockGate array
Dynamic gate array
Block evaluation
Update-status process
Processor no.Processor no.
Fig. 3.21. Internal mechanism of a Logic Processor. The internal mechanics of oneof HAL’s 29 logic processors is illustrated. The event-set process block receives events fromthe router-cell network which can be seen illustrated in Figure 3.20. The event-set processblock sets and updates the input-status memory with the received event information. Theevent-fetch process block searches the input-status memory for new events and sends thenew event information to the dynamic gate array for evaluation. The new event’s block isevaluated by the dynamic gate array which returns its results to the output status memory.The update-status process compares the new status with the previous one stored in theoutput-status memory, and puts the updated status back into the output-status memory.The fan-out process block uses the connection memory to determine the fanout list forpropagating the evaluation results. Finally, the event-send process block transfers eventsthrough either the router-cell network or a local bypass if the result is needed for the sameblock.[89]
85
In Figure 3.20, the interior of the first node processor is displayed. The event-set
process block receives events from the router-cell network which can be seen illustrated in
Figure 3.21. The event-set process block sets and updates the input-status memory with
the received event information. The event-fetch process block searches the input-status
memory for new events and sends the new event information to the dynamic gate array
for evaluation. The new event’s block is evaluated by the dynamic gate array which
returns its results to the output status memory. The update-status process compares
the new status with the previous one stored in the output-status memory, and puts
the updated status back into the output-status memory. The fan-out process block
uses the connection memory to determine the fanout list for propagating the evaluation
results. Finally, the event-send process block transfers events through either the router-
cell network or a local bypass if the result is needed for the same block.[89]
Memory Processors contain node processors which are the same as in the logic
processors. The memory simulator models main memory and the cache. Although
memory could have been simulated by logic processors, the amount of memory required
to model a mainframe computer would exhaust HAL’s logic processor capacity. So the
memory processors were developed to model memory without degrading the simulation
performance. The memory simulator, however, stores the memory data for its simulated
memory blocks in the host computer’s main memory. By using the host computer’s main
memory, HAL’s simulation memory capacity is several megabytes of memory. However,
HAL required approximately 3 ms to simulate each 16-bit read or write memory access
cycle [89].
86
The Control Processor , also called the Host Processor, synchronizes the simulation
level operations with the simulation clock across the logic and memory processors. When
a node processor ends the evaluation in the current level, it sends an end signal to the
control processor. When the control processor has received level end signals from all
the node processors indicating that they have finished, the control processor increments
the simulation level. The control processor then broadcasts the new level to all node
processors, and the simulation begins for the new level. The control processor also
manages data transfers between the logic and memory processors and the host computer.
The router cell network connects the processors and facilitates store and forward message
packet communication between the processors.
The HAL simulation model introduces the novel approach of block-level simu-
lation which improves simulation speed at the cost of simulation granularity. HAL’s
disadvantages are as follows:
• The zero-delay, unit-delay, and nomimal-delay models used to describe different
approaches to evaluating logic simulation are not applicable to non-deterministic
event-driven simulation. In non-deterministic simulation, the events, which are
analogous to gates in a logic simulation, are created on the fly, and therefore can
not be easily sorted according to evaluation levels. These methods all depend on
the gates existing before the simulation begins so that the events can be ordered.
• The granularity categories of the logic event-driven simulations depend on a priori
information about the hardware being simulated. Non-deterministic simulations
87
don’t have all the events generated before the simulation occurs, so again this
categorization doesn’t apply to non-deterministic simulation.
• HAL’s dependence on the host processor’s main memory to simulate memory in its
test models causes a substantial 3 ms time penalty for simulation memory accesses.
After the success of the initial HAL machine, second and third generation ma-
chines were developed culminating with the construction of HAL III. HAL III consists
of 127 processors and a maximum memory capacity of 254 M bytes [135]. HAL III is
reported to be more than 10,000 times faster than conventional multiprocessor software
simulators.
3.1.9 MARS:Micro-Programmable Accelerator for Rapid Simulation
MARS, the Micro-Programmable Accelerator for Rapid Simulation, was devel-
oped at AT&T Bell Laboratories and built in approximately 1987. MARS is a pipelined,
parallel accelerator whose microprocessors can be reconfigured through microprogram-
ming [3]. MARS is classified as an event-driven, fine-grained, conservative logic simula-
tor.
MARS consists of 256 clusters which are connected to a binary 8-cube communi-
cations network. A host processor can access the network and each cluster. The clusters
and network are illustrated in Figure 3.22. A cluster contains 14 Processing Elements
(PE). Every PE serves as a single stage of the pipeline. Each cluster performs a partition
of a multiple-delay logic simulation and communicates with the other clusters via the
88
communication network. Figure 3.23 illustrates the internal components of a MARS
cluster.
Global communication network
Cluster 0 Cluster 1 Cluster 255
Host
Fig. 3.22. Global MARS Architecture. MARS consists of 256 Clusters which areconnected to a binary 8-cube communications network. A host processor can access thenetwork and each cluster.
A cluster contains a communications network node, a local message switch, 14
PEs and a housekeeping processor. The 14 PEs are connected via a 16x16 crossbar switch
which also connects to the housekeeping processor and the external global communica-
tions network. The housekeeping processor is implemented as an M68020 processor,
which uses a local disk to store circuit partitions.
Figure 3.24 illustrates the architecture of the MARS Processing Elements (PE).
Each PE acts as a pipeline stage and together, the pipelined PEs perform the simula-
tion [3]. Individual PE functions include event scheduling, fanout updating, and function
89
PE13PE0 PE1 PE2
RAM RAM RAM RAM
Housekeeper
Host/
Local message switch (16x16 crossbar)
Global communications network node
Fig. 3.23. Internal Cluster Architecture Each cluster contains a communicationsnetwork node, a local message switch, 14 processing elements (PE) and a housekeepingprocessor. The 14 PEs are connected via a 16x16 crossbar switch which also connects tothe housekeeping processor and the external global communications network.
90
evaluation. The PEs communicate with each other and other clusters through their local
message switch.
As illustrated in Figure 3.24, each PE contains a microprogram RAM, a data
RAM, a register array containing 32 registers, an address arithmetic unit (AAU), a bit
field operation unit (FOU), and a message Queue Unit which serves as the I/O queue
for the PE. The FOU can perform operations on 1, 2, 4, and 8-bit field sizes. The AAU
is used to address the PE’s data RAM using a variety of addressing methods. The AAU
also performs 16-bit arithmetic, logical and shift operations. The AAU can support
multiplication and division at the rate of 1-bit per clock cycle. The FOU, on the other
hand, can perform bit-wise data extraction from two separate words, a bit-wise addition
operation on the operands, and then re-pack the results all in the same clock cycle.
The data path consists of 3 16-bit buses, A, B, and C. The microinstruction cycles
consist of 3 phases. Phase 1 allows data to be read from registers onto a bus. During
phase two, the AAU and FOU operate on the retrieved data and place the results on
a bus during phase 3. The contents of the buses are also written to selected registers
during the final phase.
Other units in Figure 3.24 include the data RAM address register (DAR), the
data RAM high address register (DHAR), the external address (EAD), the external
data register (ED), the field select register (FSR), the microinstruction register (MIR),
the memory select register (MSR), and the program address register (PAR).
Figure 3.25 illustrates the fanout phase and evaluation phase of a cluster for logic
simulation. Each stage of the pipeline represents one PE. The same PE may be used in
both phases.
91
EAD
EDPAR
DAR
DHARMSR
24
16
16
1616
16 64
Macroinstruction
ED
MicroprogramRAM
D
Address
Data
4
166
Handshakesignals
Bus A
Bus B
Bus C
Address
Data
Bus A
Bus BBus C
RAMinterface
RAM Interfacecontrol
ADDRMUX
Registerdecode
Registerarray
FOU
FOUdecode
AAUdecode
MIR
Conditionstall &
trap logic
Queuecontrol
Queueunit
AAUExternal
Fig. 3.24. Architecture of the Processing Element Each Processing Element (PE)acts as a pipeline stage and together, the PE pipeline performs the simulation [3]. IndividualPE functions include event scheduling, fanout updating, and function evaluation. The PEscommunicate with each other and other clusters through their local message switch. ThePEs contain a microprogram RAM, a data RAM, a register array containing 32 registers,an address arithmetic unit (AAU), a bit field operation unit (FOU), and a message QueueUnit which serves as the I/O queue for the PE.
92
During the fanout phase, the signal scheduler contains pointers to linked lists
of events. The output filter keeps track of current and pending signal values as well as
canceled events [3]. The oscillation detector detects zero delay oscillations and interrupts
the housekeeper if a predetermined number of oscillations is exceeded. The output log
records events in its data RAM on watched signals. The fanout pointer list, fanout list,
and input table all are used to propagate the gate results to the proper elements on the
evaluated gate’s fanout list. Finally, the gate scheduler schedules the gates whose inputs
have changed for evaluation during the next appropriate evaluation cycle.
The evaluation phase starts where the fanout phase left off. The gate scheduler
pops the gates to be evaluated off its stack of events and forwards the events to the
input table. The input table then fetches the appropriate input values for the gates
and forwards the values to the gate type table. The gate type table adds its data for
the appropriate gate to the data and moves it to the function unit. The function unit
evaluates the single gate and passes its computed result to the delay table which adds the
correct gate delay. Next, the input vector list, the output filter, and the signal scheduler
detect the new events generated by evaluation and schedule the new events for the next
fanout phase.
The MARS project has the following advantages and disadvantages:
1. Advantages
• Provides programmability through its use of microcoded PE chips.
• MARS works well for designs which utilize variable bit fields and variable
memory widths.
93
Fanout Phase
Evaluation Phase
Signalscheduler
Gatescheduler
Inputtable
Fanoutlist
Fanout
listpointer
OutputLog
Oscilationdetector
Gatescheduler
Outputfilter
Outputfilter
Signalschedulervector
Input
listtableDelay
UnitFunction
typeGate
table
Inputtable
Housekeeper
Housekeeper
Fig. 3.25. MARS logic simulation pipeline The fanout phase and evaluation phaseof a cluster for logic simulation are illustrated above. Each stage of the pipeline representsone PE. The same PE may be used in both phases.
2. Disadvantages
• Each PE has to receive data and instructions from RAM, which will cause
some speed penalties to access the data.
• The PEs also function as processors, with data-paths, so the operations of each
PE may involve reading from a bus, calculating a result, and then writing to
a bus, with appropriate storage to either memory or registers.
3.1.10 Reconfigurable Machine
The Reconfigurable Machine [137] (RM) combines FPGAs and RAMs to support
a wide range of applications. The RM, built in approximately 1992, incorporates FPGAs
which are capable of in-circuit reconfiguration allowing the RM to reload several types
of configuration data during power-on. The RM is an event-driven, fine-grained, conser-
vative logic simulator. A first prototype version of the RM, called RM-I, has been built
94
and applied to a multiple-delay Logic Simulator (LSIM). LSIM can simulate 1 million
gate events per second at a 4 MHz clock rate [137].
The RM architecture employs FPGAs which allow in-circuit reconfiguration and
relatively fast switching speeds. FPGAs come in four types with respect to programming
technology. The four types are anti-fuse, EPROM, EEPROM, and the SRAM type. The
anti-fuse and EPROM types do not allow in-circuit reconfiguration. The other two types
do allow in-circuit reconfiguration, but the SRAM type offers faster switching rates. So
the RM project decided to employ SRAM FPGAs. The project used the XC3090 (9000
gate class) FPGA from Xilinx which contains 320 Configuration Logic Blocks (CLBs)
and 144 Input/Output Blocks (IOBs).
One of the FPGAs serves as the interface module. The other four FPGAs serve
as processing modules. Each of the four FPGAs access two types of memory, shared
and distributed. Both types of memory are implemented as 24-bit words. The FPGAs
have access only to their local memory when the FPGA is in its processing mode. When
not in the processing mode, the host has access to each FPGA memory using global
addressing. The RM can configure 4 pipeline stages with memory access.
The RM employs a tightly connected communications architecture with all FP-
GAs directly connected. There is also a 24-bit global bus used for global data transfer
and control. The system can be configured to run with a 16 MHz/(2n) clock where
n = 0, 1, 2, . . . , 7 and is programmed by the configuration data [137]. When the clock
speed is less than or equal to 4 MHz, the RM-I can process two memory accesses within
a single system clock cycle which is good for read/modify/write cycles.
95
Global Bus
Memory4 FPGA4 FPGA4 Memory4
FPGA4 FPGA4 Memory4
FPGA4
Memory4
Configuration Data HostInterface Computer
Interface Module
Processing Module
Fig. 3.26. The RM Machine The FPGAs selected for the RM implementation wereSRAM-type FPGAs which allowed in-circuit configuration. The RM employs a distributedmemory architecture, using 24-bit 32K words. Each FPGA accesses only its local memorywhen in the processing mode. The communications network consists of a global bus whichhas 24 data bits and 6 control bits. One FPGA serves as the system interface. Theconfiguration data interface unit determines the FPGA configurations.
96
When applied to logic simulation, the RM implementation is called the Logic
Simulator (LSIM). The logic simulator implementation on the RM was divided into
two main phases, fanout and evaluation [137]. The fanout phase allowed events in the
current time increment to propagate to the inputs of the next gates. These gates are
then scheduled for evaluation. During the evaluation phase, the event manager fetches
each gate and its signals from the event list. The evaluator receives the gate type and
input signals and retrieves the appropriate rise and fall delays from the function code
table. Finally, a comparator compares the gates’ new output to its previous output.
If there was a change, the gate, its delay information, and the new signal results are
forwarded to the scheduler as a future event.
The FPGAs are used for implementing logic functions [137]. Gate-level circuits
implemented on each FPGA of RM-I are designed manually. Xilinx tools are used for
automatic placement and routing.
The RM advantages and disadvantages are as follows:
1. Advantages
• The system has demonstrated 170 to 190 times speedup running a Logic
Diagnosis Engine as compared to the same system compiled as software on
the host computer.
2. Disadvantages
• The machine has limited storage capacity, so tasks requiring large amounts of
memory are not practical.
97
Connectionindex table
Connectiontable
Event flagEvent list
Propagator Event Manager
Input Signaltable
SchedulerComputer
Host
Time−mappingqueue
Monitor gate event
Initial data
Memory 2Memory 1
Test Pattern
Memory 3
Memory 1,4
Fig. 3.27. The LSIM Fanout Phase The Scheduler first increments the simulation cycletime and then gets the pointer to the linked list of current events from the time-mappingqueue. Each event consists of the gate identifier and its new output value. The Propagator,which receives each event from the Scheduler, uses the Connection Index table to locatethe current gates fan-out receiving gates in the Connection Table. The Event Managerreceives the propagated gate identifier, the terminal identifier, and the signal value. TheEvent Manger updates the Input Signal Table according to the values from the Propagator.Events whose inputs changed, have their Activity Flags set to false and are stored in theevent list.
Event flagEvent list
Function code tableDelay table
Time mappingqueue
Event Manager EvaluatorComparator
Output signaltable
Input signaltable
Scheduler
Memory1,4
Memory3 Memory2
Memory3Memory1
Fig. 3.28. The LSIM Evaluation phase In the Evaluation phase, the Event Mangergets each event from the event list, clears its activity flag, and retrieves the gate’s inputvalues. The Evaluator retrieves each gate identifier and its input signal values from theEvent List. Next, the gate type is pulled from the Function Code Table, and the rise/falldelays are retrieved from the Delay Table. The Evaluator determines the gate’s output valueand forwards the result to the comparator. The comparator evaluates the gate output todetermine if a change has occurred. If the output is new, the gate identifier, its delayinformation, and the new output value are sent to the Scheduler. The Scheduler places thenew event on the Time Mapping Queue.
98
• The 4 FPGA processors have fixed connections. The maximum bandwidth
between the processors is not scalable.
• The Xilinx XC3090 FPGA clock rate is limited to between 4 and 8 MHz.
3.1.11 Bauer
A more recent logic simulation design which utilizes FPGAs was developed in
work by Bauer [14]. This logic simulator uses reconfigurable logic to accelerate the dis-
crete event simulation of logic circuits. The focus of this work is accelerating discrete
event simulation, but the target is again deterministic. The foundation for this recon-
figurable computing system is an FPGA-based emulator, which provides large blocks of
reconfigurable logic [14]. The simulation is generated by a compiler which compiles a
behavioral Verilog HDL description of the design under test.
Each emulation module of Figure 3.29 runs a small operating system to manage
the behavioral simulation and logic netlist emulation. A separate control processor which
is not illustrated performs higher level operating system functions including network ac-
cess and disk management. The emulation modules consist of a PowerPC 403GCX
processor, local RAM, and a local FPGA array with its associated programmable inter-
connect. Emulation modules connect to each other via programmable interconnects.
The advantages and disadvantages of the system are:
1. Advantages
• Focus on accelerating Logic simulation as discrete event simulation.
• The system is scalable.
99
GlobalProgrammableInterconnect
Local ProgrammableInterconnect
RAM CPU FPGA
FPGA Array
Local ProgrammableInterconnect
RAM CPU FPGA
FPGA Array
Fig. 3.29. Bauer’s Reconfigurable Logic Simulator The architecture of the systemconsists of one or more emulation modules. Each module consists of a CPU, RAM, and anFPGA array with a local programmable interconnect. The interconnect allows the FPGAarray to be treated as one large reservoir of reconfigurable logic. The figure depicts twoemulation modules.
100
2. Disadvantages
• Purely time-driven implementation.
• The system foundation is composed of FPGA-based emulators. Emulation
sacrifices generality for performance: it cannot be used to simulate behavioral
circuit models that contain delays or other constructs that are either non-
structural or cannot be synthesized into gate-level circuitry [14].
3.2 Accelerator & General Purpose Machine
The two machines in this section, the Splash accelerator, and the ArMen, a general
purpose parallel machine, are not specifically designed as logic simulators, and therefore
do not fit in the criteria of Section 3.1. Splash, described in Section 3.2.1, is designed
to provide very high performance on a range of bit-processing problems. Similar to the
architecture proposed in this thesis, Splash employs reconfigurable logic in the form of
systolic arrays. The work also provides invaluable feedback on its architecture advan-
tages and disadvantages. Section 3.2.2 describes the ArMen, which is perhaps the closest
architecture to the system presented by the thesis in that the ArMen is a general pur-
pose machine using reconfigurable logic which is specifically designed to support parallel
discrete event simulation. The ArMen, however, has a significantly different approach
to synchronization and it lacks a reduction network.
3.2.1 Splash
The original Splash 1 is a single board which plugs into the VME bus of a Sun
workstation. In approximately 1991, Splash was designed to serve as a systolic processing
101
system [8] using a Sun workstation as its host. The general purpose machine is normally
an SIMD machine and the original design was motivated by a systolic algorithm for
DNA pattern matching [64]. The boards consist of 32 XC3090 Xlinix FPGAs which are
programmed to serve as processing elements. The FPGAs are connected as a linear array
by a 32-bit wide data bus. The two end chips, X0 and X31 can be connected together
allowing the FPGAs to form a ring. Between each pair of FPGAs lies a shared 128K x 8
RAM with an 8-bit wide path to the FPGAs. The Splash 1 board clock rates can be set
in factors of 2 from 1 MHz to 32 MHz. The slower speeds allow placement and routing
design difficulties to be accommodated.
Splash 2 attempts to alleviate the I/O bound drawbacks of Splash 1 by using a
Sparc II host and a connection to the system’s SBus. Splash 2 is expected to be 8 to 10
times faster than its predecessor in terms of its sustainable I/O rate.
In Figure 3.30, each Splash 2 array board is composed of 16 processing elements,
FPGAs designated X1 through X16, each of which is connected to a 16-way crossbar
switch. An additional FPGA processing element, designated X0, controls the switch
configuration. The FPGA processing elements are Xilinx XC4010 FPGAs which are
each connected to 500K of local memory. The host Sun platform can directly address the
processing element’s 500K local memories. The memories are connected to the FPGAs
by a 16-bit data bus and a 18-bit address bus. The FPGAs have 36-bit bidirectional
data paths to both left and right neighbors as well as the crossbar switch [7]. A crossbar
input may be configured to connect to any number of output ports allowing point-to-
point, multicast, and broadcast communication. This configuration allows X0 to receive
broadcast data from the host on the SIMD bus and rebroadcast it through the crossbar
102
Bus
X1 X2 X3 X4 X5 X6 X7 X8
X9X10X11X12X13X14X15X16
CrossbarX0
X1 X2 X3 X4 X5 X6 X7 X8
X9X10X11X12X13
SIMD
X14
SBus
X15X16
CrossbarX0
X1 X2 X3 X4 X5 X6 X7 X8
X9X10X11X12X13X14X15X16
CrossbarX0
Sparc
Station
Host
Input
DMAXL
Output
DMAXR
Splash Boards
RBus
Interface Board
Extension
Fig. 3.30. The Splash 2 Architecture The Splash 2 architecture was designed basedon the newer Xilinx XC4010 10,000-gate FPGA. Splash 2 is scalable from one board with16 processing elements to a combination of boards yielding 256 processing elements. Theinput and output data streams can be provided with direct memory access (DMA) from theSun SBus or from an external source. The crossbar switches on each board fully connectthe board’s 16 processing elements. Applications programs can be written in behavioraldescribed VHDL.
103
DMAChannel
DMAChannel
DMAChannel
XR
XL
Array Board n
Array Board 2
Array Board 1
SparcStationHost
SBusExtension
BusSIMD
RBus
Interface Board
Fig. 3.31. The Splash 2 Interface The Splash 2 interface board contains 3 bidirectionalDMA channels. Each DMA channel is connected to the Splash array boards via a FIFOqueue. XL and XR are two user programmable FPGAs which can process the incomingand outgoing data streams, optionally stopping and starting the system clock as data fillsthe output channel or new data becomes available on the input channel. In Splash 2, theclock frequency is selectable by the host in 50-Hz increments from 100 Hz to 30 MHz.
104
to each of the 16 PEs on the board. The input to each board’s processing element array
is through the XL processing element which connects to a 36-bit SIMD bus of each X0
element on all the boards and to the X1 element of the first board. Additional array
boards can be linked together by extending the linear data path from the X16 element
of one board to the X1 element of the next board [7]. The XR unit determines which
board is the last board in the chain.
The Sparc host down-loads the configuration data to the processing elements on
each board, which includes X0-X16, XL, XR and the crossbar switch. The host system
can read and write to the DMA channel FIFOs on the interface board as shown in
Figure 3.31. The host can also stop and start the system clock, setup and manage
the DMA channels, read and write to the processing element memories, and receive
interrupts from both the DMA channels and from each of the computing elements. Each
array board contains a set of bidirectional handshake registers through which the host can
communicate directly and asynchronously with the computing elements. There is also
a single-bit broadcast mechanism and a 2-bit wide global AND/OR reduction network
between the processing elements and the interface board [7].
Splash was designed to handle various programming models including a single
instruction/multiple data stream (SIMD) model, a one-dimensional pipelined systolic
model, and several higher-dimensional systolic models [7]. The SIMD applications utilize
the X0 element and the crossbar switch on each board to broadcast the instructions and
data to all processing elements simultaneously. The instruction stream is sent from the
host to the X0 chip on each board via the SIMD bus. X0 broadcasts the instruction to
all 16 of the board’s processing elements, which are each programmed with one or more
105
identical SIMD computing engines. These engines synchronously receive and execute
instructions and perform nearest neighbor communications through the linear data path.
Global element synchronization is accomplished with the AND/OR reduction network.
One-dimensional systolic arrays are formed by using the processing board’s 36-bit
linear data path to form a continuous pipeline from the host, through the array, and back
to the host. The crossbar switch allows an individual processing element to be bypassed
or multi-dimensional systolic arrays to be implemented.
The Splash system has several advantages and disadvantages:
1. Advantages
• Splash is a general purpose system.
2. Disadvantages
• I/O bandwidth and inter-processor communications have proved to be a lim-
iting factor during system testing. Splash 1 was entirely I/O bound [8].
• Splash 1 has only a single systolic datapath between all of its 32 Xilinx FPGAs.
Splash 2 implemented a crossbar interconnect, but this still limits the speed
of the communication required for simulation synchronization.
• Splash was designed for synchronous SIMD operation, however most simu-
lations might work better with asynchronous multiple input multiple data
(MIMD) operation. MIMD allows different nodes of the same simulation
to operate with different constraints and statistical distributions in a non-
homogeneous simulation network.
106
• The two-bit reduction network is rather small and might be constraining for
event-driven simulations.
3.2.2 The ArMen
In 1994, Beaumont et al [15] proposed a new architecture for discrete synchronous
event-driven simulation using FPGAs. The MIMD ArMen implementation allows the
parallel execution of events with the same timestamp in virtual time. The machine is an
event-driven, fine-grained, conservative logic simulator. All processors wait until all the
event computations for a given simulation cycle are complete. Then the simulation can
proceed to the next phase. The next simulation cycle is the global minimum of all the
minimum timestamped events on each node. The protocol respects causality constraints
since all processors are always executing events with the same timestamp [15].
The two main global control operations are the synchronization barrier which ev-
ery processor must reach before the simulation can proceed to the next simulation cycle,
and the calculation of the global minimum of all the Local Virtual Times (LVTs) in order
to determine the next time to be simulated. These two operations have been implemented
in the FPGAs of the ArMen machine. The algorithm is provided in Table 3.2.
Each ArMen node is tightly coupled to an FPGA ring. The reconfigurable ring,
called the logic layer , allows the synthesis of application specific operators. ArMen can
be configured and specialized at runtime.
The basic ArMen architecture is illustrated in Figure 3.32. Each ArMen node
consists of a processor and FPGA combination. The processor is connected via its
system bus to a bank of memory. The FPGA can be dynamically reconfigured by the
107
GV T ← GV T Computation(tWakeUpmini)
. Global minimum computation and broadcastwhile (¬ End Simulation)
if (GVT = tWakeUpmini) then
〈〈 Model evaluation;Sending of Generated Messages;Waiting for acknowledgments; 〉〉
Global Synchronization();. . . . in order to be sure that every execution. is over in the current time step〈〈 tWakeUpmini
evaluation 〉〉. Local minima searchGV T ← GV T Computation(tWakeUpmini
)
. New global minimum computation and broadcast
Table 3.2. Synchronous Discrete Event-Driven Simulation Algorithm In thisalgorithm, the Global Virtual Time (GVT) minimum is the minimum of all the local virtualtimes at each simulation processing node. Instructions which occuring between 〈〈 〉〉 indicateconcurrent instructions. The protocol respects the causality constraint since all processorsare always executing events with the same timestamp [15].
108
attached processor. The processor loads configuration data into the FPGA during a 100
ms delay using memory-mapped registers [49]. There are four input/output ports on each
FPGA. The north port connects to the processor bus. The east and west FPGA ports are
connected to the adjacent FPGAs forming a ring topology. The south port is generally
free, but can be connected to other processor nodes to form different communications
topologies.
The logic layer can provide either application speedups or other services [49]. Al-
gorithms or functions are implemented in the FPGAs which serve as local accelerators
with data exchanges between the FPGA and the processor. The processor writes values
into the FPGA registers allowing the FPGA to perform its configured calculations [15].
The processor then reads its results back. Local FPGA-based accelerators can be fed
and controlled using the MIMD framework. Experimentation has shown that the system
throughput is limited by the processor read/write speed. The FPGA and the processor
are synchronized via the processor’s interrupt signal line. The FPGA to FPGA com-
munications can be either synchronous, taking advantage of the same clock signal, or
asynchronous, using ready/ack signal lines.
When implementing synchronous parallel event-driven simulation, all processors
send flags to their associated FPGAs indicating the need to synchronize. The processors
set up a synchronization barrier for each simulation cycle. When every processor reaches
the barrier and sends the appropriate signal, node 0 issues a restart signal.
To compute the global minimum time which is the time of the next event, the
ArMen machine implements the following strategy, illustrated in Figure 3.33. Each
FPGA computes the minimum of its own node’s next event time value and the local time
109
FPGA FPGA FPGA FPGA
Memory
Host
Processor+
Memory
Processor+
Memory
Processor+
Memory
Processor+
Interconnection Network
MIMD level
FPGA Ring
Configurable Logic Layer
Fig. 3.32. The ArMen Architecture Each ArMen node has a processor connectedvia its system bus to a bank of memory and an FPGA. The FPGA can be dynamicallyreconfigured by the attached processor. The input/output ports of each FPGA are dividedinto four ports. The north port connects to the processor bus. The east and west FPGAports are connected to the adjacent FPGAs forming a ring topology.
110
minimums for the nodes to its left and right. Each node writes its local minimum in its
associated FPGA. After n levels of vertical pipelining and shifting of data between the
FPGAs, the minimum over 2n+1 values is computed. So, as an example, if the minimum
timestamp needed to be computed among 21 timestamp values, then 10 levels of vertical
pipelining would be required. If N > 2n+1, where N is the number of processors in the
system, then the vertical pipeline is executed again with the results from the previous
run until at least N/2 levels are computed. The computation broadcasts the result to all
processors. All the processors of the MIMD network have to contribute to the calculation
at the same time under control of the simulation kernel. The ArMen can switch from
MIMD to Single Program Multiple Data (SPMD) mode to assist in the computation.
SPMD differs from SIMD in that the program is replicated and stored and executed at
several (i.e. multiple) nodes, as opposed to the SIMD model where the instructions are
transmitted one by one to the processing elements.
Compared to a pure software implementation using the same operating system,
the ArMen group reports speedups of 40 for the global minimum computation on four
32-bit integer values and one level of the FPGA pipeline. The slow speeds are attributed
to the delay required by the processor interrupt signal. With 120 processors, a speedup
of 600 is expected.
The advantages and disadvantages of the ArMen system are:
1. Advantages
• Capable of running random event-driven simulations.
• The system is scalable.
111
B C D
min(cdabc)
min(dab)
min
min min
min min
min
(abc) (cda)
(dabcd) (abcda) (bcdab)
A
(bcd)
Processors
Global and
local clocks
Cellular automatons in network for minimum computation
Fig. 3.33. Digital-Serial Implementation of the Global Minimum Computationand Broadcast The method of comparing each local next event time stored at every nodeto come up with a global minimum next event time is illustrated in this figure. After nlevels of vertical pipelining and shifting of data between the FPGAs, the minimum over2n + 1 values is computed.
112
2. Disadvantages
• Finding the global virtual time using an O(n2) algorithm.
• Lacks a reduction network.
• Consists of a ring topology, which doesn’t scale well.
3.3 Optimistic Processing
Section 3.3 reviews an optimistic state-saving hardware approach to discrete event
simulation. This hardware device is used to allow optimistic simulation to proceed by
the application of a state saving technique. Checkpoints are saved in the event that the
optimistic path taken turns out to be incorrect, so that the previously saved point in the
simulation can be quickly restored.
In July of 1988, Fujimoto et al.[58] proposed a special purpose hardware design
called the Rollback Chip (RBC) for parallel discrete event-driven simulation. The RBC is
designed to work with the Time Warp mechanism which handles difficult clock synchro-
nization problems. Time warp relies on a lookahead and rollback mechanism to achieve
widespread exploitation of parallelism. The state of each process must be periodically
saved, and when necessary, the process must be rolled back to a previously checkpointed
time.
The Rollback Chip (RBC) is a type of memory management unit and data cache
combined into a single component [58]. The chip was specifically designed to work with
Time Warp mechanism developed by D.R. Jefferson [80].
113
Instead of putting data into “protected” areas of memory, the RBC manipulates
the addresses generated by the CPU in order to avoid overwriting old values which may be
required for a future roll back operation. The RBC is designed to be embedded in every
computation node of a multi-node system in order to assist with rollback operations for
that local node. The processors perform optimistic simulation, computing as far ahead
as possible and then rolling back if a straggler or late event arrives with a timestamp
which is earlier than the node’s local simulation clock time.
The RBC provides the processor it serves with version controlled memory (VCM).
Version controlled memory is identical to normal read/write memory, except that a
process may “mark” the state of the memory as one which may later need to be restored
via a rollback operation. In a parallel simulation, the processors issue a mark operation
after processing a simulation event. Simulation variables which are subject to rollback
and therefore require state saving, must be stored in version control memory.
The RBC contains six operations. The reset operation initializes the RBC. The
Mark operation preserves the current state of version controlled memory. A write(A,D)
operation writes data D into the location at memory address A. A read(A) operation
reads the most recently written version of data associated with address A (excluding
rolled back write operations) and returns the data. A rollback(k) operation restores the
version controlled memory to the kth previously marked state (k > 0). Finally, the last
operation is an advance(k), in which the kth oldest marked states are discarded and
can be reclaimed as available memory space. This memory reclamation is called fossil
collection and is similar to garbage collection, but it also performs additional irrevocable
operations such as I/O [58].
114
The processor can invoke the reset, mark, rollback, and advance operations by
writing into the RBC’s control registers which are memory mapped to the CPU’s address
space. The RBC read and write operations represent normal CPU read/write operations
to specific program variables. Context switch controls are represented by additional
registers in the RBC.
The advantages and disadvantages of this system are:
1. Advantages
• Allows optimistic simulation processing.
• Creates a fast state restoring technique.
2. Disadvantages
• Extra address calculations are required with every memory access.
• Larger memory space is required.
3.4 Non-Deterministic Simulation
Section 3.4 gives a brief introduction to the Ising Spin model. This model inspired
three physicist groups [76, 106, 118] to develop random number generators which are
applicable for seeding statistical distributions.
The simulation machines discussed in Section 3.1 are deterministic logic simula-
tors. An important feature which distinguishes this work from previous simulators is
its enhanced ability to model non-deterministic simulation. However, non-deterministic
number generation has been both required and created by previous architectures. These
115
special purpose architectures created processors which were designed for one specific ap-
plication, the Ising Spin model. In statistical mechanics, the calculation of static and
dynamic properties of Ising Spin systems by means of the Monte-Carlo method is today
a standard technique [76]. Ising Spin models have been compared to a stochastic vari-
ation of cellular automata, similar to the popular Game of Life [107, 106] and to the
Hamiltonian Path problem [131].
Another analogous comparison is the model of an array of pixels composed of
one or more bits which each interact with other adjacent pixels. The exact meaning of
the pixel depends on the nature of the simulation. In the case of cellular automata, the
pixel is a cell, in the simulation of a discretized fluid, it is a region of space that can
accommodate gas molecules, and in the case of statistical physics, each pixel represents
a discrete degree of freedom of a many-body system. In the last case, the pixel is called
a spin if the system to be simulated is a magnetic system [106].
Specifically, the Ising Spin Model is composed of discrete variables, Si, called
spins, which take on one of two values, up (+1), or down (-1) and occupy the sites
of a regular or random D-dimensional lattice, where D = 1, 2, 3, . . . as illustrated in
Figure 3.34. The Ising model was first used as a model for the behavior of magnetic ma-
terials. A magnetic material consists of a large number of regularly located microscopic
magnetic moments ( or dipoles ) which are also called spins because they arise from
angular momentum or spin properties of electrons. In the Ising Model, these dipoles are
only allowed to point in one of two opposite directions.
“When high accuracy is required or complex systems are studied, one is severely
limited by the amounts of computing time that is needed. Large amounts of computation
116
Fig. 3.34. The Ising Spin Model A two dimensional array of Ising Spins is illustrated.Similar to the model of a lattice of magnetic moments, the elements can have either up ordown spin. Figures of three dimensional spin models can be found in [131].
are necessary because the accuracy obtained in a Monte-Carlo study is proportional to
1√N
, where N is the number of iterations of the algorithm. For many computations that
we wish to do . . . the cost of performing the computation on a general purpose processor
is prohibitive.” [118] In order to accelerate their computations, the physicists developed
a special purpose processor for performing Monte-Carlo simulations on a particular class
of problems, the three dimensional Ising models. Despite its modest cost, this machine
is faster than the fastest supercomputers on the one particular problem for which it was
designed. Details of the algorithm required and the architecture of the process developed
can be found in [118, 76]. Here, the interest is specifically in the uniform random number
generation techniques which were developed.
117
3.4.1 Hoogland, Spaa, Selman, and Compagner
To generate uniformly distributed random numbers, Hoogland employed a Feed-
back Shift Register algorithm [76] which consists of a 127-bit shift register with feedback
on the input of the first bit. If we denote the nth bit of the sequence by xn, where
n = 0, 1, 2, . . . , 15, the Feedback Shift Register algorithm used is [76]:
xn = (xn+(127−2∗(16−1)) + xn+(127−1∗(16−1)))mod2 (3.1)
Selecting values for p and q of Figure 3.35 which produce the maximum length
sequences, the maximum non-repeating period is 2127−1 [76]. The 32-bit random num-
bers are selected out of the 127-bit sequence at intervals of 32 clock cycles. Figure 3.35
illustrates the circuit used to accomplish the random number generation. Every 32 clock
cycles produce a new 32-bit random number. An accelerated version which produces a
new uniformly distributed random number every 2 clock cycles is shown in Figure 3.36.
3.4.2 Monaghan & Pearson, Richardson, and Toussant
Both Monaghan [106] and Pearson’s [118] random number generation work di-
rectly parallels Section 3.4.1, except that the random number generated is either 8 or
24-bits wide respectively. One of the most salient points of Pearson’s work is not the
discussion of the success of the random number generator described above, but of their
earlier failure with a different Random Number Generator (RNG) which was based on
the Linear Congruence Algorithm [87]. During testing, small but significant discrepan-
cies were discovered after comparisons with known results. The average magnetization in
118
r + Lq p1 r + 1 ............ ....
L−bit random number
Fig. 3.35. A General 2-bit Feedback Shift Register A general 2-bit feedback shiftregister of p bits is used to generate a random number of L bits. From the initial state ofthe register, a sequence of p bits, the next state is produced by inserting the modulo-2 sumof the feedback bits in positions q and p into position 1 of the shift register. In the shiftregister, the original bit in position 1 is shifted into position 2, and so on up to position p,the original contents of which is lost. If all bits in the register are initially zero, the shiftregister will remain in that state forever. If the shift register progresses through the other2p− 1 states before repeating, a maximum length sequence is produced. After L shifts, thecontents of the L positions used to generate the random number are completely refreshedand the next random number can be read [76].
17
33
49
65
81
97
113
1
112
16
32
48
64
80
96
111
127
31
15
47
63
79
95
114
18
98
2
34
50
66
82
97 112 98 111 112113 126 127
8 Bit Shift RegistersPlaced vertically, read horizontally
8−bi
t shi
ft r
egis
ter
Fig. 3.36. Random Number Generator The actual design implemented by Hooglanddiffers from Figure 3.35 in that this design produces one random number per two clockcycles, instead of every 32 clock cycles, where L = 32. This design is composed of 16 8-bitshift registers and additional logic to compute the L required bits in parallel. The resultof the single synchronized shift in each register coupled with the 16 feedback circuits isequivalent to 16 shifts of the circuit of Figure 3.35. The random number for this circuit isread from bit positions 96 through 127 [76].
119
the high temperature phase and at zero magnetic field of the Ising model is strictly zero.
But the results they garnered of long runs with different addends using the original linear
congruence random number generator produced magnetizations as large as 0.01 which
were reproducible for different random number seeds [118]. As a result, the random
number generator was redesigned following Figure 3.35. With the new generator design,
which is described above, the tests yielded consistent results with zero magnetizations.
3.5 Reduction Buses
Some multi-processor implementations include a specially designed bus which
fosters a combination of both communications and computation. These buses are often
referred to as reduction buses. The CM-5 [77] contains a reduction bus tying its separate
processing units together. Another reduction bus which has a relatively high level of
functionality is the Parallel Reduction Network developed by Reynolds [116]. This bus
is the subject of Section 3.5.1.
3.5.1 Parallel Reduction Network
Reynolds [116] proposed a Parallel Reduction Network (PRN) bus which both
computes and disseminates different binary, associative operations across state vectors
of values. State vectors, composed of subcomponent values, are passed through the
reduction bus. Simultaneously traveling instructions can request that all of the first
components of the state vector be added together, and that all of the second components
of a 2 component state vector be Or-ed together. In the network, the hardware reads
state vectors of size m, computes m globally reduced values, and writes a globally reduced
120
state vector. Separate auxiliary processor units interface with the PRN to handle the
expected load of traffic.
Figure 3.37 illustrates a binary reduction tree of depth log2 n, where n is the num-
ber of processors in the multi-processor network. Each node of the PRN is an Arithmetic
and Logic Unit (ALU) with some additional logic for tagged selective operations. Global
reduction operations can be computed and disseminated in O(log n) time.
A single ALU is illustrated in Figure 3.38. The ALUs perform binary associative
operations on the two inputs based on a programmed operation code which accompanies
the state vectors as they flow through the PRN. The operations include sum, minimum,
maximum, logical AND, logical OR, etc. Operations including minimum and maximum
support tagged selective operations. A tag, chosen by the selector, accompanies the
winning value of the binary operation. Additionally, an error check is performed on
the two incoming opcodes. If they do not match, an error condition is set in the tag
registers denoting the problem with the resulting state vector. The PRN also pipelines
the reduction operations at a rate which equals the delay time of each stage.
Reynolds’ design has both advantages and disadvantages:
1. Advantages
• allows the simultaneous computation and dissemination of data throughout
the network.
• the computation does not impact the processors.
2. Disadvantages
• may suffer from geometric constraints in large networks.
121
In Out
AP
In Out
AP
In Out
AP
In Out
AP
In Out
AP
In Out
AP
In Out
AP
In Out
AP
ALU ALU ALUALU
ALU ALU
ALU
0 1 2 3 4 5 6 7
PRN
Fig. 3.37. Parallel Reduction Network Reynolds [116] designed a reduction bus referredto as the Parallel Reduction Network (PRN) which is presented as a k -ary tree of depthlogk n, where n is the number of processors in the network. Each node of the tree isan Arithmetic and Logic Unit (ALU) with some logic for tagged selective operations. EachAuxiliary Processor (AP) has sets of memory-mapped input and output registers. The PRNreads values from the input registers and writes the corresponding globally reduced resultsto the AP output registers. An interlock mechanism prevents memory access contention.The tree allows a global reduction operation to be computed and disseminated in O(log n)time.
122
Error Check
Opcode
ALU
Data
Selector
Tag Tag
Control
8 832 32 3232
83232
Fig. 3.38. PRN Arithmetic and Logical Unit Node A single Arithmetic and LogicalUnit (ALU) node of the PRN network of Figure 3.37 is illustrated. The ALUs performbinary operations on two inputs based on a programmed operation code which accompaniesthe inputs; operations include sum, minimum, maximum, logical AND, logical OR, etc.Each input register of the ALU is paired with a Tag register. The ALU supports taggedselective operations whereby a tag, chosen by the selector, accompanies the winning valueof each binary operation. An error check is performed on the two incoming opcodes. Ifthey do not match, an error condition is set in the tag registers denoting a problem withthe resulting state vector.
123
Chapter 4
Software Traffic Simulation
The primary focus of this work is the creation of an architecture for a non-
deterministic traffic simulation machine. In order to clearly concentrate acceleration
efforts, software models were used as a guide for the hardware development. The soft-
ware models included in this paper were studied in three phases. In determining whether
this project was worth pursuing, the small, simple code modules of Section 4.1 are used to
establish what types of speedup can be obtained to accelerate discrete event simulation.
Once established by the initial publications [24, 25, 26] that this simulator work is both
desired and justified, a study of a representative and well established traffic simulator,
CORSIM, was undertaken. This study is described in Section 4.2. Since CORSIM is not
an open source, free software simulator, sharing verifiable results is not practical using
CORSIM as a standard for comparison. Other possible candidates for study are rejected
for similar reasons. The final stage of the simulator work did require a system to verify
the accuracy of the selected Scheduler algorithm employed in Section 7.2.3. Therefore, as
a separate effort, the simulator, Trafix, was generated as an open source, freely available
traffic simulator. Unlike other conventional simulators, Trafix is open source, free, and
modular. The Trafix simulator is briefly described in Section 4.3.
124
4.1 Event Generation & Queue
As part of the initial studies used to gauge the effectiveness and direction of
the selected approach, the event generation and the event queue of Figure 1.1 were first
examined. An event generator was constructed using software which was then translated
to a reconfigurable logic implementation. The same sequence was performed on the event
queue segment. In Section 4.1.1, the event generation software was first implemented
following the methods applied in [141]. This method facilitated a fine-grained, parallel,
systolic hardware implementation in Section 7.2.1.
In Section 4.1.2, the Event Queue software applied standard GNU C++ classes to
manage both the event queue and the random distribution calculations. Therefore the
event generation code was re-written in Section 4.1.2, so that standardized code could
be applied, and attention focused on the queuing software.
4.1.1 Event Generation Software
An abbreviated software outline is listed in Table 4.1. In the C++ code, first,
the Poisson event arrival offset, τ , is calculated according to Equation 4.1 [141]. In
Equation 4.1, Ω1, or rand1 in the code, is an independent random variable uniformly
distributed over [0,1). λ, or LAMBDA in the code, is the object or event arrival rate.
The event generator dynamically allocates space for the new event, s, and enqueues the
s object.
τ = − 1
λlog Ω1 (4.1)
125
The resulting τx values generated by this equation can be seen in Figure 4.1. τ0
is the distance from the beginning of the timeline to the first event. τ1 is the distance
from the beginning of the first event to the beginning of the second event, and so on.
New event arrival times are calculated by adding the arrival offset, τ , to the previous
event arrival time. The clock is then advanced to the new event arrival time. Service
events which overlap at the end of the current timeline segment are carried over into the
next segment.
The Poisson service time, σ, is calculated according to Equation 4.2 [141]:
σ = − 1
µlog Ω2 (4.2)
where µ, given as an average number of events per second, is the object or event
service rate. µ is the same as MU in Table 4.1. The σx values generated by Equation 4.2
are illustrated in Figure 4.1 to be the offsets from the beginning to the end of event x.
Ω2, or rand2 in the code, is also an independent random variable which is uni-
formly distributed over [0,1). The service time offset, σ, is added as an offset to the
event’s arrival time to determine the end of the event’s service time. Event resources
are released at the end of this service time. Both σ and τ are independent and expo-
nentially distributed. In software, to allocate memory and then generate Poisson arrival
and service times requires approximately 30500 nanoseconds on an Ultra Sparc.
126
61 70 10789 988079
Timeline 0
Timeline 1
0 9 18 27 36 45 54
τ1
τ2
τ4
τ3
τ5
τ6
τ7
τ0
σ0
σ1
σ0
σ1
σ
σ
σ
σ
σ
σ
2
3
4
5
6
7
τ τ10
σ σ−2 −1
Fig. 4.1. Simulation Timeline Generation Each succeeding arrival starts an offset ofτx from the previous arrival. Similarly, each service time σx is an offset from the x event’scorresponding arrival time. These dependencies which constrain event arrival time andevent service time generation appear to prevent speedup through parallelism.
event* s = new event();
clock = s->arrival = clock - (1/LAMBDA)*log(rand1);
s->service = - (1/MU)*log(rand2) + s->arrival;
Table 4.1. Event Generation Code I The initial event generation implementationfollowed Walrand [141] creating random arrival and service times as fine-grained paralleldiscrete steps in a systolic array. The approach is also illustrated in the hardware eventgeneration block diagrams of Section 7.2.1.
127
4.1.2 Event Queue Software
This section focuses on the software used for the service queue. The software
version is implemented as a GNU LIBG++ XPPQ Priority Queue class. In the C++
software simulation, the time required for the insertion and extraction of events to and
from the event queue increases as the queue strays from its optimum size. The proposed
hardware queue speed, on the other hand, is not affected by its size, and provides a 102
speedup over the software model.
The software simulation model used for comparison is written in C++ and is
illustrated in Tables 4.2 and 4.3. Some additional processing is performed when the
event data structure is allocated. The arrival and service queues are maintained as a
single heap data structure, unlike the proposed dual queue hardware mechanism of the
processing elements for the proposed architecture which are described in Section 7.2.
128
while (create_event_cnt <= num_events)
create_event_cnt++;
arrival_time = clock + rnd1();
service_time = rnd2();
Event_Class *s =
new Event_Class(arrival_time,
service_time);
queue->enq(*s);
;
Table 4.2. Event Generation Code II The Event Generation code allocates an eventwith an arrival time which is a random offset from the previous event’s arrival time. Theservice time for the event is then selected to be a random offset from its own arrival time.The two random values need not necessarily use the same statistical distribution. The eventis also constructed to randomly require resources when it is executed by the scheduler. Thiscode differs from Table 4.1 in that GNU LIBG++ standard classes are applied. Table 4.1creates its random offsets using distribution methods from Walrand [141].
129
while (events <= num_events)
events++;
// included in speedup test
Event_Class event = queue->deq();
if (event.getArrival() == true)
if ((event.res.get_a() <= a_resource_counter) &&
(event.res.get_b() <= b_resource_counter))
a_resource_counter -= event.res.get_a();
b_resource_counter -= event.res.get_b();
event.SetNextArrivalTime();
// push service event
// enq included in speedup test
queue->enq(event);
else
if (event.res.get_a() >= a_resource_counter)
block_a++;
if (event.res.get_b() >= b_resource_counter)
block_b++;
// requeue a replacement event
arrival_time = clock + rnd1();
service_time = rnd2();
Event_Class *arriv = new Event_Class(arrival_time,
service_time);
// enq not included in speedup test
queue->enq(*arriv);
else
a_resource_counter += event.res.get_a();
b_resource_counter += event.res.get_b();
;
Table 4.3. Event Queue Loop Code The arrival and service queues are maintained as asingle heap data structure, unlike the proposed dual queue hardware mechanism illustratedin Figure 7.3. If the dequeued event is an arrival event, then the resources available arecompared against the resources required by the event. If the required resources are available,a service event is enqueued. If resources are unavailable, the event is recorded as a blockedevent. When a service event is dequeued, its resources are returned to the available resourcespool. To gather accurate timing results, the number of events in the event queue is keptconstant. The extra time used to generate additional arrival events in order to maintainthe queue size is not included in the speedup plot of Figure 7.12.
130
4.2 CORSIM: An Established Software Simulator
As part of the effort to develop a profile of a traffic simulator, CORSIM (COR-
ridor SIMulator) was selected as a representative software simulation model. CORSIM
microscopically models vehicular traffic flows, emissions and accounts for pedestrians.
Developed in Fortran by the Federal Highway Administration (FHWA), CORSIM is
part of the TRAF family of simulation models. [CORSIM] combines TRAF-NETSIM,
a simulation model of non-freeway traffic, and FRESIM, a simulation model of freeway
traffic [37]. NETSIM, the older of the two simulators, grew out of the Urban Traffic
Control System, developed for mainframes in the early 1970s. The CORSIM model
and its components comprise one of the first traffic simulation environments of its kind.
CORSIM has been widely used in the traffic engineering community and claims to have
been calibrated and validated in a wide variety of traffic and highway design conditions.
The FHWA granted special access to study and to evaluate the CORSIM source code as
part of the research generated for use in this thesis and its related publications.
In current applications, CORSIM is used to evaluate alternatives planned for
highway networks [37, 41]; it may be used, for example, to evaluate new traffic signal
optimization strategies. The runtime required by the simulator has caused CORSIM to
be used in off-line applications only. Real time applications, however, are becoming more
prevalent in transportation engineering, and in such applications, speed is critical. A
study of CORSIM runtime characteristics determined that the processor tended to dwell
in simulation scheduling and overhead routines [27]. Therefore, attempts to accelerate
traffic simulation need to accelerate or eliminate overhead and event scheduling.
131
4.2.1 CORSIM Function Categories
Following Figure 1.1, CORSIM functions were categorized according to the Ta-
ble 4.4 classifications. The event generation, event list, scheduling, and timer classifi-
cations are derived from the simulation model depicted in Figure 1.1. Two additional
categories of overhead and statistics are not depicted in the figure, but are required by
simulators.
4.2.2 NT versus Linux
The CORSIM code was delivered as a win32 based software system which compiles
and runs under the Microsoft Fortran Compiler. This compiler is a Fortran 90 compiler
which is integrated with Microsoft’s Developer Studio. The compiler is now maintained
and developed by Compaq. As CORSIM was delivered using the Microsoft compiler,
there was strong incentive for our decision to profile under NT.
In order to study the code better, and as part of its procurement, the source was
ported to the Linux operating system. A code translation program called VAST/f90
from Pacific-Sierra Research was selected as the compiler under Linux. The VAST/f90
system translates the Fortran 90 code to an intermediate Fortran 77 version. Then
GNU’s g77 compiler is called to compile the intermediate result. The VAST/f90 system
uses a library of routines to emulate Fortran 90’s memory allocation and deallocation
routines. Some of the advantages of the VAST/f90 compiler include its target operating
system, Linux, and its freely available personal version. The incorporation of the g77
compiler provides the binaries produced by VAST/f90 with the advantages of profiling
132
Classification Description
event list Majority of code queues or dequeues objects waiting for an event.Handles traffic queues at lights, and on links, as well as otherqueues of data structures.
event generator Functions which generate or sink vehicles in the simulation.overhead Functions which include general mathematics calculations, mem-
ory allocation, data error checks, etc.scheduling Majority of code executes and schedules events. Traffic routing
across multiple road links as well as within each link and inter-section. This classification also handles traffic signal events andpedestrian action.
scheduling/event list Functions which are combinations of the scheduling and event listcategories in an approximately 50%-50% mix of functionality.
shutdown Simulation shutdown functions, such as flushing data and closingout files.
statistics Functions which calculate general traffic statistics and results.These statistics include vehicle speed, stops, delays, hours oftravel, miles of travel, fuel consumption, and emissions.
timer Functions which control the simulation clocks and timers.
Table 4.4. CORSIM Function Classifications The CORSIM functions were classifiedaccording to these eight categories. CORSIM functions often contained a myriad of categoryfunctionality, but were classified according to the majority of the function code. In somecases, the subroutines performed approximately 50% routing and 50% event list work, so anadditional combinational category was added. The event generation, event list, scheduling,and timer categories are derived from Figure 1.1.
133
using gprof and debugging with gdb, the GNU debugger. VAST/f90’s disadvantages
include significant delays during memory allocation and deallocation.
Unlike NT, profiling under Linux includes operating system and compiler related
functions. To compare the profiling results, the non-CORSIM Linux overhead func-
tions were removed from their respective profile before the chart and graph values were
computed.
4.2.3 CORSIM Profile
CORSIM was profiled under the NT operating system as a stand-alone application
without the rest of TSIS. Perl scripts were written to parse the resulting profile data.
The runtime statistics from 20 CORSIM traffic models were averaged and joined with
classification categories based on the CORSIM functions. The pie chart data in Figure 4.2
is the result of these scripts.
Perl scripts were written to parse the resulting profile data for both operating
systems. A single classification file, used for both the NT and Linux functions, was
parsed by multiple profiling scripts to maintain a common criteria among the profiling
data sets, and then runtime statistics from 20 traffic models were averaged and joined
with the classification categories based on the CORSIM function names. The pie chart
data in Figures 4.2 and 4.3 are a result of these scripts. Additional Perl scripts were
written to parse and compute the simulation runtime statistics of Table 4.5.
CORSIM functions were examined and categorized according to the eight clas-
sifications described in Table 4.4. CORSIM functions often contain code which falls
134
under more than one of the classifications listed. Therefore, the functions were classi-
fied according to the major functionality of their code. There were some subroutines
whose functionality was approximately a 50%-50% split between scheduling and event
queue maintenance, so these functions were put in their own classification, called schedul-
ing/event list. A large percentage of CORSIM functions included in the overhead cate-
gory were devoted to parsing and error checking the simulation input data files. Instead
of having generalized functions to bounds check and parse, there were usually specialized
functions for each input line type in the .trf file.
Our study of CORSIM is motivated by our desire to determine the bottlenecks of
the simulation model illustrated in Figure 1.1. After the CORSIM functions were cate-
gorized, the profiling data derived using the NT tools, PREP, PROFILE, and PLIST,
was joined with the function classifications and reduced to yield the pie chart shown
in Figure 4.2. This figure illustrates the percentage of CORSIM runtime devoted to
each category of simulation function. When run under the NT operating system, COR-
SIM dwells mostly in its scheduling and overhead functions. Therefore, the simulation
architecture proposal must carefully consider their acceleration.
Table 4.5 lists the simulation models from the Georgia Institute of Technology’s
Civil Engineering CORSIM repository which were used to compute and average the
CORSIM simulation results. The listed runtimes indicate values derived from their
respective profiling datasets. Please note that at the time these files were generated,
code optimization flags were active under NT but not under Linux due to a compiler
bug. So, the code is expected to execute faster on Linux when compiler optimizations
become available.
135
Function Runtime Linux Runtime NT
actctrl 5.32 4.71ca101001 9.77 10.08ca101002 10.36 10.31intch1 12.70 9.81mmbs 10.04 10.16mmd1 8.84 9.65mmf1 9.39 10.22mmp1 9.99 10.06mmp2 9.86 10.12mms1 10.28 10.18opt 80 32.88 12.48proj3 33.36 12.82rabs 37.23 26.80rad1 33.60 29.31raf1 33.47 28.99rap1 34.78 25.19rap2 33.31 29.30ras1 33.50 29.20scen1 259.42 193.62scen3 228.26 173.20
Averages 40.78 31.25
Table 4.5. CORSIM Runtime Under Linux and NT The various Georgia Institute ofTechnology CORSIM repository models which were used to test CORSIM under both Linuxand NT are listed along with their respective runtimes. The NT profiler only provides resultsfor the CORSIM functions, so the Operating System and Compiler generated function wereculled from the Linux results. Note, that although the NT results appear to run faster byabout 16.5%, these results indicate the amount of CPU time used by the simulation, and arenot necessarily the duration the user waited for their results. For example, the time valuesneglect operating system and compiler function overheads which will be present under bothNT and Linux. All times shown are in seconds.
136
event_list (6%)shutdown (0%)
overhead (44%)
event_generator (0%)
scheduling (46%)
scheduling/event_list (0%)statistics (4%)timer (0%)
Fig. 4.2. Profile Chart of CORSIM on NT Illustrated are the percentages of CORSIMruntime used by eight categories of simulation functions when run under the NT operat-ing system. The graph represents an average of 20 simulations from Table 4.5 run underthe NT operating system and profiled using NT’s profiling tools. The CORSIM functionswere classified into the eight categories of scheduling, scheduling/event list, event list, timer,event generator, overhead, statistics and shutdown. These categories are described in Ta-ble 4.4.
The graphs in Figures 4.2 and 4.3 illustrate the importance of accelerating the
CORSIM overhead and scheduling categories. A new proposed simulation accelerator
must focus on these two simulation components.
The first CORSIM category, overhead, is dominated by its data integrity routines
which read data from input files, verify that data, and then store the results for later
retrieval. The proposed architecture assists in alleviating much of the overhead required
by CORSIM. For example, with the reconfigurable logic approach, the simulator system
must be configured before it is used, and error checking on the input data occurs once
during initialization. The network of roads and the scheduling and routing algorithms
need to be implemented in reconfigurable hardware before the system starts the simu-
lation. Much of the data which is input into the CORSIM simulation is configured as
137
event_list (9%)
shutdown (0%)
overhead (26%)
event_generator (0%)
scheduling (56%)
statistics (1%)
scheduling/event_list (9%)
timer (0%)
Fig. 4.3. Profile Chart of CORSIM on Linux This chart is the same as the graphpresented in Figure 4.2, but performed under Linux. The same dataset of 20 simulationmodels were run and averaged to create the pie chart which depicts CORSIM categoriesaccording to their dwell time. The GNU gprof profiler was used to produce this data chart.
hardware in the proposed simulator. The setup is based on selections from available
configurations or sub-configuration model segments.
The version of CORSIM provided with version 4.2 of the TSIS package, having
been constructed over time, is not modular in its software functionality. Routines reg-
ularly blend input data integrity, event list handling, and event scheduling functions.
CORSIM source code is not generally publicly available for research study and compari-
son. For these and a myriad of other reasons, a second simulator, Trafix, was developed
for modeling the traffic scheduling software functionality. Trafix was developed in C++
and is open source. Its development models follow the research provided in [146].
138
4.3 Trafix: A Road Traffic Simulator
In working with road traffic simulators, it becomes immediately clear that the
traffic research and management community requires a standard open source, free traf-
fic simulator. The simulator should also have a standard general input file format, so
that simulations can be easily tested under various simulators for verification. The free
software, open source approach allows researchers to test various traffic theories on a
standardized software platform in a reliable and reproducible fashion. Further, an inex-
pensive system can provide smaller municipalities with access to a tool for making their
own local road networks and traffic signal timing schemes more efficient. Maximizing
traffic throughput and minimizing delay benefits trade and tourism both domestically
and internationally. The possible windfall is potentially large.
The Trafix simulator was developed in C++ with the GNU gcc compiler under the
GNU-Linux operating system. The program uses Xfig, which has created its own input
file format standard, to generate an input file describing the simulation road network.
The code is written to be modular so that various components can be replaced as the user
community requires. The overall modular design concept is overviewed in Figure 4.4.
So for instance, attempts have been made to allow the code to be easily changed in the
future, altering the current dependence on Xfig input files. Trafix displays its animated
output in X windows as illustrated in Figure 4.5. In addition to using Xfig for its input
and X windows for output, Trafix employs the STLPORT Standard Template Library
(STL) routines wherever expedient to foster the reuse of code which is intended to both
lead to efficiency and reduce errors.
139
The STLPORT is a more cross platform portable version of Silicon Graphic’s
Standard Template Library (SGI STL). The SGI STL maintains the current implemen-
tation of the Hewlett-Packard Company’s original version of the STL. Trafix employs the
STL container classes whose interfaces are described by Bjarne Stroustrup [134]. These
classes include standard library container objects and iterators which are used to main-
tain various simulation objects. For example, in Trafix, to maintain a list of vehicles, an
STL vector of vehicles is generated. This vector can then be stepped through by use of
the vector class iterator.
The STL container classes provide programmers with a variety of benefits. In-
dividual containers are simple and efficient. Each container provides a set of standard
operations with standard names and semantics. Individual container classes may also
provide operations which are specific to a particular container class. The same classes
also provide a set of standard common iterators to efficiently access the container mem-
bers. The container classes are non-intrusive, that is the objects need not be modified
to be stored within the containers. The containers each take an allocator argument
which can be used as a handle for implementing services for every container. The allo-
cator greatly eases the provision of universal services which may include persistence and
object I/O. The STL benefits are further enumerated in [134].
At this time, Trafix forks two processes. The first process displays the two win-
dows, each containing maps. One window holds input map symbols which have been
used to generate the simulation, and the second depicts the background map for the
animated traffic display. The first process is intended to eventually migrate into a more
suitable user interface as community interest materializes. The second process handles
140
Physical
Simulation
Symbolic
Layer Objects
Road, Place, Intersection
Xfig objects and attributes
Road Q, Intersection Q, Timeline
Xwindows objects
Fig. 4.4. The Trafix Software Structure Written to be modular, the Trafix software iscomposed of three levels. The bottom, physical, level allows trafix to interface with its inputand output systems. As currently written, Trafix reads its input from Xfig files and animatesits output in X windows displays. The middle symbolic layer serves as an intermediate levelbetween the simulator and its physical files and converts raw data into conceptual objects.These objects include roads, intersections, and places. The top layer consists of simulationobjects including such concepts as timelines, road queues, and intersection queues.
Fig. 4.5. The Trafix Display A view of the Trafix simulator is illustrated. Vehicles aremoving from the left and bottom lanes to the top and right. Turning decisions at the secondintersection depend on the vehicle’s randomly assigned destination. Cars, buses, and trucksare represented as different sized and colored boxes.
141
the animation of the vehicle traffic. The animation is visible only on the single graphic
map display. Trafix simulates car, bus and truck traffic moving through intersections
and along roads.
An input example corresponding to the Trafix network displayed in Figure 4.5 is
illustrated in Figure 4.6. The figure shows an Xfig editor with a Trafix input map drawn.
The Trafix library includes sources, destinations, and intersections. Vehicle sources are
drawn as triangles, vehicle sinks are circles and the three-way intersections should be
obvious from the drawing context. The Trafix software parses the Xfig drawing files
and generates input geometric objects based on the Xfig map drawing. These Xfig data
objects consist of the raw data which define the Xfig drawing. These raw data types
include such geometric shapes as lines, circles, ellipses, compounds, etc. which are all
based on the Xfig input file format. Source, destination, intersection, and road attributes
are included in the drawn object comments.
Once the raw, physical layer objects are created from the parsed Xfig drawing
file, the geometric shapes are used to compute symbolic road network objects. Drawn
lines are used to create road objects. The Trafix library compounds composed of boxes,
triangles, and circles are used to generate source, destination, and intersection objects.
Trafix performs some network error checking to insure that the network is fully connected.
A symbolic traffic map is created and used to generate a third level of simulation data
types as shown in Figure 4.4. This third level of Trafix objects includes timelines, vehicle
queues, and intersection queues.
A Trafix simulation proceeds by executing an event loop. The interior of the loop
is divided into two stages. The first stage of the loop checks network source nodes from
142
Fig. 4.6. Trafix Input Environment The Trafix input environment incorporates Xfig, anopen source drawing package. Xfig allows user generated libraries of objects, which Trafixsupplies. For Trafix, the current library consists of two, three, and four-way intersections,along with source and destination nodes. Roads consist of drawn lines. Object characteris-tics are included in the object comments. A road speed limit can be set by adding a speedattribute in the drawn road line comment. Xfig comments are not visible by default, butcan be accessed using the Edit tool on the Editing modes tool bar. Object names are seteither by including a name attribute in the object comments or by creating a compoundconsisting of a text object with the desired name along with the selected object to receivethe name. Using compounds to add names to objects allows the name to be visible onthe drawn map. Intersections of degree greater than four must be made by combinationsof smaller intersections. Intersections are composed of boxes. Roads to and from the in-tersection must end inside of the intersections associated peripheral boxes, one road perperipheral box. Source nodes, which generate vehicles, are represented by triangles, anddestination nodes, represented by circles, are vehicle sinks. Roads leading to or exiting froma source or sink must have one end within the source or destination object.
143
which vehicles erupt at particular simulation time cycles. Each source node contains a
timeline segment of vehicle events consisting of their arrival times based on a user selected
random distribution. The arrival distribution characteristics are entered by the user as
attributes in the Xfig Trafix library source symbols. These erupting vehicle objects are
popped off the timeline and injected into the traffic network if there is sufficient space
on the road at the source node. If space is lacking, the event is blocked. The second
stage of the event loop processes all vehicles already in motion in the simulation network.
Vehicles in motion are updated in a time-driven fashion. Leaders are moved first. Vehicle
movement routines are located in the vehicle.cc source file.
Trafix was created to allow a verification of the car-following acceleration schemes
employed. No suitable open source traffic simulator was available when the project
started. The initial intent of Trafix is not to serve as a stand-alone finished traffic
simulation program. However, incidental materials were added as convenient, and hooks
are available in the software, such that the software has become a good starting point
towards a finished traffic simulator. The code was released and is available as a free
software, open source project with the intent of providing others with a valuable starting
point from which to improve. The Trafix simulator code is GNU-public licensed and
available from its web site at http://trafix.sourceforge.net.
The vehicle movement functions from Trafix which are used to verify the algo-
rithms and equations used in Section 7.2.3 were tested and timed on a 600 MHz Pentium
III 7.0 SuSE Linux Box. The software runtimes measured are averages for each cited
vehicle movement function. The timing results are shown in Table 4.6.
144
Function Runtime in µsvehicle_road_initialize 30.43move_vehicle_on_road 36.76
vehicle_intersection_initialize 23.19move_vehicle_thru_intersection 48.40
Table 4.6. Scheduler Software Function Profile The four modular vehicle movementfunctions from Trafix were timed on a 600 MHz Pentium III 7.0 SuSE Linux Box. Thesefunctions are contained in the vehicle.cc source file. The times presented are in microseconds.The software bottleneck is in the intersection handling function. The time shown is thetime elapsed during function execution. The functions are executed each time a vehicle isprocessed.
4.3.1 A Shared, Pooled Allocator
The design approach in Trafix generates two processes. One process is needed
to provide a fast response to the user. A second process is required to execute, and
therefore becomes bound by the simulation event loop. These two processes need to
communicate easily. UNIX provides a variety of methods to allow cross-process commu-
nications including semaphores, pipes, and shared memory. A natural solution for this
application is to allocate the simulation objects in shared memory and allow both the
user and event loop processes access to the same objects. Then, if the user selects to
change the scale of the map, the simulation process immediately sees the change and
can alter its calculations to adopt the newly selected map scale factors in its vehicle
movement computations.
In C++, local (auto) objects are stored on the run-time stack, symbols are bound
directly to the local objects, and storage management is performed via stack-based mark
and release strategies in which enough space to hold all locals is allocated, all at once,
145
upon entry to a block, and released upon exit. Globals or statics have similar properties.
Programmers who desire other access and lifetime control strategies must use the new
operator in order to create objects on the free store, and then explicitly manage their
storage. While block structuring accounts for some of the efficiency and flexibility advan-
tages of C++ over other languages, it is not without its cost. In many object-oriented
applications, difficulties arise in using block structures to obtain the desired effect in
controlling storage lifetimes, and requiring users to manually employ the new and delete
operators leads to error-prone results. Container classes offer an attractive alternative
to this manual manipulation of memory. Besides their value in organizing groups of ob-
jects as data structures, container classes are perhaps the best means for ensuring that
groups of objects have coexistent lifetimes. In other words, containers serve as a way of
extending the scope or lifetime rules of C++. Knowing that all objects created within
some collection exist, unless explicitly removed, until the collection itself is destroyed,
can help minimize a good deal of awkwardness and error-prone code [95].
The C++ language provides a programming concept, referred to as an alloca-
tor, which is used to insulate programmers from the details of physical memory. The
allocator provides standard methods and a standard interface for allocating and deal-
locating memory along with standard names of types used as pointers and references.
Further, the STLPORT library, which is the Standard Template Library employed by
Trafix, is well suited to employ a standard allocator. However, although it would seem
obvious that a programmer might need to allocate a container class in shared memory, a
shared memory allocator class was not readily available. The STL code is written to be
portable across platforms and operating systems. Memory allocation is very operating
146
system specific. Therefore, the desired shared memory allocator was not part of the STL
library. The STL classes, are, however, written properly to accept and use an allocator
if provided.
So because the Trafix code required a C++ shared memory allocator class, the
code was generated and now resides at http://allocator.sourceforge.net. A link from
www.stlport.org points to the allocator web-site. The allocator uses the standard mem-
ory template to allocate shared memory on the GNU-Linux operating system. The
allocator creates large blocks of shared memory and then manages the memory, allo-
cating and deallocating sub-portions of it as required by the STL container classes in
the form of chunks. The allocator keeps track of free memory using a bit vector. The
allocator works with the STLPORT container classes. Additional general information
on allocators can be found in [134].
147
Chapter 5
Analysis
Chapter 5 performs some basic mathematical analysis of simulation properties.
Section 5.1 uses mathematics to determine whether event versus time-driven simula-
tion yields results faster under particular constraints. Section 5.2 reviews some of the
constraints involved in wrapping a traffic map on an array of processors.
Deciding between running the simulator in event or time-driven mode is important
for simulations in which event processing is not continuous. As an example of non-
continuous event processing, consider the case of simulating telephone calls where the
calls are temporarily assigned virtual circuits within the communications network. For
this example, the circuits are the resources required by each event. An event generator
creates a sequence of calls which are placed into an event queue. The calls can be initiated
if their required circuits are available when they actually execute. The executed call
then temporarily depletes the circuit it uses from the available pool. However, in the
telephone simulation, the circuit does not require continuous adjustment or modification.
The execution of the telephone call simply needs to schedule a secondary event which
will return the circuits to the available pool for other calls to use when the current call
completes. Logic simulation is a similar example, although deterministic. When a gate
executes, it changes the state of its output signals, but it does not need continuous event
processing. The gate only needs attention when one of its input signals is scheduled to
148
change. Contrast these examples with traffic simulation. The event generator schedules
new vehicles to enter the traffic network. The vehicle arrival times are enqueued in the
event queue. At the appropriate time, the vehicle is popped off the event queue and
enters the traffic network. Once moving within the network, the vehicle’s acceleration,
velocity, and position require constant updates. Traffic is a type of simulation which
requires continuous event processing by the scheduler during every simulation cycle.
Since traffic needs attention every simulation cycle, the model naturally falls into
the time-driven mode of simulation. Virtual circuit telephone simulation and logic simu-
lation may be better suited for an event-driven model. Additional hardware, referenced
in Section 7.3.2, can be included in the simulator design accelerating event-driven simu-
lation.
5.1 Event verses Time-Driven Simulation
The analysis in this section compares and examines event versus time-driven sim-
ulation. Section 5.1.1 illustrates the maximum speedup which can be expected from
running under an event versus a time-driven approach. Section 5.1.2 uses statistics to
find the solution point which optimizes runtime by selecting between an event versus a
time-driven mode.
5.1.1 Expected Advantage of Event vs Time-Driven Simulation
Felderman [52] compares two distributed processing methods which are analogous
to event and time-driven simulation. One method is asynchronous and the other is
synchronous. In the synchronous method, after each subtask is completed, all processes
149
must reach a barrier before being allowed to process the next sub-task. The synchronous
method is analogous to the time-driven approach. The asynchronous method doesn’t
require the barrier. Individual jobs run to completion as fast as they progress. Felderman
demonstrates that the asynchronous method has an expected potential speedup over the
synchronous method by no more than lnP where P is the number of processors used.
So the speedup gained from event-driven processing over time-driven processing will be
no more than lnP .
5.1.2 Decision between Event vs Time-Driven Modes
When confronted with a network of random event generators, the next expected
event time to occur can be calculated using Ordered Statistics. Frequently, an objective
is to determine the fastest car in a race or the heaviest mouse among those fed a certain
diet [126]. Similarly, random variables can be ordered according to their magnitudes.
For this work, the shortest expected arrival time must be found in order to determine
which simulation approach, time or event-driven, is the most appropriate.
Let X1, X2, . . . , Xn denote independent continuous random variables which have
distribution functions shown in Equation 5.1.
F1(x), F2(x), . . . , Fn(x) (5.1)
150
The distribution functions of Equation 5.1 have the corresponding density functions of
Equation 5.2.
f1(x), f2(x), . . . , fn(x) (5.2)
Ordered random variables, Xi, are denoted X(1), X(2), . . . , X(n) where X(1) ≤
X(2) ≤ . . . ≤ X(n). Continuous random variables allow the equality signs to be dropped.
So the maximum value of Xi is
X(n) = max(X1, X2, . . . , Xn) (5.3)
and the minimum value is
X(1) = min(X1, X2, . . . , Xn) (5.4)
151
For this work, the goal is to determine the minimum next expected event time which is
X(1). The density function of X(1), denoted g1(x) can be found as:
P[
X(1) ≤ x]
= 1− P[
X(1) > x]
= 1− P (X1 > x,X2 > x, . . . ,
Xn > x)
= 1− [1− F1(x)] [1− F2(x)] · · ·
[1− Fn(x)]
(5.5)
Taking the derivative of both sides yields the density function,
g1(x) = f1(x) [1− F2(x)] · · · [1− Fn(x)] +
[1− F1(x)] f2(x) · · · [1− Fn(x)] + . . .
[1− F1(x)] [1− F2(x)] · · · fn(x)
(5.6)
The expected time of the next arrival event can then be calculated by finding the
expectation of g1(x) as follows:
E(x) =
∫ ∞
0xg1(x)dx (5.7)
For an actual simulator, the computation of Equation 5.7 would be automated
given that the user supplies the appropriate F (x) and f(x). For the purposes of this
152
thesis, the expected minimum timestamp is derived for 2 sample distributions, the Ex-
ponential and Weibull Distributions. The resulting equations are used to derive the
results of Chapter 9.
5.1.3 Exponentially Distributed Example
As an example, the next expected event time for a network of two exponentially
distributed event generators will be calculated. The exponential density function is
provided in Equation 5.8:
f(x) =1
θe−x
θ (5.8)
The exponential cumulative distribution function can then be calculated as:
F (x) = 0 for t < 0
F (x) = P (X ≤ x) =
∫ x
0
1
θe−x
θ dt
= −e−x
θ
∣
∣
∣
x
0= 1− e
−xθ for t ≥ 0
(5.9)
The density function for this example contains two generators, so Equation 5.6
simplifies to:
g1(x) = f1(x) [1− F2(x)] + [1− F1(x)] f2(x) (5.10)
153
Substituting equations 5.8 and 5.9 into Equation 5.10 yields the following proba-
bility density function, g1(x):
g1(x) =
(
1θ1
e− x
θ1
)[
1− (1− e− x
θ2 )
]
+
[
1− (1− e− x
θ1 )
] (
1θ2
e− x
θ2
)
=
(
1θ1
e− x
θ1
)(
e− x
θ2
)
+
(
e− x
θ1
)(
1θ2
e− x
θ2
)
= 1θ1
e− (θ1+θ2)
θ1θ2x
+ 1θ2
e− (θ1+θ2)
θ1θ2x
=(
1θ1
+ 1θ2
)
e− (θ1+θ2)
θ1θ2x
(5.11)
Next, the development of the expected value of the minimum is derived by plug-
ging Equation 5.11 into Equation 5.7 to derive:
E(x) =
∫ ∞
0x
(
1
θ1+
1
θ2
)
e− (θ1+θ2)
θ1θ2xdx
=(
1θ1
+ 1θ2
)
∫ ∞
0x e−αxdx
where α =(θ1+θ2)
θ1θ2
(5.12)
Equation 5.12 can be simplified by applying the following Γ(n) definition:
Γ(n) =
∫ ∞
0xn−1 e−xdx (5.13)
154
The situation in Equation 5.12 is slightly different due to the α in the exponent.
We can massage Equation 5.12 by using the substitution y = αx and dy = αdx. We
derive Equation 5.13:
∫ ∞
0xn−1 e−αxdx =
∫ ∞
0
( y
α
)n−1e−y 1
αdy
=(
1α
)n∫ ∞
0yn−1 e−ydy
=(
1α
)nΓ(n)
(5.14)
The results of Equation 5.14 and the substitution for α can be plugged back
into Equation 5.12. For Equation 5.12, the two generator example, n = 2. Note that
Γ(n) = (n− 1)!.
E(x) =(
1θ1
+ 1θ2
)
Γ(n)(
1α
)n
E(x) =(
1θ1
+ 1θ2
)
Γ(2)(
1α
)2
E(x) =(
1θ1
+ 1θ2
)
(2− 1)!(
1α
)2
E(x) =θ1θ2
θ1+θ2
(5.15)
155
We can interpret the results of Equation 5.15 as follows. If the mean of the first
process, θ1, is 5 seconds, and the mean of the second process, θ2, is 10 seconds, then the
expected minimum arrival time of the two processes is given by Equation 5.15 as:
E(x) =θ1θ2
θ1+θ2
E(x) = 5·105+10 = 3.33 seconds
(5.16)
For three exponentially distributed event generators, the next expected event for
x would occur at time:
E(x) =θ1θ2θ3
θ2θ3+θ1θ3+θ1θ2(5.17)
Assuming that all event generators create events with the same θ value, the ex-
pected value of the next event in Equation 5.17 can be generalized to Equation 5.18.
E(x) = θN
(5.18)
5.1.4 Weibull Distribution Example
As a second example, an independent, identically distributed (IID) Weibull dis-
tribution is calculated. This example is calculated for a 2 source experiment similar to
Equation 5.16. The Weibull distribution has the density function found in Equation 5.19.
f(x) = γθ xγ−1e−xγ/θ for x > 0 (5.19)
156
Equation 5.19 can be integrated directly to derive the cumulative distribution
function in Equation 5.20.
F (x) =
∫ x
0
γ
θyγ−1e−yγ/θdy
= −e−yγ/θ∣
∣
∣
x
0= 1− e−xγ/θ x > 0
(5.20)
Equations 5.19 and 5.20 can be inserted into Equation 5.10. Also, if IIDs are
assumed, then γ1 = γ2 and θ1 = θ2 allowing additional simplifications:
g1(x) =(
γ1θ1
xγ1−1e−xγ1/θ1) (
e−xγ2/θ2)
+
(
e−xγ1/θ1)(
γ2θ2
xγ2−1e−xγ2/θ2)
= 2γθ xγ−1e−2xγ/θ
where θ1 = θ2 and γ1 = γ2
(5.21)
To find the expected minimum, the results of Equation 5.21 are inserted into
Equation 5.7:
E(x) =
∫ ∞
0
2γ
θxγe−2xγ/θdx (5.22)
If we then let y = xγ so that x = y1/γ and dx = 1γ y
1−γγ dy, we can substitute
these results into 5.22 to obtain:
E(x) = 2γθ
∫ ∞
0ye−2y/θ 1
γy
1−γγ dy
= 2θ
∫ ∞
0y1/γe−2y/θdy
(5.23)
157
Employing the Gamma function substitution listed in Equation 5.24, where a = 2θ
and n = 1γ we find a solution for Equation 5.23.
∫ ∞
0xne−axdx =
Γ(n + 1)
an+1(5.24)
E(x) = 2θ
∫ ∞
0y1/γe−2y/θdy
=(
2θ
)
Γ(
1γ +1
)
(
2θ
) 1γ +1
=(
2θ
)− 1γ Γ
(
1γ + 1
)
(5.25)
Equation 5.25 has a nice solution for γ = 1 or γ = 2. For the latter case, the
formula yields:
E(x) =(
2θ
)−12 Γ
(
12 + 1
)
=(
2θ
)−12 1
2Γ(
12
)
=
√
(
θ2
)
12√
π
(5.26)
Equation 5.25 can be generalized for N different independent identical distribution
(IID) generators as:
E(x) =(
Nθ
)− 1γ Γ
(
1γ + 1
)
(5.27)
158
5.2 Topology: Traffic Map Layout
Much of the simulator design acceleration is facilitated by judicial use of inherent
data locality. Data locality is dependent on adjacent nodes in the simulation being
physically adjacent within the simulator. Communications delays are a factor due to the
accelerated speed of the machine. The inter-processing element communications speed
is shown to be comparable to the simulation event processing speed in Section 7.3.6.
Therefore, the question arises, what is the maximum distance expected between any two
traffic map sections when the map is divided and assigned to processing arrays in the
simulator.
Fishburn [53] shows that for a graph, Gn,m, with nm vertices arranged in n
rows and m ≥ maxn, 2 columns with an edge u, v between vertices u and v, if the
vertices are adjacent either horizontally or vertically, the bandwidth of Gn,m equals n.
For an example of a 4 by 6 map, G4,6, shown in Figure 5.1 and arranged judicially
into the stack of the same figure, the largest discontinuous jump required between the
normal connections of the original map squares in the stack would be 4 vertical array
sections, the value of n. Therefore, in the simulator, if the target simulation map is
divided into sections and distributed on the simulator, in terms of communications, the
largest discontinuous jump in the simulator depends on the minrows, columns of the
simulated traffic map.
159
Fig. 5.1. Wrapping a Traffic Map onto the Simulator The Traffic network to besimulated is partitioned and assigned to the processing elements of the simulator. Assumingthat the map is divided into an m by n matrix of subsections where n ≤ m, what isthe largest discontinuity between the resulting tiles? For the stack of tiles illustrated,Fishburn [53] shows that when the tiles are laid out in a down-diagonal lexicographicallinear arrangement, the maximum discontinuity is expected to be n. This value providesan estimate of the connectivity required by the simulator.
160
1
2
3
4
5 9 13 17 21
6 10 14 18 22
7 11 15 19 23
8 12 16 20 24
column−row lexicographic
1
2
4
7
3
5
8
11
6
9
12
15
10
13
16
19
14
17
20
22
18
21
23
24
down−diagonal lexicographic
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18
19 20 21 22 23 24
row−column lexicographic
1
3
6
10
2
5
9
14
4
8
13
18
7
12
17
21
11
16
20
23
15
19
22
24
up−diagonal lexicographic
Fig. 5.2. Different Lexicographical Map Layouts Four linear arrangements from [53]illustrate different possible layouts for the map sections of Figure 5.1. The algorithm usedto derive the graphs is evident from their construction. The column-row and the down-diagonal lexicographic arrangements yield bandwidth arrangements with |f(u)− f(v)| = 4.The other maps are not bandwidth arrangements because |f(u) − f(v)| = 5 for the up-diagonal lexicographic arrangement and |f(u)− f(v)| = 6 for the row-column lexicographicarrangement.
161
Chapter 6
Design Methods
The overall design employs a few general methods of hardware design to optimize
and accelerate its performance. Chapter 6 reviews various architecture elements which
are applied within the design as described in Chapter 7. Section 6.1 describes Recon-
figurable Logic which has been applied throughout the system. Parallelism in the form
of Systolic Arrays is described in Section 6.2. Section 6.3 discusses Content Addressable
Memory which is applied to Event Generation in Section 7.2.1. Finally, Section 6.4
reviews a Reduction Bus.
6.1 Reconfigurable Logic
The major new technology which facilitates the proposed solution is the applica-
tion of reconfigurable logic [51]. Because of the acceleration dependence on reconfigurable
logic, this section will be covered in detail. Reconfigurable logic is difficult to understand
without a brief overview of FPGAs, the selected source of reconfigurable logic. Recon-
figurable logic has several advantages:
• More application specific than processors.
• No instruction fetch.
• Allows high degrees of parallelism.
162
• Allows reuse of silicon.
• Provides flexible redesign which is more affordable than ASICS.
A Field Programmable Gate Array is an array of uncommitted elements whose in-
terconnections can be programmed by the user. In 1985, the Xilinx company introduced
the first FPGA. Typical FPGAs consist of logic blocks with programmable interconnects.
The interconnects are wiring segments of the chip which may be of varying lengths.
Switches connect the logic blocks to the wiring segments. A design is implemented by
partitioning the design among the FPGA’s logic blocks and by using the interconnect to
route signals appropriately.
FPGA logic block structures vary widely. They may contain combinational logic,
multiplexors and lookup tables. Many logic blocks also contain flip-flops to assist in the
implementation of sequential circuits [22].
Reconfigurable logic consists of a collection of elementary units which can be flex-
ibly connected together to form larger functions. Reconfigurability allows more than one
custom circuit to run on a given piece of silicon. Reconfigurable logic can be config-
ured while the FPGAs remain within their resident systems. The elementary gates can
also be reconfigured as the system is running which is called “on the fly” configuration.
However, there is often a performance impact or delay when “on the fly” reconfiguration
is implemented. For discrete event-driven simulations, reconfigurable logic allows the
user to select from a smorgasbord of random statistical distributions and implement the
choice as hardware. So the statistical models which result are unconstrained in form and
faster than their software counterparts.
163
Field Programmable Gate Arrays (FPGAs) were selected as the reconfigurable
logic. FPGAs increase the application speed, but maintain the reconfigurability of soft-
ware. Further, the FPGAs are reconfigurable in place, meaning that the chip needs not
be removed from the board to be reprogrammed. Unlike traditional microprocessors,
reconfigurable logic instructions are directly embedded in the technology and need not
be fetched from memory and executed sequentially. Commercially available FPGAs have
been categorized into four classifications[22]:
• Symmetrical Array
• Row Based
• Hierarchical Programmable Logic Device
• Sea-of-Gates
The two major FPGA competitors are Xilinx and Altera. Xilinx follows a Sym-
metric Array architecture, which consists of a two-dimensional array of logic blocks
interconnected by both horizontal and vertical routing channels as shown in Figures 6.1
and 6.2. Connections which traverse different numbers of switching matrices will have
different delays, making accurate FPGA simulation timing difficult.
Xilinx Configurable Logic Blocks (CLBs) use Lookup Tables (LUT) which can
generate functions based on stored values. The Xilinx XC4000 CLB, depicted in Fig-
ure 6.3, is capable of implementing two independent functions of 4 variables, a single
function of five variables, or some functions of up to nine variables. The two CLB outputs
can be either combinational or registered.
164
Routing Channels
Configuration
BlockLogic
IO Block
Fig. 6.1. General Xilinx FPGA Architecture. Xilinx FPGA architectures generallyconsist of a two dimensional array of programmable Configurable Logic Blocks (CLBs).The FPGAs contain horizontal and vertical routing channels running between the rows andcolumns of the CLBs. The programmable resources may be controlled by setting staticRAM cell values [22].
165
MatrixSwitch
MatrixSwitch
MatrixSwitch
CLB
MatrixSwitch
MatrixSwitch
CLB
MatrixSwitch
WiringSegment
RoutingSwitch
Each switch matrixpoint consists of sixrouting switches
Note:
Direct Interconnect
ConnectionBlock
Fig. 6.2. Xilinx Architecture Interconnects. The horizontal and vertical routing linesconnect at the FPGA switch matrices. Single length lines are intended for short connectionsor connections which do not have critical timing requirements. The XC3000 series alsocontains Direct Interconnect lines which allow the CLBs to reach their neighbors on theright, top, and bottom. For connections which span a distance of more than one CLB, theconnections are made via the wiring segments and the switch matrices. Connections routedthrough the switches incur significant routing delays [22]. Special long lines traverse theentire width or length of the chip. The long lines cross at least one switch and are used tointerconnect several CLBs with minimum delay [22].
166
LookupTable
LookupTable
LookupTable
state
E R
SD Q
E R
SD Q
state
selector
G4
G3
G2
G1
F4
F3
F2
F1
C1 C2 C3 C4
Q2
G
Q1
F
InputsOutputs
VccClock
Fig. 6.3. The Xilinx XC4000 Configurable Logic Block. The Xilinx XC4000 Con-figurable Logic Block (CLB) utilizes a two-stage arrangement of lookup tables (LUT) thatyields a greater logic capacity per CLB than the XC3000’s single stage LUT. The ar-rangement allows the CLB to implement two independent functions of 4 variables, a singlefunction of five variables, or a function of up to nine variables. The two CLB outputs canbe either combinational or registered.
167
Altera FPGAs consist of a Hierarchical Array of Programmable Logic Devices
(PLDs). Altera refers to its devices as Complex Programmable Logic Devices (CPLDs)
and distinguishes CPLDs from FPGAs based on their interconnect structures. The
segmented interconnect structure of FPGAs is distinguished by its use of multiple metal
lines of varying lengths, joined by pass transistors or anti-fuses, to connect logic cells. In
contrast, the continuous interconnect structure of CPLDs uses an interconnect structure
with seamless metal lines to provide logic cell-to-logic cell connectivity [5]. Brown et al
classifies these Hierarchical Arrays as FPGAs because the Altera devices consist of a two-
dimensional array of programmable blocks and a programmable routing structure [22].
In this thesis, the conventions found in [22] are followed and the devices are referred to as
FPGAs. The Altera devices also implement multi-level logic and are user programmable.
Figure 6.4 illustrates the block diagram of the Altera Flex 10K architecture. In
the center of the figure resides the embedded array which consists of a series of embedded
array blocks (EABs). EABs can function as either memory or logic. When configured
as memory, the EABs each provide 2K of bits which can be used to create RAM, ROM,
FIFO functions, or dual-port RAM. EABs can also be configured into complex logic
functions, such as multipliers, micro-controllers, state machines, and DSP functions [4].
EABs can be used independently or ganged together. A more detailed diagram of the
EAB implementation can be found in Figure 6.5.
The logic array, consisting of logic array blocks (LABs), is featured in Figure 6.4.
Each LAB contains eight logic elements (LEs) and a local interconnect. The LEs consist
of a 4-input LUT, a programmable flip-flop, and dedicated signal paths for cascaded
168
IOE IOE
IOE IOE
IOE IOE
IOE IOE
IOE IOE
IOE IOE
IOE IOE
IOE IOE
IOE IOE
IOE IOE
IOE
IOE
IOE
IOE IOE
IOE
IOE
IOE
EAB
IOE
IOE IOE
IOE
EAB
Embedded Array Block (EAB)I/O Element (IOE)
Logic Array Block (LAB)
Logic Element
Logic Interconnect
Embedded Array
Logic Array
Row Interconnect
Column Interconnect
Fig. 6.4. Block Diagram of the Altera Flex 10K Architecture. Each group of logicelements (LEs) is combined into a logic array block (LAB). The LABs are arranged intorows and columns. Each row contains a single embedded array block (EAB). The LABsand the EABs are interconnected forming a network. Chip I/O elements (IOEs) are locatedat the end of each row and column [5].
169
D Q
D Q
256x8512x41,024x22,048x1
ColumnInterconnect
D Q
D Q
Data
In
Data
Out
Address
RAM/ROM
WE
8,4,2,1
8,9,10,11
6
2,4,8,16
24
2,4,8,16
Row Interconnect
EAB Local Interconnect
Device−Wide Clearand Global SignalsDedicated Inputs
Fig. 6.5. Diagram of the Altera Embedded Array Block (EAB). The EmbeddedArray Block (EAB) is a flexible block of RAM with registers on the input and output ports.Logic functions may be created by programming the EAB with a read-only pattern duringconfiguration, creating a large Lookup Table (LUT). The large capacity of the EAB allowscomplex logic functions to be implemented in one level without routing delays. The EABscan also implement large, dedicated blocks of RAM which eliminate the timing and routingconcerns of competitor FPGAs. The competitor FPGAs must often string together smallerdistributed RAM blocks to allocate larger memories [5].
170
functions. The eight LEs can be used to create functions such as 8-bit counters, address
decoders, or state machines. The internals of an LE are illustrated in Figure 6.6.
The Sea-of-Gates format consists of logic blocks in which the interconnection net-
work is overlayed on the logic blocks themselves. The Sea-of-Gates approach is not
commonly used. Row-Based FPGAs consist of a multiplexor based interconnection net-
work which employs anti-fuse technology. Anti-fuse connections are normally open, high
impedance connections. Programming permanently closes the appropriate connections
by melting a dielectric and thereby configures the connections on the chip. Actel manu-
factures some row-based devices.
Although reconfigurable logic facilitates algorithm implementations which are
faster than analogous software, the application of Application Specific Integrated Circuits
(ASIC) would yield still faster results. ASICs are silicon chips fabricated to perform a
single specific task. ASICs are generally not configurable, and their rigid nature makes
them less appropriate for application in a more general purpose simulator. Reconfig-
urable logic serves as a compromise between the slow but flexible general purpose pro-
cessor and the overly rigid ASIC approach. Whereas ASICs would allow the user access
to a limited array of statistical models, FPGAs allow the user to implement a statistical
model for event generation of the user’s choosing. Section 7.2.3 uses reconfigurable logic
to flexibly implement a configurable arrangement of functional units. Reconfigurable
logic preserves the parallel processing aspect of ASICs along with permitting some of
the flexible programming of the general purpose processor.
171
Look−UpTable
(LUT)
CascadeChain to
Interconnect
to LAB localInterconnect
CarryChain
Clear/PresetLogic
Device−WideClear
ClockSelect
ProgrammableRegister
DATA1
DATA2
DATA3
DATA4
Carry−in Cascade−In
Register Bypass
LABCTRL3
LABCTRL4
LABCTRL2
LABCTRL1
D QPRn
ENACLR
Carry−Out Cascade−Out
Fig. 6.6. Diagram of the Altera Logic Element (LE). The Altera Flex 10K LogicElement (LE) is the smallest unit of logic in the architecture. Each LE contains a 4-inputLUT. Additionally, each LE contains a programmable flip-flop with a synchronous enablewhich can be configured as a D, T, JK or SR flip-flop. The LE also contains a carry chainwhich supports high speed counters and adders, and a cascade chain which implementswide-input (large fan-in) functions with minimum delay. Carry and cascade chains canconnect all LEs in a Logic Array Block (LAB) and all LABs in the same row [5].
172
The Altera FPGA was selected over Xilinx for a couple of important reasons.
First, the manufacturer claims that the hierarchical architecture of the FPGA allows re-
alistic timing simulation. The continuous global vertical and horizontal routing structure
is claimed to provide more predictable performance as opposed to other manufacturer
approaches which employ a routine segmented interconnect with switch matrices. The
Altera interconnect ensures predictable performance and accurate simulation and timing
analysis. This predictable performance contrasts with that of other manufacturers, which
use a segmented connection scheme and therefore have unpredictable performance [5, pg.
69]. The competitor FPGAs must often string together smaller distributed RAM blocks
to allocate larger memories causing timing and routing concerns.
Reconfiguration allows the semantic expressiveness of very large instructions with-
out paying the commensurate bandwidth and deep storage costs for these powerful in-
structions. The sacrifice made in developing this solution is the ability to change the
entire instruction on every cycle. Reconfiguration opens a middle ground, or an inter-
mediate binding time, between ‘behavior which is hardwired at fabrication time’ and
‘behavior which is specified on a cycle by cycle basis. This middle ground is useful to
consider in the design of any kind of computing device not just conventional FPGAs [47].
6.2 Systolic Arrays
In this work, the event generation hardware is implemented as a pipelined, parallel
systolic array. Translating software to hardware significantly increased the simulation
calculation performance as the hardware logic is faster than comparable software.
173
Reconfigurable logic naturally facilitates the creation of Systolic Arrays [73, page
580]. In a systolic array, data is pumped from processing element to processing element
at regular intervals, until the data circulates back to memory. Intermediate results can
be passed along in the pipeline, instead of writing them back to the register file after
every instruction [70].
Systolic systems consist of interconnected cells, each of which is capable of per-
forming a simple operation. Systolic systems tend to have uncomplicated communication
and control structures which provide an advantage in design and implementation. Sev-
eral cells are generally joined together to form an array or tree. Data flows through the
cells which are pipelined together.
Initially, systolic arrays were proposed for special purpose computers implemented
in Very Large Scale Integration (VLSI) silicon, in an effort to reduce design costs. Design
costs in FPGAs are already somewhat minimal, but here the systolic array architecture
allows the implementation of multidimensional pipelines. Arithmetic functional units
(adders, multipliers, etc.) can be flexibly formed and interconnected allowing both par-
allelism and the continuous flow of data through computation units. These systolic
pipelines are ideal for implementing parallel algorithms to compute simulation events.
6.3 Content Addressable Memory
Content addressable memory (CAM) or Associate memory is a storage system
which can both store data and also perform some minor processing, such as searches.
Although a CAM approach is less applicable to a traffic simulation example, the CAMs
are suitable to a more general discrete event simulation. When working in conjunction
174
with a processor, the content addressable memory queue allows some events to be pro-
cessed independently, freeing the scheduler to attend to other tasks. A block diagram
of an Associate Memory is illustrated in Figure 6.7 [100]. An example of a storage bit
within one memory word is shown in Figure 6.8 [100]. One of the proposed Event Queue
approaches of Section 7.2.2 applies Content Addressable Memory. If the required event’s
resources are unavailable, the queue assists the scheduler by removing the impotent
events.
Searches are accomplished by seeking a specific bit pattern. For the search, two
parameters are supplied, the matching bit pattern and a mask to limit the search set.
The search is dependent on which values are stored in memory as opposed to the address
or location of those values. It accomplishes a search by content. For example a request
might be to search for memory words whose lowest-order 8 bits contain the pattern
“00000100” and return the first match. In this case, the lowest-order 8 bits is the mask,
the argument is the given bit pattern, and returning the first word provides conflict
resolution in the event of multiple matching words. Figure 6.9 illustrates Matching logic
from [100].
Besides the structure depicted in Figure 6.7, an additional Tag register is often
included, allowing the rapid determination of which words in the memory are valid.
When a word is written to the associative memory, the Tag Register, which contains one
bit for each memory word, is scanned until the first 0 bit is found indicating an unused
word. The new value can then be written to the corresponding associative memory word,
and that bit is then flipped to a 1 indicating that the newly written word is valid. To
175
MatchRegister
Argument Register
Key Register
Write
Read
Input
Output
n bits/wordm words
AssociativeMemory
Fig. 6.7. Associative Memory Block Diagram The block diagram of an AssociativeMemory or Content Addressable Memory consists of the four elements shown. The Memorystorage array and match logic are used for both the storage of data and for allowing a parallelsearch. The Argument Register contains the value to be compared with all the words in theMemory array in one parallel operation. The Key register is really a Mask register, used tolimit which bits of the Argument Register are used in the search for a match. The MatchRegister contains one bit for every word in the array. During a search, words which matchset a bit in the Match Register. The found data can be read sequentially by selecting wordswhose Match Bits have been set by the results of the search. These architectures usuallyalso contain a Tag register, which like the Match Register contains one bit for each word inthe array, indicating whether or not that word contains valid data [100].
176
MatchLogic
Aj Kj
Read
Write
Input
R S
To MiFij
Output
Fig. 6.8. An Associative Memory Cell The diagram illustrates the typical logic con-tained within an Associative memory cell [100]. The cell contains a flip-flop storage unit,Fij , and circuits for reading, writing and matching the cell contents with an argument. A
write transfers an input bit to the flip-flop, and reads the value during a read operation [100].
177
Fi1’Fi1
K A1 1 K A
’F F
2 2
i2i2
K A
’F F
n
in in
n
Fig. 6.9. Associative Memory Match Logic The match logic is used to compare thevalues stored in each cell with the corresponding bit held in the argument register. Fora match to occur, both the argument and corresponding cell bits must contain the samevalue, so they must either both be 0 or both be 1. In Boolean, the logic is described byxi = AjFij + AjFij , where xi is 1 when both bits are equal. For a match, all bits mustregister a 1, so the xi’s are ANDed together. The Kj bit is used to remove a bit from
comparison by forcing the comparison logic output line high when the bit is not maskedregardless of the match logic outcome. Therefore any bit not masked will not negativelyaffect the matching logic result [100].
178
delete a word in memory, the corresponding Tag bit is set to 0. Further, the tag bits are
included in the match logic to prevent an invalid word from participating.
In the case of discrete event simulation, content addressable memory can assist
in removing events which lack required resources from the event queue.
6.4 Reduction Bus
A reduction bus is a communications structure which has the dual purpose of both
communications and some minor simultaneous computation. For instance, the bus can
both determine and disseminate the next global minimum event time through a network
of interconnected nodes. The CM-5 r©, developed by the Thinking Machines Corpora-
tion, can consist of hundreds or thousands of processors, linked together by two communi-
cations networks, the Control and Data networks. The Control network contains a global
reduction network which allows many data elements to be combined producing a smaller
result. The CM-5’s reduction network directly supports integer summation, finding the
integer maximum, logical OR, and logical exclusive OR operations [77]. Reynolds de-
scribes another reduction network referred to as the Parallel Reduction Network (PRN),
where each node in the binary tree structure of the PRN is an Arithmetic Logical Unit
(ALU) specifically for performing reduction computation [116]. Reynolds’ PRN is de-
scribed in Section 3.5.1. The reduction network presented in Section 7.3.2 is a flattened
network, where the considered constraints include both the node geometric layout and
short bus run-lengths.
179
Chapter 7
Architecture
The proposed architecture is composed of multiple processing elements united
and synchronized towards the common goal of accelerating discrete event simulation. In
the case of traffic simulation, each processing element is responsible for simulating the
vehicles within an intersection and for simulating the traffic on the intersection’s outgoing
roads. Traffic intersections which are directly joined by roads in the simulation are
similarly co-located on adjacent nodes within the architecture to profit from the resulting
data locality. Data dependencies are local to each simulation node providing ample
opportunity for the concurrent processing of events on different processing elements.
As illustrated in Figure 1.1, the simulation architecture for each processing ele-
ment is divided into the three main categories of event generation, an event queue, and
a scheduler. Additionally, there is the interconnection network which binds the elements
into a single cohesive computing machine. The architecture of the system is sub-divided
into the four smaller sub-architectures. The architecture, overviewed in Section 7.1,
is composed of multiple processing elements which are unified into a cohesive simula-
tor. Individual processing element sub-components are discussed in Sections 7.2.1, 7.2.2,
and 7.2.3. Section 7.2.1 describes the event generation design. Section 7.2.2 describes the
event queue which stores events generated by the hardware described in Section 7.2.1.
At each processing element, events are retrieved from the event queues and scheduled
180
and processed by scheduler hardware described by Section 7.2.3. Finally, Section 7.3
describes the network which unites and synchronizes the processing elements.
7.1 Distributed Multiprocessors
Figure 7.1 illustrates the operational environment of the simulator which is simi-
lar to Levendel’s [98]. A general purpose machine serves as a User Interface and system
Controller. The Controller pre-processes the simulation data, partitioning and loading
it across the various simulator processing elements. There may be more than one Con-
troller initializing the simulator to ensure reasonable response time due to the size and
scalability of the system. However, if there are several Controller units, one will be des-
ignated as the main Controller, responsible for the initial simulation partitioning. The
pre-processing, partitioning, and post-processing topics are not covered in this thesis.
Once the simulator is loaded, the main Controller provides the initial Start signal, il-
lustrated in Figure 7.2. The Controller machines receive, postprocess, and provide the
simulation results to the user. Intermediate simulation results can be obtained by either
programming the processing elements to automatically post results as certain thresholds
are crossed or by interruptions from the Controller. The interruptions allow the user
to monitor simulation progress. Major adjustments are not possible without halting
the simulator and reconfiguring processing element logic. Results include traffic run-
time statistics and output values from monitored points at specific simulation times or
conditions.
The Controller unit is a general purpose machine which receives the simulation
input from the user. The Controller then partitions the simulation across the simulator
181
General PurposeController
SimulatorProcessing Element
Network
Preprocessing Postprocessing
Fig. 7.1. System User Interface The proposed preprocessing and post processing oper-ational environment for the multi-processing element simulator architecture are similar toLevendel [98]. The Simulator Processing Element Network is composed of a parallel reduc-tion bus structure and a cross-point matrix. The parallel bus is used for synchronizationand initialization. The cross-point matrix network is for inter-processor communications.
182
processing element network of Figure 7.2, optimizing the distribution to take advantage
of data locality among the neighboring processing elements and the simulator commu-
nications network. For a road traffic network, the Controller logically assigns the roads
and intersections of the simulated road network onto the processing elements, attempting
to place adjacent simulation nodes on adjacent simulator processing elements. Multiple
controllers can be used once the simulation network is partitioned to initialize the sys-
tems reconfigurable logic. For instance, eight controllers, one per quadrant, can be used
to implant the initialization and configuration data in the processing elements of each
quadrant. Once the simulation starts, the controllers can monitor the system and check
to see when user-determined thresholds are crossed. In the traffic simulation example,
one concern is throughput. Overflowing vehicle queues indicate a traffic bottleneck, so
a threshold can be assigned to the traffic queue size as a simulation break point. The
Controller units are responsible for providing formated output to the user. Configuration
of the processing elements will be slow, but once the simulation starts running, the Con-
trollers will not negatively affect the speed of the simulation unless the user requires the
simulation to proceed slowly. Controllers can step through a simulation and run it in a
debug mode. Citing the traffic network model example, once a simulator is initialized for
the traffic network model, major model changes are infrequent as road networks change
slowly.
Vehicle data flows into each processing element as the vehicle enters the corre-
sponding traffic map section. Vehicle data is transmitted between processing elements
using either nearest neighbor interconnect routing or the processing element array cross-
point switch. Greater detail of the processing element sub-components is provided in
183
GeneralPurposeController
Cross−PointSwitch
CommunicationsStructure
ControlSwitch
CommunicationsStructure
ParallelBus
Data/Clk
Address
Control
Processing
ElementAddress
Data
ControlDoneStart
Data/Clk
Address
Control
Processing
ElementAddress
Data
ControlDoneStart
Fig. 7.2. Processor Element Network The simulator consists of a Controller and anetwork of processing elements, interconnected by both a shared parallel reduction bus anda dedicated communications structure. The communications structure is composed of cross-point matrices laid out in approximately fully connected star topologies. Further detail onthe cross-point matrix network is found in Section 7.3.6, and the parallel bus is describedin Section 7.3.2.
184
the following sections. A controlling processor at the core of the simulator initiates each
simulation cycle using the Start signal in both its time and event-driven modes. In
time-driven mode, the processing elements have already exchanged input values for the
beginning of the next simulation cycle during the previous cycle using the communica-
tions structure. In event-driven mode, the processing elements must wait until the next
event time is determined before exchanging data. Only data required for the next sim-
ulation cycle is exchanged by the processing elements, alleviating the need to exchange
event scheduling times with the vehicle data. Data may also be generated for the Con-
trolling unit, upon user request or instruction, and it is expected that this user requested
data will impact the simulator’s processing speed. Processing elements signal that they
are ready using the Done signal line on the reduction bus, illustrated in Figure 7.2.
In event-driven mode, when all processing elements have signalled the end of
the current simulation cycle, the next event time is determined using the reduction
bus of Section 7.3.2. Data between the processing elements can then be exchanged
for the next time cycle. The time-driven mode avoids both the next simulation time
cycle determination and the subsequent data exchange which occurs during the cycle
processing. Once all processing for the previous simulation time cycle is complete, the
main controlling processor initiates the next simulation cycle using the Start signal.
7.2 Processing Elements
The local processing element architecture, consisting of an event generator, local
event queues, and a scheduler, is shown in Figure 7.3. The processing elements perform
the actual scheduling calculation on each discrete event in the system. For the road
185
traffic example, the processing elements compute the acceleration, velocity, and positions
of each vehicle as they traverse the simulated road network. Each processing element
is responsible for part of the overall simulation map, as partitioned by the Controller
during the simulation initiation. For the traffic simulation, parts of both the routing
table and map attributes are implemented in reconfigurable logic before the simulation
starts. Source nodes which introduce new events into the simulation network need event
generators and queues to handle arrival and service events. Simulation nodes which
serve as pass-throughs or way-points for events already active in the simulation may
require only the scheduler components which may be complex. Further, the scheduler
components may contain queue structures which should not be confused with the Arrival
and Service queues illustrated in Figure 7.3. The processing elements each include a
microprocessor, RAM, and EEPROM to provide added design flexibility.
Each processing element incorporates hardware to exchange simulation data with
other elements connected to the Communications Structure of Figure 7.2. The inbound
events are handled by additional small communications FIFO queues not illustrated in
Figure 7.3. These communications queues are used to maintain the ordered inbound
events from other processing elements and the ordered outbound events sent to other
processing elements.
7.2.1 Event Generation
Speedup of event-driven simulation is attacked from two vantage points. First, a
separate event generator is created which functions in parallel with, and independently
186
comparator
ServiceQueue
Event
Generator
adjacent PEsEvents from
Scheduler
ArrivalQueue
Fig. 7.3. Local Processing Element Design The local processing element (PE) designuses two queues for each server. The arrival queue holds the sorted list of arrival eventsfrom the Event Generator and adjacent network processing elements. Service events, whichare created from processing successful arrival events, are stored in the service queue. Acomparator samples the heads of both queues and indicates where the next minimum localtimestamped event resides.
187
from, the event scheduler. Although some data dependency exists during event gener-
ation, partial parallelism at this stage is reasonable. Data dependency exists because
event arrivals are calculated as random offsets from the previous event’s arrival time.
Also, service durations, required for some types of simulation, are calculated as random
offsets from the event’s arrival time. Note that not all types of simulation require service
durations. Road traffic does not necessarily require a duration, but a telephone call in
a communications network simulation needs to determine the length of time its corre-
sponding circuit is unavailable. The partial parallelism derived by calculating the arrival
offset and service duration concurrently is possible because the arrival and service offsets
are not themselves dependent on anything. However those offsets must then be added
to either the previous or current event’s arrival time, respectively. The event generator
computes event arrival times, service times, and resource requirements with some par-
tial parallelism (see Figure 7.4). The resulting event objects are stored in a memory
queue which is accessible to the scheduling software. The memory queue serves as the
simulation’s event queue.
Speedup is accomplished by translating some simulation loop software into par-
allel, systolic, hardware. The hardware is designed through a combination of recon-
figurable logic technology and systolic arrays. Reconfigurable logic allows the user to
compile various statistical distribution models into hardware inexpensively. Systolic ar-
rays lend themselves nicely to the parallel execution of the independent sub-portions of
events which are otherwise data dependent. These non-dependent parts of the events
can be executed in parallel. The Event Generator from Figure 1.1 is translated into both
software and the hardware of Figure 7.4 for timing comparisons.
188
CreateArrival
Time Offset
Pipeline Register
Add OffsetTo Previous
Arrival Time
Create
Time OffsetService
i
e
Reg
p
PPipeline Register
i
e
Reg
p
P
i
e
Reg
p
P
Pipeline Register
Set
Resources
Place Event
in Queue
Add OffsetTo Current
Arrival Time
Fig. 7.4. The Event Generator Flow Diagram The Event Generator of Figure 1.1is subdivided into arrival and service time generation. The time offsets can be created inparallel. This design converts event generation software into a two-dimensional reconfig-urable, systolic array. Reconfigurable logic boosts the execution speed of event generationby fostering parallel computation. In the systolic array depicted above, data is pumpedfrom one processing block to the next at regular intervals, until the data circulates to theEvent Queue.
189
In the hardware version, multiple calculations happen simultaneously. First, the
three outer pipeline blocks of Figure 7.4, Create Service Time Offset, Create Arrival Time
Offset, and Set Resources execute simultaneously. Create Service Time Offset and Create
Arrival Time Offset generate the Poisson arrival and service times of Equations 4.1
and 4.2. Next, the arrival time offset is added to the current clock time to determine
the actual arrival time in the Add Offset to Previous Arrival Time block. In the next
step, the service time offset is added to the actual arrival time yielding the time at which
the event is finished and its resources become available again. Simultaneously, the start
event data is matched to its resource requirements in the Create Next Start Event block.
The Create Related Finish Event block pumps out its value in the next step. However,
when the pipe is loaded, start and finish events emerge from the pipeline simultaneously,
with each cycle.
The hardware version of Section 4.1.1 was modeled using Altera’s Max+Plus IIR©
FPGA simulation package. The design, written in the AHDL language, used the Flex
10K series FPGA chips. The Max+Plus IIR© design automation package consists of a
series of tools including an editor, a compiler, and a simulator. The editor allows designs
to be entered as text files in AHDL (Altera High Level Design Language), Verilog,
or VHDL (VHSIC Hardware Description Language where VHSIC is another acronym
standing for Very High Speed Integrated Circuits). The compiler translates the design
into files for simulation, timing, and device programming. The Max+Plus II R© simulator
provides timing information and allows design functionality to be verified.
The Altera Flex 10K series of FPGAs have the following features. The devices
contain 10,000 to 100,000 typical gates, 720 to 5,392 registers, and 6,144 to 24,576 RAM
190
bits. Additional routing features on the chip facilitate predictable interconnect delays
which provide reliable simulation results. Software design support and automatic place-
and-route tools are provided by Altera’s Max+Plus IIR© development system.
7.2.1.1 Event Generator Results
The design depicted in Figure 7.4 is translated into AHDL, Altera’s Hardware
Design Language. The Event generator is synthesized as a combination of five chips. Four
logarithm units are required, two producing their results on the even clock cycles and
the other two producing results on the odd clock cycles. The Create Arrival Time Offset
and the Create Service Time Offset blocks of Figure 7.4 each require one odd and one
even logarithm unit. In simulation at a 200 nanosecond clock rate, the hardware version
requires 200 nanoseconds, producing one event per clock cycle. Therefore, we achieve a
speedup of 150. This speedup is just for the translation of the event generation software
code as pipelined, systolic hardware. The event generator implementation results are
listed in Table 7.1.
7.2.2 Event Queue
The second problem strike point occurs within the queue of waiting events. This
queue is designed to hold the events in order of their arrival. One proposed memory
queue is a Content Addressable Memory Queue. If the required event’s resources are
unavailable, the queue assists the scheduler by removing impotent events.
191
After events are created by the event generator, they are stored in the arrival
queue in order. The arrival queue can be easily implemented as a FIFO queue. Success-
fully executed arrival events create service events. However, the service events may be
generated out of order. The smallest timestamped events, be they service or arrival, must
be continually available to allow the events to be executed in time order. This section
presents two service queue alternatives. The first method, described in Section 7.2.2.1,
maintains a sorted queue and can select the nth element in O(1), but requires O(4) steps
to insert a new element. The second method, described in Section 7.2.2.2, inserts new
elements in O(1) and can pop the smallest element off in O(1) but does not maintain a
sorted queue.
7.2.2.1 The Service Event Sorter
The first method, the Service Event Sorter, applies associative memory to sort
events in 4 cycles, significantly faster than standard software sorts. This sorting mech-
anism maintains a sorted array facilitating selection of the kth smallest element. The
hardware consists of the Input Register, a Content Register Array, a Marked Array, and
a Maxbit Register. The input value is compared against the content array values and
inserted in the correctly sorted position within the content array. Auxiliary hardware
registers and logic are used to quickly locate the correct insertion point for the new value.
Longer queues can be created by chains of smaller queues.
In the first cycle, illustrated in Figure 7.5, all words in the content array are
compared to the input register value. If any word in the content array has the same
most significant bit (MSB) as the input register, the first bit of the maxbit register is
192
set. If any content array word has the same top two MSB’s as the input register, then
the first two bits of the maxbit register are set. So if three bits of the maxbit register
are set, at least one word in the content array has the same top 3 MSB’s as the input
register. In the example depicted in Figures 7.5, 7.6, and 7.7, two words have the same
top two MSB’s as the input register value.
The proposed algorithm works as follows. All registers in the content array are
compared with the input register. A network of nodes, called the match array, is used to
determine the number of most significant bits which each content register has in common
with the input data register. A single register, the maxbit register, records the result.
For example, if one or more content array registers match the data input register on all
3 of the 3 most significant bits, then the first 3 bits of the maxbit register are set to logic
1’s.
In the second of the four cycles, all words in the content array matching the input
register with the maximum number of MSBs are marked by setting bits in a marked
array. The marked array consists of one bit per word of the content array. Content
array words which have the maximum number of MSBs matching the input register are
marked as illustrated in Figure 7.6. These words will cluster due to the binary nature of
the search and the sorted queue format.
During the third cycle, the required words in the content array will be moved
down to allow room for the insertion of the input register contents. If the least significant
marked bit in the maxbit register is i, then the i+1 bit of the input register is checked.
When the i+1 bit of the input register is a zero, as shown in Figure 7.7, all registers
in the content array from the marked register to the end of the array are shifted down
193
1st B
it M
atch
es1s
t 2 B
its M
atch
1st 3
Bits
Mat
ch1s
t 4 B
its M
atch
0001
0011
0110
0111
1001
1010xxxxxxxx
MatchArray
Content
ArrayRegister
−−−−xx−−
−−−−
0101
xx−−−−−−−−−−−−−−−−−−
1100
Input Register
Maxbit RegisterMarked Array
Fig. 7.5. Service Event Sorter: Cycle 1 The Service Event Sorter can sort events in 4cycles, significantly faster than standard software sorts. In the first cycle, all words in thecontent array are compared to the input register value. If any word in the content arrayhas the same most significant bit (MSB) as the input register, the first bit of the maxbitregister is set. If any content array word has the same top two MSBs as the input register,then the first two bits of the maxbit register are set. So if bit three of the maxbit registeris set, it indicates that at least one word in the content array has the same top 3 MSBs asthe input register. In this example, two words have the same top two MSBs as the inputregister value.
194
1st B
it M
atch
es1s
t 2 B
its M
atch
1st 3
Bits
Mat
ch1s
t 4 B
its M
atch
0001
0011
0110
0111
1001
1010xxxxxxxx
MatchArrayArray
ContentRegister
−−−−xx−−
−−−−
0101
xx−−−−−−−−−−−−−−−−−−
1100
11
Input Register
Maxbit RegisterMarked Array
Fig. 7.6. Service Event Sorter: Cycle 2 In the second cycle, all words in the contentarray which contain the same number of matching maximum bits with the input registerare marked. For this example, two words match the input register with their two mostsignificant bits.
195
one register in a single clock cycle. Otherwise, if the i+1 bit of the input array is a one,
then all registers below the marked registers are shifted down one position creating a
spot for the new word to be inserted as shown in Figure 7.7. Finally, in the fourth step,
the input register value can be inserted into the array in proper order.
7.2.2.2 The Linear Array
The second service queue mechanism consists of a linear array approach which is
described in [97] and illustrated in Figure 7.9. However, instead of using the linear array
to sort the values, the array will simply maintain fast access to the minimum timestamped
event. All new simulation events are passed into the leftmost array element, the queue
head, and when removed, the elements are also popped off the queue head. Each element
of the queue contains two registers and a comparator. The larger of the two resident
elements is passed to the right, and the smaller of the two elements is passed to the left.
Therefore the smallest entry is always at the leftmost queue element. Comparators in
each element and the queue push/pop signal steer the 2x2 multiplexor logic to route the
correct entries in and out of the processing element registers.
The service queue is required to always have the smallest element ready. The
availability of the smallest element can be reasoned as follows. Assume that at some time,
t, the queue contains N elements. Therefore, the leftmost element, K, has examined a
sequence of N values, retaining the smallest value. This value can be popped off in 1
move. The element to K’s right, K-1, has examined at least N-2 values, so the 2nd
smallest value can be either at element K, or at element K-1, but it must be in one of
196
1st B
it M
atch
es1s
t 2 B
its M
atch
1st 3
Bits
Mat
ch1s
t 4 B
its M
atch
MatchArrayArray
ContentRegister
−−−−xx−−
−−−−
0101
xx−−−−−−−−−−−−−−−−−−
1100
11
0001
0011
xxxx1010100101110110
Input Register
Maxbit RegisterMarked Array
Fig. 7.7. Service Event Sorter: Cycle 3 The third cycle shifts words in the contentarray to insert the input register word. If the least significant marked bit in the maxbitregister is i, then the i+1 bit of the input register is checked. When the i+1 bit of the inputregister is a zero, all registers in the content array from the marked register to the end ofthe array are shifted down one register in a single clock cycle as illustrated in the figure.Otherwise, if the i+1 bit of the input array is a one, then all registers below the markedregisters are shifted down one register creating a spot for the new word to be inserted.
197
1st B
it M
atch
es1s
t 2 B
its M
atch
1st 3
Bits
Mat
ch1s
t 4 B
its M
atch
MatchArrayArray
ContentRegister
−−−−xx−−
−−−−
xx−−−−−−−−−−−−−−−−−−
1100
11
0001
0011
xxxx10101001011101100101
Input Register
Maxbit RegisterMarked Array
Fig. 7.8. Service Event Sorter: Cycle 4 The fourth cycle inserts the input registerword into the content array. If the least significant marked bit in the maxbit register is i,then the i+1 bit of the input register is checked. When the i+1 bit of the input register isa zero, the input register word is inserted into the first marked word position. If the i+1bit of the input register is a one, then the word is inserted below the last marked word.
Mux Mux
Out
In Comparator
Register A
Register B
Comparator
Register A
Register B
Fig. 7.9. Linear Array Queue The queue consists of a linear array of processing ele-ments. All new elements are passed into the leftmost array element, and when removed, theelements exit the same leftmost element. Each element of the queue contains two registersand a comparator. The larger of the two resident elements may be passed to the right, andthe smaller of the two elements may be passed to the left. Therefore the smallest entryis always at the leftmost queue element. Comparators in each queue element steer themultiplexor logic to route the correct entries in and out of the processing element registers.
198
those two places, and can be accessed in 2 moves since the smallest element must be
removed first.
The nth smallest element to enter the array is in any position from K down to
K − (n− 1). Then the next smallest element to enter the queue will be in any position
from K down to K − ((n − 1) − 1), which provides our inductive step for the n − 1
smallest element. So our nth smallest element can always be retrieved in n steps. This
queue allows us to push and pop each element in O(1) time. The queue is illustrated in
Figures 7.10 and 7.11.
Figure 7.10 illustrates a sequence of values being pushed into the array. The top
array illustrates the first time step, with each successive array below depicting the same
array during the next clock cycle. Comparators on each processing element and their
associated multiplexors steer the values into each element of the array. Larger elements
are pushed to the right. When events are popped off the queue, the analogous sequence
of steps is illustrated in Figure 7.11. Smaller elements are pushed to the left during
insertion.
7.2.2.3 The Queue Model Results
The implemented hardware service queue is a five element design closely resem-
bling Figure 7.9. For the traffic simulation example, there is no need for the fully ordered
list of Figure 7.6. The linear array queue is capable of pushing one 16-bit value per 40
nanoseconds. The smallest queue value can also be popped out at that rate. It is as-
sumed that each simulator cycle needs to push one event and pop one event from the
service queue. Therefore, the queue achieves the system 80ns cycle time. Queue data
199
8 21
4 653
7
1 34 6
5728
8 2 7 3 1 5 6 4
46
45
6
8 2 7 3 1 5
8 2 7 3 1
41
65
8 2 7 3
Fig. 7.10. Linear Sort Array Input Example The figure illustrates a sequence ofvalues being pushed into the array. The top array illustrates the first time step, with eachsuccessive array below depicting the same array during successive clock cycles. Comparatorson each processing element and multiplexors between each element steer the values into eachelement of the array of Figure 7.9. Larger elements are pushed to the right.
1 34 6
578 2
36
782
4 51
78 41 2 3 5
6
78
51 2 3 46
8 61 2 3 4 5 7
81 2 3 4 5 6 7
1 2 3 4 5 6 7 8
Fig. 7.11. Linear Sort Array Output Example The figure illustrates a sequenceof values being popped out of the array. Comparators on each processing element andmultiplexors between each element steer the values into each element of the array. Smallerelements are pushed to the left.
200
values also require pointers to the event data so that pairs of values are needed to be
pushed and popped off the queue. Conversely, when new elements are pushed into a
software data structure, the existing software elements must be fetched from memory to
allow the CPU to compare the stored elements to the new arrival, so that the insertion
point can be determined, or an address must be calculated to determine a proper bin
on which to chain the new entry for hashing. Software methods require more time and
variable amounts of it.
Comparing the two methods in terms of their required hardware, the Service Event
Queue requires two registers for each memory word, where one is the memory register
and the second is within the Matchbit array. Additional matching logic, a Match Array
Bit, and a Tag bit are required for each word. With the Linear Array approach, each
stored word requires 1 storage register, and an amortized 50% of both a multiplexor and
a comparator plus some additional control logic. The hardware requirements are similar
in magnitude.
Using Altera’s Max+Plus IIR© FPGA simulation package, the Event Generator
and the Service Queue have been simulated as individual parts running with a clock
rate of 80ns. The service queue was simulated at a rate of 40ns, allowing it to push
an event during the first half cycle and pop an event during the next. A 5 processing
element queue was implemented on one Altera EPF10K20TC144-3 chip utilizing 90% of
the chip’s resources. The system FPGA components are listed in Table 7.1.
Figure 7.12 illustrates the speedup expected for other distributions if the 80ns
clock is maintained in their hardware implementations. The implemented hardware
201
Function Quantity Chip Type % Utilized Clock RateEvent GeneratorLogarithm Unit 4 EPF10K40RC208-3 95% 15.84 MHz
Event Logic 1 EPF10K30RC240-3 71% 12.78 MHzService QueueLinear Array 1 EPF10K20TC144-3 90% 30.95 MHz
Table 7.1. Event Generator and Event Queue FPGA Implementation The AlteraFPGAs used to simulate the event generation and event list hardware are listed. The naturallogarithm unit uses pairs of FPGAs to facilitate one result per clock cycle. Two logarithmunits are used to generate one arrival time per clock cycle, and another two logarithm unitsare used to generate a service time offset per clock cycle. Each logarithm unit of the pairsgenerate output on alternating clock cycles. The Linear Array implementation is describedin Section 7.2.2.2.
100
150
200
250
300
1 2 3 4 5 10 50 100 500 1000
Spe
edup
Number of Events in Arrival Queue
Number of Queue Arrival Events vs Speedup
UniformNormal
LogNormalWiebull
NegativeExpntl
Fig. 7.12. Speedup vs Events for Event Generation, Arrival and Service QueuesIllustrated are the speedup values obtained by comparing the software event generationand queuing from the code in Table 4.3 against the respective hardware implementation.The speedup values were derived on a dual Intel Pentium 350 MHz RedHat Linux boxrunning the 2.2.15 kernel. Compilation of the software was with the GNU gcc compiler,version 2.95.2, using the optimization flag. The speedup results indicate a second order ofmagnitude speedup.
202
distribution provides the speedup illustrated for the Negative Exponential curve in Fig-
ure 7.12. It is also important to note that the proposed linear array queue need not
be implemented as reconfigurable logic. The queue can be implemented as an Applica-
tion Specific Integrated Chip (ASIC), and would probably be able to function at an even
faster clock rate with many more queue elements. The code execution times were clocked
on a Dual Pentium 350 MHz machine running RedHat Linux kernel version 2.2.15. The
code was compiled using the GNU gcc compiler, version 2.95.2. To gather accurate tim-
ing results, the number of events in the queue is kept constant. The extra time used to
generate additional arrival events in order to maintain the queue size is not included in
the speedup plot of Figure 7.12.
7.2.3 Scheduler
The results of Section 4.2.3 determined that discrete event simulation scheduling
algorithms are an important target of acceleration research. However, unlike the Event
Generation and Event Queue implementations, the scheduler implementation is very
simulation dependent. For instance, a discrete event simulation of road traffic might
have a very different scheduling algorithm than a biological scenario. For this study,
the simulation of traffic was selected. The nature of microscopic traffic simulation is
the determination of position, velocity, and acceleration along with routing and other
considerations. Traffic simulation has the added benefit of a significant amount of data
locality. Vehicles in a system tend to dwell in the same locality and their dependencies
rely on other sets of data within that same locality. Even when vehicles move, they move
to an adjacent node within the traffic network. The work in this study, and especially
203
within this section, must be viewed in light of these properties of traffic simulation.
The other properties of event generation and the event queue are probably more widely
applicable to discrete event simulation in general. However, there is a wide and analogous
field of simulation which can benefit from the results of this section.
Due to current technology limitations, it is assumed that each processing element
contains one intersection and all of the roads which exit from the intersection. Process-
ing elements model two, three, and four-way intersections. A four-way intersection is
illustrated in Figure 7.13. Larger intersections are modeled using combinations of these
three intersection types.
The Scheduler component of the architecture is described in several sections.
Section 7.2.3.1 explains the data structure which composes a vehicle description. Next,
Section 7.2.3.2 describes how the vehicle data structure of Section 7.2.3.1 is initialized as
the vehicle enters the traffic simulation network. Vehicle road movement computation is
described in Section 7.2.3.3, which is the third section. Section 7.2.3.4 describes the com-
putation involved in vehicle movement through an intersection. Finally, Section 7.2.3.5
presents the Scheduler experimental results.
7.2.3.1 Vehicle Data
The data requirements of the logic simulators discussed in Chapter 3 are less
demanding than the data requirements faced by a traffic simulator. For logic simulation,
the output of a particular logic function tends to be a relatively simple signal value. The
output of a logical gate, AND or OR for example, tends to be a single value. Traffic
simulation faces a broader, more complex, set of values which must be computed and
204
Fig. 7.13. An Intersection and its Departing Roads The traffic model employedassumes that each processing element can model one intersection and its exiting roads,which are highlighted. The processing element handles traffic entering the intersectionfrom up to four directions. The processing element continues to handle the traffic on thelanes which exit the intersection. When a vehicle reaches the end of a road it is handedoff to the next processing element. Processing elements handle two, three and four-wayintersections. Large intersections are generated by creating combinations of these smallerintersection types.
205
transmitted with each simulation cycle. The challenge is to develop a minimum dataset
which is feasible and useful. For this research, the vehicle is composed of the sub-fields
described in Table 7.2.
Vehicle acceleration on roads in the scheduler is determined by a variety of condi-
tions. One factor is whether or not the vehicle has a leader within its headway. For this
thesis, the headway is a 4 second following time based on the vehicle’s velocity. Other
acceleration criteria include the distance to the end of the road, the value of the traffic
signal at the end of the road, and the vehicle’s velocity with regards to the speed-limit.
If the vehicle is determined to be in a following mode, the vehicle’s acceleration is cal-
culated using Equation 2.9. Vehicles which do not follow a leader, are not approaching
the end of a road, and are not speeding, use a table lookup to determine their free-flow
acceleration. Similar to [146], the acceleration is determined by the most restrictive
constraint of Equation 7.1. Tables 7.3 and 7.4 provide the algorithms used to determine
vehicle acceleration.
an = minaCarFollowingn
, aTrafficSignaln
,
aFreeF lown
, aSpeedingn
(7.1)
As the vehicle flows through the simulation network, additional data is required,
but that data remains stationary at each local processing element. Free flow acceleration,
for example, is maintained at each processing node. Vehicles needing the value perform
206
Field Name # of bits Description
source 16 Vehicle entered the network at source.destination 16 Vehicle travels to destination.velocity 9 Vehicle speed (ft/sec)vehicle type 2 High or Low Performance Car, Bus or Truckvehicle id 18 Unique ID, used in combination with Sourcelane assignment 2 Format to handle 4 lanes, 3 travel + shoulder.valid word bit 1 Is data word valid?center point x 32 x coordinatecenter point y 32 y coordinatedistance down lane 16 distance along roadreserved 16 Reserved.
Total: 160 Total bits for vehicle
Table 7.2. Vehicle Data Fields In order to accelerate vehicle processing, the size of thevehicle data needs to be limited to allow it to pass quickly through the datapath and avoidmemory stores and fetches wherever possible. The data fields listed in the table are thosewhich are required for movement computation. The source is included for statistics andvehicle identification. The destination is used for vehicle routing. The vehicle type is used incombination with the velocity to determine the free flow acceleration. For movement withinan intersection on turning lanes, a polar coordinate system is used. Within intersections,the distance down lane field is initialized to a radial θ field which keeps track of the angletraversed during the vehicle’s turning motion. Similarly, the reserved field is used to keeptrack of the angular velocity, and is initialized to ω0. Free flow accelerations are storedon each node in table format. The data which is vehicle specific and accessed with eachcalculation moves with the vehicle. If more data is incorporated with the vehicle, localmemory and cache can be used in conjunction with the event list and a pre-fetch mechanism.
207
a table lookup based on the vehicle type and current velocity. Each node also contains
network routing information.
7.2.3.2 Vehicle Initialization
When each vehicle is dequeued from either the Arrival or Service queue of Fig-
ure 7.3 and injected into the simulation network, the vehicle must be initialized with its
vehicle type, source node, and destination node. Additional data may also be required,
depending on the simulation. A single stage implementation was created to initialize
new vehicles.
7.2.3.3 Road Movement
The road movement implementation is composed of two parts. The first part
initializes vehicles each time they enter the road for the first time. The data fields which
are initialized include the lane assignment, center point, and distance down lane fields
from Table 7.2. In Section 7.2.3.5, this initialization implementation is referred to as
Veh Road Init. The rest of this section deals with the second implementation, referred
to as Move on Road in Table 7.5, which performs the vehicle movement computation.
The queues in Figure 7.14 are not event queues as illustrated in Figure 1.1. The
Event Queue in Figure 1.1 expels approximately one event per simulation time cycle.
The number of vehicles generated and expelled from source nodes is determined by the
user’s selected statistical generation distribution. The event queues of Section 7.2.2 are
required only in conjunction with event generation and only on simulation source nodes.
208
Perform VehicleMovement Calcs
To Next Node
From Sourceor Intersection
comparator
MUX
MainFIFO
TransferVehicle
FIFOEntry
Fig. 7.14. Scheduler Vehicle Queue The scheduler vehicle queue implementation issimilar for both movement on a road and movement in an intersection. Two FIFOs aremaintained. Newly arrived vehicles are placed in the Entry FIFO. The Main FIFO is forthose vehicles which are in-progress, either along a road or through an intersection. Thecomparator between the two queues selects the vehicle with the most advanced positiondown the lane, and routes that vehicle’s data into the appropriate functional units of Fig-ure 7.15 or 7.16 for either road-handling or intersection-handling movement calculationsrespectively. Vehicle datasets which are circulated back from either Figure 7.15 or 7.16 areplaced in the main FIFO for the next simulation time cycle’s computation. Although thedual queue design is similar to the system illustrated in Figure 7.3, the application here ismuch different. The vehicle events stored in the queues of Figure 7.3 are dormant until theirappropriate simulation time cycle arrives. The vehicles are then dequeued, initialized, andinjected into the Scheduler, which for traffic, simulates the road network. Vehicle eventsin the queues of Figure 7.3 are indexed based on time. The vehicles stored in the queuesshown in this figure are already moving within the traffic network and are themselves circu-lated and processed every simulation cycle. These vehicles are indexed based on their laneposition, not time.
209
The main queue of Figure 7.14 is used to store the vehicles which circulate to
the computation units each simulation cycle so that the vehicle position, velocity, and
acceleration fields can be updated. The second queue, the entry queue, stores vehicles
which have just entered the intersection or roadway. The implementation calculates the
vehicle position, velocity and acceleration for vehicles moving in the same direction on a
road. Similarly, each intersection implementation handles one direction of traffic enter-
ing the intersection and exiting the same intersection in one of possibly three directions.
Each processing element contains FPGAs to handle a four-way intersection and its cor-
responding exiting lanes of traffic. All queued vehicles in Figure 7.14 are processed every
simulation time cycle. The queued data values represent vehicles moving on either a
road or through an intersection. Each vehicle is circulated from the appropriate queue
into the calculation hardware implementation (Figure 7.15 or 7.16) for acceleration, ve-
locity and position updates, and then either the vehicle is circulated back to the main
FIFO or to the next traffic network node. Each vehicle in the queues is handled once
per simulation cycle. The vehicles are advanced based on position. Those closest to the
end of the road are moved first.
For the road movement calculation, the vehicles are processed as follows. Al-
though the vehicles pass through the 6 stages of pipeline illustrated in Figure 7.15, the
processing is divided into 4 cycles. Vehicle data passes through the first four stages where
its acceleration is determined. The next two stages are a repeat of the first two cycles,
only during the second pass, the vehicle is fulfilling its role as a leader to subsequent
vehicles in the same lane. If no vehicle is following in the same lane, the vehicle data
replaces the previous leader in the lane based leader register.
210
The movement calculations can now be defined according to their 6 pipeline stages.
As vehicle data enters the pipeline in the first stage, several actions begin immediately.
First, the computation of all forms of acceleration which require a total of four stages
begins. In order to compute the acceleration, the vehicle’s lane assignment and leader
are determined. The lane assignment in conjunction with the vehicle’s headway facilitate
the determination of the vehicle’s leader. The second stage computes items including
the vehicle’s headway. Acceleration computation continues during the second and third
stages. In the third stage, the vehicle’s position and velocity computations begin. Free-
flow acceleration indices are computed. During the fourth stage, the correct acceleration
value is selected from the computed accelerations. The acceleration selection is based
on the algorithm described in Table 7.3 and Equation 7.1. Once the acceleration for
the vehicle is selected, the vehicle can finish its velocity and position computations and
begin to serve as a leader for any qualifying subsequent vehicle. Stage five allows the
vehicle with its newly calculated acceleration to be placed in the appropriate leader
register for subsequent vehicles. The final stage of the pipeline computation decides
whether the vehicle has reached the end of the road, therefore, needing to transfer to
the next intersection, or whether the vehicle should be returned to the main queue of
Figure 7.14 to await the next simulation cycle of processing. The pipeline is illustrated
in Figure 7.15.
The scheduler hardware implementation for the selected traffic example takes
significant advantage of data locality. The selected model distinguishes road traffic and
intersection traffic. Vehicles are assumed to be initialized and injected onto a road.
Properties such as the speed-limit, grade, and other road characteristics are considered
211
DetermineLane
AccelerationTable lookup
Car FollowingAcceleration
SpeedingDeceleration
StoppingDeceleration
Set LeaderRegister
MUXSelectAcceleration
Set NewAcceleration
CalculateNew Velocity
CalculateNew Position
AdjustCenter Pt
MovedVehicle
TransferPrev Vehicle
To Next
Node
To Main
Queue
CalculateNew Position
or Entry Q
From Main
Vehicle Register
Fig. 7.15. Calculations for Vehicle Movement on a Road The steps required to cal-culate a vehicle’s movement along a road are represented. For each simulation time cycle,vehicles on the road are moved from the Main and Entry FIFO queues of Figure 7.14 intothe pipeline described by the block diagram. The pipeline processes the vehicle data wordsto adjust the vehicle acceleration, velocity, and position. Each vehicle passing through thecalculation pipeline may depend on its immediate predecessor’s calculation if the previousvehicle is the current vehicle’s leader in traffic. Because the lead time required to calculatethe acceleration is 6 cycles, and because a dependency may exist where this vehicle’s accel-eration may be required to determine the following vehicle’s acceleration, all the possibleacceleration outcomes commence calculation immediately and concurrently. Accelerationdetermines the duration of each vehicle’s process time in the pipeline. As the accelerationsare being computed, the vehicle’s traveling lane and possible lead vehicle are determined.The appropriate acceleration is selected based on the vehicle’s relation to its leader, thevehicle’s distance to the end of the road, the traffic signal value at the end of the road, thevehicle’s previous velocity, the speed-limit, etc. The block diagram was implemented as a6 stage pipeline.
212
if leader = 0. Not following anyone - no leaderif (traffic signal == green) ||
(plenty of open road) thenif speeding then
. decelerate in proportion to speeding
accel =V 2−V 2
speeding
2(headway)else
accel = table lookup free-flow valueend if
else. open road, but traffic signal or road endif (road end) then
stopsign, wait for intersection accessaccel = 0, velocity = 0
else
accel = − V 2
2(roadleft)end if
end ifelse
. Following a leaderif (traffic signal == red) then
accel = − V 2
2(roadleft)else
accel =αl,m
[
xn+1(t+∆t)]m
[
xn(t)−xn+1(t)]l
[
xn(t)− xn+1(t)]
end ifend if
Table 7.3. Acceleration Decisions for a Road The algorithm for determining vehicleacceleration during road travel is provided. Only stop-sign traffic signals were simulated,therefore vehicles stop at the end of a road before proceeding into the intersection. Accel-eration for free-flow traffic is determined by table-lookup based on the vehicle speed andtype. Equations 2.9 and 2.15 derived in Chapter 2 are incorporated in this algorithm.
213
constant with respect to the roads. The vehicle is composed of data which includes
the destination, velocity, type, etc. Some vehicle properties, free-flow acceleration for
example, need not move with the vehicle, but can be locally accessed based on the vehicle
type and velocity in each simulation node.
7.2.3.4 Intersection Movement
Similar to the road movement implementation of Section 7.2.3.3, the intersection
movement implementation is composed of two parts. The first part initializes vehicles
each time they enter the intersection for the first time. The data fields which are initial-
ized include the lane assignment, θ0, and the ω0 fields from Table 7.2. In Section 7.2.3.5,
this initialization implementation is referred to as Veh Intersect Init in Table 7.5. The
rest of this section deals with the second implementation, referred to as Move in Intersect
in Table 7.5. Move in Intersect performs the computation for vehicle movement within
the intersection.
Vehicle motion through the intersections is similar to the motion computations of
the traffic on roads. There are, however, some differences. For this simulator design, the
motion through the intersection and the related computations were converted to angular
velocity and acceleration for intersection turns. Another difference is that as vehicles
traversed the intersections, an assumption is made that the vehicles continue through
the end of the intersection and do not stop. All roads, on the other hand, are simulated
as being terminated by stop-signs. Otherwise, the intersection computations also follow
a similar 6 stage pipeline. The vehicle’s lane position and leader are determined in the
first stage. All acceleration choices also begin their computation in the first stage. In
214
the second stage, a determination of whether or not the leader falls within the vehicle’s
headway is started. Acceleration computations continue. The third stage initiates the
vehicle’s free-flow acceleration table index computation. Vehicle position and velocity
computations begin during the third stage. The selection of the proper acceleration
occurs in the fourth pipeline stage. Vehicle position and velocity computation complete
during this fourth stage. In the fifth stage, the vehicle may begin to serve as the leader
for any subsequent vehicles. If the vehicle has crossed the intersection, it is handed off
to the exit road entrance queue for road handling. Otherwise, the vehicle is returned to
the intersection main queue for further processing during the next simulation cycle.
The general mathematical formulas used to describe vehicle motion which are
applied in this section are reviewed in Chapter 2.
For this thesis, all roads are assumed to be straight running either north/south
or east/west. Further, all intersections are assumed to be governed by stop signs. Traffic
lights were not modeled. For turning computations required within intersections, angular
acceleration equations analogous to 2.9 and 2.15 were derived.
7.2.3.5 Scheduler Results
Results from the scheduler contain the least speedup of the simulator sections at-
tempted. The major limitation to the experimental design lies in the division functional
unit and the data dependency between leading and following vehicles. An implemen-
tation of just a simple division functional unit with registered input and output ports
achieved a clock rate of 9.14 MHz. So one impediment to faster implementation on the
215
MUX
MovedVehicle
AccelerationTable lookup
Car FollowingAcceleration
SpeedingDeceleration
Set NewAcceleration
CalculateNew Velocity
CalculateNew Position
Set LeaderRegister
SelectAcceleration
CalculateNew Position
TransferPrev Vehicle
To Next
Queue
Node
To Main
or Entry Q
From Main
Vehicle Register
Fig. 7.16. Calculations for Vehicle Movement Through an Intersection A vehicle’smovement through an intersection is similar to its movement along a road as illustratedin Figure 7.15. Again, the acceleration calculations determine the length of the 6 cyclecalculation pipeline. Movement through the intersection differs from movement along aroad. For instance, it is assumed that there is no traffic signal at the end of the intersectionlane. For this study, a polar coordinate system is applied in the intersections so the angularacceleration, velocity, and radial angle are computed for each vehicle.
216
if (leader = 0) then. No leader - open roadif speeding then
. decelerate in proportion to speeding
accel =V 2−V 2
speeding2∗headway
else. Not speedingaccel = table lookup freeflow value
end ifelse
. Following a leaderaccel =
xn+1(t + T ) =λxm
n+1(t+T )
[xn(t)−xn+1(t)]l[xn(t)− xn+1(t)]
end if
Table 7.4. Acceleration Decisions for an Intersection The acceleration decisions fortraversing an intersection are similar to the decisions for travel on a road as explained inTable 7.3, but it is assumed that a vehicle does not stop at the end of the intersection beforemoving on to the next road. Equations 2.9 and 2.15 derived in Chapter 2 are converted totheir angular counterparts and then incorporated in this algorithm. For turning lanes, thecorresponding angular acceleration equations were implemented.
217
FPGAs is division. During the fitting of the traffic designs, the slowest routing imple-
mentation paths are composed of division signal lines. If AHDL division library routines
cannot be accelerated, providing hardwired division functional units on FPGAs would
certainly accelerate the traffic implementations.
The routines listed in Table 7.5 compute different segments of the vehicle simula-
tion. The Initialize Vehicle implementation, described briefly in Section 7.2.3.2, handles
vehicles which are entering the simulation at source nodes for the first time. These ve-
hicles have been popped off the Event Queue described in Section 7.2.2 and presented
in Table 7.1. The Initialize Vehicle component receives vehicles from the event queue
and inserts the vehicles source and random destination node values into the vehicle data
structure. The Veh Road Init component handles vehicles anytime they enter a new
road in the simulation. Any required road specific data is injected into the vehicle data
structure in this routine. The component is described at the beginning of Section 7.2.3.3.
The rest of that section is devoted to describing the Move on Road implementation. Veh
Intersect Init, described at the beginning of Section 7.2.3.4, similarly initializes vehicles
as they enter intersections. The rest of Section 7.2.3.4 then describes the Move in Inter-
sect implementation. Both the Move in Intersect and Move on Road implementations
employ a First-In-First-Out (FIFO) queueing system composed of two FIFO queues. So
each implementation requires two chips to implement this queueing system. The system
is implemented to handle 256 vehicles in the main queue.
Each processing element has facilities to handle a four-way intersection and its
egress lanes of traffic. Therefore, each processing element contains enough FPGAs to
handle 4 sets of each implementation described in Table 7.5, except the Initialize Vehicle
218
Function Chips Chip Type % Util Clock Rate Cycles/vehInitialize Vehicle 1 EPF10K30EFC256-1 50% 5.44 MHz 1Veh Road Init 1 EPF10K200SFC484-1 38% 46.94 MHz 1
Veh Intersect Init 2 EPF10K200SFC484-1 63% 109.89 MHz 1EPF10K30EQC208-1 50 %
Move in Intersect 3 EPF10K200SBC356-1 52% 5.44 MHz 1EPF10K50EQC240-1 57% 5.44 MHz 1EPF10K100EBC356-1 75% 7.54 MHz 4
Move on Road 3 EPF10K200SBC356-1 52% 5.44 MHz 1EPF10K50EQC240-1 57% 5.44 MHz 1EPF10K130EFC484-1 89% 8.00 MHz 4
Table 7.5. Scheduler Chip Implementation The scheduler software was implementedas 5 separate components. The Initialize Vehicle sets the vehicle’s source location anddestination as the vehicle is injected into the traffic network. Although its clock speed isonly 5.44 MHz, it can process vehicles once per cycle and is therefore not a bottleneck forthe simulator. The Veh Intersect Init prepares vehicles for transit though an intersectionby initializing their starting coordinates and lane designation. The module also performssome vehicle routing. Both the Move in Intersect and Move on Road implementationsemploy a first-in-first-out (FIFO) queueing system composed of two FIFO queues. Soeach implementation requires two chips to implement this queueing system. The Move inIntersect implementation is the system bottleneck. 4 clock cycles are required to processeach vehicle data word, and the clock only runs at 7.54 MHz due to division operationsin some of the pipeline stages. Table 4.6 illustrates that for the timed Trafix schedulerfunction, the bottleneck resides in the intersection routine. Here, the routine requires 4cycles, so the speedup attained by the hardware over the software is 91.
219
implementation. Nodes may be specifically configured to act as vehicle source nodes,
in which case only the vehicle road handling hardware and the hardware used for the
source node presented in Table 7.1 are required. The resulting processing element design
is illustrated in Figure 7.17. The total number of FPGA or reconfigurable logic chips is
expected to decrease as the technology and compilers improve.
Comparing Tables 4.6 and 7.5, system bottlenecks can be seen to occur within the
function for traversing an intersection. In software, this routine required 48.4 µs. This
measurement comes from timing and averaging the vehicle movement routines while the
Trafix code was executing on a UNIX box as described by Table 4.6. The software timing
results are then compared to the time required to get the same functional result on the
FPGA implementation according to the Altera MaxPlusII simulation. In hardware, 4
cycles of a pipeline running at a 7.54 MHz clock cycle are required. Therefore, the
speedup of the hardware implementation over the software implementation is 91. This
value compares favorably with other reported hardware acceleration, usually based on
a network of processors, used to increase the speed and capacity of simulation by up to
100 times [14].
7.3 Network
For simulation acceleration to be successful, speedup must occur within all facets
of the architecture, including the processing element interconnection network. Section 7.3
presents a method of synchronizing individual nodes to form a processing element net-
work capable of determining the smallest timestamped event rapidly. The basic processor
model used to implement the local processing elements is illustrated in Figure 7.3.
220
ChannelControl
To NorthPE
ChannelControl
To WestPE
LocalCrossbar
ChannelControl
To SouthPE
ProcessingElement
ChannelControl
To EastPE
ParallelBus
InterfaceStorage
CPU
ToParallel
Bus
To Cross−PointSwitch
ChannelControl
EventGenerator
comp
IntersectionMovementExit Road
Movement
Service Arrival
Scheduler
Move onRoad
Move onRoad
Move onRoad
Move onRoad
Move inIntersect
Move inIntersect
Move inIntersect
Move inIntersect
Fig. 7.17. Processing Element for 4-way Intersection and Exit Roads A detailedversion of the Processing Element design depicted in Figure 7.3 is illustrated. This designis capable of modeling 4-way intersections, and consists of enough Scheduler subcomponentunits to model the traffic entering an intersection from 4 directions and exiting the inter-section on 4 output roads. A single Event Generator module which contains its associatedArrival and Service Queues is included to allow the Processing Element to serve as a simula-tion source node. The Processing Element contains 4 nearest neighbor interconnect FIFOsand a communications FIFO pair which connects to its corresponding cross-point switch,illustrated in Figure 7.23. An additional interface connects the processing element to theparallel bus illustrated in Figure 7.22. A central crossbar matrix, similar to the Splashdesign described in Section 3.2.1, connects the various processing element sub-components.The design described in this Figure requires approximately 30-34 FPGAs. 6 FPGAs arerequired for the Event Generator and the Event Queues. Each Scheduler sub-componentused in calculating vehicle movement requires 3 FPGAs, yielding a total of 24 FPGAs forthe 8 Scheduler sub-components. Additional FPGAs are reserved for Channel Control. Thisvalue is similar to the original Splash board designs which required 32 FPGAs per Splashunit [64].
221
A typical simulator is composed of individual nodes joined in a network. To
prevent causality errors in conservative simulation, all nodes process the same simulation
cycle simultaneously. In conservative event-driven simulation, individual nodes all jump
to the simulation cycle which coincides with the smallest timestamped event held within
the network. Logistical difficulties occur in both the communications and sorting of the
timestamps. Each node’s local minimum timestamp must be compared against all of the
local minimum timestamps in the global network.
In a simulation network, as shown in Figure 7.18, nodes are generally synchro-
nized using either a time-driven or event-driven simulation approach. A single network
architecture can be constructed allowing a simulation to run as either a time or event-
driven model. The decision between the two models is made at the beginning of the
simulation based on calculations from Chapter 5, and the selected model is used for the
simulation duration. A communications network which can be used to determine and
select the smallest timestamp in a network of nodes when running in event-driven mode
is presented. A time-driven solution is also presented using the same implementation.
7.3.1 Communications Architectures
Communications synchronization is often a source of delay. In work on the CM-5,
Legendza notes synchronization overhead accounts for 70% to 90% of total simulation
runtime and therefore severely limits speedup [96]. Traditional approaches in multi-
processor simulation search for the smallest next timestamp in a network of N processing
elements. The simulation model may have n active simulation model nodes distributed
across the N processing elements in a balanced fashion, but each processing element will
222
EventQueue
Scheduler
EventQueue
Scheduler
EventQueue
Scheduler
EventQueue
Scheduler
EventQueue
Scheduler
EventQueue
Scheduler
GeneratorEvent
ProcessingElement
GeneratorEvent
ProcessingElement
GeneratorEvent
ProcessingElement
GeneratorEvent
ProcessingElement
BusAsyncParallel
GeneratorEvent
ProcessingElement
GeneratorEvent
ProcessingElement
Master
SyncStart
Done
Fig. 7.18. A Network of Processing Elements. A simulation consists of a network ofevent sources, sinks, and way-points. Each must be synchronized to the global system timeclock. Two common methods of synchronization are time and event-driven synchronization.The analysis of Chapter 5 can be used to gauge which methods are faster. The illustratedtime-driven simulation uses a controller/subordinate approach similar to Levendel [98]. Thenetwork core, illustrated in Figure 7.19, serves as the Main Synchronizer which asserts theStart line at the beginning of each time cycle. Each network processor signals it is readyfor the next time cycle by asserting its Done line. The Start and Done lines are configuredas reduction network lines illustrated in Figure 7.20.
223
Fig. 7.19. The 3-Dimensional Network Structure Although trees have a wonderfullylogarithmic decreasing structure, they offer difficult geometric constraints for actual imple-mentation. A linear parallel bus offers a much easier to implement structure, but poses moredifficult adjacency problems. In the network illustrated, each parallel bus is composed ofreduction logic as shown in Figure 7.20. Much of the communications can be accomplishedby the Process Element (PE) I/O cells. The length of each bus is a trade-off between com-munications circuit element switching speed, bus signal propagation speed, and physical PEgeometry constraints. In this figure, the PEs are arrayed along linear busses. Letting 10elements reside on each bus, and 10 arrays of 100 PEs per quadrant, allows each networkto contain 8000 elements. The core may be composed of more than one processor, but forthe purposes of this research, the core is assumed to be one unit.
224
ProcessingElement
ProcessingElement
ProcessingElement
ProcessingElement
Fig. 7.20. The PE Interconnection Network Processing Elements (PEs) can be inter-connected in 1 or 2 dimensions. The interconnections consist of a high speed emitter-coupledlogic design. The buses link the Processing Elements (PE) together, allowing rapid and semi-parallel determination of the next smallest time in the network. The OR network assistsin the computation of the smallest timestamp and serves for both computation and sig-nal driving. In addition, each processing element is directly connected to its north, south,east and west neighbors in what is commonly called a two-dimensional nearest-neighborcommunication pattern [64].
225
have one minimum timestamp for the model nodes it handles. Each processor timestamp
must be compared against the other minimum timestamps in the network. Some of the
more commonly expected network search algorithms include network structures con-
structed as k-ary trees depicted in Figure 7.21. To determine the minimum timestamp
in such a network requires logk(n) communications steps. The smallest timestamp is
filtered to the root of the tree, and from there the result must be distributed to the rest
of the network. This method requires O(logk(n)) communications steps.
ProcessingElement
ProcessingElement
ProcessingElement
ProcessingElement
ProcessingElement
ProcessingElement
ProcessingElement
Fig. 7.21. K-ary Search Tree Network The K-ary search network topology allows Nprocessing elements in a network to compare individual local minimum timestamp results tothe winner of the K elements on the level below it in the network tree. Successive winnerscompete in tournament style comparisons.
Another view of the simulation notes that the larger the number of event gener-
ators which exist in the system, the shorter the expected time to the next event, E(x).
This phenomenon can be gleamed from Equations 5.18 and 5.27. Although the examples
226
from [27] use homogeneous distributions, it is assumed that the trend holds for indepen-
dent heterogenous distributions as well. So the larger the number of event generators in
the simulation, the faster the events will arrive, and the smaller the mean time between
events grows. As N increases, time-driven simulation becomes more and more practical.
7.3.2 Parallel Bus Architecture
For the proposed algorithm, several transmitters must share the bus and be able to
generate signals simultaneously. The bus architecture can be handled by a bi-directional
reduction logic network. Employing a technology such as Emitter Coupled Logic (ECL)
gives the interface reasonable transmission speed, and ECL hardware couples nicely with
CMOS technology [34]. ECL switching speed is accomplished by keeping its transistors
always biased in their active regions. OR or NOR logic can be used to run buses in two
directions as depicted in Figure 7.20. Reduction logic can be accomplished directly at
the processing element I/O points without processor intervention.
The primary function of the parallel bus is to locate the minimum network time
stamp and synchronize the network. A secondary function of the bus allows the pro-
cessing elements to communicate with the centralized Controllers. Data sent to the
Controller includes the address of the sending processing element, a simulation event or
location identifier, and the data values. Additional control signals to send data to the
Controller are used with the bus. Alternatively, the PE can also communicate with the
controllers serially via the cross-point matrix of Section 7.3.6.
227
7.3.3 Search Algorithm
One algorithm for finding the network minimum timestamp proceeds in two basic
phases. The first step consists of a general elimination which prunes processing elements
having timestamps larger than 2k, the base 2 ceiling of the global minimum timestamp.
The second phase of the algorithm then finds the minimum among the remaining nodes.
7.3.4 Phase 1 Elimination
First, all network processing elements (PEs) find their local minimum values.
This search involves comparing the lead elements of the service and arrival queues from
Figure 7.3 in O(1) time. A hardware algorithm for maintaining the smallest event within
a processing element is presented in 7.2.2. Next, each PE computes the difference between
the current global simulation time cycle and the next local minimum timestamp, tdiff ,
in O(1) time. Each PE determines the number of bits, b, required to express tdiff .
For example, 13, requires 4 bits, 11012. The PEs simultaneously pull the signal line
representing b low on the global parallel bus illustrated in Figure 7.22. After all PEs
have floated their b values on the bus in O(1), the PEs whose b value is greater than the
bus minimum signal line eliminate themselves from the search. The smallest asserted
signal line of the parallel bus narrows the scope of the search to the limited range of
numbers expressed in Equation 7.2:
2b − 2b−1 = 2b−1(2− 1) = 2b−1 (7.2)
228
All elements not eliminated in this first phase are referred to as active elements
in the second phase.
7.3.5 Phase 2 Selection
The second phase of the algorithm can proceed in either of two methods. Method
one requires a 3-bit reduction network, and method two requires a 2-bit reduction net-
work. The first method performs a binary search through the range of timestamps
isolated in Phase 1. The second method performs binary eliminations among the re-
maining active nodes. The reduction network can also serve as the Start and Done lines
for the Main Synchronizer under time-driven simulation as described by Figure 7.18.
In the first method, a Bus Controller begins a binary search through the remaining
range of numbers to determine the minimum global timestamp. The reduction network
is used to allow the PEs to signal whether their values are higher, equal, or lower than the
value floated on the parallel bus. Using Equation 7.2, the global search can be completed
in O(log2(2b−1)) = b− 1, and the resulting global minimum timestamp range is visible
to all PEs simultaneously.
The selection phase has several significant advantages over tree search methods.
One advantage of this method is its initial elimination step which occurs across the
network at all PEs simultaneously. This advantage is opposed to a k-ary tree in which
the first comparison happens at the lowest level only. Another significant advantage is
that the network is somewhat more conducive to a geometric element layout as opposed
to a binary tree, where the interconnections between element levels get progressively more
difficult. Perhaps the most significant advantage is that the timestamps can remain in
229
their original locations instead of being moved and coalesced into a central location. Its
disadvantages include requirements for additional hardware and bus lines as illustrated
in Figure 7.20.
The second method proposed for Phase 2 allows the processing elements which
remain after the first phase to work in adjacent pairs. All active processing elements
generate a signal which is passed towards the core along the network Edge signal line.
Therefore any PE and the core receiving this signal know that there exists at least one
active element on their network edge side. Next, the elements use their Adjacency signal
line to form processor pairs. Active elements at the edge of the network propagate both
the Edge and Adjacency signals. The next innermost active element heading towards
the core will receive both the Edge and Adjacency signals along with the value of the
smallest timestamp on the data lines as shown in Figure 7.22. This inner core side ele-
ment will propagate only the Edge signal towards the core. Having alternating elements
propagate the Adjacency signal facilitates a pairing of the network elements. In each
pair, the element closer to the network edge automatically self eliminates. The inner
paired element compares its local minimum timestamp with the value received on the
data-bus. The smaller value becomes the minimum used in the next cycle. The core
retains the smallest value until all eight network quadrants have reported in, and then
broadcasts the final result. The advantage of this mode is that as the number of nodes,
N, increases, the expected time of the minimum timestamp becomes more isolated from
the other local minimum timestamps in the system. The first elimination becomes the
only step required.
230
pairedelements
pairedelements
reductionnetwork
NetworkNetworkCore Edge
ProcessingElements
DataBus
Adjacency LineEdge Line
eliminatedelements
signal/data linesBold indicates active
Fig. 7.22. Algorithm Phase 2 Method 2 Elements eliminated by the initial reductionstep are illustrated inscribed with a cross. Signals flow through the eliminated processingelements. The data signals are shown traversing the upper bus. The lower two-signal busrepresents the basic handshaking signals. The Edge signal indicates to each element whetheror not that element is a network edge element. All elements which have not self-eliminatedduring the first phase generate an active Edge signal and propagate the signal towards thenetwork core. The Adjacency signal is used to pair processing elements. Each active elementwhich receives the Edge signal but not the corresponding Adjacency signal propagates itsown Adjacency signal towards the direction of the network core. When either another activePE or the core receives the Adjacency signal, that element does not propagate the signalbut instead compares its minimum local timestamp with the timestamp value received onthe Data bus. The minimum value of the pair becomes the minimum value at the nodeclosest to the core while the outer pair node is eliminated.
231
7.3.6 Cross-Point Matrix
The simulator cross-point matrix Communications Structure of Figure 7.2 is a
cross-point matrix network laid out in a star topology. Levendel’s [98] cross-point ma-
trix and Splash’s [8] crossbar switch employ an interconnect to accelerate processing
element communications thereby avoiding a communications bottleneck. In the case of
Splash [8], the requirement of a crossbar switch was learned only after creating the initial
prototype without the matrix. Each of the 8 quadrants depicted in the 3-dimensional net-
work layout of Figure 7.19, contains 2-dimensional arrays of processing elements. Each
of these 2-dimensional arrays is associated with a cross-point switch used to allow pro-
cessing element communications. The cross-point network, although serial, allows more
direct connections between the processing elements than the parallel reduction bus. For
quadrant cubes, 10 processing elements on an edge, each 2-dimensional processing ele-
ment sub-array contains 100 processing elements. Using a 300 pin cross-point matrix,
approximately one third of the lines will connect directly to the processing elements of
the 2-dimensional array. The other two thirds of the lines of the cross-point matrix are
used to connect the 2-dimensional array to the rest of the 3-dimensional network. There
is a cross-point switch at the network core. Adjacent processing elements also connect
directly to each other. The cross-point control network is illustrated in Figure 7.23.
The time required for communications using the cross-point matrix network can
be analyzed by dividing the simulator processing time into the time spent processing
vehicles, tprocess, and the communications time, tcomm. Therefore each simulation
cycle is composed of tprocess + tcomm as illustrated in Figure 7.24.
232
FPGACPU
Storage
Cross−PointSwitch
ProcessingElement
ProcessingElement
Switch
Switch
ProcessingElement
ProcessingElement
Initialization
Control
Status
HDLC
Switch
Data/Clk Data/Clk
Data/ClkData/ClkHDLC
HDLC
Data/Clk Data/Clk
HDLC HDLC
Fig. 7.23. Cross-point Switch Architecture A cross-point hierarchical network isillustrated with one switch shown in detail. Arrays of processing elements connect to theirrespective switch by a high-level data link control (HDLC) line which is used to send framedconnection control data to the cross-point switch controller. If a virtual circuit is availableto the requested destination, the circuit is assigned to the processing element. The cross-point switch virtual circuit provides a direct serial connection for a data line and for itsrelated clock line. Cross-point switches in the same simulator quadrant directly connectto each other. The cross-point switches are hierarchically configured allowing a virtualcircuit to connect to processing elements attached to cross-point switches in other simulatorquadrants. Large cross-point matrices are used to provide as close to a fully connectednetwork as possible where the network is layed out in a star topology. CPUs are used tomonitor and initialize the switch configurations [17].
233
Run through the Queues updatingthe vehicles
Prepare vehicleswhich mustbe transferred
TranferVehicles tonext node
Accept New vehiclesTransferred in
ProcessUserRequests
processt
Processing cyclecommt
Comm cycle
One Simulation Cycle
Fig. 7.24. Processing and Communications time Each simulation time cycle is di-vided into tprocess and tcomm subcomponents. The processing element cycles through theScheduler Main and Entry queues updating the position, velocity, and acceleration infor-mation of each vehicle data structure during tprocess. Vehicles which must be transfered
to the next node are moved during tcomm. User directed system interrupts and systemsynchronization occur during this later phase.
Looking first at the time required to process events in each processing element
using the traffic scheduler as our model, let tprocess be defined in Equation 7.3.
tprocess = ((Qavg − 1) + numstages)(tcycle) (7.3)
The limiting motion function of Table 7.5 has a clock rate of 7.54 MHz or 133
ns/cycle. Using this clock rate for tcycle and the 6 stage pipeline implementation of
Figure 7.15, numstages = 6. Let Qavg, the expected vehicle queue size, be 25 vehicles.
The resulting tprocess = 4 µs to process the vehicles moving on the road. Next, the
value of tcomm, the communications delay through the cross-point matrix, is calculated.
The simulator was implemented with 0.5 second time resolution, so let esend = 2, be the
events, or vehicles which have finished processing at the current processing element and
need to be transferred to the next during one simulation cycle. In the traffic simulation,
234
these events represent vehicles which have come to the end of a road and are now entering
the intersection.
From a table in [38], the propagation signal delay can be estimated as tsd = 0.05
ns/cm. The worst case communications scenario involves passage through 3 cross-point
switches. The first half of the route is illustrated in Figure 7.19, the second half of the
route travels outwards from the core to a different cube corner. The first cross-point
switch in the worst case scenario is connected to the sending processing element’s array.
This first switch is located at approximately the end of the second array displayed in
Figure 7.19. The second switch resides at the network core, where the sphere is located in
Figure 7.19, and the third connects the receiving array to the network. Each processing
element is a 10 cm cube. The worst case distance across a network composed of 8 1000
PE quadrants is 600 cm. Through that distance, the propagation delay is 30 ns. The
vehicle data messages are relatively long as compared to the gate value results of [98].
So let, tdm = 300 ns, as a conservative estimate.
To communicate across the cross-point matrix, a point-to-point channel is nego-
tiated between the two processing elements. First, channel request & grant time, tcrg,
is required to establish the circuit. The time required to send the message is the delay
in message transmission time, tdm. Once the process is complete, channel release time,
tcr, is used to free the circuit. Finally, if the circuit is unavailable, a penalty of time
wasted in processing a blocked request, trb, is incurred. For the calculation of tcomm,
let j denote the number of events which encounter a busy channel. Assume, on average
235
that the messages transmitted go half of the worst case distance, or through 12(3) or 1.5
switching matrix hops. The formula for the transmission time, is as follows:
tcomm = 1.5[tcrg + tdm + tcr]esend + trb(1− (1− j)(1.5)) (7.4)
Equation 7.4 assumes the average communications require 1.5 network hops which
can result in 1.5 possible call blocks. To compute tcomm, the parameter values: tcrg = 50
ns, tdm = 300 ns, tcr = 50 ns, esend = 2 vehicles, trb = 50 ns, and j = 10%, which
are based on the values from [98] are used to compute the communications delay. Using
these values, tcomm computes to 1.2 µs. For this example, although tcomm is smaller
than tprocess, the values are close enough to indicate that the implementation of the
communications system is an important consideration in the machine design.
7.3.7 Network Results
Networks of processing elements deployed in three dimensional arrays and con-
nected by the parallel bus architecture of Section 7.3.2 were simulated for both time and
event-driven mechanisms. The event-driven synchronization time was computed assum-
ing that the worst case signal propagation delay was required for all steps. The signal
propagation delay along the parallel bus is composed of the time required to propagate
a signal through each reduction gate in the array as depicted in Figure 7.20. The gates
are assumed to be high speed Emmiter-Coupled Logic (ECL) or Source-Coupled FET
Logic (SCFL) 10 ns delay gates [109]. Processing elements are deployed along linear
busses whose lengths are determined by the number of processors in the simulation. The
236
processing element connections to the bus are spaced 10 cm apart. Propagation delay
along the bus is assumed to be 5 ns/m [38] excluding the time required to pass through
the OR gate drivers. The peripheral buses of processing elements are connected to the
middle level of linear busses. Gates at the end of the buses bridge onto the next bus
layer. There are three interconnected bus layers from the network edge to its core. These
three layers are represented by the arrows illustrated in Figure 7.19.
The simulation mode determines the time required to locate the next smallest
timestamp in the network. For the time-driven simulation mode, the time required is
composed of two components. The first component tallies the simulation clock cycle
signal as it is passed through each repeater gate illustrated in Figure 7.20. The second
component is the propagation time along the wire runs between each gate. The signal
must pass through the entire network from its core to its edge elements. The expected
time to the smallest arrival event was computed using a network of Independent, Iden-
tically Distributed (IID) sources following the patterns set by Equations 5.18 and 5.27.
The event-driven simulation time was computed by using the same pair of ex-
pected time components described above. One bus propagation/elimination step is al-
ways necessary. Then the power of 2, k, which is a logarithmic ceiling of the expected
minimum network timestamp value is computed. Next, the number of events, E, which
can be expected to arrive before 2k is computed. Each comparison requires about 4
ns. Finally, log2(E) was determined to calculate the number of comparisons required to
calculate the network minimum timestamp. Each communication through the network
is assumed to be from core to edge, the worst case scenario.
237
In Figures 7.25-7.28, two event-driven methods are illustrated. The two methods
are labeled Event-Driven Range and Event-Driven Elements and are defined in Sec-
tion 7.3.5. The Event-Driven Range curve performs a binary search by dividing the
range of possible time values isolated in the first elimination phase. Alternatively, in the
Event-Driven Element method, the algorithm takes advantage of the fact that as the
distribution means increase, the elements become more isolated at the extreme ends of
the distribution curves. The first method must step through the remaining binary range
of numbers, searching for the minimum. The second method tends to jump directly to
the correct element in O(1) as the distribution means increase.
Figure 7.25 reveals the time required by the network minimum timestamp search
algorithm as the number of processing elements and the event generator Exponential
means vary. The event generators used in the distributions are IID. Figure 7.26 shows a
slice of Figure 7.25 at the 1000 processing element mark. The graphs indicate a clear gain
which can be harvested if the time and event-driven methods are used in conjunction.
The second method in the second phase of the event-driven algorithm clearly yields
significant gains for exponential distributions with large means. A scenario which would
benefit from this algorithm might be one where the simulation has distribution means
in the millisecond range but requires nanosecond resolution.
Figure 7.27 illustrates the results of a simulation using IID Weibull distributed
sources. The plot varies the number of processing elements and the arrival rates of
those elements. Time-driven simulation works well with smaller mean arrival rates, and
the second proposed event-driven method works best with higher distribution means.
238
Event-Driven Range
1000 2000 3000 4000 5000 6000 7000 8000
50000
100000
10
100
1000
10000
100000
Num of Processing ElementsExponential Dist Mean
Time UnitsEvent-Driven Elements
Time-Driven
Fig. 7.25. Exponential Distribution in Event vs Time-Driven Simulation Thegraph illustrates a network of Independent and Identically Distributed (IID) nodes in net-work sizes ranging from 125 to 8000 nodes. Each node is generating arrival events according
to an Exponential distribution with mean arrival times ranging from 1 to 214. The graphillustrates that for an exponential arrival rate, the mean arrival time offers the most sig-nificant impact to network synchronization. The time required for the event-driven modelis computed by counting the longest signal run from the edge to the center of the networkmultiplied by the propagation delay per unit length. 10 ns are added for each OR gate, seeFigure 7.20, encountered in traversing the path to the network core. The time-driven delaythrough the network is just the duration of the number of time steps required.
239
50
100
150
200
250
300
350
400
450
2000 4000 6000 8000 10000 12000 14000 16000
Tim
e U
nits
Exponential Dist Mean
Event-Driven RangeEvent-Driven Elements
Time-Driven
Fig. 7.26. Exponential Distribution Slice of Figure 7.25 Illustrates a slice takenfrom Figure 7.25 where the simulation contains 1000 processing elements. The first methodfrom Section 7.3.4 is labeled as the Event-Driven Range, and the second method from thesame section is labeled Event-Driven Elements. The graph illustrates that a time-drivenapproach used in conjunction with the Event-Driven Element method provides the fastestnetwork search approach.
240
Figure 7.28 clearly illustrates the greater potential range of benefit to be gained by a
machine which can proceed using either the time or event-driven approaches.
Another issue which requires consideration is the notion that when the mean time
between statistical events exceeds a certain limit, and the effected number of simulation
events impacted as a causal by-product of the isolated events decrease and are limited
to a locality around the isolated event, a simulation may actually better lend itself to a
purely software implementation. The software would be able to jump from affected area
to affected area of the simulation network, processing only the individual simulation
nodes which require attention under the isolated circumstances. So simulations with
sparse events may run more efficiently using a software approach, where the software
keeps the event list in a heap, pulls off the next smallest timestamped event and moves
to process it. However, for traffic simulations and simulations with continuous activity
spread over a network with events arriving rapidly and simultaneously across multiple
nodes of the network, a time-driven approach is clearly beneficial.
241
Event-Driven Range
1000 2000 3000 4000 5000 6000 7000 800050000
100000150000
20000025000010
100
1000
10000
Num of Processing ElementsWeibull Dist Mean
Time Units
Event-Driven ElementsTime-Driven
Fig. 7.27. Weibull Distribution in Event vs Time-Driven Simulation The Weibulldistribution results are similar to the Exponential. The optimum cross over point from thetime-driven to the event-driven method allows a wider speedup gain to be derived.
40
60
80
100
120
140
160
180
200
220
2000 4000 6000 8000 10000 12000 14000 16000
Tim
e U
nits
Exponential Dist Mean
Event-Driven RangeEvent-Driven Elements
Time-Driven
Fig. 7.28. Weibull Distribution Slice of Figure 7.27 The graph is a slice of Figure 7.27at the 1000 processing element mark. The relative simulation search times are displayed.The optimal solution for this range of means would be a simulator which could selectbetween the Time and Event-Driven Element approaches.
242
Chapter 8
Optimistic Synchronization
As opposed to conservative simulation which avoids causality violations, opti-
mistic approaches allow errors to occur, but then are able to detect and recover from
violations. Optimistic simulation offers two important advantages over its conservative
counterpart. First, greater degrees of parallelism can be exploited. For instance, if two
events might affect each other, but the computations are such that they actually don’t,
optimistic mechanisms can process the events concurrently, while conservative methods
must sequentialize execution [57]. Second, optimistic simulation methods need not rely
on application specific information (e.g. the proximity to the next object) in order to de-
termine which events are safe to process. Conservative approaches tend to be dependent
on application specific data for correctness. The synchronization method can therefore
be more transparent to the application program in optimistic simulation. The downside
is that optimistic simulation may require more overhead computations and storage than
conservative approaches, causing performance penalties instead of the intended benefits.
For the proposed system, one optimistic modification which might prove beneficial
involves overlapping the local element processing and communications time periods. If
each simulation cycle is divided into two sub-segments, a processing phase followed by a
communications phase as described in Section 7.3.6, some optimistic processing can be
inserted by the overlap of these two phases. The idea is to allow processing elements to
243
begin processing their next cycle of vehicle queue data concurrently with the previous
communications phase. If the majority of simulation cycles do not transmit data between
the processing elements, some speedup may result. The entire simulation cycle can be
initially stored off and saved as a checkpoint. For traffic, this approach seems intuitive,
as the newly transferred vehicles will probably enter the vehicle queues towards the
beginning of the road. Processing vehicles near the front of the queues (or the end of
the road) may not engender any causality errors. As a further modification, processing
elements can begin their early vehicle computation based on the size of their main queues
or based on the simulation positions of the contents of those queues. So, for example,
thresholds can be set so that if the main queue is full, or if the vehicles in simulation are
beyond a certain entrance distance from the front of the queue, then optimistic processing
can begin at that individual node. Problems with the optimistic approach include the
amount of memory required to store off the checkpoint information. Additional circuitry
is required to detect and handle the causality error conditions.
244
Chapter 9
Results
Discrete event simulation acceleration is requested, needed, and feasible. Exam-
ples of articles in the press [127] and in scientific journals [118] explicitly describe the
requirement for access to accelerated means of simulation. Experimentation shows that
by applying various architectural techniques, discrete event simulation can be success-
fully accelerated. Results can be found at the ends of each thesis chapter, and are also
summarized in this section.
In Section 4.2.3, using the representative software simulation model, CORSIM,
typical bottleneck areas of simulation processing are identified. These bottleneck areas,
shown in Figures 4.2 and 4.3, involve the scheduler and overhead routines. Both sets
of routines must be minimized. The presented architecture, by converting software to
hardware, accelerates the Scheduler routines in Section 7.2.3. The overhead routines from
CORSIM, which involve a significant number of data integrity checks in the simulation,
can be eliminated and moved into the simulation model input phase. Because CORSIM
was neither modular nor current, Trafix, a software simulator written in C++ using
object oriented methods is used to verify the correctness of the car-following algorithms.
Once the car-following algorithms were verified in software, they were translated into
the hardware implementations of Section 7.2.3. The Trafix Scheduler routine timing
245
results are found in Table 4.6. The software bottleneck is identified as the intersection
movement routine which requires 48.4 µs to process each vehicle.
Although not specified as an initial goal in the comprehensive exam paper, the
lack of an open source road traffic simulator necessitated the development of Trafix. The
current purpose of the Trafix simulator is to test and verify the road movement software
routines developed for Scheduler modeling. Trafix is a GNU licensed open source, free,
modular traffic simulator which is further described in Section 4.3. During the Trafix
system development, the lack of a C++ shared memory allocator was also unexpectedly
noted for GNU-Linux systems. Therefore, an open source, free GNU licensed allocator
class was developed for use with the C++ Standard Template Library, STLPORT. The
allocator is described in Section 4.3.1.
Chapter 5 performs analysis which can be used to determine whether a simulation
will proceed faster in time or event-driven mode. Equation 5.7 can be used to determine
the expected time of the next event. Knowing that interval time, the mode of operation
which most rapidly advances the simulation can be determined. Specific results for both
Independent and Identically Distributed (IID) Exponential and Weibull distributions are
provided in Equations 5.1.3 and 5.1.4. Section 5.2 explores the geometric requirements
for wrapping a traffic map onto the simulator. When a traffic map is divided into nm
sections, the distance between the discontinuities of traffic map sections laid out on the
simulator processing element arrays is n, where n ≤ m. A more detailed explanation of
the topology concerns can be found in Section 5.2.
The interior design of the processing element architecture is divided into the Event
Generation, Event Queue, and Scheduler components. Each subsection of the processing
246
element architecture is individually explored. The Event Generator design is presented in
Section 7.2.1. The results of Section 7.2.1.1 yield an implementation which can generate
an event every 200 ns. Events are produced rapidly enough that this section of the design
is not a bottleneck to throughput. An Event Queue was implemented as a linear array
in Section 7.2.2.3. The Event Queue handles 16-bit words which can be used to point
to the address of data interleaved in memory, or can be expanded to larger sized words
by working queues in parallel. The Event Queue implementation is capable of working
against an 80 nanosecond cycle time, both pushing and popping elements off in each
cycle. The logic implementation required for both the Event Generator and the Service
Queue can be found in Table 7.1. Again, the speed of the event queue results show that
the event queue is not a throughput bottleneck. The event Scheduler results for the
traffic simulator model can be found in Section 7.2.3.5. The Scheduler section modeled
the scheduling algorithm in 5 components. The first component initialized a vehicle data
object entering the network with its source and randomly selected destination. The next
four components consisted of a pair to handle vehicles on a road and a pair to handle
vehicles traveling through an intersection. Both the road and intersection initialization
implementations set attributes before injecting vehicles either onto a road or into an
intersection.
Section 7.2.3.5 found the speedup of the software bottleneck in accelerated hard-
ware to be a factor of 91. As expected, the Scheduler is the bottleneck component in
the processing element design. The individual component speedup values are illustrated
in Figure 9.1. The lowest speedup derived in the system is 91, which coincides with the
simulation task which requires the most time, so the overall system speedup would be
247
approximately that value. The processing element design illustrated in Figure 7.17 would
require approximately 30-34 FPGAs. Six FPGAs are required for the Event Generator
and the Arrival and Service Queues. Each processing element Scheduler sub-component
used in calculating vehicle movement requires 3 FPGAs, yielding a total of 24 FPGAs
for the eight sub-components of the Scheduler illustrated in Figure 7.17. Additional FP-
GAs are reserved for Channel Control. This number of FPGAs is similar to the original
Splash board designs which required 32 FPGAs per Splash unit [64]. Each processing
element is capable of modeling one source, intersection, or destination node with the
associated outbound roads of traffic. A system composed of 8000 processing elements
can therefore simulate a large traffic network.
Although very implementation-dependent, the Scheduler can be further acceler-
ated by accelerating division in reconfigurable logic, which currently causes the largest
pipeline stage delays. Access to an accelerated form of division in reconfigurable logic
allows the derivation of greater acceleration. Although Splash originally intended to
pair floating-point chips along with each FPGA [64], the chip I/O pin constraints in this
existing design are already significant. Altera is planning to introduce a new FPGA con-
taining a Strong-Arm processor core. However, the ARM integer core does not contain
support for floating-point data types [59]. Perhaps if the reconfigurable logic had some
functional units embedded within the chip, faster designs would be implementable. An
approach similar to the Digital Signal Processors which often include special functional
units may be worthwhile.
One interesting facet gleamed from the FPGA research is that FPGA implemen-
tation methods and user designs directly impact the resulting design clock speeds. It is
248
Speedup: 150
Speedup: 125
Speedup: 91
Server
SchedulerSimulationTime Clock
EventGenerator
Event Queue
RandomNumberGenerator
Queue
Fig. 9.1. Speedup Results by Section The speedup results obtained from the simulationsection-by-section analysis are illustrated. A speedup of 150 is obtained when comparingevent generation software to its reconfigurable hardware counterpart. Similarly, a minimumspeedup of 125 was determined for the Event Queue and a speedup of 91 for the Scheduler.So the overall speedup determined for the system would be approximately 91, the minimumof the sub-components. This result compares reasonably with the speedup of 100 reportedfor deterministic logic simulators by Bauer [14].
249
difficult for the hardware compilers to fully optimize designs. For instance, two methods
of allowing a 16 bit D-flip/flop register to hold its value can be compared. One method
routes the output back through an input multiplexor, so that the output is re-inserted
on the next clock strike. The second method simply disables the register bits which then
retain their value. The first method requires 16 lines to be routed efficiently within the
FPGA. The second requires one signal to be chained to each element of the register bits.
The second method was much more efficient producing better timing results. At this
time, the compiler does not seem capable of detecting and affecting the faster design
automatically.
The proposed architecture is an innovative and unique contribution to the field
of non-deterministic parallel discrete event simulation architecture. Accompanied by
the literature survey, the mathematical analysis, the FPGA research, and the software
studies, the architecture presented in this thesis represents a comprehensive coverage of
the problem. The work clearly shows that a well designed parallel discrete event simulator
can provide much needed results rapidly. Specifically, the simulator can provide timely
results which are required by road traffic management personnel handling a network
under stress.
250
References
[1] Miron Abramovici, Ytzhak H. Levendel, and Premachandran R. Menon. A logical
simulation machine. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, CAD-2(2):82–94, April 1983.
[2] Miron Abramovici and Prem Menon. Fault simulaton on reconfigurable logic.
IEEE Symposium on FPGAs for Custom Computing Machines, pages 182–190,
April 1997.
[3] P. Agrawal, W.J. Dally, and W.C. Fischer et. al. Mars: A multiprocessor-based
programmable accelerator. IEEE Design & Test of Computers, pages 28–35, Oc-
tober 1987.
[4] Altera Corporation. Altera Data Book, 1996.
[5] Altera Corporation. Altera Data Book, 1998.
[6] American Association of State Highway and Transportation Officials, 444 North
Capitol Street,NW, Washington,DC 20001. A Policy on Geometric Design of High-
ways and Streets, 1995. ISBN: 1-56051-068-4.
[7] Jeffery M. Arnold. The splash 2 software environment. The Journal of Supercom-
puting, 9:277–290, 1995.
251
[8] Jeffery M. Arnold, Ducan A. Buell, and Elaine G. Davis. Splash 2. SPAA ’92. 4th
Annual ACM Symposium on Parallel Algorithms and Architectures, pages 316–22,
June,July 1992.
[9] William Aspray and Arthur Burks, editors. Papers of John Von Neumann on
Computing and Computer Theory, volume 12 of The Charles Babbage Institute
Reprint Series of the History of Computing. The MIT Press and Tomash Publish-
ers, Cambridge, MA, 1987. QA76.5.P3145 1987.
[10] Prithviraj Banerjee. Parallel Algorithms for VLSI Computer-Aided Design. PTR
Prentice Hall, Englewood Cliffs, NJ 07632, 1994.
[11] Jerry Banks, John S. Carson II, and Barry L. Nelson. Discrete-Event System
Simulation. International Series in Industrial and Systems Engineering. Prentice
Hall, Upper Saddle River, New Jersey 07458, second edition, 1996.
[12] Robert J. Baron and Lee Higbie. Computer Architecture. Addison Wesley in
Electrical and Computer Engineering. Addison Wesley, 1 edition, 1992.
[13] R. Barto and S. A. Szygenda. A computer architecture for digital logic simulation.
Electronic Engineering, 52(642):35–66, September 1985.
[14] Jerry Bauer, Michael Bershteyn, Ian Kaplan, and Paul Vyedin. A reconfigurable
logic machine for fast event-driven simulation. In Proceedings of the 1998 35th
Design Automation Conference, pages 668–671. IEEE, 1998.
252
[15] C. Beaumont, P. Boronat, and J. Champeau et al. Reconfigurable technology: An
innovative solution for parallel discrete event simulation support. In 8th Work-
shop on Parallel and Distributed Simulation (PADS ’94). Proceedings of the 1994
Workshop on Parallel and Distributed Simulation, pages 160–163, Edinburgh, UK,
July 1994. IEEE, SCS, San Diego, CA, USA.
[16] Christophe Beaumont, J. Chanpeau, J.-M. Filloque, and B. Pottier. On fpgas as
a new hardware support for parallel discrete event simulation. moscou94.ps from
ftp://ubolib.univ-brest.fr/pub/reports/, October 1994.
[17] John Bergen. Personal communication, February 2001. ICube, Inc.
[18] Dimitri Bertsekas and Robert Gallager. Data Networks. Prentice Hall, Inc., En-
glewood Cliffs, New Jersey 07632, second edition, 1992.
[19] William H. Beyer, editor. CRC Standard Mathematical Tables. CRC Press, Inc.,
Boca Raton, FL, 25 edition, 1978.
[20] Tom Blank. A survey of hardware accelerators used in computer-aided design.
IEEE Design & Test, pages 21–39, August 1984.
[21] James Brink and Richard Spillman. Computer Architecture and VAX Assem-
bly Language Programming. The Benjamin/Cummings Publishing Company, Inc.,
Menlo Park, CA, 1987.
253
[22] Stephen D. Brown, Robert Francis, Jonathan Rose, and Zvonko Vranesic. Field-
programmable gate arrays. The Kluwer International Series in Engineering and
Computer Science. Kluwer Academic Publishers, 1992.
[23] Marc Bumble and Lee Coraor. Introducing parallelism to event-driven simulation.
In Proceedings of the IASTED International Conference–Applied Simulation and
Modelling, ASM ’97, Banff, Canada, July 27-August 1, 1997. The International
Association of Science and Technology for Development, August 1997.
[24] Marc Bumble and Lee Coraor. Architecture for a non-deterministic simulation
machine. In 1998 Winter Simulation Conference Proceedings, volume 2, pages
1599–1606, December 1998.
[25] Marc Bumble and Lee Coraor. Implementing parallelism in random discrete event-
driven simulation. In Lecture Notes in Computer Science 1388, Parallel and Dis-
tributed Processing, pages 418–427. IEEE Computer Society, Springer, March 1998.
[26] Marc Bumble and Lee Coraor. A global synchronization network for a non-
deterministic simulation architecture. In 1999 Winter Simulation Conference Pro-
ceedings, December 1999.
[27] Marc Bumble, Lee Coraor, and Lily Elefteriadou. Exploring corsim runtime char-
acteristics: Profiling a traffic simulator. 33rd Annual Simulation Symposium 2000
(SS 2000), pages 139–146, April 2000.
254
[28] Ted Burggraff, Al Love, Richard Malm, and Ann Rudy. The ibm los gatos logic
simulation machine hardware. IEEE International Conference on Computer De-
sign: VLSI in Computers, pages 584–587, 1983.
[29] Calvin A. Buzzell and Micheal J. Robb. Modular VME rollback hardware for
time warp. Simulation Series, 22(1):153–156, January 1990. Monthly Publication
Number: 0114438.
[30] Pak K. Chan and Samiha Mourad. Digital Design Using Field Programmable Gate
Arrays. PTR Prentice Hall, Englewood Cliffs, New Jersey 07632, first edition,
1994.
[31] K. M. Chandy and J. Misra. Asynchronous distributed simulation via a sequence of
parallel computations. Communications of the ACM, 24(11):198–206, April 1981.
[32] C.S. Chang, S.L. Ho, T. T. Chan, and K. K. Lee. Fast ac train emergency reschedul-
ing using an event driven approach. IEE Proceedings-B, 144(4):281–288, July 1993.
[33] Gang-Len Chang, Jifeng Wu, and Henry Lieu. Real-time incident responsive sys-
tem for corridor control: modeling framework and preliminary results. In Trans-
poration Research Record, volume 1452, pages 42–51, December 1994.
[34] Barbara A. Chappell, Terry I. Chappell, Stanley E. Schuster, Herman M. Seg-
muller, James W. Allan, Robert L. Franch, and Phillip J. Restle. Fast cmos ecl
receivers with 100-mv worst-case sensitivity. IEEE Journal of Solid-State Circuits,
23(1):59–67, February 1988.
255
[35] A. T. Chronopoulos. Traffic flow simulation through high order traffic modeling.
Mathematical Computing Modeling, 17(8):pp. 11–22, 1993.
[36] Anthony Theodore Chronopoulos and Charles Michael Johnson. A real-time traffic
simulation system. IEEE Transactions on Vehicular Technology, 47(1):321–331,
February 1998.
[37] Jim Clark and Gene Daigle. The importance of simulation techniques in ITS re-
search and anaylysis. In S. Andradottir, K. J. Healy, D. H. Withers, and B. L.
Nelson, editors, Winter Simulation Conference Proceedings, pages 1236–1243, Pis-
cataway, NJ, USA, 1997. IEEE.
[38] Alan Clements. Microprocessor Systems Design: 68000 Hardware, Software, and
Interface. PWS Publishing Company, Boston, MA, third edition, 1997.
[39] John Craig Comfort. The simulation of a master-slave event set processor. Simu-
lation, pages 117–124, March 1984.
[40] Gareth Cook. Scientists dissect dynamics of panic. The Boston Globe, page A24,
September 28 2000.
[41] Gene Daigle, Michelle Thomas, and Meenakshy Vasudevan. Field applications of
corsim: I-40 freeway design evaluation. In D. J. Medeiros, E. F. Watson, J. S. Car-
son, and M. S. Manivannan, editors, Winter Simulation Conference Proceedings,
volume 2, pages 1161–1167, Piscataway, NJ, USA, 1998. IEEE.
256
[42] Frederica Darema and Gregory F. Pfister. Multipurpose parallelism for vlsi cad on
the rp3. IEEE Design & Test of Computers, pages 19–27, October 1987.
[43] Samir R. Das and Richard M. Fujimoto. An empirical evaluation of performance-
memory trade-offs in time warp. IEEE Transactions on Parallel and Distributed
Systems, 8(2):210–224, February 1997.
[44] Carolyn K. Davis, Sallie V. Sheppard, and William M. Lively. Automatic devel-
opment of parallel simulation models in ADA. Proceedings of the 1988 Winter
Simulation Conference, pages pp. 339–343, 1988.
[45] Andre DeHon. DPGA utilization and application. Proceedings of the 1996 Inter-
national Symposium on Field Programmable Gate Arrays, February 1996.
[46] Andre DeHon. Dynamically programmable gate arrays: A step toward increased
computational density. Proceedings of the Fourth Canadian Workshop on Field-
Programmable Devices, pages 47–54, May 1996.
[47] Andre DeHon. Reconfigurable architectures for general-purpose computing. A.I.
Technical Report 1586, Massachusetts Institute of Technology, Aritificial Intelli-
gence Laboratory, Cambridge, MA, October 1996.
[48] Jay L. Devore. Probability and statistics for engineering and the sciences. Duxbury
Press, fourth edition, 1995.
257
[49] Philippe Dhaussy, Jean-Marie Filloque, Bernard Pottier, and Stephane Rubini.
Global control synthesis for an mimd/fpga machine. In Proceedings of the IEEE
Workshop on FPGAs for Custom Computing Machines, pages 72–81, IEEE, Los
Alamitos, CA USA, April 1994. IEEE Computer Society.
[50] J. Presper Eckert. Thoughts on the history of computing. Computer, pages 58–65,
December 1976.
[51] Bradly K. Fawcett. Taking advantage of reconfigurable logic. Seventh Annual IEEE
International ASIC Conference and Exhibit, pages 227–230, September 1994.
[52] Robert E. Felderman and Leonard Kleinrock. An upper bound on the improve-
ment of asynchronous verses synchronous distributed processing. Simulation Series
Proceedings of the SCS Multiconference on Distributed Simulation, 22(1):131–136,
January 1990.
[53] Peter Fishburn and Paul Wright. Bandwidth edge counts for linear arrangements
of rectangular grids. Journal of Graph Theory, 26(4):195–202, 1997.
[54] Richard M Fujimoto. Performance measumements of distributed simulation strate-
gies. Transactions of the Society for Computer Simulation, 6(2):89–132, April 1989.
[55] Richard M. Fujimoto. Parallel discrete event simulation. In Communications of
the ACM, volume 33 no. 10, pages 30–53. ACM, October 1990.
[56] Richard M. Fujimoto. Parallel and distributed simulation. Proceedings of the 1995
Winter Simulation Conference, pages 118–125, 1995.
258
[57] Richard M. Fujimoto. Parallel and distributed simulation. Proceedings of the 1999
Winter Simulation Conference, pages 122–131, 1999.
[58] Richard M. Fujimoto, Jya-Jang Tsai, and Ganesh C. Gopalakrishman. Design and
evaluation of the rollback chip: Special purpose hardware for time warp. IEEE
Transactions on Computers, 41(1):68–82, January 1992.
[59] Steve Furber. ARM System Architecture. Addison-Wesley, Essex, England, 1st
edition, 1996.
[60] Nicolas J. Garber and Lester A. Hoel. Traffic and Highway Engineering. PWS
Publishing Company, 2 edition, 1997. ISBN 0-534-95338-7.
[61] Demos C. Gazis, Robert Herman, and Renfrey B. Potts. Car-following theory of
steady-state traffic flow. Operations Research, 7(4):499–505, 1959.
[62] Mohammed S. Ghausi. Electronic Devices and Circuits: Discrete and Integrated.
HRW Series in Electrical and Computer Engineering. Holt, Rinehart and Winston,
1985.
[63] Loys Gindraux and Gary Catlin. CAE station’s simulators tackle 1 million gates.
Electronic Design, pages 127–136, November 10 1983.
[64] Maya Gokhale, William Holmes, Andrew Kopser, Sara Lucas, Ronald Minnich,
Douglas Sweely, and Daniel Lopresti. Building and using a highly parallel pro-
grammable logic array. Computer, 24(1):81–89, January 1991.
259
[65] Jim Gray. International parallel processing symposium keynote address, April
1998.
[66] Harold Greenberg. An analysis of traffic flow. Operations Research, 7:79–85, 1959.
[67] B. D. Greenshields. A study in highway capacity. Highway Research Board Pro-
ceedings, 14:pp. 468, 1934.
[68] Leo J. Guibas and Frank M. Liang. Systolic stacks, queues, and counters. In 1982
Conference on Advanced Reesearch in VLSI, M.I.T., pages 155–164, January 1982.
[69] J.D. Hadley and B.L. Hutchings. Design methodologies for partially reconfigured
systems. IEEE Symposium on FPGAs for Custom Computing Machines, Proceed-
ings 1995, pages 78–84, 1995.
[70] Reiner W. Hartenstein, Jurgen Becker, Rainer Kress, and Helmut Reinig. High-
performance computing using a reconfigurable accelerator. Concurrency: Practice
and Experience, 8(6):429–443, July-August 1996.
[71] John Patrick Hayes. Computer architecture and organization. McGraw-Hill series
in computer organization and architecture. McGraw-Hill, 2 edition, 1988.
[72] W. R. Heller, C. George Hsi, and Wadi F. Mikhaill. Wirability - designing wiring
space for chips and chip packages. IEEE Design and Test of Computers, pages
43–51, August 1984.
[73] John L. Hennessy and David A. Patterson. Computer Architecture A Quantitative
Approach. Morgan Kaufmann Publishers, Inc., first edition, 1990.
260
[74] John L. Hennessy and David A. Patterson. Computer Architecture A Quantitative
Approach. Morgan Kaufmann Publishers, Inc., second edition, 1996.
[75] M.P. Henry. Keynote paper: Hardware compilation - a new technique for rapid
prototyping of digital systems - applied to sensor validation. Control Engineering
Practice, 3(7):907–924, 1995.
[76] A. Hoogland, J. Spaa, B. Selman, and A. Compagner. A special-purpose processor
for the monte carlo simulation for ising spin systems. Journal of Computational
Physics, 51:250–260, 1983.
[77] R. Micheal Hord. Parallel Supercomputing in MIMD Architectures. CRC Press,
Inc., Boca Raton, Florida, 1993.
[78] John K. Howard, Richard L. Malm, and Larry M. Warren. Introduction to the ibm
los gatos logic simulation machine. Proceedings - IEEE International Conference
on Computer Design: VLSI in Computers, pages 580–583, 1983.
[79] Kai Hwang and Faye A. Briggs. Computer Architecture and Parallel Process-
ing. McGraw-Hill Series in Computer Organization and Architecture. McGraw-Hill
Book Company, 1984.
[80] David R. Jefferson. Virtual time. ACM Transactions on Programming Languages
and Systems, 7(3):404–425, July 1985.
261
[81] Charles Michael Johnson and Anthony Theodore Chronopoulos. A communica-
tions latency hiding parallelization of a traffic flow simulation. In 13th International
Parallel Processing Symposium & 10th Symposium on Parallel and Distributed Pro-
cessing, pages 688–695. IEEE, April 1999.
[82] Adolf D. May Jr. and Hartmut E. M. Keller. Non-integer car-following models.
Highway Research Board, 199:19–32, 1967.
[83] T. Junchaya and G. Chang. Exploring real-time traffic simulation with massively
parallel computing architecture. Transportation Research Committee, 1(1):pp. 57–
76, 1993.
[84] Tom Kean and John Gray. Configurable hardware: Two case studies of micro-grain
computation. Journal of VLSI Signal Processing, 2:9–16, 1990.
[85] Thomas F. Knight and Alexander Krymm. A self-terminating low-voltage swing
cmos output driver. The IEEE Journal of Solid-State Circuits, 23(2):457–463,
April 1988.
[86] Donald E. Knuth. The art of computer programming. Addison-Wesley, 1968.
[87] Donald E. Knuth. The art of computer programming, volume 2. Addison-Wesley,
3rd edition, 1998.
[88] Jack Kohn, Richard Malm, Chuck Meiley, and Frank Nemec. The ibm los gatos
logic simulation software. IEEE International Conference on Computer Design:
VLSI in Computers, pages 588–591, 1983.
262
[89] Nobuhiko Koike, Kenji Ohmori, and Tohru Sasaki. HAL: A high-speed logic sim-
ulation machine. IEEE Design & Test of Computers, 2(5):61–73, October 1985.
[90] Israel Koren. Computer arithmetic algorithms. Prentice Hall, Englewood Cliffs,
N.J., 1993.
[91] H. T. Kung. Why systolic architectures. IEEE Computer, 15(1):37–46, January
1982.
[92] H. T. Kung. Systolic communications. International Conference on Systolic Arrays,
pages 695–703, May 1988.
[93] Bernard S. Landman and Roy L. Russo. On pin versus block relationship for
partitions of logic circuits. IEEE Transactions on Computers, c-20(12):1469–1479,
December 1971.
[94] Richard J. Larsen and Morris L. Marx. An Introduction to Mathematical Statistics
and its Applications. Prentice-Hall, Englewood Cliffs, NJ 07632, second edition,
1986.
[95] Doug Lea. Some storage management techniques for container classes, 1989.
[96] Ulana Legedza and William E. Weihl. Reducing synchronization overhead in par-
allel simulation. In 10th Workshop on Parallel and Distributed Simulation (PADS
’96). Proceedings of the 1996 Workshop on Parallel and Distributed Simulation,
pages 86–95, Philadelphia,PA, May 1996. IEEE, SCS, San Diego, CA, USA.
263
[97] F. Thomson Leighton. Introduction to Parallel Algorithms and Architectures: Ar-
rays, Trees, Hypercubes. Morgan Kaufmann Publishers, San Mateo, CA, 1992.
[98] Y. H. Levendel, P. R. Menon, and S. H. Patel. Special-purpose computer for
logic simulation using distributed processing. The Bell System Technical Journal,
61(10):2873–2909, December 1982.
[99] M. T. Lighthill and G. B. Witham. On kinematic waves, a theory of traffic flow on
long crowded roads. Proceedings of the Royal Society, A 229(1178):pp. 317–345,
1955.
[100] M. Morris Mano. Computer System Architecture. Prentice Hall, Englewood Cliffs,
NJ, 3 edition, 1993.
[101] John W. Mauchly. Amending the eniac story. Datamation, 25(11):217–218, Octo-
ber 1979.
[102] Adolf D. May. Traffic Flow Fundamentals. Prentice Hall, Englewood Cliffs, NJ
07632, 1990. ISBN 0-13-926072-2.
[103] John J Metzner and B. N. Jamoussi. an easily programmable algorithm for window
flow control analysis. Proceedings of the 1992 Conference on Information Science
and Systems, pages 1041–1044, March 1992.
[104] Panos G. Michalopoulos, Ping Yi, and Anastasios S. Lyrintzis. Development of
an improved high-order continuum traffic flow model. Transportation Research
Record, 1365:125–132, 1993.
264
[105] Chris Miller. Comet crash: Teraflops computer simulates colossal comet impact
into ocean. Sandia National Laboratories - News Release WWW, April 1997.
[106] Sean Monaghan. A gate-level reconfigurable monte carlo processor. Journal of
VLSI Signal Processing, 6(2):139–153, August 1993.
[107] Sean Monaghan and P.D. Noakes. Reconfigurable special purpose hardware for
scientific computation and simulation. Computing & Control Engineering Journal,
page 225, September 1992.
[108] Motorola. MECL Device Data, 1989.
[109] Motorola. Motorola Military ALS/FAST/TTL Data, 1989. Q3/89 DL142.
[110] Jeffrey D. Myjak. A massively parallel microscopic traffic simulation model with
fuzzy logic. Master’s thesis, Massachusetts Institute of Technology, September
1993.
[111] William R. Newcott. The age of comets. National Geographic, 192(6):94–109,
December 1997.
[112] David M. Nicol. Principles of conservative parallel simulation. In J. M. Charnes,
D. J. Morrice, D. T. Brunner, and J. J. Swain, editors, Proceedings of the 1996
Winter Simulation Conference, pages 128–135, 1996.
[113] Bill Nitzberg and Samuel A. Fineberg. Parallel I/O on highly parallel systems
supercomputing ’94 – tutorial m11 notes. Technical Report NAS-94-005, NASA
Ames Research Center, Moffett Field, CA 94035-1000, November 1994.
265
[114] John V. Oldfield and Richard C. Dorf. Field Programmable Gate Arrays: Recon-
figurable Logic for Rapid Prototyping and Implemenation of Digital Systems. John
Wiley & Sons, Inc., 1995.
[115] Athanasios Papoulis. Probabilty, Random Variables, and Stochastic Processes.
McGraw-Hill Series in Electrical Engineering. McGraw-Hill Publishing Company,
second edition, 1984.
[116] Jr. Paul F. Reynolds, Carmen M. Pancerella, and Sudhir Srinivasan. Design and
performance analysis of hardware support for parallel simulations. Journal Of
Parallel And Distributed Computing, 18(4):435–453, August 1993.
[117] Jr. Paul F. Reynolds, Craig Williams, and Jr. R.R. Wagner. Isotach networks.
IEEE Transactions on Parallel and Distributed Systems, 1997.
[118] Robert B Pearson, John L. Richardson, and Doug Toussant. A fast processor for
monte-carlo simulation. Journal of Computational Physics, 51:241–249, 1983.
[119] Gregory F. Pfister. The ibm yorktown simulation engine. Proceedings of the IEEE,
74(6):850–860, June 1986.
[120] Neil S. Pickles and Martin C. Lefebvre. ECL I/O buffers for BiCMOS integrated
systems: A tutorial overview. IEEE Transactions on Education, 40(4):229–241,
November 1997.
[121] James L. Pline, editor. Traffic Engineering Handbook. Prentice Hall, Englewood
Cliffs, NJ 07632, 4 edition, 1992.
266
[122] Eric S. Raymond. The cathedral and the bazaar. Presented at the 1997 Linux-
Kongress, the Atlanta Linux Showcase, 1997.
[123] Daniel A. Reed, Allen D. Molony, and Bradley D McCredie. Parallel discrete event
simulation: A shared memory approach. Proc of the 1987 ACM SIGMETRICS
Conf on Meas and Model of Comput Syst, 15(1):36–38, May 1987.
[124] Ronni Sandroff. New jump start for hearts? Consumer Reports, 66(2):8, February
2001.
[125] Tohru Sasaki, Nobuhiko Koike, Kenji Ohmori, and Kyoji Tomita. HAL; a block
level hardware logic simulator. Proceedings - ACM IEEE 20th Design Automation
Conference, pages 150–156, 1983.
[126] Richard L. Scheaffer. Introduction to Probability and its Applications. The Duxbury
Advanced Series in Statistics and Decision Sciences. PWS-KENT Publishing Com-
pany, Boston, USA, 1990.
[127] Bruce Schecter. Putting a darwinian spin on the diesel engine. The New York
Times, page D3, September 19 2000.
[128] Donald L. Schilling and Charles Belove. Electronic Circuits Discrete and Integrated.
Series in Electrical Engineering. McGraw-Hill, second edition, 1979.
[129] Carla Sciullo. Department of statistics - project id: 98-1-008. consultation,
Febraury 1998.
267
[130] Larry Soule and Tom Blank. statistics for parallelism and abstraction level in
digital simulation. Design Automation Conference - Proceedings 1987, pages 588–
591, 1987.
[131] Daniel L. Stein. Spin glasses. Scientific American, 261:52–59, July 1989.
[132] Nancy Fortgang Stern. From ENIAC to UNIVAC: A case study in the history of
technology. PhD thesis, State University of New York at Stony Brook, August
1978.
[133] Harold S. Stone, editor. Introduction to Computer Architecture. SRA computer
science series. Science Research Associates, Inc., 2nd edition, 1980.
[134] Bjarne Stroustrup. The C++ Programming Language. Addison-Wesley, 3rd edi-
tion, 1997.
[135] Shigeru Takasaki, Nobuyoshi Nomizu, Yoshihiro Hirabayashi, Hiroshi Ishikura,
Masahiro Kurashita, Nobuhiko Koike, and Toshiyuki Nakata. HAL iii: Function
level hardware logic simulation system. Proceedings - IEEE International Confer-
ence on Computer Design: VLSI in Computers and Processors Proceedings of the
1990 IEEE International Conference on Computer Design: VLSI in Computers
and Processors - ICCD ’90, pages 167–170, September 1990.
[136] Techical Education - Corporate Management Development, Crotonville, NY. Math-
ematical Analysis, second edition, August 1988. Chapter 6 - Queueing Theory.
268
[137] Masahiro Tomita, Naoaki Suganuma, and Kotaro Hirano. Reconfigurable machine
and its application to logic simulation. IEICE Transactions on Fundamentals of
Electronics Communications and Computer Science, E76-A(10):1705–1712, Octo-
ber 1993.
[138] A. W. VanAusdal. Use of the boeing computer simulator for logic design con-
firmation and failure diagnostics programs. Proceedings of the Advances in the
Astronautical Sciences 17th Annual Meeting, 29:573–594, June 1971.
[139] George Varghese, Roger Chamberlain, and William E. Weihl. The pessimism be-
hind optimistic simulation. In 8th Workshop on Parallel and Distributed Simulation
(PADS ’94). Proceedings of the 1994 Workshop on Parallel and Distributed Sim-
ulation, pages 126–131, Edinburgh, UK, July 1994. IEEE, SCS, San Diego, CA,
USA.
[140] George Varghese, Roger Chamberlain, and William E. Weihl. Deriving global
virtual time algorithms from conservative simulation protocols. Information Pro-
cessing Letters, 54(2):121–126, April 1995.
[141] Jean Walrand. Communication Networks: A First Course. Aksen Associates, Inc.,
1991.
[142] Kevin Watkins. Discrete Event Simulation in C. The McGraw-Hill International
Series in Software Engineering. McGraw-Hill Book Company, 1993.
[143] C. Craig Williams and Jr. Paul F. Reynolds. Combining atomic actions. Journal
of Parallel and Distributed Systems, pages 152–163, 1995.
269
[144] Michael J. Wirthlin and Brad L. Hutchings. A dynamic instruction set computer.
IEEE Workshop on FPGAs for Custom Computing Machines, Napa, CA, pages
1–9, April 1995.
[145] Michael J. Wirthlin, Brad L. Hutchings, and Kent L. Gilson. The nano processor:
A low resource reconfigurable processor. IEEE Workshop on FPGAs for Custom
Computing Machines, Napa, CA, pages 23–30, April 1994.
[146] Qi Yang. A microscopic traffic simulation model for ivhs applications. Master’s
thesis, Massachusetts Institute of Technology, Department of Civil and Environ-
mental, August 1993.
[147] Albert Y. Zomaya. Parallel and distributed computing handbook. Computer engi-
neering series. McGraw-Hill, New York, 1996.
Vita
Marc Bumble is a PhD candidate in the Computer Science and Engineering department
at the Pennsylvania State University in University Park, PA. He received his B.S. and
M.S. degrees in Electrical Engineering from the University of Pennsylvania in Philadel-
phia. There, he wrote his Masters Thesis in Telecommunications on a routing algorithm
for a hypothetical satellite network based on the Iridium cellular network. His current re-
search investigates architectures for accelerating non-deterministic simulation, including
the application of reconfigurable logic.