Post on 14-Apr-2018
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
1/127
VLSI Micro-Architectures for High-Radix Crossbars
Giorgos Passas
Computer Architecture & VLSI Systems (CARV) Laboratory,
Institute of Computer Science (ICS)
Foundation of Research and Technology Hellas (FORTH)
Science and Technology Park of CreteP.O. Box 1385 Heraklion, Crete, GR-711-10 Greece
Technical Report FORTH-ICS/TR-427 April 2012
Copyright 2012 by FORTH
Work Performed as a Ph.D Thesis
at the Department of Computer Science, University of Crete,
under the supervision of Prof. Manolis Katevenis,with the financial support of FORTH-ICS
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
2/127
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
3/127
FORTH-ICS/TR-427 A PRIL 2012
VLSI Micro-Architectures for High-Radix Crossbars
Giorgos Passas
The crossbar is the most popular switch for digital systems such as Internet routers, clusters, and
multiprocessors (on-chip, as well as multichip). However, because the cost of the crossbar grows with
the square of the radix thereof, and because of past implementations in various technologies, it is
widely believed that the crossbar is not scalable to radices beyond 32 or 64, and that for higher radices
more complicated networks are needed, where the crossbar is the basic building block. In this thesis,
we scale the crossbar to radices well beyond 100 by crafting novel VLSI micro-architectures and their
detailed CMOS layouts.
As a case study, we laid out a 12812824Gb/s crossbar, interconnecting 128 1mm2 user tiles in a
single hop, using just 16mm
2
of silicon in 90nm CMOS. The crossbar is 32b i t s wide, runs at 750M H z,and consumes 7W a t t s .
In router systems, the user tiles will contain memory implementing combined queueing at the inputs
and outputs of the crossbar, plus a small part of logic for port control. We show that this architecture
is the best among a range of known router memory architectures (e.g. totally shared memory, solely
input queueing, or crosspoint queueing), for two reasons: (i)It gives top performance using only a
modest speedup on either the crossbar or the memories, independent of radix; and (ii)it partitions
the memory space only linearly with the radix, thus yielding: (a) High SRAM density by using few,
large, and area efficient blocks; and (b)high memory space utilization through flexible sharing among
flows. In chip multiprocessors, the user tiles will contain cache or local memory, plus a small part of
logic for the processor. When traffic is global and heavy, such a system is competitive to the popular
mesh-centric systems, owing to the simplified routing and load balancing of the crossbar.
We made high radix crossbars feasible by developing novel VLSI micro-architectures for both theirdatapath and their control path. We implement the datapath using trees of multiplexor gates, as tris-
tate buses are slowed down by intrinsically large parasitic capacitances, and we show that highly con-
centrated trees are more area efficient by further reducing the parasitic capacitance of their internal
wires. Moreover, we contribute an experimental analysis showing that: (i)The area of the crossbar is
gate limited for all practical values of its radix N and its width W, thus growing as O(N2W), not as
O(N2W2), which would have been the case had area been wire limited, as is commonly believed in
the literature; and (ii)the delay of the crossbar is dominated by the parasitics of wires, and because
wire length grows with the perimeter of the crossbar, delay grows as O(N
W), not as O(lo g N ), which
would have been the case had delay been gate limited, as is commonly believed in the literature. Next,
we propose novel pipelines to cope with the delay of the interconnect. Finally, we demonstrate that
modern EDA tools can be guided to exploit the abundance of wiring resources through custom, but
algorithmic placement of gates.For the control path, we study the architecture of iSLIP, which is the most popular parallel match-
ing crossbar scheduler. In particular, we study a traditional iSLIP architecture that implements the
matching decision of each input and each output of the crossbar in a separate arbiter block, and com-
municates the matching decisions between the input and the output arbiters through global arbiter-
to-arbiter links. First, we show that this architecture is expensive because the arbiter-to-arbiter links
take up O(N4) area. Thus, a r a d i x -128 iSLIP scheduler occupies 14mm2, where the arbiter-to-arbiter
links account for more than 50%. Next, by observing that the wiring of an arbiter fits in O(N l o g N )
area, we propose a novel architecture that inverts the locality of wires by orthogonally interleaving the
input with the output arbiters, thus lowering the wiring area of the scheduler down to O(N2log2N).
Using this architecture, the r a d i x -128 iSLIP scheduler becomes gate limited, fitting in 7mm2, which
is a 50% reduction compared to the traditional. For a higher radix of 256, area is reduced by almost
an order of magnitude. Finally, the running time of the proposed scheduler is less than 10ns, thusallowing operation with a minimum packet as small as 30 B y t e s at a 24Gb/s line rate.
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
4/127
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
5/127
iii
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
6/127
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
7/127
Publications
Publications related to the topic:
G. Passas, M. Katevenis, D. Pnevmatikatos: Crossbar NoCs are scalable beyond 100 nodes,IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), Vol.
31, No. 4, Apr. 2012, pp. 573-585;
G. Passas, M. Katevenis, D. Pnevmatikatos: VLSI Micro-Architectures for High-Radix CrossbarSchedulers, proc. 5th ACM/IEEE International Symposium on Networks-on-Chip (NOCS 2011),
Pittsburgh, PA, USA, May 1-4, 2011, 8 pages, ISBN 978-1-4503-0720-8;
G. Passas, M. Katevenis, D. Pnevmatikatos: A 128 x 128 x 24Gb/s Crossbar, Interconnecting 128Tiles in a Single Hop, and Occupying 6% of their Area, proc. 4th ACM/IEEE International Sym-
posium on Networks-on-Chip (NOCS 2010), Grenoble, France, May 3-6, 2010, pp. 87-95, IEEE
Computer Society, ISBN 978-0-7695-4049-8.
Other publications by the author:
G. Passas, H. Eberle, N. Gura, W. Olesinski: Fast and Fair Arbitration on a Data Link, U.S. Patent,USPTO number 7965705, June 21, 2011;
G. Passas, M. Katevenis: Asynchronous Operation of Bufferless Crossbars, proc. IEEE Interna-tionalConference on HighPerformance Switching and Routing (HPSR2007), Brooklyn, NY, USA,
May 30 - June 1, 2007, ISBN 1-4244-1206-4, paper ID 1569017531.pdf;
G. Passas, M. Katevenis: Packet Mode Scheduling in Buffered Crossbar (CICQ) Switches, proc.IEEEWorkshop on High Performance Switching and Routing (HPSR 2006), Poznan, Poland, June
7-9, 2006, pp. 105-112, ISBN 0-7803-9570-0;
M. Katevenis, G. Passas: Variable-Size Multipacket Segments in Buffered Crossbar (CICQ) Archi-tectures, proc. IEEE International Conference on Communications (ICC 2005), Seoul, Korea,
May 16-20, 2005, 6 pages, paper ID 09GC08-4;
M. Katevenis, G. Passas, D. Simos, I. Papaefstathiou, N. Chrysos: Variable Packet Size Buffered
Crossbar (CICQ) Switches, proc. IEEE International Conference on Communications (ICC 2004),Paris, France, June 20-24, 2004, vol. 2, pp. 1090-1096.
v
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
8/127
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
9/127
Acknowledgments
I worked on my PhD thesis at the Computer Architecture and VLSI Systems (CARV) laboratory of the
Institute of Computer Science (ICS) of the Foundation for Research and Technology Hellas (FORTH).
FORTH provided my graduate scholarship, including funding by the European Commission through
the SARC project (FP6 IP #27648) and the HiPEAC Network of Excellence (NoE #004408 and #217068).
I am grateful to my advisor prof. ManolisKatevenis for suggesting theevaluation of the cost of crossbar
speedup on chip using real VLSI layouts, and for supervising my work through weekly meetings from
Spring 2008 to Fall 2011. Particularly years 2008 and 2009 had been very hard, and I can remember
only him giving me a hand. Nevertheless, there was still plenty of time and space to act myself, whichI really appreciated.
I am also grateful to my co-advisor prof. Dionisios Pnevmatikatos, who joined our meetings in fall
2009, and offered fresh insights. However, I mostly thank him for all nice things I learned from him on
technical writing and drawing for example, the bold lines in Fig. 5.10 are due to him.
I also thank the other members of my thesis committee, prof. Davide Bertozzi, prof. Angelos Bilas, Dr.
Cyriel Minkenberg, prof. Yannis Papaefstathiou, and prof. Apostolos Traganitis, for their questions
and comments in the defense of my thesis they proved very constructive. Special thanks to prof.
Davide Bertozzi and Dr. Cyriel Minkenberg for going into more details.
I thank prof. Christos Sotiriou whom I consulted on several of the issues I faced on EDA flows and
algorithms the custom placement techniques were motivated by these discussions.
I thank Spyros Lyberis and Michael Ligerakis for setting up the EDA toolset for me. Spyros Lyberis also
helped me improve the oral presentation of my defense by commenting on my rehearsal.
I thank Dr. Hans Eberle, prof. Jose Duato, and prof. Jose Flich, whom I had the opportunity to cooper-
ate with during my internship at Sun Labs, Menlo Park, CA, in Fall 2007 seeing how other researchers
are thinking on a closely related research topic was very helpful for my thesis.
Last but not least, I thank my family, especially my mother, and my uncle Nikos Arapakis, for their
love and encouragement, and my friends, especially Makis Stamos (Psilos), Kostis Anastasakis, Enrico
Schiattarella, Orestis Karamagiolas, and Giorgos Panagiotakis, for helping me further tolerate reality.
Finally, I thank the cleaning and security personnel at FORTH for being kind with me.
vii
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
10/127
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
11/127
Contents
1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Basic Concepts 52.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Time Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Space Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.5 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.6 Virtual Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.7 High Radix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 A Comparison of Architectures for High-Radix Switches 233.1 Basic Switch Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 Memory-Sharing Merits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3 On-Chip SRAM Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.4 Combined Input-Output Queued (CIOQ) Crossbars . . . . . . . . . . . . . . . . . . . . . . 333.5 Hierarchically-Queued Crossbars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.6 Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 Datapath Micro-Architectures for High-Radix Crossbars 414.1 Basic Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2 Cost & Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.3 Customized Layout using Link Pipelining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.4 Models for Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.5 Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5 Scheduler Micro-Architectures for High-Radix Crossbars 635.1 iSLIP Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.2 Block Micro-Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.3 Cross Micro-Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.4 Cross-iSLIP versus Wavefront-Scheduler Comparison . . . . . . . . . . . . . . . . . . . . . 815.5 FIFO & Virtual-Channel Schedulers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845.6 Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6 High-Radix CIOQ Crossbar Switches & Crossbar NoCs 876.1 Tiled Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876.2 Wire-Over-SRAM Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.3 Radix-128 Tiled System using Centralized Crossbar . . . . . . . . . . . . . . . . . . . . . . . 936.4 Crossbar versus Mesh Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.5 Projections in Newer Technology Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986.6 Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7 Summary & Future Work 105
Bibliography 109
ix
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
12/127
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
13/127
Chapter 1
Introduction
Today, on-chip memory systems with few hundreds of memory nodes, such as chip multiprocessors
(CMPs) and switch fabrics, are pivotal digital systems. A key component in such systems is the switch
interconnecting the memories. While the crossbar is the most popular switch, it is widely considered
non-scalable to radices beyond fewtens of nodes dueto its quadratic cost [1][2][3]. Thus, designers are
increasingly adopting cheaper-but-intricate topologies, such as meshes and tori [1][4][5], where the
crossbar is the basic building block. However, a study on the scaling limits of the crossbar is missing
from the literature: If the crossbar is proven to be feasible, designers are likely to replace the currently
deployed topologies with a crossbar to benefit from its simplicity.
We consider memory systems where each memory node is located on a user tile, and user tiles
are arranged in a 2D matrix. In a switch fabric, a user tile will implement mostly queues associated
with a switch port, plus a small circuit for the control of that port. In a CMP, a user tile will implement
a processor and the cache or local memory next to the processor. We evaluate the area, speed, and
power consumption overhead of the crossbar to the user tiles by studying their VLSI organization in
modern CMOS technology, using Electronic Design Automation (EDA) tools.
1.1 Motivation
Switch chips interconnect their ports using a crossbar. An efficient approach to handle contention for
the output ports is using queues. In particular, traffic at different inputs contending for a same output
is queued at the inputs [6]. To reduce Head of Line blocking in such Input Queued (IQ) crossbars,
queue memories can be divided in lanes, e.g. Virtual Output Queues (VOQ) [6]. However, scheduling
of which lane is connected to which output is a hard problem to solve fast, fair, and efficiently [7]. To
compensate for inefficiencies of scheduling, internal speedup can be used. Then, the throughput of
both the memories and the crossbar is overprovisioned, and memories are placed also at the outputs.
This organization is known as Combined Input-Output Queueing (CIOQ) [8][9][10][11].
1
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
14/127
G. Passas, VLSI Micro-Architectures for High-Radix Crossbars
CIOQ has been studied for systems where the crossbar and the memories are implemented on
separate chips [7][12]. In such systems, it is known to be expensive because it wastes expensive chip-
IO resources [12]. Thus, crosspoint queueing (XQ) has been proposed as a more scalable alternative.
XQ can be considered a scalable var iant of output queueing (OQ), trading queueing internally to the
crossbar for operation without internal speedup [13][14].
However, the tradeoff between speedup and memory is very different off-chip from on-chip.
Chip-to-chip link bandwidth is expensive, in terms of both pin count and power consumption. Each
pin is run at the highest possible frequency, so link speedup can only be provided by increasing the
number of pins. But, more pins cost in packagesize, board area, and wiring cost. Moreover, high speed
serial off-chip links carry their own clock information, embedded in the data encoding. Even if such a
link has no valid data to carry, it still consumes power because it has to carry synchronization signals
for the receivers clock recovery circuit to stay in-sync (powering down is beneficial only for quite long
inactivity periods). Thus, an off-chip link with a speedup ofs> 1.0 consumes power proportional to
its peak throughput, s, although its average utilization never exceeds 1.0.
On-chip, things are very different though. The wires of an idle link need not change state, hence
CMOS circuits will only consume energy when transferring valid data: On-chip link power consump-
tion is proportional to average throughput, not peak capacity. Moreover, on-chip, links can be routed
over the memories: Modern CMOS processes offer many layers of interconnect (e.g. eight), while
SRAM blocks obstruct few of them (e.g. four). Thus, on chip, CIOQ appears technologically correct.
Furthermore, CIOQ is advantageous to XQ because it reduces the partitioning of memory space.
Particularly, in CIOQ the total number of memories grows linearly with the radix, while in XQ it grows
quadratically. Moreover, higher memory partitioning translates to higher implementation cost. On
the other hand, CIOQ compared to XQ increases the cost of the crossbar, while also requiring a mono-
lithic crossbar scheduler. Hence, a comparison between CIOQ and XQ starts from the evaluation of
the cost of crossbar speedup and the feasibility of the scheduler.
This evaluation should be done for high radix switches [15]. Switch chip designers strive to ben-
efit from the advances in signaling technology by increasing the radix of the switch chips, as switch
chips with higher radix enable lower diameter network topologies, with lower component count, and
lower cost. However, most studies in the literature concern relatively low radix crossbars, up to 32 or
64 ports [2][16], and a study of scaling to hundreds of ports is missing.
Finally, the cost of crossbar speedup and the feasibility of the scheduler should be evaluated
on real VLSI layouts. Switch chips are typically Application Specific Integrated Circuits (ASICs), and
ASICs are designed using Electronic Design Automation (EDA) tools. Thus, VLSI layouts should be
developed using such EDA tools.
2
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
15/127
1.2. CONTRIBUTIONS
1.2 Contributions
The key contributions of this thesis are as follows:
1. High Radix Crossbar Network-on-Chip. We lay out a 12812824Gb/s crossbar in a 90nm
CMOS process with 9 layers of interconnect. The crossbar is 32b i t s wide, runs at 750M H zusing
a 3-st ag e pipeline, fits in 16mm2 of silicon by filling it at the 90% level, and consumes 6W a t t s .
Moreover, we surround the crossbar with 128 1mm2 user tiles, and we connect the crossbar to
the user tiles through global links. The global links are 32b i t s wide, run at 750M H z using a
two-stage pipeline, run on top of the user tiles, and consume 1.2W a t t s .
2. Crossbar Datapath Micro-Architecture.We implement the datapath of the crossbar using trees
of multiplexor gates, as tristate buses are slowed down by intrinsically large parasitic capaci-
tances, and we show that highly concentrated trees are more area efficient by further reducing
the parasitic capacitance of their internal wires. Next, we contribute an experimental scaling
analysis, showing that: (i)The area of the crossbar is gate limited for all practical values of its
radix N and its width W, thus growing as O(N2W), not as O(N2W2), which would have been
the case had area been wire limited, as is commonly believed in the literature [2][17]; and (ii)
the delay of the crossbar is dominated by the parasitics of wires, and because wire length grows
with the perimeter of the crossbar, delay grows as O(N
W), not as O(lo g N ), which would have
been the case had delay been gate limited, as is commonly believed in the literature [ 15]. Next,
we propose novel pipelines to cope with the delay of the interconnect. Finally, we demonstrate
that EDA tools can be guided to compact routing solutions through custom gate placement.
3. Crossbar Scheduler Micro-Architecture. We study a traditional iSLIP architecture that imple-
ments the matching decision of each input and each output of the crossbar in a separate arbiter
block, and communicates the matching decisions between the input and the output arbiters
through global arbiter-to-arbiter links. First, we show that this architecture is expensive be-
cause the arbiter-to-arbiter links take up O(N4) area. Thus, a r a d i x -128 iSLIP scheduler occu-
pies 14mm2
, where the arbiter-to-arbiter links account for more than 50%. Next, by observing
that the wiring of an arbiter fits in O(N l o g N ) area, we propose a novel cross architecture that
inverts the locality of wires by orthogonally interleaving the input with the output arbiters, thus
lowering the wiring area of the scheduler down to O(N2l og2 N). Using this cross architecture,
the r a d i x -128 iSLIP scheduler becomes gate limited, fitting in 7mm2, which is 50% reduction
compared to the traditional. For a higher radix of 256, reduction nears an order of magnitude.
4. Combined Input-Output Queueing is better than Crosspoint Queueing. Based on the above
findings we conclude that crossbars are small and speedup is inexpensive. Because CIOQ com-
pared to XQ reduces memory partitioning, CIOQ is better than XQ.
3
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
16/127
G. Passas, VLSI Micro-Architectures for High-Radix Crossbars
1.3 Outline
Chapter 2 presents basic concepts in switch design. First, we abstract the role of the switch to scalable
distributed multiparty communication. Next, we classify switches to time and space switches, and
we explain why space switches are more scalable. Though scalable, space switches need scheduling
and routing. Thus, we also overview some popular scheduling and routing algorithms. Moreover, we
discuss thepopular Virtual Channels and the argument for highradix switches. Finally, we summarize.
Chapter 3 presents a comparison of known switch architectures for high radix switches. First, we
overview basic switch architectures. Next, we study the merits of memory sharing and the cost of
memory implementation, and we show that the input queued crossbar is the only scalable switch.
Because the input queued crossbar is difficult to schedule at high radices, we also discuss scalable
variants thereof, namely combined input-output queued crossbars and hierarchically queued cross-
bars. Finally, we discuss related work and the conclusion.
Chapter 4 presents VLSI micro-architectures for high radix crossbar datapaths. First, we describe a
basic datapath architecture. Next, we show that in this architecture area is practically always gate
limited, while delay becomes wire limited at high radices. To increase throughput, we also describe a
customized layout using a novel wire pipeline. Moreover, we develop simple models for area. Finally,
we discuss related work and the conclusion.
Chapter 5 presents VLSI micro-architectures for high radix crossbar schedulers. First, we describe theiSLIP circuit. Next, we study a traditional block architecture, and we show that at high radices sched-
uler area is wire limited. To remove the wiring limitations, we propose and we study a novel cross
architecture. We find that this cross architecture has similarities with the architecture of the Wave-
front Scheduler. Moreover, we adapt the cross scheduler architecture to FIFO and Virtual Channel
crossbars. Finally, we discuss related work and the conclusion.
Chapter 6 presents VLSI micro-architectures for high radix Combined Input-Output Queued (CIOQ)
switches and Networks-on-Chip (NoCs). First, we show that a tiled architecture can be used for both
CIOQ switches and NoCs. Next, we study alternative locations of the crossbar in its context of tiles,
and we show that a centralized crossbar is more practical. Moreover, we plot overall system perfor-
mance, we compare our crossbar to a popular mesh NoC, and we make projections for newer tech-
nology nodes. Finally, we describe related work and the conclusion.
Chapter 7 presents a summary of the thesis, as well as directions for future work.
4
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
17/127
Chapter 2
Basic Concepts
In this chapter, we describe basic concepts in switch design. In particular, we describe the role of
a switch (section 2.1), a taxonomy of switches to time and space switches (section 2.2 and 2.3), the
problem of scheduling (section 2.4), the problem of routing (section 2.5), the concept of Virtual Chan-
nels (section 2.6), the argument for high radix switches (section 2.7), and a summary (section 2.8).
The description is based mainly on the transparencies of the Packet Switch Architecture class at the
University of Crete [18].
2.1 PreliminariesInterconnection switches are intermediaries implementing the communication between the parties
of a digital system. For example, between line cards in an Internet router [19], between processors and
memories in a multiprocessor system [20], and between I/O devices, processors, and/or memories in
a storage system [21].
As the scale of systems varies from few tens of parties (e.g. in an Internet router) to many thou-
sands of parties (e.g. in a multiprocessor), the scale of the switch has to vary accordingly. While a small
switch may switch data by simply passing it through a single point in space at different moments in
time (e.g. the memory switch section 2.2), larger switches use topologies of parallel paths in space
(e.g. the crossbar or the Benes switch section 2.3). Thus, space and time are two basic dimensions in
switch design. A third dimension is the coordination of the parties. Examples are scheduling or rout-
ing algorithms resolving situations where parties are contending for a single resource of the switch,
such as an output link or an internal path (sections 2.4 and 2.5). In any case, parties comprise a dis-
tributed system. When scale is large, distribution is obvious a direct consequence of the physical
distances between the parties. In smaller scales, distribution emerges from large ratios of path delays
to processing times.
5
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
18/127
G. Passas, VLSI Micro-Architectures for High-Radix Crossbars
r
r i
0
r
r i
0
>= R
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
19/127
2.2. TIME SWITCHES
assemble
words1
assemble
words
single path in space is bottleneck
4
words
words
disassemble
disassemble
1
11
memory
Figure 2.2: A 44 memory switch.
2.2 Time Switches
For small systems, a switch may simply pass data through a single point in space at different moments
in time. A representative such switch is the memory switch. A memory switch (Fig. 2.2) switches data
from a set of input links to a set of output links throughwrites and readsto and from a central memory.
The access rate of the central memory is equal to the aggregate rate of the input and the output links,
in order to absorb any contention among the input links for the output links, and to fully utilize the
output links. Thus, the inputs assemble words, which are then multiplexed in time on the memory
(write) bus. In the opposite direction, words are demultiplexed on the memory (read) bus, and are
disassembled at the outputs. Inside the memory, words are organized in queues. For details on the
memory switch, refer e.g. to [22][23][24].
Queues are typically simple first-in-first-out (FIFO) structures, as only such simple structures
can be implemented fast in hardware [25]. Thus, at least one queue is needed per output, to maximize
the throughput of the memory: If words for different outputs are intermingled in the same queue,
words for a heavily loaded output block other, lightly loaded outputs. This is a fundamental problem
in switch design, known as Head-of-Line (HoL) blocking[6][26].
Coordination in a memory switch concerns the sharing of memory space among the inputs and
theoutputs. In particular, when some inputsare contending for a same output, thequeuecorrespond-
ing to that output starts building up in time. Given that memory space is finite, a protocol is needed
to prevent the memory from overflowing. Usually, a credit based protocol is employed [27]. Using a
credit based protocol, inputs are allocated a number of credits for each queue, credits are spent on
writes, and are returned on reads, so that operation is lossless. Notice that in this way the memory is
minimally partitioned into N2 credit equivalents. Furthermore, while dropping excessive words could
be an alternative, this usually degrades performance, as new words may not be able to arrive in time
to replace the dropped ones [6][26]. Finally, according to Littles law [28], rates allocated to inputs for
7
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
20/127
G. Passas, VLSI Micro-Architectures for High-Radix Crossbars
(a) (b)
Figure 2.3: (a) A 44 crossbar switch and (b) example connections.
a particular output are proportional to their share of that outputs queue space. That is, in a memory
switch rates are allocated through memory space.
The memory switch is the most efficient of the known switch architectures because it optimizes
the utilization of both the links and the memory space. All links are guaranteed to run at 100% of their
capacity, while memory space can be flexibly shared among queues see [29] for a range of possible
sharing schemes. Unfortunately, the memory switch is non-scalable because as the number and/or
the rate of links increases, the access rate of the central memory increases proportionally. As we shall
see in the next chapter 3, current memory and link technology constrain the size of a state-of-the-artmemory switch to only eight ports.
2.3 Space Switches
Space switches are more scalable by using parallel paths in space. Popular examples are the crossbar,
the mesh, and the Benes switch.
A crossbar switch (Fig. 2.3) is a set of N input lines, N output lines, and N2 programmable
crosspoints between them. Let us denote byxi,j
the crosspoint between input i and output j. For
input i to connect to output j, crosspoint xi,j closes, while every other crosspoint xk,j, k= i, opens
to avoid shorts. Finally, the input and the output lines connect to same-rate input and output links.
Thus, the crossbar is internally non blocking.
A mesh switch (Fig. 2.4) uses one 55 crossbar for each pair of input-output links, and connects
the crossbars in a
N
N grid. In this way, it reduces the complexity of the crossbar from O(N2)
down to O(N). However, this comes at the cost of internal blocking. In particular, the bisection of the
mesh is O(
N) wide, that is narrower than the O(N) connections. Furthermore, unlike the crossbar
where there is a dedicated path for each input-output pair, in the mesh each input may connect to
any of the outputs through multiple alternative paths, intersecting with other paths connecting other
8
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
21/127
2.3. SPACE SWITCHES
xbar
5x5
xbar
5x5
xbar5x5
xbar5x5
xbar
5x5
xbar5x5
xbar
5x5
xbar
5x55x5
xbar
(a) (b)
Figure 2.4: (a) A 99 mesh switch and (b) alternative routes for an example connection.
pairs. For details on the mesh refer e.g. to [30].
A 44 Benes switch (Fig. 2.5) comprises two back-to-back connected 44 Banyan networks of
22 switching elements. Larger Benes switches are constructedrecursively. An NNBenescomprises
two N/2N/2 Benes sub-switches, sandwiched byNadditional elements. Thus, the cost of the Benes
is O(N lo g N ). Furthermore, the Benes is non-blocking, like the crossbar. The intuition behind its
non-blocking property is that the Benes has more states than all possible permutations of external
connections. In particular, it uses 2N l o g N 1 stages ofN/2 22 switching elements, thus providing
2N(l o g N )
N/2
states, which are more than the N! permutations of external connections [31]. Finally,
like the mesh, the Benes is a multipath switch. For details on the Benes refer e.g. to [ 32].
There are a number of other known space switches. For example, the mesh can be extended to
three or more dimensions. Moreover, the Benes can be folded in a fat tree. Finally, the Clos switch
is a Benes switch using higher radix switching elements. For details on these switches, refer e.g. to
[33][34][35].
Like time switches, space switches hold contending traffic in queues. However, space switches
place these queues at the inputs. Thus, memory throughput is independent of N. Owing to this fea-
ture, space switches can scale to hundreds or even thousands of ports, as we show in chapter 3. In
the simplest case, each input memory contains one FIFO queue. The problem then is HoL blocking.
Progressing at the head of the queue, traffic destined to a congested output blocks other, irrelevant
traffic behind it. When traffic is uniformly destined, this is known to reduce the throughput of the
switch below 60% [6]. Under more stressed conditions, performance degrades even further [6]. The
solution to HoL blocking is to change the organization of memories. In particular, by separating traffic
per switch output, HoL blocking is eliminated [6]. This approach is known as Virtual Output Queueing
(VOQ) because although queues are per output, they are physically located at the inputs. Other less
expensive approaches, such as Virtual Channels (section 2.6), have been also proposed. Finally, the
9
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
22/127
G. Passas, VLSI Micro-Architectures for High-Radix Crossbars
(a)
(b)
(c)
Figure 2.5: (a) A 44 or (b) 88 Benes switch and (c) alternative routes for an example connection.
allocation of the space of each memory to upstream nodes can be controlled by a credit protocol, in
analogy to time switches.
Fig. 2.6 compares the operation of space and time switches using space-time diagrams. We as-
sume a 3 3 switch, where traffic at input 0 and 1 is destined to output 0, and traffic at input 2 is
destined to output 1. In the time switch (Fig. 2.6(a)), inputs assemble 3-packet frames, which are
multiplexed in time on the write bus, are written in memory per output, and are finally demultiplexed
on the read bus towards the outputs. Observe in Fig. 2.6(a) that there is a minimum latency of three
packet times for the assembly of frames, and backlog for output 0 accumulates inside the memory
neither at the inputs, nor at the outputs. On the other hand, in the space switch (Fig. 2.6(b)) input
memories are three times narrower and packets are multiplexed towards the outputs in space. Ob-
serve in Fig. 2.6(b) that there is a minimum latency of one packet time, corresponding to scheduling,
and backlog accumulates solely at the inputs.
Unfortunately, the problem with space switches is coordination. Coordination concerns (i)schedul-
ing of which input is connected to which output, and (ii)routing connections between the matched
inputs and outputs. We overview these problems in the next two sections 2.4 and 2.5. Notice that
10
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
23/127
2.3. SPACE SWITCHES
000
000
000
111
111
111
000000111111000000000111111111000000111111000000000111111111
00
0000
11
1111
00
00
11
11
00
00
00
11
11
11
000
000
000
111
111
111
00
00
00
11
11
11
00
00
00
11
11
11
00
00
00
11
11
11
000000111111000000111111000000111111
Q00
Q00
Q10
Q21
Q10
Q21
00 1 2
outputs1
inputs memory
1widepath
(a)
000011110000001111110000111100
00
00
11
11
11
00
00
00
11
11
11
00
00
11
11
000000000111111111000000111111000000000111111111000000111111000
000
000
111
111
111
000
000
000
111
111
111
0
outputs1
crossbarinputs0 1 2
Q00 Q10 Q21
2parallel,narrowpaths
(b)
Figure 2.6: Space-time diagram of packet forwardings in a 33 (a) time switch and (b) space switch.Traffic at input 0 and input 1 is destined to output 0, and traffic at input 2 is destined to output 1. In
(a), inputs assemble 3-packet frames, which are multiplexed in time on the write bus, are written in
memory per output, and are finally demultiplexed on the read bus towards the outputs. In (b), input
memories are three times narrower and packets are multiplexed towards the outputs in space.
routing in crossbars is trivial because each input-output pair has a private path. This is a basic reason
crossbars are so popular.
11
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
24/127
G. Passas, VLSI Micro-Architectures for High-Radix Crossbars
requests match
0 0
inputs outputs inputs outputs
VOQs ???
0 0
1 111
Figure 2.7: Unfairness of maximum matchings. To setup the connection from input 1 to output 0 a
non-maximum match is needed.
2.4 Scheduling
Switch scheduling is a special application of bipartite graph matching [36]. The vertices of the graph
are the inputs and the outputs of the switch, and the edges the wishful connections. Although max-
imum size matching algorithms maximize the throughput of the switch, they are impractical for two
reasons. First, they are too slow to implement in fast hardware [37]. Second, they are inherently unfair
in the example of Fig. 2.7, a maximum size matching algorithm would starve the connection from
input 1 to output 0 [37]. Thus, heuristics are being used in practice.
Let us first consider the simplest case, that is the scheduler for a switch with a single FIFO per
input. Such a scheduler runs in the following two steps:
Step 1: Request. Each input sends a request to the destination output of its head word.
Step 2: Grant. If an output receives any requests, it chooses the one to grant (e.g. randomly), and
notifies each input whether its request was granted.
After the above two steps have been executed, a bipartite match has been found. The runs of the
scheduler are usually pipelined with the configuration of the switch and the forwardings of the words.
Thus, the running time of the scheduler quantizes the external traffic into fixed size internal units,
often called packets or cells. Notice that to sustain the line rate, the internal packets have to be at least
as small as the minimum external packets.
Schedulers for VOQ switches are more complicated because in such switches each input may
wish to connect to more than one outputs. Thus, input is added to output contention. Below we
review four popular and representative scheduling algorithms: PIM [38], iSLIP [37], DRRM [39], and
the Wavefront Scheduler [40].
12
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
25/127
2.4. SCHEDULING
01
0000000
0000000
00000000000000000000000000000000000
1111111
1111111
11111111111111111111111111111111111
00000001111111
000000000000
000000
000000
000000000000
111111111111
111111
111111
111111111111
00
00
11
11000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111111111110011
000111
000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111111111110000111100000000000000000000000000000000000000000000000001111111111111111111111111111111111111111111111111
0000000
0000000
00000000000000000000000000000000000
1111111
1111111
1111111111111111111111111111111111101
request acceptgrant
11
2
3 3
00
2
3 3
0
22
11
00
2
33
0
11
2
Figure 2.8: Example run of the PIM scheduling algorithm.
PIM Parallel Iterative Matching[38] runs in the following three steps:
Step 1: Request. Each input sends a request to every output for which it has at least one packet.
Step 2: Grant. If an output receives any requests, it chooses randomly the one to grant, and
notifies each input whether its request was granted.
Step 3: Accept. If an input receives any grants, it chooses randomly the one to accept.
After the above three steps have been executed, a bipartite match has been found (Fig. 2.8). Moreover,
the above steps may be iterated between the unmatched inputs and outputs to increase the size of
the match. As proved in the original PIM paper [38], lo g N iterations converge to a maximal match.
However, the problem with this algorithm is that (i)it needs random-number generators, which are
tricky to implement in fast hardware, and (ii)it is unfair under asymmetric traffic [41], as illustrated
in Fig. 2.9. In Fig. 2.9(a), each flow would ideally receive 1/2 of link bandwidth, but in reality, the
algorithm tends to discriminate against inputs that have contention [41]. Fig. 2.9(b) shows a second,
analogous scenario.
iSLIP Iterative SLIP [37] overcomes the problems of PIM by resolving contention round-robin. It runs
in the following four steps:
Step 1: Request. Each input sends a request to every output for which it has at least one packet.
Step 2: Grant. If an output receives any requests, it decides round-robin which one to grant, and
communicates back to each input whether its request was granted.
Step 3: Accept. If an input receives any grants, it decides round-robin which one to accept, and
communicates back to each output whether its grant was accepted.
13
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
26/127
G. Passas, VLSI Micro-Architectures for High-Radix Crossbars
000000111111000000111111
000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111111111110000001111110000000000000000000000000000
0000000
0000000
1111111111111111111111111111
1111111
1111111
5/8
5/8
3/8
3/8
33
00
2
11
0
2
11
0
2
0000011111000000111111
5/16
5/1
6
5/16
5/16
5/16
probability
1/4 accept
per grant
5/16
1/4 grant
probability
per request
1/16
00
22
11
0
2
3 3
11
0
2
3 3
(a) (b)
Figure 2.9: PIM is unfair when input load is asymmetric. (a) and (b) are two examples. Inputs request
the full rate from a set of outputs. Fractions denote rates allocated by PIM.
Step 4: Slip. If an input accepts any output, it increments (modulo N) its round-robin pointer to
one location beyond that output. If an output is accepted by the input it granted, it increments
(modulo N) its round-robin pointer to one location beyond that input.
After the first three steps have been executed, a bipartite match has been found. The fourth step
ensures that subsequent runs of the algorithm will give fair and often maximal matches. Because each
output keeps granting the same input until accepted, and because inputs arbitrate round-robin, any
output gets eventually accepted in at most Nruns of the algorithm. As a consequence, it is guaranteed
that any request results to a match in at most N2 runs [37]. Finally, by insisting on their grants, the
outputs tend to slip (desynchronize), speeding up convergence to maximal matches.
DRRM Dual Round Robin Matching[39] runs in the following three steps:
Step 1: Request. Each input selects round-robin which output to send a request to.
Step 2: Grant. If an output receives any requests, it decides round-robin which one to grant, and
communicates back to each input whether or not its request was granted.
Step3: Slip. If an input is granted by the output it requested, it increments (modulo N) its round-
robin pointer to one location beyond that output. If an output grants any input, it increments
(modulo N) its round-robin pointer to one location beyond that input.
14
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
27/127
2.4. SCHEDULING
1 1 0 0
0000
0 1 0 1
1000
1 1 0 0
0000
0 1 0 1
1000
1 1 0 0
0000
0 1 0 1
1000
1 1 0 0
0000
0 1 0 1
1000
1 1 0 0
0000
0 1 0 1
1000
1 1 0 0
0000
0 1 0 1
1000
0 0 0 0
0 0
1 1 0 0
1 0 10
10
run
00
111
22
33
2
000
11
2 2
0
3 3
requests request matrix match
Figure 2.10: Example run of the Wavefront Scheduler.
Compared to iSLIP, DRRM saves one step. Thus, a bipartite match is found in the first two steps. The
third step ensures that subsequent runs of the algorithm will give fair and often maximal matches, in
analogy to the fourth step of iSLIP. Unfortunately, DRRM introduces HoL blocking [42]. If the output
that an input insists on is congested, flows from this input to non-congested outputs are being blocked
[42].
Wavefront Scheduler Instead of making selections locally at N inputs and N outputs, the Wavefront
Scheduler [40] operates globally on a square matrix ofN2 flows, where flows are prioritized by a diag-
onal wavefront. We show an example run of this scheduler in Fig. 2.10. In the upper part, we show
the request matrix and the match eventually computed by the scheduler. In particular, the Wavefront
Scheduler uses a request matrix, where flow (i,j) from input i to output j corresponds to entry (i,j),
and that entry (i,j) is set if and only if flow (i,j) is backlogged. The scheduler runs on the request
matrix as we show in the bottom part of Fig. 2.10. Initially, each flow in the first row 0 is given a vertical
token, and each flow in the first column 0 is given a horizontal token. A flows request is granted if and
15
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
28/127
G. Passas, VLSI Micro-Architectures for High-Radix Crossbars
1 2 3
432
3 4 5
3
4 2
1 2
3
435
5 3 4
213
4 2 3
3
4
213
5 3
24
1 2 3
5
43
43
2
3
1
2 3
2
4 5
3
4
4
2
3 4 2
13
5 3
1
3
24
5
32
4
3
2
3
4 5
4
3 1
2
3
run 1 run 2 run 3
run 6run 5run 4
run 7 run 8 run 9
Figure 2.11: Priorities of flows during a period of runs of the Wavefront Scheduler in an example 3 3switch. Top priority is shifted to a different flow from run to run. Flow 00 is given 6 times higherpriority over flow 10, while flow 10 is given only 3 times higher priority over flow 00.
only if that flow grabs both the vertical and the horizontal token corresponding to its column and its
row, respectively. Thus, in the first step, flow (0,0) has the top priority, flows (0,0) request is granted,
and flow(0, 0) stops the propagation of both its vertical and horizontal token to prevent its same-input
and same-output flows from being granted in subsequent steps. In the second step, likely to have the
top priority are flows (0,1) and (1,0). However, flow (0,1) misses the horizontal token, hence it is not
granted. Thus, it propagates unchanged its tokens to its neighbors in both directions. In parallel, flow
(1,0) propagates its tokens also unchanged because it is idle. Scheduling proceeds similarly, and a
match is computed in 2N 1 steps. Notice that in step i, i flows are likely to have top priority, but
these flows are always non conflicting. In the next scheduling decision, the initial top priority flow
changes, as in Fig. 2.11, to provide a degree of fairness. Observe in Fig. 2.11 that flow (0,0) is given six
times higher priority over flow (1,0), while flow (1,0) is given only three times higher priority over flow
(0,0). Thus, the Wavefront Scheduler is inherently unfair.
Traditionally, scheduling algorithms are evaluated for crossbar switches, using simulation. The
simulated performance of the above algorithms is plotted in Fig. 2.12. We assume traffic of fixed size
packets, uniformly destined over the outputs. Moreover, packet arrivals follow a Bernoulli process,
with probability corresponding to traffic load. In Fig. 2.12(a), we plot average packet delay as a func-
tion of input load. We observe that PIM with one or two iterations is unable to sustain a load greater
than 0.6 or 0.9, respectively. For larger loads, VOQs keep growing in time, and the switch becomes un-
stable. On the other hand, iSLIP is stable for all loads, even using a single iteration. Furthermore, the
Wavefront Scheduler performs significantly better than both iSLIP and PIM. However, performance
converges as the number of iterations of PIM and iSLIP increases. In Fig. 2.12(b) we plot the stan-
dard deviation of delay, which serves as a metric for fairness. Thus, the Wavefront Scheduler performs
16
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
29/127
2.4. SCHEDULING
0.01
0.1
1
10
100
1000
10000
averagedelay(packettimes)
1PIM1SLIP
WFS
2PIM
2SLIP
WFS
0.2 0.4 0.6 0.8 1.0
load
0.01
0.1
1
10
100
1000
10000
averagedelay(packet
times)
4PIM
4SLIP
WFS
0.2 0.4 0.6 0.8 1.0
load
7SLIP/7PIMWFS
(a)
0.2 0.4 0.6 0.8 1.0
load
0.01
0.1
1
10
100
1000
10000
standarddeviationofdelay(packettimes)
2SLIP
WFS
0.2 0.4 0.6 0.8 1.0
load
7SLIPWFS
(b)
Figure 2.12: Simulated (a) average delay and (b) standard deviation of delay of PIM, iSLIP, and Wave-
front Scheduler (WFS) as a function of input load. N= 128.
17
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
30/127
G. Passas, VLSI Micro-Architectures for High-Radix Crossbars
worse than iSLIP, while PIM falls between iSLIP and the Wavefront Scheduler. (In Fig. 2.12, the stan-
dard deviation of PIM delay is omitted for clarity.)
Finally, several other approaches to switch scheduling have been proposed. Marsan et al. [43]
proposed extensions of the above algorithms for operation on variable length packets. Kar et al. [44]
proposed a scheme that merges multiple external packets into large internal envelopes to relax the
running time of the scheduler. Kam and Kai-Yeung Siu [45] proposed weighted matching to provide
Quality of Service (QoS) guarantees. Ahuja et al. [46] studied multicast scheduling.
2.5 Routing
In multipath space switches, an input-output pair can be connected through multiple alternative
paths, intersecting with other paths connecting other pairs. Thus, once scheduling has resolved input
and output contention, routing is needed to resolve contention for internal paths. Like scheduling,
routing algorithms are heuristics, and they can be classified into the following three categories [47]:
Deterministic Routingalgorithms make deterministic path selections. For example, in the Benes
switch (Fig. 2.5), each input selects round-robin one of the available paths for each output, and
in case of conflict with other inputs, a selection is made, e.g. again round-robin. Deterministic
routing algorithms ineffectively exploit path diversity, thus suffering performance issues [47].
Randomized Routingalgorithms randomly select an intermediate node, and then make a deter-
ministic selection between the available paths. Randomized routing algorithms perform well
under non-uniform traffic, but their performance degrades under uniform traffic [47].
Adaptive Routingalgorithms aim at combining the merits of deterministic and randomized al-
gorithms by using network state to select among paths. However, practical adaptive routing
algorithms access only local state, thus often making globally sub-optimal decisions [47].
2.6 Virtual Channels
Scheduling algorithms are widely used to coordinate crossbars. However, multipath switches are
harder to coordinate because they need routing in addition to scheduling. Thus, multipath switches
are only implicitly coordinated, using memories inside the switching elements. An example queued
mesh is shown in Fig. 2.13. In Fig. 2.13, we consider a scenario where inputs B and I are contending
for output G, we assume dimensioned ordered routing, and we show only the queues related to our
scenario. Thus, packets from B follow the path B-C-G, and packets from I the path I-J-K-G. As mem-
ory space is finite, congestion starts from G and spreads backwards. Using credit flow control, this
is realized by the fact that credits are being consumed at the upstream nodes at rate greater than the
rate they are being released by their downstream neighbors. In this way, rates are implicitly regulated.
18
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
31/127
2.7. HIGH RADIX
hot
idle
BA C D
E F G H
I J K L
Figure 2.13: Example contention situation in a 1212 queued mesh. Flows from B and I are content-ing for G, blocking a flow from I destined to an idle output D. Dimension ordered routing is assumed.
Queues also resolve contention for internal paths. For example, a route from Ato D, sharing the path
segment B-C with the route from B to G, is throttled similarly.
A hard problem in queued networks is HoL blocking. In particular, in Fig. 2.13, a third connec-
tion from I to D is blocked behind the connection from I to G even when its destination D is idle.
Unfortunately, it is infeasible to resolve this problem using VOQ. While it is affordable to implement
O(N) queues at each input of the switch, per-flow queueing inside the switching elements requires
O(N2) queues inside each switching element. Thus, in practice, heuristics are being employed.
An example such heuristic is illustrated in Fig. 2.14(a), where the connection from I to Descapes
blocking using a second network. Another solution is the popular Virtual Channels [30]. The main
idea is that instead of duplicating the whole network, only queues are duplicated, as in Fig. 2.14(b),
while links are shared hence the name Virtual Channels. The comparison of network duplication and
Virtual Channels is technology dependent, specified by the relative cost of links and memories.
2.7 High Radix
A great limiting factor in todays chip is power consumption. A high speed serial off-chip transceiver
implemented as a differential pair at 3.125Gbaudconsumes on the order of 150mW[48]. Thus, for a
state-of-the-art port rate of 10Gb/s, a few hundreds of ports cost on the order of a few tens of Watts.
The only way to scale up to larger fabrics is by connecting switch chips in multistage topologies, like
the Benes.
Kim et al. [15] showed that it is more effective to implement switching elements with a large
19
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
32/127
G. Passas, VLSI Micro-Architectures for High-Radix Crossbars
I
A
E
B C D
F G H
J K L
(a)
LKJI
(b)
Figure 2.14: Reducing blocking using (a) a duplicated network and (b) two Virtual Channels
number of slow ports rather than a small number of fast ports because higher radix switch chips en-
able lower diameter fabrics. Their argument is illustrated in Fig. 2.15. Consider a r a d i x -4 Benes
switch, and suppose that advances in signaling technology double the IO rate that is feasible on a sin-
gle chip. Then, there aretwo options to benefit. First, onecan keep theradix of theswitching elements,
and double the rate of their ports this actually doubles the rate of the end ports of the fabric. Second,
one can keep the port rate of the switching elements, and double their radix this merges multiple
switching elements on a single chip, converting the Benes topology to a Clos topology. Notice that
end port rate doubles by arranging switching elements in parallel. (We assume that chips at end ports
implementing demultiplexing and multiplexing introduce negligible overhead). Thus, comparing the
two options, the second one is better because it reduces hop count and chip count. Lower hop count
translates to lower latency and power consumption and lower chip count to lower cost [ 15].
20
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
33/127
2.8. SUMMARY
2x
2x
1x1x
2x
2x
Figure 2.15: High radix switching elements exploit increases in chip IO throughput more effectively
by reducing the diameter of the switch fabric [15].
2.8 Summary
This chapter described basic concepts in switch design. An interconnection switch provides scal-
able and distributed communication between the parties of a digital system. Time switches, like the
memory switch, implement communication by switching data through a single point in space at dif-
ferent moments in time. While time switches optimize resource utilization, they are non scalable.
Space switches, like the crossbar and the Benes switch, are more scalable by providing parallel paths
in space. However, space switches need scheduling and routing to resolve contention for these paths.
Compared to other space switches, the crossbar has the advantage of simplifying routing by providing
a private path for each input-output pair. However, switch scheduling is a hard problem alone. From
the scheduling algorithms known in the literature, iSLIP is the most efficient, by providing both high
throughput and good fairness properties. Scheduling and routing can be simplified by using queues
internally to the switching elements. Finally, high radix switching elements are the right tradeoff in
modern technology by reducing the diameter of the switch fabric.
21
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
34/127
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
35/127
Chapter 3
A Comparison of Architectures
for High-Radix Switches
In this chapter, we first overview basic switch architectures, like shared memory, block crosspoint
queueing, output queueing, crosspoint queueing, and input queueing (section 3.1). We then com-
pare the performance of the above architectures (section 3.2) and their feasibility using on-chip SRAM
technology (section 3.3). We show that only the input queued crossbar scales to high radices, by min-
imizing both the individual and the aggregate throughput of memories. Because the traditional in-
put queued crossbar is difficult to schedule at high radices, we consider the combined input-output
queued crossbar, which compensates for scheduling inefficiencies by moderately overprovisioning
the crossbar and the memories (section 3.4). Moreover, we compare the combined input-output
queued crossbar to the hierarchically queued crossbar, proposed recently for high radix switches, and
we show that the combined input output queued crossbar is advantageous because it gives better
performance using only a moderate speedup on the crossbar and the memories, independent of radix
(section 3.5). Finally, we discuss related work (section 3.6) and the conclusion (section 3.7).
3.1 Basic Switch Architectures
We first overview four basic time switch architectures: (i)Crosspoint Queueing (XQ),(ii)Output Queue-
ing (OQ), (iii) Block Crosspoint Queueing (BXQ), and (iv) Shared Memory (SM). Next, we compare
these architectures to the Input Queued (IQ) space switch.
Crosspoint Queueing (XQ) The XQ architecture (Fig. 3.1(a)) switches packets from N inputs to N
outputs usingN2 memories. Each input selects which memory to write its head packet to according
to the destination output of that packet, and each output selects which memory to read a packet from
according to a predetermined policy e.g. weighted round-robin. Thus, XQ additionally uses one
23
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
36/127
G. Passas, VLSI Micro-Architectures for High-Radix Crossbars
N1
N2
N3
2
1
0
0 1 N1
N:1
arbiter
N:1
arbiter
N:1
arbiter
N110
0
1
2
N1
(a) (b)
01
N1
0 1 N2 N1
N2
N/2:1arbiter
N/2:1arbiter
N/2:1arbiter
N/2:1arbiter
0 1 N1
0
1
N1
(c) (d)
Figure 3.1: Time switch architectures. (a) Crosspoint Queueing (XQ). (b) Output Queueing (OQ). (c)Block Crosspoint Queueing (BXQ-2). (d) Shared Memory (SM).
bus per input to route packets to the memories of that input, and N crosspoints coupled with one N:1
arbiterper outputto forward packets to that output. Because all memories operate at the rate of inputs
and outputs, XQ can also be considered a degeneration of a time to a space switch. Summarizing, XQ
uses N2 memories, each with a throughput of 2R, for an aggregate of 2N2R. XQ designs were proposed
by Abel et al. [13] and by Katevenis et al. [14].
Output Queueing (OQ) The OQ architecture (Fig. 3.1(b)) compared to XQ allows better sharing of
memory space by merging the memories of each output in a single memory, dedicated to that output.
The write access rate of that memory is N R to resolve any contention between the inputs for the
output. Thus, the crosspoints and arbiters of XQ are removed. Moreover, using a single FIFO queue
per memory suffices to provide both high output utilization and a degree of fairness. In summary, OQ
uses Nmemories, each with a throughput of (N+1)R, for an aggregate of (N+1)N R. OQ designs were
proposed by Yeh at al. [49].
Block Crosspoint Queueing (BXQ) The BXQ architecture (Fig. 3.1(c)) results from XQ by merging
the k2 memories between kdistinct inputs and kdistinct outputs in a single memory block. The write
24
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
37/127
3.1. BASIC SWITCH ARCHITECTURES
access rate of that memory is kR to resolve any contention between the kinputs for the koutputs, and
the read access rate is also kR to fully utilized the koutputs. Each memory block must also implement
at least one FIFO per local output, to remove HoL. Thus, the inputs of the same memory block share
the space of that block in analogyto OQ. The space of the block can also be shared between its outputs.
However, this type of sharing increases complexity by requiring queues to be implemented as link
lists. In contrast, to implement FIFO queues, simple circular arrays suffice. Finally, BXQ uses N/k
crosspoints coupled with one N/k:1 arbiter per output to multiplex the memory blocks of that output.
Thus, BXQ is a combination of a time and a space switch. Summarizing, BXQ-kuses(N/k)2 memories,
each with a throughput of 2kR, for an aggregate of 2N2R/k.
Shared Memory (SM) By varying the parameter kof BXQ from 1 to N, we get intermediate solutions
from complete partitioning (XQ) to complete sharing the latter solution is widely known as shared
memory (SM, Fig. 3.1(d)). Thus, SM uses a single memory with a throughput of 2N R. SM designs
were proposed by Devault et al. [22], by Katevenis et al. [23], and by Kozaki et al. [24]. (We described
SM in more detail in chapter 2.)
Now let us compare the above time switches to the Input Queued (IQ) space switch (Chapter
2). An IQ switch uses N memories, like OQ, but it places these memories at the inputs instead of the
outputs. Thus, the physical partitioning of memories in IQ is analogous to OQ. However, while in
OQ sharing is between the inputs across the outputs, in IQ sharing is between the outputs across the
inputs. As a consequence, like BXQand SM, IQ also needs linked lists to implement sharing. Moreover,
the aggregate memory throughput of IQ is 2N R, that is equal to SM. Finally, while in time switches
rates are allocated through memory spacing, IQ strongly depends on the efficiency of scheduling.
While IQ can use any space switch, in the rest of this thesis, we will be considering that IQ uses a
crossbar, to simplify routing.
Fig. 3.2 shows a conceptual derivation of the above architectures through alternative groupings
ofN2 memory blocks. Observe that the throughput of each memory is proportional to the periphery
of the rectangle enclosing the blocks, while space is proportional to the area of that rectangle. Fig. 3.2,
as well as the observation on throughput geometry, is copied from the transparencies of the Packet
Switch Architecture class at the University of Crete [18]. Also copied from there is the metric of ag-
gregate memory throughput. However, in section 3.3, we contribute a practical application of that
metric. In particular, we show that the minimum total memory area to implement a switch architec-
ture is analogous to the aggregate memory throughput of that architecture. Thus, architectures like
XQ are costly.
25
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
38/127
G. Passas, VLSI Micro-Architectures for High-Radix Crossbars
OutputQueueing(OQ)
SharedMemory
(SM)
Input Queueing (IQ)
Block Crosspoint Queueing (BXQ4) Crosspoint Queueing (XQ)
Figure 3.2: Derivation of switch architectures.
3.2 Memory-Sharing Merits
Switches allowing better sharing of memory space have the capacity to improve performance under a
broad range of traffic conditions by allowing memory space to be allocated on demand, thus virtually
increasing memory space. Equivalently, the more the sharing, the less the memory space to achieve
a fixed level of performance. We first examine the effect of memory sharing on performance by com-
paring through simulation the time switch architectures described in the previous section 3.1. Next,
we study a second type of memory sharing, which we call queue sharing.
In order to quantify the effect of memory space sharing on performance, we evaluate the rate of
packet losses as a function of memory space under Bernoulli traffic of fixed size, uniformly destined
packets. In this approach, the better the sharing of memory space, the less the memory space to
achieve a fixed packet loss rate [26]. While real traffic patterns may be considerably more stressful,
including bursts and hot spots, the results described in this section are fundamental, also likely to be
found within more complicated scenaria. Packet loss rates are plotted in Fig. 3.3 for a range of link
loads. First, we observe that packet loss rates increase with load in all architectures, as contention for
switch outputs increases correspondingly, and more packets have to be queued. Second, we observe
that architectures allowing better sharing of memory space require a smaller memory space to achieve
a given packet loss rate. Thus, XQ has the worse performance and SM the best, while OQ falls in
between, and BXQ is better or worse than OQ dependent on block size. Finally, at high loads XQ
requires about 5 larger memory space to achieve the performance of SM.
26
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
39/127
3.2. MEMORY-SHARING MERITS
-6
-5
-4
-3
-2
-1
0
packetlo
ssrate(log10)
XQ
BXQ-2
OQBXQ-4
load = 0.70
XQ
BXQ-2
OQ
BXQ-4
load = 0.80
0 10 20 30 40 50
memory space (packets)
-6
-5
-4
-3
-2
-1
0
packetlossrate(log10)
SM
BXQ-4 OQ
BXQ-2
XQ
load = 0.90
0 10 20 30 40 50
memory space (packets)
SM
BXQ-4OQBXQ-2XQ
load = 0.99
Figure 3.3: Packet loss rates as a function of memory space. N= 8 and memory space is the totaldivided byN. BXQ and XQ use round-robin arbiters. Simulated time is 106 packet times.
0.5 0.6 0.7 0.8 0.9 1.0
load
0
1
10
100
1000
10000
delay(packettimes)
IQ with 3SLIP
XQ/OQ/BXQ/SM
Figure 3.4: Queueing delay as a function of input load. N= 8 and memory space is infinite. IQ usesVOQ and 3-it e rat io nSLIP (3SLIP).
In Fig. 3.4, we plot queueing delay when memory space is infinite. Then, XQ, OQ, BXQ, and SM
all degenerate to a M/D/1 queueing system [6]. We also plot the performance of IQ using VOQ and
3SLIP. At low loads, performance is comparable for all switches, as there is low contention, and only
few packets are queued. At high loads, delay in IQ is significantly higher, as packets are contending for
27
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
40/127
G. Passas, VLSI Micro-Architectures for High-Radix Crossbars
5flow
s
25flows Qs
5
25
10
5flow
s
25flows
1010
(a) (b)
Figure 3.5: Queue sharing in a 33 switch with a total memory space of 30 queues per input. (a) In IQqueues are flexibly allocated on demand. (b) In XQ queues are statically partitioned per crosspoint.
IQ increases performance by reducing blocking.
both the inputs and the outputs, while in any time switch contention is for the outputs only.
Finally, memory sharing affects the performance of queue sharing mechanisms in larger fabrics
[30][50]. Particularly, consider the 33 switches of Fig. 3.5 are elements of a larger Clos fabric. Also
consider that technology defines a maximum memory space of 30 queues per switch input. In IQ,
queues are concentrated at the inputs. Thus, there are 30 queues at each input. In XQ, queues are
distributed at the crosspoints. Thus, there are 10 queues at each crosspoint. Hence, when same-
input flows are unevenly distributed over the outputs, queues are better utilized in IQ, and blocking is
reduced. Notice that this queue sharing is a variation of byte sharing we described above.
3.3 On-Chip SRAM Cost
In this section, we evaluate implementation cost in on-chip SRAM technology. We consider a 90 nm
CMOS process, where SRAM is available in blocks, and we decide the feasibility of a switch architec-
ture by evaluating two metrics: (i)Total silicon area to implement all memories; and (ii) individual
memory width. State of the art technology constrains the core of a chip to less than 400mm2
, and
smaller chips are typically less expensive [51]. Moreover, the budget for memories is a major cost. For
example, Katevenis et al. [14] described a 180nm-CMOS implementation of XQ, where memory area
accounted for as much as 70% of the switch core. Thus, we bound the feasible memory area to less
than 200mm2. On the other hand, memory throughput expands by increasing its word width. When
this becomes larger than the maximum width of the available SRAM blocks, one can arrange multiple
SRAM blocks in parallel. In any case, memory width is bounded by the minimum external packets.
We will consider a maximum width of 64B y t e s , corresponding to minimum Ethernet packets [52].
Summarizing, a switch architecture is feasible when (i)total memory area is smaller than 200mm2,
and (ii)individual memory width is smaller than 64B y t e s .
28
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
41/127
3.3. ON-CHIP SRAM COST
256 1K 4K 16K
number of words
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
area(sq
-mm)
word width 128 bits
64
32
16
256 1K 4K 16K
number of words
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
latency(ns,w
orstcase)
word width 128 bits
64
32
16
(a) (b)
8 32 128 512
block size (Kbits)
0
20
40
60
80
100
120
140
160
180
200
capacity(Mbitsin200sq-mm
)163264128
word width
8 32 128 512
block size (Kbits)
200
300
400
500
600
700
800
900
1000
throughput(Mb/s/pin)
163264128
word width
(c) (d)
Figure 3.6: On chip memory performance in 90nm CMOS. (a) Single port SRAM block area 2-po rt
blocks are 20% to 60% larger, and are omitted here, (b) speed as a function of block height, (c) total
memory capacity fitting in 200mm2 as a function of block size, and (d) memory throughput per pin
as a function of block size.
We first plot 90nm CMOS SRAM block performance in Fig. 3.6. Wordwidth varies from 16b i t s to
128b i t s , and number of words from 256 to 16K. We observe in Fig. 3.6(a) that as block size increases,
block area increases as well, to accommodate more SRAM bit cells. However, area increases sub-
linearly. The reason is that larger blocks are more area efficient because their peripheral overhead
(e.g. address decoders, column multiplexors, sense amplifiers, etc) is amortized over a larger core.
Furthermore, in Fig. 3.6(b), we observe that as block size increases, SRAM blocks become slower, as
internal bit-line and word-line capacitances increase accordingly (The smaller the faster [51]). In
Fig. 3.6(c), we show the memory capacity we can fit in 200mm2. We observe that, except for corner
cases, capacity depends mainly on the size of SRAM blocks, rather than their configuration corner
cases are wide and small blocks, where extra sense amplifiers result to disproportional overheads. We
also observe that using the largest blocks, it is feasible to implement as many as 100Mbit s of memory
29
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
42/127
G. Passas, VLSI Micro-Architectures for High-Radix Crossbars
1
2
addressbus
X/x
read/writebus
1
2
read/writebusa
ddressbus
A/a
(a) (b)
Figure 3.7: (a) Memory throughput and space expansion and (b) memory space expansion. A and
X denote the area and the throughput of the memory, and a and x the area and throughput of an
individual block, respectively.
space. Finally, in Fig. 3.6(d), weshow the throughput ofeachmemory block in M b i t s /s per pin. Again,
throughput mainly depends on block size, rather than block configuration. Moreover, the maximum
throughput we can get from a single SRAM block is less than 60Gb/s, using the smallest blocks.
To build a memory of throughput X, we need X/x parallel SRAM blocks, where x the through-
put of one SRAM block see Fig. 3.7(a). Ifa is the area of the SRAM block, memory area is X/xa.
Then, if M is the number of memories, total memory area is MX/xa this must be smaller than
200mm2. Notice that total memory area is analogous to aggregate memory throughput. As a conse-
quence, aggregate memory throughput suggests a realistic cost metric. Finally, ifw is the block width,
the width of each memory is X/xw. Summarizing,
Total Memory Area= MXxa,
Memory Width= Xxw,
where x, a, and w the throughput, area, and width of one SRAM block, respectively.
Table 3.1 summarizes the values of X and M for each switch architecture. By substituting the SRAM
cost numbers of Fig. 3.6, we get an SRAM cost of each switch architecture.
Total memory area and individual memory width are plotted in Fig. 3.8 and Fig. 3.9, respectively.
We observe in Fig. 3.8 that area grows as O(N2) for XQ and OQ, and as O(N) for IQ and SM, follow-
Table 3.1: Number of memories (M) and throughput per memory (X). R= 10Gb/s.
switch class
XQ OQ BXQ SM IQ
M N2 N (N/k)2 1 N
X 2R (N+1)R 2kR 2N R 2R
30
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
43/127
3.3. ON-CHIP SRAM COST
0.1
1
10
100
200
500
memo
ryarea(sq-mm)
4Kbit blocks
XQ OQ
IQ/SM
infeasible
feasible
8Kbit blocks
XQ OQ
IQ/SM
16Kbit blocks
XQ OQ
IQ/SM
32Kbit blocks
XQ OQ
IQ/SM
4 8 16 32 64 128
N
0.1
1
10
100
200
500
memoryarea(sq-mm)
64Kbit blocks
XQ OQ
IQ/SM
4 8 16 32 64 128
N
128Kbit blocks
XQ OQ
IQ/SM
4 8 16 32 64 128
N
256Kbit blocks
XQ OQ
IQ/SM
4 8 16 32 64 128
N
512Kbit blocks
XQ OQ
IQ/SM
Figure 3.8: Minimum total area to implement the memories of each switch architecture as a function
ofN. Area is proportional to aggregate memory throughput. Only IQ and SM scale above 100 ports.
1
10
20
30
40
50
60
70
80
90
100
memorywidth(Bytes)
infeasible
feasible
4Kbit blocks
XQ/IQ
OQSM
8Kbit blocks
XQ/IQ
OQSM
16Kbit blocks
XQ/IQ
OQSM
32Kbit blocks
XQ/IQ
OQSM
4 8 16 32 64 128
N
1
1020
30
40
50
60
70
80
90
100
mem
orywidth(Bytes)
64Kbit blocks
XQ/IQ
OQSM
4 8 16 32 64 128
N
128Kbit blocks
XQ/IQ
OQSM
4 8 16 32 64 128
N
256Kbit blocks
XQ/IQ
OQSM
4 8 16 32 64 128
N
512Kbit blocks
XQ/IQ
OQSM
Figure 3.9: Minimum width of an individual memory of each of the switch architectures. Memory
width is proportional to memory throughput. Only IQ and XQ scale above 100 ports.
ing the aggregate throughput of memories. Thus, area is the same for both IQ and SM, well below
200mm2. On the other hand, XQ and OQ do not scale above 32 and 64 ports, respectively, and to scale
to these radices they use small SRAM blocks. Furthermore, we observe in Fig. 3.9 that memory width
grows as O(N) for SM and OQ, while it remains constant, independent of N, for IQ and XQ, following
31
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
44/127
G. Passas, VLSI Micro-Architectures for High-Radix Crossbars
0.1
1
10
100
200
500
memo
ryarea(sq-mm)
4Kbit blocks
BXQ
infeasible
feasible
8Kbit blocks
BXQ
16Kbit blocks
BXQ
32Kbit blocks
BXQ
4 8 16 32 64 128
N
0.1
1
10
100
200
500
memoryarea(sq-mm)
64Kbit blocks
BXQ
4 8 16 32 64 128
N
128Kbit blocks
BXQ
4 8 16 32 64 128
N
256Kbit blocks
BXQ
4 8 16 32 64 128
N
512Kbit blocks
BXQ
Figure 3.10: Minimum total area to implement the memories of BXQ-8 as a function ofN. BXQ does
scale above 100 ports, but only using the smallest SRAM blocks.
4 8 16 32 64 128
N
0
20
40
60
80
100120
140
160
180
200
memorycapacity(Mbitsin200sq-mm)
IQ
XQ
OQSM
4 8 16 32 64 128
N
BXQ
Figure 3.11: Total memory capacity in 200mm2 for each of the switch architectures.
the individual throughput of memories. Thus, for IQ and XQ, memory width is smaller than the ex-
ternal packets, for all N, while for SM and OQ it grows quickly, limiting these architectures to radices
below 8 or 16, respectively. Combining the plots of Fig. 3.8 and 3.9, we conclude that, from IQ, XQ,
OQ, and SM, only IQ scales to radices above 100.
Another scalable architecture is BXQ. We plot total memory area for BXQ-8 in Fig. 3.10. In this
plot, BXQ uses memory blocks corresponding to the the largest feasible SM. We observe that BXQ does
scale above 100, but it can do so only using the smallest SRAM blocks. However, this has a negative
impact on memory density, as plotted in Fig. 3.11. In this plot, each architecture has the density of the
32
7/29/2019 2012.TR427 VLSI Micro-Architectures High-Radix Crossbars
45/127
3.4. COMBINED INPUT-OUTPUT QUEUED (CIOQ) CROSSBARS
maximum SRAM block it can use (When memory area is smaller than 200mm2, spare space is utilized
t