Towards Simple, High-performance Input-Queued Switch Schedulers
description
Transcript of Towards Simple, High-performance Input-Queued Switch Schedulers
![Page 1: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/1.jpg)
Towards Simple, High-performance Input-Queued
Switch Schedulers
Devavrat ShahStanford University
Berkeley, Dec 5
Joint work withPaolo Giaccone and Balaji Prabhakar
![Page 2: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/2.jpg)
2
Outline
• Description of input-queued switches• Scheduling
– the problem – some history
• Simple, high-performance schedulers– Laura– Serena– Apsara
• Conclusions
![Page 3: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/3.jpg)
3
The Input-Queued (IQ) Switch Architecture
• N inputs, N outputs (in fig, N = 3)• Time is slotted
– at most one packet can arrive per time-slot at each input
• Equal sized cells/packets• Buffers only at inputs• Use a crossbar for switching packets
![Page 4: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/4.jpg)
4
Scheduling
• Crossbar is defined by these constraints: in each time-slot– only one packet can be transferred to each output– only one packet can be transferred from each input
• The scheduling problem: Subject to the above constraint, find a matching of inputs and outputs– i.e. determine which output will receive a packet from
which input in each time slot
![Page 5: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/5.jpg)
5
Background to switch scheduling
1. [Karol et al. 1987] Throughput is limited due to head-of-line blocking (limited to 58% for Bernoulli IID uniform traffic)
2. [Tamir 1989] Observed that with “Virtual Output Queues” (VOQs) head-of-line blocking is eliminated.
![Page 6: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/6.jpg)
6
Basic Switch Model
S(t)
N NLNN(t)
A1N(t)
A11(t)L11(t)
1 1
ANN(t)
AN1(t)
D1(t)
DN(t)
![Page 7: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/7.jpg)
7
Some definitions
matrix. npermutatio a is and :where
:matrix Service 2.
".admissible" is traffic the say we If
where
:matrix Traffic 1.
SssS
nAE
ijij
jij
iij
ijijij
1,0],[
1,1
)]([:,
3. Queue occupancies:
Occupancy
L11(t) LNN(t)
)]([ tAE ij
![Page 8: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/8.jpg)
8
More background on theory
[Anderson et al. 1993] A schedule is equivalent to finding a matching in a bipartite graph induced by input and output nodes
![Page 9: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/9.jpg)
9
Background
[McKeown et al. 1995] (a) Maximum size match does not give 100% throughput.(b) But maximum weight match can, where weight can be queue-length, age of a cell
20
32
30
25
20
30
25
MWM
![Page 10: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/10.jpg)
10
Maximum Weight Matching
• Maximum weight matching (MWM)– 100% throughput– provable delay bounds for i.i.d. Bernoulli admissible
traffic– but, finding MWM is like solving a network-flow problem
whose complexity is -- complex for high-speed networks
• We seek to approximate maximum weight matching
• Our goal:– obtain a simply implementable approximation to MWM
that performs competitively with MWM
)( 3NO
![Page 11: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/11.jpg)
11
Approximating MWM
• Two performance measures– throughput– delay
• We first consider simple approximations to MWM that deliver 100% throughput (i.e. stability), and then deal with delay
![Page 12: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/12.jpg)
12
Methods of Approximation
• Randomization– well-known method for simplifying
implementation
• Using information in packet arrivals– since queue-sizes grow due to arrivals, and
arrival times are a source of randomness
• Hardware parallelism– yields an efficient search procedure
![Page 13: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/13.jpg)
13
Randomization
• The main idea of randomized algorithms is
– to simplify the decision-making process by basing
decisions upon a small, randomly chosen sample from the state rather than upon the complete state
![Page 14: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/14.jpg)
14
An Illustrative Example
• Find the oldest person from a population of 1 billion
• Deterministic algorithm: linear search – has a complexity of 1 billion
• A randomized version: find the oldest of 30 randomly chosen people– has a complexity of 30 (ignoring complexity of random
sampling)
• Performance– linear search will find the absolute oldest person (rank = 1)– if R is the person found by randomized algorithm, we can
make statements like P(R has rank < 100 million) > 0.95 thus, we can say that the performance of the randomized
algorithm is very good with a high probability
109
130
![Page 15: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/15.jpg)
15
Randomizing Iterative Schemes
• Often, we want to perform some operation iteratively
• Example: find the oldest person each year
• Say in 2001 you choose 30 people at random– and store the identity of the oldest person in memory– in 2002 you choose 29 new people at random– let R be the oldest person from these 29 + 1 = 30 people
P(R has rank < 100 million)
or, P(R has rank < 50 million)
109
159
109
130
![Page 16: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/16.jpg)
16
Back to Switch Scheduling: Randomizing MWM
• Choose d matchings at random and use the heaviest one as the schedule
• Ideally we would like to have small d. However:
• Theorem: Even with d = N this algorithm doesn’t yield 100% throughput!
![Page 17: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/17.jpg)
17
Proof
![Page 18: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/18.jpg)
18
• Switch Size : 32 X 32
• Input Traffic (shown for a 4 X 4 switch) – Bernoulli i.i.d. inputs– diagonal load matrix:
• normalized load=x+y<1• x=2y
Simulation Scenario
xy
yx
yx
yx
00
00
00
00
![Page 19: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/19.jpg)
19
0.001
0.01
0.1
1
10
100
1000
10000
0.0 0.2 0.4 0.6 0.8 1.0
Mea
n IQ
Len
Normalized Load
Diagonal Traffic
MWM R32R1
![Page 20: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/20.jpg)
20
Crucial Observation
• The state of the switch changes due to arrivals & departures
• Between consecutive time slots, a queue’s length can change at most by 1– hence a heavy matching tends to stay heavy
• Therefore– ‘’remembering’’ a heavy matching should help
in improving the performance
![Page 21: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/21.jpg)
21
Tassiulas’ Algorithm
• [Tassiulas 1998] proposed the following algorithm based on this observation:– let S(t-1) be the matching used at time t-1– let R(t) be a matching chosen uniformly at
random– and let S(t) be the heavier of R(t) and S(t-1)
• This gives 100% throughput !note the boost in throughput is due to the use
of memory
• But, delays are very large
![Page 22: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/22.jpg)
22
0.01
0.1
1
10
100
1000
10000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Mea
n IQ
Len
Normalized Load
Diagonal Traffic
MWMTassiulas
![Page 23: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/23.jpg)
23
Derandomization
• Let G be a fully-connected graph where each node is one of the N! possible schedules
• Construct a Hamiltonian walk, H(t), on G– H(t) cycles through the nodes of G
• At any time t – let R(t) = H(t mod N!) – and let S(t) be the heavier of R(t) and S(t-1) this also has 100% throughput, but delays are
large (derandomization will be useful later)
![Page 24: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/24.jpg)
24
Stability
• Lemma: Consider IQ switch with Bernoulli i.i.d. inputs. Let B be a matching algorithm which ensures WB(t) >= W*(t) – c for every t. Then B is stable.
• Theorem: WDER(t) >= W*(t) – 2N.N! Therefore, it is stable.
![Page 25: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/25.jpg)
25
Delay
• These simple approximations of MWM yield 100% throughput, but delays are large
• To obtain good delays we’ll present three different algorithms which use the following features:– selective remembrance -- Laura– information in the arrivals -- Serena– hardware parallelism -- Apsara
![Page 26: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/26.jpg)
26
Laura
Tassiulas
• COMP = Maximum• R(t) – uniform sample
Next time COMP
S(t-1)
S(t)
R(t)
Laura
• COMP = Merge, picks the best edges of two matchings
• R(t) – non-uniform sample
![Page 27: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/27.jpg)
27
10
10
10
70
60
50
40
30
10
20Merging
S(t-1) R
10 – 40+10 -30+10-50= - 90
70-10+60-20=100
W(S(t-1))=160
W(R)=150
S(t)W(S(t)) = 250
Merging Procedure
![Page 28: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/28.jpg)
28
Throughput
• Theorem:– LAURA is stable under any admissible Bernoulli
i.i.d. input traffic.
![Page 29: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/29.jpg)
29
Average Backlog via Simulation
• Switch size: N = 32
• Length of VOQ: QMAX = 10000
• Comparison with– iSLIP, iLQF, MUCS, RPA and MWM
![Page 30: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/30.jpg)
30
Simulation
• Traffic Matrices– uniform diagonalsparse– logdiagonal
TU 1
N 2
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 11 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
TD 1
2N
1 1 0 0 0 0
0 1 1 0 0 0
0 0 1 1 0 00 0 0 1 1 0
0 0 0 0 1 1
1 0 0 0 0 1
TS 1
3N 2
2 1 0 0 0 0
0 1 2 0 0 0
0 1 1 1 0 00 0 0 2 1 0
0 0 0 0 1 2
1 0 0 0 1 1
![Page 31: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/31.jpg)
31
Laura: Diagonal traffic
![Page 32: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/32.jpg)
32
Laura: Sparse traffic
![Page 33: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/33.jpg)
33
• Since an increase in queue sizes is due to arrivals
• And arrivals are a source of randomness
Use arrivals to generate random matching
SERENASerena
![Page 34: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/34.jpg)
34
Serena
Next time Merge
S(t-1)
S(t)
R(t) = matching generated using arrivals
![Page 35: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/35.jpg)
35
23 7
893
2
5
Arr-R
47
1131
97
S(t-1)
Merging Procedure
893
5
23
W(S(t-1))=209
1
W(R)=121RMerging
S(t)
W(S(t))=243
89
3
23
31
97
![Page 36: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/36.jpg)
36
Throughput
Theorem:– SERENA achieves 100% throughput under any
admissible i.i.d. Bernoulli traffic pattern
![Page 37: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/37.jpg)
37
Serena: Diagonal traffic
![Page 38: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/38.jpg)
38
Apsara
• One way to obtain MWM is to search the space of all N! matchings
• A natural approximation: If S(t-1) is the current matching, then S(t) is the heaviest matching in a “neighborhood” of S(t-1)
• It turns out that there is a convenient way of defining neighbors (both for theory and for practice)
![Page 39: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/39.jpg)
39
Neighbors
Neighbors differ from S(t) in ONLY TWO edges (for all values of N)
Neighbors
Example: 3 x 3 switchS(t)
![Page 40: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/40.jpg)
40
Apsara
Next time MAX
S(t-1)
S(t)
Neighbors generated in parallel
N1 N2 Nk H(t)
Hamiltonian Walk
![Page 41: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/41.jpg)
41
Apsara: Throughput
• Theorem: Apsara is stable under any admissible i.i.d. Bernoulli traffic.
(stability due to Hamiltonian matching)
• Also, note that W(S(t)) >= W(S(t-1),t)
• Theorem: If W(S(t)) = W(S(t-1),t) then W(S(t)) >= 0.5 W *(t)
(this is not enough to ensure stability)
![Page 42: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/42.jpg)
42
Apsara: Diagonal traffic
![Page 43: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/43.jpg)
43
Limited Parallelism
• The Apsara algorithm searches over neighbors in parallel
• If space is limited to modules, then search over randomly chosen subset of size K from all neighbors
• And there are other (good) deterministic ways of searching a smaller neighborhood of matchings
2
N
2
NK
2
N
![Page 44: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/44.jpg)
44
Apsara: Limited parallelism
![Page 45: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/45.jpg)
45
Diagonal traffic
![Page 46: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/46.jpg)
46
Conclusions
• We have presented novel scheduling algorithms for input-queued switches– Laura– Serena– Apsara
• They are simple to implement and perform competitively with respect to the Maximum Weight Matching algorithm
![Page 47: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/47.jpg)
47
References
1. L. Tassiulas, “Linear complexity algorithms for maximum throughput in radio networks and input-queued switches,” Proc. INFOCOM 1998.
2. D. Shah, P. Giaccone and B. Prabhakar, “An efficient randomized algorithm for input-queued switch scheduling,” Proc. of Hot Interconnects, 2001.
3. P. Giaccone, D. Shah and B. Prabhakar,” An Implementable Parallel Scheduler for Input-Queued Switches”, Proc. of Hot Interconnects, 2001.
4. P. Giaccone, B. Prabhakar and D. Shah, “Towards simple and efficient scheduler for high-aggregate IQ switches”, Submitted INFOCOM’02.
5. R. Motwani and P. Raghavan, Randomized Algorithms, Cambridge University Press, 1995.
![Page 48: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/48.jpg)
48
Uniform traffic
![Page 49: Towards Simple, High-performance Input-Queued Switch Schedulers](https://reader035.fdocuments.net/reader035/viewer/2022062309/56814918550346895db651df/html5/thumbnails/49.jpg)
49
LogDiagonal traffic
Maximum Throughput Algorithm Load 0.99
MWM 0.99MaxLAURA 0.99LAURA 0.99iSLIP 0.84iLQF 0.97MUCS 0.99RPA 0.98