Control Planes for Optical Switching - OFC Conference · Payload size (B) 2048 1024 2048 1024 Data...
Transcript of Control Planes for Optical Switching - OFC Conference · Payload size (B) 2048 1024 2048 1024 Data...
Control Planes for
Optical SwitchingGeorge Papen
UC San Diego
Opportunities and Challenges for Optical Switching in the Data Center
OFC Workshop 2019
Practical Optical Switching in Data Centers• Hardware Issues
• Link-level reconfiguration time dominates physical switch time
• Synchronization is hard• Links w/burst-mode RXs are more complex
TX data
Switch latch
Switch output
BM RX start
BM RX output
Demux output
A B C D E
InvalidPreamblePRBS data
AError detector gate
DSW DPRBS
DGATE
TCYCLE
TGATE
Switch state
Time (us)
Host 7Host 6Host 5Host 4Host 2Host 1
Tx to Host 0PFC
ReconfigDemand Updates
● ● ● ● ● ● ● ● ● ● ● ● ● ●
0 1500 3000 4500 6000 7500
M7
…M1 M2 M3 M4 M5
…
Optical rotorswitches
Electronic top-of-rack switches
Server racks
M6 M8 M9 M7
…M1 M2 M3 M4 M5
…
Optical rotorswitches
Servers /Compute nodes
M6 M8 M9
Single mode fiberSingle mode OR
multi mode fiber
• Software Issues• Centralized scheduling does not scale• Software has no firm concept of ``Go now!”• Unacceptable delay for µs (or less) switching
• A New Solution RotorNet• Decouple scheduling and routing• Decentralized control plane• Hide delay w/parallelism
Hardware Issues: Reconfiguration Time
• Research at IBM w/UCSD intern using nanosecond Si-P switch• Goal: minimize link-level reconfiguration time of the switch
(Time during which data cannot reliably transit the network)
• Includes:– Switch reconfiguration time– Clock Data Recovery (CDR) locking time– Synchronization guard delays
Transfer data Transfer dataSwitch reconfig CDR lock
Guard Guard Guard
System-level reconfiguration time
OFC 2018: A. Forencich et. al, “System-Level Demonstration of a Dynamically Reconfigured Burst-Mode Link Using a Nanosecond Si-Photonic Switch”
System Reconfiguration Time Testbed
• Data Plane– 100G PSM-4 QSFP28 TX– 2x2 Silicon Photonic switch– Amp/filter/attenuator– Burst-mode receiverPattern
Generator
100GPSM-4
2x2 Si-PhSwitch
Burst ModeReceiver
Powermeter
ErrorDetector
SwitchControl
TriggerGenerator
PDFA VOA
BPF99%
1%Transmitter(QSFP28)
Xilinx Virtex Ultrascale FPGA (VCU108 board)
PC 1:16
Demuxboard
1:16
PatternGenerator
100GPSM-4
2x2 Si-PhSwitch
Burst ModeReceiver
Powermeter
ErrorDetector
SwitchControl
TriggerGenerator
PDFA VOA
BPF99%
1%Transmitter(QSFP28)
Xilinx Virtex Ultrascale FPGA (VCU108 board)
PC 1:16
Demuxboard
1:16
• Control plane– Xilinx XVCU095 FPGA– 25 Gbps pattern generator– Trigger generator, switch interface– 25 Gbps gated error detector
~ 1 ns Si-P switching
(physical response)
Measured Waveforms of Photonic Switch and BM-RX
Payload size (B) 2048 1024 2048 1024Data rate (Gbps) 12.5 12.5 20 20Cycle time (ns) 1366 730 858 460
BM-Switch time (ns) 90 90 60 60Duty cycle (%) 93 87 93 87
Reconfiguration 60-90x slower than physical switch time !
Physical Switch Time
Preamble Data
Software Issues
ASIC
Queue occupancy
Scheduling
Queue occupancy
Scheduling
Data plane doesn’t scale to entire datacenter!
The Centralized Control Issue
Queue occupancy
Inputs Outputs
Scheduling
Crossbar
Sendingracks
Receivingracks
Crossbar
Optical CircuitSwitch
Electronic PacketSwitch
Information required for scheduling not locally available
Hybrid Switch
Sender
Sender
Sender
N Input Ports
Receiver
Receiver
Receiver
N Output Ports
Circuit Switch
Packet Switch
. . . 1 . . . . . 1 1 . . . . . 1 . . . . . 1 . .
. . 1 . . . . . 1 . . . . . 1 1 . . . . . 1 . . .
. 3 . 1 4 . . 6 2 . 2 . . 2 4 5 1 2 . . 1 4 . 3 .
69 10ScheduleS =
Duration Circuit Configuration Duration Circuit Configuration Leftover Demand
Circuit Switch Packet Switch
. 13 10 70 4 . . 14 12 6965 . . 12 1415 33 2 . 1112 7 3 1 .
Senders
R e c e i v e r s
SchedulingAlgorithm
Steps in Centralized SchedulingTrafficMatrix
• Collect demand information from endhosts• Send demand information over a network to central location • Form the traffic matrix • Factor traffic matrix into a sequence of switch states• Finally! Set the switch
Switch States
Scheduling Techniques for Hybrid Circuit/PacketNetworks, CoNEXT 2015
Receivers
Send
ers
A Centralized Control Plane -ReacToR
ControlPlane
Hosts
Top of Rackswitch
OR
Station
“A multiport microsecond optical circuit switch for data center networking,” PTL 2013
0 5 10 15
0.0
0.2
0.4
0.6
0.8
1.0
End−to−End Flow Switch Time (microsecond)
CDF
of L
aten
cy D
istrib
utio
n
�
14.84us
Circuit Switch(Mordia)
Control Path
EPS(10G Switch)
H H H H H H H HSchedule
REACToR(FPGA)
REACToR(FPGA)Control
Computer
DemandInformation
Schedule
10 G
1 G
Software latencies dominate switch time
Time-Varying Demand
Time (us)
Host 7Host 6Host 5Host 4Host 2Host 1
Tx to Host 0PFC
ReconfigDemand Updates
● ● ● ● ● ● ● ● ● ● ● ● ● ●
0 1500 3000 4500 6000 7500
§ Control plane tries to allocate circuits based on calculated schedule§ Control plane prototype was slow, did not always schedule
correctly, and does not scale – hard lesson learned!
IncorrectlyScheduled
“Circuit Switching Under the Radar with REACToR,” NSDI 2014
A New Solution - RotorNet
• No centralized control – inherently more scalable!
• Co-design of optical switch and network
Why build a large, fast crossbar that you cannot control?
• Parallelism decouples minimum latency from switching time
TOMORROW: Max Mellette, invited talk M2C.3, Monday 11 AM, Room 3.
“A Practical Approach to Optical Switching in Datacenters”
Rotor switch model:
N inputports
N outputports
N – 1 matchings
RotorNet has no Central Control
Real-time scheduleMatching 1Matching 2Queue occupancy
Scheduling
Crossbar
1 ® 2 1 ® 3 1 ® 4Fixed schedule
,
Rotor switch
® No (central) control
® Bounded reduction in throughput
RotorNet: A Scalable, Low-complexity, Optical Datacenter Network, Sigcomm ‘17
Summary
• Mimicking the electronic packet-switched network control plane leads to an optical circuit switch that does not scale
• Even w/o centralized scheduling, the control plane is hard• Must reduce/hide system reconfiguration time• Must synchronize “asynchronous” end hosts
• RotorNet addresses control plane issues by:• No centralized scheduler • Using parallelism to bypass system reconfiguration delay• Still must address synchronization with end hosts
Contributing ResearchersSystems
• UC San Diego: Max Mellette, George Papen, George Porter, Alex C. Snoeren, Geoffrey M. Voelker, Amin Vahdat, with students Nathan Farrington, Rishi Kapoor, He Liu, Feng Lu, Rob McGuinness, Arjun Roy, Malveeka Tewari
• CMU: David G. Andersen, Srinivasan Seshan, with students Matthew K. Mukerjee , Conglong Li, Nicolas Feltman
• Intel: Michael Kaminsky
Hardware• UC San Diego: Joe Ford, Max Mellette, George Papen, P.-C. Sun, with
students Alex Forencich, Max Mellette, Glenn M. Schuster
• IBM: Nicolas Dupuis, Christian Baks, Benjamin G Lee, Laurent Schares
• Technical University of Denmark: Valerija Kamchevska (student)
14