Reconstructing network states in cloud using NIC and ... · A case study of Cloudlab ShiyuLiu,...
Transcript of Reconstructing network states in cloud using NIC and ... · A case study of Cloudlab ShiyuLiu,...
Reconstructing network states in cloud using NIC and system timestamps:
A case study of CloudlabShiyu Liu, Balaji Prabhakar, Mendel Rosenblum
Stanford UniversityFeb 7, 2018
SIMON: A Simple and Scalable Method for Sensing, Inference and Measurement in Data Center Networks (NSDI’19)
• Using edge-based measurement to reconstruct key network state variables• Packet queuing times at switches• Link utilizations• Queue and link compositions at the flow-level
• SIMON enables:• Sensitive A/B tests• Network troubleshooting & diagnosis• Network performance monitoring
HW (NIC) or SW (system) timestamps?• HW (NIC) timestamps:
• Accurate inputs for estimating queueing
delays. SIMON’s default.
• Not available in many cases: e.g. cloud
• SW (system) timestamps:
• Widely available
• Could we use SW timestamps to still
get fairly good reconstructions?
• Will improve the deployability of SIMON
CPU+RAM
APP
Kernel
Driver
NIC
PCIe
CPU+RAM
APP
Kernel
Driver
NIC
PCIe
Tx HW Rx HW
Tx SW
Rx SW
!(#$%&) − !()$%&) v.s. !(#$*&) − !()$*&)• Software processing delays in driver,
interrupt handling, interrupt coalescing
• PCI-E delays
• NIC queueing & hardware processing delays
Contents
• Overview of Cloudlab environment• Performance of SIMON w/ HW timestamps• Study of the difference between SW & HW measured one-way delays• Performance of SIMON w/ SW timestamps
• Cloudlab: 2-stage switching fabric, 10G links
• Use Huygens to sync SW (system) and HW (NIC) clocks respectively among all servers.
A case study of Cloudlab
OS: Linux v4.15
NIC: Mellanox ConnectX-4
ToR: Dell S4048-ON
12MB shared pkt buffer
Spine: Mellanox MSN2410
16MB shared pkt buffer
Topology of Cloudlab experiment
Contents
• Overview of Cloudlab environment• Performance of SIMON w/ HW timestamps• Study of the difference between SW & HW measured one-way delays• Performance of SIMON w/ SW timestamps
Estimate queue recon errors without ground truth
• Send two independent probe meshes, i.e. two independent sets of measurements.
!"# $% − $' = !"# $% + !"#($')
,-./"0% ,% = , + $%
,' = , + $'
Measure 1 SIMON
SIMON-./"0'Measure 2
The diff between two independent reconstruction (,% and ,') bounds the diff between these reconstructions and the ground
truth (i.e. $% and $').
SIMON w/ HW timestamps in CloudlabCross-validation by 2 independent meshes of probes. Recon interval = 1ms.
All queues Queues > 100usRMS(blue-red) 29.33 us 108.54 us
Relative error = !"#(%&'()*(+)!"#(-./012034 ) 7.2% 6.9%
Contents
• Overview of Cloudlab environment• Performance of SIMON w/ HW timestamps• Study of the difference between SW & HW measured one-way delays• Performance of SIMON w/ SW timestamps
SW & HW one-way delay in Cloudlab
• Our goal is to use SW one-way delay (red line) to estimate the HW one-way delay (blue line)• The noise is instantaneous, but the switch queueing delays are prolonged
DC bias
High-freqnoise
Remove the noise in SW one-way delays
DC bias
High-pass filter to remove the DC bias
> threshold > threshold
Remove peak noises Remaining noise:LASSO will take avg
Contents
• Overview of Cloudlab environment• Performance of SIMON w/ HW timestamps• Study of the difference between SW & HW measured one-way delays• Performance of SIMON w/ SW timestamps
Filtering improves the approximation of SW recon results to HW results
SW one-way delaysw/o or w/ filtering
HW one-way delays
SIMON SW recon results
HW recon resultsSIMON
Approximate
23.04
29.08
12.67
24.80
0.005.00
10.0015.0020.0025.0030.0035.0040.00
All queues Queues > 100us
RMS(
diffe
renc
e) (u
s)
RMS(SW recon - HW recon) (us)
w/o filter w/ filter
29.33
108.54
32.37
109.79
0.00
20.00
40.00
60.00
80.00
100.00
120.00
All queues Queues > 100us
RMSE
(us)
RMSE (us)
HW SW w/ filter
7.23% 6.89%7.98%
6.97%
0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
All queues Queues > 100us
Rela
tive
erro
r
Relative error
HW SW w/ filter
Recon errors using SW & HW timestamps
• SW recon errors close to HW, esp. for large queues• SW (system) timestamps are good replacements of HW (NIC)
timestamps for SIMON
Conclusion
• By applying proper filters on SW (system) timestamps, they become good replacements of HW (NIC) timestamps for reconstructing network states. • This improves the deployability of SIMON, e.g. in cloud environment
Welcome to our poster for more details and Q/A