IEPM-BW: Bandwidth Change Detection and Traceroute Analysis and Visualization

IEPM-BW: Bandwidth Change Detection and Traceroute Analysis and VisualizationConnie Logg, Joint Techs Workshop February 4-9, 2006

BW Change Detection: ImportantKnow what you are looking for• How long must a change persist before alerting?• What threshold to use for alerting (drop of N %)?• What probes provide quality data and are relevant?

May differ between network types and technologies• Once an alert is detected, what circumstances must

be met before another alert is generated for the same or new drop?

• Alerting and forecasting/predicting future performance are two different things – however data taken may be relevant to both

• Remember – Don’t want to respond to every little glitch – more probing may escalate a minor momentary congestion event.

What to do with ALERTS• Study them for accuracy and relevance• What information would help diagnose the

drop? • Were there traceroute changes?• Do changes in other probes seem to have occurred

in the same time frame?• Was there an increase in the ping RTT times?• If TCP RTT is available, was there a change in

that?• What does OWAMP show (to be implemented)

Algorithm - SimplifiedStream of data t0 - - - tn

2 buffers: history buffer (hbuff) and trigger buffer (tbuff), sizes hmax & tmax

Load data t0-thmax into history buffer and calculate baseline histmean(hm) & histsd(hsd)

Algorithm - SimplifiedLoop over data t = {thmax+1 - - - tn}1. if t > hm -2*hsd, tbuffoldest->hbuff, t->hbuff, drop

hbuffoldest ,calc hm & hsd, next2. If t<= hm -2*hsd t->tbuff

If size(tbuff) < tmax, next;Calc tbuff mean (tm), if (hm-tm)/hm > threshold,

generate an alert, tbuff -> hbuff, calc hm, hsd, nextOnce alert is generated, drop threshold must be

met again from the tm or the data stream must recover for ½ of drop time.

Overview• What we currently look for

• Look for a drop lasting at least 6 hours• Look for a drop of 33%• Before reporting another drop, require 3 hours of

restored throughput

Time

B

andw

idth

33% drop 6 hours

33% drop 6 more hours

Up for at least 3 hours

Drop of 33% for 6 hours

Observations:• Traceroute changes occasionally coincide with

bandwidth drops• Challenge: How do you defined a traceroute

change and which have most priority?• Checksum error• Duplicate responding or non responding hop• ! Annotations• IP addr differ in 4th octet (or 3rd and 4th octets)

• How do you quickly review traceroute changes?

Traceroute Visualization• One compact page per day• One row per host, one column per hour• One character per traceroute to indicate pathology or change

(period(.) = no change)• Identify unique routes with a number

• Inspect the route associated with a route number• Provide for analysis of long term route evolutions

Route # at start of day, gives idea of route stability

Multiple route changes (due to GEANT), later restored to original routePeriod (.) means no change

Pathology Encodings

Stutter

Probe type

End host not pingable

ICMP checksum

Change in only 4th octet

Hop does not respond

No change

Multihomed

! Annotation (!X)

Change but same AS

Navigationtraceroute to CCSVSN04.IN2P3.FR (134.158.104.199), 30 hops max, 38 byte packets 1 rtr-gsr-test (134.79.243.1) 0.102 ms …13 in2p3-lyon.cssi.renater.fr (193.51.181.6) 154.063 ms !X

#rt# firstseen lastseen route0 1086844945 1089705757 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx1 1087467754 1089702792 ...,192.68.191.83,171.64.1.132,137,...,131.215.xxx.xxx2 1087472550 1087473162 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx3 1087529551 1087954977 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx4 1087875771 1087955566 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,(n/a),131.215.xxx.xxx5 1087957378 1087957378 ...,192.68.191.83,137.164.23.41,137.164.22.37,...,131.215.xxx.xxx6 1088221368 1088221368 ...,192.68.191.146,134.55.209.1,134.55.209.6,...,131.215.xxx.xxx7 1089217384 1089615761 ...,192.68.191.83,137.164.23.41,(n/a),...,131.215.xxx.xxx8 1089294790 1089432163 ...,192.68.191.83,137.164.23.41,137.164.22.37,(n/a),...,131.215.xxx.xxx

AS’ information

Changes in network topology (BGP) can result in dramatic changes in performance

Snapshot of traceroute summary table

Samples of traceroute trees generated from the table

ABwE measurement one/minute for 24 hours Thurs Oct 9 9:00am to Fri Oct 10 9:01am

Drop in performance(From original path: SLAC-CENIC-Caltech to SLAC-Esnet-LosNettos (100Mbps) -Caltech )

Back to original path

Changes detected by IEPM-Iperf and AbWE

Esnet-LosNettos segment in the path(100 Mbits/s)

Hour

Rem

ote

host

Dynamic BW capacity (DBC)

Cross-traffic (XT)

Available BW = (DBC-XT)Mbi

ts/s

Notes:1. Caltech misrouted via Los-Nettos 100Mbps commercial net 14:00-17:002. ESnet/GEANT working on routes from 2:00 to 14:003. A previous occurrence went un-noticed for 2 months4. Next step is to auto detect and notify

Los-Nettos (100Mbps)

New Graphical Map Display• New Traceroute Map Display

Quality Control – Bandwidth MonitoringIt is good to have a local target host for a sanity check:

Problem here was that the monitoring host rebooted into single CPU mode after maintenance had been performed on it.

More Sanity Checks• Target host – iepm-bw@caltech – was not completely installed

– process cleanup did not have the perl modules that it needed to kill lingering processes (needs install check)

Probe Correlation

Pathchirp analysis shows drop

Multi-stream iperf show drop

Single stream iperf shows drop

Traceroute change affected all 3

Analysis Results• Email is sent to interested parties with links to

graphs, data and traceroute analysis• Alerts are saved in the ALERT table and graphs

are saved in the GRAPH Table for future reference

• Every analysis run, about every 2 hours, a table showing which alerts occurred for which probes and when is generated. It has links to the more detailed alert information.

• Reports are generated nightly for the last month alerts from these tables.

Future Improvements• Integrate Ping RTTmin and RTTmax analysis • Optimize code for speed of execution –

estimate mean and std dev• Upload alerts to MonALISA – What info?• Compare detection algorithms (KS, HW, PCA?)• Recommendations on data taking frequencies

and how to define the trigger and history buffer sizes still needs more exploring

• Implement prediction/forecasting algorithm(s)QUESTIONS?

IEPM-BW: Bandwidth Change Detection and Traceroute Analysis and Visualization

Documents

Transcript of IEPM-BW: Bandwidth Change Detection and Traceroute Analysis and Visualization