Post on 27-Dec-2015
1
Automated Fault diagnosis in VoIP
31st March,2006Vishal Kumar Singh and
Henning Schulzrinne
2
VoIP Diagnosis What is automated VoIP diagnosis
Determining failures in network Automatically finding the root cause of the failure
Why VoIP diagnosis Networks are complex, making it difficult to
troubleshoot problems Automatic fault diagnosis reduces human
intervention Issues in VoIP diagnosis
Detecting failures/faults Finding the cause of failure, determining
dependency relationships among different components for diagnosis
Solution steps and approaches
3
Issues in Automated VoIP Diagnosis
Increasingly complex and diverse network elements
Complex interactions/relationships between different network elements
Different run time bindings for each application usage instance, e.g., different calls may use different DNS, SIP proxy servers, media path
Problem in one network element may manifest itself as user perceived failure of another element
4
Fault Identification Service unavailability reporting
Node/Device/UA generates faults (failure events) e.g. SNMP Traps, failure messages
Monitoring application e.g., SNMP based application detects service unavailability and reports the failure event
Affected user reports service unavailability , e.g., by e-mail, calling to helpdesk, automatically by pressing a button on phone while in a call and experiencing echo
Dependent application detects service unavailability and generates fault (failure events)
5
Fault Localization : Determining the Source of Problem
Fault Classification – Local Vs. Global (Does it affect only me or Does it affect
others also) Global failures
Server failure e.g. SIP proxy, DNS failure, DB failures Network failures
Local failures Specific Source failure e.g. node A cannot make call
to anyone Specific destination or participant failure e.g. No one
can make call to node B Locally observed but global failures e.g., DNS
service failed, but only B observed it.
6
Solution Approach DYSWIS “Do you see what I see” [1] Peers (Nodes) perform diagnostic tests when
another peer reports or detects failure Nodes can choose the diagnostic test depending
on dependency encoded as decision tree Nodes (at least some) will be initially preloaded
with the dependency relationship in some format (e.g., XML based)
Nodes (at least some) may build and update the dependency relationship based on statistical and temporal analysis of failure events which they receive and diagnostic tests which they perform
7
Solution Approach Store context information of past failures experienced by
each node E.g., specific server that was acting as the proxy server (for my
call which failed) Store locality of past failures instances
LAN, domain, subnet First hop at each layer e.g., switch (MAC), default gateway (IP),
domain’s proxy (Application layer), Failure count for each network element (statistical) Last failure timestamp for each network element Last successfully seen timestamp for each network
element (why do I need to test the proxy for you, my call just went through)
Temporal correlation of past failures (proxy seems to be failing after DNS fails)
Each node has a runtime dependency list based on past failures and diagnostic tests
8
Solution Architecture
DNS Server
P2PP2P
P2P
P2P
P2PP2P
P2PP2P
Service Provider 1 Service Provider 2
P1
P2
P3
Domain A
P5
P4
P6
P7
P8
DNS Test
PESQ Test
SIP Server
SIP Test
Call Failed at P1
Nodes in different domains cooperating to determine cause of failure
9
Solution Architecture: Logical View
Dependencies encoded as
decision tree, static and dynamic
rules
Admin input
[Dependencyrelationships and tests (XML) ]
Triggers to perform TESTS.(Peer selection and
Probe selection.
Alerts
Dependency graph generation[Bayesian network based, Inference, other models ]
Failures in Network
Decision Tree updates
Test results
The above figure shows logical entities and separation of dependency graph generation and Distributed diagnostic infrastructure (enclosed in blue).
10
Solution Requirements Request-Response protocol between the node
which experiences the failure and the peer nodes Nodes capability to perform diagnostic tests
(probes), probe selection based on cost/result Encoding the dependency relationship into a
decision tree (giving as an input from an expert e.g., as XML)
Peer node discovery, based on Location (local network, domain) Capability to perform tests (based on specific tests)
Dependency graph generation and updation, based on
Network failure events Diagnostic test results correlated with failures
11
Test/ Probe Selection Which diagnostic probe to run –
network layer or application layer and for what kind of failures. A probe covering broad range of
failures can give faster and crude but less accurate results
E.g. PING vs TCP Connect vs. SIP PING tests
Cost of Probe
12
Dependency Classifications Functional dependency:
At generic service level e.g. SIP proxy depends on DB service, DNS service
Structural dependency Configuration time e.g. Columbia CS SIP
proxy is configured to use mysql database on metro-north
Operational dependency Runtime dependencies or run time bindings,
e.g., the call which failed was using failover SIP server obtained from DNS which was running on host a.b.c.d in IRT lab
13
Dependency classifications: Layered Approach
Vertical and Lateral dependencies: Applications depends on other application layer services (e.g., SIP service depends on DB, DNS service) as well as lower layer services
OSI layers as service dependency layers Application layer service also depends on transport
layer service which in turn depends on network layer service
MAC layer: Access point, Switch Network layer: Router Application layer: DNS, SIP, Database
Topology based dependency e.g., calls from CS domain depends on specific SIP
server, calls from lab phones depends on specific switches and routers
14
Dependency Graph
15
Dependency Graph Encoded to Decision Tree
A
C
B
D
A Failed,Use Decision Tree
Yes
Invokes Decision Tree for C
No
No
No
Yes
YesInvokes Decision
Tree for B
Invokes Decision Tree for D
Cause Not KnownReport, Add new
Dependency
A
B C D
A = SIP CallC = SIP ProxyB = DNS ServerD = Connectivity
16
Diagnostic Tests SIP proxy
Proxy server availability SIP PING
Call Routing availability Invite tests
Call Path determination SIP TraceRoute
Media path Quality related
Speech quality degradation - MOS Echo jitter- MOS, PESQ QoS – RTCP
NAT/Firewall Checking binding expiration. Firewall failure to open a port - One way media.
How to determine which Firewall in the path ? SIP signaling ?
17
Diagnostic Tests DNS tests DHCP Switch/Router
ARP/RARP/Multicast BGP failures
Conference mixers Gateway
Echo return loss- readings- Analysis DB XCAP server tests Presence service availability tests
18
Example Call Failure – Possible Causes
SIP Proxy server Database Authentication
Media path failure Gateway
Specific call legs – ERL, Authentication, etc. DNS server failure End station failure Network failure, e.g., router, switch failure
Different calls will have different run time dependencies
19
Mapping to a Human Medical System
Doctors perform diagnostic tests to find out the cause of disease when the symptoms are mentioned – They may learn new things about the disease as a part of diagnostic tests Failures and triggered tests update the
dependency graph Medical researchers do different types of
tests to learn about new diseases, determine the cause and relationship of a disease with other physiological system Set of tests that can run periodically and can be
used to build dependency graph independent of failures
20
Solution Evolution Learning the dependency graph
from failure events and diagnostic tests
Learning using random/periodic testing to identify failures and determine relationships
21
Future Directions Self healing Predicting failures Protocols for labeling event failures
which would enable automatically incorporating new devices/applications to the dependency system
Decision tree (dependency graph) based event correlation
22
Reference
[1] User-oriented Management of VoIP Applications (http://www.ibr.cs.tu-bs.de/projects/nmrg/meetings/2005/nancy/dyswis.pdf)