1 Automated Fault diagnosis in VoIP 31st March,2006 Vishal Kumar Singh and Henning Schulzrinne.

Automated Fault diagnosis in VoIP

31st March,2006Vishal Kumar Singh and

Henning Schulzrinne

VoIP Diagnosis What is automated VoIP diagnosis

Determining failures in network Automatically finding the root cause of the failure

Why VoIP diagnosis Networks are complex, making it difficult to

troubleshoot problems Automatic fault diagnosis reduces human

intervention Issues in VoIP diagnosis

Detecting failures/faults Finding the cause of failure, determining

dependency relationships among different components for diagnosis

Solution steps and approaches

Issues in Automated VoIP Diagnosis

Increasingly complex and diverse network elements

Complex interactions/relationships between different network elements

Different run time bindings for each application usage instance, e.g., different calls may use different DNS, SIP proxy servers, media path

Problem in one network element may manifest itself as user perceived failure of another element

Fault Identification Service unavailability reporting

Node/Device/UA generates faults (failure events) e.g. SNMP Traps, failure messages

Monitoring application e.g., SNMP based application detects service unavailability and reports the failure event

Affected user reports service unavailability , e.g., by e-mail, calling to helpdesk, automatically by pressing a button on phone while in a call and experiencing echo

Dependent application detects service unavailability and generates fault (failure events)

Fault Localization : Determining the Source of Problem

Fault Classification – Local Vs. Global (Does it affect only me or Does it affect

others also) Global failures

Server failure e.g. SIP proxy, DNS failure, DB failures Network failures

Local failures Specific Source failure e.g. node A cannot make call

to anyone Specific destination or participant failure e.g. No one

can make call to node B Locally observed but global failures e.g., DNS

service failed, but only B observed it.

Solution Approach DYSWIS “Do you see what I see” [1] Peers (Nodes) perform diagnostic tests when

another peer reports or detects failure Nodes can choose the diagnostic test depending

on dependency encoded as decision tree Nodes (at least some) will be initially preloaded

with the dependency relationship in some format (e.g., XML based)

Nodes (at least some) may build and update the dependency relationship based on statistical and temporal analysis of failure events which they receive and diagnostic tests which they perform

Solution Approach Store context information of past failures experienced by

each node E.g., specific server that was acting as the proxy server (for my

call which failed) Store locality of past failures instances

LAN, domain, subnet First hop at each layer e.g., switch (MAC), default gateway (IP),

domain’s proxy (Application layer), Failure count for each network element (statistical) Last failure timestamp for each network element Last successfully seen timestamp for each network

element (why do I need to test the proxy for you, my call just went through)

Temporal correlation of past failures (proxy seems to be failing after DNS fails)

Each node has a runtime dependency list based on past failures and diagnostic tests

Solution Architecture

DNS Server

P2PP2P

Service Provider 1 Service Provider 2

Domain A

DNS Test

PESQ Test

SIP Server

SIP Test

Call Failed at P1

Nodes in different domains cooperating to determine cause of failure

Solution Architecture: Logical View

Dependencies encoded as

decision tree, static and dynamic

Admin input

[Dependencyrelationships and tests (XML) ]

Triggers to perform TESTS.(Peer selection and

Probe selection.

Alerts

Dependency graph generation[Bayesian network based, Inference, other models ]

Failures in Network

Decision Tree updates

Test results

The above figure shows logical entities and separation of dependency graph generation and Distributed diagnostic infrastructure (enclosed in blue).

Solution Requirements Request-Response protocol between the node

which experiences the failure and the peer nodes Nodes capability to perform diagnostic tests

(probes), probe selection based on cost/result Encoding the dependency relationship into a

decision tree (giving as an input from an expert e.g., as XML)

Peer node discovery, based on Location (local network, domain) Capability to perform tests (based on specific tests)

Dependency graph generation and updation, based on

Network failure events Diagnostic test results correlated with failures

Test/ Probe Selection Which diagnostic probe to run –

network layer or application layer and for what kind of failures. A probe covering broad range of

failures can give faster and crude but less accurate results

E.g. PING vs TCP Connect vs. SIP PING tests

Cost of Probe

Dependency Classifications Functional dependency:

At generic service level e.g. SIP proxy depends on DB service, DNS service

Structural dependency Configuration time e.g. Columbia CS SIP

proxy is configured to use mysql database on metro-north

Operational dependency Runtime dependencies or run time bindings,

e.g., the call which failed was using failover SIP server obtained from DNS which was running on host a.b.c.d in IRT lab

Dependency classifications: Layered Approach

Vertical and Lateral dependencies: Applications depends on other application layer services (e.g., SIP service depends on DB, DNS service) as well as lower layer services

OSI layers as service dependency layers Application layer service also depends on transport

layer service which in turn depends on network layer service

MAC layer: Access point, Switch Network layer: Router Application layer: DNS, SIP, Database

Topology based dependency e.g., calls from CS domain depends on specific SIP

server, calls from lab phones depends on specific switches and routers

Dependency Graph

Dependency Graph Encoded to Decision Tree

A Failed,Use Decision Tree

Invokes Decision Tree for C

YesInvokes Decision

Tree for B

Invokes Decision Tree for D

Cause Not KnownReport, Add new

Dependency

A = SIP CallC = SIP ProxyB = DNS ServerD = Connectivity

Diagnostic Tests SIP proxy

Proxy server availability SIP PING

Call Routing availability Invite tests

Call Path determination SIP TraceRoute

Media path Quality related

Speech quality degradation - MOS Echo jitter- MOS, PESQ QoS – RTCP

NAT/Firewall Checking binding expiration. Firewall failure to open a port - One way media.

How to determine which Firewall in the path ? SIP signaling ?

Diagnostic Tests DNS tests DHCP Switch/Router

ARP/RARP/Multicast BGP failures

Conference mixers Gateway

Echo return loss- readings- Analysis DB XCAP server tests Presence service availability tests

Example Call Failure – Possible Causes

SIP Proxy server Database Authentication

Media path failure Gateway

Specific call legs – ERL, Authentication, etc. DNS server failure End station failure Network failure, e.g., router, switch failure

Different calls will have different run time dependencies

Mapping to a Human Medical System

Doctors perform diagnostic tests to find out the cause of disease when the symptoms are mentioned – They may learn new things about the disease as a part of diagnostic tests Failures and triggered tests update the

dependency graph Medical researchers do different types of

tests to learn about new diseases, determine the cause and relationship of a disease with other physiological system Set of tests that can run periodically and can be

used to build dependency graph independent of failures

Solution Evolution Learning the dependency graph

from failure events and diagnostic tests

Learning using random/periodic testing to identify failures and determine relationships

Future Directions Self healing Predicting failures Protocols for labeling event failures

which would enable automatically incorporating new devices/applications to the dependency system

Decision tree (dependency graph) based event correlation

Reference

[1] User-oriented Management of VoIP Applications (http://www.ibr.cs.tu-bs.de/projects/nmrg/meetings/2005/nancy/dyswis.pdf)

1 Automated Fault diagnosis in VoIP 31st March,2006 Vishal Kumar Singh and Henning Schulzrinne.

Documents

Transcript of 1 Automated Fault diagnosis in VoIP 31st March,2006 Vishal Kumar Singh and Henning Schulzrinne.

Technology Transition: Numbering Henning Schulzrinne FCC.

Robert Hancock, Henning Schulzrinne (editors) IETF#58 – Minneapolis November 2003

Presence Aware Location-Based Services For Managing Mobile Communications Vishal K. Singh, Henning Schulzrinne Department of Computer Science, Columbia.

CALLER ID SPOOFING – TECHNICAL CHALLENGES & STANDARDS Henning Schulzrinne FCC.

8 December 2015SIP conferencing1 SIP Conferencing Henning Schulzrinne.

The Internet Real-Time Laboratory Henning Schulzrinne September 2003 .

Numbering Update HENNING SCHULZRINNE JUNE 4, 2015.

1 Text-to-911: Requirements & Options Henning Schulzrinne FCC.

RNAP: A Resource Negotiation and Pricing Protocol Xin Wang, Henning Schulzrinne Columbia University xwang@ctr.columbia.edu, schulzrinne@cs.columbia.edu.

Standardization Henning Schulzrinne Dept. of Computer Science Columbia University Fall 2003.

Robert Hancock, Henning Schulzrinne (editors) IETF#62 – Minneapolis March 2005

The Internet Real-Time Laboratory Henning Schulzrinne March 2002 .

IIT RTC Conf 2014 Henning Schulzrinne Summary

Henning Schulzrinne Presentation

March 2004SIPPING - IETF 59 (Seoul)1 Emergency calling draft-ietf-sipping-sos draft-schulzrinne-emergency-arch Henning Schulzrinne Columbia University.

Internet Real-Time Laboratory Prof. Henning Schulzrinne

Wireless IP Multimedia Henning Schulzrinne Columbia University MOBICOM Tutorial, September 2002.

NG911 technology Henning Schulzrinne henning.schulzrinne@fcc.gov.

SIP in 2002 Henning Schulzrinne Dept. of Computer Science Columbia University.

Robert Hancock, Henning Schulzrinne (editors) NSIS Interim Meeting – Munich May 2005