1/30/2008 International SIP 2008 (Paris) Peer-to-Peer-based Automatic Fault Diagnosis in VoIP...
-
Upload
barbara-gilbert -
Category
Documents
-
view
219 -
download
0
description
Transcript of 1/30/2008 International SIP 2008 (Paris) Peer-to-Peer-based Automatic Fault Diagnosis in VoIP...
1/30/2008 International SIP 2008 (Paris)
Peer-to-Peer-based Automatic Fault Diagnosis in VoIP
Henning Schulzrinne (Columbia U.)Kai X. Miao (Intel)
1/30/2008 International SIP 2008 (Paris)
Overview
• The transition in IT cost metrics• End-to-end application-visible reliability still poor (~ 99.5%)
– even though network elements have gotten much more reliable– particular impact on interactive applications (e.g., VoIP)– transient problems
• Lots of voodoo network management• Existing network management doesn’t work for VoIP and other modern
applications• Need user-centric rather than operator-centric management• Proposal: peer-to-peer management
– “Do You See What I See?”• Using VoIP as running example -- most complex consumer application
– but also applies to IPTV and other services• Also use for reliability estimation and statistical fault characterization
1/30/2008 International SIP 2008 (Paris)
Circle of blame
OS VSP
appvendor
ISP
must be a Windows registryproblem re-installWindows
probably packetloss in yourInternet connection reboot your DSL modem
must beyour software upgrade
probably a gateway fault choose us as provider
1/30/2008 International SIP 2008 (Paris)
Diagnostic undecidability
• symptom: “cannot reach server”• more precise: send packet, but no response• causes:
– NAT problem (return packet dropped)?– firewall problem?– path to server broken?– outdated server information (moved)?– server dead?
• 5 causes very different remedies– no good way for non-technical user to tell
• Whom do you call?
1/30/2008 International SIP 2008 (Paris)
Traditional network management model
SNMP
X
“management from the center”
1/30/2008 International SIP 2008 (Paris)
Old assumptions, now wrong
• Single provider (enterprise, carrier)– has access to most path elements– professionally managed
• Problems are hard failures & elements operate correctly– element failures (“link dead”)– substantial packet loss
• Mostly L2 and L3 elements– switches, routers– rarely 802.11 APs
• Problems are specific to a protocol– “IP is not working”
• Indirect detection– MIB variable vs. actual protocol performance
• End systems don’t need management– DMI & SNMP never succeeded– each application does its own updates
1/30/2008 International SIP 2008 (Paris)
Managing the protocol stack
RTP
UDP/TCP
IP
SIP
no routepacket loss
TCP neg. failureNAT time-outfirewall policy
protocol problem
playout errors
media echogain problems
VAD action
protocol problem
authorizationasymmetric conn (NAT)
1/30/2008 International SIP 2008 (Paris)
Types of failures
• Hard failures– connection attempt fails– no media connection– NAT time-out
• Soft failures (degradation)– packet loss (bursts)
• access network? backbone? remote access?– delay (bursts)
• OS? access networks?– acoustic problems (microphone gain, echo)– a software bug (poor voice quality)
• protocol stack? Codec? Software framework?
1/30/2008 International SIP 2008 (Paris)
Examples of additional problems
• ping and traceroute no longer works reliably– WinXP SP 2 turns off ICMP– some networks filter all ICMP messages
• Early NAT binding time-out– initial packet exchange succeeds, but then TCP binding is
removed (“web-only Internet”)
• policy intent vs. failure– “broken by design”– “we don’t allow port 25” vs. “SMTP server temporarily
unreachable”
1/30/2008 International SIP 2008 (Paris)
Fault localization
• Fault classification – local vs. global – Does it affect only me or does it affect others also?
• Global failures– Server failure
• e.g., SIP proxy, DNS failure, database failures– Network failures
• Local failures– Specific source failure
• node A cannot make call to anyone– Specific destination or participant failure
• no one can make call to node B– Locally observed, but global failures
• DNS service failed, but only B observed it
1/30/2008 International SIP 2008 (Paris)
Proposal: “Do You See What I See?”
• Each node has a set of active and passive measurement tools• Use intercept (NDIS, pcap)
– to detect problems automatically• e.g., no response to SIP, HTTP or DNS request• deviation from normal protocol exchange behavior
– gather performance statistics (packet jitter)– capture RTCP and similar measurement packets
• Nodes can ask others for their view– possibly also dedicated “weather stations”
• Iterative process, leading to:– user indication of cause of failure– in some cases, work-around (application-layer routing) TURN
server, use remote DNS servers• Nodes collect statistical information on failures and their likely
causes
DYSWIS
1/30/2008 International SIP 2008 (Paris)
Architecture
Probe
SIP Proxy DNS Server SMTP Server Firewall Other
Sensor Probe Sensor
Diagnosis Diagnosis
Three types of nodes – sensor, probe, and diagnosis
1/30/2008 International SIP 2008 (Paris)
Diagnosis node
Architecture
“not working”
(notification)
inspect protocol requests(DNS, HTTP, RTCP, …)
“DNS failure for 15m”
orchestrate testscontact others
ping 127.0.0.1can buddy reach our resolver?
notify admin(email, IM, SIP events, …)
request diagnostics
Sensor node
1/30/2008 International SIP 2008 (Paris)
Solution architecture
DNS Server
P2PP2P
P2PP2P
P2PP2P
P2PP2P
Service Provider 1 Service Provider 2
P1
P2
P3
Domain A
P5
P4
P6
P7
P8
DNS Test
PESQ Test
SIP Server
SIP Test
Call Failed at P1
Nodes in different domains cooperating to determine cause of failure
1/30/2008 International SIP 2008 (Paris)
Failure detection tools
• STUN server– what is your IP address?
• ping and traceroute• Transport-level liveness and QoS
– open TCP connection to port– send UDP ping to port– measure packet loss & jitter
• Need scriptable tools with dependency graph– using DROOLS for now
• TBD: remote diagnostic– fixed set (“do DNS lookup”) or– applets (only remote access)
media
RTP
UDP/TCP
IP
1/30/2008 International SIP 2008 (Paris)
Distributed p2p architecture with an iterative process involving all these functions:
- Data gathering from multiple perspectives - Knowledge in existence or built over time (learning) - Tools (with intelligence built in) for active probing or observations - Inference, analysis, and decision making
Peer nodes: detection nodes, diagnosis nodes, and probe nodes
P2P protocol for fault diagnosis
Operation rules used to generate tests – built or learned in real time
Inference based in rules (inference modeling)
Components and Operations
1/30/2008 International SIP 2008 (Paris)
Dependency Graphs
Passive Tests/Active Tests
Analysis/Inference/Diagnosis
Fault diagnosis architecture, components, and domain agents
Dependency relationships/Decision trees
Normal Network Behavior
Monitoring deviant behavior
Active probes Adaptive probes
Diagnostic tests
Diagnostic analysisStatistical inference
Learning & modelingFault profiles
Fault types: hard vs. soft
Components and Operations
1/30/2008 International SIP 2008 (Paris)
Dependency classification
• Functional dependency – At generic service level
• e.g., SIP proxy depends on DB service, DNS service
• Structural dependency– Configuration time
• e.g., Columbia CS SIP proxy is configured to use mysql database on host metro-north
• Operational dependency– Runtime dependencies or run time bindings
• e.g., the call which failed was using failover SIP server obtained from DNS which was running on host a.b.c.d in IRT lab
1/30/2008 International SIP 2008 (Paris)
Dependency Graph
1/30/2008 International SIP 2008 (Paris)
Dependency graph encoded as decision tree A
C
B
D
A Failed,Use Decision Tree
Yes
Invokes Decision Tree for C
No
No
No
Yes
YesInvokes Decision
Tree for B
Invokes Decision Tree for D
Cause Not KnownReport, Add new
Dependency
A
B C D
A = SIP CallC = SIP ProxyB = DNS ServerD = Connectivity
1/30/2008 International SIP 2008 (Paris)
Current work
• Building decision tree system• Using JBoss Rules (Drools 3.0)
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
1/30/2008 International SIP 2008 (Paris)
Future work
• Learning the dependency graph from failure events and diagnostic tests
• Learning using random or periodic testing to identify failures and determine relationships
• Self healing• Predicting failures• Protocols for labeling event failures --> enable
automatically incorporating new devices/applications to the dependency system
• Decision tree (dependency graph) based event correlation
1/30/2008 International SIP 2008 (Paris)
Conclusion
• Hypothesis: network reliability as single largest open technical issue prevents (some) new applications
• Existing management tools of limited use to most enterprises and end users
• Transition to “self-service” networks– support non-technical users, not just NOCs running HP
OpenView or Tivoli
• Need better view of network reliability