PABX Cisco

8/11/2019 PABX Cisco

1/17

Cisco Systems, Inc.

All contents are Copyright 19922002 Cisco Systems, Inc. All rights reserved. Important Notices and Privacy Statement.

Page 1 of 17

White Paper

IP Telephony: The Five Nines Story

IP Telephony Avai labi l i ty

Overview

Availability can be defined as the

probability that a product or service will

operate w hen needed. When yo u buy a

netwo rk or a netwo rk service, you expect

availability. A network that isnt availableismore than annoying or frustrating, its

expensive. With increased dependency on

the netwo rk for business critical

applications and voice, a dow n network

costs everyone valuable time and money.

As enterprise organizations migrate legacy

voice services to IP telephony, a routine

consideration is availability of the IP

telephony solution. Everyone has high

expectations for voice service availability.

A common goa l is to consistently achieve

99.999% ava ilability w ith 5.3 minutes of

dow ntime per year o r less. Cisco Systems

understands that high availability and

reliability are a bsolute requirements for

all mission critical applications, including

IP telephony solutions. Through a

combination of PDIO (planning, design,

implementation a nd o perations)

best-practices, Cisco IP telephony solutions

can support 99.999%uptime. Five nines

(99.999 percent) availability is not only a

desired goa l, it i s an achievab le goa l tha t is

possible using the recommended high

availability IP telephony solution discussed

in th is paper.

The purpose of this document is to detail

how PBX a vailability is defined and to

demonstrate that C isco do es have a high

availab ility fi ve nines solution. The

perception is that legacy PBX solut ions are

fi ve nines end-to-end.

The truth is that legacy PBX availab ility

achieves five nines by defining availability in

a w ay that limits impact from

non-redundant components. Using asimilar defi nition, ava ilability of the Cisco

IP telephony solution is carefully examined

using industry accepted theoretical

availability analysis. This analysis includes

Telcordia (formerly Bellcore) MTBF

ana lysis using the parts count method,

software availability analysis, power

availability analysis and network design

availability analysis.

Using the definition for availability of a

redundant core architecture supporting

Cisco C allManagers and PSTN gateways

theoretical network hardware availability

w as show n to be 99.99993% in the

hardw are reliability section of this paper.

This was ba sed on a CI infra structure

model with redundant C atalyst 6509

chassis, redundant pow er supplies and

redundant supervisor modules for Cisco

Ca llMana ger access sw itches. Theoretical

software availability was computed based

on estimated software forced reloads for

early deployment software but was

negligible due to redundant devices at each

layer and f ast estimated reload t imes for

Cisco prod ucts (6 minutes). Po w er

availab ility w as 99.99962% and is based

on theoretical power availability analysis

done by APC C orporation, available at


2/17

Cisco Systems, Inc.


Page 2 of 17

http://w ww.a pcc.com/support. This level of power availability requires UPS systems w ith generator backup and UPS

monitoring with four hour response UPS service and support. Theoretical network design availability is 99.99947%

and is based on failover times between redundant components at each network la yer including the Cisco

Ca llMana ger and PSTN ga tewa ys. The analysis is provided in t he design section later in t his paper. Using norma l

serial ava ilability ana lysis w e compute overall a vailability a s the product of these results.

(.9999993) * (.9999962) * (.9999947) = 99.999% availa bility

Clearly C isco ha s a fi ve nines solution fo r IP telephony in a campus LAN d eployment mod el based on availab le

theoretical availability analysis techniques. Actual availability may vary due to other factors that include link/carrier

availab ility, netwo rk mana gement processes and expertise. By reading the follow ing definition of availab ility and

process by which availability is theoretically determined, anyone should be convinced that this is possible. The

reader should also be aw are that C isco is committed to providing high ava ilability and improving availability for

Cisco customers.

Legacy P BX Av ai lab il i ty What does 99.999% reliability mean to a customer? It means absolute minimum downtime in terms of existing calls

dropped and inability to placecalls. Five nines equates to less that 5.26 minutes of downtime in a year of operat ion.

Clearly this is an ad mirable target, a nd in mission critical applications, such as voice netwo rks, it may be a

requirement . Legacy PBX vendors are sometimes able to achieve five nines of availabili ty using a str ict definition of

outages that occur directly to a redundant PBX system. The definition does not include end-to-end ca lling where

broken phones, w iring or lost pow er can be a fa ctor. The definition, a s you w ill see below, is primarily a facto r of

tota l system uptime of t he PBX itself.

Reliability in PBX Designs

I t is important to understand how five nines reliabil ity is measured in thelegacy PBX world. In order to do that, let s

t ake a look a t the basiccomponents of a PBX and how they rela te to measuring fivenines upt ime. Thetwo diagrams

in Figure 1 below illustrate very simplistic view s of nonredunda nt and redundant PBX s.

Figure 1

Redun dant and N on-redun dant PB Xs

PSTN

T

DMBus

Legacy Line andTrunk Cards

CommonControl Cards

Single PowerSupply

Power Supply Backup

PowerSupply

Non-redundant PBX

PSTN

T

DMBus

T

DMBus

Legacy Line andTrunk Cards

Redundant CommonControl Cards

Redundant PowerSupplies

Power Supply Backup

Redundant PBX

PowerSupply

PowerSupply


3/17

Cisco Systems, Inc.


Page 3 of 17

Description of C omponents:

L ine CardsAnalog line cards support analog devices (phones, modems, fax, recorders, etc.). Digital line cardssupport pro prietary digital pho nes.

Trunk CardsProvide connectivity to thePSTN. Analog trunk cards for FXO, OPS, or TIE trunks. Digita l t runk

cards for T1, E1, PRI, BR I, etc.

Common ControlCPU, memory, and other cards used to perform call processing, diagnostics, administration,

etc., as w ell as common services (DTM F registers and senders for exa mple).

TDM BusThis is the switching matrix for the PBX that connects all the cards together for cont rol and voice

traffic.

Power SupplyTakes power from an external AC o r D C source and distributes it to the shelves to pow er the

cards.

Power Supply BackupThis is an internal UPS tha t is found in some PBXs. It provides pow er to operate the

system for several minutes, to pro vide protection f rom very brief pow er outages and to permit the system to go

through an orderly shutdown. For more than just a few minutes of protection an external UPSsystem isrequired.

The main difference betw een non-redunda nt and redundant PBX s is optional d uplication o f the Co mmon C ontrol

Ca rds, the Pow er Supplies, and t he TD M Bus. Sometimes redundancy only means CPU and memory redundancy.

Power supply redundancy isoften a separately orderable redundancy option. Note, however, that theline and trunk

cards are never redundant. No PBX vendor offers redundancy to the line card level.

Now tha t weveexamined the components of a PBX, wha t types of fa ilures would count in a calcula t ion of fivenines

reliability? For example,

Failure of a line card (typically 16 to 32 ports) putting phones out of service?

Failure of a trunk card (typically 4 to 30 bearer channels) putting a group of trunks out of service?

Loss of a couple of attendant consoles?

The answer? None of these would count as system downtime. In general, according to BellCore GR 512 standards,

outages of 64 ports or less are not counted in t he reliability calculations. The only thing t hat counts is total system

downt ime, which is loss of a ll lines, t runks, or a ll a t tendant consoles. In other words, in a non-redundant PBX you

would have to lose a CPU or power supply; in redundant system you would have to lose both the primary and the

backup element. Thats w hy redundant PBX s are so reliable.

Of course, if you think about it logically, i t would be difficult to consistent ly achieve99.999%uptime if you counted

minor failures, such as a non-redundant line card. Most PBX vendorsoffer four-hour response to minor outages such

as loss of a line or trunk card (unless specia l service level agreements are in place). For instance if a PBX has 48 line

cards in a PBX and four hour replacement ispossible, theM TBF would have to be about 400,000 hours or 45 years.

This uses the basic calcula tion of availabili ty equaling MTBF /MTBF + MTTR. 400,000 hours is achievable, but is

probably not implemented in most current PBX installations.

In fact, legacy PBX vendorsa re currently hard pressed to show five nines reliability for nonredundant systems. Again,

fi ve nines reliability typically infers that PBX s have maximum redundancy a nd pow er protection built in.


4/17

Cisco Systems, Inc.


Page 4 of 17

The legacy voice service provider has additional concerns including the local wiring system and phone reliability

w here redundancy, pow er backup and fast repair t imes are not possible. When end-to-end a vailability is closely

examined the theoretical ava ilability is less than fi ve nines. In fa ct, ma ny voice service providers target residential

service availabili ty in therangeof 99.95%. This iscloser to four hours of downtime per year and could be at tributed

to local power, broken phones or broken connections to themain switch. The main PBX may very well be five nines

according to the above defi nition but individual users w ill experience lower a vailability due to issues outside of t he

switch itself.

In summary, legacy PBX systems can maintain fi ve nines of ava ilability. H ow ever, this level of ava ilability only

includes the PBX itself and does not reflect the true end-to-end voice availability because it does not include

non-redundant line cards failures, pow er failures, bro ken phones and w iring systems. M any legacy voice service

providers have budgeted end-to-end availabili ty more in the range of 99.95% availabili ty or 4+ hours of downtime

per year.

IP Telephony Avai labi l i ty Definition

In a manner similar to legacy PBX vendors, Cisco can achievefi venines of availabili ty with theIP Telephony solut ion

by defining availabili ty in terms of a redundant core system. Using the legacy PBX availabili ty definition, Cisco can

define Campus IP telephony availabili ty as theabili ty to mainta in and originatecalls within the redundant core and

distribution layers using a redundant C isco C allM ana ger cluster.

Using this definition, any component or individual softwarefa ilure will forcea fa ilover to an alternate path or Cisco

CallManager device. With one level of redundancy and four hour MTTR (mean time to repair), theIP telephony core

can achieve fi ve nines of a vailability in a ma nner analogous to t he legacy PBX definition. This is show n w ith more

detail in the follow ing section.

Not ice that C isco ha s removed t he access layer from the definition. Access switches typically support a round 120users, fairly close to a line card o n a legacy P BX system. In the legacy w orld, this type of o utage has been removed

from the overall availabili ty definition and calcula tion. One would suppose thereason is that thelegacy PBX vendor

cannot achieve five nines of reliabil ity by including the access layer. Like legacy PBX vendors, Cisco cannot achieve

five nines without redundancy. Access switches typically have 60,000 to 150,000 hours MTBF (mean time between

failure) as computed by the Telcordia (formerly Bellcore) Parts Count Method. 1 Using this methodology we find

that a typical access switch with a 100,000-hour MTBF and a 4-hour MTTR has theoret ica l availabili ty of 99.996%

or 21 minutes of downtime per year. Clearly, no vendor can easily supply five nines of availabili ty end-to-end and in

fact current legacy voiceservice providers target availability significantly lower, around 99.95%. The Cisco IP phone

has less complicated ha rdw are and does mainta in M TBF over 400,000 hours w ith fi ve nines plus reliability.

1. Cisco uses a 2X multiplier from t he Telcordia computed value to provide theoretical results in supports full line with historical reliability da tacollected for Cisco equipment.


5/17

Cisco Systems, Inc.


Page 5 of 17

Figure 2

Com m on Infrastructureor AVVID

The diagram above shows a campus network core design often referred to as Common Infrastructure or AVVID

(architecture for voice, video & integrated da ta). The design redundancy and can support fi ve nines of a vailability

using the Telcordia parts count method . M ore informat ion on the Cisco fi ve nines solution is available in the next

section.

The IP Telephony Five Nines Solution

Availability o f the IP telephony solution can b e theoretically determined b y closely examining ava ilability

characteristics in several areas and applying known availability analysis techniques.2 By understanding and applying

this methodology, organizations will better understand the availability characteristics of a potential solution. This is

important to ensure the product or service provided meets expectations, to supply meaningful service level

agreements, to create ava ilability targets and to implement a n ava ilability improvement goal.

2. The book High Availability Network Fundamentals by Chris Oggerino provides detailed mathematical techniques for predicting availability ofa network system.

IPIP

IP

IP IP V

V

MDF

PSTN

Boundary

IP WAN

Access Switches

AccessSwitches

DistributionSwitches



CoreSwitches

CiscoCallManager

Cluster

PSTN Gateway


6/17

Cisco Systems, Inc.


Page 6 of 17

A meaningful ana lysis of a vailability includes the following a reas of netwo rk availability.

Hardware reliab ili ty Sof tware reliab ili ty

Link/Carrier availability

Power/Environment availability

Network Design reliabil ity

User-error & netwo rk support processes

Link/C arr ier availa bility ma y include all the area s as service provider netwo rks are subject to t he same failure

root-causes as any enterprise environment. Network design includes topology and failover capabilit ies of the system.

In IP telephony environments, this may include routing protocol convergence, C isco C allMa nager system fa ilover

times or spanning t ree convergence times. User-error & netw ork support pr ocesses includes netw ork ma nagement

practices that promote netw ork consistency, supportability a nd rapid pro blem resolution times.

This section provides a theoretical a vailability a nalysis of a campus IP telephony environment w here N+ 1

redundancy is implemented for the core, distribution and Cisco CallManager server devices. Access devices are not

included in the theoretical ana lysis. Link/C arr ier connectivity netw ork support processes are not evalua ted for th is

LAN model. Figure 3 below shows theload sharing redundant topology of thefi venines solution with devices outside

the dotted line removed fro m the ava ilability ana lysis.

Figure 3

IP IP

IPIP

IP

IP

IP

IP

IP IP

PSTN

Boundary

Access Switches




CoreSwitches

CiscoCallManager

Cluster

PSTN Gateways


7/17

Cisco Systems, Inc.


Page 7 of 17

Hardware Reliability

Hardware reliabil ity isest imated using theTelcordia Parts Count Method and applying theCisco 2X factor basedon historical failure data. This is accomplished by understanding MTBF values for devices, device components and

paths through the network. A theoret ica l MTTR of four hours isused for thecalcula t ion and isconsidered to be the

norm f or t heoretical vendor-comparison a vailability prediction a nalysis.

Using this methodology, hardware reliability wa s calculated to be 99.99993%for the IP telephony solution using the

availab ility definition in the above section. The following informa tion provides supporting detail for the hardw are

reliability a nalysis. Also keep in mind t hat many vendors based availab ility solely on hardw are reliability.

Hardware analysis is accomplished by analyzing MTBF information along a path from point A to point B. In IP

telephony environments, two paths arerequired for operat ion. One from phoneto Cisco CallManager and onefrom

phone to phone or phone to PSTN ga tewa y. To allow for the tw o paths, ana lysis of tw o paths w ill be done

individually and then together to determine hardware reliability in a typical CI recommended infrastructure model.

Devices or device pairs that a re on both paths need only to b e calculated o nce. Note that the second pat h below

includes gatew ays. This is not required for phone-to-phone availability w ithin the local environment.

Figure 4

Reliability Blocks for Path A nalysis

The reliability calculation begins with the calculation for one switch. The following tables show MTBF information

for the mod ules and cha ssis of one distribution switch. M TBF and availab ility of the sw itch is identified viaavailab ility calculations identifi ed in the book H igh Availability Netw orking Fundamentals . The follow ing

equation is used to d etermine overall ava ilability and can be explained in English as for the components I going

from 1 to n (thenumber of components), multiply by availabili ty i . In addit ion, thesecond table provides fa ilure rate

informatio n. This is the theoretical percentage of t ime where the switch w ill fail during a one-year interval.


CoreSwitches


PSTNGateways


AccessSwitches

Cisco CallManagerCluster

n

i=1

System Availability = = availability(i)


8/17

Cisco Systems, Inc.


Page 8 of 17

The next calcula tion can be applied to show availabili ty of a redundant parallel system. This method only accounts

for t he second f ailure while the repair fro m the fi rst failure is going on. Since the first repair only takes four hours,

thea vailability looks very high. In reality, this almost never happens. The real problems are first, that even when there

is a successful protocol switchover, a ll existing calls w ill be dropped and no new calls can be processed during the

switchover. The availability due to this problem is addressed in the design section of this paper. Other problems

include the inability to fa ilover due to an unavailable or broken standby path or that the switchover does not occur

because the failure is not detected. These are real problems that must be addressed via proper protocol design and

testing, but for t he most part have been show n to be successful with C isco layer III routing protocol, H SRP and

Spanning tree capabilities.

In this case of calculating basic hardw are reliability w e need to know the availability of redundant distribution

switches. For this equation N equals the number of componentsin parallel (in this case 2)a nd I equals thecomponent

number. In our ca se, this is equiva lent to 1 [ (1 .99985988) * (1 .99985988) ] or 99.99999802%.

Table 1 Distribution Switch Module Reliability

Part # Quantity QuantityRequired MTBF MTTR Availability

WS-6509 1 1 369,897 4hr 99.99892%

WS-CAC-1300W 2 1 316,456 4hr 99.99999%

WS-X6K-SUP1A-MSFC2 1 1 46,235 4hr 99.99135%

WS-X6408A-BGIC 1 1 93,457 4hr 99.99572%

Table 2 Distribution Switch Reliability (Non-Redundant)

Availability 99.985988%

Annual Downtime 73.7 minutes

MTBF 28,545 hrs

Annual Failure Rate 26.44%

Table 3 Distribution Switch Reliability (Redundant)


Annual Downtime < 6 seconds

MTBF 101,880,673

Annual Failure Rate .01%

n

i =1

Para llel availability = 1 [(1 component availability(i)

)]


9/17

Cisco Systems, Inc.


Page 9 of 17

Using this methodo logy w e can determine reliability of each redundant block a nd then use a normal series

methodology to d etermine overall hard w are reliability of the complete IP telephony pat h. In our ca se w e have

redundant blocks for two distribution layers, one core layer, Cisco CallManager layer and gateway layer. Overall

availability is then determined by factoring each layer. But first we will need to determine module availability, chassis

availability, and redundant chassis availability for each network layer. The following tables provide this information

using the standa rd methods used above. Follow ing this, availability of the complete path is provided.

Table 4 Core Switch Module Reliability


WS-6509 1 1 369,897 4hr 99.99892%

WS-CAC-1300W 2 1 316,456 4hr 99.99999%

WS-X6K-SUP1A-MSFC2 1 1 46,235 4hr 99.99135%

WS-X6408A-BGIC 2 2 93,457 4hr 99.99572%

Table 5 Core Switch Reliability (Non-redundant)



MTBF 21,866 hrs


Table 6 Core Switch Reliability (Redundant)



MTBF 59,787,111


Table 7 Access Switch Module Reliability


WS-6509 1 1 369,897 4hr 99.99892%

WS-CAC-1300W 2 1 316,456 4hr 99.99999%

WS-X6K-SUP1A-2GE 2 1 82,654 4hr 99.99999%

WS-G5484 2 1 432,470 4hr 99.99999%

WS-X6248A (10/100 card) 1 1 94,684 4hr 99.99577%


10/17

Cisco Systems, Inc.


Page 10 of 17

Table 8 Access Switch Reliability (Non-redundant)



MTBF 75,380 hrs.


Table 9 Access Switch Reliability (Redundant)



MTBF 710,343,430


Table 10 Gateway/Access Switch Module Reliability


WS-6509 1 1 369,897 4hr 99.99892%

WS-CAC-1300W 2 1 316,456 4hr 99.99999%

WS-X6K-SUP1A-2GE 2 1 82,654 4hr 99.99999%

WS-G5484 2 1 432,470 4hr 99.99999%

WS-X6608-T1 1 1 40,512 4hr 99.99577%

Table 11 Gateway/Access Switch Reliability (Non-redundant)



MTBF 36,511


Table 12 Gateway/Access Switch Reliability (Redundant)



MTBF 166,668,151



11/17

Cisco Systems, Inc.


Page 11 of 17

The only remaining theoretical availability numbers pertain to the Cisco CallManager servers. Ha rdware iscurrently

manufa ctured by outside vendors and the Telcordia M TBF information is not ava ilable. Cisco can only base

availab ility for this product on the actual return rate of the product. The MC S 7835 currently has over 7 million

hours of combined operat ion with a MTBF of over 1,000,000 hrs based on thenumber of returns during a 3-month

period. G iven the excellent current reliability of t he MC S-7835 chassis and m odules, the Cisco H igh Availa bility

Services group offers the following estimates of availability for primary and redundant Cisco CallManager systems.

The second table below provides reliabil ity for redundant Cisco CallManager devices, which isa capability with the

IP telephony solution.

Information is now available for the full path analysis. The following information is factored into the total path

analysis calcula tion, which is obta ined by multiplication of availabili ty numbers for each redundant module. Given

high availabili ty a t each layer overall hardware reliabil ity of theIP telephony solution in a campus environment with

an MTTR of four hours is 99.99993%. Again, this is determined simply by using the basic serial calculation method

of each of the seven redundant mo dules described a bove.

Table 13 Cisco CallManager Chassis Reliability (Non-redundant)

Availability 99.99%


MTBF 40,000


Table 14 Cisco CallManager Chassis Reliability (Redundant)



MTBF 200,040,000


Table 15 Redundant Layer Availability

Availability Path Analysis Based on Redundant (N+1) Modules

Distribution

1IP phone

Core Distribution

2PSTN

access

PSTN access

gateways

Distribution

3Cisco

CallManager

Access

layerCisco

CallManager

Cisco

CallManager

platform

99.99999 99.99999 99.99999 99.99999 99.99999 99.99999 99.99999


12/17

Cisco Systems, Inc.


Page 12 of 17

Figure 5

Software Reliability

To a ccurately predict the availability of a netw ork or device, the softw are contribution to d ow ntime must be

considered. The problem how ever is that no industry sta ndard exists for soft w are reliability. Several methods a re

available for calcula ting software reliabil ity but these methods are known to be inaccurate and difficult to perform.

The best method known is to analyzefi eld measurements for software reliabil ity. This section examines a few Cisco

internal field measurements and predicts softw are reliability ba sed on techniques documented in the boo k H igh

Availability Netw orking Fundamentals by Chris Oggerino.

Chris Ogg erino a nd Scott C herf, both of C isco Systems, have been actively collecting and ana lyzing softw are

reliability for over tw o years. Bot h individuals examined the number of softw are forced crashes from la rge sample

sets with a variety of C isco IO S versions. O ggerino a nd C herf also standard ized on a six minute M TTR a s a

conservative estimate for the time it w ould ta ke a router to reboot a nd reestablish neighbors and tra ffi c flow s in amedium sized enterprise environment. These measurements did not account for software defects that did not restart

thedevice, as these were difficult to measure. Using their methodology, Oggerino and Cherf conclude that Cisco IOS

General Deployment (GD ) softwareis five nines reliable. Cherf , for instance, concludes that Cisco IOS SoftwareG D

Release 11.1 is 99.99986% reliable w ith a MTBF of 71,740 hours and that the non-G D version of Release 11.1 is

99.9996%reliable with an MTBF of 25,484 hours. Bothindividualsalso showed that softwarematurity was a major

factor in softw are reliability.

Every Cisco customer should already know that softwarereliabili ty isprimarily a factor of softwarematurity. As the

softw are ages and gains acceptance, defects are reduced until the softw are is considered stable w ith no problems

expected. As IP telephony software increases in maturity, Cisco fully expects general deployment highly reliable

softw are. C isco is a lso committed to releasing fi ve nines softw are by conducting extensive release tests on new

softw are. Unfort unately, some problems surface with customer traffi c patterns in real w orld environments. C isco

therefore recommends that all early deployment software be tested and/or piloted in the real-world environment to

mitigate risk. Best-practices for cha nge management and new solution d eployment a re detailed in t he Cisco w hite

papers a t htt p://w w w.c isco.com/w ar p/public/126/index.sh tml.

Because the Cisco IP telephony solutions utilize Cisco IOS software classified as early deployment, a conservative

est imate is needed that reflects this sta tus. Based on field measurements for new software, a conservat ive est imate is

created f or this ana lysis of 10,000 hrs. M TBF and six minute MTTR. Oggerino uses this number in his book a s an

extremely conservat ive number. 3A conservative estimate is also created fo r G eneral d eployment softw are of


CoreSwitches


PSTNGateways


AccessSwitches

CiscoCallManager

Cluster


13/17

Cisco Systems, Inc.


Page 13 of 17

30,000 hours. These expected M TBF values will actua lly impact a vailability mo re due to failover. In the netwo rk

design section, a 10,000 hour M TBF is used for newer and more complicated layers including gat ewa y, Cisco

CallManager and distr ibut ion layer. A 30,000-hour MTBF is used for more reliable software in the core and accesslayer. This analysis is provided in the design section of the paper later on.

We can use the softw are reliability number in the hardw are calculation by facto ring in softw are reliability a s a

component or mo dule of each d evice. This can be done because softw are forced cra shes or softw are restarts are

independent activities isolated to one hardware platform. Field measurements also confirm that software failures do

not occur to both redundant devices in a par ticu la r network layer a t the same t ime. This isnot to say tha t so f tware

defects cannot affect bo th devices in the same network layer. R outing proto cols, HSRP, spanning tree and other

protocols can affect availabili ty of a network layer and should be designed carefully and tested prior to deployment .

Fortunately, these protocols have been aro und fo r years and are not expected to cause non-availab ility under

reasonable design conditions.

Using thepreviousdefinit ion of availabili ty and fully redundant system, software does not impact availabili ty of the

system to a signifi cant extent. For instance, hard w are reliability of a distribution switch is 99.98598%. Softw are

reliability using a 10,000 hour MTBF and six minute MTTR is 99.999%. This changes reliability of an individual

switch to 99.98498%. Reliabil ity of a redundant layer does not change significantly but dropsless than a second per

year from 99.99999804% to 99.99999774%. Since all calculations have been rounded off to five decimal places,

this does not appear to impact ava ilability. In the design section, w e will use this MTBF however to look a t

convergence times of devices and t he number of expected failovers to understand how this impacts ava ilability.

Power/Environmental Availability

Power and environment are also potential impacts to availability that affect the overall availability of the IP

telephony solution. Pow er is also unique in that it do es not impact o ne device at a time like softw are or hard w are.

Its affects, or ca n af fect, an entire building or multiple buildings at a time. This can impact a ll devices in the IPtelephony availability definition including distribution, core, gateway and Cisco CallManager all at once. The

calculations therefore change from d evice based to entire netwo rk based, w hich create a signifi cant impact t o

theoretical a vailability d epending on the pow er protection strategy used fo r IP t elephony.

To understand theoretical availability fo r pow er and the environment, Cisco has turned to APC C orporat ion for

theoretical ava ilability.4APC can be co ntacted, ht tp://w w w.apcc.com, f or mo re precise availability informa tion in

Nort h America based on geogra phical location, w eather patterns, site details and installation specifics. In general

however, thefollowing characteristics apply to North American power. The data isvariable from site to site and some

areas like Florida are expected t o experience an o rder of magnitude higher of non-availability. The data is also

expected to be representative of Japan and Western Europe.

The average number of outages suffi cient to cause IT system malfunction per year at a t ypical site is

approximately 15

90% of the outages are less than five minutes in duration

99% of the outages are less than one hour in duration

Total cumulative outage duration is approximately 100 minutes per year

3. Page 67 of High Availability Networking Fundamentals by Chris Oggerino

4. Based on technical no te #24 f rom APC C orporat ion loca ted a t http://sturgeon.apcc.com/kbasewb2.nsf/Tnotes+Ext ernal/6F45EFB214C9DDFF8525672300568C B6?OpenDocument


14/17

Cisco Systems, Inc.


Page 14 of 17

APC also provides theoretical availability ba sed on t he power protection strategy and concludes that ava ilability

levelsof five nines or higher requirea UPSsystem with a minimum of onehourbat terylife or a generator with a onsite

servicecontract or four hour response for UPS system failuresor problems. The Cisco recommended high availability

solution requires additional support to achieve five nines overall. The HA IP telephony solution therefore must

include UPS and generator backup for all distribution, core, gateway and Cisco CallManager devices. In addition,

the organization should ha ve UPS systems that ha ve auto-restart capability a nd a service contract f or four ho ur

response to support the UPS/generato r. G iven the follo w ing recommended core infrastructur e, APC co rpora tion

estimates that non-availability w ill be two minutes per year.

IP telephony high availability power/environment recommendations:

UPS & generator backup

UPS systems have auto-restart capability

UPS system monitoring

Four hour service response contract for UPS system problems

Ma intain recommended equipment operating temperatures 24X7

Overall power availability using the above solution is estimated to be 99.99962%. This impacts overall availability,

99.99993%, in the same wa y a new m odule w ould a ffect a device in a serial system. The calculation used to

determine overall estimated availability is then (.9999962) * (.9999993) or 99.99955%.

Network Design Availability

Netw ork design availability includes netw ork topology and netw ork protocol d esign used to provide scalability,

performance and convergence follow ing a device or link fa ilure. This is critical to availab ility. Without a properly

designed network with recommended protocols and configuration, convergenceor failover timescould easily exceed

fi ve minutes, eliminating a ny opportunity to achieve a fi ve nines solution due to expected softw are and hardw arefailures. To achieve the highest a vailability in netwo rk design, protocols need to determine w hen a failure has

occurred and immediately switch to a redundant component or device. The system must also recognize and monitor

backup components to ensure that the failover will be successful. Fortunately, Cisco is the recognized leader in these

networking capabilit iesand can provide high availability as long as the design and implementation meet IP telephony

design rules. This section helps to identify the design rules for the defined IP telephony availability solution and also

estimates availability ba sed on f ailure rates seen in softw are and hardw are.

The following design rules apply to the Cisco IP telephony five nines solution:

Standa rd Co mmon Infrastructure design with core/distribution/access layers

EIG RP or OSPF routing in the campus with default t imers

HSRP for all IP access subnets including phones and C isco CallMa nager servers Spanning tree environments eliminated or fast-converging spanning tree features (Use spanning tree only w hen

multiple access switches a re needed on 1 subnet/VLAN to support H A services)

Dual home access switches to tw o distribution switches

Dual home distribution switches to two core switches

In genera l do not t runk devices a t the same layer (except where multip le access swi tches areneeded on 1 subnet

VLAN to support HA devices)

Quality of Service configured across the LAN (or plenty of bandw idth)


15/17

Cisco Systems, Inc.


Page 15 of 17

When the recommended C ommon Infrastructure design for Ca mpus netw orks is implemented for IP telephony w e

can investigate the failure rate a t each netwo rk layer. This is accomplished b y fi rst understanding the time required

for failover and convergence at each netw ork lay er. Fortuna tely, this ha s been previously tested w ithin C isco. The

next step is to understand the annual failure rate a component at each layer for both software and hardware. The

failure rate can t hen be multiplied by the expected f ailover time to understand the overall impact to ava ilability. A

major assumption in this analysisis that thenetwork has adequateresources (bandwidth, CPU & processor memory)

to ha ndle the entire load w hile the alternate device is recovering. Figure 6 below can be used to id entify the layers

w here failover will occur.

Figure 6

IP Telephony Path

Of the seven layers above, four separate failover mechanisms will be discussed that lead to slightly different failover

or convergence times. The follow ing ta ble provides detail o n fa ilover a nd convergence requirements and ho w they

occur.

Table 16 Failover Analysis

Fa ilo ve r Ty pe A n aly sis

Distribution

layer

A failed distribution device or links may include three different failover mechanisms. These

include HSRP, IP routing and possibly spanning tree. Thom Bryant documents testing results for

failover in the Redundant Network Design Failure Analysis paper available internally at Cisco.

Thom has applied tuning to improve network convergence at the distribution layer. For the

purposeof this document,we aresimply takinga conservative number based on his test results,

30 seconds.

Core layer Distribution devices should have equal cost routes through the core, making IP convergence

unnecessary to route traffic. Thom Bryant documents testing results for failover in the

Redundant NetworkDesign FailureAnalysis white paper.Thomhas applied tuning to improve

network convergence at the distribution layer. For the purpose of this document, we are simply

taking a conservative number based on his test results, 10 seconds. The document is currently

available.

PSTN gateway The Cisco CallManager will attempt to initiate a PSTN call through the designated route group,

device list or alternate device list. If the Cisco CallManager cannot initiate a call, the alternate

device is selected. This takes approximately 10 seconds.


CoreSwitches


PSTNGateways

Distribution

Switches

Access

Switches

Cisco

CallManagerCluster


16/17

Cisco Systems, Inc.


Page 16 of 17

Availability has now been created for hardware, software, power/environment & network design. Keep in mind that

the theoretical ca lculation has ignored the number of dropped calls. This is currently d ifficult to estimate. C alls in

fact are not necessarily dro pped but can be perceived a s such w hen failover takes over a few seconds.

Overall pow er availability w as estimated to be 99.99962% and ha rdw are/softw are availab ility wa s expected to b e

.9999993. Using the estimate for netw ork design availability of 99.99947 w e can use a serial model to calculate

theoretical availability. This is (.9999962) * (.9999947) * (.9999993) which equate to 99.999% or approximately

5.3 minutes of dow ntime per year.

Cisco

CallManager

and access

switches

Redundant Cisco CallManagers are not currently stateful. This capability is expected in future

releases. Currently, a TCP keepalive is sent every 30 seconds. Three simultaneous negative

responses cause failover to the standby, which can initiate new calls. Current calls are not

dropped during failover however DTMF signaling info is lost and call may be reset. A failed

access device will causea failover to theredundantCisco CallManager. Because three keepalives

can occur in 61 seconds, the mean time before the 1st keepalive is used bringing the average

downtime to 75 seconds. Shortly following this, the redundant Cisco CallManager will have the

ability to accept new call requests.

Table 17 Expected Failure Rate Per Year

Distribution C ore Access Gatew ay CiscoCallManager

Software MTBF 10,000 hrs 30,0000 hrs 30,000 hrs. 10,000 hrs 1 0,000 hrs

Hardware MTBF (from

Table 2)

28,545 hrs 21,866 hrs 75,380 hrs 36,511 hrs 40,000 hrs

Software failure rate per

year

.876 (365*24/

10,000 =.876)

.292 .292 .876 .876

Hardware failure rate per

year

.306 .401 .116 .240 .219

Total failure rate1 1.182 .693 .408 1.116 1.095

Failure impact 30 seconds 10 seconds 75 seconds 10 seconds 75 seconds

Estimated down time per

year

35 seconds 7 seconds 31 seconds 11 seconds 82 seconds

Total downtime 166 seconds 2.77 minutes

Network design

availability

99.99947%

1.It is generally not a good idea to combine failure rates since software and hardware failure rates have different MTTR values. However this only applies toswitchover time due to a failed component.

Table 16 Failover Analysis

Fa ilo ve r Ty pe A n aly sis


17/17

Corporate HeadquartersCisco Systems, Inc.170 West Tasm an D riveSan Jose, CA 95134-1706USAwww.cisco.comTel: 408 526-4000

800 553-NETS (6387)

Fax: 408 526-4100

European HeadquartersCisco Systems Europe11, Rue Camille Desmoulins92782 Issy-les-MoulineauxCedex 9Francewww-europe.cisco.comTel: 33 1 58 04 60 00

Fax: 33 1 58 04 61 00

Americas HeadquartersCisco Systems, Inc.170 West Tasman DriveSan Jose, CA 95134-1706USAwww.cisco.comTel: 408 526-7660Fax: 408 527-0883

Asia Pacific HeadquartersCisco Systems, Inc.Ca pital Tow er168 Robinson Roa d#22-01 to #29-01Singapore 068912www.cisco.comTel: 65 317 7777

Fax: 65 317 7799

Cisco Systems has more than 200 offices in the following countries and regions. Addresses, phone numbers, and fax numbers are listed on the

Cisc o Web site at www.ci sco.com/go/offi ces

Argent ina Austra lia Aust ria Belgium Bra zil Bulga ria C a na da C hile C hina PR C C olombia C osta R ica C roa t i

C zech R epublic D enma rk D uba i, UAE Finla nd Fra nce G erma ny G reece H ong Kong SAR H unga ry India Indonesia Irela n

Isra el Ita ly Ja pa n Korea Luxembourg M ala ysia M exico The N etherla nds N ew Z ea la nd N orw ay Peru Philippines Po la n

Portuga l Puerto R ico R oma nia R ussia Sa udi Ara bia Sco t la nd Singa pore Slova kia Slovenia South Africa Spa in Sw ede

Sw it z er l a nd Ta iw a n Th a i l a nd Tu r k ey U k r a in e U n i t ed K in g d o m U n i t ed St a te s Ven ez u e l a Vi et n a m Z im b a bw

All contents are Copyright 19922002C isco Systems, Inc.All rights reserved. CC IP, the CiscoPoweredNetwork mark, the Cisco Systems Veri f ied logo , C isco Unity , Fast Step, Follow Me Browsing , FormShare, In terne

Q uo ti en t , i Q B r eak t hr o ug h , i Q E x pe rt i se , i Q F ast Tr ack, t h e i Q l og o , i Q N et R ead in es s Sco r eca r d, N e tw o r k in g A cad em y , Scr ip t Sh ar e, SMAR Tn et , Tr an sP a t h, an d V oi ce L AN ar e t r ade mar ks o f C i sco Sys te ms , I n c

Changing theWay WeWork, Live ,Play , and Learn , Discover All That sPoss ible, TheFas tes t Way to Increase Your Internet Quot ient , and iQuick Studyare servicemarks of Cisco Systems , Inc. ; and Aironet , ASIST, BPXCatalys t ,C CDA, CCD P, CCIE, CCNA, CCNP, Cisco , theC isco Cert if ied Internetwork Expert logo , C isco IOS, theC isco IOS logo, C isco Press, C isco Systems , Cisco Systems Capi tal , the Cisco Systems logo, Empoweri

the Internet Generat ion, Enterprise/Solver , EtherChannel , EtherSwi tch , GigaStack, IOS, IP/TV, LightStream, MG X, MICA, the Networkers logo, Network Regist rar ,Packet, P IX, Pos t-Rout ing , Pre-Rout ing , RateMUX

Registrar, SlideCa st, StrataView Plus, Stratm, Sw itchProb e, TeleRouter, and VC O ar e registered trademark s of Cisco Systems, Inc. and/or its affiliat es in the U.S. and certa in other countries.

User Error and Network Support Processes

User error and network support process can account for significant

dow ntime in any IT infrastructure.5However, the amount of

downtime that an organization will incur due to user error or lack

of pro cesses is diffi cult to th eoretically determine. The best-know n

method is to ana lyze expertise, staffi ng a nd processes to determine

gapsthat may impact availability. Unfortunately, this isw ell beyond

the scope of this document. For this purpose of theoretically

estimating availability, user-error and process have not been

included. This does not mean that it should be ignored. Bo th

trad itional PBX vendors and IP t elephony vendors experience

similar user-error and process issues and therefore should be

add ressed in bo th environments.

Rather than focuson theimpact to availabili ty in this area , i t makes

sense to at least identify some of the user-error and process issues

that may impact availability in an IP telephony environment. These

processes should promote t he ability t o deploy solutions

successfully, quickly identify a nd resolve faults, keep the netw ork

modular & consistent, and provide the ability to resolve potential

problems proactively. M ore informat ion is availab le on these

processes in the IP telephony so lutions guid e at: http: //

w w w.cisco.com/w ar p/custo mer/788/solut ion_guide/

New technology deployment planning & design

Staffi ng & expertise

Operational support plan (Service repair plans & problem

priorities, conta cts, escala tions)

Fault monitoring & d etection

Cisco CallManager & device backups

Performance ma nagement/Ca pacity planning

Cisco Ca llManager a nd device configuration backups

Softw are & a pplicatio n mana gement (versions contro l)

Change M anagement

Lab validation

New technology pilot

Phone pro visioning

M AC (move, a dd, change) process

IP/D H CP mana gement

Telephone invento ry management

Dial plan management

Fraud d etection

Disaster recovery

5. A Gartner Group report sta tes that 40% of non-availabi li ty isdueto user-error andprocess.

PABX Cisco

Documents

Transcript of PABX Cisco