Deep kamalsingh

Strategic approach to ascertain accurate decisions after

unplanned service outage in Telecom operations

Submitted to PMI National Conference 2013

Author: Deep Kamal Singh

Table of Contents

1. ABSTRACT ................................................................................................................................................. 3

2. KEYWORDS ................................................................................................................................................ 3

3. INTRODUCTION ......................................................................................................................................... 4

3.1. OBJECTIVE OF THE RESEARCH AND PAPER ................................................................................ 6

4. THE CONCEPT ........................................................................................................................................... 6

5. CASE STUDY.............................................................................................................................................. 8

5.1. CASE STUDY: ACTIVITY DESCRIPTION ........................................................................................... 8

5.2. CASE STUDY: ACTIVITY EXECUTION ............................................................................................. 12

5.3. KEY OBSERVATION AND LEARNING DRAWN FROM THE EVENT ........................................... 12

6. EVOLUTION OF THE STRATEGIC APPROACH ................................................................................. 14

6.1. IDENTIFYING ACTIVITY PHASES WHERE THINGS CAN GO WRONG ...................................... 14

6.1.1. PLANNING PHASE .......................................................................................................................... 15

6.1.2. EXECUTING PHASE AND MAINTENANCE PHASE ................................................................... 17

7. CONCLUSION .......................................................................................................................................... 19

Table of Figures

Figure-1: Phases of activity .......................................................................................................14 Figure-2: Strategic approach to ascertain accurate decisions after unplanned service outage ..19

Tables

Table 1 – Technical Teams involved in activity ........................................................................... 9 Table 2 – Case Study: Activity timelines and major milestones .................................................11 Table 3 – Sample Service Impact matrix ...................................................................................16

1. Abstract

After 50 years and four generations of mobile technology, subject matter experts have fashioned standard

operating procedures of all activities that are executed in ever improving telecom ecosystem, however

system failures still happen leaving no operator untouched by unforeseen outages impacting services of

end subscriber.

With an objective to define effective action plan that should be followed under unplanned service outages

a team of project managers were formed. The team was involved in planning and execution phase of

various activities in a mobile network. The ideas and learning were exchanged between technical

managers and other PMs of all involved teams. The team adopted qualitative and interpretive approach to

consolidate steps for resolving the crisis in least possible time minimizing revenue loss. Based on the

erudition a general action plan was formulated following which stakeholders can take best decisions

during service outages in telecom operations.

Currently practiced recovery approaches are native to specific domain or subsystem of a service industry,

there is a lack of commonly applicable practice for taking right decision in any service outage. Thus this

paper is an initiative to construct standard action plan applicable to overall telecom ecosystem which

should be referred by Project managers and apex decision makers when sudden unexpected outages

and service down time occur in otherwise well-functioning environment. Albeit this paper is authored upon

explorations of activities in telecom network operations, but the guidelines and underlying approach

concluded in this paper can be referred by any service industry in case of unplanned outages.

2. Keywords

Risk planning, minimizing uncertainty, revenue loss control, decision making

3. Introduction

Telecommunication ecosystem in itself is varying and ever improving industry. For reasons like business

growth, technology implementation, adaptability, innovation, cost control etc there are endless upgrades,

tune ups and additions that keep on happening in the network system, for same reasons there are regular

set of activities planned and executed, ranging from a small node level change to a complete subsystem

replacement or introduction of a new setup altogether.

Continuous changes in the ecosystem subjects the network and business to risks of service outages and

revenue impacts, To ensure business continuity and to keep revenue impact to a minimum it’s a

necessary practice by technical teams to plan and prepare step wise execution plan before every change

that further gets reviewed and approved by all relevant stakeholders.

Often the activity involves implementation of changes that affect various services which in turn requires

several tech-teams to align, coordinate and prepare their action steps in sync with other for the execution

of activity. In such scenarios when any unexpected and unplanned variation occurs, it becomes

paramount to define next course of action keeping actual need and impact to business in calculations.

In these cases, it is most required to be well informed and equipped with all technical and business critical

information altogether in order to quickly assess not only all possible decision but also to ascertain which

decision will lead to lowest cost to business and keep revenue impact to minimum.

Activity managers and technical specialists are well qualified experts in their field, they can easily work

out and deduce key planning factors like steps of execution, detailed activity plan, estimate outages etc,

up to certain level they can also highlight the risks and probable phases where an unplanned service

outage might occur during course of complete activity, however its neither their domain nor their expertise

to analyze and predict business impact in terms of revenue when an unplanned outage is experienced,

On the other hand business driving functions like sales and marketing, revenue assurance, customer

services and similar domains are always well equipped with analytical data and information which makes

them expert for accurate calculation of revenue inflow from a live service, thus these functional domains

can estimate impact to business in terms of revenue when a service is down. However they lack technical

insight to understand risk factors involved in a planned activity, neither can they predict which service is

more prone to get disrupted compared to other nor they are aware about the overhead cost to business in

case of delay or cancellation of the activity.

When an unplanned service outage occurs during a complex activity it becomes increasingly difficult to

derive to a best decision because of gap in domain boundaries between various teams involved and also

because of unavailability of some kind of cross reference matrix which can help to adjudge quickly

whether to go ahead, delay or roll back the activity.

Thus it is identified here that activity owners must compile a reference approach with all necessary

analysis well before the activity which should be followed in case of unexpected downtimes or service

outages.

This paper is an initiative to establish a common reference approach which can be referred in case of

unexpected outages during a planned activity to reach to an appropriate decision that will ensure least

cost to business.

3.1. Objective of the research and Paper

To establish whether a methodical approach and standard operating procedure can be formulated which

when followed in event of unplanned and unexpected service outage will ensure that best possible

decision is taken in least time.

4. The Concept

As briefed in above section the regular activities in telecom domain require planned outages of live

services, for every activity the involved technical managers perform diligent analysis and prepare activity

steps with documents covering step wise execution in great detail. The services which should implement

changes at respective end due to the activity are also required to plan accordingly. Thus different

technical teams owning each service also prepare set of activity steps at their own part and Project

manager of activity then collates the execution plan to prepare a combined set of steps with team name

and ownership assigned against each step, this document gets signed off and approved from all the

stakeholders so that everyone stays informed of the downtimes and the service outages involved along

with stepwise execution planned in sequence and in parallel across all services.

However when an unexpected outage occurs during an activity execution or if some set of services are

not functioning properly after any step then it becomes first priority to decide next step, technical teams

get engulfed in finding out root cause and then solutions, where as parallel progress happening on other

connected system’s/service’s end may lose direction as they have no clear understanding whether to go

ahead, hold or completely stop. It is observed that ownership of decision making is also not apparent in

such situations, as the owners of malfunctioning connected service will not favor carrying on rest part of

activity unless they know the root cause and estimate time to implement solution, the core team who are

conducting the main track of activity will always suggest to carry out the activity so that their planned

timelines are not impacted and also other services which are functioning properly will not face any delay

in their timelines, business team can actually mandate the decision here but they lack the technical

insight to take an informed decision, and thus events like this severely consume time approved for service

outages, further lack of timely decision making sets off a chain reaction which increases outages duration

allocated for even those services which are progressing as planned, as a result of which Business has to

face un-estimated, unaccounted consequential costs.

For example let’s consider a complex activity where a core system is getting upgraded to a higher

version, when the core system is upgraded and switched on for initial testing, it is found that among many

connected services one particular service is not able to connect to the new version, this behavior was not

expected and because of this, launch of new version of core system gets on hold as core team is awaiting

management’s ‘go ahead’ decision. It’s now a prerogative to take most appropriate decision in least

possible time as minute by minute cost to business is increasing due to outage of all connected services,

and decision makers need to be capable enough to derive the next course of action –

1. Whether to go live and end the downtime of all other services, except the disrupted one.

2. Delay the go live until problem is resolved thus increasing downtime of all connected services

3. Or simply roll back the complete version upgrade so that all services function properly as they were before activity and plan the upgrade again increasing cost of activity.

How decision makers can decide which of the above option is most beneficial to business when crisis situation occurs during planned activity is key challenge concluded in joint planning process sessions.

5. Case Study

In Q3 2010, A leading telecom operator in India planned nationwide upgrade of its core prepaid billing

system (BSS), this activity was to be executed in 14 telecom circles across India, this is a complex activity

involving hardware additions and retrofits, complete software upgrade,

During planning process phase of this project the group of project managers and functional heads jointly

began to estimate the risks involved in each implementation and further to prepare mitigation approach,

as a result of many joint sessions over same apparition the concept and idea was born which became

prime subject documented in this paper.

The Operator had 14 sites running on old billing system and thus same upgrade was planned for all the

sites.

As on all sites there are different teams managing the same services, a great zeal of planning,

coordination, testing and control was required at each site to execute the activity and thus every step of

activity was properly documented and was reviewed and approved by all stakeholders. Even with great

level of planning and coordination during activity execution in starting few sites various deviation were

observed and which impacted cost of activity to business and imparted sudden revenue losses,

With every implementations technical teams concluded all necessary learning to ensure same problem

are not faced again in next implantation, and it was observed by project managers and functional heads

that a reliable approach is required to ensure critical decisions are taken in least possible time

5.1. Case Study: Activity description

Activity: Core billing system (IN System) upgrade to newer version,

This activity requires complete outage of billing system for 8 hours,

Since billing system stays unavailable, below listed connected services also face subsequent downtime

1. Voice calls – Local / National Long distance / International long distance

2. SMS

3. Data Browsing

4. Real-time data charging

5. Recharges – Voucher recharge and E-topup

6. USSD

7. Unified Subscriber life cycle management – Activation, Churn, daily jobs, and other offline

process.

8. Business reporting – MIS

A complex activity involves changes at more than one functional system and thus involves many technical

teams to work in coordinated and controlled manner, table illustrates various technical teams that were

involved in the discussed activity

SNO Service Owning Team

1 Voice calls Circle Team

2 SMS Circle Team

3 Data Browsing Circle Team

4 Real-time data charging Data Charging team

5 Voucher recharge Billing System Team

6 ETOPUP ETOPUP Team

7 USSD USSD Team

8 Unified Subscriber life cycle management Unify Team

9 Customer Service CS Team

10 Revenue assurance and CDR analysis RA Team

11 Business Reporting Mediation Team

Table 1 – Technical Teams involved in activity

Due to version upgrade of billing system the underlying communication protocol between billing system

and connected services also changes at several layers, this demands parallel upgrade at IT applications

– ETOPUP, USSD, Unified app and online data charging.

After several weeks of in depth analysis and joint solution development sessions involving all functional

team managers, it was concluded that IT applications will have to upgrade their clients in parallel to

support new billing system during activity night, and accordingly the timelines were finalized.

Table 2 illustrates high level view of activity and timelines showing various functional teams involved in

the complex upgrade

Day Time

Activity Team Ownership

D-1 18:00 Subscriber Provisioning will stopped (New Subscriber Creation and deletion will be Stopped)

Unify Team

D-1 22:00 Etopup and Paper recharge will be stopped Etopup Team, Billing System team

D-1 22:00 ‘Core balance <= 0’ subscriber base dump with IMSI details to be shared with Switch team for barring at HLR

Circle Team

D-1 22:00 All Changes from any node towards Billing system will be stopped Billing System team

D-1 23:30 Billing systems interface for incoming connectivity will be stopped (All IT apps Communication towards IN will stop)

Billing System team

D 00:00

DO

WN

TIM

E

Billing system by Pass for local and national voice calls and SMS. Circle Team

D 00:15 Billing system will be out of service after closing all CDR file Billing System team

D 01:00 Subscriber and service data complete dump to be provided to RA Billing System team

D 00:30 Billing system Upgrade start Billing System team

D 02:50 Information given to All IT teams - Data charging, ETOPUP, USSD, Unified processes and other downstream systems to get their application ready for new version Billing system

Etopup Team USSD Team Unify Team Data charging Team Customer Service Team Billing System team

D 05:00 Confirmation of completion of activity from all IT teams

Etopup Team USSD Team Unify Team Data charging Team Customer Service Team

D 05:30 Billing system upgrade complete Billing System team

D 05:30 Post upgrade billing system data dumps to be provide to RA for recon Billing System team

D 05:45 RA to confirm on provided data and give go ahead RA Team

D 05:45 Test traffic to be routed on upgraded Billing system Circle Team

D 05:45 UAT on Critical Product will be started - by Customer Service/RA /ETOPUP/Roaming/ICR/USSD/Data charging/Unified Teams

Etopup Team USSD Team Unify Team Data charging Team Customer Service Team Billing System team

D 06:25 CDR will be shared to RA team for testing number RA Team

D 06:30 Go-Ahead confirmation will be given by Business UAT team Business Team

D 06:35 Final go live confirmation from management team CxO Team

D 07:00 Billing system by Pass will be removed and system will start handling live traffic, Ending downtime

Circle Team

D 17:00 Complete Product and services testing to be completed CS Team, RA Team, Billing System team

Table 2 – Case Study: Activity timelines and major milestones

5.2. Case Study: Activity Execution

Subscriber Base: 5.4 Million

Core billing system upgrade activity was completed as per schedule projected in timelines, however when

new version of billing system was brought up for testing it was found that ETOPUP is not able to connect

whereas all other services were able to connect and perform testing at their end, technical teams started

working out to get to the cause and find solution. Minute by minute the time allocated for testing before

go-live was getting reduced for ETOPUP service, side by side testing was in progress from all other

teams.

As the time allocated for testing reached completion, testing status from all teams was shared with

business whereas problem with ETOPUP service was still not found, at this stage business owner of

ETOPUP service was advocating for roll back of complete activity, Billing system team was adamant that

since other services are working with new version, there is no fault at their end thus roll back will not be

done, to decide next course of action core business team was not having visibility over the technical

details and it was also not known that how long it may take to fix the issue. Between these discussions no

one was actually taking ownership to either give confirmation to go Live or call off the activity and roll

back, it is important to realize that outage period of all services Voice calls, SMS, VAS, USSD et al was

gradually increasing and by every passing minute revenue loss to business kept rising.

After 95 minutes of extended downtime issue was identified and fixed and all services were made live

with new version of billing system.

5.3. Key observation and learning drawn from the event

When it was crucial to identify and compare increasing revenue losses, business team was not

having enough information to decide next step.

There was no delay threshold predefined and agreed for critical milestones of activity and thus

when delay in readiness of one service encountered it consequentially delayed go live of other

services also.

The activity involved many stakeholders and cross functional teams, a specialist group of

managers could have been designated to dedicatedly help decision makers take decisions in

crisis situations.

A methodical approach must be devised for future implementations which will ensure quick

decision making

Based upon the learning and experience from the complex implementation at site one, with several

rounds of analysis and review sessions the commonly applicable reference was evolved which serves the

need of general methodological approach to be referred by apex decision makers to ensure right decision

is taken in least possible time during unexpected outage in telecom ecosystem.

6. Evolution of the strategic approach

For any planned activity there are three distinct phases where unexpected delay or failure can be

encountered, An unexpected failure is total loss to business whereas in case of unexpected delay with

use of project management methodologies and best practices revenue loss and impact to time-budget

balance can be minimized, scope and conclusion of the study covered in this paper is applicable to

minimizing business loss in cases of delay.

6.1. Identifying Activity Phases where things can go wrong

Figure-1: Phases of activity

Figure illustrates three phases of the activity with respect to major cost to business in case delay/failure is experienced.

It was concluded that at every phase specific preparation is required to be ready for unexpected outages,

collection of relevant business data is key to decision making in crisis scenarios, before the activity

execution PM has to ensure that technical team has prepared discreet figures of hourly revenue impact

associated with each service, and also included all possible consequential costs to business after any

service failure,

e.g. if a SMS service is not functioning then the average revenue earned by SMS service for the

entire time duration is direct revenue loss, whereas Subscriber calling to customer care to enquire and

complain about the service disruption is consequential cost to business for the service downtime.

Upon same principle, group of PMs deduced below set of readiness points which were deemed as most

important to plan same activity for next site, or in general to plan any change in any system or subsystem

of a service industry, below deduced approach will ensure quick and right decision making during an

unexpected outage.

6.1.1. Planning phase

Identify all services where changes will occur due to planned activity: this is first and foremost

step to plan for activity and be ready for unexpected variations in the planning, while preparing list

of the ‘to be affected’ services project manager should quantify level of impact of each service, it

might be the case that service will get impacted partially or intermittently, total impact to business

must be calculated appropriately in such cases.

Identify all connected interfaces which will be impacted during the activity

Work out and prepare Service impact matrix as shown in Table 3

o This matrix should include hourly revenue earning potential of every service, along with

all consequential costs (whether calculable or incalculable).

The service impact matrix must be reviewed and approved by all stakeholders including technical

and business teams

A Sample Service Impact matrix

SNO Service Name Subscriber

Base (In Mn)

Hourly Revenue Potential [in 100K Rs.] Indirect cost to business

H01 H02 H03 H04 H05 H06 H07 H08 H09 H10 H11 H12 H13 H14 H15 H16 H17 H18 H19 H20 H21 H22 H23 H24 Customer Care Overhead

Dependent Services impact

Customer Satisfaction Impact

1 Voice Calls 14.34

1.1 Local Home network 14.34

1.2 Local cross network 14.34

1.3 National Long Distance 09.60

1.4 International Long Distance 01.12

2 SMS 14.34

2.1 Local 02.21

2.2 National 02.21

2.3 International 00.01

3 Data Usage 01.90

4 Paper Recharges 08.14

5 ETOPUP 11.89

6 USSD Services 14.34

6.1 Subscriber Info 14.34

6.2 Subscription management 04.30

6.3 VAS Services 01.80

7 Subscriber Life Cycle 14.34

7.1 Subscriber Activation 01.00

7.2 Subscriber Churn 00.01

7.3 Service management 00.01

Table 3 – Sample Service Impact matrix

This table illustrates importance and criticality of the service in terms of revenue earning potential and cost to business in case of outage. (Data figures shown are indicative and are not real)

The revenue earning potential is distributed over 24 hour time period as service usage varies on hourly basis, for example (3) Data usage revenue earning can be higher during H19 to H22 compared to same with (6.3) VAS services revenue earning

during same period, this data will help decision makers perform calculative analysis and take go or no-go call when during an activity VAS service is down but Data usage is working fine

6.1.2. Executing Phase and Maintenance Phase

Identify affected services: whenever a variation from planned activities observed, total affected services must be

identified to calculate the magnitude of impact, further the revenue impacting factors for a service must be

evaluated.

Identify affected subscriber base and probable service outage duration

o It may happen that a particular service is only partially affected with only limited sets of service users

getting impacted, or problem could be intermittent – in outage the first prerogative of technical teams is to

prepare and share these stats.

o Expected service outage duration of each service is another important data to compare revenue losses

associated with each of feasible decisions

Calculate total revenue impact against each affected service taking affected subscriber base into account with

use of service impact matrix.

Considering the revenue impacting factors associated with the activity, below equation can be deduced

Let

R(S1), R(S2)…..R(Sn) = Avg. revenue per user per minute of Service1,2,…n

O(S1),O(S2)…..O(Sn) = Total estimated outage duration of Service1,2,…n

Sb(S1),Sb(S2)….Sb(Sn) = Estimated percentage of affected subscriber base during outage of Service1, 2,….n

Su(S1), Su(S2)…..Su(Sn) = Total users of Service1, 2, ….n

Then

Total loss to business during service outage = ∑ [ R(Sn) x O(Sn) x Sb(Sn) x Su(Sn) / 100 ]

Prepare stats for indirect additional and future cost to business due to service outages as accurately as possible.

Based on calculated revenue impact of each services and future cost to business associated with the service

outage formulate list of possible next steps or possible options.

Identify variable (uncertain) factors and Risk associated with each concluded option.

Prepare Option vs. Risk vs. Revenue Impact vs. Variance factor matrix: cross functional managers and tech leads

should prepare the tabulated listing of all options with clear information about the revenue loss, risks and variable

factors associated with each option.

o This matrix should be prepared with all three tracks

Track 1 – Go-Live: Presenting decision makers with all information and choices to go live, this

means activity to be carried on as planned despite having unexpected impacts in one or more

services – this track ensures that main activity gets completed within time and service outage

duration of all proper functioning services stays under approved limits.

Track 2 – Delay: Presenting decision makers will all information and choices to hold all sub-

activities until the unexpected problem faced at one or more services gets fixed, this track

ensures that all services will be live once the change is implemented completely, however it

inherently subjects business to risk of bigger revenue loss if the affected services are having low

revenue earning potential and/or longer time is taken to fix the issue.

Track 3 – Fall back: Presenting decision makers with all information and choices to completely

call off the activity, this will ensure that all services function the same way as it were functioning

before implementation of change, however fallback of the change directly means all efforts and

cost invested in the activity becomes void, in addition separate cost of doing the same activity

again in future should also be accounted for.

From every track rule out options which are either least possible, or have highest risk, or have highest number of

variable factors.

For every option listed in the tracks, evaluate time to be taken by each team to implement the option.

o Ensure every team is ready (technically and logistically) to go ahead with any of the option well in time.

Present Decision makers with final track with concluded set of options and details of ‘time to implement’ for each

option.

7. Conclusion

To ensure right decision is taken in least possible time, all stake holders involved in the project must invest required

efforts for in-depth analysis and preparation of service impact matrix.

Before an activity is planned, there should be a task force appointed containing members from all teams whose task will

be to swing in action when unexpected outage occurs and quickly prepare the list of options for decision makers, with use

of the strategic approach as discussed and deduced in this paper, graphically listed below:

Figure-1: Strategic approach to ascertain accurate decisions after unplanned service outage

Deep kamalsingh

Business

Transcript of Deep kamalsingh