TECHNICAL PLAN SUMMARY 2007 Disaster Recovery. Presentation Purpose Understand Impact of Various...

30
TECHNICAL PLAN SUMMARY 2007 Disaster Recovery

Transcript of TECHNICAL PLAN SUMMARY 2007 Disaster Recovery. Presentation Purpose Understand Impact of Various...

TECHNICAL PLAN SUMMARY

2007

Disaster Recovery

Presentation Purpose

Understand Impact of Various Disaster Situations What it means to operations Impact duration

Understand Alternate Work Options and Recovery Process

Understand Roles

Various Possible Disaster Scenarios

DECLARATION OF A DISASTER, which activates all DR procedures, would be would be made in the event of a facility loss or regional disaster.

Activation of subsections or various disaster alternate work means, however, may occur in the event of various service failures (example: phone services out of operation).

Probability

FarrDestroyed

RegionalDisaster

HardwareFailure

SoftwareCorruption

WANProblem

Probability

Impact to Operations

Aside from a regional disaster, the most significant impact would occur with to the Farr Regional Library.

Farr CP LP CV Erie Member Comments

ILS Store and forward on self checks will be used for up to two weeks.

Horizon -- x x x x x

HIP -- x x x x x

Self Check Store and Forward (Patron checkout) -- x x x x NA

Telecirc -- x x x x x

Communication Services Reroute main number to <x>, use branch cell phones. Use FRED sharepoint for status updates/coordination.

Ability to make/send calls via headsets -- x x x x x

Voicemail -- x x x x x

Fred core -- x x x x x

Fred Sharepoint -- x x x x x

OWA -- x x x x x

District Cell Phones -- x x x x --

Interlocation email -- x x x x x

Internet Presence/Services Staff to use wireless network or public computers for internet access-AWW indicates available with workaround (contact the vendor to eliminate patron validation)

Staff Internet Access -- x x x x x

Mylibrary -- x x x x x

Board, planning, other sharepoints -- x x x x x

Download audiobooks -- AWW AWW AWW AWW AWW

Data WAN Will have acces to any local shares, for those at Farr the sections replicated would be available at CP.Access to shares (replicated) -- x x x x NA

Ability to Log in -- x x x x x

Patron Services Public machines will function without, however patron validation

Filtering -- x x x x x

Public Computers Internet Access -- x x x x x

PC Reservation/Print --

Impact to Operations

In the event of the loss of a branch or member library, impact would be less extensive.

CP/LP Farr CV Erie Member Comments

ILS Remote access would be avialable for all CP/LP staff if needed.

Horizon -- x x x x

HIP -- x x x x

Self Check Store and Forward (Patron checkout) -- x x x NA

Telecirc -- x x x x

Communication Services

Ability to make/send calls via headsets -- x x x x

Voicemail -- x x x x

Fred core -- x x x x

Fred Sharepoint -- x x x x

OWA -- x x x x

District Cell Phones -- x x x --

Interlocation email -- x x x x

Internet Presence/Services

Staff Internet Access -- x x x x

Mylibrary -- x x x x

Board, planning, other sharepoints -- x x x x

Download audiobooks -- x x x x

Data WAN

Access to shares (replicated) -- x x x NA

Ability to Log in -- x x x x

Patron Services

Filtering -- x x x x

Public Computers Internet Access -- x x x x

PC Reservation/Print -- x x x x

DR PLAN - What’s Included

First, let’s confirm what we are trying to protect/manage with a disaster recovery plan.

1) Data (data only, not system configurations)

• ILS• Email• Individual Shares• Department Shares

2) Systems – all systems to be rebuilt

3) Paper Copies – not included

What if we don’t do anything, is it worth the effort? Basically, we’re looking at insurance plans to cover a risk/investment of $1.5-2.5 million.

Cost of Downtime

If down with no disaster recovery plan

Time in Hours Days Cost

System down 80 10

Download oclc data 80 10 $5,000

Lost fines and fees (not yet to debt collect, debt collect has more than 90 days) NA NA $60,000

Fast track labor overhead(30 sec/item/est circ of 146000 items*3..April) 625 78.125 $6,250

Update the catalog (ten minutes per bib) - 315000 50000 6250 $750,000

Impact to public avialablility (estaimted at 10% of annual taxes) - PR impact, unavailbaility $604,170

Server rebuilds (mylibrary and fred) $60,000

Total $1,485,420

Magnitude of 1.5-2.5 million to recover

Optimal Solution

The optimal solution is one that determines the best fir for cost versus time to recover (and also to what point in time data is recovered).

Terminology

There are a few basic terminology items to be aware of.

•Disaster Recovery (DR) •Disaster recovery. Typically this is associated with a technology recovery plan.

•Business Continuity Plan•An overarching business disaster recovery plan which includes staffing, public communication, and more

• Recovery time•Time for the system to be operational and available for use

• Point in time recovery•The amount (in time) of data that may be lost as a result of the process

The disaster recovery effort has been divided into three subject areas. The first efforts define the procedures to be followed during a disruption to technical services. The second effort surrounds the technical recovery design which impacts how long emergency procedures will need to be followed and how successfully data can be recovered to a specific point in time. Finally, the last area looks at all services to ensure the most cost effective appraoches are being taken for recovery.

Phases of Disaster Recovery

Continue Operations

This section of a disaster recovery plan focuses on ensuring appropriate materials are available and staff are trained and can operate during downtime situations.

I. Continue Operations

Emergency Boxes at each location• Downtime Phones• Afterhours support information for facilities and IT• ILS downtime procedure• PC Res downtime procedure• All telephone and/or network downtime procedures• Filtering product downtime

Alternate Work Processes

This table depicts at the highest level, the alternate means by which operations can be conducted in the event of a regional failure.

Horizon Store & Forward on the Self Checks and returns on the smartchutes (OCLC for inquiries)

Phone Cell phones

Data network Use public machines for internet access. Use Sharepoints for data sharing, DR issued emails for communcation.

Collaboration (email, shares)

Use FRED sharepoint for communication and setup individual accounts if critical. Share data may be available depending on situation

Or simpler yet…

1. Know where the emergency box is at your location.

2. Immediately begin to use alternate work methods (to continue operations as normally as possible)

3. Check http://www.fred.sharepointspace.com for updates.

Recovery Quickly

Speed of Recovery (how long would services be down)

II: Speed of Recovery

The speed of recovery is dependent upon

-vendor response

-resource time available (priorities)

-equipment availability

-complexity of the recovery

Service Best Case Day Live

Worst Case Day Live

Data network Day 2 Days 5-8

Main Phone Day 2 Days 5-8

ILS Day 3 Days 5-8

Email Day 5 Days 8-12

Shares Day 5 Days 8-12

HIP Day 3 Days 8-12

MyLibrary Days 8-15 Days 15-20

Financials Days 8-15 Days 15-20

Telecirc, other… Days 15-25 Days 25-35

Awareness point

For the WLD, when backups occur only the data of the system is being retained. In some instances systems configurations are backed up but a full system installation is typically not captured.

What this means: although the data is available, the hardware must be recovered and then all software reloaded. WLD is assessing when virtual machines can be created and easily backed up.

Roles (Who Does What)

DR ROLE PRIMARY BACKUP

IT Coordination and Communication Susan Mike/Gem

Infrastructure Recovery Mike 3T/MTT/Susan

Client Access and Recovery Eric Marcus

District and Public Communication Coordination Janine Kelli

Three point in time technical designs were considered. Note all designs assume a baseline of primary equipment designed fro high availability with redundant power supplies, RAID and other standard features inherent in business class servers and equipment.

Best Fit

After reviewing options WLD will be working with Iron Mountain

• Hosted backup w/ tape archive• expert resources monitoring and tracking data backup process• additional capacity available to provide needed data in the even to fan emergency (can work with other vendor partners)• best medium for ensuring data integrity (avoiding bad tapes, etc)• estimated solution duration 2 maybe three years

• Dependent on data and systems growth • Review annually to determine if best fit

• What it looks like• 7 days of onsite and offsite data backup on disk (fast recovery)• After 7 days, historical data available on tape (at risk for older recovery)

•Virtual Machines•HIP•Horizon if possible (testing to start now)•MyLibrary•Other….

Time to Recover

The table to the right depicts the estimated length of time needed for various services to be available, both temporary work means and full as well as full recovery where normal operations have resumed.

Cost Management

Cost Management

Cost management includes efforts to help smartly manage data growth and use.

Is money being spent backing up old or inappropriate data (mp3 files, family pictures, other?)

• Continued data collection and research in 2007

• Share data• Email file size • Work processes (IT example)• What about personal folders, archives

• Review findings and develop a recommendation in 2008

• Share policies/use• Email mailbox size rules• Other?

Conclusion

Key take aways for each section of Disaster recovery

Assuming the worst case disaster (Farr destroyed) short of a regional catastrophe

1. CONTINUE OPERATIONS: • Staff immediately shifts to downtime

operations

2. RECOVERY QUICKLY:• Director/associate director immediately

updates FRED sharepoint with first conference call time

• Site is http://fred.sharepointspace.com – updates will be posted on the DR page

• All managers to participate• use 866-258-0959 meeting room ID *1338021* using

1857• Daily meetings at 8:15 daily (target breif, 15 minute

information sharing) until recovery is completed• Managers to update staff after daily meeting • Communication to staff posted on DR site • IT will join all morning meetings to provide updates

and will post specific information on the DR site as well

What’s not defined, subjects for a business continuity plan

1. Staffing information1. Do staff report, where, when?2. Will staff be paid?

2. Public and board communication plan1. How to keep public notified of the status

3. Peer communication plan1. ILL services, other, how to operate

4. Actual physical recovery if location destroyed1. Rebuild/other?2. Insurance processes3. timeline for recovery (and again, staff impact

in the interim)

5. Other?

Final Next StepsNext Steps as of Sept 12, 2007

• Present to branch managers for awareness of full plan

• Complete testing of Horizon VM instance• Due in 30 days

• Complete testing of Iron mountain service• In process, decision due October

• Complete migration of applicable services to virtual machines (includes installing separate copy• Q1 2008

• Finalize the archive configuration for the ILS• Work with Kari/Managers to train appropriate

staff on store and forward uploads• Conduct DR test on Nov 5