Post on 11-Jun-2018
11
34 Riviera Drive, Markham, ON, L3R 5M1Web: www.midrange.caE-mail: rdolewski@midrange.ca
Phone: 1-800-668-6470
Don’t Fall with the Fallen !!Don’t Fall with the Fallen !!
Richard Dolewski
2
Definition of a Disaster
A sudden, Unplanned Event that causes great damage and loss to an organization
The time factor determines whether an interruption in service is an inconvenience or a disaster. The time
factor varies from organization to organization
3
What is Disaster Recovery
Reaction to a sudden, unplanned event that enables an organization to continue critical business functions until normal business
operations resume.
“…It is not enough to arrange for hardware replacement;… planning must address continuation of
business operations, or business continuation.”
4
What is a Disaster
ANYTHING !That stops your business from functioning & that
cannot be corrected within an acceptable amount of time….
5
The Value of Systems Availability
Competitive value
Increased End-user & Business Productivity
Ongoing improvement in customer support & service
Positive Business Image
Reduced Outages =
��
7
THE GOAL
NO Business tolerance for DOWNTIME
Critical systems & Networks are continuously available
Business interruption measured in hours & minutes rather
than DAYS.
Disaster Recovery
8
Protect Important Assets
Four Primary Assets needed to operate Information Systems:
Facilities
Hardware
Network
Data
9
Protect Important Assets
Hardware and Networks can be replaced
Facilities can be rebuilt or relocated
Data is Priceless !!!
10
A shared responsibilityEntire organizations senior management
IT alone cannot determine which processes are critical
Disaster Recovery
11
How Much Data Can You Afford to Lose ?
Most IT shops depend on backups to protect their dataOrphan data considerations
Minimum 24hrs of lost dataApplication access may be more critical
12
Questions for Management
Include: Manufacturing, Finance, Purchasing, Sales, Warehousing.
Ask them about their Business, What Services do you provide them!!!
13
RTO, RPO and ROI
RTO: Recovery Time ObjectivesHow long can your system be down?
RPO: Recovery Point ObjectivesHow much data can you lose?
ROI: Return on Investment GoalsPlanned verses unplannedHA verses conventional hot site
RPO RTO
ROI
DaysHoursMinutes
14
Planned vs. Unplanned Downtime
• Backup Window – Incremental Daily & Full System
• IBM & 3rd party Software Upgrades
• IBM & 3rd Party PTF/Fixes
• Application Maintenance ( Reorgs )
• Hardware Upgrades
Planned
15
Planned Downtime Score Card
Full Backups Wkly
Software Installs
+ Housekeeping------------------------------= P lanned Outages
Per/Year
------------------------------
312 hrs
20 hrs
24 hrs
512 hrs or 21.33 days/year
Per/ Week
156 hrs3 hrs
6 hrsDaily Backups 6 times per week
17
Disaster Recovery Planning
Only 70% of today’s businesses have fully documented Disaster Recovery Plans.
Of these company’s with plans
45% NEVER test their plan
18
Common Issues
We all tend to let our guard down when times improve
As Planners we must always be ready & be prepared.
19
No Disaster Recovery Plan
• Guarantees:
• Confusion
• Lack of direction
• Conflict
• Lost customers
20
Cannot be approached casually
The Plan must be ....
Well organizedAction Oriented Comprehensive
Objective: Total restoration of Services in a timely manner
Disaster Recovery Planning
21
The Products of the PlanWho will execute recovery actionsWhat is needed to continue, resume, recover or restore business functionsWhen business functions and operations must resumeWhere to go to resume corporate, business & operational functionsHow; Detailed procedures for continuity, resumption, recovery or restoration
CLASSIC: WHO-WHAT-WHERE-WHEN-HOW
22
Common Issues
Has your plan kept up to date with your IT integrations.
Expectations of Plan are un-realistic
I no longer have the staff
Implement DR into your Change Control Process.
23
Recovery Script IssuesThis procedure will pre-determine your company’s course of action:
When do I inform management?When do I put the hotsite on Alert or Declare?What time - What actions ???Who will execute them ???
25
PerspectiveTo often companies populate their DR Teams with raw inexperienced staffers and the wrong solution
Volunteers to satisfy an auditor or worse the sacrificial lamb
26
The Right People
The Best Candidates for DR Teams:Characterized as leaders
People that everyone go to
Folks that understand Enterprise systems – know quickly the how and the ramifications
Understand the business
27
Characteristics of a good TeamIdeal Characteristics Characteristic to
AvoidConsidered an Expert by his/her peers
Hands off Individual ( Avoids Work )
A go-to Person for anything and/or Everything
New to the Organization
Totally unfamiliar with the systems
Works well under Pressure Folds under Pressure
Controls Emotions Hot Head
28
Ideal Characteristics Characteristics to Avoid
Confident Lacks sense of Urgency
Trusted by Peers Tendency to blame others
Excuses , Excuses , Excuses
Totally unfamiliar with the systems
Dedicated – A company person
Pure 9 – 5 er.
First one out the DoorWilling to fix problems created by others
No where to be found
Characteristics of a good Team
30
Vulnerability Assessment
Objective is to drive down the duration of outages
A systematic approach towards:
Reducing the frequency of outages by eliminating all single points of failure.
Reducing duration of outages by configuring hardware & software for the fastest possible recovery.
31
Computer room environmentTemperatureHumidityAir Flow ( From the plant ? )
Electrical power SupplyKey-lock switchData securityTape and backup device maintenance
Vulnerability Assessment
33
Power Redundancy
UPS/Diesel Generator
Extends system operation without Hydro supplied power.RS232 cable Interface & System ValuesOff line maintenance
34
RAID5/6 Disk RedundancyParity Information saved across multiple disks.
Advantages:Lower cost than DASD mirroring.System available during a Disk failure.Customer responsibility to configure RAID sets.
Disadvantages:Only protects on the DASD level, no upstream protection.
37
Lack of Security because…
Time pressures !!Administrators wear too many hats.
Result: Poorly administered security schemes
Too much authoritySecurity software or products never installed or utilized
39
Unsecured copies of production data
• Developers need copies to test against
• The test Data is “real”
• Copies are often left unsecured on test servers
40
Default passwords• Analyze use of Default passwords on your systems
• These are the first passwords a hacker will try
• Check Consultant & Suppliers passwords:
JDEINTASLL, JDEPROD, QPGMR, you’re Boss !!!
Change IBM Default passwords
41
Old User profiles
• Rather than being cleaned up, profiles often
accumulate, even though staff have left the company
• Old profiles owning production Objects
42
TCP/IP applications
Many systems have TCP/IP servers started even when they are not used.
• Check autostart attribute of servers
• These will start when STRTCP is run
• Check authority to STRTCPSVR
• This starts all TCP servers regardless of
autostart value
43
Biggest security exposure
Behind the firewallDisgruntled employeesAccidental errors due to users having too much authority
No auditingNo way to determine if there really is a problem
44
No auditing
• If you don’t audit, you have no knowledge of
what happened.
• May need to audit to meet regulations - PIPEDA
• Minimum recommendation:
• *SECURITY, *SAVRST, *AUTFAIL,
*DELETE, *CREATE, *SERVICE
• Caution – don’t audit too much!
47
Backup = RecoveryHow many people backup their system ?
Of these company’s that perform regular backups…
• 51 % are in-complete• 23 % ( iSeries/400 ) are un-recoverable• 42 % ( Intel ) are un-recoverable
48
Availability problems IT is facing today !!
1. Backup window reduction2. Scheduling a Planned outage3. Recovery from disaster related outage events4. Best Practices for Server Compliance5. Recovery Solution Verification
IBM Enhanced CapacityCartridge System Tape
Cartridge System TapeIBM Enhanced CapacityCartridge System Tape
Cartridge System TapeIBM Enhanced CapacityCartridge System Tape
Cartridge System TapeIBM Enhanced CapacityCartridge System Tape
Cartridge System TapeIBM Enhanced CapacityCartridge System Tape
Cartridge System TapeIBM Enhanced CapacityCartridge System Tape
Cartridge System TapeIBM Enhanced CapacityCartridge System Tape
Cartridge System TapeIBM Enhanced CapacityCartridge System Tape
Cartridge System TapeIBM Enhanced CapacityCartridge System Tape
49
Save/Restore Strategy
System saves must be reviewed Ensure compete recovery is possible from mid week, mid day or weekend failure. Electronic notification of exceptions Review Restoration Procedures Backup software BRMS
51
Last Save InfoFollowing are other data areas that the system uses to maintain last save information for various other general save/restore commands:
Data Area Name PurposeQSAVALLUSR INFO FOR SAVLIB/RSTLIB LIB(*ALLUSR)QSAVCFG INFO FOR SAVCFG/RSTCFGQSAVIBM INFO FOR SAVLIB/RSTLIB LIB(*IBM)QSAVLIBALL INFO FOR SAVLIB/RSTLIB LIB(*NONSYS)QSAVSYS INFO FOR SAVSYSQSAVUSRPRF INFO FOR AVSYS/SAVSECDTA
/RSTUSRPRF
52
Reliable Backups
Backups are the backbone to any recovery situation
In most recovery situations, the backups are not adequate
Excessive time is spent recreating parts of operating system
QUSRSYS not complete
53
BRMS LogSystem not in restricted state, SAVSYS Processing completed with errorsStarting SAVDLO of folder *ANY to devices TAP01. 2574 document library objects saved. Starting save of list *LINK to devices TAP01. 43917 objects saved. 342 not saved. Save of list *LINK completed with errors.Starting save of media information at level *OBJ to device
TAP01. 18 objects saved from library QUSRBRM. Save of BRM media information at level *OBJ complete. DAILY *BKU 0070 *EXIT CALL
PGM(BBSYSTEM/ENDDAYBU). Control group DAILY type *BKU completed with errors.
54
BRMS MaintenanceRecovery Analysis reportRecovery Volume Summary reportASP Information reportProduce the Location Analysis report Recovery reports by system
Send Recovery report offsite….Daily
55
Tape Management
Ensure tapes are labeled or cataloged with unique volume ID’s (BRMS/400, Robot Save)
Prevent overwriting tapes with Active data
Have at least 2 full system saves ( yes 2 )
Audit tapes for data integrity
Do NOT IGNORE tape drive problems - PRTERRLOG *VOLSTAT
56
Save StrategyMonthly a full system save Option 21 is performed. SAVSYS, SAVLIB *NONSYS, SAVDLO, SAV.
Daily SAVLIB of all production libraries using Save -While - Active
IFS save is performed daily
Configuration and Security information is saved daily.
Tapes sent offsite daily.
59
Why is Continuous Availability not H/A !
It does eliminate the need for planned shutdown of production system
Objective is to to allow users 24 hour access to the production system
Normally only key Production application libraries are mirrored in this approach
60
Send & Receive
Users
Primary Node Backup Node
StagingStore
MatchMerge
Building Address: 222 Cross my Fingers Drive
61
Single point of failure
Same power grid
Same CO for communications
No alterative in a Building Disaster
Issues with Side by Side
62
H/A Common Findings
Mirrored System data integrity in serious questionData in-consistencies beyond application. Numerous application support model requirements missing or out of sync.Little or NO documentation exists besides run book
Solution questioned by Management
63
Bridging the H/A Gap
H/A needs to monitored 7/24 to ensure integrity
Special considerations must be given to designing a messaging model that is responsive to any mirroring condition.
Operations education will be required.
65
It’s not all bad news when your plan
fails during a test…
…frequent testing identifies gaps in your recovery process
67
Passive Testing
Hands on Plan Review
Paper WalkthroughAssessment of current Recovery PlanInvolve every member of your recovery team(s)
68
Active Testing
Validates the Recovery Plan in terms of:
1. Recovery Capability2. Source system configuration3. Network Recovery4. Data Integrity5. Identify Weakness in the Plan6. Provides Training
This all equals success during a Disaster
70
Offsite StrategyDISASTERSTRIKES
10 AM
W T F S S M T W T F S S M T W T F S
FullBackup
S
OPERATIONS RECOVERY
POSSIBLEUNPROCESSED
DATA
PLANNEDOFFSITEBACKUP
DAILY EXPOSURE
Incremental backup
71
Offsite Storage ConsiderationsThe last full system save (multiple copies)Software Installation Keys & Proof of Entitlement documentsLAN Server CD-ROMCisco Router ScriptsLAN full Backup and build items ( Database )Recovery CD’sElectronic version of DRP - PRTSYSINFLVT for LPAR-ed serversHMC – Critical Console Data DVD
73
Recipe for Disaster
Ingredients:One Average, everyday filing cabinetAll your important business documentsOne Average business fire
74
Recipe for Disaster
Directions:Place media in filing cabinet. Bake in fire at approximately 800° F for 20min. Let cool. Open filing cabinet.
77
Prevent a Disaster
Review your Backup strategy !!!!Develop a Recovery strategy First.
Review custom CL programs for applicationchanges and for completeness*UNLOADReview the logs !!!!
78
Position Your Software
Keep current software installed (V5R3/V5R4)
OS/400 upgrade paths require specific generations of hardwareReplacement hardware from IBM will dictate the release you require.
79
Prevent a Disaster
Install latest Backup/Recovery group PTFsKnow where your LIC CD isHave you performed a FULL save since the last upgrade?
80
Logical Partitioning
Each partition needs to be backed up
There is NO Option 21 for the entire system that includes all partitions.
Logical Partition configuration maintained on Primary - not Saved… Document it !!
81
LPAR Experiences
Client: Operator in DST deletes a partition
Situation: Told nobody about his actions because all looked to be in order
Result: Partition went offline 1 month Later
Educate staff and use appropriate security measures!
82
Misconceptions
Fact: BEST EFFORT
“I am a very large IBM customer, a machine will be made available for me, do you know how much I spend each year?”