Your organization needs a robust disaster recovery program in place. The days of going back to manual workarounds in the absence of core critical systems is forever fading for many time sensitive processes. In fact, often the people that used to do the work manually are retired or have moved on to other positions. If critical systems are down for an unacceptable period of time it can severely imperil your company.
- Information Technology (IT) should have direct responsibility for disaster recovery. You should work with them to map the current availability of systems (RTO and RPO) to the business requirements
- All systems must have a run-book
- All systems must be tested per your company’s policies (more on testing below)
System information, run-books and disaster recovery test results should all be maintained in your automated BCM tool, rather than in spreadsheets. Maintaining this type of data in spreadsheets has many drawbacks. A few include:
- Spreadsheets are difficult to keep up-to-date
- Spreadsheets have a limited audit trail
- Spreadsheets tend to become siloed
- Spreadsheets make it more difficult and time consuming to analyze application data against upstream and downstream applications and business processes
Using a BCM tool (recommendations) makes all of those negatives go away. You can maintain data in structured fields and use attachments if additional information in spreadsheets or word processing documents are required, such as network diagrams.
Having all this information in a BCM central repository along with your business continuity requirements will empower you to do some really nice automated analysis including real-time gap pulse reports. For instance, if changes are made to any system in your enterprise it can be analyzed in real-time and a risk profile can be updated. A robust BCM tool can use rules and workflow to deliver alerts to the right people at the right time!
I get excited just thinking about it. I have set up these types of rules, triggers and workflows on many occasions and it is great!
It is important you are capturing the information that will enable you to do the analysis and produce the reports and metrics. As a long time successful software developer, I learned to start at the end. I realize you will need ad-hoc reports as you mature but for now try to determine your near-term needs. Mock up some reports. Think about reports, alerts…then work backwards to understand what information you need to capture to make your dreams a reality.
Do I sound excited? Well darn it, I am excited!!!
Some of the information you might want to capture in your system will include the following. Be sure to add more to meet your needs:
- Application name
- Description
- Application owner and contact info
- Purpose of application
- Processes that use the application
- Vendor name
- Vendor representative (name/email)
- Critical (Yes/No)
- IT System RTO (hours)
- IT System RPO (hours)
- Run-book completed? (Yes/No)
- Disaster recovery plan completed? (Yes/No)
- Was last DR test successful? (Yes/No)
- List of issues from the last DR test
- Production data center
- Backup data center
- If vendor hosted, was a SAS-70 or data center walk through completed? (Yes/No)
- If hosted locally – must it be local?
- Number of users
- Number of servers
- Type of server (dedicated, virtual, operating system….)
- Contract / license expiration date
- Owner of equipment
- Fail-over tested (Yes/No)
- Is the application being recovered in the primary data center? (you would be surprised)
- Full backup frequency
- Incremental backup frequency
- Type of backup (digital tape, vaulting…)
- If backup is tape, where is it stored?
- If backup is tape, who transports it?
- Systems dependencies – input
- Systems dependencies – output
- Comments, concerns, results
I know it is a lot and IT may not have it now but it is important to get the ball rolling. Engage them and find out where they are. This is critical stuff.
You MUST do your disaster recovery tests ‘before the baby is born’! (By the way, I love that line. Unfortunately, my editor would not let me use it in the title of this post)
Your policy should be that prior to any new system going into production a disaster recovery test be completed with the business actively participating. The results must be signed off by the business owner. I will repeat one more time – there must be a written, tested and approved disaster recovery plan in place – PRIOR to going live.
If your organization has a Project Management Office (PMO) they should make disaster recovery testing and sign-off a ‘toll-gate’ part of every new system implementation. If the user has not signed off on the Disaster Recovery User Acceptance Test (UAT), the system cannot go into production, until it is completed.
In my experience as both an IT and BC professional, once a critical system has gone into production the urgency and incentive to complete a disaster recovery test is greatly reduced to ‘someday’ or ‘when we have time’… which often never comes. The DR testing will be pushed back indefinitely or likely forgotten as teams move on to the next critical project.
Unfortunately, your butt will be on the line when the untested system goes down or a virus hits and there is no backup, incompatible tape backup, RTO/RPO does not meet the business requirements or maybe all of the above!
So, I strongly advise you not to wait until ‘after the baby is born’. When the production system goes down is NOT the time to test or to think – woulda, coulda, shoulda.
Next step tips regarding disaster recovery testing:
Tip – Ask IT when the last disaster recovery testing was done and where the results are stored. Review them for critical systems information results, gaps and issues. Were they corrected and tested again? Did the business sign off on the corrections?
Tip – Ask IT if disaster recovery testing is on the new system development roadmap. If it is not, it must be added asap.
Tip – Make all requests in writing and keep a copy of the email trail and the final decision. Otherwise, your butt will be on the line when issues arise.
Tip – Partner with IT and management to develop the enterprise ‘before the baby is born’ disaster recovery testing policy and get it signed off on.
Tip – If a policy is subsequently put in place it is a significant accomplishment to be highlighted on your next annual job review.
Tip – If a policy does not get put in place, the email trail just might save your job when a critical system is not available during a disruption and the inevitable finger pointing begins. I have seen this play out, so please be forewarned.