Planning for Failure

Reghu Gopinathan |  June 2018

We all need a plan. On the one hand, we are driven by the old adage: “Failing to plan is planning to fail.” Another, not so optimistic, viewpoint is centuries-old military advice: “No plan survives first contact with the enemy.” Both perspectives are absolute certainties when it comes to dealing with the information technology ecosystem. In order to best reduce risk, we need to bolster planning with processes that help hold plans together when the going gets tough.

High Availability for Continuity of Operations (COOP) are the processes that help organizations prepare for disruptive events. Even though these concepts are well understood, many organizations suffer by implementing them incorrectly. While extensive planning is done to mitigate major disasters, the causes of business disruption are typically simple mistakes that balloon into a major event. Something as minor as someone releasing a script into production instead of first testing on a QA server can have a huge, negative impact.

The Big Picture

Information Technology managers or CIOs typically clamor for one guarantee – no disruption to operations due to system failure. Their staff must ensure a Service Level Agreement (SLA) that is able to restore operations – a plan for High Availability. The SLA threshold is set by determining the costs associated with lost productivity if/when systems are not available.

“To get an effective SLA, you have to put a price tag on how much the organization could lose if the systems are down, the relative opportunity costs of those lost hours or minutes, or even the value lost by damaging the goodwill of your customers.”

7 Tips to Consider BEFORE an IT Disaster Strikes

  1. You must begin with a plan. Start your preparation by doing a business impact analysis of varying levels of failure. Identify the most critical systems and the impact of the varying outages to your business or operations. Identify all the related redundant systems as well. If your operations aren’t too complex, you can plan for an offline disaster recovery site where your systems are mirrored and only the transactional databases are replicated real time.
  2. Establish priority levels. Identify all systems and assign a priority level:
    • Priority #1 System – Recovery time: Immediate (No down time). Must be mirrored to geographically dispersed location.
    • Priority #2 System – Recovery time: Must be restored within 4 hours.
    • Priority #3 System – Recovery time: Must be operational in same day.
    • Priority #X System – Continue planning to the level that your business or operations can sustain a loss.
  3. Outline your communication strategy. Assign the actions that must be accomplished by each staff member or department across the chain of command. Remember, email may not be available during a disaster, so include redundant communication systems to support your processes, such as texting or radios. Also include plans for external communications.
  4. Practice makes ready. Major outages are often caused by human error. While it is impossible to eliminate all mistakes, risks can be minimized by proper training, documentation, and practice. Just as every office building regularly practices a fire drill, your organization should practice disaster readiness once every 6 months, or at least annually. Everyone who manages critical operations must know what to do in the event of a disaster. It is also important to practice the failovers and train more than just one person. Operations staff should be rotated and cross-trained in different functional areas so that there is a pool of resources who can support all systems. You should also have a runbook with a flow chart clearly explaining the steps in the recovery process.
  5. Include your infrastructure in the planning process. Beyond software and hardware systems, also remember to include related systems and dependencies such as backup power, network access, cell access, phone systems, etc.
  6. Make copies. Keep copies of your plan and disaster recovery documentation in locations where you can access them if your network drives fail.
  7. Back Up. Test your backups by doing a restore at least every 6 months (or more frequently), depending on the criticality of your operations. Don’t forget your laptops – keep all those and other mobile devices backed up, too, along with documentation on configuration settings. In a disaster, you won’t be able to look up the old settings to reconfigure the network, routers, devices, hardware, and accounts.


Related Posts

Download Resource

Please fill out the form below.

We use cookies and similar technologies to give you a better experience, improve performance, analyze traffic, and to personalize content. By continuing to browse this website you agree to the use of cookies. For more information on how this website uses cookies, please select “Privacy Policy.”

Also of Interest