How to ensure business continuity in the face of disasters: keys to prevent your company from 'going down'
Until recently, the term ‘going down’ wasn’t widely known outside of technical circles. But incidents like large-scale power outages have brought it into everyday conversation. As the saying goes, it’s better to prepare for the storm before it hits. And while this wasn’t thunder or a major storm, it certainly served as a important wake-up call.
Many companies have experienced first-hand the risk of not being adequately prepared for an incident that can 'knock a business out', threatening their very survival. Business Continuity strategies have been with us for some time, but it is in situations like this one that we become aware of their relevance, and also of the complexity involved.
It is not enough to have a Disaster Recovery solution in place; if all the pieces are not in place, unpleasant surprises can happen.
The importance of analyzing the risks to establish key parameters
It is sometimes assumed that if we have a Disaster Recovery solution in place, we are already protected. And while this is a fundamental part, it is far from being enough. If we don't consider all the pieces, we may find ourselves with unpleasant surprises.
The starting point is to to have a detailed knowledge of our business, the IT systems that provide it and the human teams that support it. A tool such as Risk Analysis allows us to identify the impact of not being able to offer different functionalities of our services in different time intervals, for example, half an hour, four hours, a day, weeks, etc.
This impact can be analyzed at different levels, such as the economic impact (income that we will not receive or economic penalties to which we may be subject), at a legal level and also at a reputational level.
This analysis can provide requirements for two of the most common parameters in Business Continuity and Disaster Recovery, which are the RTO and the RPO.
- RTO (Recovery Time Objective) indicates the time it takes to recover a system.
- RPO (Recovery Point Objective) indicates the point in time before the disaster from which we can recover without data loss.
We will be able to make a first design of the measures to be adopted by combining the requirements of the risk analysis, as well as the different technical solutions with which our services are developed.
This process is not simple and here a very relevant factor comes into play, which is cost. If we invest more, we will improve our RTO and RPO values, but unfortunately, the budgets for these activities are limited, and often, due to many other emergencies, they are not among the highest priorities.
■ Identifying which elements are essential or estimating the capacity required for a temporary contingency situation helps to optimize the solution.
The importance of considering all scenarios and resources
On the other hand, it is essential to analyze the possible failure scenarios and to be aware of the scenarios for which we are protecting ourselves when implementing a given solution. It would be of no use to have a solution that will raise our servers in another location if we have not considered the necessary connectivity so that our users and clients can access them.
One aspect that is sometimes not emphasized is the issue of human resources. It is not just a matter of servers, machines and networks. Just as important is to be clear about how teams should respond, what they have to do, or simply how and by whom recovery procedures should be activated.
In some cases, it may be obvious, such as a complete loss of infrastructure, but in others, more focused at the application level, it may not be so simple to assess what is happening or what actions need to be taken.
Business Continuity involves continuously reviewing the design, the scenarios and improving after each contingency.
Another relevant aspect is the execution of periodic tests. This is not something simple, because a test can usually affect the provision of our services. Fortunately, technology is making it possible to carry out more and more non-disruptive tests, which give us the reassurance that if we find ourselves in a real incident, we shall be prepared and there will be no surprises.
On the other hand, as a major outage can show, Business Continuity is a continuous process in which we must review the validity of the design, the scenarios we contemplate, and in the event that we have had a contingency, review the execution and identify areas for improvement.
Solutions to ensure continuity
At Telefónica Tech, we offer consulting capabilities to support our clients in designing effective Business Continuity policies. In addition, through our Telefónica Tech Cloud Platform service, we provide Disaster Recovery as a Service (DRaaS) solutions that safeguard customer infrastructure in the event of a disaster and automate geographic replication across multiple nodes.
These solutions also allow testing in a non-disruptive way and allow fine-tuning of those small details that are not always taken into account in the initial design. A common case that is detected in testing, for example, is not contemplating all the necessary connectivities so that end users can consume the services.
Don't forget Business Continuity if you don't want your business to go down.