Telefonica Tech · Blog · Daniel Pous Montardit

Daniel Pous Montardit

Principal Architect in the CTO unit of TCCT. Computer Engineer and entrepreneur with more than 20 years of experience in software development.

Cloud

Resilience, key to Cloud-Native systems

In the first post of the Cloud-Native series, What is a Cloud-Native Application?, or what it means that my software is Cloud Native, we presented resilience as one of the fundamental attributes that help us to ensure that our systems are reliable and operate with practically no service interruptions. Let's start by defining resilience: It is the ability to react to a failure and recover from it to continue operating while minimising any impact on the business. Resilience is not about avoiding failures, but about accepting them and building a service in such a way that it is able to recover and return to a fully functioning state as quickly as possible. Cloud-Native systems are based on distributed architectures and are therefore exposed to a larger set of failure scenarios compared to the classical monolithic application model. Examples of failure scenarios are: Unexpected increases in network latencies that can lead to communication timeouts between components and reduce quality of service. Network micro-outages causing connectivity errors. Downtime of a component, with restart or change of location, which must be managed transparently to the service. Overloading of a component that triggers a progressive increase in its response time and may eventually trigger connection errors. Orchestration of operations such as rolling updates (system update strategy that avoids any loss of service) or scaling/de-scaling of services. Hardware failures. Although cloud platforms can detect and mitigate many of the failures in the infrastructure layer on which the applications run, to obtain an adequate level of resilience of our system, it is necessary to implement certain practices or patterns at the level of the application or software system deployed. Let's talk now about which techniques or technologies help us achieve resilience in each of the layers presented: infrastructure layer and software layer. Resilient Infrastructure Resilience at the hardware level can be achieved through solutions such as redundant power supplies, write-over-redundant storage drives (RAIDs), etc. However, only certain failures will be covered by these protections, and we will have to resort to other techniques to achieve the desired levels of resilience, such as redundancy and scalability. Redundancy Redundancy consists of, as the word itself indicates, replicating each of the elements that make up the service, so that any task or part of a task can always be performed by more than one component. To do this, we must add a mechanism to distribute the workload between these duplicate 'copies' within each workgroup, such as a load balancer. On the other hand, determining the level of replication needed in a service will depend on the business requirements of the service, and will affect both the cost and complexity of the service. It is recommended to identify the critical flows within the service, and to add redundancy at each point of the flows, thus avoiding the creation of single points of failure. These points refer to those components of our system that in case of failure would cause a total system failure. It is also common to add multi-region redundancy with geo-replication of the information and distribute the load by means of DNS balancing, thus directing each request to the appropriate region according to the distance from its geographical origin. Scalability Designing scalable systems is also fundamental to achieve resilience. Scalability or the capacity to adjust the resources to the workload, either by increasing or decreasing their number, is fundamental to avoid failure situations such as communication timeouts due to excessive response times, service failures due to work collapse, or the degradation of storage subsystems due to massive information ingestion, etc. There are two types of scaling: Vertical scaling or scale up: increasing the power of a machine (be it CPU, memory, disk, etc.) horizontal scaling or scale out: adding more machines. The ability to scale a system horizontally is closely interrelated to having redundancy. We could see the former as a higher level than the latter, i.e., a non-redundant system cannot be horizontally scalable and, in turn, we can achieve horizontal scalability over redundancy if we add feedback that allows us to determine from the real-time load of the system to what extent it should grow or decrease in resources to optimally adjust to the needs demanded at any given time. Note that at this point we are also establishing a relationship with the observability capacity, which will be responsible for providing the necessary metrics to monitor the load and automate the auto-scaling systems. There are libraries in many languages to implement these techniques and we can also resort to more orthogonal solutions such as Service Mesh to facilitate this task and completely decouple our business logic. Cyber Security How to become a cyber resilient organisation September 15, 2022 Resilient Software As mentioned at the beginning of this post, it is essential to incorporate resilience into the design of the software itself in order to successfully face all the challenges of distributed systems. The logic of the service must treat failure as a case and not as an exception, it must define how to act in case of failure and determine the contingency action when the preferred path is not available. This latter is known as fallback action or backup configuration for that failure case. Architectural patterns Apart from the fallback pattern, there are a set of architecture patterns oriented to provide resilience to a distributed system, such as for example: Circuit Breaker: this pattern helps a service to recover or decouple from both performance drops due to subsystem overloads and complete outages of parts of the application. When the number of continuous failures reported by a component exceeds a certain level, it is the prelude to something more serious about to happen: the total failure of the affected subsystem. By temporarily blocking further requests, the component in trouble will have a chance to recover and avoid further damage. This temporary cushion may be sufficient for the auto-scaling system to have been able to intervene and replicate the overloaded component, thus avoiding any loss of service to its clients. Timeouts: the mere fact of limiting the time in which the sender of a request will wait for its response may be the key to avoid overloads due to the accumulation of resources, thus facilitating the resilience of the system. If a microservice A requires microservice B and the latter does not respond within the defined timeout, as there is no indefinite wait, microservice A will regain control and can decide whether to continue trying or not. If the problem has been caused by a network outage or an overload of microservice B, a retry may be sufficient to redirect the request to the already recovered instance of B or to a new instance free of load. And in case of no further retries, microservice A can free resources and execute the defined fallback. Retries: the two previous techniques, short-circuit and timeouts, have already indirectly introduced the importance of retries as a base concept for resilience. But is it possible to incorporate retries in communications between components for free? Let's imagine, continuing with the previous example, that a microservice A makes a request to B, and due to a punctual network outage, B's response does not reach A. If A incorporates retries, what will happen is that when the waiting time of that call (the timeout) ends, it will recover control and make the request to B again, so B will do the work in duplicate with the consequences that may arise. For example, if that request were to subtract a purchase from the stock of products, the output would be recorded in duplicate and therefore leave an incorrect balance in the stock books. It is because of this situation that the concept of idempotence is introduced. An idempotent service is characterised by being immune to duplicate requests, i.e., the repeated processing of the same request does not cause inconsistencies in the final result, giving rise to "safe retries". The immunity is obtained based on a design that contemplates idempotency from the beginning, for example, in the previous case of the stock update, the request should include a purchase identifier, and microservice B should register and validate that this identifier has not been completely processed before trying again. Caché: now that we know why you need to incorporate retries If you use a cache to automatically store the responses of a microservice, you are helping both to reduce the pressure on it and to generate a fallback in case of certain anomalies. In the case of a retry, the cache helps to ensure that the component does not have to retry a previously completed job and can return the result directly to the component. Bulkhead: this last pattern consists of dividing the distributed system into "isolated" and independent parts, also called pools, so that if one of them fails, the others can continue to function normally. This architectural tool can be seen as a contingency technique, comparable to a firewall or watertight compartments that divide ships into parts and prevent water from jumping between them. It is advisable, for example, to isolate a set of critical components from other standards. It should also be appreciated that such divisions can sometimes lead to losses in resource efficiency, as well as adding to the complexity of the solution. Resilience tests As mentioned above, in a distributed system there are so many components interacting with each other that the probability of things going wrong is very high. Hardware, network, traffic overload, etc. can fail. We have discussed various techniques to make our software resilient and minimise the impact of these failures. But do we have a way to test the resilience of our system? The answer is yes, and it's called “Chaos Engineering”. But what is "Chaos Engineering"? It is a discipline of infrastructure experimentation that exposes systemic weaknesses. This empirical process of verification leads to more resilient systems and builds confidence in their ability to withstand turbulent situations. Experimenting with Chaos Engineering can be as simple as manually executing kill -9 (command to immediately terminate a process on unix/linux systems) on a box within a test environment to simulate the failure of a service. Or it can be as sophisticated as designing and running experiments automatically in a production environment against a small but statistically significant fraction of live traffic. There are also supporting libraries and frameworks, such as, Chaos-monkey which is a framework created by Netflix that allows randomly terminating virtual machines or containers in production environments, and complies with the principles of Chaos Engineering. It is necessary to identify system weaknesses before they manifest themselves in aberrant behaviour that affects the entire system. Systemic weaknesses can take the form of incorrect backup configurations when a service is unavailable; excessive retries due to mismatched timeouts; service outages when a component of the processing chain collapses due to traffic saturation; massive cascading failures resulting from a single component (single-point-of-failure can be detected); etc. Cyber Security Where is your company on the cybersecurity journey? April 20, 2022 Conclusions The most traditional approach when building systems was to treat failure as an exceptional event outside the successful execution path, and therefore it was not contemplated in the basic design of the heart of the service. This has changed radically in the cloud-native world, given that in distributed architectures, failure situations appear normally and recurrently in some part of the whole, and this must be considered and assumed from the outset and within the design itself. Thus, when we talk about resilience, we refer to this characteristic that allows services to respond to and recover from failures, limiting the effects on the system as a whole as much as possible and reducing the impact on it to a minimum. Achieving resilient systems not only has an impact on the quality of the service or application, but also makes it possible to gain more cost efficiency and, above all, not lose business opportunities due to loss of service. Featured image: Alex Wong / Unsplash

January 30, 2023

Cloud

Observability: what it is and what it offers

What is observability? The term "observability" comes from Rudolf Kalman's control theory and refers to the ability to infer the internal state of a system based on its external outputs. This concept applied to software systems refers to the ability to understand the internal state of an application based on its telemetry. Not all systems allow or give enough information to be 'observed', so we will classify as observable those that do. To be observable is one of the fundamental attributes of cloud-native systems Telemetry information can be classified into three main categories: Logs: probably the most common and widespread mechanism for issuing information on internal events available to the processes or services of a software system. Historically, they are the most detailed source of what happened and they follow a temporal order. Their contribution is key to debugging and understanding what happened within a system, although some point out that they could be overtaken by traces in this main role. They are easy to collect, but very voluminous and consequently expensive to retain. There are both structured and unstructured (free text) logs, and common formats include json and logfmt. There are also proposals for semantic standardisation such as Open Telemetry or Elastic Common Schema. Metrics: are quantitative information (numerical data) related to processes or machines over time. For example, it could be the percentage of CPU, Disk or Memory usage of a machine every 30 seconds or the counter of the total number of errors returned by an API, labelled with the HTTP-Status returned and the name of the Kubernetes container, for example, that has processed the request. Thus, these time series can be determined by a set of tags with values, and which also serve as an entry point for exploration of telemetry information. Metrics are characterised by being simple to collect, inexpensive to store, dimensional to allow for quick analysis, and an excellent way to measure overall system health. Later in another post we will also see that the values of a metric can have data attached to them known as exemplars, also in the form of a key/value, which serve among other reasons to easily correlate this value with other sources of information. For instance, in the above API error counter metric an attached exemplar could allow us to jump directly from the metric to the traces of the request that originated the error. This greatly facilitates the operation of the system. Traces: we are talking about detailed data about the path executed inside a system in response to an external stimulus (such as an HTTP request, a message in a queue, or a scheduled execution). This type of information is very valuable as it shows the latency from one end of the executed path to the other and for each of the individual calls made within it, even if it is a distributed architecture and therefore the execution may affect multiple components or processes. The key to this power lies in the propagation of context between system components working together, for example, in a distributed micro-services system components may use HTTP headers to propagate the required state information to get the interleaved data from one end to the other. In conclusion, traces allow us to understand execution paths, find bottlenecks and optimise them efficiently, and identify errors, making them easier to understand and fix. These three verticals of information are referred to as the "three pillars" of observability and making them work together is essential to maximise the benefits obtained. For example, metrics can be alarmed to report a malfunction, and their associated exemplars will allow us to identify the subset of traces associated with the occurrence of the underlying problem. Finally, we will select the logs related to those traces, thus accessing all the available context necessary to efficiently identify and correct the root cause of the problem. Once the incident has been resolved, we can enrich our observability through new metrics, consoles or alarms to more proactively anticipate similar problems in the future. Why monitoring is not enough? and... What does Observability offer? Monitoring allows us to detect if something is not working properly, but it does not give us the reasons. Moreover, it is only possible to monitor situations that are foreseen in advance (known knowns). Observability, on the other hand, is based on the integration and relationship of multiple sources of telemetry data, that together help us to better understand how the software system under observation works and not only to identify problems. However, the most critical aspect is what is done with the data once it is collected, for example, why rely on pre-defined thresholds when we can automatically detect unusual 'change points'? It is this kind of 'intelligence' that enables the discovery of unknown unknowns. The elaboration of real time topology maps is another capability offered by observability, and allows us to establish automatic relationships between all the telemetry information gathered, going much further than a simple correlation by time. A high-impact example of what these topologies can provide would be to achieve automatic incident resolution mechanisms in real time without human intervention. Observability also facilitates the integration of performance as a first level activity in software development, by allowing us to have profiling information (step by step detail of an execution) on a continuous basis (something that without the appropriate mechanisms requires a lot of effort in distributed systems) and offers us the possibility of detecting bottlenecks in real time, etc. In addition, the mere fact of making us understand in depth what happens within a system over time allows us to maximise the benefit of load testing (and in general of any type of e2e test) and open the doors to the implementation of chaos engineering techniques. At the same time, but not least, it reduces the mean-time to resolution (MTTR) of incidents by reducing the time spent on diagnosis, allowing us to focus on the resolution of the problem. We can conclude that when a system embraces a mature observability solution, the benefits for the business become more acute. Not only does it give rise to more efficient innovation, but the reduction in implementation times is transferred as an increase in efficiency to the teams, generating consequent cost reductions. For all these reasons, you can imagine that observability is not a purely operational concern, but a transversal responsibility of the whole team, as well as being considered a basic practice within the recommendations of the most modern and advanced software engineering. Conclusion The key to understanding the problems of distributed systems, problems that appear repeatedly but with accentuated variability, is to be able to debug them reinforced with evidence rather than conjecture or hypotheses. We must internalise that 'errors' are part of the new normal that accompanies complex distributed systems. The degree of observability of a system is the degree to which it can be debugged, so we can assimilate the contribution of observability in a distributed system to what a debugger offers us on a single running process. Finally, it is worth noting that an observable system can be optimised, both at a technical and business level, much more easily than the rest. References https://newrelic.com/resources/ebooks/what-is-observability https://lightstep.com/blog/opentelemetry-101-what-is-observability/ https://www.splunk.com/en_us/blog/devops/observability-it-s-not-what-you-think.html https://www.alibabacloud.com/blog/a-unified-solution-for-observability---make-sls-compatible-with-opentelemetry_597157 https://containerjournal.com/kubeconcnc/the-future-of-observability/ https://www.infracloud.io/blogs/tracing-grafana-tempo-jaeger/ https://blog.paessler.com/the-future-of-monitoring-the-rise-of-observability Featured photo: Mohammad Metri / Unsplash

January 12, 2023

Búsquedas recomendadas

Daniel Pous Montardit

Find out more about us