"Your flight has been cancelled for operational reasons": What!?

January 11, 2024

Three days of chaos, economic losses running into millions of dollars and thousands of passengers affected for "operational reasons". In other words, due to a bug in the UK air traffic controllers' software.

What was the problem?

The flaw occurred in the FPRSA-R software used to manage flights into and out of UK airspace. This caused NATS (National Air Traffic Services), the organization that is responsible for managing and controlling air traffic in UK airspace, to have to reduce the capacity of the air traffic control system.

As a result, it led to the cancellation of hundreds of flights and delays to thousands of others.

Could it have been avoided?

The preliminary incident report clearly states that the failure could have been avoided if NATS had implemented a more rigorous testing process.

The NATS testing process did not detect the failure before it was put into production and had been operating with the FPRSA-R subsystem continuously since October 2018 processing more than 15 million flight plans without incident and no recent upgrade had been performed that could have introduced the error.

What was the source of the flaw?

Understanding whether or not the NATS test plan was rigorous enough would require analysis and whether or not the specifications defined the process that generated the failure and whether or not that case was covered by the tests.

But it is necessary to understand the origin of the error to know how it could have been avoided:

NATS' flight planning management software (FPRSA-R) is responsible for recording and authorizing all flights entering and leaving UK airspace.
Initially the press reported that the error was triggered by flight BAW231, a British Airways flight covering the London-New York route. The flight took off from Heathrow Airport at 10:00 local time on May 29, 2023, and was able to land at New York Airport without incident. The flight plan submitted to the NATS system contained an empty line in the "arrival time" field.
But as indicated by amateur and professional aeronautical forums on Reddit, the source of the flaw was French Bee flight FBU731 from Los Angeles USA to Orly in France. The flight plan submitted to the NATS system contained a duplicate waypoint.

⚠️ In any of the above possible cases, either because the origin of an empty line or a duplicate waypoint, both scenarios should have been tested among many others to check the robustness and response of the system in case of not receiving the expected input and to avoid NATS FPRSA-R system crashing.

What happens if there is an error in a critical system?

A critical system must have one or more replicas, which ensure that, if a machine/system is no longer operational, there will always be another/s of backup, with the same data available to continue processing the pending information. The possible redundancies and distributions of systems and data could be the subject of another post, but the important thing in these systems is that there is always an alternative plan in case of a system failure.

There are contingencies at many levels, but the ones that could affect a system like NATS could be software, hardware, facilities (power outages, sabotage, or inclement weather) or even at the continental level (wars or blockades). The objective is that they can continue to function in the face of adversity or at least minimize the possible loss of information or continue with the function it performs.

The system in charge of processing the list of pending flight plans to authorize encountered this erroneous or malformed plan, could not process it and the system crashed, raising a critical exception, and going into maintenance mode. This caused the system to become inoperative and the pending flight plan to be cleared.
The backup system assumed the role of processing the next pending flight plan (the one that could not be processed) and made the same attempt and also blocked until all possible executors of the task were blocked for the same reason.
The technicians at first tried to restart the system, but it could not be returned to normal. Time was ticking and the four-hour deadline to start the manual (and slow) entry of flight plans had begun.

Possible solutions?

The support engineers, unable to restore service, requested help from the system manufacturer. With their knowledge of the system, they discovered the source of the error and possible solutions:

Extract that flight plan from the list.
Develop an update that would correct the error in the system.

The solution would be to pull that plan and alert the operator to authorize it (or not) manually, while the rest of the authorizations continued.

What lessons can be learned?

NATS' air traffic control failure is a reminder of the importance of software quality.

Companies using critical software must ensure that the software developed is thoroughly tested by looking at both valid data input cases to function, but it is also important to test how it will respond to invalid or unexpected cases.

It would be costly in time and effort to validate the possible scenarios of a system to be validated. But there are many approaches to do so. One of the mathematical computational tests is to group similar scenarios, thus testing with samples of values that represent your entire casuistic group.

An example to understand it easily would be to test a functionality that receives as input a parameter, a number that must be between 0 and 100 in order to perform certain mathematical calculations.

A first approach in this scenario would be to group the scenarios into different groups such as, for example:

The valid positive values (0≥x≥100) and select some samples [3, 17, 80].
The invalid positive values (100<x) and test with some samples [150, 500, 9999].
The invalid negative values (x<0) and their samples [-10, -33, -5555].
The boundary values, which are the scenarios that are on the boundary of valid and invalid, and we should divide them between those expected to be valid and invalid:
- VF-Valid: [0, 1, 99, 100]
- VF-Invalid: [-1, 101]
Incorrect values, which would be the cases of entering values of another type than the expected one such as, for example: [a, Z, @, ™, "105 OR 1=1", etc.]

We have therefore gone from having infinite possible cases to about 20 possible samples for a very high scenario coverage. And if it is also done with automatic tests that select the sample from each group dynamically even the number of cases tested with minimum effort skyrockets.

Another approach used is to evaluate which are the greatest risks of the software and which are the most critical scenarios with the functionality to be covered. In these cases, they are identified and prioritized in order to have the most important ones covered and to be able to test within the space and resources allocated to the validation phase.

✅ In the world of software testing, the later the error is found, the higher the cost of repair.

That is why a lot of effort is invested to ensure that the quality of the delivered products has validation phases in which it is ensured that the functionality to be delivered is the one that has been specified and that it has been confirmed that it meets functional and non-functional expectations (accessibility, number of users supported, load or response times, robustness, resilience, etc.).

How do we do it at Telefónica Tech?

In our area we have very diverse projects with the common nexus of using the data we have available as an operator. We have projects that range from helping our customers not to be victims of fraud when paying in stores with SmartDigits. To other projects where we help public entities in data-driven decision making.

With SmartSteps mobility data we can understand how the population moves so that their mobility experience is the best possible, helping to plan or improve services according to real demand. Always with the assurance that the data is anonymized and aggregated to respect the privacy of all users.

We use a common work strategy to validate all these projects at a functional level and we share the same tools so that we can reuse the knowledge acquired and the way of working is similar regardless of the project. This allows us to focus our efforts on getting to know the functionality in depth and validate it in the best possible way.

◾ If you are interested in knowing about this world, I recommend you investigate about BDD (Behaviour Driven Development) which is a behavior driven development methodology, for us it is a fundamental part of our functional validation process.

After reading all this I think you will now be able to understand this joke:

A tester walks into a new bar and orders a beer, orders ⌀ beers, orders 99999999 beers, orders a lizard, orders -1 beer and orders an asgdhfk.

The first real customer walks into the same bar and asks where the bathroom is. The bar explodes.