Software has become the backbone of countless critical systems. However, despite the promise of efficiency and reliability, software systems are not immune to failures, and when they occur, the consequences can be profound.
According to a report by Tricentis, in 2017, software failures cost the global economy an estimated $1.7 trillion in financial losses annually. Moreover, the average cost of a software failure for a large organization can exceed $10 million, including direct expenses such as remediation efforts and lost revenue, as well as indirect costs like reputational damage and legal liabilities.
To provide insight into the potential ramifications of software failure, this article will showcase instances of software malfunction and their impacts.
Case #1 - Canada’s Phoenix Pay System
The Phoenix Pay System, implemented by the Government of Canada in 2016, aimed to modernize payroll processing for federal employees by centralizing operations. However, it was plagued by software glitches, data inaccuracies, and processing delays. It has disrupted pay for over half of Canada’s federal workers, costing Ottawa over $2 billion. This has not only embarrassed politicians but has also spurred action from public sector unions, raising doubts about the system's effectiveness.
The fallout from the Phoenix Pay System's failures was widespread, affecting various sectors, including healthcare. Many government employees experienced pay disruptions, leading to financial hardship and frustration.
The audit revealed critical shortcomings in the implementation of the Phoenix Pay System:
Firstly, the system was launched in 2016 without adequate testing or a contingency plan in place. It lacked essential pay processing functions, exhibited significant security weaknesses, and lacked a plan to upgrade its underlying software. Secondly, the project was marred by a false economy. Rather than seeking additional funding to ensure the project's success, Public Services and Procurement Canada opted to cut project staff and reduce the number of software modules required for full pay processing.
Additionally, there was a profound failure to heed warnings. Despite concerns raised by other departments project executives at Public Services disregarded warnings that Phoenix wasn’t ready for launch.
In light of these findings, the auditor recommended that government-wide projects undergo fully independent reviews well before their launch.
Case #2 - National Health Service
In 2016, a critical coding error was unearthed within the SystmOne clinical computer system utilized by the National Health Service in the United Kingdom. This error impacted approximately 150,000 patients, primarily those with heart conditions. The flaw led to inaccuracies in the risk assessment of these patients, providing incorrect medical advice regarding their susceptibility to heart attacks and strokes. Individuals who were incorrectly classified as low-risk for heart conditions may have been deprived of necessary interventions or treatments, potentially putting their health at risk. Conversely, those erroneously categorized as high-risk may have been subjected to unnecessary medical interventions or heightened anxiety about their health status.
Subsequent investigations into the issue revealed that the coding error had persisted since 2009, remaining undetected for several years.
Case #3 - Nest Thermostat
The Nest thermostat bug left users feeling anything but cozy when it caused their homes to become uncomfortably cold. Imagine relying on your smart thermostat to keep you warm during the chilly months, only to wake up to a freezing house because of a software glitch.
The issue, which occurred in 2016, affected users of the popular Nest Learning Thermostat. It caused the thermostats to shut down unexpectedly, resulting in plummeting indoor temperatures and leaving users shivering in their own homes.
As the temperature plummeted, frustrations mounted as users turned to Twitter to exacerbate the situation.
For many affected users, the experience was not only inconvenient but also alarming. They had trusted their smart thermostats to regulate their home's temperature reliably, only to be left in the cold due to a software bug.
The software issue stemmed from a firmware update released by Nest Labs, the company behind the Nest Learning Thermostat. The firmware update was intended to improve the functionality and performance of the thermostat but inadvertently introduced a bug that caused the battery life to drain, resulting in a loss of heating function.
Case #4 - Air Traffic Control in LA
The Air Traffic Control (ATC) system at Los Angeles International Airport (LAX) is tasked with a crucial responsibility: ensuring the safe and efficient movement of aircraft within its airspace. This involves providing pilots with vital information such as weather updates, flight routes, and the proximity of other aircraft. Prompt communication between ATC and pilots is essential to prevent potential disasters in the skies.
However, on September 14, 2004, the ATC at LAX faced a harrowing situation. Voice communication with approximately 400 aircraft in the southwestern United States was suddenly lost, and many planes were on collision courses with each other. The cause? The primary voice communication system unexpectedly shut down, leaving controllers scrambling to maintain contact with pilots.
To compound the issue, the backup communication system also failed shortly after activation.
Quick-thinking controllers used their own cellphones to alert other traffic control centers and airlines of potential collisions. Fortunately, the collision avoidance system on board commercial jets played a crucial role in averting disasters by instructing pilots to climb or descend when danger was detected.
The root cause of the outage was traced to a countdown timer glitch in the Voice Switching and Control System (VSCS), which shut down the system when it reached zero. This glitch, combined with human error in adhering to maintenance procedures, led to the system failure. Although a software patch has since been developed to reset the timer automatically, the incident highlighted the need for solid redundancies in air traffic control systems to prevent similar crises in the future.
The FAA later implemented a software patch that periodically reset the counter without human intervention.
Case #5 - St. Mary’s Mercy Hospital
The blunder at St. Mary’s Mercy Hospital was like something out of a comedy of errors. Their patient-management software system, which was supposed to keep everything running smoothly, ended up causing quite the uproar.
The glitch happened during a routine update of Saint Mary's computer file. Instead of correctly marking patients as discharged with the code 01, the system decided to label them as "expired" with a code of 20. That's like saying they've kicked the bucket when they're actually just heading home. It caused a whirlwind of confusion and distress for 8,500 patients and their families. And to add insult to injury, this mix-up didn't just stay within the hospital walls – the wrong information also made its way to insurance companies and the Social Security Office, causing a ripple effect of administrative chaos.
St. Mary’s spokeswoman Jennifer Cammenga had this to say: “To us, this is really not a very big story. We’re not going to elaborate anymore. It was a mapping error. That’s all we have to say about it.”
Takeaway
As a software developer, it's crucial that we take responsibility for the quality and reliability of the systems we build. This entails prioritizing thorough testing, implementing robust quality assurance processes, and consistently maintaining our software to ensure it operates safely and effectively. Cutting corners or rushing to release can have serious consequences, not only for our users but also for the reputation and trustworthiness of our work.
Companies and even individuals engaged in software development and maintenance must demonstrate utmost vigilance and meticulousness in their tasks to avert these kinds of digital avalanches.