A network misconfiguration by an AT&T Mobility employee caused the 12-hour network outage in February that blocked more than 25,000 emergency 911 calls, according to a July 22 report released by the Federal Communications Commission (FCC).
The FCC report found that the misconfiguration caused voice and 5G data services for all AT&T Mobility users to be unavailable as a result of the outage, affecting more than 125 million registered devices and blocking more than 92 million voice calls.
Based on its investigation, the FCC said that along with the configuration error, they determined that AT&T Mobility’s response was lacking and did not adhere to industry best practices, including the following:
- A lack of adherence to AT&T Mobility’s internal procedures.
- Lack of peer review.
- Failure to adequately test after installation.
- Inadequate laboratory testing.
- Insufficient safeguards and controls to ensure approval of changes affecting the core network.
- A lack of controls to mitigate the effects of the outage once it began.
- A variety of system issues that prolonged the outage once the configuration error had been remedied.
An AT&T spokesperson offered the following response:
“We have implemented changes to prevent what happened in February from occurring again. We fell short of the standards that we hold ourselves to, and we regret that we failed to meet the expectations of our customers and the public safety community.”
Outages have become a new threat to security
Outages, whether caused by technical errors, cyberattacks, software issues or other causes like the one involving AT&T in February, are becoming a more noticeable concern and pose significant risks to both service providers and their customers, said Itzik Alvas, co-founder and CEO of Entro Security. Alvas said while the nature of the outages at AT&T and CrowdStrike differ, they both highlight vulnerabilities in our increasingly interconnected digital infrastructure.
“They expose systems to unauthorized access, increasing the risk of data breaches, ransomware attacks and other forms of malware,” said Alvas. “The trend reflects broader concerns about infrastructure resilience in the face of increasing technical complexity and external threats. Efforts to improve redundancy, cybersecurity and regulatory oversight are critical to addressing these vulnerabilities.”
Andy Ellis, operating partner at YL Ventures, added that at first glance, it’s easy to connect the dots between AT&T’s outage and CrowdStrike’s recent failure: both involve updates, appear to have had insufficient testing, and resulted in catastrophic outages.
But Ellis said the two failures are very different.
In AT&T’s case, Ellis said the outage was triggered by their own employee’s configuration, on their own device, operated inside their own network, which propagates through other AT&T devices, until AT&T’s safety mechanisms shut down parts of their network to constrain the damage. This factor created even more harm because AT&T didn’t seem to plan for a graceful return to service, said Ellis.
“While the FCC report repeatedly casts blame on ‘an AT&T Mobility employee,’ this failure mode is one that network engineering teams around the world can empathize with, as it’s difficult to implement safe, automated deployments of network hardware, often resulting in humans ‘triggering’ network failures in the absence of programmatically generated configurations,” said Ellis.
In CrowdStrike’s case, Ellis said the outage was triggered by what they call a “content update,” although other providers might call “metadata” or just “dynamic configuration,” a change that engages capabilities in already deployed software.
“The CrowdStrike change, based on their own incident report, survived their canary test, which highlights a double-parser hazard, where the system that ‘checks’ is somehow using a different parser than the production systems, and so validates an unsafe message,” said Ellis.
Jason Soroko, senior vice president of product at Sectigo, added that the pattern of outages because of configuration errors, procedural lapses, and failures in testing rigor, has become alarming. While a single incident might be an anomaly, the recurrence indicates deeper systemic issues, said Soroko.
“These findings underscore the necessity for telecom and cybersecurity firms to improve internal protocols, invest in rigorous testing, and ensure comprehensive oversight during network changes,” said Soroko. “Ignoring these issues disrupts services and endangers public safety, as seen with the AT&T outage's impact on emergency services. The recent outages reveal a troubling trend that requires immediate attention.”
Chris Clymer, director and CISO at Inversion6, said it’s worth noting that none of these recent incidents appear to be security incidents. Rather, Clymer said these are more fundamental IT "operations fumbles" that have had major impacts. With consolidation of vendors and increasing reliance on a handful of cloud services, many companies are at increasing risk of the wrong provider having this type of event, said Clymer.
Clymer added that stockholders pushing companies to trim overhead and maximize profit is compounding these kinds of incidents. Too much investment in good process and additional layers of checks and balances is negative for profitability — until something really bad happens to tank the stock — Clymer continued.
“This external pressure has pushed publicly traded companies to hollow out their quality management and governance processes more and more, until they reach the breaking point,” said Clymer. “The problems with ... CrowdStrike's software deployments or with AT&T's network equipment are all very well understood ones, and very avoidable with proper staffing, process and experienced personnel. But all of these things get questioned in times between incidents.”