02:06 - Source: CNN
AT&T urges customers to use Wi-Fi while service is down

Editor’s Note: Bob Kolasky is the senior vice president for critical infrastructure at Exiger, a provider of analysis of supply chain and third-party risk for the US government and critical infrastructure industries. He previously led the Cybersecurity and Infrastructure Security Agency’s (CISA) National Risk Management Center. The views expressed in this commentary are his own. View more opinion on CNN.

CNN  — 

The news Thursday morning of the AT&T service outage — affecting tens of thousands, if not hundreds of thousands of customers — was yet another reminder of the importance of critical infrastructure resilience.

Department of Homeland Security
Bob Kolasky

Statements from the company do not point to any nefarious activity but the Federal Communications Commission is investigating, and the White House said other federal agencies are also in touch with the company but it has not confirmed a specific understanding of what happened. (AT&T said in a statement Thursday evening, “Based on our initial review, we believe that today’s outage was caused by the application and execution of an incorrect process used as we were expanding our network, not a cyber attack.”)

Adversarial or not, the outage is noteworthy. By a few minutes after 3 pm ET, about 11 hours after customers’ initial reports of the outage, AT&T said it had restored service to all affected customers.

Communications, such as those provided by AT&T, are one of the “lifeline functions” designated by the US federal government as essential to national security, economic competitiveness and community well-being. Without these functions, which include transportation, water and energy, critical systems start to fail, and an incident in one industry can become systemic across the whole ecosystem.

Any time we see an incident such as Thursday’s outage, the question becomes just how serious is the risk of a widespread outage and how prepared are we as a country to respond?

The answer to the first question is daunting — there are many scenarios that theoretically threaten the telecommunications sector. Operational failure is one. This occurred in Canada two years ago when Rogers Communications lost service for more than 10 million mobile and internet customers for 19 hours.

At the same time, the threat of intentional cyber-attacks from sophisticated actors such as the Chinese Communist Party has been in the news a lot lately and is top of mind for security officials. These attacks can happen either via direct attempts to overwhelm network defenses or, perhaps more likely, to exploit vulnerable elements of the telecommunications supply chain or critical dependencies such as cloud service providers.

There are, however, also physical scenarios that impact communications. In 2022, in France, suspected domestic dissidents cut terrestrial fiber optic cables, disabling internet and other services in Paris for a considerable time. In the US, severe hurricanes in Puerto Rico, the Southeast and Gulf Coast have taken down communications. In addition, there are government working groups in the US focused on even more stark scenarios like geomagnetic storms or even an electromagnetic pulse attack.

Given that there’s no shortage of theoretical scenarios that could result in the same, if not more, panic than we saw Thursday with AT&T — the most relevant question isn’t whether there is risk (there is), but instead whether there’s sufficient resilience in the system.

Resilience is the ability to withstand, absorb and bounce back from a shock. For communications, it can be conceptualized in two different ways: What is the scope and scale of the service outage and what are the cascading consequences of the outage?

AT&T’s service is back up and running. Even a controlled outage, however, can have cascading impact from downtime on systems that rely on telecommunications to operate and those impacts must be managed.

The priority should be first responder and public safety communications, but it’s also important to maintain the provision of just-in-time healthcare, the functioning of financial systems and the ability to maintain logistics of supply chains.

Telecommunications enables the availability of data centers and cloud services which power much of critical business networks. And, as we identified in a study I just co-authored for the Carnegie Endowment for International Peace, the cloud is increasingly a source of systemic risk for much of the nation’s infrastructure.

In a connected world, a widespread communications outage can have a contagion effect. Instances such as these must be accounted for in future planning and every effort must be made to learn from and share what happened.

While we may be in the “fog of incident,” we cannot remain so. Public-private partnerships, many of which include AT&T, must collaborate across the telecommunications sector and other lifeline industries to share information about the cause and effect of the outage.

Get Our Free Weekly Newsletter

There needs to be transparent communications between government and corporate stakeholders about threats and vulnerabilities and anything that looks anomalous. There also needs to be proactive communications to the public about what is learned. Infrastructure outage incidents can’t be addressed by stove-piping information.

As we seek to move forward, we need to enhance planning for operation in degraded communications environments. The move to increased reliance on satellite communications, redundant public safety networks and alternative 911 capabilities all mitigated the damage Thursday.  This was made possible by investments in innovation and resilience.

Investments in resilience should continue across other critical functions so that they are able to operate while communications systems are unavailable, and there needs to be planning across critical systems for maintaining functionality in the face of foreseeable incidents. As Thursday’s outage demonstrated, communications services are a dependency that most systems share. A single outage can’t become something that causes systemic shock.