Microsoft has revealed its initial conclusions on what it thinks caused a major recent outage that affected some of its most popular software offerings.
The outage saw workers across Europe and Asia unable to log into Microsoft 365 services for several hours, with the likes of Microsoft Teams, Outlook, OneDrive for Business, Exchange Online and SharePoint all affected.
Having initially identified “a wide-area networking (WAN) routing change” as the culprit, Microsoft has now released the findings (opens in new tab) of its initial investigation into the outage, revealing that things were in fact a little complicated than that.
Microsoft Teams outage explained
“Between 07:05 UTC and 12:43 UTC on 25 January 2023, customers experienced issues with networking connectivity, manifesting as long network latency and/or timeouts when attempting to connect to resources hosted in Azure regions, as well as other Microsoft services including Microsoft 365 and Power Platform,” the company’s report noted.
“We determined that a change made to the Microsoft Wide Area Network (WAN) impacted connectivity between clients on the internet to Azure, connectivity across regions, as well as cross-premises connectivity via ExpressRoute.”
“As part of a planned change to update the IP address on a WAN router, a command given to the router caused it to send messages to all other routers in the WAN, which resulted in all of them recomputing their adjacency and forwarding tables. During this re-computation process, the routers were unable to correctly forward packets traversing them. The command that caused the issue has different behaviors on different network devices, and the command had not been vetted using our full qualification process on the router on which it was executed.”
Microsoft said that overall, it was able to identify the problem within an hour, and all its internal networking equipment was back to normal within two and a half hours.
In order to help prevent the same issue occuring again in the future, Microsoft says it has, “blocked highly impactful commands from getting executed on the devices”. The company is also working on adding a new requirement for all command execution on its devices to follow safe change guidelines.