Microsoft has provided information on the root cause of the massive outage that last week impacted its Azure Active Directory authentication services across Europe, Asia, and the Americas.
The 14-hour outage impacted Microsoft Azure AD Multi-Factor Authentication (MFA) services on November 19 and prevented users of Office 365, Azure, Dynamics and other services from logging in if MFA was required. The event was mitigated on November 19, but details on what caused it were only provided now.
During their investigation of the issue, Microsoft’s engineers discovered three root causes, each of them leading to the other. Due to gaps in telemetry and monitoring for the MFA services, however, the identification of these causes was delayed, and the mitigation time was extended.
The first two root causes, Microsoft explains, were introduced in a roll-out of a code update that began in some datacenters on November 13. They would be activated once a certain traffic threshold was exceeded, which happened on November 19 in the Azure West Europe (EU) datacenters due to morning peak traffic.
The first root cause was a latency issue in the Azure AD MFA frontend’s communication to its cache services. Triggered under high load once a certain traffic threshold was reached, the issue would render the services susceptible to the second root cause.
The second issue was a race condition in processing responses from the MFA backend server. It eventually led to recycles of the MFA frontend server processes, thus triggering additional latency, as well as the third root cause on the MFA backend.
Previously undetected, the third identified root cause led to accumulation of processes on the MFA backend. This eventually resulted in resource exhaustion on the backend, thus preventing it from processing further requests from the MFA frontend.
The outage was essentially caused by a change recently rolled out to more effectively manage connections to the caching services, which “introduced more latency and a race-condition in the new connection management code, under heavy load,” Microsoft explains.
Because of this rollout, the MFA service slowed down processing of requests, which first impacted the West EU datacenters. Microsoft’s engineers attempted to mitigate the issue in various manners, including changing traffic patterns in the EU datacenters and disabling auto-mitigation systems to reduce traffic volumes, which eventually caused the same issues in the East US datacenters too.
After discovering that backend resource limits were exhausted and that MFA messages were no longer delivered to customers, the recent deployment was rolled back and capacity was added, which mitigated the latency issue, but the service was fully restored only after the MFA backend servers were cycled.
“The initial diagnosis of these issues was difficult because the various events impacting the service were overlapping and did not manifest as separate issues. This was made more acute by the gaps in telemetry that would identify the backend server issue,” Microsoft says.
The outage was mitigated on November 19, but the incident was kept open for two more days to monitor and investigate any further problems. Microsoft also decided to roll out additional improvements to the Azure platform by December 2018 to prevent similar issues from happening.
The company will review its update deployment procedures, review the monitoring services to reduce detection time, review containment processes to avoid propagating an issue, and update the communications process to the Service Health Dashboard and monitoring tools to detect publishing issues immediately during incidents.
Related: Microsoft Adds New Tools to Azure DDoS Protection
Related: Microsoft Boosts Azure Security With Array of New Tools