Virtual Event Today: Ransomware Resilience & Recovery Summit - Login to Live Event
Connect with us

Hi, what are you looking for?

SecurityWeekSecurityWeek

Cloud Security

Microsoft Details Cause of Recent Multi-Factor Authentication Outage

Microsoft has provided information on the root cause of the massive outage that last week impacted its Azure Active Directory authentication services across Europe, Asia, and the Americas.

Microsoft has provided information on the root cause of the massive outage that last week impacted its Azure Active Directory authentication services across Europe, Asia, and the Americas.

The 14-hour outage impacted Microsoft Azure AD Multi-Factor Authentication (MFA) services on November 19 and prevented users of Office 365, Azure, Dynamics and other services from logging in if MFA was required. The event was mitigated on November 19, but details on what caused it were only provided now.

During their investigation of the issue, Microsoft’s engineers discovered three root causes, each of them leading to the other. Due to gaps in telemetry and monitoring for the MFA services, however, the identification of these causes was delayed, and the mitigation time was extended.

The first two root causes, Microsoft explains, were introduced in a roll-out of a code update that began in some datacenters on November 13. They would be activated once a certain traffic threshold was exceeded, which happened on November 19 in the Azure West Europe (EU) datacenters due to morning peak traffic.

The first root cause was a latency issue in the Azure AD MFA frontend’s communication to its cache services. Triggered under high load once a certain traffic threshold was reached, the issue would render the services susceptible to the second root cause.

The second issue was a race condition in processing responses from the MFA backend server. It eventually led to recycles of the MFA frontend server processes, thus triggering additional latency, as well as the third root cause on the MFA backend.

Previously undetected, the third identified root cause led to accumulation of processes on the MFA backend. This eventually resulted in resource exhaustion on the backend, thus preventing it from processing further requests from the MFA frontend.

The outage was essentially caused by a change recently rolled out to more effectively manage connections to the caching services, which “introduced more latency and a race-condition in the new connection management code, under heavy load,” Microsoft explains.

Advertisement. Scroll to continue reading.

Because of this rollout, the MFA service slowed down processing of requests, which first impacted the West EU datacenters. Microsoft’s engineers attempted to mitigate the issue in various manners, including changing traffic patterns in the EU datacenters and disabling auto-mitigation systems to reduce traffic volumes, which eventually caused the same issues in the East US datacenters too.

After discovering that backend resource limits were exhausted and that MFA messages were no longer delivered to customers, the recent deployment was rolled back and capacity was added, which mitigated the latency issue, but the service was fully restored only after the MFA backend servers were cycled.

“The initial diagnosis of these issues was difficult because the various events impacting the service were overlapping and did not manifest as separate issues. This was made more acute by the gaps in telemetry that would identify the backend server issue,” Microsoft says.

The outage was mitigated on November 19, but the incident was kept open for two more days to monitor and investigate any further problems. Microsoft also decided to roll out additional improvements to the Azure platform by December 2018 to prevent similar issues from happening.

The company will review its update deployment procedures, review the monitoring services to reduce detection time, review containment processes to avoid propagating an issue, and update the communications process to the Service Health Dashboard and monitoring tools to detect publishing issues immediately during incidents.

Related: Microsoft Adds New Tools to Azure DDoS Protection

Related: Microsoft Boosts Azure Security With Array of New Tools

Written By

Ionut Arghire is an international correspondent for SecurityWeek.

Click to comment

Trending

Daily Briefing Newsletter

Subscribe to the SecurityWeek Email Briefing to stay informed on the latest threats, trends, and technology, along with insightful columns from industry experts.

Join the session as we discuss the challenges and best practices for cybersecurity leaders managing cloud identities.

Register

SecurityWeek’s Ransomware Resilience and Recovery Summit helps businesses to plan, prepare, and recover from a ransomware incident.

Register

People on the Move

Backup and recovery firm Keepit has hired Kim Larsen as CISO.

Professional services company Slalom has appointed Christopher Burger as its first CISO.

Allied Universal announced that Deanna Steele has joined the company as CIO for North America.

More People On The Move

Expert Insights

Related Content

Application Security

Cycode, a startup that provides solutions for protecting software source code, emerged from stealth mode on Tuesday with $4.6 million in seed funding.

Cybercrime

A recently disclosed vBulletin vulnerability, which had a zero-day status for roughly two days last week, was exploited in a hacker attack targeting the...

Identity & Access

Zero trust is not a replacement for identity and access management (IAM), but is the extension of IAM principles from people to everyone and...

CISO Conversations

SecurityWeek talks to Billy Spears, CISO at Teradata (a multi-cloud analytics provider), and Lea Kissner, CISO at cloud security firm Lacework.

Cloud Security

Cloud security researcher warns that stolen Microsoft signing key was more powerful and not limited to Outlook.com and Exchange Online.

Data Breaches

LastPass DevOp engineer's home computer hacked and implanted with keylogging malware as part of a sustained cyberattack that exfiltrated corporate data from the cloud...

CISO Strategy

Okta is blaming the recent hack of its support system on an employee who logged into a personal Google account on a company-managed laptop.