Microsoft has revealed the causes of a major global incident last week that led to large numbers of Azure, Office 365, Dynamics and other Microsoft users being unable to log-in to their services.
The 14-hour outage affected Microsoft Azure AD Multi-Factor Authentication (MFA) services, but “gaps in telemetry and monitoring” for these delayed attempts to spot and understand the underlying causes, Redmond admitted.
The first two causes resulted from a code update that ran from November 13-16. First, there was a latency issue which surfaced during a period of high traffic load and affected communication between the MFA front-end and its cache services.
MFA services experiencing this latency were likely to trigger the second issue, which was a “race condition” in processing responses from the MFA back-end server.
This in turn led to recycles of the MFA front-end server processes which triggered another issue on the MFA back-end.
This third root cause was previously undetected, until triggered by the above.
“This issue causes accumulation of processes on the MFA back-end leading to resource exhaustion on the back-end at which point it was unable to process any further requests from the MFA front-end while otherwise appearing healthy in our monitoring,” explained Microsoft.
The computing giant apologized and claimed it was taking several steps to ensure the same thing doesn’t happen again.
This includes reviewing its update deployment procedures to better identify problems during testing and deployment; reviewing its monitoring services to reduce time-to-detection of incidents and reviewing its containment process to avoid propagating issues to other datacenters.
Microsoft said it’s also looking at its Service Health Dashboard and monitoring tools to detect publishing issues immediately during incidents.