Microsoft had “insufficient” staff levels at its data centre campus last week when a power sag knocked its chiller plant for two data halls offline, cooking portions of its storage hardware.
The company has released a preliminary post-incident report (PIR) for the large-scale failure, which saw large enterprise customers including Bank of Queensland and Jetstar completely lose service.
The PIR sheds light on why some enterprises lost service altogether: so many storage nodes were gracefully shut down – or had components fried – in the incident that data, and all replicas of it, were offline.
In addition, after storage nodes were finally recovered, a “tenant ring” hosting over 250,000 databases, failed – albeit with uneven impact on customers.
Microsoft said the cooling capacity for the two affected data halls “consisted of seven chillers, with five chillers in operation and two chillers in standby (N+2)”.
A power sag – voltage dip – caused the five operating chillers to fault. In addition, only one of the standby units worked.
Microsoft said the onsite staff “performed our documented emergency operational procedures (EOP) to attempt to bring the chillers back online, but were not successful.”
The company appeared to be caught out by the scale of the incident, with not enough staff onsite, and its emergency procedures not catering for the size of the issue.
“Due to the size of the data centre campus, the staffing of the team at night was insufficient to restart the chillers in a timely manner,” the company said.