Human Failure, even at Microsoft’s DC?

Last month an unexpected release of fire suppression agent during a periodic maintenance caused several services of the Microsoft Azure’s platform to automatically shut down. A human error caused customers in Northern Europe having difficulties to connect to the hosted services. You read about it on the ‘Azure Status History’ page were you’ll find detailed info on the event. link

This underwrites our experience that outages often occur during maintenance activities. Maintenance is a typical situation were humans intervene in automated systems: a filter in the HVAC needs to be replaced or UPS’s need to be taken down for scrutiny. These are the moments that ‘human errors’ can have a significant impact on the systems that are normally fully automated. In this case probably someone connected the wrong wires or pushed a wrong button that caused the system to release its agent. This resulted in a chain of events that started with automatically shutting down the air circulation. This is a logical step as the system assumes that there is a fire causing the fire suppression system to trip. Following the automatic shut down of the cold air supply was a sudden increase in temperature at the white space. This caused servers and storages systems to commence shut-down procedures resulting in unavailability of some of the Azure services. A typical domino-disaster where a relative innocent action, the release of fire suppression agent, is followed by a set of automated responses that finally causes systems to shut down.

That brings us to this one factor that is hard to automate: the human factor. Humans are still an indispensible part of the data centre work flow. Equipment needs to be installed in the racks, filters need to be cleaned or replaced, UPSs require regular maintenance, just as HVACs, generators etc.

The data centre manager has to take into account that humans are more likely to make failures than automated systems (do they make failures at all?). Procedures exist that decrease the failure rate dramatically such as proper documentation or detailed work orders. At some critical tasks it is required to have at least two persons on the job, keeping an eye on each other. People are great in creativity but they have a poor track record in repetitive tasks like most maintenance jobs. It’s only human to make mistakes.

The point is, that management should take this into consideration and anticipate on the fact that humans are likely to make mistakes. Besides having correct and detailed work orders they should also have their automated systems prepared for human errors. A proper DCIM system is able to cope with maintenance situations. If the fire extinguisher in the above situation would have been in ‘maintenance mode’ in the DCIM, it would not have closed down the air circulation when the solvent was released. The domino chain would have been stopped and no Azure customer would have noticed the incident.

In our experience it is important that your DCIM has this kind of intelligence build into it. Maintenance is a planned event and should be entered into the DCIM so anomalies during this period can be handled differently then during normal operation.

Ask us about the ‘maintenance mode’ of Perf-DCMS, our DCIM suite.

Follow us, or subscribe to our newsletter ( 3 to 4 x per year)