Reason for Outage – Network and VM Outage
Date of event: 2015-09-05
The following high level timeline described events that occurred on Saturday 2015-09-05 08:03am through to Sunday 03:45am.
08:03 Telecity Nexus switch rebooted in Telecity
08:11 Telecity 2nd Nexus switch rebooted
08:35 Large volumes of VMware alerts across all sites
08:35 00's of machines frozen/corrupted or otherwise needing repair.
09:00 - Sunday 03:45 Repairing machines & services
The first network switch reboot had no impact on systems, but the reboot of its resilient pair caused a layer2 rooting loop. The rooting loop meant that all virtual machines on the platform were unable to see the storage layer with approximately 50% of all machines affected, some severely to the level of having their operating system images corrupted rendering the machines useless.
Customers were impacted to varying degrees and varying times depending on which machines their service relies on and the damage to those machines.
The operations team deleted and recreated the affected machines in each cluster to bring services up. The operations team is continuing to rebuild machines lost on the day to bring spare headroom up to the levels before the issue occurred.
The Nexus switch reboot has been identified as a known software issue by Cisco. The vendor’s recommended action is to upgrade the switch to the latest stable version. This version has already been installed elsewhere by DXI in February this year and has not demonstrated any issues.
All affected switches will be upgraded to the latest stable version between 8pm and 11pm on Saturday 12th September. The configuration has been removed that would create a rooting loop in the event that both switches of a resilient pair rebooted at the same time.