March 5, 2012

Perhaps They Should Have Tested More - Windows Azure

So we know it's Leap Day, but what time is it?


On Wednesday, February 29th, Windows Azure experienced about 8 hours of downtime for some of its critical services. The cause? Leap Day. copyrightjoestrazzere

  • a component of Windows Azure experienced a worldwide outage for eight hours
  • a series of outages that affected multiple aspects of the system 
  • prevented customers from carrying out management operations for technology that uses the cloud management service
  • issue appears to be due to a time calculation that was incorrect for the leap year
  • outage apparently was triggered by a key server in Ireland housing a certificate that expired at midnight on Feb. 28
  • Azure users posted a stream of critical comments about the outages to the service's official forums
  • a customer described the problem as an "admin nightmare" and said they couldn't understand how such an important system could go down.
  • Microsoft blamed the Azure management problems on a "cert issue triggered on 2/29/12"
  • the service has not been around for four years yet, and on its first leap year day, it collapsed
  • initial problems propagated to different territories, and live customer-facing sites became unavailable
  • in some markets, Microsoft had promoted its Azure cloud service using the slogan “I laugh in the face of unpredictability”.
  • "Microsoft will have to start its cloud marketing from scratch, to rebuild a level of trust that has now crumbled"

Perhaps Leap Day wasn't predictable for Microsoft (although experts tell me that it has been known to occur almost every 4 years), and those time calculations can indeed be tricky.

But perhaps Microsoft should have tested more.

See also:
http://www.eweek.com/c/a/Enterprise-Applications/Microsoft-Windows-Azure-Downtime-Blamed-on-Leap-Year-Bug-707169/
http://www.eweek.com/c/a/Cloud-Computing/Microsoft-Azure-LeapYear-Glitch-Key-Lessons-Learned-280164/
http://www.computerworld.com/s/article/9224792/Microsoft_Azure_stabilizes_after_leap_year_glitch
http://www.tgdaily.com/software-brief/61798-azure-leap-day-bug-causes-chaos
http://www.theregister.co.uk/2012/02/29/windows_azure_outage/
http://www.informationweek.com/news/cloud-computing/infrastructure/232601812
http://www.zdnet.com/news/windows-azure-suffers-worldwide-outage/6348160
http://www.pcworld.com/businesscenter/article/251043/microsofts_azure_cloud_suffers_serious_outage.html

This article originally appeared in my blog: All Things Quality
My name is Joe Strazzere and I'm currently a Director of Quality Assurance.
I like to lead, to test, and occasionally to write about leading and testing.
Find me at http://AllThingsQuality.com/.

No comments:

Post a Comment