So we know it's Leap Day, but what time is it?
On Wednesday, February 29th, Windows Azure experienced about 8 hours of downtime for some of its critical services. The cause? Leap Day. copyrightjoestrazzere
- a component of Windows Azure experienced a worldwide outage for eight hours
- a series of outages that affected multiple aspects of the system
- prevented customers from carrying out management operations for technology that uses the cloud management service
- issue appears to be due to a time calculation that was incorrect for the leap year
- outage apparently was triggered by a key server in Ireland housing a certificate that expired at midnight on Feb. 28
- Azure users posted a stream of critical comments about the outages to the service's official forums
- a customer described the problem as an "admin nightmare" and said they couldn't understand how such an important system could go down.
- Microsoft blamed the Azure management problems on a "cert issue triggered on 2/29/12"
- the service has not been around for four years yet, and on its first leap year day, it collapsed
- initial problems propagated to different territories, and live customer-facing sites became unavailable
- in some markets, Microsoft had promoted its Azure cloud service using the slogan “I laugh in the face of unpredictability”.
- "Microsoft will have to start its cloud marketing from scratch, to rebuild a level of trust that has now crumbled"
Perhaps Leap Day wasn't predictable for Microsoft (although experts tell me that it has been known to occur almost every 4 years), and those time calculations can indeed be tricky.
But perhaps Microsoft should have tested more.
See also:
http://www.eweek.com/c/a/Enterprise-Applications/Microsoft-Windows-Azure-Downtime-Blamed-on-Leap-Year-Bug-707169/
http://www.eweek.com/c/a/Cloud-Computing/Microsoft-Azure-LeapYear-Glitch-Key-Lessons-Learned-280164/
http://www.computerworld.com/s/article/9224792/Microsoft_Azure_stabilizes_after_leap_year_glitch
http://www.tgdaily.com/software-brief/61798-azure-leap-day-bug-causes-chaos
http://www.theregister.co.uk/2012/02/29/windows_azure_outage/
http://www.informationweek.com/news/cloud-computing/infrastructure/232601812
http://www.zdnet.com/news/windows-azure-suffers-worldwide-outage/6348160
http://www.pcworld.com/businesscenter/article/251043/microsofts_azure_cloud_suffers_serious_outage.html
This article originally appeared in my blog: All Things Quality
My name is Joe Strazzere and I'm currently a Director of Quality Assurance. I like to lead, to test, and occasionally to write about leading and testing. Find me at http://AllThingsQuality.com/. |
No comments:
Post a Comment