CIO update: Post-mortem on the Skype outage
As a follow-up to last week’s outage, here is a detailed explanation of what transpired, the root cause, and plans to mitigate this from happening again in the future. For starters, it helps to understand that Skype is based on a peer-to-peer (P2P) network, which is explained here. Last week, the P2P network became unstable and suffered a critical failure. The failure lasted approximately 24 hours from December 22, 0800 PST/1600 GMT to December 23, 0800 PST/1600 GMT.
What was the cause for the failure?
On Wednesday, December 22, a cluster of support servers responsible for offline instant messaging became overloaded. As a result of this overload, some Skype clients received delayed responses from the overloaded servers. In a version of the Skype for Windows client (version 5.0.0152), the delayed responses from the overloaded servers were not properly processed, causing Windows clients running the affected version to crash.
Users running either the latest Skype for Windows (version 220.127.116.11), older versions of Skype for Windows (4.0 versions), Skype for Mac, Skype for iPhone, Skype on your TV, and Skype Connect or Skype Manager for enterprises were not affected by this initial problem.
However, around 50% of all Skype users globally were running the 18.104.22.168 version of Skype for Windows, and the crashes caused approximately 40% of those clients to fail. These clients included 25–30% of the publicly available supernodes, also failed as a result of this problem.
If approximately 20% of total Skype clients failed, why was there a much bigger disruption to Skype functionality?
Although Skype staff responded quickly to disable the overloaded servers and to eliminate client requests to them, a significant number of supernodes had already failed. A supernode is important to the P2P network because it takes on additional responsibilities compared to regular nodes, acting like a directory, supporting other Skype clients, helping to establish connections between them and creating local clusters typically of several hundred peer nodes per each supernode.
Once a supernode has failed, even when restarted, it takes some time to become available as a resource to the P2P network again. As a result, the P2P network was left with 25–30% fewer supernodes than normal. This caused a disproportionate load on the remaining available supernodes.
Why weren’t the other supernodes available to help?
The failure of 25–30% of supernodes in the P2P network resulted in an increased load on the remaining supernodes. While we expect this kind of increase in the instance of a failure, a significant proportion of users were also restarting crashed Windows clients at this time. This massively increased the load as they reconnected to the peer-to-peer cloud. The initial crashes happened just before our usual daily peak-hour (1000 PST/1800 GMT), and very shortly after the initial crash, which resulted in traffic to the supernodes that was about 100 times what would normally be expected at that time of day.
Supernodes have a built in mechanism to protect themselves and to avoid adverse impact on the systems hosting them when operational parameters do not fall into expected ranges. We believe that increased load in supernode traffic led to some of these parameters exceeding normal limits, and as a result, more supernodes started to shut down. This further increased the load on remaining supernodes and caused a positive feedback loop, which led to the near complete failures that occurred a few hours after the triggering event.
Regrettably, as a result of the confluence of events – server overload, a bug in Skype for Windows clients (version 22.214.171.124), and the decline in available supernodes – Skype’s functionality became unavailable to many of our users for approximately 24 hours.
How did Skype help support supernode recovery?
In order to restore Skype functionality, the Skype engineering and operations team introduced hundreds of instances of the Skype software into the P2P network to act as dedicated supernodes, which we nick-named “mega-supernodes,” to provide enough temporary supernode capacity to accelerate the recovery of the peer-to-peer cloud.
By late Wednesday night (PST) it was evident that only a proportion (about 15-20%) of Skype users connections were ‘healing’ and the volume of load on the supernodes continued to be unusually high. In response, our team introduced several thousand more mega-supernodes through the night. During Wednesday night, full recovery of the P2P network was underway and the majority of users were able to connect to the P2P network normally by early morning (California-PST) on December 23rd.
As we reported during the incident, in order to recover the core Skype functionality as quickly as possible, we utilized resources normally used to support Group Video Calling, to deploy supernodes, and over the course of Thursday night and Friday morning we returned these to their normal use and restored Group Video Calling functionality in time for Christmas.
The supernodes stabilized overnight on Thursday and by Friday, several tens of thousands of supernodes were supporting the P2P network. During Friday, we withdrew a significant proportion of the mega-supernodes from service, leaving some in operation to ensure stability of the P2P network over Christmas and New Year.
What is Skype doing to prevent this from happening again?
We understand how important the reliability, security and quality of our software is to Skype users around the world, and we work hard to maintain high standards, as well as develop new features and products.
First, we will continue to examine our software for potential issues, and provide ‘hotfixes’ where appropriate, for download or automatic delivery to our users. Since a bug was identified in Skype for Windows (version 126.96.36.199), we had provided a fix to v5.0 of our Windows software prior to the incident, and we will provide further updates for download this week. We will also be reviewing our processes for providing ‘automatic’ updates to our users so that we can help keep everyone on the latest Skype software. We believe these measures will reduce the possibility of this type of failure occurring again.
Second, we are learning the lessons we can from this incident and reviewing our processes and procedures, looking in particular for ways in which we can detect problems more quickly to potentially avoid such outages altogether, and ways to recover the system more rapidly after a failure.
Third, while our Windows v5 software release was subject to extensive internal testing and months of Beta testing with hundreds of thousands of users, we will be reviewing our testing processes to determine better ways of detecting and avoiding bugs which could affect the system.
Finally, as we continue to grow, we will keep under constant review the capacity of our core systems that support the Skype user base, and continue to invest in both capacity and resilience of these systems. An investment program we initiated a year ago has significantly increased our capacity already and more investment is planned for 2011 both to support the ongoing roll out of our paid and enterprise products, and to continue to support the growth of our core Skype software that we know millions of users rely on every day.
We are truly grateful to all of our users and humbled by your continued support. We know how much you rely on Skype, and we know that we fell short in both fulfilling your expectations and communicating with you during this incident. Lessons will be learned and we will use this as an opportunity to identify and introduce areas of improvement to our software, further assess and invest in capacity and stability, and develop better processes for outage recovery and communications to our user base. Thank you to everyone.