At approximately 3:55 pm Eastern time, hypervisor 3 (one of our six hypervisor servers) on our new cloud platform hard crashed. This resulted in our cloud control panel (OnApp) automatically initiating a migration sequence for virtual machines stored on hypervisor 3. This happened as expected, but resulted in issues.
The first issue is in the core design of OnApp automatically migrating all virtual machines on a failed hypervisor to one other hypervisor (instead of spreading virtual machines out amongst different hypervisors). Because a large number of virtual machines were being migrated at the same time, this caused the new hypervisor (hypervisor 4) to experience high load and less than optimal performance.
The second issue is that the OnApp task queue, which automatically manages virtual machine migrations, reboots, and similar activities was set to only allow three actions to take place at once. This slowed down the migration and subsequent re-initiation of virtual machines onto hypervisor 4. We expedited the task queue by removing non-essential tasks and focusing resources on getting all virtual machines activated again.
At this time, all virtual machines are working fine. We are moving virtual machines to other hypervisors to reduce the load on hypervisor 4, but these moves will be completely transparent to customers and not affect individual or collective virtual machine performance.
Why It Happened:
At this time, we are not sure as to why the hypervisor 3 crashed. All of the servers for our new cloud platform are brand new and have been tested extensively but those tests cannot replicate a production environment.
The other issues we mentioned are a result of the design of the OnApp control panel.
What We Are Doing (and Have Done):
We will be un-racking the failed hypervisor to identify the issues with it tonight. Once we have identified those issues, we will of course correct them and redeploy hypervisor 3. Now that the automatic migrations have taken place, hypervisor 3 is no longer essential to the cloud platform's performance. This will allow us to take some time carefully going over the hardware and the software logs to determine the issue and ensure they do not happen elsewhere.
We will be working with the OnApp development team to address the first issue of OnApp moving all virtual machines to one hypervisor in the event of a failure. OnApp is very responsive and has a quickly moving development organization and we expect to have more details on this soon.
We have already addressed the issue with the task queue getting behind and causing delay by modifying a configuration. This configuration was not available in OnApp's control panel and was encoded, so we worked with OnApp's developers to make the change. The change is effective immediately.
We are also working on monitoring systems that will automatically and pro-actively migrate accounts if the system believes that a hypervisor is about to fail. This will allow the hypervisor to fail gracefully and avoid downtime. We can then investigate why the particular hypervisor failed, address the issues, and redeploy it transparently.
The storage failure with our old cloud platform obviously required us to modify our migration schedule. Our initial plan was to slowly migrate virtual machines onto our new cloud, monitor performance, and address issues when/if they came up. This outage has provided us with a number of ways in which we can ensure greater reliability going forward and we have already taken steps to work on those.
The good news is that all systems worked as they were designed to and we can now improve that performance going forward.
We apologize about the downtime and thank you for your patience and understanding. As always, if you have any questions, please do not hesitate to contact us.
Temporary Outage Incident Report :: New Cloud Platform :: 1/22/12
No replies to this topic
0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users