High Availability and Cloudy Problems

written by John Hugg on September 21, 2010 with 3 comments

VoltDB, like many distributed systems, achieves high availability through redundant processing nodes. VoltDB calls this K-Safety. Essentially, the distributed system can answer any request at at least K+1 servers, so it can tolerate at least K hardware failures. The operator specifies the value of K that they find is the best tradeoff between failure, robustness and cost. Other systems use the terms “replica set” to describe similar functionality.

Let’s talk about EC2-style clouds that provide you with a virtualized server at an hourly cost. Imagine you want to deploy a VoltDB instance of 3 nodes with K = 2, i.e. all data is replicated to all nodes so that you can tolerate 2 failures before the system becomes unavailable. Since server failures are not terribly frequent, that’s a pretty robust system. You’ve decided to use EC2′s “Large” instances. You provision them from the EC2 console, install VoltDB and start them running.

The Problem

Amazon (and most other public clouds), make no assurances that your three instances aren’t running on the same honkin’ server. It’s possible that a single failure in that server could bring your whole cluster down. Amazon doesn’t offer a direct way to provision multiple instances that are guaranteed to be on separate physical hardware.

Current Options

First, Amazon offers “Availability Zones”. If you provision servers across availability zones, then they will be on different hardware. This has the added benefit of protecting you from entire datacenter failures, but it has some huge downsides.
1. You’re charged for bandwidth between zones. In a system like VoltDB, that can get expensive.
2. Latency between zones is higher and more variable. In a system like VoltDB, that can affect performance.
3. You still can only be as redundant as the number of zones you’re using.

Second, you can use “Cluster Compute” instances on EC2. Amazon says they provision one instance per server, which is ideal. They also give you 10GigE and decent specs, which is great. The downside is they cost much more than the “Large” instances you wanted. The price/gigabyte of memory is also lower than the high-memory instances, which are a better match for VoltDB.

If you’re using a different public cloud, then I’m not even sure what your options are. I don’t think Rackspace gives you any more information than Amazon, and it’s not clear to me if they offer a 1 VM to 1 Server option. [EDIT: Rackspace offers slightly more info than Amazon. See comment below.]

If you’re on a private cloud, you may have a lot more control over the provisioning process. I’m not sure VMWare will let you automatically provision several VMs without putting two VMs on the same server, but I think you can usually do this manually. Just be careful with VMotion.

VoltDB’s Position

For now, if you want to use a public cloud and you care about availability, use a cloud that offers a 1 VM to 1 Server guarantee.

We’re working on a better answer for the future, but the lack of hardware visibility and control in public cloud infrastructures is a real problem for high availability applications like VoltDB. Until cloud providers themselves offer a solution, users will have to make compromises – either by accepting the risks of undesired co-location of redundant resources or by paying the additional costs/latency of cloud infrastructure work-arounds.

The real solution is for public clouds to allow HA applications like VoltDB to detect when two VMs are co-located, and to allow explicit provisioning of N VMs on N servers. We don’t expect this to happen soon.

It seems like many users of the cloud and some vendors touting cloudy products are unaware of this problem or pretend it doesn’t exist. Do they know something we don’t? Does anyone have a better alternative?

John Hugg
Software Engineer
VoltDB