Clarifications on the CAP Theorem and Data-Related Errors

written by Mike Stonebraker on October 21, 2010 with 13 comments

There has been another round of online conversations about the CAP theorem as the internet community continues to discuss its implications on networked databases.   Coda Hale recently wrote a well received article titled, “You Can’t Sacrifice Partition Tolerance”, acknowledged as “pretty good” by Eric Brewer.  Coda refers extensively to the CAP paper by Gilbert and Lynch.

Scattered in the larger conversation is a continued mis-perception of my position regarding the CAP theorem. Coda writes “Michael Stonebraker’s assertion aside, partitions (read: failures) do happen.” Others have made similar comments, so let me set the record straight.

I have consistently and repeatedly attempted to make just four points, which I elaborate in this post.  The most important point is that using the CAP theorem to justify giving up ACID (consistency) is flawed.  In the real world, giving up consistency does not improve availability.  Hence, you are giving up consistency in exchange for nothing.  This is a horrible engineering tradeoff, and the CAP theorem, therefore, encourages engineers to make awful decisions.

Point 1: The CAP theorem contains an idealized and incomplete model of data-related errors.

I have two main issues with the CAP theorem formulation of errors:

  1. The model does not deal with several important classes of errors, which a real world system administrator must cope with.
  2. The model suggests erroneous real-world behavior for at least one important failure mode.

As such, looking to the CAP theorem for real world guidance is highly suspect.  Let me explain.  The following important sources of outages are not considered in the CAP theorem.

Bohrbugs.  These are repeatable DBMS errors that cause the DBMS to crash.  In other words, even when multiple data base replicas are available, the same transaction issued to the replicas will cause all of them to crash.  No matter what, the world stops, and high availability is an impossible goal.

Application errors.  The application inadvertently updates (all copies) of the data base.  The data base is now corrupted, and any sane DBA will stop the world and put the data base back into a consistent state.  Again, high availability is impossible to achieve.

Human error.  A human types the database equivalent of  RM * and causes a global outage.   There is no possibility of continuing operation.

Reprovisioning.  Many current DBMSs (but not all) require down time to reprovision the hardware to provide more (or less) capacity.  Again, this is typically a “stop the world” operation.

These are examples of unmaskable outages.  They will cause any distributed system to be unavailable.  In CAP theorem terms, you simply cannot have availability when issues like the above are present.  Discussion about supporting any two of Consistency, Availability and Partition tolerance is irrelevant.

Now let us turn to single node failures.  These are considered network partitions by Coda.   In Coda’s view, the dead node is in one partition and the remaining N-1 nodes are in the other one.  The guidance from the CAP theorem is that you must choose either A or C, when a network partition is present.  As is obvious in the real world, it is possible to achieve both C and A in this failure mode.  You simply failover to a replica in a transactionally consistent way.  Notably, at least Tandem and Vertica have been doing exactly this for years.  Therefore, considering a node failure as a partition results in an obviously inappropriate CAP theorem conclusion.

A similar statement can be made about network partitions in which a single node is split off from the remaining N-1 nodes by a network failure.

To summarize my first point, the CAP theorem assumes reliable applications, reliable DBMS, reliable humans and no change in resources.  Unfortunately, these are all issues, which must be considered.  In addition, modeling node failures as network partitions leads to an obviously bad conclusion.

Point 2: Engineering tradeoffs abound in distributed systems.

When all the errors that can occur are considered, the real question to deal with is “What errors can I mask and at what cost?”  This is an engineering tradeoff of run-time cost versus probability of outage.  Coda starts to get at this issue at the end of his post.  To me the answer to this question crucially depends on the frequency of the various kinds of errors.  Unfortunately, this is very sensitive to the actual operating environment.  For example IBM’s MVS is considerably more reliable than Linux or Windows.  It also depends on how hard it is to mask errors.  Specifically, byzantine failures are way more difficult to deal with than “stop cleanly” errors.

Thus, architecting a distributed system is a complex set of engineering tradeoffs, often not amenable to hard and fast statements.

Point 3: The CAP theorem is often used inappropriately to justify engineering decisions.

This is my most important complaint.  Many NoSQL systems give up on transactions (ACID), using the CAP theorem as a justification.  Specifically, they argue that partitions happen, concluding that you need to sacrifice either availability or consistency.  They choose to sacrifice consistency.  I believe this is a very poor engineering tradeoff.

The reason is very simple. In my experience, network partitions do not happen often.  Specifically, they occur less frequently than the sum of bohrbugs, application errors, human errors and reprovisioning events.  So it doesn’t much matter what you do when confronted with network partitions.  Surviving them will not “move the needle” on availability because higher frequency events will cause global outages.  Hence, you are giving up something (consistency) and getting nothing in return.  This point is further explored in my “urban myths” presentation [4].

In effect, this reinforces the undesirability of making engineering decisions based on the CAP theorem.

Point 4:  Node speed fundamentally alters the engineering discussion.

Some people argue that next generation systems are going to run on thousands of nodes.  In theory, the probability of failures of all kinds will increase.  The industry’s fascination with technologies like MapReduce, a fine solution for massively parallel batch operations, seems to fuel the notion that all “big data” solutions must necessarily be deployed on large networks of unreliable servers.  For transactional systems this is a poor engineering premise.

Next generation DBMS technologies, such as VoltDB, have been shown to run around 50X the speed of conventional SQL engines.  Thus, if you need 200 nodes to support a specific SQL application, then VoltDB can probably do the same application on 4 nodes.  The probability of a failure on 200 nodes is wildly different than the probability of failure on four nodes.

The point here is that the speed of local execution definitely matters, and “brute force” on large numbers of nodes is probably a generally bad idea, because it makes failures way more frequent, and therefore more challenging to deal with.

In summary, appealing to the CAP theorem exclusively for engineering guidance is, in my opinion, inappropriate.  CAP contains an incomplete error model that can easily skew one’s thinking about what is ultimately a complex collection of engineering tradeoffs.  These tradeoffs must also fundamentally consider the speed of your local software and the hardware, the software environment in which you are running and the business criticality of data consistency.

Mike Stonebraker
Founder
VoltDB