Intro to WAN Replication in VoltDB

written by John Piekos on November 14, 2011 with no comments

This is the first in a series of posts about WAN replication, a VoltDB feature that’s currently under development.  It provides a high level overview of our direction.  More detailed posts will follow as we progress through our engineering iterations.

As an in-memory transactional DBMS, VoltDB delivers breakthrough performance and scale.  Two of the product’s most significant innovations, however, are in database availability and durability:

  • Availability – VoltDB delivers high availability (HA) via a synchronous multi-master feature called K-safety.  When a database is running “K-safe”, multiple (two or more) node replicas are kept transactionally consistent.  If one of those nodes fails, the other(s) continue to provide uninterrupted service.
  • Durability – VoltDB delivers local cluster durability via a combination of continuous database snapshots and command logs.  Importantly, command logging can be configured to write synchronously (full crash recovery) or asynchronously (possible transaction loss) to allow devops teams to balance their durability and latency objectives.

In the next releases of VoltDB, we will continue down the path of non-stop computing with asynchronous WAN replication.  Where local synchronous replication (K-safety) provides HA, WAN replication provides disaster recovery (DR).  In a DR scenario, the primary cluster goes completely off-line due to a catastrophic event such as a major power outage.

In our WAN replication design, one VoltDB cluster will be replicated to a different VoltDB cluster somewhere else in the network.  WAN replication will be one way, from source (master, or primary) cluster to target (slave, or secondary) cluster.  The core concepts of VoltDB’s WAN replication functionality are described below.

Configuration

When a VoltDB cluster is started, it must be configured as either a Master (the default configuration), or Slave, which automatically makes the cluster read-only.  The Slave cluster should be able to handle the same transaction volume as the Master, though it is not a requirement that they be the same size.  Nor is it a requirement that the Slave have the same K-Safety as the Master.

Starting Replication

Enabling replication, specifically, adding a Slave to a Master cluster, can occur any time.  If the Slave cluster has zero data at start-up, VoltDB will move entire snapshots of data to the slave cluster.  If the Slave cluster contains data when replication starts, VoltDB will determine the difference in state between the Master and the Slave.  If the difference is great, VoltDB will move entire Snapshots to the target rather than command-log deltas.

Logical Transition from Master to Slave (when the Master does not crash)

To convert the Slave to a Master perform the following steps:

  1. Quiesce the Master via the @quiesce system procedure
  2. Disable Replication and optionally shut down the Master cluster.
  3. Convert the Slave to a Master via the new @SetWriteable system procedure.

This transition can be performed via scripts or interactively via the VoltDB Enterprise Manager.

Transition from Master to Slave (when the Master DOES crash)

To convert the Slave to a Master perform the following steps:

  1. Disable Replication.
  2. Convert the Slave to a Master via the new @SetWriteable system procedure.

This transition can be performed via scripts or interactively via the VoltDB Enterprise Manager.

In an upcoming blog post, I’ll describe how WAN replication will function in the context of different failure scenarios.