Intro to VoltDB Command Logging

written by Ariel Weisberg on July 26, 2011 with no comments

VoltDB’s new command logging feature reduces the window of data loss during a cluster wide failure from single digit minutes (the window between snapshots) to zero. Command logging can be tuned to give you the same amazing latency and throughput you get from VoltDB, even on a single 7.2k SATA disk. Our unique approach to log based durability exploits the determinism and replication inherent in VoltDB’s architecture to avoid the overhead and latency of ARIES style logging.<--break->

How We Log

A command log is kept at every node and contains partially ordered stored procedure invocations. As stored procedure invocations arrive from the network they are passed to the command log to be written to disk. If synchronous logging is enabled then a procedure will not be executed until the command log flags the invocation as durable. While the global ordering and replication process takes place, the command log has a several millisecond window where it can batch commit several procedure invocations to disk without impacting latency. The size of this window is a configuration option and can be tuned to fit your hardware.

If replication is enabled (VoltDB uses synchronous, multi-master replication for HA), then an invocation will be logged at each replica to provide additional durability and mitigate the need for RAID based redundancy. The IO throughput necessary for the log is conveniently bounded by Gig-E and can usually be satisfied with a reasonably dense 7.2k SATA drive.

The command log is bounded using VoltDB’s transactional non-blocking snapshot feature. When the cluster decides it is time to truncate the log, it initiates a snapshot. When the snapshot completes, all the transactions that executed before the snapshot transaction are automatically truncated from the log. VoltDB snapshots are extremely fast in terms of throughput and generally compact because indexes and views are not included. Thus, even on large databases, the command log is typically limited to single digit minutes of replay.

Because command log and snapshot IO are sequential, there is no need for expensive disks capable of scaling random reads and writes. There is also very little demand on throughput and CPU utilization because there is no expensive compaction step.

One advantage of command logging over ARIES style logging is that a transaction can be logged before execution begins instead of executing the transaction and waiting for the log data to flush to disk. Another advantage is that the IO throughput necessary for a command log is bounded by the network used to relay commands and, in the case of Gig-E, this throughput can be satisfied by cheap commodity disks.

How We Replay

At recovery time you can replace failed nodes, expand/shrink the cluster, increase/decrease redundancy, and the database will automatically re-shard your data and command log during replay.

The cluster starts recovery by scanning for available snapshots and command logs and then assembles and executes a recovery plan. If necessary, a snapshot is loaded to seed the cluster with the data necessary to resume from the log. Then each node replays a subset of its command log such that every transaction is replayed exactly once even if it is was logged redundantly. The transactions are merge sorted in memory and then rehashed and redistributed across the new cluster topology. The entire process is distributed and parallelized and requires a single sequential IO pass across the snapshot and log data.

Future Work

Command logging will serve as the foundation for a host of other new durability features, including point-in-time replay, log archiving, and WAN replication via log shipping.