Key-Value Benchmarking

written by John Hugg on May 25, 2010 with one Comment

Edit on 6/2/10: Check out the follow up Benchmarking FAQ blog post with links to code.


The NoSQL movement was born of the need to scale data management with predicable cost. My subjective summary of the NoSQL credo: provide simpler core functionality, scale horizontally, leverage redundancy, expect and handle failures. These systems became known as NoSQL because SQL and schema was often given up for flexibility and simplicity. The name came about even though SQL was never the primary complaint; scalability was.

VoltDB was born to solve similar problems. Provide horizontal scalability at a cost per transaction that anyone can afford. That said, the architectural approach is unlike most of the existing NoSQL solutions. When I was asked to compare the performance of VoltDB and a popular NoSQL system on some very simple workloads, I didn’t realize what a vague assignment it would turn out to be.

Choosing Cassandra

Since we had to pick one system to start, we picked Cassandra. Cassandra is roughly the BigTable data model with the Dynamo consistency scheme. While there’s no definitive NoSQL system, Cassandra has a lot of community and commercial support and it seems to be more focused on write throughput than some of the other options. This makes it a slightly more direct comparison.

Apples and Oranges

Agreed.

Cassandra writes to disk. VoltDB is an in-memory database. So I gave both systems plenty of RAM to hold the data set and turned Cassandra’s consistency settings pretty low. Some of the benchmark operations I ran involved multiple data mutations. VoltDB performed these atomically, isolated from other work. Cassandra didn’t, so I just ignored that. Cassandra’s client protocol and messaging became a bottleneck before VoltDB’s did, so I gave Cassandra a private back-channel for intra-cluster messaging.

When I moved to a redundant cluster, it becomes slightly more comparable. Both systems easily let us store 2 copies of all data, while sending requests to any node.

All of these steps made me feel like I did as fair a comparison as possible. Still, I’m a VoltDB expert, not a Cassandra expert, so I’m open to more feedback as we release more details.

Cluster Setup

We ran in both single-node and three-node configurations. Our nodes were dual-socket Xeon 5500 servers with 48gb of memory and 3 15k SAS drives with a fast controller. Each node was running Centos Linux 5.5.

Benchmarks Descriptions and Results

When we started developing VoltDB (back when it was H-Store), the first workload we built VoltDB to run was a TPC-C approximation. In that test, transactions were complex; each transaction included an average of about 26 SQL statements. For the Key-Value benchmark, we wanted something decidedly simpler. I admit these benchmarks are arbitrary, and perhaps other workloads would perform differently. The point is to show that VoltDB is fast at things with really complex transactions as well as simpler stuff like updating a single column-value.

1. Basic Key-Value

Use a trivial schema of one 50-byte key paired with one 12-kilobyte payload. Load 500,000 pairs. Define the operation as follows:

  1. From the 500,000 loaded pairs, randomly select 2 keys, key A and key B.
  2. With probability 1/2, fetch the payload for key.
  3. With probability 1/2, replace the payload for key with a new payload.

In a given 5 minutes, measure the total throughput, in operations/sec of the database. Take the average of three 5 minute runs.

Cluster Configuration VoltDB Cassandra Throughput Increase
1 Node 17,000 7940 2.2X
3 Nodes w/o Replication 19,800 17,400 1.1X
3 Nodes w/ Replication 12,600 4450 2.8X

This benchmark, more than the others, becomes wire-limited on gigabit ethernet pretty easily. Since it seems like Cassandra clusters use more network per operation than VoltDB, Cassandra was given a private gigabit network for inter-cluster communication. I would have done the same for VoltDB, but multiple network support is a forthcoming feature and not in VoltDB 1.0.01. I expect the VoltDB cluster performance would be much improved given more bandwidth.

2. Many Integer Columns

Expanding on the schema above, use the same 50 byte key, but pair it with 50 32bit integer values, identifiable by index. Load 500,000 key/integers pairs. Define the operation as follows:

  1. From the 500,000 loaded pairs, randomly select 1 key.
  2. From the 50 columns, randomly select two indices in the range 0-49 inclusive, index A and index B.
  3. For the selected key, read the integer value at index A.
  4. If the value at index A is odd, write a new random integer to the integer at index B (for the same key).

Again, measure average throughput for 3 five-minute runs.

Cluster Configuration VoltDB Cassandra Throughput Increase
1 Node 111,000 24,200 4.6X
3 Nodes w/o Replication 293,000 38,900 7.5X
3 Nodes w/ Replication 176,000 24,700 7.1X

3. Batching Many Integers

Use the same schema from benchmark 2. Load 500,000 key/integers pairs. Define the operation as follows:

  1. From the 500,000 loaded pairs, randomly select 1 key.
  2. From the 50 columns, randomly select two sets of 10 indices in the range 0-49 inclusive, index A0-A9 and index B0-B9.
  3. For indices A0-A9:
    1. For the selected key, read the integer value at index Ai.
    2. If the value at index Ai is odd, write a new random integer to the integer at index Bi (for the same key).

Again, measure average throughput for 3 five-minute runs.

Cluster Configuration VoltDB Cassandra Throughput Increase
1 Node 102,000 13,300 7.7X
3 Nodes w/o Replication 286,000 17,200 16X
3 Nodes w/ Replication 172,000 13,000 13X

The batched benchmark shows that both systems are not 10 times slower when doing 10 times the number of operations. We’re particularly proud that VoltDB shows a very small performance drop. Adding more SQL to a single-partition stored procedure is VERY cheap in VoltDB. The overhead of the transaction coordination and all the network messaging dominates the actual updating of tuples.

Conclusion

It was fun to get to play with another system targeting the same scaling pain as us. Cassandra is the first distributed system I’ve used that is as easy as VoltDB to get running on a cluster. It also has a friendly community. An issue I did have was the lack of thorough documentation or non-trivial (or non-Twitter) examples.

VoltDB requires a little more effort up front to build an application. That said, for the same hardware, VoltDB will do more work, sometimes a whole lot more work. Also, it will do it with SQL, transactional serializability and a user manual.

Going forward, we’re planning to release the code we used to do these benchmarks. We’d also like to try a few other storage layers. We’ll see how much time we have over the next few months while we’re adding features for VoltDB’s next release. We’re shortly going to post a longer summary and a better description of what we did in our community forums. Check it out, visit the VoltDB community site and join in on the discussion.

John Hugg
Software Engineer
VoltDB