A Peek Inside VoltDB’s VARBINARY Sausage Factory

written by John Hugg on August 17, 2011 with no comments

VoltDB users are pretty passionate, especially when it comes to things they want us to implement.  For example, we got a lot of feedback from early users indicating the need for VoltDB to natively support variable length binary objects.  Some of those apps were looking to use VoltDB as a K/V store (not as insane as it might sound); some for hybrid K/V workloads; some just needed a general purpose solution for storing custom data structures in VoltDB.

We initially handled VARBINARY use cases by base64-encoding data, then storing it in VARCHAR fields in the database.  That workaround satisfied some application needs, but we really wanted to address the requirement in a complete way.  So, a couple of months ago, we took on a project to add native binary data support to VoltDB.

Changes to VoltDB’s type subsystem actually touch many parts of the code – from client libraries to the core storage layer to everything in between. Although we ended up with a simple and (hopefully) useful improvement, it may be interesting to review how a seemingly simple feature like VARBINARY support got baked into VoltDB.

  • We had to make sure binary field storage couldn’t fragment memory. As an in-memory system, efficient memory use is a major concern. Since we spent a chunk of time last winter building a non-fragmenting string storage pool, we decided to reuse that code and treat VARCHAR and VARBINARY using mostly the same code paths at the storage layer.
  • We had to figure out a representation for SQL literal values. These are used as default values for columns, for literals in SQL statements, exported or snapshotted CSV data files, and in our JSON/HTTP API. We chose vanilla hexadecimal without a “0x” prefix. Base64 would have been more efficient, but we expect most binary data will be accessed through our native client libraries using binary interfaces, so our goal was to make literals easy, and hexadecimal is easy.
  • VoltDB allows strings to be passed around as byte arrays with an implied UTF-8 encoding, which is sometimes faster than worrying about Java’s string encoding. Since we now allow binary data to be encoded as strings, we allow you to use both strings and byte arrays to represent either VARCHAR or VARBINARY data. To make this work, we improved our support for late typecasting. So if you pass a string value to a SQL statement parameter, it won’t be assumed to be binary (or something else) until it’s actually used or stored, at which point VoltDB will perform any needed conversion.

As you can see, a “simple” data type implementation actually required some pretty thoughtful design and coding.  The result is a native VARBINARY datatype that boosts some cool application patterns like K/V processing and high performance caching. Look for some other neat stuff from us along these lines in the near future.  We’ll keep chugging with our short release cycles. And if you’re dying for a feature or fix that isn’t in there yet, let us know; we’re always hungry for your input.