The Sqoop on VoltDB Export and Hadoop Integration

written by John Hugg on June 22, 2011 with no comments

In the last couple of releases of VoltDB, we’ve made steady improvements to our Export feature. Export allows you to build into your VoltDB applications an automatic flow of data from VoltDB to companion datastores (for example, to an analytic database). See this earlier post here.  In this post, I’ll describe some of the improvements we’ve made recently, including integration with Hadoop using Apache Sqoop.

  1. Robustness.  The 1.3 release of VoltDB made great strides in increasing the robustness of the Export functionality, with a primary focus on building a looser coupling between the consumers of the Export data and the VoltDB cluster itself. We call these consumers Export Clients, and their job is to pull and ack data from VoltDB, then prepare it to be inserted into the downstream system.In 1.3, the Export Clients can auto-discover a cluster given the address of a single node, easing configuration in the face of topology or host changes. In fact, if a node fails and is later replaced, the Export Clients will update their connections on-the-fly without user intervention.

    We’ve made the VoltDB server side more robust as well. VoltDB servers will now overflow buffered Export data waiting to be sent to Export Clients to disk. If there is an Export Client failure, VoltDB servers will continue to operate while an admin troubleshoots and restarts the Export Client. The server nodes will also save unsent Export data to disk when saving snapshot data. This allows snapshot restores to also restore unsent export data.

  2. Better Output.  Our Export-to-File client is our primary Export Client. It supports outputting data in CSV or TSV formats. In V1.3.3, we’ve made a few improvements to its output to allow its use in more situations.
  • We’ve improved our delimiter and escaping code, which has been tested on data with whitespace, quotes and multi-byte characters.
  • We’ve improved our file rolling code so that it’s more predictable, making it easy to discover which files are being appended to and which are safe to process.
  • The client now allows you to have one file per table, rolling every N minutes, or one folder that contains all exported table data, rolling every N minutes. This reduces files in a directory and is nicely suited for batch processing.
  • Finally, we added the ability to export table schema along with data. Now your typeless CSV files can be enriched with type metadata without querying VoltDB itself. The schema is outputted to the same data as the CSV files in a simple JSON format.
  1. Integrated Export to Hadoop.  VoltDB Enterprise Edition V1.3.3 contains a new export client that exports data from VoltDB into a Hadoop distributed filed system (hdfs) using the Apache Sqoop SQL technology developed by Cloudera. The Export-to-Hadoop client operates in much the same way as the existing export-to-file client, but gives Export data access to the format flexibility that Sqoop offers.Establishing integration between VoltDB and a Hadoop installation is quite easy.  You identify which VoltDB data will be exported to Hadoop, configure the VoltDB export client with the location of Hadoop, the location of a VoltDB cluster, Sqoop options such as output formatting, and other installation-specific instructions (e.g., frequency of import).  The VoltDB export client automatically manages periodic Sqoop jobs based on your configuration.  For details, see Chapter 10, “Exporting Live Data” in the Using VoltDB manual.

John Hugg
Senior Software Engineer