VoltDB – Simplified CSV Loader

written by Douglas Patten on August 23, 2012 with no comments

My name is Xin Jia, a Brown University student, and I’ve spent the summer interning at VoltDB. One of the first projects that I worked on this summer was a CSV Loader with fellow intern Zheng Li.

Prior to VoltDB 2.8, loading data from CSV files into the database could be a challenging task, especially for newcomers. There were a few steps that needed to be taken: you would need to write a client program to parse the CSV data, handle errors and ultimately invoke a stored procedure to insert the data into the database. Extra care needed to be taken in the client in order to get optimal performance.

As part of the internship at VoltDB, Zheng and I wrote a CSV loader that simplifies the task of loading data into VoltDB. It was released as part of VoltDB 2.8. VoltDB 2.8 comes with a shell script “csvloader” in the bin directory that starts the actual CSV loader. This post will describe this new utility.

If you are migrating to VoltDB from another database system or perhaps defining initial data for your new VoltDB application, you may have your data stored in CSV format. To load it into VoltDB using the CSV loader, it is as simple as using the following command:

csvloader tableName < dataFile.csv

or, alternately:

csvloader tableName -f dataFile.csv

These two commands are equivalent.

In VoltDB 2.7, we released support for snapshotting to CSV files.  Using this new utility you can now load the CSV snapshot back into the system using the same commands shown above.

The first argument to the csvloader utility is the table name to which the data will be loaded into. To do this efficiently in VoltDB, the loader calls the auto-generated insert procedure with the name “<table name>.insert”. Should you want to do additional processing before inserting the data into the table or if you want to insert into multiple tables from one CSV file, the CSV loader gives you the option (“-p” or “–procedure”) of specifying your own custom stored procedure for loading. The only requirement is that the parameters of the stored procedure matches the column definition of the CSV file.

In addition to the features mentioned so far, the CSV loader can also handle CSV files with a different delimiter, quote character or escape character, as long as they are used consistently in the same file. Partial loading of data from the csv file is also supported by specifying the row offset and the row limit. In the case of a malformed CSV file, you can specify the max number of errors to ignore before the loader gives up on that file. Detailed usage of the CSV loader is documented here.

When the CSV loader finishes, three reports will be generated – the summary, error log and failed rows. These log files show the summary of the loading process (including the throughput, total number of rows loaded, etc.), all parsing errors and insertion errors from the database, as well as all failed rows, respectively.

We hope that you will find the CSV loader handy and makes the data loading process much easier, more flexible, and enjoyable!