VoltDB comes with three export clients:
As the name implies, the export-to-file client writes the exported data to local files, either as comma-separated or tab-delimited files. Similarly, the export-to-JDBC client writes data to a variety of possible destination databases through the JDBC protocol. Finally, the export-to-Hadoop client uses Sqoop, the SQL importer for Hadoop from Cloudera, to write the exported data to a Hadoop distributed file system.
Despite the different targets, the three export clients operate in the same way. The major difference is how you configure export, depending on whether you are running the client remotely or on the database server.
Where possible, running the export client on the database server is recommended because it simplifies the overall process, helps distribute the work, and avoids possible bottlenecks. Both the export-to-file and export-to-JDBC clients can be run on the server. Because of the complexities of configuring a Hadoop connection, the export-to-Hadoop client is available for remote operation only.
To enable export on the server, you include the enabled attribute to the <export> tag and the <onserver> tag as a child in the deployment file. You specify the client you wish to use as the argument to the exportto attribute of the <onserver> tag. Specify "file" for the export-to-file client or "jdbc" for the export-to-JDBC client (available in the VoltDB Enterprise Edition). For example:
<export enabled="true"> <onserver exportto="file"> . . . </onserver> </export>
When you enable server-side export processing, you must also configure the export client in the deployment file. You do this by specifying the configuration properties using the <configuration> tag enclosing one or more <property> tags. For example, the following XML code enables export to comma-separated (CSV) text files using the file prefix "MyExport".
<export enabled="true"> <onserver exportto="file"> <configuration> <property name="type">csv</property> <property name="nonce">MyExport</property> </configuration> </onserver> </export>
The properties that are allowed and/or required depend on the export client. Table 13.1, “File Export Client Properties” and Table 13.2, “JDBC Export Client Properties” list the required and optional properties for the export-to-file and export-to-JDBC clients, respectively.
Table 13.1. File Export Client Properties
|type*||csv, tsv||Specifies whether to create comma-separated (CSV) or tab-delimited (TSV) files,|
|nonce*||string||A unique prefix for the output files.|
|outdir||directory path||The directory where the files are created. If you do not specify an output path, the client writes the output files to the current default directory.|
|period||Integer||The frequency, in minutes, for "rolling" the output file. The default frequency is 60 minutes.|
|binaryencoding||hex, base64||Specifies whether VARBINARY data is encoded in hexadecimal or BASE64 format. The default is hexadecimal.|
|dateformat||format string||The format of the date used when constructing the output file names. You specify the date format as a Java SimpleDateFormat string. The default format is "yyyyMMddHHmmss".|
|timezone||string||The time zone to use when formatting the timestamp. Specify the time zone as a Java timezone identifier. The default is GMT.|
|delimiters||string||Specifies the delimiter characters for CSV output. The text string specifies four characters: the field delimiter, the enclosing character, the escape character, and the record delimiter. To use special or non-printing characters (including the space character) encode the character as an HTML entity. For example "<" for the "less than" symbol.|
|batched||true, false||Specifies whether to store the output files in subfolders that are "rolled" according to the frequency specified by the period property. The subfolders are named according to the nonce and the timestamp, with "active-" prefixed to the subfolder currently being written.|
|skipinternals||true, false||Specifies whether to include six columns of VoltDB metadata (such as transaction ID and timestamp) in the output. If you specify skipinternals as "true", the output files contain only the exported table data.|
|with-schema||true, false||Specifies whether to write a JSON representation of each table's schema as part of the export. The JSON schema files can be used to ensure the appropriate datatype and precision is maintained if and when the output files are imported into another system.|
Table 13.2. JDBC Export Client Properties
|jdbcurl*||connection string||The JDBC connection string, also known as the URL.|
|jdbcuser*||string||The username for accessing the target database.|
|jdbcpassword||string||The password for accessing the target database.|
The class name of the JDBC driver. The JDBC driver class must be accessible to the VoltDB process for the JDBC export process to work. Place the driver JAR files in the lib/extension directory where VoltDB is installed to ensure they are accessible at runtime.
You do not need to specify the driver as a property value for several popular databases, including MySQL, Netezza, Oracle, PostgreSQL, and Vertica. However, you still must provide the driver JAR file.
|schema||string||The schema name for the target database. The use of the schema name is database specific. In some cases you must specify the database name as the schema. In other cases, the schema name is not needed and the connection string contains all the information necessary. See the documentation for the JDBC driver you are using for more information.|
|ignoregenerations||true, false||Specifies whether a unique ID for the generation of the database is included as part of the output table name(s). The generation ID changes each time a database restarts or the catalog is updated. The default is false.|
|skipinternals||true, false||Specifies whether to include six columns of VoltDB metadata (such as transaction ID and timestamp) in the output. If you specify skipinternals as true, the output contains only the exported table data. The default is false.|
An alternative to running the export client on the database server(s) is to run an instance of the export client remotely. You can run either the export-to-file or export-to-Hadoop clients remotely. The export-to-JDBC client must be run on the database servers.
The major differences when running the export client remotely is that you specify the properties as options when starting the client process and you run only one instance of the client, which connects to all of the cluster nodes.
When you start an external export client, you specify the cluster nodes for the client to query for information
--servers argument). The client queries these nodes, one at a time, until it receives a
response. Part of the response it receives is a description of the cluster, including a list of nodes and available ports.
The client then creates connections to every node in the cluster using the port specified in the
--connect argument (either the client port or the admin port).
Note that you don't have to specify all of the nodes of the cluster on the command line. You only have to specify one. The client then discovers the cluster configuration from the first node it reaches. However, you can specify multiple nodes in case one or more of the nodes is unavailable when the client starts.
Once the client connects to the cluster, it starts to poll and ack for export data. The client "decodes" the export stream from its serialized form into the appropriate datatypes for each column in the table. It then writes the data out to its appropriate target data form, whether files or a Hadoop distributed file system.
If the client loses connection to any of the nodes in the cluster (either because of a node failure or a shutdown), it disconnects from the cluster and repeats the initial discovery process, using the information it collected from the original connection. In other words, it will query every node in the cluster, one at a time, until it determines the new configuration. If the client cannot reach any of the nodes (for example, if the cluster is temporarily shut down) it will periodically retry the discovery process until the database cluster comes back online.
Once the cluster comes back, the client resumes export operations, picking up with the last data packet it received prior to the interruption This allows the export process to continue without operator intervention even across network disruptions, node failures, and database sessions.
Both the export to file and export-to-Hadoop clients use the preceding process. The following sections describe each remote client and its operation in more detail.