Upsert: update or insert a record

written by vdbdev on October 22, 2013 with no comments

There are many times when a database needs to maintain unique records, and as new data arrives it may not be known if it is a new record to be inserted, or an existing record to be updated. Instead of checking first from outside the database, it would be much faster and easier if you could send the record to the database and have it do the right thing, update the record if it already exists, or insert it otherwise.

Some databases provide a MERGE statement for this purpose, while others have non-standard SQL syntax. In VoltDB it can be done with standard INSERT and UPDATE syntax within a simple stored procedure pattern.

The logic of this type of stored procedure is to run the UPDATE statement, and then check for the number of rows affected. If 1 row was affected, no further action is needed. If no rows were affected, then the INSERT statement is executed. There are no errors that need to be caught, because it is perfectly valid for an UPDATE statement to affect no rows.

This has the advantage of only running one statement in the “UPDATE” cases, and at most only running two statements. The assumption is that the entire record is being replaced with new values in the case of the update, but incremental updates such as adding 1 to a count, or summing the previous value of a column with the new value could be done using the same pattern, simply by incorporating these increments into the UPDATE statement.

Usually, the update statement is matching on the primary key of the table, which is indexed. In cases where there is no primary key, there should be a unique key index. If there was no index, the query would be executed using a full table scan which will result in poor performance.

Here is an example DDL file that includes a table that includes one of each data type.

DDL.sql

CREATE TABLE hellotypes (
  id              INTEGER NOT NULL,
  varchar_val     VARCHAR(40),     
  varbinary_val   VARBINARY(256),  
  tinyint_val     TINYINT,         
  smallint_val    SMALLINT,        
  bigint_val      BIGINT,          
  float_val       FLOAT,           
  dec_val         DECIMAL,         
  timestamp_val   TIMESTAMP,       
  CONSTRAINT pk_hellotypes PRIMARY KEY (id)
);
PARTITION TABLE hellotypes ON COLUMN id;

CREATE PROCEDURE FROM CLASS procedures.UpsertHelloTypes;
PARTITION PROCEDURE UpsertHelloTypes ON TABLE hellotypes COLUMN id PARAMETER 0;

Note that the class name of the java stored procedure is referenced in the DDL, along with a PARTITION PROCEDURE statement so that the procedure will run in a single partition for greater parallelism and throughput.

Here is the stored procedure java code.

UpsertHelloTypes.java

package procedures;

import java.math.BigDecimal;
import org.voltdb.*;
import org.voltdb.types.TimestampType;
import org.voltdb.client.ClientResponse;

public class UpsertHelloTypes extends VoltProcedure {

public final SQLStmt update = new SQLStmt(
    "UPDATE hellotypes SET " +
    "  varchar_val = ?," +
    "  varbinary_val = ?," +
    "  tinyint_val = ?," +
    "  smallint_val = ?," +
    "  bigint_val = ?," +
    "  float_val = ?," +
    "  dec_val = ?," +
    "  timestamp_val = ?" +
    " WHERE id = ?;");

public final SQLStmt insert = new SQLStmt(
    "INSERT INTO hellotypes VALUES (" +
    "?,?,?,?,?,?,?,?,?" +
    ");");

    public long run( int id,
                     String varchar_val,
                     byte[] varbinary_val,
                     byte tinyint_val,
                     short smallint_val,
                     long bigint_val,
                     double float_val,
                     BigDecimal dec_val,
                     TimestampType timestamp_val ) throws VoltAbortException {

        voltQueueSQL(update,
                     varchar_val,
                     varbinary_val,
                     tinyint_val,
                     smallint_val,
                     bigint_val,
                     float_val,
                     dec_val,
                     timestamp_val,
                     id);

        VoltTable results1[] = voltExecuteSQL();
        long rowsAffected = results1[0].asScalarLong();

        if (rowsAffected == 0) {
            // then insert
            voltQueueSQL(insert,
                         id,
                         varchar_val,
                         varbinary_val,
                         tinyint_val,
                         smallint_val,
                         bigint_val,
                         float_val,
                         dec_val,
                         timestamp_val);
            voltExecuteSQL();
        }

    return ClientResponse.SUCCESS;
    }
}

There are a few cases where the update really depends upon first reading the existing values of the record. For example, if some of the values provided to the procedure are NULL, and you want to keep the existing values rather than replace them with NULL. In cases like this, you would run a SELECT statment first, then if no rows were found you would INSERT. When one row was found, you would prepare the values to go into the UPDATE statement and run the UPDATE.