Once you determine what happened, you often need to take action to correct the problem. If, for example, the VoltDB server process stops on a node because of a hardware failure, you will need to fix the hardware before bringing the node back into the cluster.
Once the problem is diagnosed and any necessary repairs made, you can bring the node back into the running cluster directly from the dashboard. Click on the name of the stopped node in the server list and select one of the two Rejoin options from the popup menu. The Enterprise Manager copies the necessary files to the node, issues a rejoin request, and rejoins the node to the cluster. Once this procedure is complete, the indicators for the node and the cluster return to green.
Note that if two or more nodes are removed from the running cluster, you should have the nodes rejoin the cluster one at a time. Wait for each node to complete the rejoin process before starting the next node. Attempting to rejoin multiple nodes at the same time can result in some nodes timing out and failing to rejoin.
When rejoining a node to the cluster, you have two choices: regular rejoin and live rejoin. Which type of rejoin you choose depends on the specific needs of your application. To rejoin a busy database without negatively impacting the throughput or latency of client applications, use Live Rejoin. If your database can be paused temporarily, you can use a regular Rejoin.
Normally, when rejoining, the other nodes in the cluster send copies of the appropriate partitions to the rejoining node as part of their transactional work queue. This provides the quickest way to restore a rejoining node.
However, the downside is that the copying process ties up that partition until the copy is complete. No other transactions are executed by the partition in the meantime. If there is a multi-partition transaction in the queue, the rejoin can block all partitions as they wait for that partition to complete its part of the multi-partition transaction.
To avoid blocking transactions on an active database, you can request a "live" rejoin. A live rejoin is performed as a separate workload and does not block the existing partitions. Live rejoin is preferable because the database remains available throughout the procedure with a minimum impact on throughput and performance. The deficit of a live rejoin is that, for large datasets, the live process can take longer to complete than with a blocking rejoin.
In rare cases, if the database is near capacity in terms of throughput, a live rejoin cannot keep up with the ongoing changes made to the data. If this happens, VoltDB reports that the live rejoin cannot complete and you must wait until database activity subsides or you can safely perform a regular, blocking, rejoin to reconnect the server.
If the problem is hardware related and cannot be fixed quickly, you may wish to replace that server in the cluster. Similarly, if you want to deliberately remove a specific server from the cluster (for maintenance, for example), you can click on its name and select Stop from the popup menu.
Once the server is stopped, you can tell the Enterprise Manager to replace it with another server by doing the following:
Click on the server name in the list of servers on the dashboard.
Select Replace from the popup menu.
A secondary menu appears to the right, listing all of the available servers. Click on the name of the server you want to use as a replacement, or click on Add... to enter a new server.
If you choose Add... the Add Server dialog box pops up and lets you enter the new server information. Click Replace to add the server and use it as the replacement.
At this point the menus disappear and the stopped node is replaced in the server list with the name of its replacement.
Once the replacement is complete, you can click on the new server's name and select Rejoin or Live Rejoin to have it join the cluster and restore the database to its full complement of servers.
Note that the process for rejoining and replacing nodes to the cluster described above depends on a high availability cluster. That is, a cluster with the K-safety value set to one or more. If the cluster is running without K-safety, or more nodes fail than the cluster can support, the entire cluster will stop and the database icon will turn red.
In this situation, you can still restart the database from within the Enterprise Manager. Simply click on the Start Database button next to the database name or select Start from the popup context menu in the global list of databases.
However, when the entire database stops, the data in memory is lost. If you restart the database using the create option, you create a fresh, empty database. If you want to restore the database's previous state, be sure to use the recover option when restarting the database. This is why use of command logging or automated snapshots is recommended, especially if you are not using K-safety for durability.