Scaling: 700,000 Simultaneous Connections to VoltDB

written by Ariel Weisberg on October 21, 2013 with no comments

One of the original design goals for VoltDB was to support large numbers of connections per server. This kind of came for free with Volt’s event based design. The event based design was driven by a desire to keep the number of active software threads 1:1 with available hardware threads and this basically proscribed any kind of thread per client network IO implementation.

In theory the only limit on the number of connections in Volt is kernel memory and heap space. Volt allocates handful of objects for each client connection to track incoming and outgoing message bytes and procedure invocation statistics. All told with the way Java objects are layed out and the overhead of GC it is probably 1-4k kilobytes of memory (I know I should really measure) per connection. If you have a large number of procedures that are called by clients there will be some additional memory overhead involved in tracking the statistics of each procedure called on a per connection basis.

The real upper limit is going to be based on the ability of the JVM to manage the turn over of old gen objects associated with the connections and that will vary based on your GC pause tolerance and throughput requirements.

We never did a ton of testing with “truly” large numbers of connections, but we have had customers for several years that have been running 100k or so persistent connections and regularly test into the 10s of thousands of connections without issues.

A few months ago in the 3.5 and 3.6 timeframe we did a POC with a very demanding customer that not only wanted to maintain a large number of connections, but they wanted to use a very large number of worker processes that are constantly being recycled requiring that the server efficiently accept and tear down new connections. Maintaining a large number of connections turned out not to be an issue, but constantly adding and removing them revealed some of the issues addressed in 3.5 and 3.6.

We made a bunch of improvements fixing some n2 datastructures tracking statistics, extra off heap memory usage by idle connections, and synchronous reverse DNS lookups in the connection accept path. We also fixed a bug in the C++ client library that added a 10 millisecond wait for each connection creation which also effects the PHP client library built on the C++ library. The connection pooling in both libraries now returns the same connection to a thread if it is requested multiple times reducing the number of pooled connections in client applications that aren’t careful about reusing connections.

The end result is that we had 700k incoming client connections to a single Volt node served off of a 2-gigabyte Java heap. Keep in mind that Volt nodes proxy incoming requests within the cluster so every Volt node you add ups the number of concurrent connections a cluster can support. It’s also helpful to support large numbers of connections because it allows clients to connect to every node and route requests to the correct node reducing latency, especially for reads which can be load balanced across replicas.

Another advantage of having the cluster handle this kind of thing is that you don’t need a separate load balancer or connection pool process and we don’t have to write one for $LANG. It’s also one less moving part for you to deal with when deploying and managing a cluster.