Skip directly to content

VoltDB, native memory, and you

Monday, March 3, 2014 - 12:00am

Originally published here

Target audience

This post targets other VoltDB developers who are going to be dealing with the various unconventional ways Volt now uses native memory in the Java portions of the database. It will also be of interest to other Java developers looking to step outside what is typically considered Java’s comfort zone for interacting with native memory.

What was

Out of the box Java doesn’t give you much in the way of tools for accessing native memory allocation and deallocation mechanisms like mmap or malloc. There are DirectByteBuffers, but DBBs are managed by the garbage collector which introduces overhead and unpredictable latency to allocation and reclamation of DBBs.

DirectByteBuffer allocation path

Java limits the total amount of DBB memory allocated and by default this value is set to the same value as -Xmx, but you can set it manually using -XX:MaxDirectMemorySize. I’ll cover why I think Java needs to set such a limit when reviewing the dellocation path.

 

Because there is a limit you end up with a global lock protecting the counters for this global limit and you are going to hit this global lock every time you create a DBB. You can find this code in java.nio.Bits.reserveMemory(long, int). However we aren’t done with global locks yet.

 

Because DBBs have their underlying allocation returned by the garbage collector the collector has to be informed of the actions to take to do that when the DBB is reclaimed. DBBs don’t use finalization opting for sun.misc.Cleaner which is described as a “lightweight and more robust alternative to finalization

 

My main take away was that Cleaners run sooner and don’t retain references to the actual object being cleaned (you provide a dedicated deallocator object) and don’t introduce any reference queue overhead. However cleaners do introduce another global lock to the DBB allocation path. A cleaner object is a node in a doubly linked list of cleaners with a single global reference to the head because the cleaners all need to be strongly referenced until the reference thread comes along and runs a cleaner, removing it from the list in the process.

 

The last nit when using DBBs is that the memory is zeroed out for safety which means that every page will ending up being committed in addition to the overhead of doing the zeroing.

 

DBBs are not intended to be rapidly allocated and deallocated and you are expected to do your own pooling and slicing of memory a.k.a. write your own allocator. One of the nice attributes of DBBs is that because they are based on pointers slices of DBBs contain a raw offset pointer and don’t have to constantly add an offset (unlike a heap based bytebuffer slice). That said heap byte buffers are already much faster.

 

If you did go the route of implementing your own allocator it would perform fine, but you would still be going back to the unsafety of unmanaged memory. For that reason I don’t really understand Java’s refusal to give you a decent allocator.

DirectByteBuffer deallocation path

 

There are two ways to get a DBB to be deallocated and only one of them is actually intended for general consumption. The safe way is to treat it like the managed buffer it is supposed to be and let the garbage collector identify when the buffer is unused and run the Cleaner. The main problem with this approach is that if you retain the buffer for relatively small period of time you will have to wait for the next old gen collection to occur.

 

If your goal is to write software that does old gen collections on a weekly or monthly basis you could end up waiting a while to reclaim that memory! And that is why Bits tracks reserved memory. If a you fail to reserve memory Bits will call System.gc() and sleep for 200 milliseconds to see if the collector can reclaim enough DBBs to satisfy your reservation. Without this feedback loop the collector has no idea how much unused memory is lying around waiting to be collected.

 

If you have disabled explicit GC you can expect it to throw OutOfMemoryError after sleeping for 200 milliseconds.

 

Neither of these outcomes (unsolicited GC, OOME) are acceptable. Thankfully there is a work around. You can cast a DBB to DirectBuffer, fetch the cleaner, and run it yourself deallocating the memory and updating the global state in Bits. This is about as dangerous as you would expect if you have ever tackled memory management in C. Now there is a strongly referenced ByteBuffer floating around that points to memory that has been deallocated. You will also get some nasty compile time warnings about how these interfaces could change between versions or go away completely, but such is life programming in a language that treats native memory like a second class citizen (yes I am bitter).

Bonus, creating your own managed objects

For the vast majority of Volt I opted to go with C style memory management with one instance of a kind of reference counting. I found that our usage generally approximates auto_ptr because we have clear transfer of ownership, but I have not found an efficient nor automatic way to implement auto_ptr in Java. The ElasticHashinator fought against this approach because we do play kind of fast and loose with how the client interface fetches the latest hashinator to use.

 

The ElasticHashinator stores the consistent hash function as a sorted array of integers with odd integers representing a token on the consistent hash ring, and even integers representing the shard id of the token. The integers are stored off heap and accessed via unsafe and there is generally a single global copy. This means the same array can be shared across threads and accessed from the native execution engine which is a little more cache friendly than having many private copies.

 

We still end up with privates copies temporarily during elastic rebalance, but eventually everyone converges on the same copy and we need to reclaim the old versions of the hashinator. There can actually be many old versions because the hash function will be updated thousands of times when we add a node. That is also part of why I tried to pick an efficient fast to construct representation for the hash function.

 

Dealing with deallocation is where things become challenging. Reference counting would be fine since most parts of Volt retrieve a private copy based on transactional activity, and retain it until it is time to update again. However the client interface is free to pick an old or new hash function, it doesn’t really matter since we will re-route if there is a mistake, so it just checks a global reference for what ever is hot a new.

The problem is that if memory is managed by reference counting we would end up having to increment and decrement the refcount for each transaction or go through the coordination headache of checking for a new hashinator and only updating then. Transactions updating the hashinator can roll back so the propagation needs to be deferred until commit.

It turns out that the static factory method for Cleaners is public and that you can register your own deallocators and get in bed with the garbage collector. What could go wrong right? The trick is still to keep the collector in the loop when a lot of old hashinators are waiting to be collected so we do the same allocation tracking that Bits does for DBBs for ElasticHashinators. If we go over our target we spawn a thread to invoke System.gc() and log a warning if we didn’t reclaim enough memory. We never throw OOME since IMO it is better to commit more memory and give the operator a chance to react.

What is

Once you step out of the garden of managed buffers you always need to know what kind of ByteBuffer you are looking at when you are handed responsibility for deallocating it. You would also like the ability to perform some validation to protect you from use after free, double free, leaks, and deallocating memory that was never allocated.

 

I’m not asking for much, just Valgrind. My initial opinion on using native memory in Java was that it really sucked that there was no tooling to let you diagnose the inevitable bugs. When I started developing a new version of PersistentBinaryDeque that allowed for a lot of unsafe memory management I ran into bugs that were very difficult to diagnose. Unlike GDB, the Java debugger crashes every time something goes wrong so you can’t just play forward to the first error and start diagnosing state.

 

By the time I was finished working through a rewrite of PersistentBinaryDequee I found that you can make Java (well… your Java application) excellent at tracking native memory like Valgrind with the bonus that it can be faster than Valgrind.

BBContainer

The main interface for any piece of potentially native memory in Volt is BBContainer which is a wrapper around a ByteBuffer that provides all the functionality you need to work with it safely. DBBPool.java uses conditional compilation (don’t hate me!) because several pieces of overhead couldn’t be optimized out even if the logic were gated on assertions being enabled. There is also several kinds of fairly cheap error checking that I would want to be on even if assertions aren’t on.

 

I think that you don’t want to go down the unmanaged memory route unless you make running with validation cheap enough that all your fellow team members will be doing it and that means making on by default convenient. You also need to produce errors that team members will accept as useful.

The conditional compilation is controlled from ant via -Djmemcheck=${option}. MEMCHECK_FULL gives you extra tracking, but it prevents memory reclamation. NO_MEMCHECK disables everything and is what you would use for a benchmark or release build. Any other value results in partial memory checking that will catch most cases of use after free, double free, and leaks, but can’t help you if you do something wrong with a ByteBuffer you got from a container.

BBContainer.discard()

The discard of BBContainer is normally implemented by subclasses of BBContainer. The only thing the BBContainer version does is check for double free. Derived class implementations of discard must cooperate with double free checking by always invoking checkDoubleFree as their first action.

 

BBContainer checks for double free by synchronizing on the container during deallocation, checking for a the stack of the previous thread to free discard the container, and populating the field with the current thread stack if things are kosher.

 

If discard is called a second time the double free check will log the thread name, stack, and time of the original discard along with the same information for the current thread. This information is logged to stderr as well as VoltLogger and then System.exit(-1) is invoked. Without the System.exit(-1) there is really no way to get the unit tests to reliably fail. The downside is that Jenkins/junit/ant is not capturing any of the output which is something I need to figure out.

 

Frequently you need to take multiple actions when deallocating so it is common to subclass BBContainer while also binding another BBContainer. This nesting allows various things to react to the consumption of the state represented by that memory. See AckingContainer and PersistentBinaryDeque.

 

When composing containers it is important to always do the double free checking in all the wrapper containers so that use after free checking continues to work. Composing containers and the polymorphism inherent in BBContainer means that in situations like AckingContainer the backing container can be one of several things. At times it is it is off heap memory allocated by an execution engine, other times it might be memory allocated in Java when decompressing from disk, other times it may be a slice of a memory mapped file, and in several of those cases it may be wrapped to represent an entry in a persistent binary deque and discarding it is what “removes” the entry from the deque. The only reason it isn’t wrapped a third time is that we handle the ack explicity in ExportDataSource.ackImpl.

 

In unit tests managed buffers are common which is why DBBPool.cleanByteBuffer() tries not to make any assumptions about what kind of ByteBuffer it is looking at.

 

BBContainer.{b() | bD(), bDR()}

BBContainer does not expose it’s member ByteBuffer directly although it will leak it through an accessor. The accessor methods b(), bD(), and bDR(), check for use after free and return the raw ByteBuffer, a duplicate, or a read only duplicate respectively.

 

The use after free checking is similar to double free although there is no locking. Instead we do a volatile read of the stack trace field populated by the thread doing the free. If the field is populated you get a stack of the thread that did the free as well as the thread that attempting using after free.

 

It’s notable that slice is missing. It’s still pretty unsafe and there is a lot you can do to make it safe.

BBContainer.finalize()

One of the last pieces of the puzzle is checking for memory leaks. This is a real headache because it requires that everything clean up on the way out and deallocate. That means shutdown code in things that never actually need to shut down so that multiple unit tests can run in the same process.

 

The finalizer for BBContainer checks whether the container was discarded and if it wasn’t it prints information about the thread that performed the allocation. To get the finalizer to trigger promptly in tests it helps to put a System.gc() and System.runFinalization() in tearDown(). If you are actively debugging it makes sense to put a 200 millisecond thread sleep after to five finalization a chance to find leaked buffers before starting the next test. In most cases you want the tests to run fast so the 200 millisecond sleep is not viable.

 

I find when debugging leaks I tend to insert this combo all over until I get close to the test or portion of a test that causes the leak.

 

It might be worth looking at using Cleaners instead of finalize to get the errors to appear more promptly, but things are working well enough to satisfy me.

DBBPool.deleteCharArrayMemory

The last line of defense against double free or freeing memory that was never allocated is in deleteCharArrayMemory. The method is private so you can actually access it directly, but if you wrap a DirectByteBuffer (I will get into wrapping later) you can trick DBBPool into deleting whatever that DirectByteBuffer happens to point to.

By default you won’t get any protection, but if you compile with MEMCHECK_FULL every deallocation and allocation will be tracked by address, and deallocation will never actually occur. This prevents the allocator from returning an address more than once so we can know when something is deallocated a second time.

 

The downside is that you get unlimited memory usage, but all the unit tests seem to pass this way so the utility is still pretty high.

OK, but how do I get memory?

You have five options for getting native memory. DBBPool.allocateDirect() will return a container with a managed ByteBuffer that will run the Cleaner on discard. DBBPool.allocateDirectAndPool() will return a globally pooled managed ByteBuffer and discard returns the memory to the pool.

 

DBBPool.allocateDirectAndPool() returns an exact allocation for the request size, not the nearest power of two or some other approximation. It is intended for globally pooling some commonly used sizes. This pool should never be particularly large, it is just scratch space for serializing data before sending it to disk, compressing it, going over the network etc.

 

If allocateDirect() or allocateDirectAndPool() tries to allocate a DirectByteBuffer and gets OOME it will clear the global pool (and log that this is happening), call System.gc(), wait 200 milliseconds, and then try again. If the second attempt fails the OOME will not be caught and will propagate to the caller. This is to try and recover from having the wrong mix of memory pooled. I see that as a better alternative to crashing.

 

The third option is to get an unmanaged DirectByteBuffer allocated via JNI using DBBPool.allocateUnsafeByteBuffer(). The discard method deallocate the memory using another JNI call. Buffers allocated this way have no cleaner and it is going to be the fastest way to allocate/deallocate native memory. That doesn’t mean you don’t want to pool, but pooling becomes more of a performance optimization.

 

The fourth option is to receive memory allocated from an execution engine. The first thing you want to do is wrap the ByteBuffer you got from the EE into a BBContainer using DBBPool.wrapBB().

 

The fifth option is to use good old ByteBuffer.allocateDirect(). You can also wrap these using DBBPool.wrapBB() and things work fine because DBBPool.cleanByteBuffer() will check for a Cleaner before deallocating a the buffer. dummyWrapBB is also an option if you want to use it as a managed buffer.

Getting a bad wrap

One of the easier ways to get into trouble is to double wrap an allocation. This is the flaw in allowing BBContainer to leak it’s ByteBuffer thus allowing it to be wrapped a second time. Whomever allocates it wraps is what you want.

 

DBBPool.wrapBB() will work with a heap or direct ByteBuffer. For the heap buffer discard is a noop and for a direct buffer discard will check for and ignore null, check for a cleaner and use it, and then fall back to deallocating it via JNI.

DBBPool.dummyWrapBB() will wrap any kind of buffer and does nothing on discard. It’s something of an anti-pattern since you are giving someone access to a buffer you don’t want them to deallocate and that means you intend to deallocate it yourself which opens the door to deallocating it prematurely. It also bypasses use after free checking because the dummy wrapper does not know anything about other containers that might contain this ByteBuffer. I should probably change it to require you to provide a dependency container. Right now it is used in PersistentBinaryDeque to facilitate wrapping slices of mapped byte buffers.

 

DBBPool.wrapMBB() returns a wrapper that is similar to wrapBB(), but the b() method returns an instance of MappedByteBuffer allowing you to get at load() and force() without casting. Keep in mind that MappedByteBuffers that refer to memory mapped files also have associated cleaners and that you can force the unmap by running the cleaner and this is what the container provided by wrapMBB() will do. Eagerly unmapping is useful when you are trying to delete a file when you are done with it. This removes the file from the page cache avoiding unnecessary pressure, and if you are lucky you maybe even skip some writeback.

Going forward

Initially I thought our requirement was 5 9s under some number of milliseconds, but it has since become clear that the only number that really matters is the max. There are environments where Volt is deployed that suffer from cascading failures outside of the database when there is a hiccup greater than 10-20 milliseconds. The people deploying Volt are not in control of many of the causes of cascading failure and can’t work around them through through the usual QoS mechanisms. I don’t endorse it, but it’s a great excuse to improve the consistency of performance in Volt.

 

It’s important to understand the role that various memory allocation strategies can play when deciding what allocator to use. Java’s young gen GC is fantastic and for the purposes of Volt and can absorb a nigh unlimited allocation rate. Hotspot’s old gen GC is no longer an option if we are aiming to have an old gen GC once a week or once a month. Once we move the main offenders off the heap will we have to start looking at the causes of hourly or daily old gen GCs and figure out whether we can provision or pool our way out of those. The Java heap has the obvious advantage that it can store object graphs and don’t have to pay a penalty to serialize objects, but if it means fisticuffs with a garbage collector it’s not worth it.

 

I still have to sit down and do some measurements to find out what kind of CMS pause times we are looking at once tunables are in play. I don’t think we will ever be falling back to a full gc, but the mark and remark phases of CMS are probably going to have problematic outliers.

 

As a developer you should think about the time scales of the different kinds of data you are working with. If your goal is to prevent promotion to the old gen you want to make sure there are no objects that survive multiple young gen GCs. That is pretty much the time scale of an individual transaction because anything that spans an arbitrary number of transactions will tend to have a lifecycle that will span several GCs.

 

When the thing you are working with is going to have a problematic lifespan you can move it off heap or pool it. Don’t be afraid to do a little extra copying or use a little extra memory because consistent latency is more important than raw throughput. The most demanding shops are provisioning based off the max or the Nth percentile over throughput because they know that represents usable capacity for them.

 

We have benchmarks in continuous integration that are going to inform us of the impact of these changes as we go and I should have more information on how much mileage moving DR, live rejoin, and export buffers off heap has gotten us shortly.