Use Main Memory for OLTP

written by Mike Stonebraker on March 22, 2012 with no comments

This is the first in a series of blog posts in which I will explore various aspects of On-Line Transaction Processing (OLTP).   In this post, I’ll examine main memory storage as an alternative to disk for traditional and “New OLTP” systems.

Traditional relational DBMSs, Hadoop and most of the NoSQL offerings store their data on disk.  In contrast, VoltDB is a main memory DBMS.

First, it should be noted that main memory is getting very cheap.  It is straightforward to put 50 Gbytes of memory on a $5,000 server.  Beefy servers these days have 10 times that amount. Moreover, many (but not all) transactional databases don’t require massive storage volumes. An OLTP application with more than a few Tbytes of data is quite rare.  The same can be said for new OLTP “fire hose” applications that require ultra-high write throughput and ACID transactions (e.g., digital advertising, wireless, real-time monitoring, online games) – these systems rarely need to manage more than a few Tbytes of hot state.  Hence, it is plausible to buy enough main memory to store the data for the vast majority of OLTP applications.

So why not just run a traditional disk-based RDBMS and have your data comfortably fit in the buffer pool?

The answer comes from a paper in SIGMOD 2008 called OLTP Through the Looking Glass, and What We Found There.  To summarize, the paper explores the TPC-C benchmark implemented on a typical disk-based RDBMS architecture.  Although the Shore implementation from Wisconsin was used in the study, the authors expect the results to be very similar on other RDBMSs.  Further, the authors instrumented the code and categorized where the CPU cycles went on a TPC-C benchmark that fit completely in the buffer pool.  The paper reports several experiments, which I will very crudely summarize below.
Legacy RDBMS Overhead Pie Chart2About 10% of the CPU cycles perform useful work (retrieving and updating records).  The remaining 90% of the CPU cycles were split more-or-less equally between four tasks:

  • Buffer pool overhead (fetching records from buffer pool pages, management of buffer pool data structures)
  • Locking overhead (setting and releasing record-level locks, managing the waiting for locks)
  • Crash recovery overhead (implementing a write-ahead-log for crash recovery purposes)
  • Latch overhead (in a multi-threaded system, ALL the shared data structures in a DBMS must be latched when accessed to prevent corruption.  These include the lock table, B-tree indexes, buffer pool data structures, resource management data structures, etc.)

To implement a very high performance system, the following two tenets must be followed:Tenet 1:  The focus must be on overhead.  A scheme that makes useful work go faster will make very little difference in overall performance.

Tenet 2: To go really fast, an implementation must get rid of the vast majority of ALL FOUR sources of overhead.  Otherwise, you will run only marginally faster than a traditional disk-based system with a buffer pool.

Hence, an obvious conclusion is that high performance OLTP MUST be a main memory system.  Otherwise, the first source of overhead will destroy performance.  Equally obvious, just implementing a main memory DBMS with the other three sources of overhead intact will not result in high performance. This is why you won’t gain significant throughput improvements by simply “running MySQL in memory or replacing spinning disks with SSDs”.

Subsequent blog posts will explore these other DBMS performance issues.