Q&A from March 21 Webinar

written by Mike Stonebraker on March 26, 2012 with one Comment

Last week I gave a webinar entitled OldSQL vs. NoSQL vs. NewSQL for New OLTP.  If you missed the live webinar and want to view the recorded version, you’ll find it here (you may need to scroll down to find it).  Below is a list of questions that live webinar attendees asked, in no particular order.  If you have follow-on questions, reply to this post and I or someone else from VoltDB will answer them.<--break->

Webinar Questions and Answers

1.  Does VoltDB run on Scale up NUMA like systems or is it designed primarily to run on scale out clusters?

So far, we have primarily worried about “scale out” onto multiple nodes in a cluster.  However, our approach (dividing memory shared over K cores on a node into K non-shared “chunks” and assigning each to a specific core) will work fine on NUMA; it just won’t  particularly leverage the NUMA architecture to advantage.

2.  Are your criticisms of NoSQL DB valid for DWH?  Do your criticisms apply to column stores as DWH?

Column stores can easily beat row stores across a wide range of analytic queries.  Hence, they are the best storage structure to use for most DWH/OLAP applications.  Unless your data is highly specialized (e.g., graph oriented or deeply-dimensional), I would recommend looking hard at a columnar database like Vertica for your OLAP needs.  For all but simple (or highly specialized) applications, using a NoSQL datastore as a one-size-fits-all database means you’ll do a bad job at processing transactions and a bad job at back-end analytics.

3.  Do you generally support in-memory DB?

Yes, VoltDB is an in-memory database.  You can read more about VoltDB products here.

4.  Any references (production)?

Yes, there are many VoltDB applications deployed in production.  A partial list of VoltDB users can be found here.

5.  Won’t column-store combined with in-memory dbms provide sufficient performance for most apps?

Yes, what you’re suggesting is an excellent combination – use an in-memory RDBMS like VoltDB to handle high performance transaction processing and real-time analytics, combined with a columnar database for deep analytic processing.  In fact, VoltDB was specifically designed to be integrated with companion OLAP datastores.  In Big Data applications, we also see Hadoop being used as an alternative to columnar datastores due to Hadoop’s massively parallel processing framework.  Ultimately, analytic frameworks that combine Hadoop’s MapReduce processing tier and columnar storage may emerge as the most popular companion to VoltDB.

6.  Would VoltDB also work for OLAP?

Generally no.  VoltDB is designed for high throughput transaction processing and real-time analytics (i.e., analytics on “hot” datasets).  Although VoltDB can scale out on commodity hardware, you would probably not want to use VoltDB for databases in the tens of Tbytes and larger, which are increasingly common in OLAP applications.  For those needs, you should combine VoltDB with a companion analytic datastore.

7.  What’s the performance of volt on aggregates?

VoltDB is extremely fast at real-time operations such as grouping, aggregation and counting.

8.  And, if paging is needed (when data doesn’t fit in memory), we can’t have single threaded (without being inefficiently waiting). That would necessitate multi-threading.. leading to locking etc., rt?? How to break this deadlock??  You mentioned hot-cold data to deal with DBs that don’t fit in memory. But, traditional buf-managers  do the same using LRU queues etc. So, apart from disk-to-memory format difference, don’t we still need a bufmgr to maintain hot data in cache/memory?

A traditional buffer manager keeps “hot disk blocks” in main memory in disk format.  As noted in our SIGMOD 2008 paper, the overhead of maintaining such a disk-oriented buffer manager is prohibitive.  Hence, main memory data must be kept in main-memory format, and this requires rethinking the concept of a buffer pool.  Instead of keeping “hot” data in main memory, the DBMS should instead push “cold” data out of the main memory address space.  In effect, this is an “anti-cache”.  The policy to be employed by the anti-cache remains to be studied; however, my prediction is that it will not look anything like a traditional buffer manager.

9.  In multi core case, if we have transactions touching data in multiple partitions (though they all are in same system), it would need synchronization, necessitating locking etc. So, doesn’t removing the 4big overheads limited to some simple cases (where transaction touch single partition), and totally fit in memory?

So-called single partition transactions can be processed with no synchronization whatsoever.  In my opinion, this constitutes a very large fraction of OLTP applications.  In processing “multi-partition transactions”, VoltDB must work harder to obtain parallelism.  My current preferred approach is to use “speculative execution”.  In this model every CPU guarantees that it will process transactions in timestamp order.  However, multi-partition transactions require inter-CPU messages and there may be some delay involved.  Instead of waiting to process transactions with higher timestamps (and incurring a stall), the idea would be to process transaction in “tentative” mode until the multi-partition transaction committed.  If no conflict was observed, the tentative transactions could be committed; otherwise one or more would need to be backed out.  Effectively, this is a form of optimistic concurrency control (OCC).

10.  If it is single-threaded, how does it take advantage of multiple cores?

Given a K-core node, VoltDB divides main memory into K non-overlapping partitions and assigns each partition to a single core.  Each CPU services the transactions that involve data in its partition, and each CPU is single threaded.

11.  Any thoughts on graph databases such as Neo4J?

As mentioned earlier, specialized datastores are absolutely viable for applications with highly specialized data.

12.  Can you provide more details about the built in HA? Is this sync replication (describe)?

VoltDB’s HA architecture uses synchronous, multi-master, active-active replication.  VoltDB databases are typically partitioned across a shared nothing cluster of servers.  In addition, you can tell VoltDB to manage one or more copies of each partition to meet your HA requirements.  In VoltDB parlance, HA is called k-safety, where “k” is the number of partition copies you wish VoltDB to manage.  Obviously, the larger the k-factor, the more resilient your database will be to failure of any particular node in the cluster.

VoltDB treats each partition node as a master, dispatching all relevant transactions to the appropriate nodes.  A transaction does not complete until it has been executed successfully on all of its partitions.  Incoming queries are also dispatched to all relevant nodes; the first node to respond is the winner and queries on the other partitions terminate.

If one of the nodes experiences a fatal fault or is taken offline for service, VoltDB continues to direct work to the other sibling node(s).  Once serviced, the “downed” node can be rejoined to a running cluster.  VoltDB will detect what transactions are needed to fully resynch that node with its siblings, and will apply those transactions automatically.  Once the rejoined node has been brought up to date, it will begin accepting work again.

13.  Can VoltDB cope well in cloud architecture where multiple user processes/transactions using same core?

Generally yes, VoltDB works very well in virtualized environments.  There are a few best practices we recommend considering when running VoltDB on cloud infrastructures.

14.  Can you please give me some example situations where NoSQL best suits?

NoSQL should be considered for applications where 1) the data does not lend itself to relational organization AND, 2) you can say with certainty that your application will NEVER need competent transaction processing AND no future application that will use the database you’re building will need competent transaction processing AND, 3) use of a standard data language (SQL) is not beneficial to your team/organization.  Unless ALL of the above are true, I would recommend finding an RDBMS that meets your needs (whether that’s VoltDB or something else) and using it.

As I’ve said many times, NoSQL often gives developers the illusion of agility.  Schemaless data models are very seductive early in the application lifecycle, when you’re doing rapid development iterations.  I understand that completely.  But you also need to think beyond those initial iterations to critical longer-term requirements.  You’re building systems that will probably endure for many years.  If, at some point, your application needs competent transaction processing, a NoSQL database will force you to do that heavy lifting in user code – there will be nothing agile about living in that world.  And the increasing chatter about NoSQL giving you “enough ACID for most things” is irresponsible and dangerous, in my opinion.