In this post, we’ll discuss the rationale behind deprecating PerconaFT and embracing RocksDB.
Why is Percona deprecating PerconaFT in favor of RocksDB?
Many of you may have seen Peter Zaitsev’s recent post about Percona embracing RocksDB and deprecating PerconaFT. I’m going to shed a bit more light on the issues between the locking models for PerconaFT’s and MongoDB’s core servers. When making this decision, we looked at how the differences between the engines measure up and impact other improvements we could make. In the end, we can do more for the community by focusing on engines that are in line with assumptions the core server makes every second in your daily operations.Then we have more resources available for improving the users’ experience by adding new tools, features, and improving the core server.
What is pessimistic locking?
Pessimistic locking locks an entity in the database for the entire time that it is actively used in application memory. A lock either limits or prevents other users from working with the entity in the database. A write lock indicates that the holder of the lock intends to update the entity and disallows anyone from reading, updating, or deleting the object. A read lock means that the owner of the lock does not want the object to change while it holds the lock, allowing others to read the entity but not update or delete it. The scope of a lock might be the entire database, a table, a collection of ranges of documents or a single document.
You can order pessimistic locks as follows (from broad to granular):
- Database locks
- Collection locks
- Range locks
- Document locks
The advantage of pessimistic locking is that changes to the database get made consistently and safely. The primary disadvantage is that this approach isn’t as scalable. The chance of waiting for a lock to be released increases when:
- A system has a lot of users
- The transactions (in MongoDB, there are transactions in the engine but not at the user level) involve a greater number of entities
- When transactions are long-lived
Therefore, pessimistic locks limit the practical number of simultaneous users that your system can support.
What is optimistic locking?
In most database systems (NoSQL and RDBMS) expect collisions to be relatively uncommon. For example, although two clients are working with user objects, one might be working with the Bob Vader object while another works with the Luke Vader object. These won’t collide. In this case, optimistic locking becomes the most viable concurrency control strategy. If you accept the premise that collisions infrequently occur, instead of trying to prevent them you can choose to detect and then resolve them when they do occur.
MongoDB has something called a Write Conflict Exception (WCE). A WCE is like an engine-level deadlock. If a record inside the engine changes due to thread #1, thread #2 must wait for a time that it is safe to change the record, and retry then. Typically this occurs when a single document gets updated frequently. It can also occur when there are many updates, or there are ranges of locks happening concurrently. This is a perfect case of optimistic locking, preferring to resolve or retry operations when they occur rather than prevent them from happening.
Can you make these play well while limiting the amount of development resources needed?
These views are as polar opposite as you can get in the database world. In one view you lock as much as possible, preventing anyone else from making a change. In the other view you let things be as parallel as possible, and accept you will retry if two clients are updating the same document. With the nature of how many documents fit in a single block of memory, this has some real-world concerns. When you have more than one document in a memory block, you could have a situation where locking one document means 400% more documents get affected. For example, if we have an update using the IN operator with 25 entries, you could be blocking 125 documents (not 25 documents)!
That escalated rather quickly, don’t you think? Using optimistic locking in the same situation, you at most would have to retry five document write locks as the data changed. The challenge for optimistic locking is that if I have five clients that are all updating all documents, you get a flurry of updates. WCE’s come in and eventually resolve things. If you use pessimistic locking, everybody waits their turn, and each one would finish before the next could run.
Much of Percona’s engineering effort goes into what types of systems we should put in place to simulate cheap latches or locks in optimistic locking to allow pessimistic locking to work (without killing performance). This requires an enormous amount of work just to get on-par behavior from the system – specifically in update type workloads, given delete/inserts are very similar in the systems. As a result, we’ve spent more time improving the engine rather than adding additional variables and instrumentation.
Looking forward, MongoRocks aligns more to WiredTiger in its locking structure (they both run as log sequence merges or LSMs), and this means more time working on new optimizer patterns, building things to improve diagnostics or tuning the engine/system to your specific needs. We think you will be excited to see some of the progress we have been discussing for Mongo 3.4 (some of which might even make it directly into 3.2).
What is the MongoRocks anyhow and how does it compare to PerconaFT?
The last concept I want to cover is what RocksDB is exactly, what its future is and how it stacks up to PerconaFT. The most important news is Facebook is working on the core engine, which is used both by MyRocks and MongoRocks (you might have seen some of their talks on the engine). This means Percona can leverage some of the brilliant people working on RocksDB inside Facebook and focus instead on the API linking the engine into place, as well as optimizing how it uses the engine – rather than building the engine completely. Facebook is making some very significant bets on the backend use of RocksDB in several parts of the system, and potentially some user parts of the system (which have historically used InnoDB).
So what is RocksDB, and how does it work? Its core is an LSM system, which means it puts new data into the newest files as an append. Over time, the files get merged into five different levels (not covered here). As part of this process, when you have an update a new copy of the data is saved at the end of the latest file, and a memory table points a record to that location for the latest “version” of that data. In the background, the older records are marked for deletion by something called a “tombstone.” There is a background thread merging still-relevant data from old files into new files so that empty old files get unlinked from the filesystem.
This streamlines the process better than B-Tree’s constant rebalancing and empty blocks in files that need to be compacted or re-used over time. Being write-optimized means that, like PerconaFT previously, it will be faster for write-based operations than something like WiredTiger. (WiredTiger in some cases can be faster on reads, with MMAP being the fastest possible approach for reads.) This also means things like TTL’s can work pretty well in an LSM since all the items that were inserted in time order age out, and the engine can just delete the unneeded file. This solves some of the reasons people needed partitions to begin with, and it still allows sharding to work well.
We are also very excited about creating additional tools that let you query your backups in MongoRocks, as well as some of the very simple ways it will take binary-based backups quickly, consistently (even when sharded) and continually.
I hope this explains more about lock types and what their implications mean as a follow up to Peter’s blog post about the direction Percona is moving regarding PerconaFT and MongoRocks. If you want to ask any more questions, or would like another blog that covers some areas of the system more deeply, please let us know via this blog, email, twitter or even pigeon!