Recently I worked on a feature that involved displaying a list of Folders using an auto-complete field. A simple enough task, but one that had a bit of complexity behind it. For starters, each organization in our system has their own "folder hierarchy", or a tree structure of folders, that could be one to many layers deep. The second thing being that the UI we wanted — auto-complete — meant that we would need to "search" through the folders to find a match (based on the user’s input) as well as display the path information (where the folder lives) so that the user can differentiate folders if they happen to have the same or similar name. Since it was auto-complete, the searching also needed to be fast or else the user experience would be annoying.
The folder tree structure was stored in a MySQL DB table in a very typical self-referencing schema that represented hierarchical data. Each row of data could have a column that referenced the "parent" folder id and if there wasn’t one, then it was a "root" folder. Each row in the table could be represented by an object in Java called Folder. Getting folders by ID and getting them in a tree-like structure was pretty straight forward. However, the path-string wasn’t part of the deal.
My first thought on this was that the paths would need to be pre-computed ahead of time so that they wouldn’t need to be computed every time a user typed in a letter, an expensive and slow function. One way to do it would be to have some application code insert to a table that contained the folder paths whenever a folder was saved. Our resident MySQL expert, Ike Walker, suggested I look into the closure table pattern, discussed in Chapter 3, Naive Trees, of Bill Karwin’s book, "SQL Antipatterns" . Yes, this was exactly what I was thinking of! However, as I thought about it more there were some things I didn’t quite like about it. First, there was already a lot of existing data. Using Closure Tables would mean that I would need to script something to populate the table storing the folder paths, and run it at least once. The fact that I would need to script something like that also made me think about how will this table get updated? In the book "SQL Antipatterns", a discussion thread example was used. In the case of a discussion thread, individual comments have the same kind of tree-structure that folders might have, but comments and replies typically don’t move around whereas folders do. Updating the folder paths would be required to keep in sync with the underlying tree, along with whole sections of a particular folder’s branch. The thought of having to write this updating code made me think twice about my initial approach.
Caching to the rescue
So what does a developer do when there’s a problem with performance? Cache. So I thought: how can caching help me here? As I pointed out earlier, there was already a Java object — Folder — and a way to get the folders in a tree structure. If the Folder had a field to store the path, then I’m halfway there, right? But wait, doesn’t that mean I’d be storing the path in the DB again? Well, not if I set it on a transient field and set it only when the tree structure is getting built up. Hold on — doesn’t that mean that the paths still need to be calculated every time we search the tree? Here’s where the caching bit comes in. The folder tree would be traversed once and as it is recursing, the paths get computed and stored on a transient field of the Folder object. At the same time, the Folders are stored in a flat List that can be used for iterating and searching, etc. Next, the method which retrieves the folder list is cached and can be cached with the Organization’s ID in the cache-key so that each list can be kept separately. To make sure we don’t cache large lists of heavy Folder objects, I cached only lists of Folder IDs (not objects) and made the Folder objects also cached by their ID. Okay, then what about the updating part I talked about? The cached list of Folders would be invalidated whenever a Folder update occurred. It means that the paths would need to be re-computed again, but only once until the next update and that seemed reasonable enough.
In the end, I didn’t use the closure table, although I’m keen on using it if the opportunity arises. One of the key things to coming up with the solution was knowing not only the data-structure but also knowing the nature of the data (how it’s used and how often does it tend to change). In all fairness, Bill Karwin also points out frequent updating as a weakness/pain-point of the Closure Table pattern. So there you have it my friends, use those clousure tables if you can, and if you can’t, then hopefully the approach I used here will give you some ideas!