In this post, I talk about how we created flexible tags capable of representing a variety of semantic particularities.
A while back we needed to structure our form data quickly. It existed in the shape of answers to questions in forms, which clients were free to create in their own particular ways. An effective way of achieving structure was therefore to tag the fields in the various forms with a common set of tags, to later be able to extract the data from applications made through those forms.
These tags had some particular requirements due to the nature of the forms that our clients created on our platform. A common case was that there were slight semantic differences between questions, which could have been relevant (or not) depending on how the data would be used later. All the various edge cases are too numerous to sum up here, but essentially the tags needed to have the characteristics of a unidirectional graph.
Our infrastructure at the time made storing this data structure in MySQL the most convenient given the time constraints.
Consider a hierarchical, unidirectional graph structure, that has the following characteristics:
- It consists of semantic nodes that can be combined in a hierarchical manner to create paths with new meanings. For instance, the nodes Food , Person and Vegetarian can be combined into the paths Food > Vegetarian , and Person > Vegetarian .
- Paths can be of arbitrary length.
- Paths cannot be circular.
- We can use the paths as tags on data, and create hierarchical relations between data. Using the above example, querying data tagged with:
- Food should return any data tagged with Food or Food > Vegetarian
- Vegetarian should return any data tagged with Food > Vegetarian , Person > Vegetarian or Vegetarian .
- Food > Vegetarian should return any data tagged with Food > Vegetarian only.
Considering the graph above, we have the nodes Node A , Node B , Node C and Node D .
Together, they form the following paths and subpaths:
Path 1: Consists of Node A > Node B
Path 2 : Consists of Node A > Node B > Node C
Path 3 : Consists of Node A > Node D
We will be creating two tables, nodes for storing the nodes, and paths for storing the paths.
The nodes table is relatively straight-forward. It contains an id key and a name column:
CREATE TABLE `nodes` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(30) NOT NULL
The paths table is more complicated.
Because we’re using MySQL to store our structure, and storing resources per-row is idiomatic and queryable in MySQL, we can’t store a full path in one row. Instead, we store each edge between the nodes of a path as a separate row using a from_node and to_node column referring to the id key of our nodes table. Then we group together the edges using a common path_id key.
Because order matters and the graph is unidirectional, we need to store which part of the path the edge represents. It is possible to deduce the order without explicitly declaring it by inferring it from the from_node and to_node columns, however it would make queries prohibitively expensive, especially with deeper paths. Therefore, we add an edge_order column:
CREATE TABLE `paths` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`path_id` int(11) NOT NULL,
`from_node` int(11) NOT NULL,
`to_node` int(11) NOT NULL,
`edge_order` tinyint(1) NOT NULL
Let’s begin by populating the nodes table:
INSERT INTO `nodes` (id, name) VALUES
(1, ‘Node A’),
(2, ‘Node B’),
(3, ‘Node C’),
(4, ‘Node D’)
Now, storing paths does come with some considerations if we want to be able to query every subpath.
Consider Path 3 . It contains the subpaths A → B and B → C , but also the paths representing each of the nodes in isolation ( A → A, B → B, C → C ). Therefore, to store Path 3 and make every subpath in it a tag we can query as well, we’d need to do the following:
INSERT INTO `paths` (path_id, from_node, to_node, edge_order) VALUES
-- Store each single-node subpath A->A, B->B and C->C
(1, 1, 1, 0),
(2, 2, 2, 0),
(3, 3, 3, 0),
-- Store the shorter subpaths A->B (Path 2) and B->C
(4, 1, 2, 0),
(5, 2, 3, 0),
-- Store A->B->C (Path 3).
-- Note the edge_order incrementing here.
(6, 1, 3, 0),
(6, 3, 4, 1)
As you can see, a path consisting of three nodes should actually create 6 paths: the main path and 5 sub-paths.
Implementations need to consider the case where subpaths already exist — for instance, if Path 2 is created before Path 3 . In that case, we would actually only need to create the paths C → C, B → C and A → B → C to fully represent all subpaths.
This is where the fun part begins. This essentially involves doing a group_concat with conditionals.
Say we want to retrieve all paths from the paths table, in the format of (path_id , path_name) where path_name is the string “Node X > Node Y > …”, or “Node X” if it’s a single-node path.
With Path 3 in our paths table, this should produce:
1 “Node A”
2 “Node B”
3 “Node C”
4 “Node A > Node B”
5 “Node B > Node C”
6 “Node A > Node B > Node C”
Here’s the full query for this result:
paths.path_id AS id,
—- If the current edge is not the first edge of the path
CASE WHEN paths.edge_order > 0
—- Just append the next to_node
CONCAT(‘ > ‘, to_node.name)
—- Otherwise, if it’s the first edge
-— Check if it’s a single-node edge and just
-- add the name of that node
CASE WHEN to_node.id = from_node.id
—- Or start building the path name by taking
-- the root node name and appending the next
-- node name to it
CONCAT(from_node.name, ‘ > ‘, to_node.name)
-- This is crucial to ensure that edges
-- are ordered correctly before the concat
ORDER BY paths.edge_order ASC
) AS name
-- We join in the nodes table twice, once for the from_nodes
-- and once for the to_nodes of each path
INNER JOIN nodes AS from_node ON from_node.id = paths.from_node
INNER JOIN nodes AS to_node ON to_node.id = paths.to_node
-- And finally we group by the path_id
GROUP BY paths.path_id
The easiest way to handle any sort of query at this point, is to create a view table based on this query. Let’s assume we name it tags , and it simply has an id and name column. At this point, we can use LIKE to find the tag or tags we need.
However, it is also possible to query the nodes and paths tables directly for more complex queries, such as retrieving all paths containing a certain subpath, or all paths containing certain nodes. This employs nested selects and subqueries however, and we found that using the view is the most performant solution as well as the easiest, at least when using MySQL.
That’s it for this time! Feel free to shoot any questions in the comments field.