“SQL? Isn’t that just something my ORM tools use?” Often application developers can and should remain at least one level removed from SQL. JPA and the Entity Framework each provide more abstract (and OO-friendly) query languages, and it’s often unnecessary to write any queries at all.
If you ever diagnose production support issues, though, being proficient with SQL can greatly increase your productivity. (Picking through the data table by table and record by record can be slow and tedious; and you probably don’t want to write a bunch of one-off test apps every time a new issue arises.)
Even when you use an ORM tool’s query language, it’s advisable to understand at least the basics of SQL. A good knowledge of SQL join semantics can demystify certain behaviors that might otherwise seem counterintuitive and borderline malicious. The ability to analyze the SQL generated by your ORM tool might provide valuable insight into an elusive bug.
And you could sometimes find yourself on a project that, for whatever reason, doesn’t use an ORM tool. Tasks like ETL and reporting don’t always lend themselves to ORM-based data access. Maybe you’re writing views or stored procedures. (Dogmatic wars aside, they have their place.) Maybe you’ve inherited a legacy code base that used SQL whether it was the “right choice” or not. Maybe you’re writing in a language that doesn’t have a good ORM tool.
Of course not every database uses SQL in the first place. If you prefer NoSQL solutions, and if you have control over the toolset on every one of your projects, then you might never need to read or write a line of SQL. This article is for the rest of us.
There are plenty of good resources for learning the SQL language; my focus here will be on the ways an application developer might have to shift his or her mindset to work effectively with SQL.
Procedural vs. Declarative Programming
I’m going to start with a somewhat technical distinction because it arguably has the broadest implications.
Many developers seem to be most comfortable with the procedural programming paradigm. Roughly I mean that they write a list of instructions, and the computer follows those instructions step by step. While you may not think of Java and C# as procedural languages (since they support OO patterns as well as some declarative and functional elements), the idea of specifying an algorithm step by step remains prevalent in both languages.
In an SQL query, the prevailing paradigm is declarative. A query describes what data is needed, and it’s the DBMS’s job to decide on a series of steps that it will take to provide that data – i.e. whether to use an index or just directly access the corresponding table; the order in which to read from the tables used in the query; which join strategies to use in constructing the final result set; etc.
Part of what allows a declarative approach is the optimizer. In Java or C#, we don’t expect the compiler to be overly-literal; the optimizer can erase the performance difference between
x = x + 1 and
x++ , so we write whichever one we think makes the code more readable and we move on. So even when we think procedurally, we know that the exact execution steps may vary.
On the other side of the coin, until relatively recently database optimizers were quite limited. You might write
select e.EMPLOYEE_ID from EMPLOYEE e where exists (select 1 from EMPLOYEE mgr where e.MANAGER_ID = mgr.EMPLOYEE_ID and mgr.LAST_NAME = ‘Lister’ );
or you might write
select e.EMPLOYEE_ID from EMPLOYEE e where e.EMPLOYEE_ID in (select mgr.MANAGER_ID from EMPLOYEE mgr where mgr.LAST_NAME = ‘Lister’ );
These queries describe the same result, but you might find that one runs much faster than the other. It may even be that one runs faster under certain conditions but slower under other conditions. That could put you in a bad situation where you deploy the “fast” query but it later becomes the slow query after a migration to a new DBMS, or an upgrade of the existing DBMS, or even just an accumulation of changes over time to the data in the database.
Fortunately, query optimizers have gotten better, largely due to the move from rule-based optimizers (which took cues from how you wrote the query, and sometimes even relied on “hints” to be embedded in the query) to cost-based optimizers (which use statistics about the data to predict the “cost” – which roughly means time – required for each of a set of possible execution plans).
Previously you might have compared the execution plans of the two queries above in order to determine which one you should use; today you still might examine the execution plan to diagnose a slow-running query, but this has more to do with checking the reasoning the optimizer used rather than second-guessing the actual sequence of steps. (Did it think that a join would produce only a few records, when in fact it produced many? Maybe statistics need to be updated, or maybe you’ve legitimately found a limitation of your DBMS’s optimizer.) I wouldn’t expect a modern database to care which of the above queries you use.
The upshot is that you should generally think of SQL in declarative terms. As you incrementally assemble the query logic, instead of figuring out a “next action” to tack onto the end of your code, you’re finding a way to edit the query to bring it closer to your desired result.
Suppose you need to answer this question: “What are the names of all departments that pay an average salary greater than $50,000?” (Beyond what can be inferred from the following code, we don’t need to worry about the data model or other detailed assumptions; we’re really just concerned about the concepts here.)
Now what if you go ahead with a procedural mindset? What follows is an extreme (and simplified) example, but I have seen this done:
1)I need to get salary data, associated with department, from the
select e.SALARY , e.DEPARTMENT_ID from EMPLOYEE e;
2)Then I need to average the salaries by department.
with salaries as ( select e.SALARY, e.DEPARTMENT_ID from EMPLOYEE e ) select s.DEPARTMENT_ID , avg(s.SALARY) from salaries s group by s.DEPARTMENT_ID;
3)Next I need to filter out departments with an average salary below (or at) the cutoff.
with salaries as ( select e.SALARY, e.DEPARTMENT_ID from EMPLOYEE e ) , avg_salaries as ( select s.DEPARTMENT_ID, avg(s.SALARY) AVG_SALARY from salaries s group by s.DEPARTMENT_ID ) select a.DEPARTMENT_ID from avg_salaries a where a > 50000;
4)Finally, I need to get the names of the departments.
with salaries as ( select e.SALARY, e.DEPARTMENT_ID from EMPLOYEE e ) , avg_salaries as ( select s.DEPARTMENT_ID, avg(s.SALARY) AVG_SALARY from salaries s group by s.DEPARTMENT_ID ) , department_list as ( select a.DEPARTMENT_ID from avg_salaries a where a.AVG_SALARY > 50000 ) select d.DEPARTMENT_NAME from DEPARTMENT d where d.DEPARTMENT_ID in (select DEPARTMENT_ID from department_list);
Wow, SQL must be complicated, eh? That’s a long, messy query to do not very much. It will be hard to read and hard to maintain. To tell you the truth, without running it, I’m not sure if it’s even correct.
Let’s try that again with a declarative mindset. Instead of thinking “What’s the next step?” we’ll think “How do I need to change the result set?”
1)I need to get salary data, associated with department, from the
select e.SALARY , e.DEPARTMENT_ID from EMPLOYEE e;
2)…but I need to average the salaries by department…
select e.DEPARTMENT_ID , avg(e.SALARY) from EMPLOYEE e group by e.DEPARTMENT_ID;
3)…and I need to filter out departments with an average salary below (or at) the cutoff…
select e.DEPARTMENT_ID from EMPLOYEE e group by e.DEPARTMENT_ID having avg(e.SALARY) > 50000;
4)… and I need to get the names of the departments.
select d.DEPARTMENT_NAME from DEPARTMENT d where d.DEPARTMENT_ID in (select e.DEPARTMENT_ID from EMPLOYEE e group by e.DEPARTMENT_ID having avg(e.SALARY) > 50000);
That’s quite a bit easier to work with. Just as important, because I didn’t write the query in a way that requires each step to complete before the next can begin, the optimizer may have more flexibility to come up with an efficient execution plan.
To be fair, I can’t say that all procedural thought was eliminated with this version of the query. The use of a sub-query suggests that the
EMPLOYEE data should be grouped and filtered prior to being compared with the
DEPARTMENT data. I could’ve written:
select d.DEPARTMENT_NAME from DEPARTMENT d inner join EMPLOYEE e on e.DEPARTMENT_ID = d.DEPARTMENT_ID group by d.DEPARTMENT_ID, d.DEPARTMENT_NAME having avg(e.SALARY > 50000);
(Note that I added
DEPARTMENT_NAME to the
group by clause; since we’re already grouping by
DEPARTMENT_ID this shouldn’t matter, but it allows us to include
DEPARTMENT_ID in the
select clause. The other option would be to use a gratuitous aggregation function, e.g.
select max(d.DEPARTMENT_NAME) . I regard this as a style choice; either is fine in my opinion, unless you find that your particular DBMS handles one better than the other.)
From a 100% declarative point of view, I could argue for this form of the query. In theory, since the join key is contained in the grouping key, an optimizer could still choose to perform the aggregation before the join (potentially saving a lot of I/O, which – as we’ll discuss next – is the performance driver for an SQL query); but this is something I still don’t expect an optimizer to figure out.
I/O Is the Key Performance Driver
In a college course on computer architecture, we wrote MIPS assembly language programs to run on a simulator called SPIM. In this over-simplified environment, the primary unit of performance was the instruction. You could learn to play a perverse game where, for some reason you could never quite explain to your classmate, writing the code this way was a little better than writing it the obvious way, as you’d save three clock cycles.
It turns out computers are pretty good at that particular game, which is why Java and C# programmers don’t have to be. We’re free to focus on things like “don’t use bubble sort” when we think performance. We might not worry too much about how we use the hard drive since we have caches. We leave register assignment to the compiler.
Apart from algorithm complexity (the reason we avoid bubble sort), the most common performance concern is probably the network round trip. This can be relevant when interacting with the database (since we often use remote database servers). But once we’ve gotten a query to the database, what determines how well it performs?
The answer is: I/O.
The details vary from one DBMS to the next, but in general you’d like to minimize the number of blocks of data read from (or written to) the disk. As discussed above, you’ll generally leave that to the query optimizer. (Your DBA also plays a role by ensuring that the right indexes, statistics, etc. are defined for your environment so the optimizer can do its job. This is part of physical data modeling.)
You may have heard that the most expensive thing you can do in a database is a
join operation. This is probably a good rule of thumb, but it’s easy to misinterpret. Several of the above examples use a subquery to compare data from one table with data from another; even though we didn’t explicitly write a join expression, the query optimizer may pick an execution plan that involves joining the tables. (Whether you would ever say it isn’t joining the tables really depends on how picky you’re being with your words; the bottom line is, you can expect a cost for combining or comparing data from two tables.)
So again remembering that you generally shouldn’t second-guess the specific steps to be taken – meaning, if the optimizer decides to do a join, it probably has a good reason and you should hesitate to try changing its mind unless you’re absolutely certain it’s made a mistake – the focus would be only on making sure you don’t involve unneeded data.
The other take-away from the fact that I/O is the performance driver is that operations that may seem processing-intensive really aren’t a performance concern most of the time. SQL supports things like
CASE expressions and windowed aggregate functions that can be very powerful once you understand them. If the query already requires reading all the required input data for these operations, then using them (when appropriate) probably won’t slow things down any.
But note that I said “when appropriate”, because…
Just Because You Can Doesn’t Mean You Should
I think most people (myself certainly included) learn SQL in two phases. First you learn all the cool things you can do in SQL. Second, you learn not to do the majority of those things (or, at least, to do them sparingly and only with good reason).
A query is almost always part of a broader system. You want to make sure that it plays its proper role and nothing more. A number of factors can limit what your query should do:
A query that lives in a data access layer shouldn’t contain business logic.
When an ETL (or ELT, or ETLT) process reads from a live transactional system, you may have to limit the performance impact of the extract phase. You could use complex extract queries to “pre-transform” the data (making the transform phase(s) redundant), but you probably shouldn’t.
Report definitions are often built around a query; but a decent reporting tool will also have its own functionality for filtering, grouping/aggregating, sorting, etc. Using the reporting tools functionality can make the report more flexible and interactive.
Sometimes complex queries may be justified. A data mart dedicated to a reporting application likely has a surplus of computing power, and if a complex aggregation can reduce the amount of data flowing to the reporting client’s computer by an order of magnitude, that’s almost certainly a performance win.
Even then you want to be careful, though: Depending on how your DBMS handles concurrency and locking (and possibly on your transaction isolation settings) your complex queries could block each other and erase any performance gains or, worse, cause deadlocks (which at best will result in one of your queries failing and having to be restarted).
In any case, if your default mindset is to make the query do everything, then you probably have queries that are doing too much.
Also, if you do have to write a complex query, remember…
Formatting SQL Isn’t Like Formatting Java or C#
SQL is code, and just like any other code it needs to be formatted for readability. For whatever reasons, a lot of developers seem to “give up” on formatting when they hit a block of SQL. Maybe they don’t know good standards for SQL formatting; if that’s the case, it’s worth learning. It may not matter for a one-line query, but the more complex the query the more important it is to format it well.
I don’t have a comprehensive list of SQL formatting rules; I believe the important thing is to find a style that works for you. Here are a few of the rules I find helpful.
I usually lean toward a “two-column” approach, with the major clause keywords right-justified in the first column. (The examples above use this general approach.)
At least for the top-level query, each column in the
select clause gets its own line. Commas go at the beginning of a line, not at the end. (I line them up with the
select .) This makes it easier to add/remove (or comment/uncomment) columns at the end of the list. (For whatever reason, this seems to be a bigger issue than being able to add, remove, or comment the first column in the list.)
Similarly each table in the
from clause gets its own line. If I’m using a comma-separated list of tables, the commas go at the start of the line (under the
from ). If I’m using
ANSI join notation, I use indentation like this:
from FIRST_TABLE t1 inner join SECOND_TABLE t2 on t1.KEY1 = t2.KEY1 left outer join THIRD_TABLE t3 on t2.KEY2 = t3.KEY2
(Sometimes I very that if the from clause contains a subquery.) I don’t mix comma-separation with ANSI join notation in a single from clause.
I avoid using meaningless table aliases. Table aliases are like variable names; queries with meaningful aliases are much easier to read. I almost always do use table aliases, and I consider it good practice to use the aliases to qualify column names even if there’s only one table in the query. (This avoids a bunch of editing if you add a second table to the query later.)
I use indentation to clarify where subqueries or
CASE expressions begin and end. (On a related note – though technically not a formatting issue – I generally try to avoid nesting CASE statements. It’s just usually not needed, and it’s harder to read. Remember that
WHEN clauses are evaluated in order and the first match wins. Don’t be afraid of using
OR in a
I expect my queries may be used/read in 80-column terminals, so I make sure to avoid letting lines get longer than that.
If you know of an SQL style guide you like, or have other rules you think help with readability, feel free to discuss in the comments.
SQL Logic is Not Binary in Nature
There are three values for a logical expression in SQL:
FALSE , and
NULL . (Values of any data type can be
NULL ; and you have to be aware that SQL’s
NULL isn’t quite like a null pointer or null reference in most languages.)
Technically a value of
NULL means “unknown”. In practice it is often used with foreign keys to mean “no related entity exists”, but the semantics of various SQL operators all interpret it as “unknown”. So, if I have an
EMPLOYEE record with
NULL , how would the predicate
DEPARTMENT_ID = 37 be evaluated? Since the
DEPARTMENT_ID value isn’t known, we don’t know the predicate to be
TRUE , but we also don’t know it to be
FALSE because an unknown value just might be 37 , so we evaluate it to unknown (
NULL ) as well.
where clause filters out any record for which the predicate isn’t
TRUE ; and a CASE expression’s
THEN clause only matches records for which it evaluates to
TRUE . This is why people sometimes conflate
FALSE . But what happens if we apply the
NOT operator to a predicate whose value is
DEPARTMENT_ID = 37 is neither known to be true nor known to be false,
DEPARTMENT_ID <> 37 also is unknown, and so is
NOT (DEPARTMENT_ID = 37) . The logical negation of NULL is
NULL , not
Incidentally, this is why an equality test against the NULL value doesn’t do what you’d want. The predicate
SOME_COLUMN = NULL would always evaluate as NULL. Hence we have
IS NULL and
IS NOT NULL .
The other place
NULL logic values have a surprising behavior is if you apply
NOT IN to a list containing a NULL value. When
DEPARTMENT_ID=37 , you might expect
DEPARTMENT_ID NOT IN (42, 100, NULL) to evaluate to
TRUE ; but since that
NULL value is unknown, it could be 37, so the predicate itself evaluates to
NULL (and behaves, in most cases, as though it were
FALSE , creating much confusion).
Know Your DBMS
But it’s still worth noting: whenever you start working with a new DBMS, you want to familiarize yourself with its particular SQL quirks.
SQL is very different from most of the languages we typically work with. It grew over time to address a different type of need in a different type of environment. As with any potentially unfamiliar tool, taking the time to learn the best ways of using it is an investment. Certain problems become less frustrating, and your value to your team can increase dramatically.
There’s much more to learn about SQL than what I’ve addressed here. Getting a feel for the various SQL constructs and how to assemble them to address different problems takes time and practice; and I’m sure there are additional SQL quirks I haven’t thought of that drive application developers nuts. Still, I hope you find these pointers helpful when the time comes to write a few efficient and maintainable queries.