Assertions, regression tests, and version control
December 2013 ( postdoc )
Assertions, regression tests, and version control systems are valuable tools that all programmers should learn sooner rather than later. They make it significantly easier and more fun to write large programs.
Programming any non-trivial piece of software feels like rock climbing up the side of a mountain. The larger and more complex the software, the higher the peak.
You can’t make it to the top in one fell swoop, so you need to take careful steps, anchor your harnesses for safety, and set up camp to rest. Each time you start coding on your project, your sole goal is to make some progress up that mountain. You might struggle a bit to get set up at first, but once you get going, progress will be fast as you get the basic cases working. That’s the fun part; you’re in flow and slinging out dozens of lines of code at a time, climbing up that mountain step by steady step. You feel energized.
However, as you keep climbing, it will get harder and harder to write each subsequent line. When you run your program on larger data sets or with real user inputs, errors arise from rare edge cases that you didn’t plan for, and soon enough, that conceptually elegant design in your head gives way to a tangled mess of patches and bug fixes. Your software starts getting brittle and collapsing under its own weight.
Somewhere around a few hundred lines of code, your code base becomes too large to all fit inside of your head. You start to forget why you wrote certain code in a quirky way. And you don’t quite remember what you were trying to do last week, or what assumptions about your internal data structures still hold true. Is that snippet of documentation from last month still valid? What about this TODO note? And how come this function used to work so well but doesn’t anymore? What made it suddenly break? Wait, how did this other module ever work? ARGH!!!
You can write small pieces of code (up to 100 lines or so) without any protection. But as you scale up to 500 lines, 1000 lines, 5000 lines or more, you will need to use the following tools as code carabiners to prevent you from falling to your death as you climb higher and higher:
- Regression tests
- Version control
This article isn’t news to professional software engineers, but more and more people who aren’t trained in software engineering now need to write complex software, so it’s very important to teach them to use these tools.
Let’s now discuss each kind of code carabiner in turn.
Assertions, usually written as an
assert statement in many programming languages, are conditions that should always be true about your code. Why should your code check those conditions even though you know that they ought to be true? Because your software doesn’t always behave like you expect, especially when running on unknown, messy real-world inputs and data sets that don’t fit your prior assumptions.
Get used to writing as many meaningful assertions as possible in your code, and adding extra bookkeeping code to open up opportunities for even more assertions. For instance, I just wrote a 150-line recursive function that simultaneously traverses a string while doing a depth-first traversal of an associated tree data structure. This function is supposed to process every character exactly once, so I added extra bookkeeping code to make a note of which characters were processed. At the end of the function, I wrote an
assert statement that asserts that the list of processed characters is identical to the original string. Now every time I run this crazy function on newer and bigger inputs, as long as the
assert statement doesn’t fail, then I gain just a bit more confidence that my function is correct.
From an informal count,
assert statements comprise roughly 3% of the total lines of code in a few multi-thousand-line projects of mine. They’ve helped me catch countless numbers of bugs that would’ve made me gouge my eyes out had they not tripped an
assert and instead crashed the program later in some subtler way.
Assertions are super easy to write and don’t take any extra infrastructure to set up, so there is no excuse not to use them. If you code without assertions, that’s like climbing a mountain unprotected without any ropes or harnesses. For more information, read The benefits of programming with assertions (a.k.a. assert statements) .
Regression tests are the next, more serious kind of code carabiner. The easiest way to set up a suite of regression tests is to provide some text-based output format for your program (e.g., on
stderr ), run your program to save those files, and then diff the new versions against the existing ones to see if anything changed.
Here is a super-simple Python framework that I created in grad school and still use today. You can probably write your own or use an existing open-source solution. There’s no need to get fancy – just have some kind of automated tests for your program. Simply running your code manually and “eyeballing” the results isn’t going to cut it. Your memory isn’t that good, so let the computer do all the hard work for you.
When you have a test suite of expected behavior that you run on a regular basis (e.g., every time you make a non-trivial change to your code), then you can be sure that you’re not making backward progress. A test suite ensures that you keep climbing up the mountain – however slow at times – and make sure you don’t regress and slip back down.
The best part about setting up a regression test suite is that as you find more bugs, you can add additional tests and make your suite even more effective at finding additional bugs. It’s like you’re adding more and more safety harnesses, and eventually there’s no way for you to fall. For instance, my Online Python Tutor project has a suite of over 100 tests , many of which came from bug reports, and I have an entire GitHub repository for IncPy regression tests . Without all of those tests, there would be no way for me to get those multi-thousand-line programs to work reliably. (Of course, you can get paralyzed by spending too much time writing test infrastructure, so don’t overdo it. But most people are probably better off writing more tests, not less.)
Finally, regression tests are even more powerful when combined with assertions, since your assertions get checked every time the tests run.
Lots of people emphasize the importance of version control and teach all sorts of advanced ways to use those systems. However, in my view, version control systems, no matter how sophisticated, are useless as code carabiners without a good set of assertions and regression tests. After all, how else are you going to know what changes to undo if you don’t have a good indication of what works and what doesn’t?
The bottom line is to learn to use version control sooner rather than later. And if you’re too lazy, at least put your code files in Dropbox, since it saves a limited version history. But start with assertions right away!
Conclusion: Restraints == Freedom
Some readers might think that latching onto these code carabiners – assertions, regression tests, and version control – will restrain them too tightly. These all seem like “stuffy” software engineering practices that are suitable for large commercial projects, but not for the kinds of creative hacks that you might want to do in your spare time. If you’re like I was back in my youth, you want to play it fast and loose, whipping out code in a caffeine-fueled late-night frenzy. Pffft! Who needs assertions? Who needs tests? Who needs version control?
In reality, the safety and restraints that these code carabiners provide actually give you more freedom to take risks in your coding. If you want to try out some risky feature, refactoring, or external library, you know something is wrong as soon as one of your assertions or tests fail and can undo back to an earlier working state. Thus, no matter how high up the mountain you climb, with a proper set of code carabiners, you can fall down only to where you last latched on, not to your death thousands of feet below. Happy hacking!
On 2014-01-10, this article was republished at the O’Reilly Programming Blog .
Created: 2013-12-14Last modified: 2013-12-15