The Forkless Philosopher

Unit Tests

Definition

Unit tests are small tests that each exercise a unit in such a way that together they allow a statement about the functional correctness of the unit under test.
What is a unit? In Working Effectively With Legacy Code (Prentice Hall, 2010) Michael C. Feathers defines a unit as: "In procedural code, the units are often functions. In obejct-oriented code, the units are classes".
A more paradigm-agnostic approach would be to define a unit as "the smallest functional unit in your code that can be tested on its own, outside of any context".
The second definition has a certain appeal, because it can have a psoitive effect on the way you design classes, mainly because it fosters a culture that adheres to the KISS (Keep It Small and Simple) principle: if you treat individual methods of a class as units you need to think about how you can unit-test them. Which in turn might lead to a different view of what methods actually belong into what class and, in consequence, to a finer granularity of classes. Which in turn means better testability.

Why should you do unit testing?

Because it makes things so much easier if you or somone else has to refactor the code later. Or has to change it in some way. As long as the unit tests still result in green you know you didn't break anything.
Because unit tests are excellent documentation. They tell everybody in a very concise way "if I use this unit in this way, this is what is supposed to happen". Do you always explain the behaviour of a class or a function in some lines of comment preceeding its code? Excactly.
Because it makes your life easier when a regression test fails. With unit tests in place you can approach the bug with the assumption "assuming the code covered by the unit tests is ok, what else could have gone wrong?". The alternative would be long sessions with a debugger, stepping through the sources, checking return values of methods... basically the same sort of work that unit tests do but are far better equipped to do.
Another reason (and one already mentioned in the section above) is that, once you start securing your code with unit tests, it affects the way you code. You start to design your architecture and your classes with testing in mind (well, you have to) and that will usually change the way you develop code for the better. In C++ Coding Standards (Addison-Wesley 2010) Herb Sutter and Alexei Alexandrescu bring it to a point: "Good design us testable, and design that isn't is bad".

Test-driven development vs. writing unit tests retrospectively?

Test-driven development describes the practice of first writing a test, which fails, and then to write the code that makes it pass. Once the test passes, you know you are finished writing the code. Kent Beck describes the technique in detail in Test-Driven Development By Example (Addison-Wesley, 2006). Test-driven development is certainly a very good way to arrive at clear, easy-to-maintain code, but especially with quick prototyping it sometimes can be a bit of a hassle. And anyway, the technique can only be applied when writing new code.
But of course you can also write unit-tests retrospectively, even if that is sometimes frowned upon, mainly because then it will usually not be done at all. Still, it's better to write unit-tests retrospectively and so at least get some unit-tests from time to time than to not having any unit tests at all.
There is actually even an argument for writing unit tests retrospectively: it can then be done by somebody else, which somewhat minimizes the risk of making the same wrong assumption twice, first while coding the method under test and second when writing the test (finding the right number for the calendar week of January 1st is a very good example for this sort of problem). Which brings us to the topic of validation.

Test validation

One of the things to keep in mind when writing unit-tests is that the best test is only as good as the data used for input and for validation.
When writing unit-tests retrospectively you might be tempted to look at what the unit does with the arguments received and then model the tests on that data. However, this approach inherently assumes that the unit to be put under test is working correctly at the time you are writing the unit-test. What if it isn't? Then the fact that it isn't is not only not detected, it will also be harder to find due to the assumption that the unit test knows what it is doing meaning that that unit will be the last place to look at when searching for the cause of a bug.
In that sense, even using an existing version of the software as a reference while doing a complete rewrite can be wrong. You might not look at the source, but you look at the results and you inherently treat them as correct although they might not be (this happened to me once: I was doing a complete rewrite of a module displaying stability data of some sort and used the old module as a reference for some tests. When I finally presented the finished module to the person responsible for the feature, he looked at it and said: "that bit of data is wrong". I pointed out that it was the same as in the old version, only to learn that then it was wrong in the old version as well and the old version therefore buggy. So much for references.)
A better approach is to get validation data independently from any existing code through thorough understanding of the domain, i.e. by reading and understanding the specification (assuming one exists) and by understanding the underlying knowledge domain, be that a technological domain or an economic one (think of a method that calculates interest, for example).
There is, of course an exception to the rule that you shouldn't use the existing code as a base to write tests. If other people are dependant on the way your systems behaves, then you have to make sure that, when adding or changing parts of it, you maintain the current behaviour. Even if that means preserving wrong behaviour. The reason for this is simple: people might actually depend on errors. Be it that they have developed rough patches for them or they are actually exploiting bugs to their advantage. If you fix these bugs, thus altering the behaviour of the system, it might break other peoples functionality or, worse still, silently lead to wrong data. Rumour has it that some Windows(tm) versions were downward-compatible to such a degree that wrong behaviour was preserved just because some wide-spread games relied on that wrong behaviour. For a more detailed discussion of this see the already mentioned Working Effectively With Legacy Code (p. 186 ff).

Testability of code

Another problem you might encounter when writing unit-tests retrospectively is that the code you are writing the tests for does not seem to be easily testable. You might need to instantiate classes and fill them with meaningful values because the unit you want to test expects these objects as arguments; you might not be able to run a unit test offline because the method under test retrieves data from a server. In other words: when the code was originally written, it was not written with unit-testing in mind.
There are two concepts that might help you writing code that is easily testable.
The first is Programming To An Interface. Instead of using actual classes in declarations, for each class you define an interface and you use these interfaces to declare member fields or arguments. This enables you to develop so called "mock objects" that you can exchange with the real ones: production code uses the real classes, unit-tests use simple mock objects, where the real class for some reason or another can't be instantiated or used. For instance because it requires access to s specific server that can't be reached from where you are testing.
The second concept is that of KISS - Keep It Small and Simple. If a method does only one job; if a class offers all the functionality really needed to do what it promises but nothing more (delegating tasks to other classes where appropriate), then your tests will be easy to write. Or seen the other way round: if your tests are hard to write, this is a good indicator that your code could do with some refactoring to make it simpler and to break dependencies. Working Effectively With Legacy Code deals with this in depth.

Code coverage

Code coverage is a metric for the amount of code covered by unit tests. Or to put it in simpler terms: the percentage of instructions in your code that are actually executed by unit tests.
Your aim, of course, should be 100%. 100% percent code coverage means that every single execution path in your program is covered by at least one test. If you follow the test-driven-development approach, this will be pretty easy to achieve. It actually comes as a by-product of test-driven-development.
If you write your unit tests retrospectively, achieving 100% code coverage might be much harder - especially, if the code had not been written with unit testing in mind. In large units full of convoluted code it might not be easy to establish every single execution path and how to put that execution path under test. If that is the case though, it is a strong indicator that the unit could really do with some refactoring towards code that better adheres to the KISS principle.
Even if 100% percent code coverage for the whole application seems somewhat ambitious, not to say outright utopic, you should at least try to achieve 100% code coverage for each individual unit under test. The reason for this is simple: if less than 100% of your unit is under test, you might get false psoitives - i.e. the tests pass even though the unit is buggy. Because the bug resides in the untested bits.

How many different tests should you write?

How many unit tests you should write for a single function depends on what possible errors you want or need to catch. If you assume that the caller of the function always keeps his part of the contract and only calls the function with valid arguments, then the number of unit tests necessary to ensure correctness only depends on the algorithm the function performs on the argument(s).
Let's assume the function under test is bool IsLeapYear( int year ). Then four tests, with the years 1996, 2000, 2002 and 2100, are sufficient (the first two years are leap years, the last two are not).
But if you want to be absolutely sure that your code does work correctly even if the contract is broken (which might not even be intended by the caller), than more tests are necessary. For an integer argument I would always test for zero as well because an attempt to divide by zero will definitely crash your program (unless caught). Other prime candidates for tests are type boundaries and values close to them. For an 16-bit integer that would be 0, 65535 (boundary), 65536 (exceeds the maximum capacity of a 16-bit integer) and -32768 (a boundary for a signed 16-bit integer) in addition to the four tests already mentioned above (the boundaries of an integer are always interesting because, at least in C++, integer types can be signed or unsigned, and if you mix the two by mistake you might end up with rather strange results).
For strings you should at least test for the empty string and, at least in an environemt where mixed C/C++ code is used, some really, really long string to catch buffer overflows with sprintf() and related functions.

How many arguments are too many?

That leads us directly to another thing to think about when writing unit test: how many arguments may a function under test have?
In Clean Code (Prentice Hall, 2009) Robert C. Martin looks at the problem as a matter of code readability; he gives no arguments as the optimum, followed by functions with one and two arguments. Triadic functions (functions with three arguments) should be avoided if possible and more than three arguments should not be used at all.
From the perspective of unit tests, the same is true but for a different reason. And that is: the number of unit tests you will have to write grows exponentially with the number of arguments.
Think of the example in the paragraph above. For bool IsLeapYear( int year ) we arrived at eight unit tests for full coverage of all risks. Now assume a function that takes two arguments that interact strongly in the function, and for each argument we have agin eight distinct values that need to be included in our tests. We would have to write a whopping 64 tests to cover all possible permutations of that function. Anyone for a third argument? No? Thought so.
I once heard of a company that solved the problem by generating the unit tests automatically by a script. A valid approach, but then you a) need to be able to automatically divine the result as well, and b) your build will take rather more time than it takes to make a coffee - and so the unit test idea is perverted, because unit tests should be performed at every build. Which should be reason enough to keep the number of tests to the necessary minimum.
There is more to this, of course. You might come up with the idea of encapsulating data in a struct or plain old data class. Unfortunately that doesn't count as reducing the number of arguments. In this context, for "argument" read: every single variable that is passed into the function and used in it. This not only includes variables that are kept together in structs or pod classes; it also includes any global variables or member fields used by the function/method.

Which brings us to a nice conclusion: if you start relying on unit tests, it will fundamentally alter the way you design your code. You will start to avoid monster methods with plenty of arguments just so that you don't have to write so many unit tests. Which will result in cleaner, more concise code.

What unit tests can do. And what not.

Finally some words on what a unit test does, and what not. And that means to state the naked, ugly truth: no number of unit tests can assure the correctness of the unit under test. All a unit test does is compare data generated by some function against data provided by you, the author of the unit test.
There are three possible ways to arrive at unit tests that pass even if the method under test is not working correctly:

First, your knowledge of the domain might be insufficient. This is especially bad because you will write the unit tests under the same faulty assumptions as the code under test.
Take leap year calculation, for example. Any given year is a leap year if the number divides without remainder by four, unless it also divides by 100, but then only if it does not also divide by 400. The year 2000 is a leap year, 2100 is not. However, if you only know about the rule of the divison by four, assertEquals(true, calendar.isLeapYear(2100)) will pass, although the result is wrong.

Another reason for getting the (wrong) message that everything is fine is the sheer bad luck of repeating a typo (using cut & paste between your code and your tests greatly increases the chance of this to happen). Consider the following (very bad) implementation for isLeapYear(): if year in (1988, 1992, 1998, 2000, 2004, 2008, 2012) return true else return false.
assertEquals(true, calendar.isLeapYear(1998)) will pass although 1998 is definitely not a leap year.

The third possibility of getting false positives arises from test-driven developement, especially if you follow the practice to the letter. In Test-Driven Development By Example, the seminal book on the subject, Kent Beck writes: "Make the test work quickly, committing whatever sins necessary in the process" (a few pages later he gives a concrete example). Translated into our little leap year method, the following implementation would make the unit test assertEquals(true, calendar.isLeapYear(2100)) pass: isLeapYear(int year) { return true; }
Why is this bad? After all, Kent Beck goes on to define the next step as "Refactor - Eliminate all of the duplication created in merely getting the test to work"? Because there is a small timeframe in which the test results in a false positive. And in that timeframe you might get interrupted.
Say, you are implementing a class with a number of methods, you did write a unit test for each method and you have supplied an initial (fake) implementation to make the tests pass. In comes your boss: "Red alert! Our latest release is throwing segmentation faults like hell! Drop whatever you are doing and try to find the cause and fix it!"
Chances are, when you finally go back to the code you were developing at the time of the interruption, you might miss a fake implementation or two. And the unit tests won't tell you that you did.
How can you avoid this trap? Apart from omitting the false positive and never allowing the test to go green unless the functionality has been fully implemented, there are some practices that can be of help here. One is to only work on one method at any given time. Never have more than one unit test go green with a fake implementation. And if your boss interrupts you, at least take the time to leave some artefact in the code that will break the build. That way, even if a week has passed before you can work again on the code, the compiler will remind you of the last method you were working on.
Another practice is not to supply just one initial unit test, but two. You still get one unit test to pass, telling you everything else has been set in place, but the second test will (hopefully) fail and tell you that the implementation of the functionality is still faulty.