Chapter 7. Automated Testing

Writing automated tests is one of those things that separates average developers from the best in the world. Master this skill, and you will be able to write far more complex and powerful software than you ever could before. It is a superpower that changes the arc of your career.

Some of you have, so far, little or no experience writing automated tests, in any language. This chapter is primarily written for you. It introduces many fundamental ideas of test automation, explains the problems it is supposed to solve, and teaches how to apply Python’s tools for doing so.

Some of you will have extensive experience using standard test frameworks in other languages (such as JUnit in Java, PHPUnit in PHP, and so on). Generally speaking, if you have mastered an xUnit framework in another language, and are fluent in Python, you may be able to start skimming Python’s unittest module docs¹ and be productive in minutes. Python’s test library, unittest, maps closely to how most xUnit libraries work.²

If you are more experienced, I believe it is worth your time to at least skim this chapter, and perhaps study it thoroughly. I have woven in useful, real-world wisdom for software testing in general, and for Python specifically. This includes topics like how to organize Python test code, writing maintainable test code, useful features like subtests, and even cognitive aspects of programming…like getting into an enjoyable, highly productive “flow” state via test-driven development.

With that in mind, let’s start with the core ideas for writing automated tests. We’ll then apply those ideas specifically to Python.

What Is Test-Driven Development?

An automated test is a program that tests another program. Generally, it tests a specific portion of that program: a function, a method, a class, or some group of these things chunked together. We call that portion the system under test, sometimes abbreviated as SUT.

If the system under test is working correctly, the automated test passes; if not, that test fails—which means it catches the error and immediately tells you what is wrong. Real applications accumulate many of these tests as development proceeds.

People have different names for different kinds of automated tests: unit tests, integration tests, end-to-end tests, etc. These distinctions are useful, but we won’t need to worry about them right now. They all share the same foundation.

In this chapter, we do test-driven development, or TDD. Test-driven development means you start working on each new feature or bugfix by writing the automated test for it first. You write the test; run it, to verify that it fails (and make sure that test code is actually working); and only then write the actual code for the feature. You know it works when the test passes.

This is a different process from implementing the feature first, then writing a test for it. Writing the test first forces you to think through the interfaces of your code, answering the question, “How will I know my code is working?” That immediate benefit is useful, but it is not the whole story.

TDD’s greatest midterm benefits are mostly cognitive. As you become competent and comfortable with its tactics and techniques, you learn to easily get into a state of flow—where you find yourself repeatedly implementing feature after feature, keeping your focus with ease for long periods of time. You can honestly surprise and delight yourself with how much you accomplish in a short period of time.

But even greater benefits emerge over time. We’ve all done substantial refactorings of a large codebase, changing fundamental aspects of its architecture.³ Such refactorings—which threaten to break the application in confusing, hidden ways—become straightforward and safe when you have test code in place already, and use TDD to refactor from that foundation. You first update the tests: modifying where needed, and writing new tests as appropriate. Then all you have to do is make them pass. It might still be a lot of work. But you can be confident in the correctness of your code, once the new tests pass.

Among developers who know how to write tests, some love to do TDD in their day-to-day work. Some like to do it part of the time; some hate it, and do it rarely or never. However, the absolute best way to quickly master unit testing is to strictly do TDD for a while. So I’ll teach you how to do that. You do not have to do it forever if you don’t want to.

Python’s standard library ships with two modules for creating unit tests: doctest and unittest. Most engineering teams prefer unittest, as it is more full-featured than doctest. This isn’t just a convenience. There is a low ceiling of complexity that doctest can handle, and real applications will quickly bump up against that limit. With unittest, the sky is more or less the limit.

And because unittest maps closely to the xUnit libraries used in many other languages, if you are already familiar with Python, and have used an xUnit library in any language, you will feel right at home with unittest. That said, unittest has some unique tools and idioms—partly because of differences in the Python language, and partly from unique extensions and improvements. We will learn the best of what unittest has to offer as we go along.

Another popular option is pytest. This is not in the standard library, but it is widely used. For brevity, we will focus on unittest in this chapter. Once you learn the principles, picking up pytest is straightforward.

Unit Tests and Simple Assertions

Imagine a class representing an angle:

>>> small_angle = Angle(60)
>>> small_angle.degrees
60
>>> small_angle.is_acute()
True
>>> big_angle = Angle(320)
>>> big_angle.is_acute()
False
>>> funny_angle = Angle(1081)
>>> funny_angle.degrees
1
>>> total_angle = small_angle + big_angle
>>> total_angle.degrees
20

As you can see, Angle keeps track of the angle size, wrapping around so it’s in a range of 0 up to 360 degrees. You also have an is_acute() method, to tell you if its size is under 90 degrees, and an __add__() method for arithmetic.⁴

Suppose this Angle class is defined in a file named angles.py. Here’s how we create a simple test for it—in a separate file, named test_angles.py:

import unittest
from angles import Angle

class TestAngle(unittest.TestCase):
    def test_degrees(self):
        small_angle = Angle(60)
        self.assertEqual(60, small_angle.degrees)
        self.assertTrue(small_angle.is_acute())
        big_angle = Angle(320)
        self.assertFalse(big_angle.is_acute())
        funny_angle = Angle(1081)
        self.assertEqual(1, funny_angle.degrees)

    def test_arithmetic(self):
        small_angle = Angle(60)
        big_angle = Angle(320)
        total_angle = small_angle + big_angle
        self.assertEqual(20, total_angle.degrees,
                         'Adding angles with wrap-around')

As you look over this code, notice a few things:

There’s a class called TestAngle. You just define it, but you do not create any instance of it. It subclasses TestCase.
You define two methods, test_degrees() and test_arithmetic().
Both test_degrees() and test_arithmetic() have assertions, using some methods of TestCase: assertEqual(), assertTrue(), and assertFalse().
The last assertion includes a custom message as its third argument.

To see how this works, let’s define a stub for the Angle class in angles.py:

# angles.py - stub version
class Angle:
    def __init__(self, degrees):
        self.degrees = 0
    def is_acute(self):
        return False
    def __add__(self, other_angle):
        return Angle(0)

This Angle class defines all the attributes and methods it is expected to have, but otherwise does nothing useful. We need a stub like this to verify the test can run correctly, and alert us to the fact that the code isn’t working yet.

The unittest module is used not just to define tests, but also to run them. You do so on the command line like this:⁵

python -m unittest test_angles.py

Run the test, and you’ll see the following output:

$ python -m unittest test_angles.py
FF
=========================================================
FAIL: test_arithmetic (test_angle.TestAngle)
--------------------------------------------------------
Traceback (most recent call last):
  File "/src/test_angles.py", line 18, in test_arithmetic
    self.assertEqual(20, total_angle.degrees, 'Adding angles with wrap-around')
AssertionError: 20 != 0 : Adding angles with wrap-around

=========================================================
FAIL: test_degrees (test_angle.TestAngle)
--------------------------------------------------------
Traceback (most recent call last):
  File "/src/test_angles.py", line 7, in test_degrees
    self.assertEqual(60, small_angle.degrees)
AssertionError: 60 != 0

--------------------------------------------------------
Ran 2 tests in 0.001s

FAILED (failures=2)

Notice:

Both test methods are shown. They both have a failed assertion highlighted.
test_degrees() makes several assertions, but only the first one has been run—once it fails, the others are not executed.
For each failing assertion, you are given the line number, its expected and actual values, and its test method.
The custom message in test_arithmetic() shows up in the output.

This demonstrates one useful way to organize your test code: in a single test module (test_angles.py), you define one or more subclasses of unittest.TestCase. Here, I just define TestAngle, containing tests for the Angle class. Within this, I create several test methods, for testing different aspects of the class. In each of these test methods, I can have as many assertions as makes sense.

It’s traditional to start a test class name with the string Test, but that is not required; unittest will find all subclasses of TestCase automatically. But every method must start with the string test. If it starts with anything else (even Test), unittest will not run its assertions.

Running the test and watching it fail is an important first step. It verifies that the test does, in fact, actually test your code. As you write more and more tests, you’ll occasionally create a test; run it, expecting it to fail; and find that it unexpectedly passes. That’s a bug in your test code! Fortunately you ran the test first, so you caught it right away.

In the test code, we defined test_degrees() before test_arithmetic(), but they were actually run in the opposite order. It’s important to craft your test methods to be self-contained, and not depend on one being run before the other, for several reasons. One of them is that the order in which they are defined is generally not the order in which they are executed.⁶

(If you find yourself wanting to run tests in a certain order, this might be better handled with setUp() and tearDown(), explained in the next section.)

At this point, we have a correctly failing test. If I’m using version control and working in a branch, this is a good moment to check in the test code, because it specifies the correct behavior (even if it’s presently failing). The next step is to actually make that test pass. Here’s one way to do it:

# angles.py, version 2
class Angle:
    def __init__(self, degrees):
        self.degrees = degrees % 360
    def is_acute(self):
        return self.degrees < 90
    def __add__(self, other_angle):
        return Angle(self.degrees + other_angle.degrees)

Now when I run my tests again, they all pass:

python -m unittest test_angles.py
..
--------------------------------------------------------
Ran 2 tests in 0.000s

OK

assertEqual(), assertTrue(), and assertFalse() will be the most common assertion methods you use, along with assertNotEqual(), which does the opposite of assertEqual(). TestCase provides many others, such as assertIs(), assertIsNone(), assertIn(), and assertIsInstance()—along with “not” variants like assertIsNot(). Each takes an optional final message-string argument, like “Adding angles with wrap-around” in test_arithmetic() above. If the test fails, the message is printed in the output, which can give helpful advice to whoever is troubleshooting a broken test.⁷

If you try checking that two dictionaries are equal, and they are not, the output is tailored to the data type: highlighting which key is missing, or which value is incorrect, for example. This also happens with lists, tuples, and sets, making troubleshooting much easier. What’s actually happening is that unittest provides certain type-specialized assertions, such as assertDictEqual(), assertListEqual(), and more. You almost never need to invoke them directly: if you invoke assertEqual() with two dictionaries, it automatically dispatches to assertDictEqual(), and similar for the other types. So you get this usefully detailed error reporting for free.

The assertEqual() lines take two arguments, and I always write the expected, correct value first:

small_angle = Angle(60)
self.assertEqual(60, small_angle.degrees)

It does not matter whether the expected value is first or second, but it’s smart to pick an order and stick with it—at least throughout a single codebase, and maybe for all code you write. Sticking with a consistent order greatly improves the readability of your test output, because you never have to decipher which is which. Believe me, this will save you a lot of time; nothing will throw off your momentum more than confusing the expected and actual values for each other, only to realize 20 minutes later you had them mixed up in your head. If you’re on a team, negotiate with your teammates to agree on a consistent order, and enforce it.

Fixtures and Common Test Setup

As an application grows and you write more tests, you will find yourself writing groups of test methods that start or end with the same lines of code. This repeated code—which does some kind of pretest setup, and/or after-test cleanup—can be consolidated in the special methods setUp() and tearDown().

Each of your TestCase subclasses can define setUp(), tearDown(), both, or neither. If defined, setUp() is executed just before each test method starts; tearDown() is run just after. This is repeated for every single test method.

Here’s a realistic example of when you might use it. Imagine working on a tool that saves its state between runs in a special JSON file. We’ll call this the “state file”. On start, your tool reads the state from the file; if the tool has any state change while running, that gets written to the file on exit.

It makes sense to write a class to manage this state file. A stub might look like:

# statefile.py - Stub version
class State:
    def __init__(self, state_file_path):
        # Load the stored state data, and save
        # it in self.data.
        self.data = { }
    def close(self):
        # Handle any changes on application exit.

In fleshing out this stub, we want our tests to verify the following:

If I alter the value of an existing key, that updated value is written to the state file.
If I add a new key-value pair to the state, it is recorded correctly in the state file.
If I remove a key-value pair from the state, it is also removed in the state file.
If the state is not changed, the state file’s content stays the same.

For each test, we want the state file to be in a known starting state. Afterward, we want to remove that file so that our tests don’t leave garbage on the filesystem. Here’s how the setUp() and tearDown() methods accomplish this:

import os
import unittest
import shutil
import tempfile
from statefile import State

INITIAL_STATE = '{"foo": 42, "bar": 17}'

class TestState(unittest.TestCase):
    def setUp(self):
        self.testdir = tempfile.mkdtemp()
        self.state_file_path = os.path.join(
            self.testdir, 'statefile.json')
        with open(self.state_file_path, 'w') as outfile:
            outfile.write(INITIAL_STATE)
        self.state = State(self.state_file_path)

    def tearDown(self):
        shutil.rmtree(self.testdir)

    def test_change_value(self):
        self.state.data["foo"] = 21
        self.state.close()
        reloaded_statefile = State(self.state_file_path)
        self.assertEqual(21,
            reloaded_statefile.data["foo"])

    def test_add_key_value_pair(self):
        self.state.data["baz"] = 42
        self.state.close()
        reloaded_statefile = State(self.state_file_path)
        self.assertEqual(42, reloaded_statefile.data["baz"])

    def test_remove_key_value_pair(self):
        del self.state.data["bar"]
        self.state.close()
        reloaded_statefile = State(self.state_file_path)
        self.assertNotIn("bar", reloaded_statefile.data)

    def test_no_change(self):
        self.state.close()
        with open(self.state_file_path) as handle:
            checked_content = handle.read()
        self.assertEqual(INITIAL_STATE, checked_content)

In setUp(), you create a fresh temporary directory, and write the contents of INITIAL_DATA inside. Since we know each test will be working with a State object based on that initial data, we create that object, and save it in self.state. Each test can then work with that object, confident that it is in the same consistent starting state, regardless of what any other test method does. In effect, setUp() creates a private sandbox for each test method.

The tests in TestState would all work reliably with just setUp(). But we also want to clean up the temporary files we create; otherwise, they will accumulate over time with repeated test runs. The tearDown() method runs after each test_* method completes, even if some of its assertions fail. This ensures the temp files and directories are all removed completely.

The generic term for this kind of preparation is called a test fixture. A test fixture is whatever needs to be done or set up before a test can properly run. In this case, we set up the text fixture by creating the state file, and the State object. A text fixture can be a mock database, a set of files in a known state, some kind of network connection, or even starting a server process. You can do all these with setUp().

tearDown() is for shutting down and cleaning up the text fixture: deleting files, stopping the server process, etc. You will not always need a tear-down, but in some cases it is essential. If setUp() starts some kind of server process, for example, and tearDown() fails to terminate it, then setUp() may not be able to run for the next test.

When you write these method names, the camel-casing matters. People sometimes misspell them as setup() or teardown(), then wonder why they are not automatically invoked. Any uncaught exception in either setUp() or tearDown() will cause that test to fail, after which unittest immediately skips to the next test. For errors in setUp(), this means none of that test’s assertions will run (though it still shows as a clear error in the output). For tearDown(), the test is marked as failing, even if all the individual assertions passed.

Asserting Exceptions

Sometimes your code is supposed to raise an exception, under certain conditions. If that condition occurs, and your code does not raise the correct exception, that’s a bug. How do you write test code for this situation?

You can verify that behavior with a special method of TestCase, called assertRa⁠ises(). It’s used in a with statement in your test; the block under the with statement is asserted to raise the exception.

For example, suppose you are writing a library that translates Roman numerals into integers. You might define a function called roman2int():

>>> roman2int("XVI")
16
>>> roman2int("II")
2

In thinking about the best way to design this function, you decide that passing nonsensical input to roman2int() should raise a ValueError. Here’s how you write a test to assert that behavior:

import unittest
from roman import roman2int

class TestRoman(unittest.TestCase):
    def test_roman2int_error(self):
        with self.assertRaises(ValueError):
            roman2int("This is not a valid roman numeral.")

If you run this test, and roman2int() does not raise the error, this is the result:

$ python -m unittest test_roman2int.py
F
=========================================================
FAIL: test_roman2int_error (test_roman2int.TestRoman)
--------------------------------------------------------
Traceback (most recent call last):
  File "/src/test_roman2int.py", line 7, in test_roman2int_error
    roman2int("This is not a valid roman numeral.")
AssertionError: ValueError not raised

--------------------------------------------------------
Ran 1 test in 0.000s

FAILED (failures=1)

When you fix the bug, and roman2int() raises ValueError like it should, the test passes.

Using Subtests

Sometimes you will want to iterate through many test inputs, to thoroughly validate the input range and cover many edge cases. You could simply write a parade of assert methods, but that becomes tediously repetitive, and more importantly, it will stop with the first failing assertion. Sometimes it is tremendously helpful to run all these assertions so that you have a full picture of which are passing and which are not.

Python’s unittest library supports this with a feature called subtests. This lets you conveniently iterate through a potentially large collection of test inputs, with well-presented (and easy to comprehend) reporting output. Pytest calls its version of this feature parameterized tests, which is probably a better name. But since we are focused on unittest, we will call them subtests.

Imagine a function called numwords(), which counts the number of unique words in a string (ignoring punctuation, spelling, and spaces):

>>> numwords("Good, good morning. Beautiful morning!")
3

Suppose you want to test how numwords() handles excess whitespace. You can easily imagine a dozen different reasonable inputs that will result in the same return value, and want to verify it can handle them all. You might create something like this:

class TestWords(unittest.TestCase):
    def test_whitespace(self):
        self.assertEqual(2, numwords("foo bar"))
        self.assertEqual(2, numwords("    foo bar"))
        self.assertEqual(2, numwords("foo\tbar"))
        self.assertEqual(2, numwords("foo   bar"))
        self.assertEqual(2, numwords("foo bar    \t   \t"))
        # And so on, and so on...

Seems a bit repetitive, doesn’t it? The only thing varying is the argument to numwords(). We might benefit from using a for loop:

    def test_whitespace_forloop(self):
        texts = [
            "foo bar",
            "    foo bar",
            "foo\tbar",
            "foo   bar",
            "foo bar    \t   \t",
            ]
        for text in texts:
            self.assertEqual(2, numwords(text))

At first glance, this seems better: more readable and maintainable. If we add new variants, it’s just another line in the texts list. And if I rename numwords(), I only need to change it in one place in the test.

However, using a for loop like this creates more problems than it solves. Suppose you run this test and get the following failure:

$ python -m unittest test_words_forloop.py
F
=========================================================
FAIL: test_whitespace_forloop (test_words_forloop.TestWords)
--------------------------------------------------------
Traceback (most recent call last):
  File "/src/test_words_forloop.py", line 17, in test_whitespace_forloop
    self.assertEqual(2, numwords(text))
AssertionError: 2 != 3

--------------------------------------------------------
Ran 1 test in 0.000s

FAILED (failures=1)

Look closely, and you’ll realize that numwords() returned 3 when it was supposed to return 2.

Pop quiz: out of all the inputs in the list, which caused the bad return value?

The way we’ve written the test, there is no way to know. All you can infer is that at least one of the test inputs produced an incorrect value. You don’t know which one. That’s the first problem.

The second problem—which the original test also suffers from—is that everything stops with the first failed assertion. For this kind of function, knowing all the failing inputs, and the incorrect results they create, would be very helpful for quickly understanding what’s going on.

Subtests solve these problems. Our for-loop solution is actually quite close. All we have to do is add one line. Do you see it below?

    def test_whitespace_subtest(self):
        texts = [
            "foo bar",
            "    foo bar",
            "foo\tbar",
            "foo   bar",
            "foo bar    \t   \t",
            ]
        for text in texts:
            with self.subTest(text=text):
                self.assertEqual(2, numwords(text))

Just inside the for loop, we write with self.subTest(text=text). This creates a context in which assertions can be made, and even fail. Regardless of whether they pass or not, the test continues with the next iteration of the for loop. At the end, all failures are collected and reported in the test result output, like this:

$ python -m unittest test_words_subtest.py

=========================================================
FAIL: test_whitespace_subtest (test_words_subtest.TestWords) (text='foo\tbar')
--------------------------------------------------------
Traceback (most recent call last):
  File "/src/test_words_subtest.py", line 16, in test_whitespace_subtest
    self.assertEqual(2, numwords(text))
AssertionError: 2 != 3

=========================================================
FAIL: test_whitespace_subtest (test_words_subtest.TestWords) (text='foo bar    \t   \t')
--------------------------------------------------------
Traceback (most recent call last):
  File "/src/test_words_subtest.py", line 16, in test_whitespace_subtest
    self.assertEqual(2, numwords(text))
AssertionError: 2 != 4

--------------------------------------------------------
Ran 1 test in 0.000s

FAILED (failures=2)

Behold the opulence of information in this output:

Each individual failing input has its own detailed summary.
We are told what the full value of text was.
We are told what the actual returned value was, and it is clearly compared to the expected value.
No values are skipped. We can be confident that these two are the only failures.

This is much better. The two offending inputs are “foo\tbar” and “foo bar \t \t”. These are the only values containing tab characters, so you can quickly realize the nature of the bug: tab characters are being counted as separate words.

Let’s look at these three lines of code again:

        for text in texts:
            with self.subTest(text=text):
                self.assertEqual(2, numwords(text))

The key-value arguments to self.subTest() are shown in the reporting output. You can pass in whatever key-value pairs you like; they can be anything that helps you understand exactly what is wrong when a test fails. Often you will want to pass everything that varies from the test cases; here, that is only the string passed to numwords().

Be clear that in these three lines, the symbol text is used for two different things. Look at lines 1, 2, and 3 again:

        for text in texts:
            with self.subTest(text=text):
                self.assertEqual(2, numwords(text))

In line 1, the text is the same variable that is passed to numwords() on line 3. But on line 2, in the call to subTest(), you have text=text. The left-hand text is actually a parameter that is used in the reporting output if the test fails. The right-hand side is the value of that parameter, which, in this case, happens to also be called text.

It can clarify if we use input_text as the parameter to subTest() instead:

        for text in texts:
            with self.subTest(input_text=text):
                self.assertEqual(2, numwords(text))

Then the failure output might look like:

FAIL: test_whitespace_subtest (test_words_subtest.TestWords) (input_text='foo\tbar')

See how at the end of that FAIL line, it says input_text instead of text? That is because we used a different argument parameter when calling subTest(). In fact, we can use whatever parameter name we want, but it often works best if we use same identifier name throughout.

Conclusion

Let’s recap the big ideas. Test-driven development means we create the test first, along with whatever stubs we need to make the test run. We then run it and watch it fail. This is an important step. You must run the test and see it fail.

This is important for two reasons. You don’t really know if the test is correct until you verify that it can fail. As you write automated tests more and more over time, you will be surprised at how often you write a test and run it, expecting to see it fail, only to discover it passes. As far as I can tell, every good, experienced software engineer still occasionally does this—even after doing TDD for years! This is why we build the habit of always verifying the test fails first.

The second reason is more subtle. As you gain experience with TDD and become comfortable with it, you will find the cycle of writing tests and making them pass lets you get into a state of flow. This means you are enjoyably productive and focused, in a way that is easy to maintain over time. You will get addicted to this.

Is it important that you strictly follow TDD? People have different opinions on this, some of them very strong. Personally, I went through a period of almost a year where I followed TDD quite strictly. As a result, I got very good at writing tests, and writing them rapidly.

Now that I’ve developed that level of skill, I find that I follow TDD most of the time, but less often than I did when learning. I have noticed that TDD is most powerful when I have great clarity on the software’s design, architecture and APIs; it helps me get into a cognitive state that seems accelerated, so I can more easily maintain my mental focus, and produce quality code faster.

But I find it very hard to write good tests when I don’t yet have that clarity—when I am still thinking through how I will structure the program and organize the code. In fact, I find TDD slows me down in that phase, as any test I write will probably have to be completely rewritten several times, if not deleted, before things stabilize. In these situations, I prefer to get a first version of the code working through manual testing, then write the tests afterward.

For this reason, if your particular job requires a lot of exploratory coding—data scientists, I am looking at you—then TDD may not be something you do all the time. If that is the case, there are many benefits to still doing it as much as you can. Remember, this is a superpower. But only if you use it.

No matter your situation, I encourage you to find a way to do strict TDD for a period of time, simply because of what it will teach you. As you develop your skill at writing tests, you can step back and evaluate how you want to integrate it into your daily routine.

¹ https://docs.python.org/3/library/unittest.html

² You may be in a third category, having a lot of experience with a non-xUnit testing framework. If so, you should probably pretend you’re in the first group. You’ll be able to move quickly.

³ If you haven’t done one of these yet, you will someday.

⁴ Remember from Chapter 6; __add__() is a magic method which makes binary addition with + work.

⁵ For running Python on the command line, this book uses python for the executable. But depending on how Python was installed on your computer, you may actually need to run python3 instead. Check by running python -V, which reports the version number. If it says 2.7 or lower, that is the legacy version; you want to run python3 instead.

⁶ In current Python versions, the test methods are executed in alphabetical order. This order is fragile, because it changes when you add a new test method.

⁷ This could be you, months or years down the road. Be considerate of your future self!