Writing automated tests is one of those things that separates average developers from the best in the world. Master this skill, and you will be able to write far more complex and powerful software than you ever could before. It is a superpower that changes the arc of your career.
Some of you have, so far, little or no experience writing automated tests, in any language. This chapter is primarily written for you. It introduces many fundamental ideas of test automation, explains the problems it is supposed to solve, and teaches how to apply Python’s tools for doing so.
Some of you will have extensive experience using standard test frameworks in
other languages (such as JUnit in Java, PHPUnit in PHP, and so on). Generally
speaking, if you have mastered an xUnit framework in another language,
and are fluent in Python, you may be able to start skimming Python’s unittest module
docs1
and be productive in minutes. Python’s test library, unittest, maps
closely to how most xUnit libraries work.2
If you are more experienced, I believe it is worth your time to at least skim this chapter, and perhaps study it thoroughly. I have woven in useful, real-world wisdom for software testing in general, and for Python specifically. This includes topics like how to organize Python test code, writing maintainable test code, useful features like subtests, and even cognitive aspects of programming…like getting into an enjoyable, highly productive “flow” state via test-driven development.
With that in mind, let’s start with the core ideas for writing automated tests. We’ll then apply those ideas specifically to Python.
An automated test is a program that tests another program. Generally, it tests a specific portion of that program: a function, a method, a class, or some group of these things chunked together. We call that portion the system under test, sometimes abbreviated as SUT.
If the system under test is working correctly, the automated test passes; if not, that test fails—which means it catches the error and immediately tells you what is wrong. Real applications accumulate many of these tests as development proceeds.
People have different names for different kinds of automated tests: unit tests, integration tests, end-to-end tests, etc. These distinctions are useful, but we won’t need to worry about them right now. They all share the same foundation.
In this chapter, we do test-driven development, or TDD. Test-driven development means you start working on each new feature or bugfix by writing the automated test for it first. You write the test; run it, to verify that it fails (and make sure that test code is actually working); and only then write the actual code for the feature. You know it works when the test passes.
This is a different process from implementing the feature first, then writing a test for it. Writing the test first forces you to think through the interfaces of your code, answering the question, “How will I know my code is working?” That immediate benefit is useful, but it is not the whole story.
TDD’s greatest midterm benefits are mostly cognitive. As you become competent and comfortable with its tactics and techniques, you learn to easily get into a state of flow—where you find yourself repeatedly implementing feature after feature, keeping your focus with ease for long periods of time. You can honestly surprise and delight yourself with how much you accomplish in a short period of time.
But even greater benefits emerge over time. We’ve all done substantial refactorings of a large codebase, changing fundamental aspects of its architecture.3 Such refactorings—which threaten to break the application in confusing, hidden ways—become straightforward and safe when you have test code in place already, and use TDD to refactor from that foundation. You first update the tests: modifying where needed, and writing new tests as appropriate. Then all you have to do is make them pass. It might still be a lot of work. But you can be confident in the correctness of your code, once the new tests pass.
Among developers who know how to write tests, some love to do TDD in their day-to-day work. Some like to do it part of the time; some hate it, and do it rarely or never. However, the absolute best way to quickly master unit testing is to strictly do TDD for a while. So I’ll teach you how to do that. You do not have to do it forever if you don’t want to.
Python’s standard library ships with two modules for creating unit
tests: doctest and unittest.
Most engineering teams prefer unittest, as it
is more full-featured than doctest. This isn’t just a convenience.
There is a low ceiling of complexity that doctest can handle, and
real applications will quickly bump up against that limit. With
unittest, the sky is more or less the limit.
And because unittest maps closely to the xUnit
libraries used in many other languages, if you are already familiar
with Python, and have used an xUnit library in any language, you will
feel right at home with unittest. That said, unittest has some
unique tools and idioms—partly because of differences in the Python
language, and partly from unique extensions and improvements. We will
learn the best of what unittest has to offer as we go along.
Another popular option is pytest. This is not in
the standard library, but it is widely used. For brevity, we will focus
on unittest in this chapter. Once you learn the principles, picking up
pytest is straightforward.
Imagine a class representing an angle:
>>>small_angle=Angle(60)>>>small_angle.degrees60>>>small_angle.is_acute()True>>>big_angle=Angle(320)>>>big_angle.is_acute()False>>>funny_angle=Angle(1081)>>>funny_angle.degrees1>>>total_angle=small_angle+big_angle>>>total_angle.degrees20
As you can see, Angle keeps track of the angle size, wrapping around
so it’s in a range of 0 up to 360 degrees. You also have an is_acute()
method, to tell you if its size is under 90 degrees, and an __add__()
method for arithmetic.4
Suppose this Angle class is defined in a file named angles.py.
Here’s how we create a simple test for it—in a separate file, named
test_angles.py:
importunittestfromanglesimportAngleclassTestAngle(unittest.TestCase):deftest_degrees(self):small_angle=Angle(60)self.assertEqual(60,small_angle.degrees)self.assertTrue(small_angle.is_acute())big_angle=Angle(320)self.assertFalse(big_angle.is_acute())funny_angle=Angle(1081)self.assertEqual(1,funny_angle.degrees)deftest_arithmetic(self):small_angle=Angle(60)big_angle=Angle(320)total_angle=small_angle+big_angleself.assertEqual(20,total_angle.degrees,'Adding angles with wrap-around')
As you look over this code, notice a few things:
There’s a class called TestAngle. You just define it, but you do
not create any instance of it. It subclasses TestCase.
You define two methods, test_degrees() and test_arithmetic().
Both test_degrees() and test_arithmetic() have assertions, using
some methods of TestCase: assertEqual(), assertTrue(), and
assertFalse().
The last assertion includes a custom message as its third argument.
To see how this works, let’s define a stub for the Angle class in angles.py:
# angles.py - stub versionclassAngle:def__init__(self,degrees):self.degrees=0defis_acute(self):returnFalsedef__add__(self,other_angle):returnAngle(0)
This Angle class defines all the attributes and methods it is
expected to have, but otherwise does nothing useful. We need a stub
like this to verify the test can run correctly, and alert us to the
fact that the code isn’t working yet.
The unittest module is used not just to define tests, but also to
run them. You do so on the command line like this:5
python -m unittest test_angles.py
Run the test, and you’ll see the following output:
$ python -m unittest test_angles.py
FF
=========================================================
FAIL: test_arithmetic (test_angle.TestAngle)
--------------------------------------------------------
Traceback (most recent call last):
File "/src/test_angles.py", line 18, in test_arithmetic
self.assertEqual(20, total_angle.degrees, 'Adding angles with wrap-around')
AssertionError: 20 != 0 : Adding angles with wrap-around
=========================================================
FAIL: test_degrees (test_angle.TestAngle)
--------------------------------------------------------
Traceback (most recent call last):
File "/src/test_angles.py", line 7, in test_degrees
self.assertEqual(60, small_angle.degrees)
AssertionError: 60 != 0
--------------------------------------------------------
Ran 2 tests in 0.001s
FAILED (failures=2)
Notice:
Both test methods are shown. They both have a failed assertion highlighted.
test_degrees() makes several assertions, but only the first one has
been run—once it fails, the others are not executed.
For each failing assertion, you are given the line number, its expected and actual values, and its test method.
The custom message in test_arithmetic() shows up in the output.
This demonstrates one useful way to organize your test code: in a
single test module (test_angles.py), you define one or more
subclasses of unittest.TestCase. Here, I just define TestAngle,
containing tests for the Angle class. Within this, I create several
test methods, for testing different aspects of the class. In each
of these test methods, I can have as many assertions as makes sense.
It’s traditional to start a test class name with the string Test, but that is not
required; unittest will find all subclasses of TestCase
automatically. But every method must start with the string
test. If it starts with anything else (even Test), unittest
will not run its assertions.
Running the test and watching it fail is an important first step. It verifies that the test does, in fact, actually test your code. As you write more and more tests, you’ll occasionally create a test; run it, expecting it to fail; and find that it unexpectedly passes. That’s a bug in your test code! Fortunately you ran the test first, so you caught it right away.
In the test code, we defined test_degrees() before
test_arithmetic(), but they were actually run in the opposite
order. It’s important to craft your test methods to be self-contained,
and not depend on one being run before the other, for several
reasons. One of them is that the order in which they are defined is
generally not the order in which they are executed.6
(If you find yourself wanting to run tests in a certain order, this
might be better handled with setUp() and tearDown(), explained in
the next section.)
At this point, we have a correctly failing test. If I’m using version control and working in a branch, this is a good moment to check in the test code, because it specifies the correct behavior (even if it’s presently failing). The next step is to actually make that test pass. Here’s one way to do it:
# angles.py, version 2classAngle:def__init__(self,degrees):self.degrees=degrees%360defis_acute(self):returnself.degrees<90def__add__(self,other_angle):returnAngle(self.degrees+other_angle.degrees)
Now when I run my tests again, they all pass:
python -m unittest test_angles.py .. -------------------------------------------------------- Ran 2 tests in 0.000s OK
assertEqual(), assertTrue(), and assertFalse() will be the most
common assertion methods you use, along with assertNotEqual(), which
does the opposite of assertEqual(). TestCase provides many others,
such as assertIs(), assertIsNone(), assertIn(), and
assertIsInstance()—along with “not” variants
like assertIsNot(). Each takes an optional final message-string
argument, like “Adding angles with wrap-around” in test_arithmetic()
above. If the test fails, the message is printed in the output, which can
give helpful advice to whoever is troubleshooting a broken
test.7
If you try checking that two dictionaries are equal, and they are not,
the output is tailored to the data type: highlighting which key is
missing, or which value is incorrect, for example. This also
happens with lists, tuples, and sets, making troubleshooting much
easier. What’s actually happening is that unittest provides certain
type-specialized assertions, such as assertDictEqual(),
assertListEqual(), and more. You almost never need to invoke them
directly: if you invoke assertEqual() with two dictionaries, it
automatically dispatches to assertDictEqual(), and similar for the
other types. So you get this usefully detailed error reporting for
free.
The assertEqual() lines take two arguments, and I always
write the expected, correct value first:
small_angle=Angle(60)self.assertEqual(60,small_angle.degrees)
It does not matter whether the expected value is first or second, but it’s smart to pick an order and stick with it—at least throughout a single codebase, and maybe for all code you write. Sticking with a consistent order greatly improves the readability of your test output, because you never have to decipher which is which. Believe me, this will save you a lot of time; nothing will throw off your momentum more than confusing the expected and actual values for each other, only to realize 20 minutes later you had them mixed up in your head. If you’re on a team, negotiate with your teammates to agree on a consistent order, and enforce it.
As an application grows and you write more tests, you will find
yourself writing groups of test methods that start or end with the
same lines of code. This repeated code—which does some kind of
pretest setup, and/or after-test cleanup—can be consolidated in the
special methods setUp() and tearDown().
Each of your TestCase subclasses can define setUp(),
tearDown(), both, or neither. If defined, setUp() is executed
just before each test method starts; tearDown() is run just
after. This is repeated for every single test method.
Here’s a realistic example of when you might use it. Imagine working on a tool that saves its state between runs in a special JSON file. We’ll call this the “state file”. On start, your tool reads the state from the file; if the tool has any state change while running, that gets written to the file on exit.
It makes sense to write a class to manage this state file. A stub might look like:
# statefile.py - Stub versionclassState:def__init__(self,state_file_path):# Load the stored state data, and save# it in self.data.self.data={}defclose(self):# Handle any changes on application exit.
In fleshing out this stub, we want our tests to verify the following:
If I alter the value of an existing key, that updated value is written to the state file.
If I add a new key-value pair to the state, it is recorded correctly in the state file.
If I remove a key-value pair from the state, it is also removed in the state file.
If the state is not changed, the state file’s content stays the same.
For each test, we want the state file to be in a known starting
state. Afterward, we want to remove that file so that our tests don’t
leave garbage on the filesystem. Here’s how the setUp() and
tearDown() methods accomplish this:
importosimportunittestimportshutilimporttempfilefromstatefileimportStateINITIAL_STATE='{"foo": 42, "bar": 17}'classTestState(unittest.TestCase):defsetUp(self):self.testdir=tempfile.mkdtemp()self.state_file_path=os.path.join(self.testdir,'statefile.json')withopen(self.state_file_path,'w')asoutfile:outfile.write(INITIAL_STATE)self.state=State(self.state_file_path)deftearDown(self):shutil.rmtree(self.testdir)deftest_change_value(self):self.state.data["foo"]=21self.state.close()reloaded_statefile=State(self.state_file_path)self.assertEqual(21,reloaded_statefile.data["foo"])deftest_add_key_value_pair(self):self.state.data["baz"]=42self.state.close()reloaded_statefile=State(self.state_file_path)self.assertEqual(42,reloaded_statefile.data["baz"])deftest_remove_key_value_pair(self):delself.state.data["bar"]self.state.close()reloaded_statefile=State(self.state_file_path)self.assertNotIn("bar",reloaded_statefile.data)deftest_no_change(self):self.state.close()withopen(self.state_file_path)ashandle:checked_content=handle.read()self.assertEqual(INITIAL_STATE,checked_content)
In setUp(), you create a fresh temporary directory, and write the
contents of INITIAL_DATA inside. Since we know each test will be
working with a State object based on that initial data, we create
that object, and save it in self.state. Each test can then work with
that object, confident that it is in the same consistent starting state,
regardless of what any other test method does. In effect, setUp()
creates a private sandbox for each test method.
The tests in TestState would all work reliably with just
setUp(). But we also want to clean up the temporary files we create;
otherwise, they will accumulate over time with repeated test runs. The
tearDown() method runs after each test_* method completes, even
if some of its assertions fail. This ensures the temp files and
directories are all removed completely.
The generic term for this kind of preparation is called a test
fixture. A test fixture is whatever needs to be done or set up before
a test can properly run. In this case, we set up the text fixture by
creating the state file, and the State object. A text fixture can be
a mock database, a set of files in a known state, some kind of network
connection, or even starting a server process. You can do all these
with setUp().
tearDown() is for shutting down and cleaning up the text fixture:
deleting files, stopping the server process, etc. You will not always
need a tear-down, but in some cases it is essential. If setUp()
starts some kind of server process, for example, and tearDown()
fails to terminate it, then setUp() may not be able to run for the
next test.
When you write these method names, the camel-casing matters. People
sometimes misspell them as setup() or teardown(), then wonder why
they are not automatically invoked. Any uncaught exception in either
setUp() or tearDown() will cause that test to fail, after which
unittest immediately skips to the next test. For errors in
setUp(), this means none of that test’s assertions will run (though
it still shows as a clear error in the output). For tearDown(), the
test is marked as failing, even if all the individual assertions
passed.
Sometimes your code is supposed to raise an exception, under certain conditions. If that condition occurs, and your code does not raise the correct exception, that’s a bug. How do you write test code for this situation?
You can verify that behavior with a special method of TestCase,
called assertRaises(). It’s used in a with statement in your test;
the block under the with statement is asserted to raise the
exception.
For example, suppose you are writing a library that translates Roman
numerals into integers. You might define a function called
roman2int():
>>>roman2int("XVI")16>>>roman2int("II")2
In thinking about the best way to design this function, you decide
that passing nonsensical input to roman2int() should raise a
ValueError. Here’s how you write a test to assert that behavior:
importunittestfromromanimportroman2intclassTestRoman(unittest.TestCase):deftest_roman2int_error(self):withself.assertRaises(ValueError):roman2int("This is not a valid roman numeral.")
If you run this test, and roman2int() does not raise the error, this
is the result:
$ python -m unittest test_roman2int.py
F
=========================================================
FAIL: test_roman2int_error (test_roman2int.TestRoman)
--------------------------------------------------------
Traceback (most recent call last):
File "/src/test_roman2int.py", line 7, in test_roman2int_error
roman2int("This is not a valid roman numeral.")
AssertionError: ValueError not raised
--------------------------------------------------------
Ran 1 test in 0.000s
FAILED (failures=1)
When you fix the bug, and roman2int() raises ValueError like it
should, the test passes.
Sometimes you will want to iterate through many test inputs, to thoroughly validate the input range and cover many edge cases. You could simply write a parade of assert methods, but that becomes tediously repetitive, and more importantly, it will stop with the first failing assertion. Sometimes it is tremendously helpful to run all these assertions so that you have a full picture of which are passing and which are not.
Python’s unittest library supports this with a feature called
subtests. This lets you conveniently iterate through a potentially
large collection of test inputs, with well-presented (and easy to
comprehend) reporting output. Pytest calls its version of this feature
parameterized tests, which is probably a better name. But since we
are focused on unittest, we will call them subtests.
Imagine a function called numwords(), which counts the number of
unique words in a string (ignoring punctuation, spelling, and spaces):
>>>numwords("Good, good morning. Beautiful morning!")3
Suppose you want to test how numwords() handles excess whitespace. You
can easily imagine a dozen different reasonable inputs that will
result in the same return value, and want to verify it can handle them
all. You might create something like this:
classTestWords(unittest.TestCase):deftest_whitespace(self):self.assertEqual(2,numwords("foo bar"))self.assertEqual(2,numwords(" foo bar"))self.assertEqual(2,numwords("foo\tbar"))self.assertEqual(2,numwords("foo bar"))self.assertEqual(2,numwords("foo bar\t\t"))# And so on, and so on...
Seems a bit repetitive, doesn’t it? The only thing varying is the
argument to numwords(). We might benefit from using a for loop:
deftest_whitespace_forloop(self):texts=["foo bar"," foo bar","foo\tbar","foo bar","foo bar\t\t",]fortextintexts:self.assertEqual(2,numwords(text))
At first glance, this seems better: more readable and maintainable. If
we add new variants, it’s just another line in the texts list. And
if I rename numwords(), I only need to change it in one place in the
test.
However, using a for loop like this creates more problems than it
solves. Suppose you run this test and get the following failure:
$ python -m unittest test_words_forloop.py
F
=========================================================
FAIL: test_whitespace_forloop (test_words_forloop.TestWords)
--------------------------------------------------------
Traceback (most recent call last):
File "/src/test_words_forloop.py", line 17, in test_whitespace_forloop
self.assertEqual(2, numwords(text))
AssertionError: 2 != 3
--------------------------------------------------------
Ran 1 test in 0.000s
FAILED (failures=1)
Look closely, and you’ll realize that numwords() returned 3 when it
was supposed to return 2.
Pop quiz: out of all the inputs in the list, which caused the bad return value?
The way we’ve written the test, there is no way to know. All you can infer is that at least one of the test inputs produced an incorrect value. You don’t know which one. That’s the first problem.
The second problem—which the original test also suffers from—is that everything stops with the first failed assertion. For this kind of function, knowing all the failing inputs, and the incorrect results they create, would be very helpful for quickly understanding what’s going on.
Subtests solve these problems. Our for-loop solution is actually quite
close. All we have to do is add one line. Do you see it below?
deftest_whitespace_subtest(self):texts=["foo bar"," foo bar","foo\tbar","foo bar","foo bar\t\t",]fortextintexts:withself.subTest(text=text):self.assertEqual(2,numwords(text))
Just inside the for loop, we write with
self.subTest(text=text). This creates a context in which assertions
can be made, and even fail. Regardless of whether they pass or not,
the test continues with the next iteration of the for loop. At the
end, all failures are collected and reported in the test result
output, like this:
$ python -m unittest test_words_subtest.py
=========================================================
FAIL: test_whitespace_subtest (test_words_subtest.TestWords) (text='foo\tbar')
--------------------------------------------------------
Traceback (most recent call last):
File "/src/test_words_subtest.py", line 16, in test_whitespace_subtest
self.assertEqual(2, numwords(text))
AssertionError: 2 != 3
=========================================================
FAIL: test_whitespace_subtest (test_words_subtest.TestWords) (text='foo bar \t \t')
--------------------------------------------------------
Traceback (most recent call last):
File "/src/test_words_subtest.py", line 16, in test_whitespace_subtest
self.assertEqual(2, numwords(text))
AssertionError: 2 != 4
--------------------------------------------------------
Ran 1 test in 0.000s
FAILED (failures=2)
Behold the opulence of information in this output:
Each individual failing input has its own detailed summary.
We are told what the full value of text was.
We are told what the actual returned value was, and it is clearly compared to the expected value.
No values are skipped. We can be confident that these two are the only failures.
This is much better. The two offending inputs are “foo\tbar” and
“foo bar \t \t”.
These are the only values containing tab characters, so
you can quickly realize the nature of the bug: tab characters are
being counted as separate words.
Let’s look at these three lines of code again:
fortextintexts:withself.subTest(text=text):self.assertEqual(2,numwords(text))
The key-value arguments to self.subTest() are shown in the reporting
output. You can pass in whatever key-value pairs you like; they can be
anything that helps you understand exactly what is wrong when a test
fails. Often you will want to pass everything that varies from the
test cases; here, that is only the string passed to numwords().
Be clear that in these three lines, the symbol text is used for two
different things. Look at lines 1, 2, and 3 again:
fortextintexts:withself.subTest(text=text):self.assertEqual(2,numwords(text))
In line 1, the text is the same variable that is passed to
numwords() on line 3. But on line 2, in the call to subTest(), you
have text=text. The left-hand text is actually a parameter that is
used in the reporting output if the test fails. The right-hand side
is the value of that parameter, which, in this case, happens to also
be called text.
It can clarify if we use input_text as the parameter to subTest() instead:
fortextintexts:withself.subTest(input_text=text):self.assertEqual(2,numwords(text))
Then the failure output might look like:
FAIL: test_whitespace_subtest (test_words_subtest.TestWords) (input_text='foo\tbar')
See how at the end of that FAIL line, it says input_text instead
of text? That is because we used a different argument parameter when
calling subTest(). In fact, we can use whatever parameter name we
want, but it often works best if we use same identifier name
throughout.
Let’s recap the big ideas. Test-driven development means we create the test first, along with whatever stubs we need to make the test run. We then run it and watch it fail. This is an important step. You must run the test and see it fail.
This is important for two reasons. You don’t really know if the test is correct until you verify that it can fail. As you write automated tests more and more over time, you will be surprised at how often you write a test and run it, expecting to see it fail, only to discover it passes. As far as I can tell, every good, experienced software engineer still occasionally does this—even after doing TDD for years! This is why we build the habit of always verifying the test fails first.
The second reason is more subtle. As you gain experience with TDD and become comfortable with it, you will find the cycle of writing tests and making them pass lets you get into a state of flow. This means you are enjoyably productive and focused, in a way that is easy to maintain over time. You will get addicted to this.
Is it important that you strictly follow TDD? People have different opinions on this, some of them very strong. Personally, I went through a period of almost a year where I followed TDD quite strictly. As a result, I got very good at writing tests, and writing them rapidly.
Now that I’ve developed that level of skill, I find that I follow TDD most of the time, but less often than I did when learning. I have noticed that TDD is most powerful when I have great clarity on the software’s design, architecture and APIs; it helps me get into a cognitive state that seems accelerated, so I can more easily maintain my mental focus, and produce quality code faster.
But I find it very hard to write good tests when I don’t yet have that clarity—when I am still thinking through how I will structure the program and organize the code. In fact, I find TDD slows me down in that phase, as any test I write will probably have to be completely rewritten several times, if not deleted, before things stabilize. In these situations, I prefer to get a first version of the code working through manual testing, then write the tests afterward.
For this reason, if your particular job requires a lot of exploratory coding—data scientists, I am looking at you—then TDD may not be something you do all the time. If that is the case, there are many benefits to still doing it as much as you can. Remember, this is a superpower. But only if you use it.
No matter your situation, I encourage you to find a way to do strict TDD for a period of time, simply because of what it will teach you. As you develop your skill at writing tests, you can step back and evaluate how you want to integrate it into your daily routine.
1 https://docs.python.org/3/library/unittest.html
2 You may be in a third category, having a lot of experience with a non-xUnit testing framework. If so, you should probably pretend you’re in the first group. You’ll be able to move quickly.
3 If you haven’t done one of these yet, you will someday.
4 Remember from Chapter 6; __add__() is a magic method which makes binary addition with + work.
5 For running Python on the command line, this book uses python for the executable. But depending on how Python was installed on your computer, you may actually need to run python3 instead. Check by running python -V, which reports the version number. If it says 2.7 or lower, that is the legacy version; you want to run python3 instead.
6 In current Python versions, the test methods are executed in alphabetical order. This order is fragile, because it changes when you add a new test method.
7 This could be you, months or years down the road. Be considerate of your future self!