Effective Python: 125 Specific Ways to Write Better Python, 3rd Edition

12

Data Structures and Algorithms

When you’re implementing Python programs that handle a non-trivial amount of data, you’ll often encounter slowdowns caused by the algorithmic complexity of your code. For example, programs you expected to scale linearly in the size of input data might actually grow quadratically, causing problems in production. Luckily, Python includes optimized implementations of many standard data structures and algorithms that can help you achieve high performance with minimal effort.

Similarly, Python provides built-in data types and helper functions for handling common tasks that frequently come up in programs: manipulating dates, times, and time zones; working with money values while preserving precision and controlling rounding behavior; and saving and restoring program state for users even as your software evolves over time. Writing code to handle these situations is fiddly and hard to get right. Having battle-tested implementations built into the language to handle them is a blessing.

Item 100: Sort by Complex Criteria Using the `key` Parameter

The list built-in type provides a sort method for ordering the items in a list instance based on a variety of criteria. (Don’t confuse the sort method with sorted; see Item 101: “Know the Difference Between sort and sorted.”) By default, sort orders a list’s contents by the natural ascending order of the items. For example, here I sort a list of integers from smallest to largest:

Click here to view code image

numbers = [93, 86, 11, 68, 70]
numbers.sort()
print(numbers)

>>>
[11, 68, 70, 86, 93]

The sort method works for nearly all built-in types (strings, floats, etc.) that have a natural ordering. What does sort do with objects? As an example, here I define a class—including a __repr__ method so instances are printable (see Item 12: “Understand the Difference Between repr and str when Printing Objects”)—to represent various tools you might need to use on a construction site:

Click here to view code image

class Tool:
    def __init__(self, name, weight):
        self.name = name
        self.weight = weight

    def __repr__(self):
        return f"Tool({self.name!r}, {self.weight})"

tools = [
    Tool("level", 3.5),
    Tool("hammer", 1.25),
    Tool("screwdriver", 0.5),
    Tool("chisel", 0.25),
]

Sorting objects of this type doesn’t work because the sort method tries to call comparison special methods that aren’t defined by the class:

Click here to view code image

tools.sort()

>>>
Traceback ...
TypeError: '<' not supported between instances of 'Tool' and
➥'Tool'

If your class should have a natural ordering as integers do, then you can define the necessary special methods (see Item 104: “Know How to Use heapq for Priority Queues” for an example and Item 57: “Inherit from collections.abc Classes for Custom Container Types” for background) to make sort work without any extra parameters. But the more common case is that your objects might need to support multiple orderings, in which case defining a single natural ordering really doesn’t make sense.

Often there’s an attribute on the object that you’d like to use for sorting. To support this use case, the sort method accepts a key parameter that’s expected to be a function (see Item 48: “Accept Functions Instead of Classes for Simple Interfaces” for background). The key function is passed a single argument, which is an item from the list that is being sorted. The return value of the key function should be a comparable value (i.e., with a natural ordering) to use in place of an item for sorting purposes.

Here I use the lambda keyword (see Item 39: “Prefer functools.partial over lambda Expressions for Glue Functions” for background) to define a function for the key parameter that enables me to sort the list of Tool objects alphabetically by the name attribute:

Click here to view code image

print("Unsorted:", repr(tools))
tools.sort(key=lambda x: x.name)
print("\nSorted:  ", tools)

>>>
Unsorted: [Tool('level', 3.5), Tool('hammer', 1.25),
➥Tool('screwdriver', 0.5), Tool('chisel', 0.25)]

Sorted:   [Tool('chisel', 0.25), Tool('hammer', 1.25),
➥Tool('level', 3.5), Tool('screwdriver', 0.5)]

I can just as easily define another lambda function to sort by the weight attribute and pass it as the key parameter to the sort method:

Click here to view code image

tools.sort(key=lambda x: x.weight)
print("By weight:", tools)

>>>
By weight: [Tool('chisel', 0.25), Tool('screwdriver', 0.5),
➥Tool('hammer', 1.25), Tool('level', 3.5)]

Within the lambda function that’s passed as the key parameter, you can access attributes of items as I’ve done here, index into items (for sequences, tuples, and dictionaries), or use any other valid expression.

For basic types like strings, you might even want to use the key function to do transformations on the values before sorting. For example, here I apply the lower method to each item in a list of place names to ensure that they’re in alphabetical order, ignoring any capitalization (since in the natural lexical ordering of strings, uppercase letters appear before lowercase letters):

Click here to view code image

places = ["home", "work", "New York", "Paris"]
places.sort()
print("Case sensitive:  ", places)
places.sort(key=lambda x: x.lower())
print("Case insensitive:", places)

>>>
Case sensitive:   ['New York', 'Paris', 'home', 'work']
Case insensitive: ['home', 'New York', 'Paris', 'work']

Sometimes you might need to use multiple criteria for sorting. For example, say that I have a list of power tools, and I want to sort them first by the weight attribute and then by name. How can I accomplish this?

power_tools = [
    Tool("drill", 4),
    Tool("circular saw", 5),
    Tool("jackhammer", 40),
    Tool("sander", 4),
]

The simplest solution in Python is to use the tuple type (see Item 56: “Prefer dataclasses for Creating Immutable Objects” for another approach). A tuple is an immutable sequence of arbitrary Python values. Tuples are comparable by default and have a natural ordering, meaning that they implement all the special methods, such as __lt__, that are required by the sort method. The tuple type implements special comparator methods by iterating over each position in the tuple and comparing the corresponding values, one index at a time. Here I show how this works when one tool is heavier than another:

Click here to view code image

saw = (5, "circular saw")
jackhammer = (40, "jackhammer")
assert not (jackhammer < saw)  # Matches expectations

If the first position in the tuples being compared are equal—weight in this case—the tuple comparison will move to the second position and so on:

Click here to view code image

drill = (4, "drill")
sander = (4, "sander")
assert drill[0] == sander[0]  # Same weight
assert drill[1] < sander[1]   # Alphabetically less
assert drill < sander         # Thus, drill comes first

You can take advantage of this tuple comparison behavior in order to sort the list of power tools first by weight and then by name. Here I define a key function that returns a tuple containing the two attributes that I want to sort on, in order of priority:

Click here to view code image

power_tools.sort(key=lambda x: (x.weight, x.name))
print(power_tools)

>>>
[Tool('drill', 4), Tool('sander', 4), Tool('circular saw', 5),
➥Tool('jackhammer', 40)]

One limitation of having the key function return a tuple is that the direction of sorting for all criteria must be the same (either all in ascending order or all in descending order). If I provide the reverse parameter to the sort method, it will affect both criteria in the tuple the same way (note how sander now comes before drill instead of after):

Click here to view code image

power_tools.sort(
    key=lambda x: (x.weight, x.name),
    reverse=True,  # Makes all criteria descending
)
print(power_tools)

>>>
[Tool('jackhammer', 40), Tool('circular saw', 5),
➥Tool('sander', 4), Tool('drill', 4)]

For numerical values, it’s possible to mix sorting directions by using the unary minus operator in the key function. This negates one of the values in the returned tuple, effectively reversing its sort order while leaving the others intact. Here I use this approach to sort by weight descending and then by name ascending (note how sander now comes after drill instead of before):

Click here to view code image

power_tools.sort(key=lambda x: (-x.weight, x.name))
print(power_tools)

>>>
[Tool('jackhammer', 40), Tool('circular saw', 5), Tool
➥('drill', 4), Tool('sander', 4)]

Unfortunately, unary negation isn’t possible for all types. Here I try to achieve the same outcome by using the reverse argument to sort by weight descending and then negating the name attribute to put it in ascending order:

Click here to view code image

power_tools.sort(key=lambda x: (x.weight, -x.name),
reverse=True)

>>>
Traceback ...
TypeError: bad operand type for unary -: 'str'

For situations like this, Python provides a stable sorting algorithm. The sort method of the list type preserves the order of the input list when the key function returns values that are equal to each other. This means I can call sort multiple times on the same list to combine different criteria together. Here I produce the same sort ordering of weight descending and name ascending as I did above, but now I do it by using two separate calls to sort:

Click here to view code image

power_tools.sort(
    key=lambda x: x.name,    # Name ascending
)
power_tools.sort(
    key=lambda x: x.weight,  # Weight descending
    reverse=True,
)
print(power_tools)

>>>
[Tool('jackhammer', 40), Tool('circular saw', 5),
Tool('drill', 4), Tool('sander', 4)]

To understand why this works, note how the first call to sort puts the names in alphabetical order:

Click here to view code image

power_tools.sort(key=lambda x: x.name)
print(power_tools)

>>>
[Tool('circular saw', 5), Tool('drill', 4),
➥Tool('jackhammer', 40), Tool('sander', 4)]

When the second sort call by weight descending is made, the code sees that both sander and drill have a weight of 4. This causes the sort method to put both items into the final result list in the same order in which they appeared in the original list, thus preserving their relative ordering by name ascending:

Click here to view code image

power_tools.sort(
    key=lambda x: x.weight,
    reverse=True,
)
print(power_tools)

>>>
[Tool('jackhammer', 40), Tool('circular saw', 5),
➥Tool('drill', 4), Tool('sander', 4)]

This same approach can be used to combine as many different types of sorting criteria as you’d like in any direction. You just need to make sure that you execute the sorts in the opposite sequence of what you want the final list to contain. In this example, I wanted the sort order to be by weight descending and then by name ascending, so I had to do the name sort first, followed by the weight sort.

That said, the approach of having the key function return a tuple and using unary negation to mix sort orders is simpler to read and requires less code. I recommend using multiple calls to sort only if absolutely necessary.

Things to Remember

The sort method of the list type can be used to rearrange a list’s contents by the natural ordering of built-in types like strings, integers, tuples, and so on.
The sort method doesn’t work for objects unless they define a natural ordering using special methods, which is uncommon.
The key parameter of the sort method can be used to supply a helper function that returns the value to use in place of each item from the list while sorting.
Returning a tuple from the key function allows you to combine multiple sorting criteria together. The unary minus operator can be used to reverse individual sort orders for types that allow it.
For types that can’t be negated, you can combine many sorting criteria together by calling the sort method multiple times using different key functions and reverse values, in the order of lowest-rank sort call to highest-rank sort call.

Item 101: Know the Difference Between `sort` and `sorted`

The most familiar way to sort data in Python is by using the list type’s built-in sort method (see Item 100: “Sort by Complex Criteria Using the key Parameter”). This method causes the data to be sorted in place, meaning the original list is modified, and the unsorted arrangement is no longer available. For example, here I alphabetize a list of butterfly names:

Click here to view code image

butterflies = ["Swallowtail", "Monarch", "Red Admiral"]
print(f"Before {butterflies}")
butterflies.sort()
print(f"After {butterflies}")

>>>
Before ['Swallowtail', 'Monarch', 'Red Admiral']
After ['Monarch', 'Red Admiral', 'Swallowtail']

Another way to sort data in Python is by using the sorted built-in function. This function sorts the contents of the given object and returns the results as a list while leaving the original intact:

Click here to view code image

original = ["Swallowtail", "Monarch", "Red Admiral"]
alphabetical = sorted(original)
print(f"Original {original}")
print(f"Sorted   {alphabetical}")

>>>
Original ['Swallowtail', 'Monarch', 'Red Admiral']
Sorted   ['Monarch', 'Red Admiral', 'Swallowtail']

The sorted built-in function can be used with any iterable object (see Item 21: “Be Defensive when Iterating over Arguments”), including tuples, dictionaries, and sets:

Click here to view code image

patterns = {"solid", "spotted", "cells"}
print(sorted(patterns))

>>>
['cells', 'solid', 'spotted']

It also supports the reverse and key parameters, just like the sort built-in function (see Item 100: “Sort by Complex Criteria Using the key Parameter”):

Click here to view code image

legs = {"insects": 6, "spiders": 8, "lizards": 4}
sorted_legs = sorted(
    legs,
    key=lambda x: legs[x],
    reverse=True,
)
print(sorted_legs)

>>>
['spiders', 'insects', 'lizards']

There are two benefits of using sort over sorted. First, sort modifies the list in place, which means the memory consumed by your program will stay the same during and after the sort. On the other hand, sorted needs to make a copy of the iterable’s contents in order to produce a sorted list, which might double your program’s memory requirements. Second, sort can be a lot faster because it’s doing less work overall. The result list is already known and doesn’t need to be allocated or resized. Iteration only needs to occur over a list instead of over arbitrary iterable objects. If the data is already partially ordered, index reassignments can be avoided altogether.

However, there are two primary benefits of using sorted instead of sort. First, the original object is left alone, ensuring that you don’t inadvertently modify arguments supplied to your functions and cause perplexing bugs (see Item 30: “Know That Function Arguments Can Be Mutated” and Item 56: “Prefer dataclasses for Creating Immutable Objects” for background). Second, sorted works for any type of iterator, not just lists, which means your functions can rely on duck typing and be more flexible in the types they accept (see Item 25: “Be Cautious when Relying on Dictionary Insertion Ordering” for an example).

Arguably, sorted is more Pythonic than sort because it enables additional flexibility and is more explicit in how it produces results. The choice of which to use depends on the circumstances and what you’re trying to accomplish.

Things to Remember

sort achieves maximum performance with minimal memory overhead but requires the target list to be modified in place.
sorted can process all types of iterators and collections as input, and won’t inadvertently mutate data.

Item 102: Consider Searching Sorted Sequences with `bisect`

It’s common to find yourself with a large amount of data in memory as a sorted list that you then want to search. For example, you may have loaded an English language dictionary to use for spell checking, or perhaps you have a list of dated financial transactions to audit for correctness.

Regardless of the data your specific program needs to process, searching for a specific value in a list takes linear time proportional to the list’s length when you call the index method:

data = list(range(10**5))
index = data.index(91234)
assert index == 91234

If you’re not sure whether the exact value you’re searching for is in the list, you might want to search for the closest index that is equal to or exceeds your goal value. The simplest way to do this is to linearly scan the list and compare each item to your goal value:

Click here to view code image

def find_closest(sequence, goal):
    for index, value in enumerate(sequence):
        if goal < value:
            return index
    raise ValueError(f"{goal} is out of bounds")

index = find_closest(data, 91234.56)
assert index == 91235

Python’s built-in bisect module provides better ways to accomplish these types of searches through ordered lists. You can use the bisect_left function to do an efficient binary search through any sequence of sorted items. The index it returns will either be where the item is already present in the list or where you’d want to insert the item in the list to keep it in sorted order:

Click here to view code image

from bisect import bisect_left

index = bisect_left(data, 91234)     # Exact match
assert index == 91234

index = bisect_left(data, 91234.56)  # Closest match
assert index == 91235

The complexity of the binary search algorithm used by the bisect module is logarithmic. This means searching in a list of length 1 million takes roughly the same amount of time with bisect as linearly searching a list of length 20 using the list.index method (math.log2(10**6) == 19.93...). So bisect is way faster!

I can verify this speed improvement for the example from above by using the timeit built-in module to run a microbenchmark (see Item 93: “Optimize Performance-Critical Code Using timeit Microbenchmarks” for details):

Click here to view code image

import random
import timeit

size = 10**5
iterations = 1000

data = list(range(size))
to_lookup = [random.randint(0, size)
             for _ in range(iterations)]

def run_linear(data, to_lookup):
    for index in to_lookup:
        data.index(index)

def run_bisect(data, to_lookup):
    for index in to_lookup:
        bisect_left(data, index)

baseline = (
    timeit.timeit(
        stmt="run_linear(data, to_lookup)",
        globals=globals(),
        number=10,
    )
    / 10
)
print(f"Linear search takes {baseline:.6f}s")

comparison = (
    timeit.timeit(
        stmt="run_bisect(data, to_lookup)",
        globals=globals(),
        number=10,
    )
    / 10
)
print(f"Bisect search takes {comparison:.6f}s")

slowdown = 1 + ((baseline - comparison) / comparison)
print(f"{slowdown:.1f}x slower")

>>>
Linear search takes 0.317685s
Bisect search takes 0.000197s
1610.1x time

The best part about bisect is that it’s not limited to the list type; you can use it with any Python object that acts like a sequence (see Item 57: “Inherit from collections.abc Classes for Custom Container Types” for how to do that) containing values that have a natural ordering (see Item 104: “Know How to Use heapq for Priority Queues” for background). The module also provides additional features for more advanced situations (see https://docs.python.org/3/library/bisect.html).

Things to Remember

Searching sorted data contained in a list takes linear time using the index method or a for loop with simple comparisons.
The bisect built-in module’s bisect_left function takes logarithmic time to search for values in sorted lists, which can be orders of magnitude faster than other approaches.

Item 103: Prefer `deque` for Producer–Consumer Queues

A common need in writing programs is a first-in, first-out (FIFO) queue, also known as a producer–consumer queue. A FIFO queue is used when one function gathers values to process and another function handles them in the order in which they were received. Often, programmers turn to Python’s built-in list type to act as a FIFO queue.

For example, say that I’m writing a program that processes incoming emails for long-term archival, and it’s using a list for a producer–consumer queue. Here I define a class to represent the messages:

Click here to view code image

class Email:
    def __init__(self, sender, receiver, message):
        self.sender = sender
        self.receiver = receiver
        self.message = message

    ...

I also define a placeholder function for receiving a single email, presumably from a socket, the file system, or some other type of I/O system. The implementation of this function doesn’t matter; what’s important is its interface: It will either return an Email instance or raise a NoEmailError exception (see Item 32: “Prefer Raising Exceptions to Returning None” for more about this convention):

Click here to view code image

class NoEmailError(Exception):
    pass

def try_receive_email():
    # Returns an Email instance or raises NoEmailError
    ...

The producing function receives emails and enqueues them to be consumed at a later time. This function uses the append method on the list to add new messages to the end of the queue so they are processed after all messages that were previously received:

Click here to view code image

def produce_emails(queue):
    while True:
        try:
            email = try_receive_email()
        except NoEmailError:
            return
        else:
            queue.append(email)  # Producer

The consuming function is what does something useful with the emails. This function calls pop(0) on the queue, which removes the very first item from the list and returns it to the caller. By always processing items from the beginning of the queue, the consumer ensures that the items are processed in the order in which they were received:

Click here to view code image

def consume_one_email(queue):
    if not queue:
        return
    email = queue.pop(0)  # Consumer
    # Index the message for long-term archival
    ...

Finally, I need a looping function that connects the pieces together. This function alternates between producing and consuming until the keep_running function returns False (see Item 75: “Achieve Highly Concurrent I/O with Coroutines” on how to do this concurrently):

Click here to view code image

def loop(queue, keep_running):
    while keep_running():
        produce_emails(queue)
        consume_one_email(queue)


def my_end_func():
    ...

loop([], my_end_func)

Why not process each Email object in produce_emails as it’s returned by try_receive_email? It comes down to the trade-off between latency and throughput. When using producer–consumer queues, you often want to minimize the latency of accepting new items so they can be collected as fast as possible. The consumer can then process through the backlog of items at a consistent pace—one item per loop in this case—which provides a stable performance profile and consistent throughput at the cost of end-to-end latency (see Item 70: “Use Queue to Coordinate Work Between Threads” for related best practices).

Using a list for a producer–consumer queue like this works fine up to a point, but as the cardinality—the number of items in the list—increases, the list type’s performance can degrade superlinearly. To analyze the performance of using list as a FIFO queue, I can run some microbenchmarks using the timeit built-in module (see Item 93: “Optimize Performance-Critical Code Using timeit Microbenchmarks” for details). Here I define a benchmark for the performance of adding new items to the queue using the append method of list (matching the producer function’s usage):

Click here to view code image

import timeit

def list_append_benchmark(count):
    def run(queue):
        for i in range(count):
            queue.append(i)

    return timeit.timeit(
        setup="queue = []",
        stmt="run(queue)",
        globals=locals(),
        number=1,
    )

Running this benchmark function with different levels of cardinality lets me compare its performance in relationship to data size:

Click here to view code image

for i in range(1, 6):
    count = i * 1_000_000
    delay = list_append_benchmark(count)
    print(f"Count {count:>5,} takes: {delay*1e3:>6.2f}ms")

>>>
Count 1,000,000 takes:  13.23ms
Count 2,000,000 takes:  26.50ms
Count 3,000,000 takes:  39.06ms
Count 4,000,000 takes:  51.98ms
Count 5,000,000 takes:  65.19ms

This shows that the append method takes roughly constant time for the list type, and the total time for enqueueing scales linearly as the data size increases. There is overhead for the list type to increase its capacity under the covers as new items are added, but it’s reasonably low and is amortized across repeated calls to append.

Here I define a similar benchmark for the pop(0) call that removes items from the beginning of the queue (matching the consumer function’s usage):

Click here to view code image

def list_pop_benchmark(count):
    def prepare():
        return list(range(count))

    def run(queue):
        while queue:
            queue.pop(0)

    return timeit.timeit(
        setup="queue = prepare()",
        stmt="run(queue)",
        globals=locals(),
        number=1,
    )

I can similarly run this benchmark for queues of different sizes to see how performance is affected by cardinality:

Click here to view code image

for i in range(1, 6):
    count = i * 10_000
    delay = list_pop_benchmark(count)
    print(f"Count {count:>5,} takes: {delay*1e3:>6.2f}ms")

>>>
Count 10,000 takes:   4.98ms
Count 20,000 takes:  22.21ms
Count 30,000 takes:  60.04ms
Count 40,000 takes: 109.96ms
Count 50,000 takes: 176.92ms

Surprisingly, this shows that the total time for dequeuing items from a list with pop(0) scales quadratically as the length of the queue increases. This happens because pop(0) needs to move every item in the list back an index, effectively reassigning the entire list’s contents. I need to call pop(0) for every item in the list, and so I end up doing roughly len(queue) * len(queue) operations to consume the queue. This doesn’t scale.

Python provides the deque class from the collections built-in module to solve this problem. deque is a double-ended queue implementation. It provides constant time operations for inserting or removing items from its beginning or end. This makes it ideal for FIFO queues.

To use the deque class, the call to append in produce_emails can stay the same as it was when using a list for the queue. The list.pop method call in consume_one_email must change to call the deque.popleft method with no arguments instead. And the loop method must be called with a deque instance instead of a list. Everything else stays the same. Here I redefine the one function affected to use the new method and run loop again:

Click here to view code image

import collections

def consume_one_email(queue):
    if not queue:
        return
    email = queue.popleft()  # Consumer
    # Process the email message
    ...

def my_end_func():
    ...

loop(collections.deque(), my_end_func)

I can run another version of the benchmark to verify that the performance of append (matching the producer function’s usage) has stayed roughly the same (modulo a constant factor):

Click here to view code image

def deque_append_benchmark(count):
    def prepare():
        return collections.deque()

    def run(queue):
        for i in range(count):
            queue.append(i)

    return timeit.timeit(
        setup="queue = prepare()",
        stmt="run(queue)",
        globals=locals(),
        number=1,
    )

for i in range(1, 6):
    count = i * 100_000
    delay = deque_append_benchmark(count)
    print(f"Count {count:>5,} takes: {delay*1e3:>6.2f}ms")

>>>
Count 100,000 takes:   1.68ms
Count 200,000 takes:   3.16ms
Count 300,000 takes:   5.05ms
Count 400,000 takes:   6.81ms
Count 500,000 takes:   8.43ms

And I can also benchmark the performance of calling popleft to mimic the consumer function’s usage of deque:

Click here to view code image

def dequeue_popleft_benchmark(count):
    def prepare():
        return collections.deque(range(count))

    def run(queue):
        while queue:
            queue.popleft()

    return timeit.timeit(
        setup="queue = prepare()",
        stmt="run(queue)",
        globals=locals(),
        number=1,
    )

for i in range(1, 6):
    count = i * 100_000
    delay = dequeue_popleft_benchmark(count)
    print(f"Count {count:>5,} takes: {delay*1e3:>6.2f}ms")

>>>
Count 100,000 takes:   1.67ms
Count 200,000 takes:   3.59ms
Count 300,000 takes:   5.65ms
Count 400,000 takes:   7.50ms
Count 500,000 takes:   9.58ms

The popleft usage scales linearly, in contrast to the superlinear behavior of pop(0) that I measured before—hooray! If you know that the performance of your program critically depends on the speed of your producer–consumer queues, then deque is a great choice. If you’re not sure, then you should instrument your program to find out (see Item 92: “Profile Before Optimizing”).

Things to Remember

The list type can be used as a FIFO queue by having the producer call append to add items and the consumer call pop(0) to receive items. However, this may cause problems because the performance of pop(0) degrades superlinearly as the queue length increases.
The deque class from the collections built-in module takes constant time—regardless of length—for append and popleft, making it ideal for FIFO queues.

Item 104: Know How to Use `heapq` for Priority Queues

One of the limitations of Python’s other queue implementations (see Item 103: “Prefer deque for Producer-Consumer Queues” and Item 70: “Use Queue to Coordinate Work Between Threads”) is that they are first-in, first-out (FIFO) queues: Their contents are sorted by the order in which they were received. Often, you need a program to process items in order of relative importance instead. To accomplish this, a priority queue is the right tool for the job.

For example, say that I’m writing a program to manage books borrowed from a library. There are people constantly borrowing new books. There are people returning their borrowed books on time. And there are people who need to be reminded to return their overdue books. Here I define a class to represent a book that’s been borrowed:

Click here to view code image

class Book:
    def __init__(self, title, due_date):
        self.title = title
        self.due_date = due_date

I need a system that will send a reminder message when each book passes its due date. Unfortunately, I can’t use a FIFO queue for this because the amount of time each book is allowed to be borrowed varies based on its recency, popularity, and other factors. For example, a book that is borrowed today may be due back later than a book that’s borrowed tomorrow. Here I achieve this behavior by using a standard list and sorting it by the due_date attribute each time a new Book object is added:

Click here to view code image

def add_book(queue, book):
    queue.append(book)
    queue.sort(key=lambda x: x.due_date, reverse=True)

queue = []
add_book(queue, Book("Don Quixote", "2019-06-07"))
add_book(queue, Book("Frankenstein", "2019-06-05"))
add_book(queue, Book("Les Misérables", "2019-06-08"))
add_book(queue, Book("War and Peace", "2019-06-03"))

If I can assume that the queue of borrowed books is always in sorted order, then all I need to do to check for overdue books is to inspect the final element in the list. Here I define a function to return the next overdue book, if any, and remove it from the queue:

Click here to view code image

class NoOverdueBooks(Exception):
    pass

def next_overdue_book(queue, now):
    if queue:
        book = queue[-1]
        if book.due_date < now:
            queue.pop()
            return book

    raise NoOverdueBooks

I can call this function repeatedly to get overdue books to remind people about in order from most overdue to least overdue:

Click here to view code image

now = "2019-06-10"

found = next_overdue_book(queue, now)
print(found.due_date, found.title)

found = next_overdue_book(queue, now)
print(found.due_date, found.title)

>>>
2019-06-03 War and Peace
2019-06-05 Frankenstein

If a book is returned before the due date, I can remove the scheduled reminder message by removing the book from the list:

Click here to view code image

def return_book(queue, book):
    queue.remove(book)

queue = []
book = Book("Treasure Island", "2019-06-04")

add_book(queue, book)
print("Before return:", [x.title for x in queue])

return_book(queue, book)
print("After return: ", [x.title for x in queue])

>>>
Before return: ['Treasure Island']
After return:  []

And I can confirm that when all books are returned, the return_book function will raise the right exception (see Item 32: “Prefer Raising Exceptions to Returning None” for this convention):

Click here to view code image

try:
    next_overdue_book(queue, now)
except NoOverdueBooks:
    pass          # Expected
else:
    assert False  # Doesn't happen

However, the computational complexity of this solution isn’t ideal. Although checking for and removing an overdue book has a constant cost, every time I add a book, I pay the cost of sorting the whole list again. If I have len(queue) books to add, and the cost of sorting them is roughly len(queue) * math.log(len(queue)), the time it takes to add books will grow superlinearly (len(queue) * len(queue) * math.log(len(queue))).

Here I define a microbenchmark to measure this performance behavior experimentally by using the timeit built-in module (see Item 93: “Optimize Performance-Critical Code Using timeit Microbenchmarks” for details):

Click here to view code image

import random
import timeit

def list_overdue_benchmark(count):
    def prepare():
        to_add = list(range(count))
        random.shuffle(to_add)
        return [], to_add

    def run(queue, to_add):
        for i in to_add:
            queue.append(i)
            queue.sort(reverse=True)

        while queue:
            queue.pop()

    return timeit.timeit(
        setup="queue, to_add = prepare()",
        stmt="run(queue, to_add)",
        globals=locals(),
        number=1,
    )

I can verify that the runtime of adding and removing books from the queue scales superlinearly as the number of books being borrowed increases:

Click here to view code image

for i in range(1, 6):
    count = i * 1_000
    delay = list_overdue_benchmark(count)
    print(f"Count {count:>5,} takes: {delay*1e3:>6.2f}ms")

>>>
Count 1,000 takes:   1.74ms
Count 2,000 takes:   5.87ms
Count 3,000 takes:  11.12ms
Count 4,000 takes:  19.80ms
Count 5,000 takes:  31.02ms

When a book is returned before the due date, I need to do a linear scan in order to find the book in the queue and remove it. Removing a book causes all subsequent items in the list to be shifted back an index, which has a high cost that also scales superlinearly. Here I define another microbenchmark to test the performance of returning a book using this function:

Click here to view code image

def list_return_benchmark(count):
    def prepare():
        queue = list(range(count))
        random.shuffle(queue)

        to_return = list(range(count))
        random.shuffle(to_return)

        return queue, to_return

    def run(queue, to_return):
        for i in to_return:
            queue.remove(i)

    return timeit.timeit(
        setup="queue, to_return = prepare()",
        stmt="run(queue, to_return)",
        globals=locals(),
        number=1,
    )

And again, I can verify that indeed the performance degrades superlinearly as the number of books increases:

Click here to view code image

for i in range(1, 6):
    count = i * 1_000
    delay = list_return_benchmark(count)
    print(f"Count {count:>5,} takes: {delay*1e3:>6.2f}ms")

>>>
Count 1,000 takes:   1.97ms
Count 2,000 takes:   6.99ms
Count 3,000 takes:  14.59ms
Count 4,000 takes:  26.12ms
Count 5,000 takes:  40.38ms

Using the methods of list may work for a tiny library, but it certainly won’t scale to the size of the Great Library of Alexandria, as I want it to!

Fortunately, Python has the built-in heapq module that solves this problem by implementing priority queues efficiently. A heap is a data structure that allows for a list of items to be maintained where the computational complexity of adding a new item or removing the smallest item has logarithmic computational complexity (i.e., even better than linear scaling). In the book-borrowing example, smallest means the book with the earliest due date. The best part about heapq is that you don’t have to understand how heaps are implemented in order to use the module’s functions correctly.

Here I reimplement the add_book function using the heapq module. The queue is still a plain list. The heappush function replaces the list.append call from before. And I no longer have to call list.sort on the queue:

from heapq import heappush

def add_book(queue, book):
    heappush(queue, book)

If I try to use this with the Book class as previously defined, I get this somewhat cryptic error:

Click here to view code image

queue = []
add_book(queue, Book("Little Women", "2019-06-05"))
add_book(queue, Book("The Time Machine", "2019-05-30"))

>>>
Traceback ...
TypeError: '<' not supported between instances of 'Book' and
➥'Book'

This happens because the heapq module requires items in the priority queue to be comparable and have a natural sort order (see Item 100: “Sort by Complex Criteria Using the key Parameter” for details). You can quickly give the Book class this behavior by using the total_ordering class decorator from the functools built-in module (see Item 66: “Prefer Class Decorators over Metaclasses for Composable Class Extensions” for background) and implementing the __lt__ special method (see Item 57: “Inherit from collections.abc Classes for Custom Container Types” for background).

Here I redefine the class with a less-than method (__lt__) that simply compares the due_date fields between two Book instances:

Click here to view code image

import functools

@functools.total_ordering
class Book:
    def __init__(self, title, due_date):
        self.title = title
        self.due_date = due_date

    def __lt__(self, other):
        return self.due_date < other.due_date

Now I can add books to the priority queue without any issues by using the heapq.heappush function:

Click here to view code image

queue = []
add_book(queue, Book("Pride and Prejudice", "2019-06-01"))
add_book(queue, Book("The Time Machine", "2019-05-30"))
add_book(queue, Book("Crime and Punishment", "2019-06-06"))
add_book(queue, Book("Wuthering Heights", "2019-06-12"))

Alternatively, I can create a list with all the books in any order and then use the sort method of list to produce the heap:

Click here to view code image

queue = [
    Book("Pride and Prejudice", "2019-06-01"),
    Book("The Time Machine", "2019-05-30"),
    Book("Crime and Punishment", "2019-06-06"),
    Book("Wuthering Heights", "2019-06-12"),
]
queue.sort()

Or I can use the heapq.heapify function to create a heap in linear time (as opposed to the sort method’s len(queue) * log(len(queue)) complexity):

Click here to view code image

from heapq import heapify

queue = [
    Book("Pride and Prejudice", "2019-06-01"),
    Book("The Time Machine", "2019-05-30"),
    Book("Crime and Punishment", "2019-06-06"),
    Book("Wuthering Heights", "2019-06-12"),
]
heapify(queue)

To check for overdue books, I inspect the first element in the list instead of the last, and then I use the heapq.heappop function instead of the list.pop function to remove the book from the heap:

Click here to view code image

from heapq import heappop

def next_overdue_book(queue, now):
    if queue:
        book = queue[0]     # Most overdue first
        if book.due_date < now:
            heappop(queue)  # Remove the overdue book
            return book

    raise NoOverdueBooks

Now I can find and remove overdue books in order until there are none left for the current time:

Click here to view code image

now = "2019-06-02"

book = next_overdue_book(queue, now)
print(book.due_date, book.title)

book = next_overdue_book(queue, now)
print(book.due_date, book.title)

try:
    next_overdue_book(queue, now)
except NoOverdueBooks:
    pass          # Expected
else:
    assert False  # Doesn't happen

>>>
2019-05-30 The Time Machine
2019-06-01 Pride and Prejudice

I can write another microbenchmark to test the performance of this implementation that uses the heapq module:

Click here to view code image

def heap_overdue_benchmark(count):
    def prepare():
        to_add = list(range(count))
        random.shuffle(to_add)
        return [], to_add

    def run(queue, to_add):
        for i in to_add:
            heappush(queue, i)
        while queue:
            heappop(queue)

    return timeit.timeit(
        setup="queue, to_add = prepare()",
        stmt="run(queue, to_add)",
        globals=locals(),
        number=1,
    )

This benchmark experimentally verifies that the heap-based priority queue implementation scales much better (roughly len(queue) * math.log(len(queue))) without superlinearly degrading performance:

Click here to view code image

for i in range(1, 6):
    count = i * 10_000
    delay = heap_overdue_benchmark(count)
    print(f"Count {count:>5,} takes: {delay*1e3:>6.2f}ms")

>>>
Count 10,000 takes:   1.73ms
Count 20,000 takes:   3.83ms
Count 30,000 takes:   6.50ms
Count 40,000 takes:   8.85ms
Count 50,000 takes:  11.43ms

With the heapq implementation, one question remains: How should I handle returns that are on time? The solution is to never remove a book from the priority queue until its due date. At that time, it will be the first item in the list, and I can simply ignore the book if it’s already been returned. Here I implement this behavior by adding a new field to track the book’s return status:

Click here to view code image

@functools.total_ordering
class Book:
    def __init__(self, title, due_date):
        self.title = title
        self.due_date = due_date
        self.returned = False  # New field

    ...

Then I change the next_overdue_book function to repeatedly ignore any book that’s already been returned:

Click here to view code image

def next_overdue_book(queue, now):
    while queue:
        book = queue[0]
        if book.returned:
            heappop(queue)
            continue

        if book.due_date < now:
            heappop(queue)
            return book

        break

    raise NoOverdueBooks

This approach makes the return_book function extremely fast because it makes no modifications to the priority queue:

def return_book(queue, book):
    book.returned = True

The downside of this solution for returns is that the priority queue may grow to the maximum size needed if all books from the library were checked out and went overdue. Although the queue operations will be fast thanks to heapq, this storage overhead might require significant memory (see Item 115: “Use tracemalloc to Understand Memory Usage and Leaks” for how to debug such usage).

If you’re trying to build a robust system, you will need to plan for the worst-case scenario; thus, you should expect that it’s possible for every library book to go overdue for some reason (e.g., a natural disaster closes the road to the library). This memory cost is a design consideration that you should have already planned for and mitigated through additional constraints (e.g., imposing a maximum number of simultaneously lent books).

Beyond the priority queue primitives that I’ve used in this example, the heapq module also provides additional functionality for advanced use cases (see https://docs.python.org/3/library/heapq.html). The module is a great choice when its functionality matches the problem you’re facing (see the queue.PriorityQueue class for another thread-safe option).

Things to Remember

A priority queue allows you to process items in order of importance instead of in first-in, first-out order.
If you try to use list operations to implement a priority queue, your program’s performance will degrade superlinearly as the queue grows.
The heapq built-in module provides all the functions you need to implement a priority queue that scales efficiently.
To use heapq, the items being prioritized must have a natural sort order, which requires special methods like __lt__ to be defined for classes.

Item 105: Use `datetime` Instead of `time` for Local Clocks

Coordinated Universal Time (UTC) is the standard, time zone–independent representation of time. UTC works great for computers that represent time as seconds since the UNIX epoch. But UTC isn’t ideal for humans. Humans reference time relative to where they’re currently located. People say “noon” or “8 a.m.” instead of “UTC 15:00 minus 7 hours.” If your program handles time, you’ll probably find yourself converting time between UTC and local clocks for the sake of human understanding.

Python provides two ways of accomplishing time zone conversions. The old way, using the time built-in module, is terribly error prone. The new way, using datetime, works great with some help from the built-in zoneinfo module.

You should be acquainted with both time and datetime to thoroughly understand why datetime is the best choice and time should be avoided.

The `time` Module

The localtime function from the time built-in module lets you convert a UNIX timestamp (seconds since the UNIX epoch in UTC) to a local time that matches the host computer’s time zone (Pacific Daylight Time in my case). This local time can be printed in human-readable format using the strftime function:

Click here to view code image

import time

now = 1710047865.0
local_tuple = time.localtime(now)
time_format = "%Y-%m-%d %H:%M:%S"
time_str = time.strftime(time_format, local_tuple)
print(time_str)

>>>
2024-03-09 21:17:45

You’ll often need to go the other way as well, starting with user input in human-readable local time and converting it to UTC time. You can do this by using the strptime function to parse the time string and then calling mktime to convert local time to a UNIX timestamp:

Click here to view code image

time_tuple = time.strptime(time_str, time_format)
utc_now = time.mktime(time_tuple)
print(utc_now)

>>>
1710047865.0

How do you convert local time in one time zone to local time in another time zone? For example, say that I’m taking a flight between San Francisco and New York, and I want to know what time it will be in San Francisco when I’ve arrived in New York.

I might initially assume that I can directly manipulate the return values from the time, localtime, and strptime functions to do time zone conversions. But this is a very bad idea. Time zones change all the time due to local laws. It’s too complicated to manage yourself, especially if you want to handle every global city for flight departures and arrivals.

Many operating systems have configuration files that automatically keep up with the time zone changes. Python lets you use these time zones through the time module if your platform supports it. On other platforms, such as Windows, some time zone functionality isn’t available from time at all. For example, here I parse a departure time from the San Francisco time zone, Pacific Standard Time (PST):

Click here to view code image

parse_format = "%Y-%m-%d %H:%M:%S %Z"
depart_sfo = "2024-03-09 21:17:45 PST"
time_tuple = time.strptime(depart_sfo, parse_format)
time_str = time.strftime(time_format, time_tuple)
print(time_str)

>>>
2024-03-09 21:17:45

After seeing that "PST" works with the strptime function, I might also assume that other time zones known to my computer will work. Unfortunately, this isn’t the case. strptime raises an exception when it sees Eastern Daylight Time (EDT), which is the time zone for New York (when the flight arrives after the time change):

Click here to view code image

arrival_nyc = "2024-03-10 03:31:18 EDT"
time_tuple = time.strptime(arrival_nyc, parse_format)

>>>
Traceback ...
ValueError: time data '2024-03-10 03:31:18 EDT' does not match
➥format '%Y-%m-%d %H:%M:%S %Z'

The problem here is the platform-dependent nature of the time module. Its actual behavior is determined by how the underlying C functions work with the host operating system. This makes the functionality of the time module unreliable in Python. The time module fails to consistently work properly for multiple local times. Thus, you should avoid using the time module for this purpose. If you must use time, use it only to convert between UTC and the host computer’s local time. For all other types of conversions, use the datetime module.

The `datetime` Module

The second option for representing times in Python is the datetime class from the datetime built-in module. Like the time module, datetime can be used to convert from the current time in UTC to local time.

Here I convert the present time in UTC to my computer’s local time, PDT:

Click here to view code image

from datetime import datetime, timezone

now = datetime(2024, 3, 10, 5, 17, 45)
now_utc = now.replace(tzinfo=timezone.utc)
now_local = now_utc.astimezone()
print(now_local)

>>>
2024-03-09 21:17:45-08:00

The datetime module can also easily convert a local time back to a UNIX timestamp in UTC (matching the value above):

Click here to view code image

time_str = "2024-03-09 21:17:45"
now = datetime.strptime(time_str, time_format)
time_tuple = now.timetuple()
utc_now = time.mktime(time_tuple)
print(utc_now)

>>>
1710047865.0

Unlike the time module, the datetime module has facilities for reliably converting from one local time to another local time. However, datetime only provides the machinery for time zone operations with its tzinfo class and related methods.

Fortunately, since Python version 3.9, on many systems the zoneinfo built-in module contains a full database of every time zone definition you might need. On other systems, such as Windows, the officially endorsed tzdata community package might need to be installed to provide the time zone database that the zoneinfo module needs to function properly (see Item 116: “Know Where to Find Community-Built Modules” for details).

To use zoneinfo effectively, you should always convert local times to UTC first. Next, perform any datetime operations you need on the UTC values (such as offsetting). Finally, convert to local times.

For example, here I convert a New York City flight arrival time to a UTC datetime:

Click here to view code image

from zoneinfo import ZoneInfo

arrival_nyc = "2024-03-10 03:31:18"
nyc_dt_naive = datetime.strptime(arrival_nyc, time_format)
eastern = ZoneInfo("US/Eastern")
nyc_dt = nyc_dt_naive.replace(tzinfo=eastern)
utc_dt = nyc_dt.astimezone(timezone.utc)
print("EDT:", nyc_dt)
print("UTC:", utc_dt)

>>>
EDT: 2024-03-10 03:31:18-04:00
UTC: 2024-03-10 07:31:18+00:00

Once I have a UTC datetime, I can convert it to San Francisco local time:

Click here to view code image

pacific = ZoneInfo("US/Pacific")
sf_dt = utc_dt.astimezone(pacific)
print("PST:", sf_dt)

>>>
PST: 2024-03-09 23:31:18-08:00

Just as easily, I can convert it to the local time in Nepal:

Click here to view code image

nepal = ZoneInfo("Asia/Katmandu")
nepal_dt = utc_dt.astimezone(nepal)
print("NPT", nepal_dt)

>>>
NPT 2024-03-10 13:16:18+05:45

With datetime and zoneinfo, these conversions are consistent across all environments, regardless of what operating system the host computer is running.

Things to Remember

Avoid using the time module for translating between different time zones.
Use the datetime and zoneinfo built-in modules to reliably convert between times and dates in different time zones.
Always represent time in UTC and do conversions to local time as the very final step before presentation.

Item 106: Use `decimal` when Precision Is Paramount

Python is an excellent language for writing code that interacts with numerical data. Python’s integer type can represent values of any practical size. Its double-precision floating point type complies with the IEEE 754 standard. The language also provides a standard complex number type for imaginary values. However, these aren’t enough for every situation.

For example, say that I want to compute the amount to charge a customer for an international phone call via a portable satellite phone. I know the time in minutes and seconds that the customer was on the phone (say, 3 minutes 42 seconds). I also have a set rate for the cost of calling Antarctica from the United States (say, $1.45/minute). What should the charge be?

With floating point math, the computed charge seems reasonable:

rate = 1.45
seconds = 3 * 60 + 42
cost = rate * seconds / 60
print(cost)

>>>
5.364999999999999

The result is 0.0001 short of the correct value (5.365) due to how IEEE 754 floating point numbers are represented. I might want to round up this value to 5.37 to properly cover all costs incurred by the customer. However, due to floating point error, rounding to the nearest whole cent actually reduces the final charge (from 5.364 to 5.36) instead of increasing it (from 5.365 to 5.37):

print(round(cost, 2))

>>>
5.36

The solution is to use the Decimal class from the decimal built-in module. The Decimal class provides fixed point math of 28 decimal places by default—and it can go even higher, if required. This works around the precision issues in IEEE 754 floating point numbers. The class also gives you more control over rounding behaviors.

For example, redoing the Antarctica calculation with Decimal results in the exact expected charge instead of an approximation:

from decimal import Decimal

rate = Decimal("1.45")
seconds = Decimal(3 * 60 + 42)
cost = rate * seconds / Decimal(60)
print(cost)

>>>
5.365

Decimal instances can be given starting values in two different ways. The first way is by passing a str containing the number to the Decimal constructor. This ensures that there is no loss of precision due to the inherent nature of Python floating point values and number constants. The second way is by directly passing a float or an int instance to the constructor. Here you can see that the two construction methods result in different behavior:

Click here to view code image

print(Decimal("1.45"))
print(Decimal(1.45))

>>>
1.45
1.4499999999999999555910790149937383830547332763671875

This problem doesn’t occur if I supply integers to the Decimal constructor:

print("456")
print(456)

>>>
456
456

If you care about exact answers, err on the side of caution and always use the str constructor for the Decimal type.

Getting back to the phone call example, say that I also want to support very short phone calls between places (like Toledo and Detroit) that are much cheaper to connect via an auxiliary cellular connection. Here I compute the charge for a phone call that was 5 seconds long with a rate of $0.05/minute:

Click here to view code image

rate = Decimal("0.05")
seconds = Decimal("5")
small_cost = rate * seconds / Decimal(60)
print(small_cost)

>>>
0.004166666666666666666666666667

The result is so low that it is decreased to zero when I tried to round it to the nearest whole cent. This won’t do!

print(round(small_cost, 2))

>>>
0.00

Luckily, the Decimal class has a built-in function for rounding to exactly the decimal place needed with the desired rounding behavior. This works for the higher-cost case from earlier:

Click here to view code image

from decimal import ROUND_UP

rounded = cost.quantize(Decimal("0.01"), rounding=ROUND_UP)
print(f"Rounded {cost} to {rounded}")

>>>
Rounded 5.365 to 5.37

Using the quantize method this way also properly handles the small usage case for short, cheap phone calls:

Click here to view code image

rounded = small_cost.quantize(Decimal("0.01"), rounding=ROUND_UP)
print(f"Rounded {small_cost} to {rounded}")

>>>
Rounded 0.004166666666666666666666666667 to 0.01

While Decimal works great for fixed point numbers, it still has limitations in its precision (e.g., 1/3 will be an approximation). For representing rational numbers with no limit to precision, consider using the Fraction class from the fractions built-in module.

Things to Remember

Python has built-in types and classes in modules that can represent practically every type of numerical value.
The Decimal class is ideal for situations that require high precision and control over rounding behavior, such as computations of monetary values.
Pass str instances to the Decimal constructor instead of float instances if it’s important to compute exact answers and not floating point approximations.

Item 107: Make `pickle` Serialization Maintainable with `copyreg`

The pickle built-in module can serialize Python objects into a stream of bytes and deserialize bytes back into objects (see Item 10: “Know the Differences Between bytes and str” for background). Pickled byte streams shouldn’t be used to communicate between untrusted parties. The purpose of pickle is to let you pass Python objects between programs that you control over binary channels.

Note

The pickle module’s serialization format is unsafe by design. The serialized data contains what is essentially a program that describes how to reconstruct the original Python objects. This means a malicious pickle payload could be used to compromise any part of the Python program that attempts to deserialize it.

In contrast, the json module is safe by design. Serialized JSON data contains a simple description of an object hierarchy. Deserializing JSON data does not expose a Python program to any additional risk beyond out-of-memory errors. Formats like JSON should be used for communication between programs or people that don’t trust each other.

For example, say that I want to use a Python object to represent the state of a player’s progress in a game. The game state includes the level the player is on and the number of lives they have remaining:

class GameState:
    def __init__(self):
        self.level = 0
        self.lives = 4

The program modifies this object as the game runs:

Click here to view code image

state = GameState()
state.level += 1  # Player beat a level
state.lives -= 1  # Player had to try again

print(state.__dict__)

>>>
{'level': 1, 'lives': 3}

When the user quits playing, the program can save the state of the game to a file so it can be resumed at a later time. The pickle module makes it easy to do this. Here I dump the GameState object directly to a file:

Click here to view code image

import pickle

state_path = "game_state.bin"
with open(state_path, "wb") as f:
    pickle.dump(state, f)

Later, I can load the file and get back the GameState object as if it had never been serialized:

Click here to view code image

with open(state_path, "rb") as f:
    state_after = pickle.load(f)

print(state_after.__dict__)

>>>
{'level': 1, 'lives': 3}

The problem with this approach is what happens as the game’s features expand over time. Imagine that I want the player to earn points toward a high score. To track the player’s points, I can add a new field to the GameState class:

Click here to view code image

class GameState:
    def __init__(self):
        self.level = 0
        self.lives = 4
        self.points = 0  # New field

Serializing the new version of the GameState class using pickle will work exactly as before. Here I simulate the round trip through a file by serializing to a string with dumps and back to a GameState object with loads:

Click here to view code image

state = GameState()
serialized = pickle.dumps(state)
state_after = pickle.loads(serialized)

print(state_after.__dict__)

>>>
{'level': 0, 'lives': 4, 'points': 0}

But what happens to older saved GameState objects that the user may want to resume? Here I unpickle an old game file using a program with the new definition of the GameState class:

Click here to view code image

with open(state_path, "rb") as f:
    state_after = pickle.load(f)

print(state_after.__dict__)

>>>
{'level': 1, 'lives': 3}

The points attribute is missing! This is especially confusing because the returned object is an instance of the new GameState class:

Click here to view code image

assert isinstance(state_after, GameState)

This behavior is a by-product of the way the pickle module works. Its primary use case is making it easy to serialize objects. As soon as your use of pickle expands beyond trivial usage, the module’s functionality starts to break down in surprising ways.

Fixing these problems is straightforward using the copyreg built-in module. The copyreg module lets you register the functions responsible for serializing and deserializing Python objects, allowing you to control the behavior of pickle and make it more reliable and adaptable.

Default Attribute Values

In the simplest case, you can use a constructor with default arguments (see Item 35: “Provide Optional Behavior with Keyword Arguments” for background) to ensure that GameState objects will always have all attributes after unpickling. Here I redefine the constructor this way:

Click here to view code image

class GameState:
    def __init__(self, level=0, lives=4, points=0):
        self.level = level
        self.lives = lives
        self.points = points

To use this constructor for pickling, I define a helper function that takes a GameState object and turns it into a tuple of parameters for the copyreg module. The returned tuple contains the function to use for unpickling and the parameters to pass to the unpickling function:

Click here to view code image

def pickle_game_state(game_state):
    kwargs = game_state.__dict__
    return unpickle_game_state, (kwargs,)

Now I need to define the unpickle_game_state helper. This function takes serialized data and parameters from pickle_game_state and returns the corresponding GameState object. It’s a tiny wrapper around the constructor:

Click here to view code image

def unpickle_game_state(kwargs):
    return GameState(**kwargs)

Now I register these functions with the copyreg built-in module:

Click here to view code image

import copyreg

copyreg.pickle(GameState, pickle_game_state)

After registration, serializing and deserializing works as before:

Click here to view code image

state = GameState()
state.points += 1000
serialized = pickle.dumps(state)
state_after = pickle.loads(serialized)
print(state_after.__dict__)

>>>
{'level': 0, 'lives': 4, 'points': 1000}

With this registration done, now I’ll change the definition of GameState again to give the player a count of magic spells to use. This change is similar to when I added the points field to GameState:

Click here to view code image

class GameState:
    def __init__(self, level=0, lives=4, points=0, magic=5):
        self.level = level
        self.lives = lives
        self.points = points
        self.magic = magic  # New field

But unlike before, deserializing an old GameState object will result in valid game data instead of missing attributes. This works because unpickle_game_state calls the GameState constructor directly instead of using the pickle module’s default behavior of saving and restoring only the attributes that belong to an object. The GameState constructor’s keyword arguments have default values that will be used for any parameters that are missing. This causes old game state files to receive the default value for the new magic field when they are deserialized:

Click here to view code image

print("Before:", state.__dict__)
state_after = pickle.loads(serialized)
print("After: ", state_after.__dict__)

>>>
Before: {'level': 0, 'lives': 4, 'points': 1000}
After:  {'level': 0, 'lives': 4, 'points': 1000, 'magic': 5}

Versioning Classes

Sometimes you need to make backward-incompatible changes to your Python objects by removing fields. Doing so prevents the default argument approach to migrating serializations from working.

For example, say I realize that a limited number of lives is a bad idea, and I want to remove the concept of lives from the game. Here I redefine the GameState class to no longer have a lives field:

Click here to view code image

class GameState:
    def __init__(self, level=0, points=0, magic=5):
        self.level = level
        self.points = points
        self.magic = magic

The problem is that this breaks deserialization of old game data. All fields from the old data—even ones removed from the class—will be passed to the GameState constructor by the unpickle_game_state function:

Click here to view code image

pickle.loads(serialized)

>>>
Traceback ...
TypeError: GameState.__init__() got an unexpected keyword
➥argument 'lives'. Did you mean 'level'?

I can fix this by adding a version parameter to the functions supplied to copyreg. New serialized data will have a version of 2 specified when pickling a new GameState object:

Click here to view code image

def pickle_game_state(game_state):
    kwargs = game_state.__dict__
    kwargs["version"] = 2
    return unpickle_game_state, (kwargs,)

Old versions of the data will not have a version argument present, which means I can manipulate the arguments passed to the GameState constructor accordingly:

Click here to view code image

def unpickle_game_state(kwargs):
    version = kwargs.pop("version", 1)
    if version == 1:
        del kwargs["lives"]
    return GameState(**kwargs)

Now deserializing an old object works properly:

Click here to view code image

copyreg.pickle(GameState, pickle_game_state)
print("Before:", state.__dict__)
state_after = pickle.loads(serialized)
print("After: ", state_after.__dict__)

>>>
Before: {'level': 0, 'lives': 4, 'points': 1000}
After:  {'level': 0, 'points': 1000, 'magic': 5}

I can continue using this approach to handle changes between future versions of the same class. Any logic I need to adapt an old version of the class to a new version of the class can go in the unpickle_game_state function.

Stable Import Paths

One other issue you may encounter with pickle is breakage from renaming a class. Often over the life cycle of a program, you’ll refactor your code by renaming classes and moving them to other modules. Unfortunately, doing so breaks the pickle module unless you’re careful.

Here I rename the GameState class to BetterGameState and remove the old class from the program entirely:

Click here to view code image

class BetterGameState:
    def __init__(self, level=0, points=0, magic=5):
        self.level = level
        self.points = points
        self.magic = magic

Attempting to deserialize an old GameState object now fails because the class can’t be found:

Click here to view code image

pickle.loads(serialized)

>>>
Traceback ...
AttributeError: Can't get attribute 'GameState'
➥on <module '__main__' from 'my_code.py'>

This exception occurs because the import path of the serialized object’s class is encoded in the pickled data:

Click here to view code image

print(serialized)

>>>
b'\x80\x04\x95A\x00\x00\x00\x00\x00\x00\x00\x8c\x08__main__\
x94\x8c\tGameState\x94\x93\x94)\x81\x94}\x94(\x8c\x05level\
x94K\x00\x8c\x06points\x94K\x00\x8c\x05magic\x94K\x05ub.'

The solution is to use copyreg again. I can specify a stable identifier for the function to use for unpickling an object. This allows me to transition pickled data to different classes with different names when it’s deserialized. It gives me a level of indirection:

Click here to view code image

copyreg.pickle(BetterGameState, pickle_game_state)

After I use copyreg, it’s evident that the import path to unpickle_game_state is encoded in the serialized data instead of BetterGameState:

Click here to view code image

state = BetterGameState()
serialized = pickle.dumps(state)
print(serialized)

>>>
b'\x80\x04\x95W\x00\x00\x00\x00\x00\x00\x00\x8c\x08__main__\
➥x94\x8c\x13unpickle_game_state\x94\x93\x94}\x94(\x8c\
➥x05level\x94K\x00\x8c\x06points\x94K\x00\x8c\x05magic\x94K\
➥x05\x8c\x07version\x94K\x02u\x85\x94R\x94.'

The only gotcha is that I can’t change the path of the module in which the unpickle_game_state function is present. Once I serialize data with a function, it must remain available on that import path for deserialization in the future.

Things to Remember

The pickle built-in module is useful only for serializing and deserializing objects between trusted programs.
Deserializing previously pickled objects may break if the classes involved have changed over time (e.g., attributes have been added or removed).
Use the copyreg built-in module with pickle to ensure backward compatibility for serialized objects.