When you’re implementing Python programs that handle a non-trivial amount of data, you’ll often encounter slowdowns caused by the algorithmic complexity of your code. For example, programs you expected to scale linearly in the size of input data might actually grow quadratically, causing problems in production. Luckily, Python includes optimized implementations of many standard data structures and algorithms that can help you achieve high performance with minimal effort.
Similarly, Python provides built-in data types and helper functions for handling common tasks that frequently come up in programs: manipulating dates, times, and time zones; working with money values while preserving precision and controlling rounding behavior; and saving and restoring program state for users even as your software evolves over time. Writing code to handle these situations is fiddly and hard to get right. Having battle-tested implementations built into the language to handle them is a blessing.
key ParameterThe list built-in type provides a sort method for ordering the items in a list instance based on a variety of criteria. (Don’t confuse the sort method with sorted; see Item 101: “Know the Difference Between sort and sorted.”) By default, sort orders a list’s contents by the natural ascending order of the items. For example, here I sort a list of integers from smallest to largest:
numbers = [93, 86, 11, 68, 70]
numbers.sort()
print(numbers)
>>>
[11, 68, 70, 86, 93]
The sort method works for nearly all built-in types (strings, floats, etc.) that have a natural ordering. What does sort do with objects? As an example, here I define a class—including a __repr__ method so instances are printable (see Item 12: “Understand the Difference Between repr and str when Printing Objects”)—to represent various tools you might need to use on a construction site:
class Tool:
def __init__(self, name, weight):
self.name = name
self.weight = weight
def __repr__(self):
return f"Tool({self.name!r}, {self.weight})"
tools = [
Tool("level", 3.5),
Tool("hammer", 1.25),
Tool("screwdriver", 0.5),
Tool("chisel", 0.25),
]
Sorting objects of this type doesn’t work because the sort method tries to call comparison special methods that aren’t defined by the class:
tools.sort()
>>>
Traceback ...
TypeError: '<' not supported between instances of 'Tool' and
➥'Tool'
If your class should have a natural ordering as integers do, then you can define the necessary special methods (see Item 104: “Know How to Use heapq for Priority Queues” for an example and Item 57: “Inherit from collections.abc Classes for Custom Container Types” for background) to make sort work without any extra parameters. But the more common case is that your objects might need to support multiple orderings, in which case defining a single natural ordering really doesn’t make sense.
Often there’s an attribute on the object that you’d like to use for sorting. To support this use case, the sort method accepts a key parameter that’s expected to be a function (see Item 48: “Accept Functions Instead of Classes for Simple Interfaces” for background). The key function is passed a single argument, which is an item from the list that is being sorted. The return value of the key function should be a comparable value (i.e., with a natural ordering) to use in place of an item for sorting purposes.
Here I use the lambda keyword (see Item 39: “Prefer functools.partial over lambda Expressions for Glue Functions” for background) to define a function for the key parameter that enables me to sort the list of Tool objects alphabetically by the name attribute:
print("Unsorted:", repr(tools))
tools.sort(key=lambda x: x.name)
print("\nSorted: ", tools)
>>>
Unsorted: [Tool('level', 3.5), Tool('hammer', 1.25),
➥Tool('screwdriver', 0.5), Tool('chisel', 0.25)]
Sorted: [Tool('chisel', 0.25), Tool('hammer', 1.25),
➥Tool('level', 3.5), Tool('screwdriver', 0.5)]
I can just as easily define another lambda function to sort by the weight attribute and pass it as the key parameter to the sort method:
tools.sort(key=lambda x: x.weight)
print("By weight:", tools)
>>>
By weight: [Tool('chisel', 0.25), Tool('screwdriver', 0.5),
➥Tool('hammer', 1.25), Tool('level', 3.5)]
Within the lambda function that’s passed as the key parameter, you can access attributes of items as I’ve done here, index into items (for sequences, tuples, and dictionaries), or use any other valid expression.
For basic types like strings, you might even want to use the key function to do transformations on the values before sorting. For example, here I apply the lower method to each item in a list of place names to ensure that they’re in alphabetical order, ignoring any capitalization (since in the natural lexical ordering of strings, uppercase letters appear before lowercase letters):
places = ["home", "work", "New York", "Paris"]
places.sort()
print("Case sensitive: ", places)
places.sort(key=lambda x: x.lower())
print("Case insensitive:", places)
>>>
Case sensitive: ['New York', 'Paris', 'home', 'work']
Case insensitive: ['home', 'New York', 'Paris', 'work']
Sometimes you might need to use multiple criteria for sorting. For example, say that I have a list of power tools, and I want to sort them first by the weight attribute and then by name. How can I accomplish this?
power_tools = [
Tool("drill", 4),
Tool("circular saw", 5),
Tool("jackhammer", 40),
Tool("sander", 4),
]
The simplest solution in Python is to use the tuple type (see Item 56: “Prefer dataclasses for Creating Immutable Objects” for another approach). A tuple is an immutable sequence of arbitrary Python values. Tuples are comparable by default and have a natural ordering, meaning that they implement all the special methods, such as __lt__, that are required by the sort method. The tuple type implements special comparator methods by iterating over each position in the tuple and comparing the corresponding values, one index at a time. Here I show how this works when one tool is heavier than another:
saw = (5, "circular saw")
jackhammer = (40, "jackhammer")
assert not (jackhammer < saw) # Matches expectations
If the first position in the tuples being compared are equal—weight in this case—the tuple comparison will move to the second position and so on:
drill = (4, "drill")
sander = (4, "sander")
assert drill[0] == sander[0] # Same weight
assert drill[1] < sander[1] # Alphabetically less
assert drill < sander # Thus, drill comes first
You can take advantage of this tuple comparison behavior in order to sort the list of power tools first by weight and then by name. Here I define a key function that returns a tuple containing the two attributes that I want to sort on, in order of priority:
power_tools.sort(key=lambda x: (x.weight, x.name))
print(power_tools)
>>>
[Tool('drill', 4), Tool('sander', 4), Tool('circular saw', 5),
➥Tool('jackhammer', 40)]
One limitation of having the key function return a tuple is that the direction of sorting for all criteria must be the same (either all in ascending order or all in descending order). If I provide the reverse parameter to the sort method, it will affect both criteria in the tuple the same way (note how sander now comes before drill instead of after):
power_tools.sort(
key=lambda x: (x.weight, x.name),
reverse=True, # Makes all criteria descending
)
print(power_tools)
>>>
[Tool('jackhammer', 40), Tool('circular saw', 5),
➥Tool('sander', 4), Tool('drill', 4)]
For numerical values, it’s possible to mix sorting directions by using the unary minus operator in the key function. This negates one of the values in the returned tuple, effectively reversing its sort order while leaving the others intact. Here I use this approach to sort by weight descending and then by name ascending (note how sander now comes after drill instead of before):
power_tools.sort(key=lambda x: (-x.weight, x.name))
print(power_tools)
>>>
[Tool('jackhammer', 40), Tool('circular saw', 5), Tool
➥('drill', 4), Tool('sander', 4)]
Unfortunately, unary negation isn’t possible for all types. Here I try to achieve the same outcome by using the reverse argument to sort by weight descending and then negating the name attribute to put it in ascending order:
power_tools.sort(key=lambda x: (x.weight, -x.name),
reverse=True)
>>>
Traceback ...
TypeError: bad operand type for unary -: 'str'
For situations like this, Python provides a stable sorting algorithm. The sort method of the list type preserves the order of the input list when the key function returns values that are equal to each other. This means I can call sort multiple times on the same list to combine different criteria together. Here I produce the same sort ordering of weight descending and name ascending as I did above, but now I do it by using two separate calls to sort:
power_tools.sort(
key=lambda x: x.name, # Name ascending
)
power_tools.sort(
key=lambda x: x.weight, # Weight descending
reverse=True,
)
print(power_tools)
>>>
[Tool('jackhammer', 40), Tool('circular saw', 5),
Tool('drill', 4), Tool('sander', 4)]
To understand why this works, note how the first call to sort puts the names in alphabetical order:
power_tools.sort(key=lambda x: x.name)
print(power_tools)
>>>
[Tool('circular saw', 5), Tool('drill', 4),
➥Tool('jackhammer', 40), Tool('sander', 4)]
When the second sort call by weight descending is made, the code sees that both sander and drill have a weight of 4. This causes the sort method to put both items into the final result list in the same order in which they appeared in the original list, thus preserving their relative ordering by name ascending:
power_tools.sort(
key=lambda x: x.weight,
reverse=True,
)
print(power_tools)
>>>
[Tool('jackhammer', 40), Tool('circular saw', 5),
➥Tool('drill', 4), Tool('sander', 4)]
This same approach can be used to combine as many different types of sorting criteria as you’d like in any direction. You just need to make sure that you execute the sorts in the opposite sequence of what you want the final list to contain. In this example, I wanted the sort order to be by weight descending and then by name ascending, so I had to do the name sort first, followed by the weight sort.
That said, the approach of having the key function return a tuple and using unary negation to mix sort orders is simpler to read and requires less code. I recommend using multiple calls to sort only if absolutely necessary.
The sort method of the list type can be used to rearrange a list’s contents by the natural ordering of built-in types like strings, integers, tuples, and so on.
The sort method doesn’t work for objects unless they define a natural ordering using special methods, which is uncommon.
The key parameter of the sort method can be used to supply a helper function that returns the value to use in place of each item from the list while sorting.
Returning a tuple from the key function allows you to combine multiple sorting criteria together. The unary minus operator can be used to reverse individual sort orders for types that allow it.
For types that can’t be negated, you can combine many sorting criteria together by calling the sort method multiple times using different key functions and reverse values, in the order of lowest-rank sort call to highest-rank sort call.
sort and sortedThe most familiar way to sort data in Python is by using the list type’s built-in sort method (see Item 100: “Sort by Complex Criteria Using the key Parameter”). This method causes the data to be sorted in place, meaning the original list is modified, and the unsorted arrangement is no longer available. For example, here I alphabetize a list of butterfly names:
butterflies = ["Swallowtail", "Monarch", "Red Admiral"]
print(f"Before {butterflies}")
butterflies.sort()
print(f"After {butterflies}")
>>>
Before ['Swallowtail', 'Monarch', 'Red Admiral']
After ['Monarch', 'Red Admiral', 'Swallowtail']
Another way to sort data in Python is by using the sorted built-in function. This function sorts the contents of the given object and returns the results as a list while leaving the original intact:
original = ["Swallowtail", "Monarch", "Red Admiral"]
alphabetical = sorted(original)
print(f"Original {original}")
print(f"Sorted {alphabetical}")
>>>
Original ['Swallowtail', 'Monarch', 'Red Admiral']
Sorted ['Monarch', 'Red Admiral', 'Swallowtail']
The sorted built-in function can be used with any iterable object (see Item 21: “Be Defensive when Iterating over Arguments”), including tuples, dictionaries, and sets:
patterns = {"solid", "spotted", "cells"}
print(sorted(patterns))
>>>
['cells', 'solid', 'spotted']
It also supports the reverse and key parameters, just like the sort built-in function (see Item 100: “Sort by Complex Criteria Using the key Parameter”):
legs = {"insects": 6, "spiders": 8, "lizards": 4}
sorted_legs = sorted(
legs,
key=lambda x: legs[x],
reverse=True,
)
print(sorted_legs)
>>>
['spiders', 'insects', 'lizards']
There are two benefits of using sort over sorted. First, sort modifies the list in place, which means the memory consumed by your program will stay the same during and after the sort. On the other hand, sorted needs to make a copy of the iterable’s contents in order to produce a sorted list, which might double your program’s memory requirements. Second, sort can be a lot faster because it’s doing less work overall. The result list is already known and doesn’t need to be allocated or resized. Iteration only needs to occur over a list instead of over arbitrary iterable objects. If the data is already partially ordered, index reassignments can be avoided altogether.
However, there are two primary benefits of using sorted instead of sort. First, the original object is left alone, ensuring that you don’t inadvertently modify arguments supplied to your functions and cause perplexing bugs (see Item 30: “Know That Function Arguments Can Be Mutated” and Item 56: “Prefer dataclasses for Creating Immutable Objects” for background). Second, sorted works for any type of iterator, not just lists, which means your functions can rely on duck typing and be more flexible in the types they accept (see Item 25: “Be Cautious when Relying on Dictionary Insertion Ordering” for an example).
Arguably, sorted is more Pythonic than sort because it enables additional flexibility and is more explicit in how it produces results. The choice of which to use depends on the circumstances and what you’re trying to accomplish.
sort achieves maximum performance with minimal memory overhead but requires the target list to be modified in place.
sorted can process all types of iterators and collections as input, and won’t inadvertently mutate data.
bisectIt’s common to find yourself with a large amount of data in memory as a sorted list that you then want to search. For example, you may have loaded an English language dictionary to use for spell checking, or perhaps you have a list of dated financial transactions to audit for correctness.
Regardless of the data your specific program needs to process, searching for a specific value in a list takes linear time proportional to the list’s length when you call the index method:
data = list(range(10**5))
index = data.index(91234)
assert index == 91234
If you’re not sure whether the exact value you’re searching for is in the list, you might want to search for the closest index that is equal to or exceeds your goal value. The simplest way to do this is to linearly scan the list and compare each item to your goal value:
def find_closest(sequence, goal):
for index, value in enumerate(sequence):
if goal < value:
return index
raise ValueError(f"{goal} is out of bounds")
index = find_closest(data, 91234.56)
assert index == 91235
Python’s built-in bisect module provides better ways to accomplish these types of searches through ordered lists. You can use the bisect_left function to do an efficient binary search through any sequence of sorted items. The index it returns will either be where the item is already present in the list or where you’d want to insert the item in the list to keep it in sorted order:
from bisect import bisect_left
index = bisect_left(data, 91234) # Exact match
assert index == 91234
index = bisect_left(data, 91234.56) # Closest match
assert index == 91235
The complexity of the binary search algorithm used by the bisect module is logarithmic. This means searching in a list of length 1 million takes roughly the same amount of time with bisect as linearly searching a list of length 20 using the list.index method (math.log2(10**6) == 19.93...). So bisect is way faster!
I can verify this speed improvement for the example from above by using the timeit built-in module to run a microbenchmark (see Item 93: “Optimize Performance-Critical Code Using timeit Microbenchmarks” for details):
import random
import timeit
size = 10**5
iterations = 1000
data = list(range(size))
to_lookup = [random.randint(0, size)
for _ in range(iterations)]
def run_linear(data, to_lookup):
for index in to_lookup:
data.index(index)
def run_bisect(data, to_lookup):
for index in to_lookup:
bisect_left(data, index)
baseline = (
timeit.timeit(
stmt="run_linear(data, to_lookup)",
globals=globals(),
number=10,
)
/ 10
)
print(f"Linear search takes {baseline:.6f}s")
comparison = (
timeit.timeit(
stmt="run_bisect(data, to_lookup)",
globals=globals(),
number=10,
)
/ 10
)
print(f"Bisect search takes {comparison:.6f}s")
slowdown = 1 + ((baseline - comparison) / comparison)
print(f"{slowdown:.1f}x slower")
>>>
Linear search takes 0.317685s
Bisect search takes 0.000197s
1610.1x time
The best part about bisect is that it’s not limited to the list type; you can use it with any Python object that acts like a sequence (see Item 57: “Inherit from collections.abc Classes for Custom Container Types” for how to do that) containing values that have a natural ordering (see Item 104: “Know How to Use heapq for Priority Queues” for background). The module also provides additional features for more advanced situations (see https://docs.python.org/3/library/bisect.html).
Searching sorted data contained in a list takes linear time using the index method or a for loop with simple comparisons.
The bisect built-in module’s bisect_left function takes logarithmic time to search for values in sorted lists, which can be orders of magnitude faster than other approaches.
deque for Producer–Consumer QueuesA common need in writing programs is a first-in, first-out (FIFO) queue, also known as a producer–consumer queue. A FIFO queue is used when one function gathers values to process and another function handles them in the order in which they were received. Often, programmers turn to Python’s built-in list type to act as a FIFO queue.
For example, say that I’m writing a program that processes incoming emails for long-term archival, and it’s using a list for a producer–consumer queue. Here I define a class to represent the messages:
class Email:
def __init__(self, sender, receiver, message):
self.sender = sender
self.receiver = receiver
self.message = message
...
I also define a placeholder function for receiving a single email, presumably from a socket, the file system, or some other type of I/O system. The implementation of this function doesn’t matter; what’s important is its interface: It will either return an Email instance or raise a NoEmailError exception (see Item 32: “Prefer Raising Exceptions to Returning None” for more about this convention):
class NoEmailError(Exception):
pass
def try_receive_email():
# Returns an Email instance or raises NoEmailError
...
The producing function receives emails and enqueues them to be consumed at a later time. This function uses the append method on the list to add new messages to the end of the queue so they are processed after all messages that were previously received:
def produce_emails(queue):
while True:
try:
email = try_receive_email()
except NoEmailError:
return
else:
queue.append(email) # Producer
The consuming function is what does something useful with the emails. This function calls pop(0) on the queue, which removes the very first item from the list and returns it to the caller. By always processing items from the beginning of the queue, the consumer ensures that the items are processed in the order in which they were received:
def consume_one_email(queue):
if not queue:
return
email = queue.pop(0) # Consumer
# Index the message for long-term archival
...
Finally, I need a looping function that connects the pieces together. This function alternates between producing and consuming until the keep_running function returns False (see Item 75: “Achieve Highly Concurrent I/O with Coroutines” on how to do this concurrently):
def loop(queue, keep_running):
while keep_running():
produce_emails(queue)
consume_one_email(queue)
def my_end_func():
...
loop([], my_end_func)
Why not process each Email object in produce_emails as it’s returned by try_receive_email? It comes down to the trade-off between latency and throughput. When using producer–consumer queues, you often want to minimize the latency of accepting new items so they can be collected as fast as possible. The consumer can then process through the backlog of items at a consistent pace—one item per loop in this case—which provides a stable performance profile and consistent throughput at the cost of end-to-end latency (see Item 70: “Use Queue to Coordinate Work Between Threads” for related best practices).
Using a list for a producer–consumer queue like this works fine up to a point, but as the cardinality—the number of items in the list—increases, the list type’s performance can degrade superlinearly. To analyze the performance of using list as a FIFO queue, I can run some microbenchmarks using the timeit built-in module (see Item 93: “Optimize Performance-Critical Code Using timeit Microbenchmarks” for details). Here I define a benchmark for the performance of adding new items to the queue using the append method of list (matching the producer function’s usage):
import timeit
def list_append_benchmark(count):
def run(queue):
for i in range(count):
queue.append(i)
return timeit.timeit(
setup="queue = []",
stmt="run(queue)",
globals=locals(),
number=1,
)
Running this benchmark function with different levels of cardinality lets me compare its performance in relationship to data size:
for i in range(1, 6):
count = i * 1_000_000
delay = list_append_benchmark(count)
print(f"Count {count:>5,} takes: {delay*1e3:>6.2f}ms")
>>>
Count 1,000,000 takes: 13.23ms
Count 2,000,000 takes: 26.50ms
Count 3,000,000 takes: 39.06ms
Count 4,000,000 takes: 51.98ms
Count 5,000,000 takes: 65.19ms
This shows that the append method takes roughly constant time for the list type, and the total time for enqueueing scales linearly as the data size increases. There is overhead for the list type to increase its capacity under the covers as new items are added, but it’s reasonably low and is amortized across repeated calls to append.
Here I define a similar benchmark for the pop(0) call that removes items from the beginning of the queue (matching the consumer function’s usage):
def list_pop_benchmark(count):
def prepare():
return list(range(count))
def run(queue):
while queue:
queue.pop(0)
return timeit.timeit(
setup="queue = prepare()",
stmt="run(queue)",
globals=locals(),
number=1,
)
I can similarly run this benchmark for queues of different sizes to see how performance is affected by cardinality:
for i in range(1, 6):
count = i * 10_000
delay = list_pop_benchmark(count)
print(f"Count {count:>5,} takes: {delay*1e3:>6.2f}ms")
>>>
Count 10,000 takes: 4.98ms
Count 20,000 takes: 22.21ms
Count 30,000 takes: 60.04ms
Count 40,000 takes: 109.96ms
Count 50,000 takes: 176.92ms
Surprisingly, this shows that the total time for dequeuing items from a list with pop(0) scales quadratically as the length of the queue increases. This happens because pop(0) needs to move every item in the list back an index, effectively reassigning the entire list’s contents. I need to call pop(0) for every item in the list, and so I end up doing roughly len(queue) * len(queue) operations to consume the queue. This doesn’t scale.
Python provides the deque class from the collections built-in module to solve this problem. deque is a double-ended queue implementation. It provides constant time operations for inserting or removing items from its beginning or end. This makes it ideal for FIFO queues.
To use the deque class, the call to append in produce_emails can stay the same as it was when using a list for the queue. The list.pop method call in consume_one_email must change to call the deque.popleft method with no arguments instead. And the loop method must be called with a deque instance instead of a list. Everything else stays the same. Here I redefine the one function affected to use the new method and run loop again:
import collections
def consume_one_email(queue):
if not queue:
return
email = queue.popleft() # Consumer
# Process the email message
...
def my_end_func():
...
loop(collections.deque(), my_end_func)
I can run another version of the benchmark to verify that the performance of append (matching the producer function’s usage) has stayed roughly the same (modulo a constant factor):
def deque_append_benchmark(count):
def prepare():
return collections.deque()
def run(queue):
for i in range(count):
queue.append(i)
return timeit.timeit(
setup="queue = prepare()",
stmt="run(queue)",
globals=locals(),
number=1,
)
for i in range(1, 6):
count = i * 100_000
delay = deque_append_benchmark(count)
print(f"Count {count:>5,} takes: {delay*1e3:>6.2f}ms")
>>>
Count 100,000 takes: 1.68ms
Count 200,000 takes: 3.16ms
Count 300,000 takes: 5.05ms
Count 400,000 takes: 6.81ms
Count 500,000 takes: 8.43ms
And I can also benchmark the performance of calling popleft to mimic the consumer function’s usage of deque:
def dequeue_popleft_benchmark(count):
def prepare():
return collections.deque(range(count))
def run(queue):
while queue:
queue.popleft()
return timeit.timeit(
setup="queue = prepare()",
stmt="run(queue)",
globals=locals(),
number=1,
)
for i in range(1, 6):
count = i * 100_000
delay = dequeue_popleft_benchmark(count)
print(f"Count {count:>5,} takes: {delay*1e3:>6.2f}ms")
>>>
Count 100,000 takes: 1.67ms
Count 200,000 takes: 3.59ms
Count 300,000 takes: 5.65ms
Count 400,000 takes: 7.50ms
Count 500,000 takes: 9.58ms
The popleft usage scales linearly, in contrast to the superlinear behavior of pop(0) that I measured before—hooray! If you know that the performance of your program critically depends on the speed of your producer–consumer queues, then deque is a great choice. If you’re not sure, then you should instrument your program to find out (see Item 92: “Profile Before Optimizing”).
The list type can be used as a FIFO queue by having the producer call append to add items and the consumer call pop(0) to receive items. However, this may cause problems because the performance of pop(0) degrades superlinearly as the queue length increases.
The deque class from the collections built-in module takes constant time—regardless of length—for append and popleft, making it ideal for FIFO queues.
heapq for Priority QueuesOne of the limitations of Python’s other queue implementations (see Item 103: “Prefer deque for Producer-Consumer Queues” and Item 70: “Use Queue to Coordinate Work Between Threads”) is that they are first-in, first-out (FIFO) queues: Their contents are sorted by the order in which they were received. Often, you need a program to process items in order of relative importance instead. To accomplish this, a priority queue is the right tool for the job.
For example, say that I’m writing a program to manage books borrowed from a library. There are people constantly borrowing new books. There are people returning their borrowed books on time. And there are people who need to be reminded to return their overdue books. Here I define a class to represent a book that’s been borrowed:
class Book:
def __init__(self, title, due_date):
self.title = title
self.due_date = due_date
I need a system that will send a reminder message when each book passes its due date. Unfortunately, I can’t use a FIFO queue for this because the amount of time each book is allowed to be borrowed varies based on its recency, popularity, and other factors. For example, a book that is borrowed today may be due back later than a book that’s borrowed tomorrow. Here I achieve this behavior by using a standard list and sorting it by the due_date attribute each time a new Book object is added:
def add_book(queue, book):
queue.append(book)
queue.sort(key=lambda x: x.due_date, reverse=True)
queue = []
add_book(queue, Book("Don Quixote", "2019-06-07"))
add_book(queue, Book("Frankenstein", "2019-06-05"))
add_book(queue, Book("Les Misérables", "2019-06-08"))
add_book(queue, Book("War and Peace", "2019-06-03"))
If I can assume that the queue of borrowed books is always in sorted order, then all I need to do to check for overdue books is to inspect the final element in the list. Here I define a function to return the next overdue book, if any, and remove it from the queue:
class NoOverdueBooks(Exception):
pass
def next_overdue_book(queue, now):
if queue:
book = queue[-1]
if book.due_date < now:
queue.pop()
return book
raise NoOverdueBooks
I can call this function repeatedly to get overdue books to remind people about in order from most overdue to least overdue:
now = "2019-06-10"
found = next_overdue_book(queue, now)
print(found.due_date, found.title)
found = next_overdue_book(queue, now)
print(found.due_date, found.title)
>>>
2019-06-03 War and Peace
2019-06-05 Frankenstein
If a book is returned before the due date, I can remove the scheduled reminder message by removing the book from the list:
def return_book(queue, book):
queue.remove(book)
queue = []
book = Book("Treasure Island", "2019-06-04")
add_book(queue, book)
print("Before return:", [x.title for x in queue])
return_book(queue, book)
print("After return: ", [x.title for x in queue])
>>>
Before return: ['Treasure Island']
After return: []
And I can confirm that when all books are returned, the return_book function will raise the right exception (see Item 32: “Prefer Raising Exceptions to Returning None” for this convention):
try:
next_overdue_book(queue, now)
except NoOverdueBooks:
pass # Expected
else:
assert False # Doesn't happen
However, the computational complexity of this solution isn’t ideal. Although checking for and removing an overdue book has a constant cost, every time I add a book, I pay the cost of sorting the whole list again. If I have len(queue) books to add, and the cost of sorting them is roughly len(queue) * math.log(len(queue)), the time it takes to add books will grow superlinearly (len(queue) * len(queue) * math.log(len(queue))).
Here I define a microbenchmark to measure this performance behavior experimentally by using the timeit built-in module (see Item 93: “Optimize Performance-Critical Code Using timeit Microbenchmarks” for details):
import random
import timeit
def list_overdue_benchmark(count):
def prepare():
to_add = list(range(count))
random.shuffle(to_add)
return [], to_add
def run(queue, to_add):
for i in to_add:
queue.append(i)
queue.sort(reverse=True)
while queue:
queue.pop()
return timeit.timeit(
setup="queue, to_add = prepare()",
stmt="run(queue, to_add)",
globals=locals(),
number=1,
)
I can verify that the runtime of adding and removing books from the queue scales superlinearly as the number of books being borrowed increases:
for i in range(1, 6):
count = i * 1_000
delay = list_overdue_benchmark(count)
print(f"Count {count:>5,} takes: {delay*1e3:>6.2f}ms")
>>>
Count 1,000 takes: 1.74ms
Count 2,000 takes: 5.87ms
Count 3,000 takes: 11.12ms
Count 4,000 takes: 19.80ms
Count 5,000 takes: 31.02ms
When a book is returned before the due date, I need to do a linear scan in order to find the book in the queue and remove it. Removing a book causes all subsequent items in the list to be shifted back an index, which has a high cost that also scales superlinearly. Here I define another microbenchmark to test the performance of returning a book using this function:
def list_return_benchmark(count):
def prepare():
queue = list(range(count))
random.shuffle(queue)
to_return = list(range(count))
random.shuffle(to_return)
return queue, to_return
def run(queue, to_return):
for i in to_return:
queue.remove(i)
return timeit.timeit(
setup="queue, to_return = prepare()",
stmt="run(queue, to_return)",
globals=locals(),
number=1,
)
And again, I can verify that indeed the performance degrades superlinearly as the number of books increases:
for i in range(1, 6):
count = i * 1_000
delay = list_return_benchmark(count)
print(f"Count {count:>5,} takes: {delay*1e3:>6.2f}ms")
>>>
Count 1,000 takes: 1.97ms
Count 2,000 takes: 6.99ms
Count 3,000 takes: 14.59ms
Count 4,000 takes: 26.12ms
Count 5,000 takes: 40.38ms
Using the methods of list may work for a tiny library, but it certainly won’t scale to the size of the Great Library of Alexandria, as I want it to!
Fortunately, Python has the built-in heapq module that solves this problem by implementing priority queues efficiently. A heap is a data structure that allows for a list of items to be maintained where the computational complexity of adding a new item or removing the smallest item has logarithmic computational complexity (i.e., even better than linear scaling). In the book-borrowing example, smallest means the book with the earliest due date. The best part about heapq is that you don’t have to understand how heaps are implemented in order to use the module’s functions correctly.
Here I reimplement the add_book function using the heapq module. The queue is still a plain list. The heappush function replaces the list.append call from before. And I no longer have to call list.sort on the queue:
from heapq import heappush
def add_book(queue, book):
heappush(queue, book)
If I try to use this with the Book class as previously defined, I get this somewhat cryptic error:
queue = []
add_book(queue, Book("Little Women", "2019-06-05"))
add_book(queue, Book("The Time Machine", "2019-05-30"))
>>>
Traceback ...
TypeError: '<' not supported between instances of 'Book' and
➥'Book'
This happens because the heapq module requires items in the priority queue to be comparable and have a natural sort order (see Item 100: “Sort by Complex Criteria Using the key Parameter” for details). You can quickly give the Book class this behavior by using the total_ordering class decorator from the functools built-in module (see Item 66: “Prefer Class Decorators over Metaclasses for Composable Class Extensions” for background) and implementing the __lt__ special method (see Item 57: “Inherit from collections.abc Classes for Custom Container Types” for background).
Here I redefine the class with a less-than method (__lt__) that simply compares the due_date fields between two Book instances:
import functools
@functools.total_ordering
class Book:
def __init__(self, title, due_date):
self.title = title
self.due_date = due_date
def __lt__(self, other):
return self.due_date < other.due_date
Now I can add books to the priority queue without any issues by using the heapq.heappush function:
queue = []
add_book(queue, Book("Pride and Prejudice", "2019-06-01"))
add_book(queue, Book("The Time Machine", "2019-05-30"))
add_book(queue, Book("Crime and Punishment", "2019-06-06"))
add_book(queue, Book("Wuthering Heights", "2019-06-12"))
Alternatively, I can create a list with all the books in any order and then use the sort method of list to produce the heap:
queue = [
Book("Pride and Prejudice", "2019-06-01"),
Book("The Time Machine", "2019-05-30"),
Book("Crime and Punishment", "2019-06-06"),
Book("Wuthering Heights", "2019-06-12"),
]
queue.sort()
Or I can use the heapq.heapify function to create a heap in linear time (as opposed to the sort method’s len(queue) * log(len(queue)) complexity):
from heapq import heapify
queue = [
Book("Pride and Prejudice", "2019-06-01"),
Book("The Time Machine", "2019-05-30"),
Book("Crime and Punishment", "2019-06-06"),
Book("Wuthering Heights", "2019-06-12"),
]
heapify(queue)
To check for overdue books, I inspect the first element in the list instead of the last, and then I use the heapq.heappop function instead of the list.pop function to remove the book from the heap:
from heapq import heappop
def next_overdue_book(queue, now):
if queue:
book = queue[0] # Most overdue first
if book.due_date < now:
heappop(queue) # Remove the overdue book
return book
raise NoOverdueBooks
Now I can find and remove overdue books in order until there are none left for the current time:
now = "2019-06-02"
book = next_overdue_book(queue, now)
print(book.due_date, book.title)
book = next_overdue_book(queue, now)
print(book.due_date, book.title)
try:
next_overdue_book(queue, now)
except NoOverdueBooks:
pass # Expected
else:
assert False # Doesn't happen
>>>
2019-05-30 The Time Machine
2019-06-01 Pride and Prejudice
I can write another microbenchmark to test the performance of this implementation that uses the heapq module:
def heap_overdue_benchmark(count):
def prepare():
to_add = list(range(count))
random.shuffle(to_add)
return [], to_add
def run(queue, to_add):
for i in to_add:
heappush(queue, i)
while queue:
heappop(queue)
return timeit.timeit(
setup="queue, to_add = prepare()",
stmt="run(queue, to_add)",
globals=locals(),
number=1,
)
This benchmark experimentally verifies that the heap-based priority queue implementation scales much better (roughly len(queue) * math.log(len(queue))) without superlinearly degrading performance:
for i in range(1, 6):
count = i * 10_000
delay = heap_overdue_benchmark(count)
print(f"Count {count:>5,} takes: {delay*1e3:>6.2f}ms")
>>>
Count 10,000 takes: 1.73ms
Count 20,000 takes: 3.83ms
Count 30,000 takes: 6.50ms
Count 40,000 takes: 8.85ms
Count 50,000 takes: 11.43ms
With the heapq implementation, one question remains: How should I handle returns that are on time? The solution is to never remove a book from the priority queue until its due date. At that time, it will be the first item in the list, and I can simply ignore the book if it’s already been returned. Here I implement this behavior by adding a new field to track the book’s return status:
@functools.total_ordering
class Book:
def __init__(self, title, due_date):
self.title = title
self.due_date = due_date
self.returned = False # New field
...
Then I change the next_overdue_book function to repeatedly ignore any book that’s already been returned:
def next_overdue_book(queue, now):
while queue:
book = queue[0]
if book.returned:
heappop(queue)
continue
if book.due_date < now:
heappop(queue)
return book
break
raise NoOverdueBooks
This approach makes the return_book function extremely fast because it makes no modifications to the priority queue:
def return_book(queue, book):
book.returned = True
The downside of this solution for returns is that the priority queue may grow to the maximum size needed if all books from the library were checked out and went overdue. Although the queue operations will be fast thanks to heapq, this storage overhead might require significant memory (see Item 115: “Use tracemalloc to Understand Memory Usage and Leaks” for how to debug such usage).
If you’re trying to build a robust system, you will need to plan for the worst-case scenario; thus, you should expect that it’s possible for every library book to go overdue for some reason (e.g., a natural disaster closes the road to the library). This memory cost is a design consideration that you should have already planned for and mitigated through additional constraints (e.g., imposing a maximum number of simultaneously lent books).
Beyond the priority queue primitives that I’ve used in this example, the heapq module also provides additional functionality for advanced use cases (see https://docs.python.org/3/library/heapq.html). The module is a great choice when its functionality matches the problem you’re facing (see the queue.PriorityQueue class for another thread-safe option).
A priority queue allows you to process items in order of importance instead of in first-in, first-out order.
If you try to use list operations to implement a priority queue, your program’s performance will degrade superlinearly as the queue grows.
The heapq built-in module provides all the functions you need to implement a priority queue that scales efficiently.
To use heapq, the items being prioritized must have a natural sort order, which requires special methods like __lt__ to be defined for classes.
datetime Instead of time for Local ClocksCoordinated Universal Time (UTC) is the standard, time zone–independent representation of time. UTC works great for computers that represent time as seconds since the UNIX epoch. But UTC isn’t ideal for humans. Humans reference time relative to where they’re currently located. People say “noon” or “8 a.m.” instead of “UTC 15:00 minus 7 hours.” If your program handles time, you’ll probably find yourself converting time between UTC and local clocks for the sake of human understanding.
Python provides two ways of accomplishing time zone conversions. The old way, using the time built-in module, is terribly error prone. The new way, using datetime, works great with some help from the built-in zoneinfo module.
You should be acquainted with both time and datetime to thoroughly understand why datetime is the best choice and time should be avoided.
time ModuleThe localtime function from the time built-in module lets you convert a UNIX timestamp (seconds since the UNIX epoch in UTC) to a local time that matches the host computer’s time zone (Pacific Daylight Time in my case). This local time can be printed in human-readable format using the strftime function:
import time
now = 1710047865.0
local_tuple = time.localtime(now)
time_format = "%Y-%m-%d %H:%M:%S"
time_str = time.strftime(time_format, local_tuple)
print(time_str)
>>>
2024-03-09 21:17:45
You’ll often need to go the other way as well, starting with user input in human-readable local time and converting it to UTC time. You can do this by using the strptime function to parse the time string and then calling mktime to convert local time to a UNIX timestamp:
time_tuple = time.strptime(time_str, time_format)
utc_now = time.mktime(time_tuple)
print(utc_now)
>>>
1710047865.0
How do you convert local time in one time zone to local time in another time zone? For example, say that I’m taking a flight between San Francisco and New York, and I want to know what time it will be in San Francisco when I’ve arrived in New York.
I might initially assume that I can directly manipulate the return values from the time, localtime, and strptime functions to do time zone conversions. But this is a very bad idea. Time zones change all the time due to local laws. It’s too complicated to manage yourself, especially if you want to handle every global city for flight departures and arrivals.
Many operating systems have configuration files that automatically keep up with the time zone changes. Python lets you use these time zones through the time module if your platform supports it. On other platforms, such as Windows, some time zone functionality isn’t available from time at all. For example, here I parse a departure time from the San Francisco time zone, Pacific Standard Time (PST):
parse_format = "%Y-%m-%d %H:%M:%S %Z"
depart_sfo = "2024-03-09 21:17:45 PST"
time_tuple = time.strptime(depart_sfo, parse_format)
time_str = time.strftime(time_format, time_tuple)
print(time_str)
>>>
2024-03-09 21:17:45
After seeing that "PST" works with the strptime function, I might also assume that other time zones known to my computer will work. Unfortunately, this isn’t the case. strptime raises an exception when it sees Eastern Daylight Time (EDT), which is the time zone for New York (when the flight arrives after the time change):
arrival_nyc = "2024-03-10 03:31:18 EDT"
time_tuple = time.strptime(arrival_nyc, parse_format)
>>>
Traceback ...
ValueError: time data '2024-03-10 03:31:18 EDT' does not match
➥format '%Y-%m-%d %H:%M:%S %Z'
The problem here is the platform-dependent nature of the time module. Its actual behavior is determined by how the underlying C functions work with the host operating system. This makes the functionality of the time module unreliable in Python. The time module fails to consistently work properly for multiple local times. Thus, you should avoid using the time module for this purpose. If you must use time, use it only to convert between UTC and the host computer’s local time. For all other types of conversions, use the datetime module.
datetime ModuleThe second option for representing times in Python is the datetime class from the datetime built-in module. Like the time module, datetime can be used to convert from the current time in UTC to local time.
Here I convert the present time in UTC to my computer’s local time, PDT:
from datetime import datetime, timezone
now = datetime(2024, 3, 10, 5, 17, 45)
now_utc = now.replace(tzinfo=timezone.utc)
now_local = now_utc.astimezone()
print(now_local)
>>>
2024-03-09 21:17:45-08:00
The datetime module can also easily convert a local time back to a UNIX timestamp in UTC (matching the value above):
time_str = "2024-03-09 21:17:45"
now = datetime.strptime(time_str, time_format)
time_tuple = now.timetuple()
utc_now = time.mktime(time_tuple)
print(utc_now)
>>>
1710047865.0
Unlike the time module, the datetime module has facilities for reliably converting from one local time to another local time. However, datetime only provides the machinery for time zone operations with its tzinfo class and related methods.
Fortunately, since Python version 3.9, on many systems the zoneinfo built-in module contains a full database of every time zone definition you might need. On other systems, such as Windows, the officially endorsed tzdata community package might need to be installed to provide the time zone database that the zoneinfo module needs to function properly (see Item 116: “Know Where to Find Community-Built Modules” for details).
To use zoneinfo effectively, you should always convert local times to UTC first. Next, perform any datetime operations you need on the UTC values (such as offsetting). Finally, convert to local times.
For example, here I convert a New York City flight arrival time to a UTC datetime:
from zoneinfo import ZoneInfo
arrival_nyc = "2024-03-10 03:31:18"
nyc_dt_naive = datetime.strptime(arrival_nyc, time_format)
eastern = ZoneInfo("US/Eastern")
nyc_dt = nyc_dt_naive.replace(tzinfo=eastern)
utc_dt = nyc_dt.astimezone(timezone.utc)
print("EDT:", nyc_dt)
print("UTC:", utc_dt)
>>>
EDT: 2024-03-10 03:31:18-04:00
UTC: 2024-03-10 07:31:18+00:00
Once I have a UTC datetime, I can convert it to San Francisco local time:
pacific = ZoneInfo("US/Pacific")
sf_dt = utc_dt.astimezone(pacific)
print("PST:", sf_dt)
>>>
PST: 2024-03-09 23:31:18-08:00
Just as easily, I can convert it to the local time in Nepal:
nepal = ZoneInfo("Asia/Katmandu")
nepal_dt = utc_dt.astimezone(nepal)
print("NPT", nepal_dt)
>>>
NPT 2024-03-10 13:16:18+05:45
With datetime and zoneinfo, these conversions are consistent across all environments, regardless of what operating system the host computer is running.
Avoid using the time module for translating between different time zones.
Use the datetime and zoneinfo built-in modules to reliably convert between times and dates in different time zones.
Always represent time in UTC and do conversions to local time as the very final step before presentation.
decimal when Precision Is ParamountPython is an excellent language for writing code that interacts with numerical data. Python’s integer type can represent values of any practical size. Its double-precision floating point type complies with the IEEE 754 standard. The language also provides a standard complex number type for imaginary values. However, these aren’t enough for every situation.
For example, say that I want to compute the amount to charge a customer for an international phone call via a portable satellite phone. I know the time in minutes and seconds that the customer was on the phone (say, 3 minutes 42 seconds). I also have a set rate for the cost of calling Antarctica from the United States (say, $1.45/minute). What should the charge be?
With floating point math, the computed charge seems reasonable:
rate = 1.45
seconds = 3 * 60 + 42
cost = rate * seconds / 60
print(cost)
>>>
5.364999999999999
The result is 0.0001 short of the correct value (5.365) due to how IEEE 754 floating point numbers are represented. I might want to round up this value to 5.37 to properly cover all costs incurred by the customer. However, due to floating point error, rounding to the nearest whole cent actually reduces the final charge (from 5.364 to 5.36) instead of increasing it (from 5.365 to 5.37):
print(round(cost, 2))
>>>
5.36
The solution is to use the Decimal class from the decimal built-in module. The Decimal class provides fixed point math of 28 decimal places by default—and it can go even higher, if required. This works around the precision issues in IEEE 754 floating point numbers. The class also gives you more control over rounding behaviors.
For example, redoing the Antarctica calculation with Decimal results in the exact expected charge instead of an approximation:
from decimal import Decimal
rate = Decimal("1.45")
seconds = Decimal(3 * 60 + 42)
cost = rate * seconds / Decimal(60)
print(cost)
>>>
5.365
Decimal instances can be given starting values in two different ways. The first way is by passing a str containing the number to the Decimal constructor. This ensures that there is no loss of precision due to the inherent nature of Python floating point values and number constants. The second way is by directly passing a float or an int instance to the constructor. Here you can see that the two construction methods result in different behavior:
print(Decimal("1.45"))
print(Decimal(1.45))
>>>
1.45
1.4499999999999999555910790149937383830547332763671875
This problem doesn’t occur if I supply integers to the Decimal constructor:
print("456")
print(456)
>>>
456
456
If you care about exact answers, err on the side of caution and always use the str constructor for the Decimal type.
Getting back to the phone call example, say that I also want to support very short phone calls between places (like Toledo and Detroit) that are much cheaper to connect via an auxiliary cellular connection. Here I compute the charge for a phone call that was 5 seconds long with a rate of $0.05/minute:
rate = Decimal("0.05")
seconds = Decimal("5")
small_cost = rate * seconds / Decimal(60)
print(small_cost)
>>>
0.004166666666666666666666666667
The result is so low that it is decreased to zero when I tried to round it to the nearest whole cent. This won’t do!
print(round(small_cost, 2))
>>>
0.00
Luckily, the Decimal class has a built-in function for rounding to exactly the decimal place needed with the desired rounding behavior. This works for the higher-cost case from earlier:
from decimal import ROUND_UP
rounded = cost.quantize(Decimal("0.01"), rounding=ROUND_UP)
print(f"Rounded {cost} to {rounded}")
>>>
Rounded 5.365 to 5.37
Using the quantize method this way also properly handles the small usage case for short, cheap phone calls:
rounded = small_cost.quantize(Decimal("0.01"), rounding=ROUND_UP)
print(f"Rounded {small_cost} to {rounded}")
>>>
Rounded 0.004166666666666666666666666667 to 0.01
While Decimal works great for fixed point numbers, it still has limitations in its precision (e.g., 1/3 will be an approximation). For representing rational numbers with no limit to precision, consider using the Fraction class from the fractions built-in module.
Python has built-in types and classes in modules that can represent practically every type of numerical value.
The Decimal class is ideal for situations that require high precision and control over rounding behavior, such as computations of monetary values.
Pass str instances to the Decimal constructor instead of float instances if it’s important to compute exact answers and not floating point approximations.
pickle Serialization Maintainable with copyregThe pickle built-in module can serialize Python objects into a stream of bytes and deserialize bytes back into objects (see Item 10: “Know the Differences Between bytes and str” for background). Pickled byte streams shouldn’t be used to communicate between untrusted parties. The purpose of pickle is to let you pass Python objects between programs that you control over binary channels.
Note
The pickle module’s serialization format is unsafe by design. The serialized data contains what is essentially a program that describes how to reconstruct the original Python objects. This means a malicious pickle payload could be used to compromise any part of the Python program that attempts to deserialize it.
In contrast, the json module is safe by design. Serialized JSON data contains a simple description of an object hierarchy. Deserializing JSON data does not expose a Python program to any additional risk beyond out-of-memory errors. Formats like JSON should be used for communication between programs or people that don’t trust each other.
For example, say that I want to use a Python object to represent the state of a player’s progress in a game. The game state includes the level the player is on and the number of lives they have remaining:
class GameState:
def __init__(self):
self.level = 0
self.lives = 4
The program modifies this object as the game runs:
state = GameState()
state.level += 1 # Player beat a level
state.lives -= 1 # Player had to try again
print(state.__dict__)
>>>
{'level': 1, 'lives': 3}
When the user quits playing, the program can save the state of the game to a file so it can be resumed at a later time. The pickle module makes it easy to do this. Here I dump the GameState object directly to a file:
import pickle
state_path = "game_state.bin"
with open(state_path, "wb") as f:
pickle.dump(state, f)
Later, I can load the file and get back the GameState object as if it had never been serialized:
with open(state_path, "rb") as f:
state_after = pickle.load(f)
print(state_after.__dict__)
>>>
{'level': 1, 'lives': 3}
The problem with this approach is what happens as the game’s features expand over time. Imagine that I want the player to earn points toward a high score. To track the player’s points, I can add a new field to the GameState class:
class GameState:
def __init__(self):
self.level = 0
self.lives = 4
self.points = 0 # New field
Serializing the new version of the GameState class using pickle will work exactly as before. Here I simulate the round trip through a file by serializing to a string with dumps and back to a GameState object with loads:
state = GameState()
serialized = pickle.dumps(state)
state_after = pickle.loads(serialized)
print(state_after.__dict__)
>>>
{'level': 0, 'lives': 4, 'points': 0}
But what happens to older saved GameState objects that the user may want to resume? Here I unpickle an old game file using a program with the new definition of the GameState class:
with open(state_path, "rb") as f:
state_after = pickle.load(f)
print(state_after.__dict__)
>>>
{'level': 1, 'lives': 3}
The points attribute is missing! This is especially confusing because the returned object is an instance of the new GameState class:
assert isinstance(state_after, GameState)
This behavior is a by-product of the way the pickle module works. Its primary use case is making it easy to serialize objects. As soon as your use of pickle expands beyond trivial usage, the module’s functionality starts to break down in surprising ways.
Fixing these problems is straightforward using the copyreg built-in module. The copyreg module lets you register the functions responsible for serializing and deserializing Python objects, allowing you to control the behavior of pickle and make it more reliable and adaptable.
In the simplest case, you can use a constructor with default arguments (see Item 35: “Provide Optional Behavior with Keyword Arguments” for background) to ensure that GameState objects will always have all attributes after unpickling. Here I redefine the constructor this way:
class GameState:
def __init__(self, level=0, lives=4, points=0):
self.level = level
self.lives = lives
self.points = points
To use this constructor for pickling, I define a helper function that takes a GameState object and turns it into a tuple of parameters for the copyreg module. The returned tuple contains the function to use for unpickling and the parameters to pass to the unpickling function:
def pickle_game_state(game_state):
kwargs = game_state.__dict__
return unpickle_game_state, (kwargs,)
Now I need to define the unpickle_game_state helper. This function takes serialized data and parameters from pickle_game_state and returns the corresponding GameState object. It’s a tiny wrapper around the constructor:
def unpickle_game_state(kwargs):
return GameState(**kwargs)
Now I register these functions with the copyreg built-in module:
import copyreg
copyreg.pickle(GameState, pickle_game_state)
After registration, serializing and deserializing works as before:
state = GameState()
state.points += 1000
serialized = pickle.dumps(state)
state_after = pickle.loads(serialized)
print(state_after.__dict__)
>>>
{'level': 0, 'lives': 4, 'points': 1000}
With this registration done, now I’ll change the definition of GameState again to give the player a count of magic spells to use. This change is similar to when I added the points field to GameState:
class GameState:
def __init__(self, level=0, lives=4, points=0, magic=5):
self.level = level
self.lives = lives
self.points = points
self.magic = magic # New field
But unlike before, deserializing an old GameState object will result in valid game data instead of missing attributes. This works because unpickle_game_state calls the GameState constructor directly instead of using the pickle module’s default behavior of saving and restoring only the attributes that belong to an object. The GameState constructor’s keyword arguments have default values that will be used for any parameters that are missing. This causes old game state files to receive the default value for the new magic field when they are deserialized:
print("Before:", state.__dict__)
state_after = pickle.loads(serialized)
print("After: ", state_after.__dict__)
>>>
Before: {'level': 0, 'lives': 4, 'points': 1000}
After: {'level': 0, 'lives': 4, 'points': 1000, 'magic': 5}
Sometimes you need to make backward-incompatible changes to your Python objects by removing fields. Doing so prevents the default argument approach to migrating serializations from working.
For example, say I realize that a limited number of lives is a bad idea, and I want to remove the concept of lives from the game. Here I redefine the GameState class to no longer have a lives field:
class GameState:
def __init__(self, level=0, points=0, magic=5):
self.level = level
self.points = points
self.magic = magic
The problem is that this breaks deserialization of old game data. All fields from the old data—even ones removed from the class—will be passed to the GameState constructor by the unpickle_game_state function:
pickle.loads(serialized)
>>>
Traceback ...
TypeError: GameState.__init__() got an unexpected keyword
➥argument 'lives'. Did you mean 'level'?
I can fix this by adding a version parameter to the functions supplied to copyreg. New serialized data will have a version of 2 specified when pickling a new GameState object:
def pickle_game_state(game_state):
kwargs = game_state.__dict__
kwargs["version"] = 2
return unpickle_game_state, (kwargs,)
Old versions of the data will not have a version argument present, which means I can manipulate the arguments passed to the GameState constructor accordingly:
def unpickle_game_state(kwargs):
version = kwargs.pop("version", 1)
if version == 1:
del kwargs["lives"]
return GameState(**kwargs)
Now deserializing an old object works properly:
copyreg.pickle(GameState, pickle_game_state)
print("Before:", state.__dict__)
state_after = pickle.loads(serialized)
print("After: ", state_after.__dict__)
>>>
Before: {'level': 0, 'lives': 4, 'points': 1000}
After: {'level': 0, 'points': 1000, 'magic': 5}
I can continue using this approach to handle changes between future versions of the same class. Any logic I need to adapt an old version of the class to a new version of the class can go in the unpickle_game_state function.
One other issue you may encounter with pickle is breakage from renaming a class. Often over the life cycle of a program, you’ll refactor your code by renaming classes and moving them to other modules. Unfortunately, doing so breaks the pickle module unless you’re careful.
Here I rename the GameState class to BetterGameState and remove the old class from the program entirely:
class BetterGameState:
def __init__(self, level=0, points=0, magic=5):
self.level = level
self.points = points
self.magic = magic
Attempting to deserialize an old GameState object now fails because the class can’t be found:
pickle.loads(serialized)
>>>
Traceback ...
AttributeError: Can't get attribute 'GameState'
➥on <module '__main__' from 'my_code.py'>
This exception occurs because the import path of the serialized object’s class is encoded in the pickled data:
print(serialized)
>>>
b'\x80\x04\x95A\x00\x00\x00\x00\x00\x00\x00\x8c\x08__main__\
x94\x8c\tGameState\x94\x93\x94)\x81\x94}\x94(\x8c\x05level\
x94K\x00\x8c\x06points\x94K\x00\x8c\x05magic\x94K\x05ub.'
The solution is to use copyreg again. I can specify a stable identifier for the function to use for unpickling an object. This allows me to transition pickled data to different classes with different names when it’s deserialized. It gives me a level of indirection:
copyreg.pickle(BetterGameState, pickle_game_state)
After I use copyreg, it’s evident that the import path to unpickle_game_state is encoded in the serialized data instead of BetterGameState:
state = BetterGameState()
serialized = pickle.dumps(state)
print(serialized)
>>>
b'\x80\x04\x95W\x00\x00\x00\x00\x00\x00\x00\x8c\x08__main__\
➥x94\x8c\x13unpickle_game_state\x94\x93\x94}\x94(\x8c\
➥x05level\x94K\x00\x8c\x06points\x94K\x00\x8c\x05magic\x94K\
➥x05\x8c\x07version\x94K\x02u\x85\x94R\x94.'
The only gotcha is that I can’t change the path of the module in which the unpickle_game_state function is present. Once I serialize data with a function, it must remain available on that import path for deserialization in the future.
The pickle built-in module is useful only for serializing and deserializing objects between trusted programs.
Deserializing previously pickled objects may break if the classes involved have changed over time (e.g., attributes have been added or removed).
Use the copyreg built-in module with pickle to ensure backward compatibility for serialized objects.