Effective Python: 125 Specific Ways to Write Better Python, 3rd Edition

4

Dictionaries

A natural complement to lists and sequences in Python is the dictionary type, which stores lookup keys mapped to corresponding values (in what is often called an associative array or a hash table). Their versatility makes dictionaries ideal for bookkeeping: dynamically keeping track of new and changing pieces of data and how they relate to each other. When writing a new program, using dictionaries is a great way to start before you're sure what other data structures or classes you might need.

Dictionaries provide constant time (amortized) performance for adding and removing items, which is far better than what simple lists can achieve on their own. Thus, it's understandable that dictionaries are the core data structure that Python uses to implement its object-oriented features. Python also has special syntax and related built-in modules that enhance dictionaries with additional capabilities beyond what you might expect from a simple hash table type in other languages.

Item 25: Be Cautious when Relying on Dictionary Insertion Ordering

In Python versions 3.5 and earlier, iterating over a dictionary instance would return its keys in arbitrary order. The order of iteration would not match the order in which the items were originally inserted into the dictionary. For example, here I use Python version 3.5 to create a dictionary that maps animal names to their corresponding baby names:

Click here to view code image

# Python 3.5
baby_names = {
    "cat": "kitten",
    "dog": "puppy",
}
print(baby_names)

>>>
{'dog': 'puppy', 'cat': 'kitten'}

When I created the dictionary, the keys were in the order "cat", "dog", but when I printed it, the keys were in the reverse order: "dog", "cat". This behavior is surprising, makes it harder to reproduce test cases, increases the difficulty of debugging, and is especially confusing to newcomers to Python.

This happened because the dictionary type in older versions of Python implements its hash table algorithm with a combination of the hash built-in function and a random seed that is assigned when the Python interpreter process starts executing. Together, these behaviors cause dictionary orderings to not match insertion order and to randomly shuffle between program executions.

Starting with Python 3.6, and officially part of the Python specification since version 3.7, dictionaries preserve insertion order. Now, this code always prints the dictionary in the same way it was originally created by the programmer:

Click here to view code image

baby_names = {
    "cat": "kitten",
    "dog": "puppy",
}
print(baby_names)

>>>
{'cat': 'kitten', 'dog': 'puppy'}

With Python 3.5 and earlier, all methods provided by dict that rely on iteration order—including keys, values, items, and popitem—similarly demonstrate this random-looking behavior:

Click here to view code image

# Python 3.5
print(list(baby_names.keys()))
print(list(baby_names.values()))
print(list(baby_names.items()))
print(baby_names.popitem())  # Randomly chooses an item

>>>
['dog', 'cat']
['puppy', 'kitten']
[('dog', 'puppy'), ('cat', 'kitten')]
('dog', 'puppy')

These methods now provide consistent insertion ordering that you can rely on when you write your programs:

Click here to view code image

print(list(baby_names.keys()))
print(list(baby_names.values()))
print(list(baby_names.items()))
print(baby_names.popitem())  # Last item inserted

>>>
['cat', 'dog']
['kitten', 'puppy']
[('cat', 'kitten'), ('dog', 'puppy')]
('dog', 'puppy')

There are many repercussions of this change on other Python features that are dependent on the dict type and its specific implementation.

Keyword arguments to functions—including the **kwargs catch-all parameter (see Item 35: “Provide Optional Behavior with Keyword Arguments” and Item 37: “Enforce Clarity with Keyword-Only and Positional-Only Arguments”)— come through in seemingly random order in earlier versions of Python, which can make it harder to debug function calls:

Click here to view code image

# Python 3.5
def my_func(**kwargs):
    for key, value in kwargs.items():
        print("%s = %s" % (key, value))

my_func(goose="gosling", kangaroo="joey")

>>>
kangaroo = joey
goose = gosling

Now in the latest version of Python, the order of keyword arguments is always preserved to match how the programmer originally called the function:

Click here to view code image

def my_func(**kwargs):
    for key, value in kwargs.items():
        print(f"{key} = {value}")

my_func(goose="gosling", kangaroo="joey")

>>>
goose = gosling
kangaroo = joey

Classes also use the dict type for their instance dictionaries. In previous versions of Python, object fields show the randomizing behavior:

Click here to view code image

# Python 3.5
class MyClass:
    def __init__(self):
        self.alligator = "hatchling"
        self.elephant = "calf"

a = MyClass()
for key, value in a.__dict__.items():
    print("%s = %s" % (key, value))

>>>
elephant = calf
alligator = hatchling

Again, you can now assume that the order of assignment for these instance fields will be reflected in __dict__:

Click here to view code image

class MyClass:
    def __init__(self):
        self.alligator = "hatchling"
        self.elephant = "calf"

a = MyClass()
for key, value in a.__dict__.items():
    print(f"{key} = {value}")

>>>
alligator = hatchling
elephant = calf

The way that dictionaries preserve insertion ordering is now part of the Python language specification. For the language features above, you can rely on this behavior and even make it part of the APIs you design for your classes and functions (see Item 65: “Consider Class Body Definition Order to Establish Relationships Between Attributes” for an example).

Note

For a long time the collections built-in module has had an OrderedDict class that preserves insertion ordering. Although this class’s behavior is similar to that of the standard dict type (since Python 3.7), the performance characteristics of OrderedDict are quite different. If you need to handle a high rate of key insertions and popitem calls (e.g., to implement a least-recently-used cache), OrderedDict may be a better fit than the standard Python dict type (see Item 92: “Profile Before Optimizing” on how to make sure you need this).

However, you shouldn’t always assume that insertion ordering behavior will be present when you’re handling dictionaries. Python makes it easy for programmers to define their own custom container types that emulate the standard protocols matching list, dict, and other types (see Item 57: “Inherit from collections.abc Classes for Custom Container Types”). Python is not statically typed, so most code relies on duck typing—where an object’s behavior is its de facto type—instead of rigid class hierarchies (see Item 3: “Never Expect Python to Detect Errors at Compile Time”). This can result in surprising gotchas.

For example, say that I’m writing a program to show the results of a contest for the cutest baby animal. Here, I start with a dictionary containing the total vote count for each one:

votes = {
    "otter": 1281,
    "polar bear": 587,
    "fox": 863,
}

Now, I define a function to process this voting data and save the rank of each animal name into a provided empty dictionary. In this case, the dictionary could be the data model that powers a UI element:

Click here to view code image

def populate_ranks(votes, ranks):
    names = list(votes.keys())
    names.sort(key=votes.get, reverse=True)
    for i, name in enumerate(names, 1):
        ranks[name] = i

I also need a function that will tell me which animal won the contest. This function works by assuming that populate_ranks will assign the contents of the ranks dictionary in ascending order, meaning that the first key must be the winner:

def get_winner(ranks):
    return next(iter(ranks))

Here, I confirm that these functions work as designed and deliver the result that I expected:

Click here to view code image

ranks = {}
populate_ranks(votes, ranks)
print(ranks)
winner = get_winner(ranks)
print(winner)

>>>
{'otter': 1, 'fox': 2, 'polar bear': 3}
otter

Now, imagine that the requirements of this program have changed. The UI element that shows the results should be in alphabetical order instead of rank order. To accomplish this, I can use the collections.abc built-in module to define a new dictionary-like class that iterates its contents in alphabetical order:

Click here to view code image

from collections.abc import MutableMapping

class SortedDict(MutableMapping):
    def __init__(self):
        self.data = {}

    def __getitem__(self, key):
        return self.data[key]

    def __setitem__(self, key, value):
        self.data[key] = value

    def __delitem__(self, key):
        del self.data[key]

    def __iter__(self):
        keys = list(self.data.keys())
        keys.sort()
        for key in keys:
            yield key

    def __len__(self):
        return len(self.data)

I can use a SortedDict instance in place of a standard dict with the functions from before, and no errors will be raised since this class conforms to the protocol of a standard dictionary. However, the results are incorrect:

Click here to view code image

sorted_ranks = SortedDict()
populate_ranks(votes, sorted_ranks)
print(sorted_ranks.data)
winner = get_winner(sorted_ranks)
print(winner)

>>>
{'otter': 1, 'fox': 2, 'polar bear': 3}
fox

The problem here is that the implementation of get_winner assumes that the dictionary’s iteration is in insertion order to match populate_ranks. This code is using SortedDict instead of dict, so that assumption is no longer true. Thus, the value returned for the winner is "fox", which is alphabetically first.

There are three ways to mitigate this problem. First, I can reimplement the get_winner function to no longer assume that the ranks dictionary has a specific iteration order. This is the most conservative and robust solution:

Click here to view code image

def get_winner(ranks):
    for name, rank in ranks.items():
        if rank == 1:
            return name

winner = get_winner(sorted_ranks)
print(winner)

>>>
otter

The second approach is to add an explicit check to the top of the function to ensure that the type of ranks matches my expectations and to raise an exception if not. This solution likely has better runtime performance than the more conservative approach:

Click here to view code image

def get_winner(ranks):
    if not isinstance(ranks, dict):
        raise TypeError("must provide a dict instance")
    return next(iter(ranks))


get_winner(sorted_ranks)

>>>
Traceback ...
TypeError: must provide a dict instance

The third alternative is to use type annotations to enforce that the value passed to get_winner is a dict instance and not a MutableMapping with dictionary-like behavior (see Item 124: “Consider Static Analysis via typing to Obviate Bugs”). Here, I run the mypy tool in strict mode on a type-annotated version of the code above:

Click here to view code image

from typing import Dict, MutableMapping

def populate_ranks(votes: Dict[str, int],
                   ranks: Dict[str, int]) -> None:
    names = list(votes.keys())
    names.sort(key=votes.__getitem__, reverse=True)
    for i, name in enumerate(names, 1):
        ranks[name] = i

def get_winner(ranks: Dict[str, int]) -> str:
    return next(iter(ranks))

class SortedDict(MutableMapping[str, int]):
    ...

votes = {
    "otter": 1281,
    "polar bear": 587,
    "fox": 863,
}

sorted_ranks = SortedDict()
populate_ranks(votes, sorted_ranks)
print(sorted_ranks.data)
winner = get_winner(sorted_ranks)
print(winner)

$ python3 -m mypy --strict example.py
.../example.py:48: error: Argument 2 to "populate_ranks" has
➥incompatible type "SortedDict"; expected "dict[str, int]"
➥[arg-type]
.../example.py:50: error: Argument 1 to "get_winner" has
➥incompatible type "SortedDict"; expected "dict[str, int]"
➥[arg-type]
Found 2 errors in 1 file (checked 1 source file)

This correctly detects the mismatch between the dict and SortedDict types and flags the incorrect usage as an error. This solution provides the best mix of static type safety and runtime performance.

Things to Remember

Since Python 3.7, you can rely on the fact that iterating a dictionary instance’s contents will occur in the same order in which the keys were initially added.
Python makes it easy to define objects that act like dictionaries but that aren’t dict instances. For these types, you can’t assume that insertion ordering will be preserved.
There are three ways to be careful about dictionary-like classes: Write code that doesn’t rely on insertion ordering, explicitly check for the dict type at runtime, or require dict values using type annotations and static analysis.

Item 26: Prefer `get` over `in` and `KeyError` to Handle Missing Dictionary Keys

The three fundamental operations for interacting with dictionaries are accessing, assigning, and deleting keys and their associated values. The contents of dictionaries are dynamic, and thus it’s entirely possible—even likely—that when you try to access or delete a key, it won’t already be present.

For example, say that I’m trying to determine people’s favorite type of bread to devise the menu for a sandwich shop. Here, I define a dictionary of counters with the current votes for each loaf’s style:

counters = {
    "pumpernickel": 2,
    "sourdough": 1,
}

To increment the counter for a new vote, I need to see if the key exists, insert the key with a default counter value of zero if it’s missing, and then increment the counter’s value. This requires accessing the key two times and assigning it once. Here, I accomplish this task by using an if statement with an in operator that returns True when the key is present:

Click here to view code image

key = "wheat"

if key in counters:
    count = counters[key]
else:
    count = 0

counters[key] = count + 1
print(counters)

>>>
{'pumpernickel': 2, 'sourdough': 1, 'wheat': 1}

Another way to accomplish the same behavior is by relying on how dictionaries raise a KeyError exception when you try to get the value for a key that doesn’t exist. Putting aside the cost of raising and catching exceptions (see Item 80: “Take Advantage of Each Block in try/except/else/finally”), this approach is more efficient, in theory, because it requires only one access and one assignment:

try:
    count = counters[key]
except KeyError:
    count = 0

counters[key] = count + 1

This flow of fetching a key that exists or returning a default value is so common that the dict built-in type provides the get method to accomplish this task. The second parameter to get is the default value to return in the case that the key—the first parameter—isn’t present (see Item 32: “Prefer Raising Exceptions to Returning None” on whether that's a good interface). This approach also requires only one access and one assignment, but it’s much shorter than the KeyError example and avoids the exception handling overhead:

count = counters.get(key, 0)
counters[key] = count + 1

It’s possible to shorten the in operator and KeyError approaches in various ways, but all of these alternatives suffer from requiring code duplication for the assignments, which makes them less readable and worth avoiding:

if key not in counters:
    counters[key] = 0
counters[key] += 1

if key in counters:
    counters[key] += 1
else:
    counters[key] = 1

try:
    counters[key] += 1
except KeyError:
    counters[key] = 1

Thus, for a dictionary with simple types, using the get method is the shortest and clearest option.

Note

If you’re maintaining dictionaries of counters like this, it’s worth considering the Counter class from the collections built-in module, which provides most of the functionality you're likely to need.

What if the values in a dictionary are a more complex type, like a list? For example, say that instead of only counting votes, I also want to know who voted for each type of bread. Here, I do this by associating a list of names with each key:

Click here to view code image

votes = {
    "baguette": ["Bob", "Alice"],
    "ciabatta": ["Coco", "Deb"],
}

key = "brioche"
who = "Elmer"

if key in votes:
    names = votes[key]
else:
    votes[key] = names = []

names.append(who)
print(votes)

>>>
{'baguette': ['Bob', 'Alice'],
 'ciabatta': ['Coco', 'Deb'],
 'brioche': ['Elmer']}

Relying on the in operator requires two accesses if the key is present, or one access and one assignment if the key is missing. This example is different from the counters example above because the value for each key can be assigned blindly to the default value of an empty list if the key doesn’t already exist. The triple assignment statement (votes[key] = names = []) populates the key in one line instead of two. Once the default value has been inserted into the dictionary, I don’t need to assign it again because the list is modified by reference in the later call to append (see Item 30: “Know That Function Arguments Can Be Mutated” for background).

It’s also possible to rely on the KeyError exception being raised when the dictionary value is a list. This approach requires one key access if the key is present, or one key access and one assignment if it’s missing, which makes it more efficient than the in operator (ignoring the cost of the exception handling machinery):

try:
    names = votes[key]
except KeyError:
    votes[key] = names = []

names.append(who)

Similarly, I can use the get method to fetch a list value when the key is present, or do one fetch and one assignment if the key isn’t present:

names = votes.get(key)
if names is None:
    votes[key] = names = []

names.append(who)

The approach that involves using get to fetch list values can further be shortened by one line if you use an assignment expression (introduced in Python 3.8; see Item 8: “Prevent Repetition with Assignment Expressions”) in the if statement, which improves readability:

Click here to view code image

if (names := votes.get(key)) is None:
    votes[key] = names = []

names.append(who)

Notably, the dict type also provides the setdefault method to help shorten this pattern even further. setdefault tries to fetch the value of a key in the dictionary. If the key isn’t present, the method assigns that key to the default value provided. And then the method returns the value for that key: either the originally present value or the newly inserted default value. Here, I use setdefault to implement the same logic as in the get example above:

Click here to view code image

names = votes.setdefault(key, [])
names.append(who)

This works as expected, and it is shorter than using get with an assignment expression. However, the readability of this approach isn’t ideal. The method name setdefault doesn’t make its purpose immediately obvious. Why is it set when what it’s doing is getting a value? Why not call it get_or_set? I’m arguing about the color of the bike shed here, but the point is that if you were a new reader of the code and not completely familiar with Python, you might have trouble understanding what this code is trying to accomplish because setdefault isn’t self-explanatory.

There’s also one important gotcha: The default value passed to setdefault is assigned directly into the dictionary when the key is missing instead of being copied. Here, I demonstrate the effect of this when the value is a list:

data = {}
key = "foo"
value = []
data.setdefault(key, value)
print("Before:", data)
value.append("hello")
print("After: ", data)

>>>
Before: {'foo': []}
After:  {'foo': ['hello']}

So, I need to make sure I’m always constructing a new default value for each key I access with setdefault. This leads to a significant performance overhead in this example because I have to allocate a list instance for each call. If I reuse an object for the default value—which I might try to do to increase efficiency or readability—I might introduce strange behavior and bugs (see Item 36: “Use None and Docstrings to Specify Dynamic Default Arguments” for another example of this problem).

Going back to the earlier example that used counters for dictionary values instead of lists of who voted: Why not also use the setdefault method in that case? Here, I reimplement the same example using this approach:

Click here to view code image

count = counters.setdefault(key, 0)
counters[key] = count + 1

The problem here is that the call to setdefault is superfluous. You always need to assign the key in the dictionary to a new value after you increment the counter, so the extra assignment done by setdefault is unnecessary. The earlier approach of using get for counter updates requires only one access and one assignment, whereas using setdefault requires one access and two assignments.

There are only a few circumstances in which using setdefault is the shortest way to handle missing dictionary keys (e.g., for list instance default values that are cheap to construct and won’t raise exceptions). In these very specific cases, it might seem worth accepting the confusing method name setdefault instead of having to write more characters and lines to use get. However, often what you really should do in these situations is use defaultdict instead (see Item 27: “Prefer defaultdict over setdefault to Handle Missing Items in Internal State”).

Things to Remember

There are four common ways to detect and handle missing keys in dictionaries: using in operators, KeyError exceptions, the get method, and the setdefault method.
The get method is best for dictionaries that contain basic types like counters, and it is preferable along with assignment expressions when creating dictionary default values has a high cost or might raise exceptions.
When the setdefault method of dict seems like the best fit for your problem, you should consider using defaultdict instead.

Item 27: Prefer `defaultdict` over `setdefault` to Handle Missing Items in Internal State

When working with a dictionary that you didn’t create, there are a variety of ways to handle missing keys (see Item 26: “Prefer get over in and KeyError to Handle Missing Dictionary Keys”). Although using the get method is a better approach than using in operators and KeyError exceptions, for some use cases, setdefault appears to be the shortest option.

For example, say that I want to keep track of the cities I’ve visited in countries around the world. Here, I do this by using a dictionary that maps country names to a set instance containing corresponding city names:

Click here to view code image

visits = {
    "Mexico": {"Tulum", "Puerto Vallarta"},
    "Japan": {"Hakone"},
}

I can use the setdefault method to add new cities to the sets, whether the country name is already present in the dictionary or not. The code is much shorter than achieving the same behavior with the get method and an assignment expression:

Click here to view code image

# Short
visits.setdefault("France", set()).add("Arles")

# Long
if (japan := visits.get("Japan")) is None:
    visits["Japan"] = japan = set()

japan.add("Kyoto")
print(visits)

>>>
{'Mexico': {'Puerto Vallarta', 'Tulum'},
 'Japan': {'Kyoto', 'Hakone'},
 'France': {'Arles'}}

What about the situation when you do control creation of the dictionary being accessed? This is generally the case when you’re using a dictionary instance to keep track of the internal state of an object, for example. Here, I wrap the example above in a class with helper methods to access its dynamic inner state that’s stored in a dictionary:

Click here to view code image

class Visits:
    def __init__(self):
        self.data = {}

    def add(self, country, city):
        city_set = self.data.setdefault(country, set())
        city_set.add(city)

This new class hides the complexity of calling setdefault with the correct arguments, and it provides a nicer interface for the programmer:

Click here to view code image

visits = Visits()
visits.add("Russia", "Yekaterinburg")
visits.add("Tanzania", "Zanzibar")
print(visits.data)

>>>
{'Russia': {'Yekaterinburg'}, 'Tanzania': {'Zanzibar'}}

However, the implementation of the Visits.add method isn’t ideal. The setdefault method is still confusingly named, which makes it more difficult for a new reader of the code to immediately understand what’s happening. And the implementation isn’t efficient because it constructs a new set instance on every call, regardless of whether the given country was already present in the data dictionary.

Luckily, the defaultdict class from the collections built-in module simplifies this common use case by automatically storing a default value when a key doesn’t exist. All you have to do is provide a function that will return the default value to use each time a key is missing (an example of Item 48: “Accept Functions Instead of Classes for Simple Interfaces”). Here, I rewrite the Visits class to use defaultdict:

Click here to view code image

from collections import defaultdict

class Visits:
    def __init__(self):
        self.data = defaultdict(set)

    def add(self, country, city):
        self.data[country].add(city)

visits = Visits()
visits.add("England", "Bath")
visits.add("England", "London")
print(visits.data)

>>>
defaultdict(<class 'set'>, {'England': {'Bath', 'London'}})

Now, the implementation of add is short and simple. The code can assume that accessing any key in the data dictionary will always result in an existing set instance. No superfluous set instances will be allocated, which could be costly if the add method is called a large number of times.

Using defaultdict is much better than using setdefault for this type of situation (see Item 29: “Compose Classes Instead of Deeply Nesting Dictionaries, Lists, and Tuples” for another example). There are still cases in which defaultdict will fall short of solving your problems, but there are even more tools available in Python to work around those limitations (see Item 28: “Know How to Construct Key-Dependent Default Values with __missing__”, Item 57: “Inherit from collections.abc Classes for Custom Container Types”, and the collections.Counter built-in class).

Things to Remember

If you’re creating a dictionary to manage an arbitrary set of potential keys, then you should prefer using a defaultdict instance from the collections built-in module if it suits your problem.
If a dictionary of arbitrary keys is passed to you, and you don’t control its creation, then you should prefer the get method to access its items. However, it’s worth considering using the setdefault method for the few situations in which it leads to shorter code and the default object allocation cost is low.

Item 28: Know How to Construct Key-Dependent Default Values with `missing`

The built-in dict type’s setdefault method results in shorter code when handling missing keys in some specific circumstances (see Item 26: “Prefer get over in and KeyError to Handle Missing Dictionary Keys”). For many of those situations, the better tool for the job is the defaultdict type from the collections built-in module (see Item 27: “Prefer defaultdict over setdefault to Handle Missing Items in Internal State” for why). However, there are times when neither setdefault nor defaultdict is the right fit.

For example, say that I’m writing a program to manage social network profile pictures on the filesystem. I need a dictionary to map profile picture pathnames to open file handles so I can read and write those images as needed. Here, I do this by using a normal dict instance and checking for the presence of keys using the get method and an assignment expression (see Item 8: “Prevent Repetition with Assignment Expressions”):

Click here to view code image

pictures = {}
path = "profile_1234.png"

if (handle := pictures.get(path)) is None:
    try:
        handle = open(path, "a+b")
    except OSError:
        print(f"Failed to open path {path}")
        raise
    else:
        pictures[path] = handle

handle.seek(0)
image_data = handle.read()

When the file handle already exists in the dictionary, this code makes only a single dictionary access. In the case that the file handle doesn’t exist, the dictionary is accessed once by get, and then it is assigned in the else clause of the try/except statement (see Item 80: “Take Advantage of Each Block in try/except/else/finally”). The call to the read method stands clearly separate from the code that calls open and handles exceptions.

Although it’s possible to use the in operator or KeyError approaches to implement this same logic, those options require more dictionary accesses and levels of nesting. Given that these other options work, you might also assume that the setdefault method would work, too:

Click here to view code image

try:
    handle = pictures.setdefault(path, open(path, "a+b"))
except OSError:
    print(f"Failed to open path {path}")
    raise
else:
    handle.seek(0)
    image_data = handle.read()

This code has many problems. The open built-in function to create the file handle is always called, even when the path is already present in the dictionary. This results in an additional file handle that might conflict with existing open handles in the same program. Exceptions might be raised by the open call and need to be handled, but it may not be possible to differentiate them from exceptions that could be raised by the setdefault call on the same line (which is possible for other dictionary-like implementations; see Item 57: “Inherit from collections.abc Classes for Custom Container Types”).

If you’re trying to manage internal state, another assumption you might make is that a defaultdict could be used for keeping track of these profile pictures. Here, I attempt to implement the same logic as before but now using a helper function and the defaultdict class:

Click here to view code image

from collections import defaultdict

def open_picture(profile_path):
    try:
        return open(profile_path, "a+b")
    except OSError:
        print(f"Failed to open path {profile_path}")
        raise

pictures = defaultdict(open_picture)
handle = pictures[path]
handle.seek(0)
image_data = handle.read()

>>>
Traceback ...
TypeError: open_picture() missing 1 required positional
➥argument: 'profile_path'

The problem is that defaultdict expects that the function passed to its constructor doesn’t require any arguments. This means that the helper function defaultdict calls doesn’t know the specific key that’s being accessed, which hinders my ability to call open. In this situation, both setdefault and defaultdict fall short of what I need.

Fortunately, this situation is common enough that Python has another built-in solution. You can subclass the dict type and implement the __missing__ special method to add custom logic for handling missing keys. Here, I do this by defining a new class that takes advantage of the same open_picture helper method defined above:

Click here to view code image

class Pictures(dict):
    def __missing__(self, key):
        value = open_picture(key)
        self[key] = value
        return value

pictures = Pictures()
handle = pictures[path]
handle.seek(0)
image_data = handle.read()

When the pictures[path] dictionary access finds that the path key isn’t present in the dictionary, the __missing__ method is called. This method must create the new default value for the key, insert it into the dictionary, and return it to the caller. Subsequent accesses of the same path will not call __missing__ since the corresponding item is already present (similar to the behavior of __getattr__; see Item 61: “Use __getattr__, __getattribute__, and __setattr__ for Lazy Attributes”).

Things to Remember

The setdefault method of dict is a bad fit when creating the default value has high computational cost or might raise exceptions.
The function passed to defaultdict must not require any arguments, which makes it impossible to have the default value depend on the key being accessed.
You can define your own dict subclass with a __missing__ method in order to construct default values that must know which key was being accessed.

Item 29: Compose Classes Instead of Deeply Nesting Dictionaries, Lists, and Tuples

Python’s built-in dictionary type is wonderful for maintaining dynamic internal state over the lifetime of an object. By dynamic, I mean situations in which you need to do bookkeeping for an unexpected set of identifiers. For example, say that I want to record the grades of a set of students whose names aren’t known in advance. I can define a class to store the names in a dictionary instead of using a predefined attribute for each student:

Click here to view code image

class SimpleGradebook:
    def __init__(self):
        self._grades = {}

    def add_student(self, name):
        self._grades[name] = []

    def report_grade(self, name, score):
        self._grades[name].append(score)

    def average_grade(self, name):
        grades = self._grades[name]
        return sum(grades) / len(grades)

Using the class is simple:

Click here to view code image

book = SimpleGradebook()
book.add_student("Isaac Newton")
book.report_grade("Isaac Newton", 90)
book.report_grade("Isaac Newton", 95)
book.report_grade("Isaac Newton", 85)

print(book.average_grade("Isaac Newton"))

>>>
90.0

Dictionaries, lists, tuples, and sets are so easy to use that there’s a danger they’ll cause you to write brittle code. For example, say that I want to extend the SimpleGradebook class to keep a list of grades by subject, not just overall. I can do this by changing the _grades dictionary to map student names (its keys) to yet another dictionary (its values). The innermost dictionary will map subjects (its keys) to a list of grades (its values). Here, I do this by using a defaultdict instance for the inner dictionary to handle missing subjects (see Item 27: “Prefer defaultdict over setdefault to Handle Missing Items in Internal State” for background):

Click here to view code image

from collections import defaultdict

class BySubjectGradebook:
    def __init__(self):
        self._grades = {}                       # Outer dict

    def add_student(self, name):
        self._grades[name] = defaultdict(list)  # Inner dict

    def report_grade(self, name, subject, grade):
        by_subject = self._grades[name]
        grade_list = by_subject[subject]
        grade_list.append(grade)

    def average_grade(self, name):
        by_subject = self._grades[name]
        total, count = 0, 0
        for grades in by_subject.values():
            total += sum(grades)
            count += len(grades)
        return total / count

This is straightforward enough. The report_grade and average_grade methods gained quite a bit of complexity to deal with the multilevel dictionary, but it’s seemingly manageable. Using the class remains simple:

Click here to view code image

book = BySubjectGradebook()
book.add_student("Albert Einstein")
book.report_grade("Albert Einstein", "Math", 75)
book.report_grade("Albert Einstein", "Math", 65)
book.report_grade("Albert Einstein", "Gym", 90)
book.report_grade("Albert Einstein", "Gym", 95)
print(book.average_grade("Albert Einstein"))

>>>
81.25

Now, imagine that the requirements change again. I also want to track the weight of each score toward the overall grade in the class so that midterm and final exams are more important than pop quizzes. One way to implement this feature is to change the innermost dictionary; instead of mapping subjects (its keys) to a list of grades (its values), I can use tuples of (score, weight) in each key’s corresponding value list. Although the changes to report_grade seem simple—just make the grade_list store tuple instances—the average_grade method now has a loop within a loop and is difficult to read:

Click here to view code image

class WeightedGradebook:
    def __init__(self):
        self._grades = {}

    def add_student(self, name):
        self._grades[name] = defaultdict(list)

    def report_grade(self, name, subject, score, weight):
        by_subject = self._grades[name]
        grade_list = by_subject[subject]
        grade_list.append((score, weight))    # Changed

    def average_grade(self, name):
        by_subject = self._grades[name]

        score_sum, score_count = 0, 0
        for scores in by_subject.values():
            subject_avg, total_weight = 0, 0
            for score, weight in scores:      # Added inner loop
                subject_avg += score * weight
                total_weight += weight

            score_sum += subject_avg / total_weight
            score_count += 1

        return score_sum / score_count

Using the class has also gotten more difficult. It’s unclear what all the numbers in the positional arguments mean:

Click here to view code image

book = WeightedGradebook()
book.add_student("Albert Einstein")
book.report_grade("Albert Einstein", "Math", 75, 0.05)
book.report_grade("Albert Einstein", "Math", 65, 0.15)
book.report_grade("Albert Einstein", "Math", 70, 0.80)
book.report_grade("Albert Einstein", "Gym", 100, 0.40)
book.report_grade("Albert Einstein", "Gym", 85, 0.60)
print(book.average_grade("Albert Einstein"))

>>>
80.25

When you see complexity like this, it’s time to make the leap from built-in types like dictionaries, lists, tuples, and sets to a hierarchy of classes.

In the grades example, at first I didn’t know I’d need to support weighted grades, so the complexity of creating other classes seemed unwarranted. Python’s built-in dictionary and tuple types made it easy to keep going, adding layer after layer to the internal bookkeeping. But you should avoid doing this for more than one level of nesting; using dictionaries that contain dictionaries makes your code hard for other programmers to read and sets you up for a maintenance nightmare (see Item 9: “Consider match for Destructuring in Flow Control, Avoid When if Statements Are Sufficient” for another way to deal with this).

As soon as you realize that your bookkeeping is getting complicated, break it all out into classes. You can then provide well-defined interfaces that better encapsulate your data. This approach also enables you to create a layer of abstraction between your interfaces and your concrete implementations.

Refactoring to Classes

There are many approaches to refactoring (see Item 123: “Consider warnings to Refactor and Migrate Usage” for an example). In this case, I can start moving to classes at the bottom of the dependency tree: a single grade. A class seems too heavyweight for such simple information. A tuple, though, seems appropriate because grades are immutable. Here, I use a tuple of (score, weight) to track grades in a list:

Click here to view code image

grades = []
grades.append((95, 0.45))
grades.append((85, 0.55))
total = sum(score * weight for score, weight in grades)
total_weight = sum(weight for _, weight in grades)
average_grade = total / total_weight

I used _ (the underscore variable name, a Python convention for unused variables) to capture the first entry in each grade’s tuple and ignore it when calculating total_weight.

The problem with this code is that tuple instances are positional. For example, if I want to associate more information with a grade, such as a set of notes from the teacher, I need to rewrite every usage of the two-tuple to be aware that there are now three items present instead of two, which means I need to use _ further to ignore certain indexes:

Click here to view code image

grades = []
grades.append((95, 0.45, "Great job"))
grades.append((85, 0.55, "Better next time"))
total = sum(score * weight for score, weight, _ in grades)
total_weight = sum(weight for _, weight, _ in grades)
average_grade = total / total_weight

This pattern of extending tuples longer and longer is similar to deepening layers of dictionaries. As soon as you find yourself going longer than a two-tuple, it’s time to consider another approach. The dataclasses built-in module does exactly what I need in this case: It lets me easily define a small immutable class for storing values in attributes (see Item 56: “Prefer dataclasses for Creating Immutable Objects”):

Click here to view code image

from dataclasses import dataclass

@dataclass(frozen=True)
class Grade:
    score: int
    weight: float

Next, I can write a class to represent a single subject that contains a set of Grade instances:

Click here to view code image

class Subject:
    def __init__(self):
        self._grades = []

    def report_grade(self, score, weight):
        self._grades.append(Grade(score, weight))

    def average_grade(self):
        total, total_weight = 0, 0
        for grade in self._grades:
            total += grade.score * grade.weight
            total_weight += grade.weight
        return total / total_weight

Then, I write a class to hold the set of subjects that are being studied by a single student:

Click here to view code image

class Student:
    def __init__(self):
        self._subjects = defaultdict(Subject)

    def get_subject(self, name):
        return self._subjects[name]

    def average_grade(self):
        total, count = 0, 0
        for subject in self._subjects.values():
            total += subject.average_grade()
            count += 1
        return total / count

Finally, I write a container for all of the students, keyed dynamically by their names:

Click here to view code image

class Gradebook:
    def __init__(self):
        self._students = defaultdict(Student)

    def get_student(self, name):
        return self._students[name]

The line count of these classes is almost double the previous implementation’s size. But this code is much easier to read. The example driving the classes is also clearer and more extensible:

Click here to view code image

book = Gradebook()
albert = book.get_student("Albert Einstein")
math = albert.get_subject("Math")
math.report_grade(75, 0.05)
math.report_grade(65, 0.15)
math.report_grade(70, 0.80)
gym = albert.get_subject("Gym")
gym.report_grade(100, 0.40)
gym.report_grade(85, 0.60)
print(albert.average_grade())

>>>
80.25

It would also be possible to write backward-compatible methods to help migrate usage of the old API style to the new hierarchy of objects.

Things to Remember

Avoid making dictionaries with values that are dictionaries, long tuples, or complex nestings of other built-in types.
Use the dataclasses built-in module for lightweight, immutable data containers before you need the flexibility of a full class.
Move your bookkeeping code to using multiple classes when your internal state dictionaries get complicated.