2

Strings and Slicing

Python initially grew in popularity as a scripting language for orchestrating command-line utilities and processing input and output data. With built-in syntax, methods, and modules for string and sequence processing, Python was an attractive alternative to traditional shells and other common scripting languages (e.g., Perl). Since then, Python has continued to grow into adjacent domains, becoming an ideal programming language for parsing text, generating structured data, inspecting file formats, analyzing logs, and so on.

By using bytes and str types, Python programs can interface with human language text, manipulate low-level binary data formats, and perform input/output (I/O) and communicate with the outside world. Python abstracts over these character types, lists, and other types to provide a common interface for indexing, subsequencing, and more. These capabilities are so essential that you'll see them in nearly every program.

Item 10: Know the Differences Between bytes and str

In Python, there are two types that represent sequences of character data: bytes and str. Instances of bytes contain raw, unsigned 8-bit values (often displayed in ASCII encoding):

a = b"h\x65llo"
print(type(a))
print(list(a))
print(a)

>>>
<class 'bytes'>
[104, 101, 108, 108, 111]
b'hello'

Instances of str contain Unicode code points that represent textual characters from human languages:

a = "a\u0300 propos"
print(type(a))
print(list(a))
print(a)

>>>
<class 'str'>
['a', '`', ' ', 'p', 'r', 'o', 'p', 'o', 's']
à propos

Importantly, a str instance does not have an associated binary encoding, and a bytes instance does not have an associated text encoding. To convert Unicode data to binary data, you must call the encode method of str. To convert binary data to Unicode data, you must call the decode method of bytes. You can explicitly specify the encoding you want to use for these methods, or you can accept the system default, which is commonly UTF-8 (but not always, as you’ll see shortly).

When you’re writing Python programs, it’s important to do encoding and decoding of Unicode data at the furthest boundary of your interfaces; this approach is often called the Unicode sandwich. The core of your program should use the str type, which contains Unicode data, and should not assume anything about character encodings. This setup allows you to be very accepting of alternative text encodings (such as Latin-1, Shift JIS, and Big5) while being strict about your output text encoding (ideally, UTF-8).

The split between character data types leads to two common situations in Python code:

  • Images You want to operate on raw 8-bit sequences that contain UTF-8-encoded strings (or some other encoding).

  • Images You want to operate on Unicode strings that have no specific encoding.

You’ll often need two helper functions to convert between these cases and to ensure that the type of input values matches your code’s expectations.

The first function takes a bytes or str instance and always returns a str:

def to_str(bytes_or_str):
    if isinstance(bytes_or_str, bytes):
        value = bytes_or_str.decode("utf-8")
    else:
        value = bytes_or_str
    return value  # Instance of str

print(repr(to_str(b"foo")))
print(repr(to_str("bar")))

>>>
'foo'
'bar'

The second function takes a bytes or str instance and always returns a bytes:

def to_bytes(bytes_or_str):
    if isinstance(bytes_or_str, str):
        value = bytes_or_str.encode("utf-8")
    else:
        value = bytes_or_str
    return value  # Instance of bytes

print(repr(to_bytes(b"foo")))
print(repr(to_bytes("bar")))

There are two big gotchas when dealing with raw 8-bit values and Unicode strings in Python.

The first issue is that bytes and str seem to work the same way, but their instances are not compatible with each other, so you must be deliberate about the types of character sequences that you’re passing around.

By using the + operator, you can add bytes to bytes and str to str, respectively:

print(b"one" + b"two")
print("one" + "two")

>>>
b'onetwo'
onetwo

But you can’t add str instances to bytes instances:

b"one" + "two"

>>>
Traceback ...
TypeError: can't concat str to bytes

You also can’t add bytes instances to str instances:

"one" + b"two"

>>>
Traceback ...
TypeError: can only concatenate str (not "bytes") to str

By using binary operators, you can compare bytes to bytes and str to str, respectively:

assert b"red" > b"blue"
assert "red" > "blue"

But you can’t compare a str instance to a bytes instance:

assert "red" > b"blue"

>>>
Traceback ...
TypeError: '>' not supported between instances of 'str' and
➥'bytes'

And you also can’t compare a bytes instance to a str instance:

assert b"blue" < "red"

>>>
Traceback ...
TypeError: '<' not supported between instances of 'bytes' and
➥'str'

Comparing bytes and str instances for equality will always evaluate to False, even when they contain exactly the same characters (in this case, ASCII-encoded "foo"):

print(b"foo" == "foo")

>>>
False

The % operator works with format strings for each type (see Item 11: “Prefer Interpolated F-Strings over C-Style Format Strings and str.format” for background):

blue_bytes = b"blue"
blue_str = "blue"
print(b"red %s" % blue_bytes)
print("red %s" % blue_str)

>>>
b'red blue'
red blue

But you can’t pass a str instance to a bytes format string because Python doesn’t know what binary text encoding to use:

print(b"red %s" % blue_str)

>>>
Traceback ...
TypeError: %b requires a bytes-like object, or an object that
➥implements __bytes__, not 'str'

However, you can pass a bytes instance to a str format string by using the % operator, or you can use a bytes instance in an interpolated format string, but it doesn’t do what you’d expect:

print("red %s" % blue_bytes)
print(f"red {blue_bytes}")

>>>
red b'blue'
red b'blue'

In these cases, the code actually invokes the __repr__ special method (see Item 12: “Understand the Difference Between repr and str when Printing Objects”) on the bytes instance and substitutes that in place of %s or {blue_bytes}, which is why the b"blue" literal appears in the output.

The second gotcha is that operations involving file handles (returned by the open built-in function) default to requiring Unicode strings instead of raw bytes. This can cause surprising failures, especially for programmers accustomed to Python 2. For example, say that I want to write some binary data to a file. This seemingly simple code breaks:

with open("data.bin", "w") as f:
    f.write(b"\xf1\xf2\xf3\xf4\xf5")

>>>
Traceback ...
TypeError: write() argument must be str, not bytes

The cause of the exception is that the file was opened in write text mode ("w") instead of write binary mode ("wb"). When a file is in text mode, write operations expect str instances containing Unicode data instead of bytes instances containing binary data. Here, I fix this by changing the open mode to "wb":

with open("data.bin", "wb") as f:
    f.write(b"\xf1\xf2\xf3\xf4\xf5")

A similar problem exists for reading data from files. For example, here I try to read the binary file that was written above:

with open("data.bin", "r") as f:
    data = f.read()

>>>
Traceback ...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in
➥position 0: invalid continuation byte

This fails because the file was opened in read text mode ("r") instead of read binary mode ("rb"). When a handle is in text mode, it uses the system’s default text encoding to interpret binary data using the bytes.decode (for reading) and str.encode (for writing) methods. On most systems, the default encoding is UTF-8, which can’t accept the binary data b"\xf1\xf2\xf3\xf4\xf5", thus causing the error above. Here, I solve this problem by changing the open mode to "rb":

with open("data.bin", "rb") as f:
    data = f.read()
assert data == b"\xf1\xf2\xf3\xf4\xf5"

Alternatively, I can explicitly specify the encoding parameter to the open function to make sure I’m not surprised by any platform-specific behavior. For example, here I assume that the binary data in the file was actually meant to be a string encoded as "cp1252" (a legacy Windows encoding):

with open("data.bin", "r", encoding="cp1252") as f:
    data = f.read()
assert data == "ñòóôõ"

The exception is gone, and the string interpretation of the file’s contents is very different from what was returned when reading raw bytes. The lesson here is that you should check the default encoding on your system (using python3 -c 'import locale; print(locale.getpreferred-encoding())') to understand how it differs from your expectations. When in doubt, you should explicitly pass the encoding parameter to open.

Things to Remember

  • Images bytes contains sequences of 8-bit values, and str contains sequences of Unicode code points.

  • Images Use helper functions to ensure that the inputs you operate on are the type of character sequence you expect (8-bit values, UTF-8-encoded strings, Unicode code points, etc).

  • Images bytes and str instances can’t be used together with operators (like >, ==, +, and %).

  • Images If you want to read or write binary data to/from a file, always open the file using a binary mode (like "rb" or "wb").

  • Images If you want to read or write Unicode data to/from a file, be careful about your system’s default text encoding. Explicitly pass the encoding parameter to open to avoid surprises.

Item 11: Prefer Interpolated F-Strings over C-Style Format Strings and str.format

Strings are present throughout Python codebases. They’re used for rendering messages in user interfaces and command-line utilities. They’re used for writing data to files and sockets. They’re used for specifying what’s gone wrong in Exception details (see Item 88: “Consider Explicitly Chaining Exceptions to Clarify Tracebacks”). They’re used in logging and debugging (see Item 12: “Understand the Difference Between repr and str When Printing Objects”).

Formatting is the process of combining predefined text with data values into a single human-readable message that’s stored as a string. Python has four different ways of formatting strings that are built into the language and standard library. All but one of them, which is covered last in this item, have serious shortcomings that you should understand and avoid.

C-Style Formatting

The most common way to format a string in Python is by using the % formatting operator. A predefined text template is provided on the left side of the operator in a format string. A value to insert into the template is provided as a single value or as a tuple of multiple values on the right side of the format operator. For example, here I use the % operator to convert difficult-to-read binary and hexadecimal values to integer strings:

a = 0b10111011
b = 0xC5F
print("Binary is %d, hex is %d" % (a, b))

>>>
Binary is 187, hex is 3167

The format string uses format specifiers (like %d) as placeholders that will be replaced by values from the right side of the formatting expression. The syntax for format specifiers comes from C’s printf function, which has been inherited by Python (as well as by other programming languages). Python supports all of the usual options you’d expect from printf, such as %s, %x, and %f format specifiers, as well as control over decimal places, padding, fill, and alignment. Many programmers who are new to Python start with C-style format strings because they’re familiar and simple to use.

There are four problems with C-style format strings in Python.

The first problem is that if you change the type or order of data values in the tuple on the right side of a formatting expression, you can get errors due to type conversion incompatibility. For example, this simple formatting expression works:

key = "my_var"
value = 1.234
formatted = "%-10s = %.2f" % (key, value)
print(formatted)

>>>
my_var     = 1.23

But if you swap key and value, you get an exception at runtime:

reordered_tuple = "%-10s = %.2f" % (value, key)

>>>
Traceback ...
TypeError: must be real number, not str

Leaving the right side parameters in the original order but changing the format string results in the same error:

reordered_string = "%.2f = %-10s" % (key, value)

>>>
Traceback ...
TypeError: must be real number, not str

To avoid this gotcha, you need to constantly check that the two sides of the % operator are in sync; this process is error prone because it must be done manually for every change.

The second problem with C-style formatting expressions is that they become difficult to read when you need to make small modifications to values before formatting them into a string—and this is an extremely common need. Here, I list the contents of my kitchen pantry, without any inline changes to the values:

pantry = [
    ("avocados", 1.25),
    ("bananas", 2.5),
    ("cherries", 15),
]
for i, (item, count) in enumerate(pantry):
    print("#%d: %-10s = %.2f" % (i, item, count))

>>>
#0: avocados   = 1.25
#1: bananas    = 2.50
#2: cherries   = 15.00

Now, I make a few modifications to the values that I’m formatting to make the printed message more useful. This causes the tuple in the formatting expression to become so long that it needs to be split across multiple lines, which hurts readability:

for i, (item, count) in enumerate(pantry):
    print(
        "#%d: %-10s = %d"
        % (
            i + 1,
            item.title(),
            round(count),
        )
    )

>>>
#1: Avocados   = 1
#2: Bananas    = 2
#3: Cherries   = 15

The third problem with formatting expressions is that if you want to use the same value in a format string multiple times, you have to repeat it in the right-side tuple:

template = "%s loves food. See %s cook."
name = "Max"
formatted = template % (name, name)
print(formatted)

>>>
Max loves food. See Max cook.

This is especially annoying and error prone if you have to repeat small modifications to the values being formatted. For example, here I call the title() method on one reference to name but not the other, which causes mismatched output:

name = "brad"
formatted = template % (name.title(), name)
print(formatted)

>>>
Brad loves food. See brad cook.

The % operator in Python helps solve some of these problems because it has the ability to also do formatting with a dictionary instead of a tuple. The keys from the dictionary are matched with format specifiers that have the same name, such as %(key)s. Here, I use this functionality to change the order of values on the right side of the formatting expression with no effect on the output, thus solving problem #1 from above:

key = "my_var"
value = 1.234

old_way = "%-10s = %.2f" % (key, value)

new_way = "%(key)-10s = %(value).2f" % {
    "key": key,  # Key first
    "value": value,
}

reordered = "%(key)-10s = %(value).2f" % {
    "value": value,
    "key": key,  # Key second
}

assert old_way == new_way == reordered

Using dictionaries in formatting expressions also solves problem #3 from above by allowing multiple format specifiers to reference the same value, thus making it unnecessary to supply that value more than once:

name = "Max"

template = "%s loves food. See %s cook."
before = template % (name, name)   # Tuple

template = "%(name)s loves food. See %(name)s cook."
after = template % {"name": name}  # Dictionary

assert before == after

However, dictionary format strings introduce and exacerbate other issues. For problem #2 above, regarding making small modifications to values before formatting them, formatting expressions become longer and more visually noisy because of the presence of the dictionary key and colon operator on the right side. Here, I render the same string with and without dictionaries to show this problem:

for i, (item, count) in enumerate(pantry):
    before = "#%d: %-10s = %d" % (
        i + 1,
        item.title(),
        round(count),
    )

    after = "#%(loop)d: %(item)-10s = %(count)d" % {
        "loop": i + 1,
        "item": item.title(),
        "count": round(count),
    }

    assert before == after

Using dictionaries in formatting expressions also increases verbosity, which is problem #4 with C-style formatting expressions in Python. Each key must be specified at least twice—once in the format specifier, once in the dictionary as a key, and potentially once more for the variable name that contains the dictionary value:

soup = "lentil"
formatted = "Today's soup is %(soup)s." % {"soup": soup}
print(formatted)

>>>
Today's soup is lentil.

Besides involving duplicative characters, this redundancy causes formatting expressions that use dictionaries to be long. These expressions often must span multiple lines, with format strings concatenated across multiple lines and dictionary assignments with one line per value to use in formatting:

menu = {
    "soup": "lentil",
    "oyster": "kumamoto",
    "special": "schnitzel",
}
template = (
    "Today's soup is %(soup)s, "
    "buy one get two %(oyster)s oysters, "
    "and our special entrée is %(special)s."
)
formatted = template % menu
print(formatted)

>>>
Today's soup is lentil, buy one get two kumamoto oysters, and
➥our special entrée is schnitzel.

To understand what this formatting expression is going to produce, your eyes have to keep going back and forth between the lines of the format string and the lines of the dictionary. This disconnect makes it hard to spot bugs, and readability gets even worse if you need to make small modifications to any of the values before formatting.

There must be a better way.

The format Built-in Function and str.format

Python 3 added support for advanced string formatting that is more expressive than the old C-style format strings that use the % operator. For individual Python values, this new functionality can be accessed through the format built-in function. For example, here I use some of the new options (, for thousands separators and ^ for centering) to format values:

a = 1234.5678
formatted = format(a, ",.2f")
print(formatted)

b = "my string"
formatted = format(b, "^20s")
print("*", formatted, "*")


>>>
1,234.57
*      my string       *

You can use this functionality to format multiple values together by calling the new format method of the str type. Instead of using C-style format specifiers like %d, you can specify placeholders with {}. By default the placeholders in the format string are replaced by the corresponding positional arguments passed to the format method in the order in which they appear:

key = "my_var"
value = 1.234

formatted = "{} = {}".format(key, value)
print(formatted)

>>>
my_var = 1.234

Within each placeholder you can optionally provide a colon character followed by format specifiers to customize how values will be converted into strings (see https://docs.python.org/3/library/string.html#format-specification-mini-language for the full range of options):

formatted = "{:<10} = {:.2f}".format(key, value)
print(formatted)

>>>
my_var     = 1.23

The way to think about how this works is that the format specifiers will be passed to the format built-in function along with the value (format(value, ".2f") in the example above). The result of that function call is what replaces the placeholder in the overall formatted string. You can customize the formatting behavior per class by using the __format__ special method.

Another detail to be careful about with str.format is escaping braces ({). You need to double them ({{) so they’re not accidentally interpreted as placeholders (much as you need to double the % character to escape it properly with C-style format strings):

print("%.2f%%" % 12.5)
print("{} replaces {{}}".format(1.23))

>>>
12.50%
1.23 replaces {}

Within the braces you may also specify the positional index of an argument passed to the format method to use for replacing the placeholder. This allows the format string to be updated to reorder the output without requiring you to also change the right side of the formatting expression, thus addressing problem #1 from above:

formatted = "{1} = {0}".format(key, value)
print(formatted)

>>>
1.234 = my_var

The same positional index may also be referenced multiple times in the format string without the need to pass the value to the format method more than once, which solves problem #3 from above:

formatted = "{0} loves food. See {0} cook.".format(name)
print(formatted)

>>>
Max loves food. See Max cook.

Unfortunately, the new format method does nothing to address problem #2 from above, leaving your code difficult to read when you need to make small modifications to values before formatting them. There’s little difference in readability between the old and new options, which are similarly noisy:

for i, (item, count) in enumerate(pantry):
    old_style = "#%d: %-10s = %d" % (
        i + 1,
        item.title(),
        round(count),
    )

    new_style = "#{}: {:<10s} = {}".format(
        i + 1,
        item.title(),
        round(count),
    )

    assert old_style == new_style

There are even more advanced specifier options for the str.format method, such as using combinations of dictionary keys and list indexes in placeholders and coercing values to Unicode and repr strings:

formatted = "First letter is{menu[oyster][0]!r}".format(menu=menu)
print(formatted)

>>>
First letter is 'k'

But these features don’t help reduce the redundancy of repeated keys from problem #4 above. For example, here I compare the verbosity of using dictionaries in C-style formatting expressions to the new style of passing keyword arguments to the format method:

old_template = (
    "Today's soup is %(soup)s, "
    "buy one get two %(oyster)s oysters, "
    "and our special entrée is %(special)s."
)
old_formatted = old_template % {
    "soup": "lentil",
    "oyster": "kumamoto",
    "special": "schnitzel",
}

new_template = (
    "Today's soup is {soup}, "
    "buy one get two {oyster} oysters, "
    "and our special entrée is {special}."
)
new_formatted = new_template.format(
    soup="lentil",
    oyster="kumamoto",
    special="schnitzel",
)

assert old_formatted == new_formatted

This style is slightly less noisy because it eliminates some quotes in the dictionary and a few characters in the format specifiers, but it’s hardly compelling. Further, the advanced features of using dictionary keys and indexes within placeholders are only a tiny subset of Python’s expression functionality. This lack of expressiveness is so limiting that it undermines the value of the str.format method overall.

Given these shortcomings and the problems from C-style formatting expressions that remain (problems #2 and #4 from above), I suggest that you avoid using the str.format method in general. It’s important to know about the new mini-language used in format specifiers (everything after the colon) and how to use the format built-in function. But the rest of the str.format method should be treated as a historical artifact to help you understand how Python’s new f-strings work and why they’re so great.

Interpolated Format Strings

Python 3.6 added interpolated format strings—f-strings for short—to solve these issues once and for all. This new language syntax requires you to prefix a format string with an f character, which is similar to how a byte string is prefixed with a b character and a raw (unescaped) string is prefixed with an r character.

F-strings take the expressiveness of format strings to the extreme, solving problem #4 from above by completely eliminating the redundancy of providing keys and values to be formatted. They achieve this pithiness by allowing you to reference all names in the current Python scope as part of a formatting expression:

key = "my_var"
value = 1.234

formatted = f"{key} = {value}"
print(formatted)

>>>
my_var = 1.234

All of the same options from the new format built-in mini-language are available after the colon in the placeholders within an f-string, as is the ability to coerce values to Unicode and repr strings, similar to the str.format method (i.e., with !r and !s):

formatted = f"{key!r:<10} = {value:.2f}"
print(formatted)

>>>
'my_var'   = 1.23

Formatting with f-strings is shorter than using C-style format strings with the % operator and the str.format method in all cases. Here, I show every option together, in order from shortest to longest, and line up the left sides of the assignments so you can easily compare them:

f_string = f"{key:<10} = {value:.2f}"

c_tuple  = "%-10s = %.2f" % (key, value)

str_args = "{:<10} = {:.2f}".format(key, value)

str_kw   = "{key:<10} = {value:.2f}".format(key=key,
➥value=value)
c_dict   = "%(key)-10s = %(value).2f" % {"key": key, "value":
➥value}

assert c_tuple == c_dict == f_string
assert str_args == str_kw == f_string

F-strings also enable you to put a full Python expression within the placeholder braces, solving problem #2 from above by allowing small modifications to the values being formatted with concise syntax. What took multiple lines with C-style formatting and the str.format method now easily fits on a single line:

for i, (item, count) in enumerate(pantry):
    old_style = "#%d: %-10s = %d" % (
        i + 1,
        item.title(),
        round(count),
    )

    new_style = "#{}: {:<10s} = {}".format(
        i + 1,
        item.title(),
        round(count),
    )

    f_string = f"#{i+1}: {item.title():<10s} = {round(count)}"

    assert old_style == new_style == f_string

Or, if it’s clearer, you can split an f-string over multiple lines by relying on adjacent-string concatenation (see Item 13: “Prefer Explicit String Concatenation over Implicit, Especially in Lists”). Even though this is longer than the single-line version, it’s still much clearer than any of the other multiline approaches:

for i, (item, count) in enumerate(pantry):
    print(f"#{i+1}: "
          f"{item.title():<10s} = "
          f"{round(count)}")

>>>
#1: Avocados   = 1
#2: Bananas    = 2
#3: Cherries   = 15

Python expressions may also appear within the format specifier options. For example, here I parameterize the number of digits to print by using a variable instead of hard-coding it in the format string:

places = 3
number = 1.23456
print(f"My number is {number:.{places}f}")

>>>
My number is 1.235

The combination of expressiveness, terseness, and clarity provided by f-strings makes them the best built-in option for Python programmers. Any time you find yourself needing to format values into strings, choose f-strings over the alternatives.

Things to Remember

  • Images C-style format strings that use the % operator suffer from a variety of gotchas and verbosity problems.

  • Images The str.format method introduces some useful concepts in its formatting specifiers mini-language, but it otherwise repeats the mistakes of C-style format strings and should be avoided.

  • Images F-strings are a new syntax for formatting values into strings that solves the biggest problems with C-style format strings.

  • Images F-strings are succinct yet powerful because they allow for arbitrary Python expressions to be directly embedded within format specifiers.

Item 12: Understand the Difference Between repr and str when Printing Objects

As you debug a Python program, using the print function and format strings (see Item 11: “Prefer Interpolated F-Strings over C-Style Format Strings and str.format”) or outputting via the logging built-in module will get you surprisingly far. Python object internals are often easy to access via plain attributes (see Item 55: “Prefer Public Attributes over Private Ones”). All you need to do is call print to see how the state of your program changes while it runs and deduce where it goes wrong (see Item 114: “Consider Interactive Debugging with pdb” for a more advanced approach).

The print function outputs a human-readable string version of whatever you supply it. For example, I can use print with a basic string to see the contents of the string without the surrounding quote characters:

print("foo bar")

>>>
foo bar

This is equivalent to all of these alternatives:

  • Calling the str function before passing the value to print

  • Using the "%s" format string with the % operator

  • Using the default formatting of the value with an f-string

  • Calling the format built-in function

  • Explicitly calling the __format__ special method

  • Explicitly calling the __str__ special method

Here, I show that they all produce the same output:

my_value = "foo bar"
print(str(my_value))
print("%s" % my_value)
print(f"{my_value}")
print(format(my_value))
print(my_value.__format__("s"))
print(my_value.__str__())

>>>
foo bar
foo bar
foo bar
foo bar
foo bar
foo bar

The problem is that the human-readable string for a value doesn’t make it clear what the actual type and the specific composition of the value are. For example, notice how in the default output of print you can’t distinguish between the types of the number 5 and the string "5":

int_value = 5
str_value = "5"
print(int_value)
print(str_value)
print(f"Is {int_value} == {str_value}?")

>>>
5
5
Is 5 == 5?

If you’re debugging a program with print, these type differences matter. What you almost always want while debugging is to see the repr version of an object. The repr built-in function returns the printable representation of an object, which should be its most clearly understandable string serialization. For many built-in types, the string returned by repr is a valid Python expression:

a = "\x07"
print(repr(a))

>>>
'\x07'

Passing the value returned by repr to the eval built-in function often results in the same Python object that you started with:

b = eval(repr(a))
assert a == b

Of course, in practice you should only use eval with extreme caution (see Item 91: “Avoid exec and eval Unless You’re Building a Developer Tool”).

When you’re debugging with print, you should call repr on a value before printing to ensure that any difference in types is clear:

print(repr(int_value))
print(repr(str_value))

>>>
5
'5'

This is equivalent to using the "%r" format string with the % operator or an f-string with the !r type conversion:

print("Is %r == %r?" % (int_value, str_value))
print(f"Is {int_value!r} == {str_value!r}?")

>>>
Is 5 == '5'?
Is 5 == '5'?

When the str built-in function is given an instance of a user-defined class, it first tries to call the __str__ special method. If that’s not defined, it falls back to call the __repr__ special method instead. If __repr__ also wasn’t implemented by the class, then the call goes through method resolution (see Item 53: “Initialize Parent Classes with super”), eventually calling the default implementation from the object parent class. Unfortunately, the default implementation of repr for object subclasses isn’t especially helpful. For example, here I define a simple class and then print one of its instances, which ultimately leads to a call to object.__repr__:

class OpaqueClass:
    def __init__(self, x, y):
        self.x = x
        self.y = y

obj = OpaqueClass(1, "foo")
print(obj)

>>>
<__main__.OpaqueClass object at 0x1009be510>

This output can’t be passed to the eval function, and it says nothing about the instance fields of the object. To improve this, here I define my own __repr__ special method that returns a string containing the Python expression that re-creates the object (see Item 51: “Prefer dataclasses for Defining Lightweight Classes” for another approach to defining __repr__):

class BetterClass:
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __repr__(self):
        return f"BetterClass({self.x!r}, {self.y!r})"

Now the repr value is much more useful:

obj = BetterClass(2, "bar")
print(obj)

>>>
BetterClass(2, 'bar')

Calling str on an instance of this class produces the same result because the __str__ special method isn’t defined, causing Python to fall back to __repr__:

print(str(obj))

>>>
BetterClass(2, 'bar')

To have str print out a different human-readable format of the string—for example, to display in a UI element—I can define the corresponding __str__ special method:

class StringifiableBetterClass(BetterClass):
    def __str__(self):
        return f"({self.x}, {self.y})"

Now repr and str return different human-readable strings for each of the different purposes:

obj2 = StringifiableBetterClass(2, "bar")
print("Human readable:", obj2)
print("Printable:     ", repr(obj2))

>>>
Human readable: (2, bar)
Printable:      BetterClass(2, 'bar')

Things to Remember

  • Images Calling print on built-in Python types produces the human-readable string version of a value, which hides type information.

  • Images Calling repr on built-in Python types produces a string that contains the printable representation of a value. repr strings can often be passed to the eval built-in function to get back the original value.

  • Images %s in format strings produces human-readable strings like str. %r produces printable strings like repr. F-strings produce human-readable strings for replacement text expressions unless you specify the !r conversion suffix.

  • Images You can define the __repr__ and __str__ special methods on your classes to customize the printable and human-readable representations of instances, which can help with debugging and can simplify integrating objects into human interfaces.

Item 13: Prefer Explicit String Concatenation over Implicit, Especially in Lists

Earlier in its history, Python inherited many attributes directly from C, including notation for numeric literals and printf-like format strings. The language has evolved considerably since then; for example, octal numbers now require a 0o prefix instead of only 0, and the new string interpolation syntax is far superior (see Item 11: “Prefer Interpolated F-Strings over C-Style Format Strings and str.format”). However, one C-like feature that remains in Python is implicit string concatenation. This causes string literals that are adjacent expressions to be concatenated without the need for an infix + operator. Therefore, these two assignments actually do the same thing:

my_test1 = "hello" "world"
my_test2 = "hello" + "world"
assert my_test1 == my_test2

This implicit concatenation behavior can be useful when you need to combine different types of string literals with varying escaping requirements, which is a common need in programs that do text templating or code generation. For example, here I implicitly merge a raw string, an f-string, and a single-quoted string:

x = 1
my_test1 = (
    r"first \ part is here with escapes\n, "
    f"string interpolation {x} in here, "
    'this has "double quotes" inside'
)
print(my_test1)

>>>
first \ part is here with escapes\n, string interpolation 1 in
➥here, this has "double quotes" inside

Having each type of string literal on its own line makes this code easier to read, and the absence of operators reduces visual noise. In contrast, when implicit concatenation happens on a single line, it can be difficult to anticipate what the code is going to do without having to pay extra attention:

y = 2
my_test2 = r"fir\st" f"{y}" '"third"'
print(my_test2)

>>>
fir\st2"third"

Implicit concatenation like this is also error prone. If you accidentally slip in a comma character between adjacent strings, the meaning of the code will be completely different (see a similar issue in Item 6: “Always Surround Single-Element Tuples with Parentheses”):

my_test3 = r"fir\st", f"{y}" '"third"'
print(my_test3)

>>>
('fir\\st', '2"third"')

Another problem can occur if you do the opposite and accidentally delete a comma instead of adding one. For example, imagine that I want to create a list of strings to output, with one element for each line:

my_test4 = [
    "first line\n",
    "second line\n",
    "third line\n",
]
print(my_test4)

>>>
['first line\n', 'second line\n', 'third line\n']

If I delete the middle comma, the resulting data will have similar structure, but the last two lines will be merged together silently.

my_test5 = [
    "first line\n",
    "second line\n"  # Comma removed
    "third line\n",
]
print(my_test5)

>>>
['first line\n', 'second line\nthird line\n']

As a new reader of this code, you might not even see the missing comma at first glance. If you use an auto-formatter (see Item 2: “Follow the PEP 8 Style Guide”), it might rewrap the two lines to make this implicit behavior more discoverable, like this:

my_test5 = [
    "first line\n",
    "second line\n" "third line\n",
]

But even if you do notice that implicit concatenation is happening, it’s unclear whether it’s deliberate or accidental. Thus, my advice is to always use an explicit + operator to combine strings inside a list or tuple literal to eliminate any ambiguity caused by implicit concatenation:

my_test6 = [
    "first line\n",
    "second line\n" +  # Explicit
    "third line\n",
]
assert my_test5 == my_test6

When the + operator is present, an auto-formatter might still change the line wrapping, but in this state, it’s at least clear what the author of the code originally intended:

my_test6 = [
    "first line\n",
    "second line\n" + "third line\n",
]

Another place that implicit string concatenation might cause issues is in function call argument lists. Sometimes using implicit concatenation within a call looks fine, such as with the print function:

print("this is my long message "
      "that should be printed out")

>>>
this is my long message that should be printed out

Implicit concatenation can even be readable when you provide additional keyword arguments after a single positional argument:

import sys

print("this is my long message "
      "that should be printed out",
      end="",
      file=sys.stderr)

However, when a call takes multiple positional arguments, implicit string concatenation can be confusing and error prone, just as it is with list and tuple literals. For example, here I create an instance of a class with implicit concatenation in the middle of the initialization argument list—how quickly can you spot it?

import sys

first_value = ...
second_value = ...

class MyData:
    ...

value = MyData(123,
               first_value,
               f"my format string {x}"
               f"another value {y}",
               "and here is more text",
               second_value,
               stream=sys.stderr)

Changing the string concatenation to be explicit makes this code much easier to scan:

value2 = MyData(123,
                first_value,
                f"my format string {x}" +  # Explicit
                f"another value {y}",
                "and here is more text",
                second_value,
                stream=sys.stderr)

My advice is to always use explicit string concatenation when a function call takes multiple positional arguments in order to avoid any confusion (see Item 37: “Enforce Clarity with Keyword-Only and Positional-Only Arguments” for a similar example). If there’s only a single positional argument, as with the print example above, then using implicit string concatenation is fine. Keyword arguments can be passed using either explicit or implicit concatenation—whichever maximizes clarity—because sibling string literals can’t be misinterpreted as positional arguments after the = character.

Things to Remember

  • Images When two string literals are next to each other in Python code, they will be merged as if the + operator were present between them, in a similar fashion to the implicit string concatenation feature of the C programming language.

  • Images Avoid implicit string concatenation of items in list and tuple literals because it creates ambiguity about the original author’s intent. Instead, you should use explicit concatenation with the + operator.

  • Images In function calls, it is fine to use implicit string concatenation with one positional argument and any number of keyword arguments, but you should use explicit concatenation when there are multiple positional arguments.

Item 14: Know How to Slice Sequences

Python includes syntax for slicing sequences into pieces. Slicing allows you to access a subset of a sequence’s items with minimal effort. The simplest uses for slicing are the built-in types list, tuple, str, and bytes. Slicing can be extended to any Python class that implements the __getitem__ and __setitem__ special methods (see Item 57: “Inherit from collections.abc Classes for Custom Container Types”).

The basic form of the slicing syntax is somelist[start:end], where start is inclusive and end is exclusive:

a = ["a", "b", "c", "d", "e", "f", "g", "h"]
print("Middle two:  ", a[3:5])
print("All but ends:", a[1:7])

>>>
Middle two:   ['d', 'e']
All but ends: ['b', 'c', 'd', 'e', 'f', 'g']

When slicing from the start of a sequence, you should leave out the zero index to reduce visual noise:

assert a[:5] == a[0:5]

When slicing to the end of a sequence, you should leave out the final index because it’s redundant:

assert a[5:] == a[5:len(a)]

Using negative numbers for slicing is helpful for doing offsets relative to the end of a sequence. All of these forms of slicing would be clear to a new reader of your code:

a[:]      # ["a", "b", "c", "d", "e", "f", "g", "h"]
a[:5]     # ["a", "b", "c", "d", "e"]
a[:-1]    # ["a", "b", "c", "d", "e", "f", "g"]
a[4:]     #                     ["e", "f", "g", "h"]
a[-3:]    #                          ["f", "g", "h"]
a[2:5]    #           ["c", "d", "e"]
a[2:-1]   #           ["c", "d", "e", "f", "g"]
a[-3:-1]  #                          ["f", "g"]

There are no surprises here, and I encourage you to use these variations.

Slicing deals properly with start and end indexes that are beyond the boundaries of a list by silently omitting missing items. This behavior makes it easy for your code to establish a maximum length to consider for an input sequence:

first_twenty_items = a[:20]
last_twenty_items = a[-20:]

In contrast, directly accessing the same missing index causes an exception:

a[20]

>>>
Traceback ...
IndexError: list index out of range

Note

Beware that indexing a list by a negated variable is one of the few situations in which you can get surprising results from slicing. For example, the expression somelist[-n:] will work fine when n is greater than zero (e.g., somelist[-3:] when n is 3). However, when n is zero, the expression somelist[-0:] is equivalent to somelist[:], which results in a copy of the original list.

The result of slicing a list is a whole new list. Each of the items in the new list will refer to the corresponding objects from the original list. Modifying the list created by slicing won’t affect the contents of the original list:

b = a[3:]
print("Before:   ", b)
b[1] = 99
print("After:    ", b)
print("No change:", a)

>>>
Before:    ['d', 'e', 'f', 'g', 'h']
After:     ['d', 99, 'f', 'g', 'h']
No change: ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']

When used in assignments, slices replace the specified range in the original list. Unlike unpacking assignments (e.g., a, b = c[:2]; see Item 5: “Prefer Multiple-Assignment Unpacking over Indexing” and Item 16: “Prefer Catch-All Unpacking Over Slicing”), the lengths of slice assignments don’t need to be the same. All of the values before and after the assigned slice will be preserved, with the new values stitched in between. Here, the list shrinks because the replacement list is shorter than the specified slice:

print("Before ", a)
a[2:7] = [99, 22, 14]
print("After  ", a)

>>>
Before  ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
After   ['a', 'b', 99, 22, 14, 'h']

And here the list grows because the assigned list is longer than the specified slice:

print("Before ", a)
a[2:3] = [47, 11]
print("After  ", a)

>>>
Before  ['a', 'b', 99, 22, 14, 'h']
After   ['a', 'b', 47, 11, 22, 14, 'h']

If you leave out both the start and the end indexes when slicing, you end up with a copy of the whole original list:

b = a[:]
assert b == a and b is not a

If you assign to a slice with no start or end indexes, you replace the entire contents of the list with references to the items from the sequence on the right side (instead of allocating a new list):

b = a
print("Before a", a)
print("Before b", b)
a[:] = [101, 102, 103]
assert a is b             # Still the same list object
print("After a ", a)      # Now has different contents
print("After b ", b)      # Same list, so same contents as a

>>>
Before a ['a', 'b', 47, 11, 22, 14, 'h']
Before b ['a', 'b', 47, 11, 22, 14, 'h']
After a  [101, 102, 103]
After b  [101, 102, 103]

Things to Remember

  • Images Avoid being verbose when slicing: Don’t supply 0 for the start index or the length of the sequence for the end index.

  • Images Slicing is forgiving of start or end indexes that are out of bounds, which means it’s easy to express slices on the front or back boundaries of a sequence (e.g., a[:20] or a[-20:]).

  • Images Assigning to a list slice replaces that range in the original sequence with what’s referenced even when the lengths are different.

Item 15: Avoid Striding and Slicing in a Single Expression

In addition to basic slicing (see Item 14: “Know How to Slice Sequences”), Python has special syntax for the stride of a slice in the form somelist[start:end:stride]. This lets you take every nth item when slicing a sequence. For example, the stride makes it easy to group by even and odd ordinal positions in a list:

x = ["red", "orange", "yellow", "green", "blue", "purple"]
odds = x[::2]    # First, third, fifth
evens = x[1::2]  # Second, fourth, sixth
print(odds)
print(evens)

>>>
['red', 'yellow', 'blue']
['orange', 'green', 'purple']

The problem is that the stride syntax often causes unexpected behavior that can introduce bugs. For example, a common Python trick for reversing a byte string is to slice the string with a stride of -1:

x = b"mongoose"
y = x[::-1]
print(y)

>>>
b'esoognom'

This also works correctly for Unicode strings (see Item 10: “Know the Differences Between bytes and str”):

x = "寿司"
y = x[::-1]
print(y)

>>>
司寿

But it will break when Unicode data is encoded as a UTF-8 byte string:

w = "寿司"
x = w.encode("utf-8")
y = x[::-1]
z = y.decode("utf-8")

>>>
Traceback ...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in
➥position 0: invalid start byte

Are negative strides besides -1 useful? Consider the following examples:

x = ["a", "b", "c", "d", "e", "f", "g", "h"]
x[::2]   # ["a", "c", "e", "g"]
x[::-2]  # ["h", "f", "d", "b"]

Here, ::2 means “Select every second item starting at the beginning.” Trickier, ::-2 means “Select every second item starting at the end and moving backward.”

What do you think 2::2 means? What about -2::-2 vs. -2:2:-2 vs. 2:2:-2?

x[2::2]     # ["c", "e", "g"]
x[-2::-2]   # ["g", "e", "c", "a"]
x[-2:2:-2]  # ["g", "e"]
x[2:2:-2]   # []

>>>
['c', 'e', 'g']
['g', 'e', 'c', 'a']
['g', 'e']
[]

The point is that the stride part of the slicing syntax can be extremely confusing. Having three numbers within the brackets is hard enough to read because of its density. Then, it’s not obvious when the start and end indexes come into effect relative to the stride value, especially when the stride is negative.

To prevent problems, I suggest that you avoid using a stride along with start and end indexes. If you must use a stride, prefer making it a positive value and omit start and end indexes. If you must use a stride with start or end indexes, consider using one assignment for striding and another for slicing:

y = x[::2]   # ["a", "c", "e", "g"]
z = y[1:-1]  # ["c", "e"]

Striding and then slicing creates an extra shallow copy of the data. The first operation should try to reduce the size of the resulting slice by as much as possible. If your program can’t afford the time or memory required for two steps, consider using the itertools built-in module’s islice method (see Item 24: “Consider itertools for Working with Iterators and Generators”), which is clearer to read and doesn’t permit negative values for the start, end, or stride.

Things to Remember

  • Images Specifying start, end, and stride together in a single slice can be extremely confusing.

  • Images If striding is necessary, try to use only positive stride values without start or end indexes; avoid negative stride values.

  • Images If you need start, end, and stride in a single slice, consider doing two assignments (one to stride and another to slice) or using islice from the itertools built-in module.

Item 16: Prefer Catch-All Unpacking over Slicing

One limitation of basic unpacking (see Item 5: “Prefer Multiple-Assignment Unpacking over Indexing”) is that you must know the length of the sequences you’re unpacking in advance. For example, here I have a list of the ages of cars that are being traded in at a car dealership. When I try to take the first two items of the list with basic unpacking, an exception is raised at runtime:

car_ages = [0, 9, 4, 8, 7, 20, 19, 1, 6, 15]
car_ages_descending = sorted(car_ages, reverse=True)
oldest, second_oldest = car_ages_descending

>>>
Traceback ...
ValueError: too many values to unpack (expected 2)

Newcomers to Python often rely on indexing and slicing (see Item 14: “Know How to Slice Sequences”) for this type of situation. For example, here I extract the oldest, second oldest, and other car ages from a list of at least two items:

oldest = car_ages_descending[0]
second_oldest = car_ages_descending[1]
others = car_ages_descending[2:]
print(oldest, second_oldest, others)

>>>
20 19 [15, 9, 8, 7, 6, 4, 1, 0]

This works, but all of the indexing and slicing is visually noisy. In practice, it’s also error prone to divide the members of a sequence into various subsets this way because you’re much more likely to make off-by-one errors; for example, you might change boundaries on one line and forget to update the others.

To better handle this situation, Python also supports catch-all unpacking through a starred expression. This syntax allows one part of the unpacking assignment to receive all values that didn’t match any other part of the unpacking pattern. Here, I use a starred expression to achieve the same result as above without any indexing or slicing:

oldest, second_oldest, *others = car_ages_descending
print(oldest, second_oldest, others)

>>>
20 19 [15, 9, 8, 7, 6, 4, 1, 0]

This code is shorter, easier to read, and no longer has the error-prone brittleness of boundary indexes that must be kept in sync between lines.

A starred expression may appear in any position—start, middle, or end—so you can get the benefits of catch-all unpacking any time you need to extract one optional slice (see Item 9: “Consider match for Destructuring in Flow Control, Avoid When if Statements Are Sufficient” for another situation where this is useful):

oldest, *others, youngest = car_ages_descending
print(oldest, youngest, others)

*others, second_youngest, youngest = car_ages_descending
print(youngest, second_youngest, others)

>>>
20 0 [19, 15, 9, 8, 7, 6, 4, 1]
0 1 [20, 19, 15, 9, 8, 7, 6, 4]

However, when you use a starred expression in an unpacking assignment, you must have at least one required part, or else you’ll get a syntax error. You can’t use a catch-all expression on its own:

*others = car_ages_descending

>>>
Traceback ...
SyntaxError: starred assignment target must be in a list or
➥tuple

You also can’t use multiple catch-all expressions in a single unpacking pattern:

first, *middle, *second_middle, last = [1, 2, 3, 4]

>>>
Traceback ...
SyntaxError: multiple starred expressions in assignment

But it is possible to use multiple starred expressions in an unpacking assignment statement, as long as they’re catch-alls for different levels of the nested structure being unpacked. I don’t recommend doing the following (see Item 31: “Return Dedicated Result Objects Instead of Requiring Function Callers to Unpack More Than Three Variables” for related guidance), but understanding it should help you develop an intuition for how starred expressions can be used in unpacking assignments:

car_inventory = {
    "Downtown": ("Silver Shadow", "Pinto", "DMC"),
    "Airport": ("Skyline", "Viper", "Gremlin", "Nova"),
}
((loc1, (best1, *rest1)),
 (loc2, (best2, *rest2))) = car_inventory.items()
print(f"Best at {loc1} is {best1}, {len(rest1)} others")
print(f"Best at {loc2} is {best2}, {len(rest2)} others")

>>>
Best at Downtown is Silver Shadow, 2 others
Best at Airport is Skyline, 3 others

Starred expressions become list instances in all cases. If there are no leftover items from the sequence being unpacked, the catch-all part will be an empty list. This is especially useful when you’re processing a sequence that you know in advance has at least N elements:

short_list = [1, 2]
first, second, *rest = short_list
print(first, second, rest)

>>>
1 2 []

You can also unpack arbitrary iterators with the unpacking syntax. This isn’t worth much with a basic multiple-assignment statement. For example, here I unpack the values from iterating over a range of length 2. This doesn’t seem useful because it would be easier to just assign to a static list that matches the unpacking pattern (e.g., [1, 2]):

it = iter(range(1, 3))
first, second = it
print(f"{first} and {second}")

>>>
1 and 2

But with the addition of starred expressions, the value of unpacking iterators becomes clear. For example, here I have a generator that yields the rows of a CSV (comma-separated values) file containing all car orders from the dealership this week:

def generate_csv():
    yield ("Date", "Make", "Model", "Year", "Price")
    ...

Processing the results of this generator using indexes and slices is fine, but it requires multiple lines and is visually noisy:

all_csv_rows = list(generate_csv())
header = all_csv_rows[0]
rows = all_csv_rows[1:]
print("CSV Header:", header)
print("Row count: ", len(rows))

>>>
CSV Header: ('Date', 'Make', 'Model', 'Year', 'Price')
Row count:  200

Unpacking with a starred expression makes it easy to process the first row—the header—separately from the rest of the iterator’s contents. This is much clearer:

it = generate_csv()
header, *rows = it
print("CSV Header:", header)
print("Row count: ", len(rows))

>>>
CSV Header: ('Date', 'Make', 'Model', 'Year', 'Price')
Row count:  200

Keep in mind, however, that because a starred expression is always turned into a list, unpacking an iterator also risks using up all the memory on your computer and causing your program to crash (see Item 115: “Use tracemalloc to Understand Memory Usage and Leaksfor how to debug this). So you should only use catch-all unpacking on iterators when you have good reason to believe that the result data will all fit in memory (see Item 21: “Be Defensive when Iterating over Arguments” for another approach).

Things to Remember

  • Images Unpacking assignments may include a starred expression to store all values that weren’t assigned to the other parts of the unpacking pattern in a list.

  • Images Starred expressions may appear in any position of the unpacking pattern. They will always become a list instance containing zero or more values.

  • Images When dividing a list into non-overlapping pieces, catch-all unpacking is much less error prone than using separate statements that do slicing and indexing.